Scalable DSP Core Architecture Addressing

Transcription

Scalable DSP Core Architecture Addressing
Tampereen teknillinen yliopisto. Julkaisu 483
Tampere University of Technology. Publication 483
Christian Panis
Scalable DSP Core Architecture Addressing Compiler
Requirements
Thesis for the degree of Doctor of Technology to be presented with due permission for
public examination and criticism in Tietotalo Building, Auditorium TB104, at Tampere
University of Technology, on the13th of August 2004, at 12 o’clock noon.
Tampereen teknillinen yliopisto - Tampere University of Technology
Tampere 2004
ISBN 952-15-1205-9
ISSN 1459-2045
Abstract
This thesis considers the definition and design of an embedded configurable DSP (Digital
Signal Processor) core architecture and will address the necessary requirements for
developing an optimizing high-level language compiler. The introduction provides an
overview of typical DSP core architectural features, briefly discusses the currently available
DSP cores and summarizes the architectural aspects which have to be considered when
developing an optimizing high-level language compiler. The introduction is followed by a
total of 12 publications which outline the research work carried out while providing a
detailed description of the main core features and the design space exploration methodology.
Most of the research work focuses on architectural aspects of the configurable RISC
(Reduced Instruction Set Computer) DSP core based on a modified Dual-Harvard load-store
architecture. Due to increasing application code size and the associated configuration aspect
the use of automatic code generation by a high-level language compiler is required. To
generate code efficiently requires that the architectural aspects be considered as early as
definition stage. This results in an orthogonal instruction set architecture with simple issue
rules.
Architectural features are introduced to reduce area consumption and power dissipation to
fulfill requirements of SoC (System-on-Chip) and SiP (System-in-Package) applications
and close the gap between dedicated hardware implementations and software based system
solutions. Code density has a significant influence on the area of the DSP sub-system, thus
xLIW (scalable Long Instruction Word) is introduced. An instruction buffer allows the
reduction of power dissipation during execution of loop-centric DSP algorithms. Simple
issue rules and exhaustive predicated execution feature enable efficient cycle and power
execution of control code.
The scalable DSP core architecture introduced herein allows parameterization of the main
architectural features to application specific requirements. To make use of this feature it is
necessary to analyze the requirements of the application. This thesis introduces a design
space exploration methodology based on a C-compiler and a cycle-true instruction set
simulator. A unique XML-based configuration file is used to reduce the implementation and
validation effort for configuring the tool-chain, updating documentation and for automatic
generation of parts of the VHDL-RTL core description.
II
III
Preface
The research work described in this thesis was carried out during 1999-2004 in Infineon
Technologies Austria and in the Institute of Digital and Computer Systems at the Tampere
University of Technology in Tampere, Finland.
I will like to express my deepest gratitude to my thesis advisor, Prof. Jari Nurmi. He
introduced and guided me carefully through the scientific world. Jari hosted me during my
stays at the university in Tampere and warmed the cockles of my heart in the sometimes
cold Finland. Prof. Jarmo Takala as head of the Institute of Digital and Computer Systems
supported my study work and along with Lasse Harju and Timo Rintakoski ensured me a
warm and pleasant working environment during my time at TUT.
A note of gratitude goes out to Prof. Lars Wanhammar and Dr. Mika Kuulusa for reviewing
my thesis and supporting me with imperative feedback.
Defining and developing a new DSP core when considering the approach of Hardware and
Software Co-Definition can only be done with a competent and enthusiastic team. Therefore
I would like to express my deepest thanks to the xDSPcore team which contributed
excellent work during the long period.
Many thanks to Prof. Andreas Krall from Vienna University of Technology who influenced
the xDSPcore architecture and considered aspects relevant when developing an optimizing
C-compiler and to Karl Vögler and Ulrich Hirnschrott who developed the main parts of the
C-compiler backend and supporting my thesis by contributing benchmarks and analysis
results alongside many productive discussions.
Many thanks also to the internship and masters students who contributing to the xDSPcore
research project including Pierre Elbischger, Gunther Laure, Wolfgang Lazian, Raimund
Leitner, Michael Bramberger and many more.
During my time in Infineon Technologies I had the pleasure to meet many amazing people
which led to a plethora of a lot of fruitful discussions again representative, many thanks to
Herbert Zojer who supported the development of an innovative DSP core architecture, to
Prof. Lajos Gazsi, Fellow of Infineon Technologies and Dr. Söhnke Mehrgardt the CTO of
Infineon Technologies who guided the development team inside the company.
I would like to express my thanks to Dr. Franz Dielacher, Manfred Haas and Reinhard
Petschacher at Infineon Technologies Austria and Prof. Herbert Grünbacher and Erwin
Ofner at Carithian Tech Institute who enabled me to finalize my research work.
IV
In addition I would like to express my thanks to Prof. Tobias G. Noll and Volker Gierenz at
RWTH Aachen who assisted the project from the beginning with their technical expertise.
The research was financially supported by Infineon Technologies Austria, the European
Commission with the project SoC-Mobinet (IST-2000-30094) and the Carinthian Tech
Institute who hosted me in the last two years. Many thanks.
Most of all I would like to express my deepest gratitude to my parents Maria and Herbert
Panis and brother Peter who supported me unrelentingly throughout the long time period
with their love. Only through their support was it possible for me to complete my studies in
Tampere, Finland.
Tampere, August 2004
Christian Panis
V
Table of Contents
1
2
3
4
Introduction..................................................................................................................... 2
1.1
Motivation............................................................................................................... 2
1.2
Methodology........................................................................................................... 3
1.3
Goals ....................................................................................................................... 4
1.4
Outline of Thesis..................................................................................................... 4
DSP Specific Features..................................................................................................... 7
2.1
Introduction............................................................................................................. 7
2.2
Saturation ................................................................................................................ 7
2.3
Rounding................................................................................................................. 9
2.4
Fixed-Point, Floating-Point................................................................................... 10
2.5
Hardware Loops.................................................................................................... 12
2.6
Addressing Modes ................................................................................................ 13
2.7
Multiple Memory Banks ....................................................................................... 18
2.8
CISC Instruction Sets............................................................................................ 19
2.9
Orthogonality ........................................................................................................ 20
2.10
Real-Time Requirements ...................................................................................... 21
DSP cores...................................................................................................................... 23
3.1
Design Space......................................................................................................... 23
3.2
Architectural Alternatives..................................................................................... 31
3.3
Available DSP Core Architectures ....................................................................... 37
3.4
xDSPcore .............................................................................................................. 49
High Level Language Compiler Issues......................................................................... 51
4.1
Coding Practices in DSP’s .................................................................................... 51
VI
5
6
7
4.2
Compiler Overview............................................................................................... 59
4.3
Requirements ........................................................................................................ 62
4.4
HLL-Compiler Friendly Core Architecture .......................................................... 69
Summary of Publications.............................................................................................. 73
5.1
Architectural Aspects of Scalable DSP Core........................................................ 73
5.2
Design Space Exploration..................................................................................... 76
5.3
Author’s Contribution to Published Work............................................................ 77
Conclusion .................................................................................................................... 81
6.1
Main Results ......................................................................................................... 81
6.2
Future Research .................................................................................................... 84
References..................................................................................................................... 89
VII
List of Publications
This thesis is split into two parts with the first containing an introduction into Digital Signal
Processor architectures and the second part a reprint of the publications listed below.
[P1] C. Panis, J. Nurmi, “xDSPcore - a Configurable DSP Core”, Technical Report 1-2004,
Tampere University of Technology, Institute of Digital and Computer Systems, Tampere,
Finland, May 2004.
[P2] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “xLIW – a Scaleable Long Instruction
Word”, in Proceedings The 2003 IEEE International Symposium on Circuits and
Systems (ISCAS 2003), Bangkok, Thailand, May 25-28, 2003, pp. V69-V72.
[P3] C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “Align Unit for a Configurable DSP
Core”, in Proceedings on the IASTED International Conference on Circuits, Signals and
Systems (CSS 2003), Cancun, Mexico, May 19-21, 2003, pp. 247-252.
[P4] C. Panis, M. Bramberger, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer
for a Configurable DSP Core”, in Proceedings of 29th European Solid State Conference
(ESSCIRC 2003), Estoril, Portugal, September 16-18, 2003, pp. 49-52.
[P5] C. Panis, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer and Align Unit
for xDSPcore”, IEEE Journal of Solid-State Circuits, Volume 35, Number 7, July 2004,
pp. 1094-1100.
[P6] C. Panis, U. Hirnschrott, A. Krall, G. Laure, W. Lazian, J. Nurmi, “FSEL – Selective
Predicated Execution for a Configurable DSP Core”, in Proceedings of IEEE Annual
Symposium on VLSI (ISVLSI-04), Lafayette, Louisiana, USA, February 19-20, 2004, pp.
317-320.
[P7] C. Panis, G. Laure, W. Lazian, H. Grünbacher, J. Nurmi, “A Branch File for a
Configurable DSP Core”, in Proceedings of the International Conference on VLSI
(VLSI’03), Las Vegas, Nevada, USA, June 23-26, 2003, pp. 7-12.
[P8] C. Panis, R. Leitner, J. Nurmi, “A Scaleable Shadow Stack for a Configurable DSP
Concept”, in Proceedings The 3rd IEEE International Workshop on System-on-Chip for
Real-Time Applications (IWSOC), Calgary, Canada, June 30-July 2, 2003, pp. 222-227.
VIII
[P9] C. Panis, J. Hohl, H. Grünbacher, J. Nurmi, “xICU - a Scaleable Interrupt Unit for a
Configurable DSP Core”, in Proceedings 2003 International Symposium on System-onChip (SOC’03), Tampere, Finland, November 19-21, 2003, pp. 75-78.
[P10] C. Panis, G. Laure, W. Lazian, A. Krall, H. Grünbacher, J. Nurmi, “DSPxPlore –
Design Space Exploration for a Configurable DSP Core”, in Proceedings International
Signal Processing Conference (GSPx), Dallas, Texas, USA, March 31- April 3, 2003,
CD-ROM.
[P11] C. Panis, U. Hirnschrott, G. Laure, W. Lazian, J. Nurmi, “DSPxPlore - Design Space
Exploration Methodology for an Embedded DSP Core”, in Proceedings of the 2004
ACM Symposium on Applied Computing (SAC 04), Nicosia, Cyprus, March 14-17, 2004,
pp. 876-883.
[P12] C. Panis, A. Schilke, H. Habiger, J. Nurmi, “An Automatic Decoder Generator for a
Scaleable DSP Architecture”, in Proceedings of the 20th Norchip Conference
(Norchip’02), Copenhagen, Denmark, November 11-12, 2002, pp. 127-132.
IX
List of Figures
Figure 1: Chosen Methodology for Definition of the Core Architecture. .............................. 3
Figure 2: Principle of Saturation............................................................................................. 8
Figure 3: Two's Complement Rounding (Motorola 56000 family)........................................ 9
Figure 4: Convergent Rounding (Motorola 56000 family)................................................... 10
Figure 5: Integer versus Fractional Data Representation...................................................... 11
Figure 6: Fractional Multiplication Including Left Shift. ..................................................... 12
Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter. ................... 12
Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx).................. 13
Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140). ...... 13
Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore). ...... 13
Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x)....................... 14
Figure 12: Principle of Register Indirect Addressing. .......................................................... 14
Figure 13: Principle of Pre/Post Operation Mode................................................................. 15
Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore). ...... 15
Figure 15: Principle of Using a Modulo Buffer for Address Generation. ............................ 16
Figure 16: Principle of the Bit Reversal Addressing Scheme............................................... 17
Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore). ..................... 17
Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard. ...... 18
Figure 19: Example for Interleaved Memory Addressing (SC 140)..................................... 19
Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC). .................. 20
Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA).
....................................................................................................................................... 24
Figure 22: Principle of Define-in-use Dependency. ............................................................. 26
Figure 23: Principle of Load-in-use Dependency................................................................. 27
Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140)............... 27
Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures. 31
Figure 26: Architectural Alternatives: RISC versus CISC Pipeline. .................................... 34
Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores.................. 34
Figure 28: Architectural Alternatives: Direct Memory versus Load-Store. ......................... 35
Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction
Scheduling..................................................................................................................... 36
X
Figure 30: Architectural Overview: OAKDSP Core. ........................................................... 38
Figure 31: Architectural Overview: Motorola 56300. .......................................................... 38
Figure 32: Architectural Overview: TI C54x........................................................................ 41
Figure 33: Architectural Overview: ZSP400. ....................................................................... 42
Figure 34: Architectural Overview: Carmel. ........................................................................ 44
Figure 35: Architectural Overview: TI C6xx........................................................................ 46
Figure 36: Architectural Overview: Starcore SC140............................................................ 47
Figure 37: Architectural Overview: Blackfin. ...................................................................... 48
Figure 38: Architectural Overview: xDSPcore..................................................................... 50
Figure 39: Principle of Software Pipelining. ........................................................................ 52
Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values......... 53
Figure 41: Example for Assembler Code Implementation including Software Pipelining
(xDSPcore).................................................................................................................... 54
Figure 42: Data Flow Graph for Maximum Search Example............................................... 55
Figure 43: C-Code Example for Illustration of Software Pipelining.................................... 56
Figure 44: Generated Assembler Code without Software Pipelining (xDSPcore). .............. 56
Figure 45: Generated Assembler Code including Software Pipelining (xDSPcore). ........... 56
Figure 46: Principle of Loop Unrolling. ............................................................................... 57
Figure 47: Principle of Predicated Execution using Loop Flags. ......................................... 58
Figure 48: General High-level Language Compiler Structure.............................................. 59
Figure 49: Example for banked Register Files (TI C62x). ................................................... 63
Figure 50: Limitations during Instruction Scheduling caused by Processor Modes. ........... 64
Figure 51: Example for Address Generation Unit (Motorola 56300)................................... 65
Figure 52: Example for not Orthogonal Instructions: MAX2VIT D4,D2 (Starcore SC140).
....................................................................................................................................... 65
Figure 53: Example for Mode Dependent Instruction Sets: ARM Thumb Decompression
Logic. ............................................................................................................................ 67
Figure 54: Example for Address Generation Unit (Starcore SC140). .................................. 68
Figure 55: Configurable Long Instruction Word (CLIW of Carmel DSP Core).................. 69
Figure 56: xDSPcore Core Overview. .................................................................................. 70
Figure 57: Orthogonal Register File. .................................................................................... 70
Figure 58: Issuing Rules for xDSPcore Architecture. .......................................................... 71
Figure 59: Results for Dhrystone Benchmarks generated by C-Comipler. .......................... 72
Figure 60: Results for EFR Benchmarks generated by C-Compiler..................................... 72
XI
Figure 61: xDSPcore Overview. ........................................................................................... 73
Figure 62: DSPxPlore Overview. ......................................................................................... 76
Figure 63: Screenshot of xSIM ............................................................................................. 82
Figure 64: DSPxPlore Design Flow...................................................................................... 83
XII
XIII
List of Tables
Table 1: Principle of Resource Allocation Table.................................................................. 53
Table 2: Resource Allocation Table including Software Pipeline Technology for increased
Usage of Core Resources.............................................................................................. 54
XIV
XV
List of Abbreviations
AGU
Address Generation Unit
ALU
Arithmetic Logic Unit
ANSI
American National Standard Institute
ASIC
Application Specific Integrated Circuit
ASIP
Application Specific Instruction Set Processor
BMU
Bit Manipulation Unit
CISC
Complex Instruction Set Computer
CLIW
Configurable Long Instruction Word
CMOS
Complementary Metal Oxide Semiconductor
CPU
Central Processing Unit
DMA
Direct Memory Access
DPG
Data Path Generator
DRAM
Dynamic Random Access Memory
DRM
Digital Radio Mondale
DSP
Digital Signal Processor
FFT
Fast Fourier Transformation
FIR
Finite Impulse Response
FPGA
Field Programmable Gate Array
FSM
Finite State Machine
GOPS
Giga Operations Per Second
GPP
General Purpose Processor
HDL
Hardware Description Language
HLL
High-Level Language
IC
Integrated Circuit
XVI
ICU
Interrupt Control Unit
IEEE
Institute of Electrical and Electronics Engineers
ILP
Instruction Level Parallelism
IR
Intermediate Representation
ISA
Instruction Set Architecture
ISO
International Organization for Standardization
ISR
Interrupt Service Routine
ISS
Instruction Set Simulator
LCP
Loop Carry Path
LSB
Least Significant Bit
MAC
Multiply and Accumulate
MSB
Most Significant Bit
MII
Minimum Initiation Interval
MIMD
Multiple Instruction Multiple Data
MIPS
Million Instructions Per Second
MMACS
Million MAC Instructions Per Second
MOPS
Million Operations Per Second
MTCMOS
Multi-Threshold CMOS
NMI
Non Maskable Interrupt
NOP
No Operation
OCE
Open Compiler Environment
OS
Operating System
PCU
Program Control Unit
RAM
Random Access Memory
RISC
Reduced Instruction Set Computer
RTOS
Real-Time Operating System
SIMD
Single Instruction Single Data
XVII
SiP
System in Package
SJP
Split Join Path
SMT
Simultaneous Multithreading
SoC
System on Chip
SSA
Static Single Assignment
TLB
Translation Lookaside Buffer
TLP
Task Level Parallelism
VHDL
VHSIC Hardware Description Language
VHSIC
Very High Speed Integrated Circuit
VLES
Variable Length Execution Set
VLIW
Very Long Instructions Word
WCET
Worst Case Execution Time
xICU
Scaleable Interrupt Control Unit
xLIW
Scaleable Long Instructions Word
Part I
INTRODUCTION
2
1 Introduction
The introduction begins with a short description of the motivation upon why defining and
developing a new DSP core architecture was chosen for the thesis. This shall be followed
by a brief introduction of the chosen methodology. A few sentences will then illustrate the
goals of the development project carried out for this thesis before the outline of the thesis is
provided.
1.1 Motivation
Increasing complexity of System-on-Chip (SoC) applications increases the demand on
powerful embedded cores. The flexibility provided by the usage of software programmable
cores quite often leads to an increase in consumed silicon area and an increased power
dissipation. Therefore dedicated hardware is favored over software-based platform solutions.
The picture is changing however due to significantly increasing mask costs due to the use of
advanced process technologies and difficulties to enter such high-volume products to the
heterogeneous market that would justify the high non-recurring cost. Together these
elements increase the pressure for developing product platforms. These platforms are used
for a group of applications so that software executed on programmable core architectures
can be used for differentiating the products.
General purpose processors with a fixed Instruction Set Architecture (ISA) are less well
suited for integration into platforms. To close the gap between dedicated hardware
implementations and software-based solutions requires core architectures which enable
platform-specific and application-specific adaptations.
For embedded Digital Signal Processors (DSP) an additional problem exists. Nonorthogonal core architectures are preferred due to increased performance and less area
consumption when mapping DSP algorithms onto a processor. Therefore DSPs are still
programmed manually in assembly language [162]. The only drawback of the better usage
of available processor resources is an architecture-dependent description of the algorithms
which makes changes in the core architecture difficult and costly (due to compatibility
issues) and prohibits application-specific adaptations [113]. Therefore products based on a
programmable core architecture remain with the same architecture for a long time even if
not state-of-the-art any more.
Consequences from using assembly language are long development cycles [174]. 10 years
ago algorithms executed on DSP cores consisted of several hundred lines of code. Manual
coding was reasonable even if minor changes in the application code required several weeks
of coding and verification. Today’s DSP cores are more powerful and enable the execution
of large programs consisting of several hundred thousand lines of code. DSP cores are not
3
only used for filtering operations any more, most notably in low cost products where not
more than one core is reasonable and the control code is executed on the DSP core.
To increase the performance of the DSP subsystems a high degree of parallelism and deep
pipeline structures are introduced. Unfortunately manual programming of highly
parallelized DSP core architectures with deep pipelines and resolving data and control
dependencies is limited or even impossible. Therefore the motivation of using assembly
code to increase the use of the available resources is not valid any more.
1.2 Methodology
To obtain the definition of a DSP core programmable in a high-level language and not to
make just another DSP core, the methodology for defining the core architecture has to be
changed to meet the target. The technical reason along many commercial ones as to why
efficient high-level language programming of DSP cores is still not feasible is the
compromise for improved efficiency in terms of area and power consumption for the price
of orthogonality. This is the major requirement for the compiler architecture. Considering
early DSP cores as programmable filter structures the major driving factor for the
architectural features has been initiated by the algorithms executed on the cores. Some
constraints influencing available core architectures have been caused by the possibility to
implement the architecture in hardware with reasonable core performance such as banked
register files, mode registers and complex instruction sets.
Figure 1 outlines the definition of design methodology of the core architecture introduced in
this thesis. The development of an optimizing high-level language compiler has been
considered during the definition of the feature set and the main architectural concepts. Thus
it differs to already existing core architectures.
Figure 1: Chosen Methodology for Definition of the Core Architecture.
Before adding an instruction into the instruction set architecture (ISA) its suitability for the
three aspects algorithm, hardware implementation and software suitability has been verified.
4
1.3 Goals
To close the gap between dedicated hardware implementations and software based solutions
a paradigm change is required. The main architectural features of the core subsystem have
to be scaled to enable a definition of an application specific optimum in the terms of area
consumption and power dissipation. To obtain this goal it is necessary to consider the DSP
subsystem instead of focusing only on the core architecture. To overcome the software
compatibility issues caused by the scaleable core features the programming has to take
place in a high level language (HLL) like C. This enables an architectural independent
application description. However HLL compilers reduce software development effort and
maintenance costs. To enable the development of an optimizing HLL compiler generating
efficient code (whereas efficient means less than 10% overhead compared with manual
coding) requires restricting the design space for the core architecture. The goal for the core
architecture can be summarized as follows- a scaleable DSP core architecture to meet area
and power targets to be competitive with hardwired implementations suited as a target for
an optimizing C-compiler and designated for efficient execution of control code and loopcentric DSP-specific algorithms as well.
The proposed approach is to provide an application-specific scaleable DSP core architecture.
To gain the advantage of this approach it is strictly required to understand the application
specific requirements of the core architecture. For this purpose a design space exploration
methodology is introduced to analyze the influence of different core configurations onto
area consumption (and later on also onto power dissipation) for specific application code.
Flexibility and scalability increase verification and validation effort. To keep this effort
reasonably low a unique configuration file is introduced. When changing parameters the
current core configuration propagates automatically to software tools, the VHDL-RTL
description used for generating silicon and the documentation which is then automatically
updated.
1.4 Outline of Thesis
This thesis consists of two parts namely an introductory Part I structured as outlined below
followed by Part II which illustrates the main research results in 12 publications.
Chapter two starts with the introduction of DSP specific architectural features and
introduces system aspects like worst case execution time. The first part of the third chapter
briefly discusses the design space of core subsystems while considering area consumption,
performance and power dissipation followed by architectural alternatives and their
suitability for being used in DSP core architectures. The third chapter ends with an
introduction of some commercially available DSP core architectures and a brief illustration
of xDSPcore which is the configurable DSP core architecture introduced in this thesis. The
5
fourth part discusses issues concerning high-level language compilers starting with typical
coding practices used during implementation of algorithms in the field of digital signal
processing. Then follows a short introduction of the structure of high-level language
compilers. The fourth part ends with a discussion about the necessary requirements which
of a high-level language compiler and proceeds to summarize the architectural requirements
to obtain efficient compilation results. In the fifth chapter a summary of the publications
provides an overview of the research work and summarizes the author’s contribution. The
sixth and final chapter contains a conclusion with a summary of the results of the project
and provides an overview of future research topics.
7
2 DSP Specific Features
This section illustrates DSP specific architectural features which differentiate DSPs from
traditional microcontroller architectures. The architectural features are introduced and the
motivation for choosing them analyzed.
Some of these features exist in microcontroller architectures used to increase performance
of the core when executing algorithms in the field of digital signal processing [4][7][95].
2.1 Introduction
“DSP is an embedded microprocessor specifically designed to handle signal processing
algorithms cost effectively”, where cost effectiveness means low silicon area and low power
dissipation [102]. To obtain this target while considering the specific requirements of digital
signal processing algorithm-specific hardware is utilized to meet the performance, power
and area targets. Orthogonality in contradiction with these targets is ignored.
The consequence of ignoring orthogonal structure leads to highly specialized core
architectures programmed manually by experts in assembly language. Developing a highlevel language compiler for the specialized features is costly and requires so called compiler
known functions and intrinsics to invoke the efficient use of the specialized hardware [37].
The consequences are algorithm descriptions which are not easily portable to different core
architectures.
The alternative is to use pure ANSI-C [109] and not make full use of the specialized
features which decreases the potential performance on the core architecture. In 2003 a first
draft of a standard for the embedded-C language was introduced [169]. Based on ANSI-C
additional standardized enhancements are introduced to make use of the special features
required to implement digital signal processing algorithms efficiently [24]. The advantage
of standardized intrinsics is that compilers for different core architectures are able to
compile the same algorithmic source.
2.2 Saturation
When the result of an operation exceeds the size of the destination register an overflow or
underflow takes place. Signed twos complement number presentation changes the sign bit
and leads therefore to a significant error. As illustrated in Figure 2 the envelope of a signal
is significantly changed and the error caused by the overflow or underflow is crucial.
To overcome this problem traditional DSP architectures support saturation circuits. If the
value of a result exceeds the data range of the storage the highest or lowest possible value
which can be presented correctly is used instead of the calculated result. The error generated
by the saturation circuit as illustrated in Figure 2 is thus minimal.
8
In commercially available DSP core architectures three saturation mechanisms are
commonly used.
DSP cores support accumulator registers, thus differing to microcontroller architectures
where the 16-bit and 32-bit registers support accumulators are registers with additional bits
called guard bits. For example the SC140 from Starcore LCC [32] supports a 40-bit wide
accumulator register. These guard bits allow storage of intermediate results exceeding the
data range, for example 32 bits. To store the final result to data memory the value has to fit
into the data range supported by the data memory port. Therefore the result stored in the
accumulator register has to be evaluated for the required data range. If the value stored in
the accumulator register exceeds the maximum value supported by the data memory port
(indicated by guard bits different to the sign bit) it has to be saturated.
Some of the DSP cores, for example the Blackfin from Analog Devices and Intel [8]
support an additional saturation method. To indicate the necessity of saturation the overflow
flag is evaluated and overflow or underflow takes place e.g. for a 16-bit value the result is
saturated to fit into a 16-bit destination register.
The third mechanism uses a saturation mode where the computation results are matched to
the allowed word length after each arithmetic operation [21].
40000
35000
30000
25000
20000
original signal
15000
saturated signal
signal overflow
10000
5000
0
-5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
-10000
Figure 2: Principle of Saturation.
Figure 2 illustrates three signals. The solid line shows the original signal flow without
limitations. The dotted line illustrates the signal flow for the original signal when the data
range of the destination register is exceeded which leads to signal flipping. The dashed line
represents the saturated signal. When the original signal exceeds the data range in this
example of a 16-bit register the result is saturated and the largest possible value is stored.
The generated error is thus kept minimal compared with the flipping signal.
9
2.3 Rounding
DSP core architectures support different rounding modes. In this subsection the two
rounding modes available in most commercial core architectures are introduced. To
illustrate the rounding modes 40-bit accumulator registers are used. Commercial core
architectures define the rounding modes only for rounding 32-bit values to 16-bit values.
The guard bits remain.
2.3.1 Two’s Complement Rounding
Two’s Complement rounding is also called round-to-nearest technique [116]. If the value of
the lower half of the data word is greater than or equal to half of the LSB of the resulting
rounded word the values are rounded up, all values smaller than half are rounded down.
Therefore statistically a small positive bias is introduced. In Figure 3 the Two’s
Complement rounding is illustrated. Independent from the Least Significant Bit (LSB) of
the high word one is added which causes the positive bias.
Figure 3: Two's Complement Rounding (Motorola 56000 family).
2.3.2 Convergent Rounding
A slightly improved rounding methodology is convergent rounding also called round-tonearest even number [116]. The above discussed bias caused by the decision at half bit is
compensated by rounding down if the high portion is even and rounding up if the high
portion is odd. In Figure 4 convergent rounding is illustrated.
10
Figure 4: Convergent Rounding (Motorola 56000 family).
Different to Two’s Complement Rounding the addition of one is only done, if the LSB of
the high word is equal to one.
2.4 Fixed-Point, Floating-Point
DSP core architectures can be divided into those supporting fixed-point arithmetic and
those supporting floating-point arithmetic, the floating-point DSP architectures support
mostly integer fixed-point arithmetic for example to obtain address calculation. Floatingpoint data presentation uses a combination of significand and exponent:
value = significand * 2exponent
Fixed-point presentation is chosen for most of the available core architectures especially
those used as embedded cores for SoC or SiP applications. The algorithm development for
fixed-point arithmetic requires more care but the hardware implementation implies less
power and area consumption and is therefore favored.
Fixed-point numbers are also called fractional presentation with the special case integer.
The place of the virtual binary point in the data word determines the number of integer and
fraction bits in the word. In integers the point is to the right of LSB whereas in fractional
numbers the point is right after the sign bit. Some operations like addressing and control
code functions are inherent of type integer. Filtering operations on the other hand make use
of fractional data representation.
The difference is illustrated in Figure 5 for a 40-bit wide accumulator register. The common
fixed point format used for DSP cores is S[15.1] where the S stands for signed [172] and
scales the data in the range -1 ≤ X < 1. The radix point is located on the left side of the
11
register between the sign bit and the next one. The number located right from the radix
point encodes the fraction.
The guard bits as mentioned before are used to store intermediate results exceeding the data
range of the destination register. For the example in Figure 5 eight guard bits are supported.
This allows a data range of -256 ≤ X< 255. Integer can be interpreted as a special case of
fractional where the radix point is located at the end of the register because no fractional
values are supported.
Figure 5: Integer versus Fractional Data Representation.
The advantage of using fractions for traditional DSP algorithms like filtering is that by
reducing the number of bits available for representing a value (e.g. during rounding
operation) accuracy is changing but the value remains correct. Integers are used for control
code and address generation. In [172] definitions of accuracy, data range and different
variants of fractional data representation can be found.
With the exception of multiplication operations there is no difference between concerning
hardware implementation aspects of arithmetic functions for fractional and integer data
types. The multiplication requires a left shift by one bit and a setting of the LSB to zero to
correct the position of the radix point. In Figure 6 the required shift for signed
multiplication is illustrated.
12
Figure 6: Fractional Multiplication Including Left Shift.
2.5 Hardware Loops
Filter algorithms often executed on DSPs are loop-centric. The code example in Figure 7
represents a Finite Impulse Response (FIR) filter [103]. The filter kernel consists of only
one clock cycle. It loads the operands (including address calculation) and calculates one
filter tap. Software pipelining is used to compensate the load-in-use dependency, caused by
split execution which is illustrated in a later section.
ld (r0)+, d0 || ld (r1)+, d1
ld (r0)+, d0 || ld (r1)+, d1 || rep n
mac d0,d1,a4
mac d0,d1,a4
Figure 7: Assembly Code Example for Finite Impulse Response (FIR) Filter.
Executing the same algorithm on a traditional microcontroller architecture leads to
additional instructions and clock cycles necessary for loop handling. Traditional
microcontrollers do not support single Multiply and Accumulate (MAC) instructions and
therefore require at least two instructions (multiply and accumulate) to calculate a filter tap.
Single-issue microcontroller architectures do not support enough hardware resources to
make use of SW pipelining. Loop handling consists of setting the loop counter once,
decrementing the loop counter with each loop iteration and evaluating the end of loop
condition continuously. If no explicit loop instruction is available conditional branch
instructions are required to implement the jump back to loop start.
To implement loop constructs more efficiently DSP core architectures support zerooverhead loop instructions. The loop is invoked so that the loop length and the number of
iterations is part of the loop instruction. The remaining loop handling like decrementing of
13
the loop counter and jump back to loop start with each loop cycle is implicitly handled in
hardware. Branch delays caused by regular branch instructions are compensated.
2.6 Addressing Modes
This sub-section introduces different addressing modes supported by DSP core architectures
[116] some of which exist in traditional microcontroller architectures.
2.6.1 Implied Addressing
For Implied Addressing the addresses of the source operands are implicitly coded in the
instruction word. Examples for implied addressing can be found in former core
architectures like the Lucent 16xx core family where the multiplier source operands are
located in two registers X and Y. Even if the assembler syntax for the multiplication
contains explicitly named registers (X, Y as illustrated in Figure 8) the multiplication does
not allow different registers to be assigned.
P = X*Y
Figure 8: Example for Implied Addressing: Multiply Operation (Lucent 16xx).
Similar examples can be found in later DSP architectures like the Starcore SC140 where
functions like the MAX2VIT instruction uses implied addressing [26]. Two register pairs
are supported and selected by a mode bit. An example is illustrated in Figure 9.
MAX2VIT D2, D4
Figure 9: Example for Implied Addressing: MAX2VIT Instruction (Starcore SC140).
Implied addressing can be used to increase code density but restricts the use of the implied
registers during register allocation for other instructions.
2.6.2 Immediate Data Addressing
Immediate Data Addressing is used for operations where the operand is part of the
instruction word. Examples for immediate addressing can be found in most core
architectures supporting the preload of register values with immediate data, e.g. move
constant (movc) instruction of xDSPcore illustrated in Figure 10.
MOVC 27,D4
Figure 10: Example for Immediate Data Addressing: MOVC instruction (xDSPcore).
2.6.3 Memory Direct Addressing
Memory Direct Addressing is also called absolute addressing. The address where data has
to be fetched from or stored is part of the instruction word. This is the main limiting factor
of making use of absolute addressing. The reachable address space is limited by the
available coding space in the instruction word or instruction words.
14
2.6.4 Register Direct Addressing
Register Direct Addressing is used for instructions which receive their operands from
registers which are addressed as part of the instruction. The difference to implied addressing
is that the registers are explicitly coded inside the instruction word which allows assignment
of different registers to the same instruction during register allocation.
SUBF R1, R4
Figure 11: Example for Register Direct Addressing: Subtraction (TI C62x).
In Figure 11 an example of TI C62x is illustrated [36]. The two operand subtract instruction
allows an assignment of different source and destination operands to the same subtraction
instruction.
2.6.5 Register Indirect Addressing
Register Indirect Addressing and its variants as explained in the following subsections are
quite often used for algorithms executed on DSP cores. The core architectures are
supporting registers which contain memory addresses and can be used for accessing
memory entries. The memory addresses can be stored in specialized address registers only
used for this purpose or in general purpose registers. These general purpose registers can
also be also used for other operations. Large address spaces can be addressed with less
coding effort which has significant influence on code density. The principle of register
indirect addressing is introduced in Figure 12.
Figure 12: Principle of Register Indirect Addressing.
Register Indirect Pre/Post Addressing
The addressing mode Register Indirect can be used with Pre/Post Addressing option as
illustrated in Figure 13. In particular algorithms executed on DSP architectures process
blocks of data and therefore consecutive addresses are used. The Post operation mode
allows access to a memory address and afterwards will increment or decrement the address
stored in the address register. The pre-address operation allows access to a data memory
location with the already updated address.
15
Figure 13: Principle of Pre/Post Operation Mode.
The value for incrementing or decrementing the address located in the address registers can
be one (or equal to the granularity of the addressed memory space) or for some core
architectures a programmable offset.
LD (R0+), D0 … pre-operation
LD (R0)+, D0 … post-operation
Figure 14: Assembly Code Example for Pre/Post Increment Instructions (xDSPcore).
Pre-operation requires an additional clock cycle for address calculation in many DSP core
architectures. xDSPcore supports both modes without requiring an additional clock cycle
whereas the related assembly code example is introduced in Figure 14.
Register Indirect Addressing with Indexing
For Register Indirect Addressing with Indexing the content of two address registers is added
and the result is used for addressing data memory locations. The difference to the above
introduced pre/post modification addressing scheme is that none of the register values is
modified.
Two reasons favor this addressing mode: Register Indirect Addressing with Indexing allows
the use of the same program code with different data sets. Between different data sets only
the index register value has to be set to the start address of the new block of data.
The second reason is the use by compilers communicating arguments to subroutines by
bypassing data via the stack. One address register is assigned as stack frame pointer. This
means that the subroutine does not have to know the absolute addresses. The transferred
arguments are located relative to the stack frame pointer.
Register Indirect Addressing with Modulo Arithmetic
Modulo arithmetic can be used for implementing circular buffers [171]. The data values as
illustrated in Figure 15 [78] are located on consecutive addresses in data memory. If the
address pointer reaches the end of the circular buffer, specialized hardware circuits are used
to reset the pointer to the start address.
16
begin = n 2ld ( m ) 
n , m∈ N
=8n
end = begin + m
Figure 15: Principle of Using a Modulo Buffer for Address Generation.
This implicit boundary check reduces the effort for manual control of the buffer addressing.
Separate modulo registers are supported to store the size of the chosen buffer.
Some commercially available core architectures support circular buffer addressing with a
defined start address aligned to the size of the supported buffer, e.g. a circular buffer with
the buffer size of 256 can start at the addresses 0, 256, 512 and so on. The drawback of this
implementation is fragmented data memory.
Overcoming the fragmentation problem some core architectures like the SC140 [32] or
Carmel [12] support programmable size and start address of the circular buffer which
requires an additional base address register and an additional adder circuit for address
calculation.
Register Indirect with Bit Reversal
The Register Indirect with Bit Reversal addressing mode is also called reverse carry
addressing. This address mode is only used during execution of FFT algorithms. FFT
algorithms have the drawback that they either take their input or output values in scrambled
order. To complicate matters further the scrambling depends on the particular version of the
FFT algorithms [103].
In Figure 16 [78] an example is illustrated. The lower bits of the generated addresses are
mirrored and allow scrambling of the addresses as required by FFT algorithms [116].
17
Figure 16: Principle of the Bit Reversal Addressing Scheme.
2.6.6 Short Addressing Modes
Code density is a significant factor influencing the consumed silicon area of the core
subsystem. Many of the above described addressing schemes use two instruction words for
addressing to store the immediate or offset values. To increase code density DSP core
architectures support instructions with small immediate values which allow the coding of
them into one instruction word.
Short Immediate Data
One example is the short immediate data addressing where a constant as part of the
instruction word can be stored in a register. Restricting the data range of the constant to a
reduced value allows only one instruction word to be used. Figure 17 illustrates an example
supported by xDSPcore.
MOVC 0, d0
MOVCL 1234, d0
Figure 17: Assembly Code Example for Short Immediate Data (xDSPcore).
The short version of the instruction supports a data range for the constant of –32 ≤ constant
< 32. For assigning constants exceeding this range a second instruction with the same
function is introduced which supports an additional instruction word for storing the
immediate value.
Short Memory-Direct Addressing
As mentioned before the use of memory direct addressing or absolute addressing is limited
by the required coding space. However some core architectures support this address mode
within one instruction word. The address mode can then be used in combination with
special features as for example the Motorola 56000 family [20][21] where I/O registers can
be addressed with this address mode. The small offset which can be placed in one
instruction word (e.g. 6 bits for this example) is extended inside the core to a physical
address in the 64k byte address space.
18
Paged Memory-Direct Addressing
This address scheme splits the available address range into address pages. A reduced coding
space can be used to access addresses in the page once the page is set. This allows the short
version of the addressing schemes to be used. The overhead for the paging mechanism is
not negligible and the addressing schemes can only be used to increase code density if the
executed algorithm allows data to be mapped to pages. Changing the memory page requires
additional instructions (with influence on code density) and additional execution cycles.
2.7 Multiple Memory Banks
Traditional algorithms executed on DSP architectures are data flow algorithms. For
example data values describing a signal are fetched, processed by digital signal processing
algorithms and then stored into data memory.
The implementation of filter algorithms based on MAC instructions as illustrated in Figure
7 requires a fetching of two data values for the multiply operation. The summation of the
multiplication results takes place in a local register. To obtain the two independent data
fetch operations at least two independent memory ports are required.
Figure 18: Processor Architectures: von Neumann, Harvard, modified Dual-Harvard.
Figure 18 illustrates the principles of von Neumann architecture with a combined data and
program memory- the Harvard architecture where data and program are split and also the
modified Dual-Harvard architecture with two independent data memory ports
[96][113][149]. Some core architectures for example Carmel [10] support the fetching of up
to four independent data values which can be used to increase execution speed of filtering
or FFT algorithms.
Some commercial DSP cores like the SC140 [32] from Starcore LCC feature one address
space for data and program memory which eases the transfer of data between data and
program memory. Others including xDSPcore feature separate address spaces.
The X/Y memory splitting as used for OAKDSP [29] is well suited if the two fetched
operands are located in two different memory spaces (e.g. for the example in Figure 7). If
the fetched operands are located in the same address space the memory operations have to
be serialized which will lead to a reduced system performance.
19
The Starcore SC140 [32] features interleaved addressing which can be used to reduce the
possibility of memory hazards. In Figure 19 the memory mapping is illustrated. The chosen
concept makes use of an implementation aspect. The performance of memory operations is
limited, especially when considering large memory blocks. Therefore physically the
memory implementation is split into small memory blocks reaching higher clock
frequencies.
Figure 19: Example for Interleaved Memory Addressing (SC 140).
The small memory blocks can be accessed separately as illustrated in Figure 19 where 4k
physical memory blocks are supported.
2.8 CISC Instruction Sets
CISC instructions are built up of several micro-instructions. The Multiply and Accumulate
(MAC) [97] instruction introduced in Figure 20 is used as an example for CISC instructions.
Two data values are required for multiplication.
A third operand is required for accumulation with the multiplication results. The result of
the accumulation operation is stored in an accumulator register, the same used as the third
source operand. The example in Figure 20 also illustrates the additional left shift operation
required for multiplying fractional data values.
20
Figure 20: Example for CISC Instructions: Multiply and Accumulate (MAC).
If the core architecture is based on load/store then the operands are fetched from a register
file. For direct-memory architectures the operands have to be fetched from data memory.
This requires memory addressing coded in the instruction word, thus increasing the
complexity of the MAC instruction. Some core architectures support rounding and
saturation logic as part of the MAC instruction.
Driven by compiler requirements modern DSP core architectures feature RISC instruction
sets with some CISC extensions for increasing code density and performance during the
execution of filtering and FFT algorithms.
2.9 Orthogonality
During definition of what a DSP core is one of the major arguments was that orthogonality
aspects are ignored for the sake of increased efficiency in terms of silicon area and power
dissipation. The feature Orthogonality is mentioned at least once in white papers, product
briefs or even technical documentation of DSP vendors and often in combination with
instruction set or core architecture [14][35].
In [116] Orthogonality is defined as “to which extent a processor’s instruction set is
consistent”. It is also mentioned that Orthogonality is not so easily measured. Besides the
aspect of instruction set consistency the degree that the operands and addressing modes are
uniformly available for operations is used as a measure for orthogonality.
Examples for missing orthogonality can be found in existing core architectures e.g. the
address registers of the Motorola 56k family [20][21] which are banked. Four of the eight
registers are assigned to one Address Generation Unit (AGU) and the remaining four to the
second AGU which limits register allocation and instruction scheduling. The SC140 [32]
does not allow the higher eight address registers to be used during modulo addressing
because the higher address registers are then used as base registers.
In the following some more examples for missing orthogonality are illustrated.
21
Reduced Number of Operations
Reducing the number of instructions relaxes the pressure for coding the instruction set, for
example the rotate instruction missing from the Lucent 16xx architecture [19].
Reduced Number of Addressing Modes
Providing all of the known addressing modes for all instructions demands a lot of coding
effort inside the instruction set. To increase code density the core architectures only support
a subset of the addressing modes and restrict their use to a group of instructions.
Reduced Number on Source/Destination Operands
To allow orthogonal use of registers by each instruction causes large instruction space and
decreases code density. Therefore core architects limit the use of some of the registers to
specialized functions for example the MAC2VIT instruction of the SC140 [32].
Use of Mode Bits
Most commercial DSP cores make use of mode bits. Depending on the mode indicated by
mode bits the meaning of an instruction is changed. Mode bits are often used for specialized
addressing modes or saturation and rounding modes. The advantage of increasing code
density can be compensated by limitations during register allocation and instruction
scheduling. In a later section the problem is discussed in detail.
2.10 Real-Time Requirements
Real-time requirements form the last aspect where digital signal processing algorithms have
specific requirements and influence on the architecture of Digital Signal. Analyzing microarchitectural improvements in microcontrollers in the last number of years it is apparent that
most of the improvements have taken place in cache structures. Cache structures are well
suited to reduce the average execution time of an algorithm.
A similar phenomenon has taken place in DSP cores for example at the SC140 of Starcore
LCC [32] or at the Blackfin from Analog Devices and Intel [8]. Cache structures have been
introduced for data and program memory. The drawback of introducing caches is that a
strong requirement of real-time applications is lost: minimizing of the worst case execution
time.
The purpose of Worst Case Execution Time (WCET) analysis is the possibility of a priori to
determine the worst case execution time of a piece of code. WCET is used in real time and
embedded systems to perform scheduling of tasks to determine whether performance goals
for periodic events are met, also to analyze for example interrupts and their response time
[80]. The main influence on execution time comes from program flow aspects like loop
22
iterations and function calls and architectural features like pipeline structures and cache
architectures [80].
In the area of research several algorithms and tools for analyzing the WCET of application
code have been introduced [81][91][93]. The program flow analysis for this purpose can be
split into a global low-level analysis and a local low-level analysis.
The global low-level analysis considers the effect of architectural features like data [111]
[170], instruction cache structures [47][83][94][124][154] and branch prediction [65]. These
analyses determine only global effects but do not generate any actual execution time values.
The local low-level analysis handles effects caused by single instructions and their neighbor
instructions for example pipeline effects [79][146][155] and the influence of memory
accesses on the execution time. The influence of caches onto the WCET is significant as
discussed in [50][119][127][135] [168].
If core architectures support instructions with different latency dependent on the input
values for example the multiplication instruction of ARM [5] whose execution time can
differ between 1 and 4 clock cycles then the calculation of the WCET is more complicated.
The multiplication of PowerPC 603 [30][57][58] can even consume between 2 and 6 clock
cycles depending on the source operands. In the Alpha 21604 [16][17] the execution ratio of
a software division algorithm differs between 16 clock cycles and 144 cycles which implies
a ratio of 1:9.
In [144] the contribution of different architectural features to the variation in the execution
time and therefore the uncertainty in the WCET analysis are illustrated. The most impact
arises from Translation Lookaside Buffer (TLB) accesses followed by data and instruction
caches. The influence of instruction execution compared with these dominating aspects is
negligible [74].
To summarize caches and prediction algorithms are contra productive to fulfill real-time
requirements and therefore to minimize the worst-case execution time. Similar are the
requirements for developing an optimizing compiler namely simple issue rules and
architectures with few restrictions are preferred as they allow more accurate results.
23
3 DSP cores
This section starts with an introduction of the design space of DSP core architectures, the
main parameters influencing the design of them and illustrates the limiting parameters
which cause the gap between theoretical and practical performance. The second part
introduces some architectural alternatives and discusses their advantages and disadvantages.
The third part describes commercially available core architectures, starting with cores from
the early 1990s up to the latest announcements. This chapter ends with a brief introduction
of xDSPcore.
3.1 Design Space
This section introduces the possible design space for RISC based core architectures. Today
most of DSP core architectures are RISC based load-store architectures. The trade-offs
between the main architectural features considering the silicon area, performance and power
dissipation are briefly illustrated by some examples. The design space of xDSPcore and the
possibilities to influence these parameters by configuration settings can be found in
[P10][P11].
The purpose of this section is to illustrate the complexity of choosing the “best core” and to
show that there is no general solution [13]. A DSP core is well suited when solving an
application-specific problem efficiently in terms of consumed silicon area and power
consumption. However it also has to be considered that the overall application partitioning
has a significant influence on the costs and that the costs of a product are not only caused by
silicon production and packaging. Software development costs, maintenance and portability
significantly contribute to the costs of SoC and SiP solutions.
3.1.1 Silicon Area
This subsection introduces the main contributors to the silicon consumption of a core
subsystem, with special focus on DSP architectures. The instruction set architecture (ISA),
with its influence is then chosen as an example aside from core and memory subsystem.
This example shall illustrate the complexity and the mutual influence of these aspects.
Core
Increasing system complexity has lead to large programs being executed on core
architectures. The contribution of the core area to the die area of the core subsystem is then
deemed insignificant. This key number is still taken as a decision point for choosing one
particular core. With the increasing complexity of modern day silicon technologies a
comparison is then made even more sophisticated. Performance figures for example the
core area in mm² requires additional information like chosen technology, silicon foundry,
number of metal layers, temperature range and supply voltage.
24
Memory Subsystem
The increasing size of programs executed on embedded core architectures leads to an
increasing importance of memory subsystems to the area contribution. Therefore the
importance of code density with influence on the program memory area has increased. In
the following item the instruction set is taken as an example to illustrate the influence on
core and memory subsystems.
Instruction Set
The instruction set of a core architecture can be split into two aspects: the instruction set
architecture and the related binary coding. The instruction set is taken as an example to
illustrate the cross coupling of different subsystem features. Further examples can be found
in the design space discussion of [P10] and [P11].
The instruction set architecture mirrors the functionality supported by the core architecture.
For example the support of two or three operand instructions features like addressing modes
and saturation modes or complex instructions like division.
Instructions and the related binary coding are necessary to program the available units. The
mapping of the instruction set architecture to instructions must consider micro-architectural
aspects. It will be difficult to map the ISA onto the native instruction word if the native
instruction word size is 16 bits and the ISA requirement is to support three-operand
instructions and each operand requires 4 bit register-coding. In this case it is necessary to
map the three-operand instructions onto two instruction words or to increase the size of the
native instruction word..
Figure 21: Influence of Binary Coding on Application Code Density (using the same ISA).
In Figure 21 the influence of the chosen binary coding is illustrated. The same ISA is once
mapped onto 16-bit wide instruction words and once onto 20-bit wide instruction words. To
illustrate the influence on code density, a piece of traditional control code is used, for
example some PC benchmarks [18]. The results in Figure 21 show that the shorter native
instruction word requires an increased number of long-words which are simply additional
25
instruction words for the identical instruction. This is reasonable because the coding space
for immediate values and offsets is reduced in 16 bit wide native instructions. The overall
code density for this example normalized in bytes is improved by 16 % when using the 16bit native instruction words, however the result will be different for other application code
examples.
3.1.2 Performance
The performance of DSP cores is measured in Million Instructions Per Second (MIPS) or
Million Operation Per Second (MOPS) [25]. MOPS was introduced when multi-issue core
architectures appeared on the market. These numbers are calculated by multiplying the
reachable core frequency by the number of instructions executable in parallel. This led to
announcements like the Texas Instruments TIC64x [39] with 8 GOPS (the possibility of
eight parallel executed instructions multiplied by 1 GHz clock frequency).
Berkeley Design Technologies Inc. (BDTi) introduced the so called BDTi benchmark suite,
containing a dozen algorithmic examples. Most of these are based on small loop centric
kernels for filtering and vector operations, whereas other examples include a FFT a Viterbi
implementation and a control code example. Certain coding requirements restrict the
implementation in order to simplify comparison between different core architectures. These
small kernels are often not representative for application code executed on DSP cores.
Another possibility to measure performance is counting the Million MAC instructions per
second (MMACs). For example during execution of control code the number of possible
executable MAC instructions does not significantly influence performance. Microarchitectural limitations (e.g. as illustrated in Figure 24) reduce the accuracy of this
performance factor for a mixture of DSP and control code.
Theoretical versus Practical Performance
The example of Texas Instruments can be used to illustrate the term theoretical
performance. This outlines the theoretical performance of the 1 GHz TI C62x is 8 GOPS
[37] or another example in [133]. The practical performance is a measurement of how
efficiently a certain algorithm can make use of the resources provided by the core
architecture. Some of the factors limiting the reachable practical performance are
introduced in this sub-section for illustrating the gap between theoretical and practical
performance.
Define-in-use Dependency
One way to increase the number of MIPS and MOPS is to increase the reachable clock
frequency of the core architecture. The increase of possible clock frequency can be attained
on a technological level by smaller feature size and an architectural level by increasing the
number of pipeline stages. This leads to super-pipelined architectures with 10 pipeline
26
stages and more. Increasing the number of pipeline stages during the execution phase
increases the define-in-use dependency [149] as illustrated in Figure 22. The five-stage
pipeline of Figure 22 supports split execution where two clock cycles are used for
calculation e.g. one MAC instruction. The operands are read at the beginning of EX1 and
the result written at the end of stage EX2. Filtering operations for example the FIR filter as
illustrated in Figure 7 require consecutive MAC instructions for cycle efficient
implementation. Due to the dependency of the result of the first MAC instruction as source
operand for the second MAC instruction a NOP cycle is required to prevent data hazards
[149]. For the core architecture of Figure 22 which features a lean pipeline structure the
additional NOP cycle is reasonable. However the TI C62x [36] provides an eleven stage
pipeline containing 5 execution stages where the define-in-use dependency increases
significantly. A method to compensate this problem is through bypass circuits (bypassing
intermediate results to the next instruction).
Figure 22: Principle of Define-in-use Dependency.
The xDSPcore architecture which utilizes the pipeline structure of Figure 22 allows the
fetching of the accumulator operand for the MAC instruction at the beginning of EX2 (as
illustrated in Figure 58) which compensates the define-in-use dependency during executing
filter operations.
This example shall illustrate that increasing reachable clock frequency by adding additional
pipeline stages leads to an increased theoretical performance (due to relaxing of the critical
implementation path and by reaching a higher clock frequency) but data and control
dependencies in the application code can limit the increase of practical performance.
Load-in-use Dependency
A similar problem is the load-in-use dependency [149] as illustrated in Figure 23. To relax
the timing at the data memory ports additional pipeline stages are introduced for memory
access. The execution of instructions dependent on the fetched data entries have to be
delayed until the memory access has been finished.
27
Figure 23: Principle of Load-in-use Dependency.
Different to the load-in-use dependency bypass circuits cannot be used to partly compensate
the data dependency. The load-in-use dependency can cause a significant mismatch
between theoretical and practical performance especially during the execution of control
code featuring short branch distances.
Data Memory Bandwidth
The data memory bandwidth of a core architecture is characterized by the number of
load/store instructions executed in parallel, the size of the data memory ports and the
structure of the access. The structure of the access considers alignment requirements at the
data memory port and the number of independent addresses which can be generated and
accessed each clock cycle. To prevent limitations on the practical performance compared
with the theoretical performance the necessary operands for each of the executed instruction
has to be provided.
Figure 24: Example for Data Memory Bandwidth Limitations (Starcore SC140).
The core architecture as illustrated in Figure 24 allows execution of up to four MAC
instructions in parallel (e.g. the SC140 [32]) which can be used to increase performance
during the execution of filter algorithms. However for each of the four MAC instructions
two operands on each clock cycle are required. The example in Figure 24 enables fetching
of two independent data values from data memory each cycle. The memory bandwidth for
executing the four MAC instructions in parallel is sufficient when fetching two times 64-bit
data and assuming 16-bit wide operands for the MAC instructions. The structure in Figure
28
24 illustrates a limitation of storing data in data memory. The data has to be placed in data
memory so that it is possible to fetch operands for all four MACs in parallel by addressing
only two independent data entries. This limitation can require a large amount of operations
to position the data according to the required scheme which is normally not assumed for
benchmark results e.g. [9][28].
Program Memory Port
The program memory port is used to fetch instructions from program memory. Multi-way
VLIW architectures require a large amount of instruction words to enable programming of
the available parallel resources. An example is the Texas Instruments TIC6xx family [36]
which requires a program memory port width of 256 bits requiring significant wiring effort.
In combination with the poor code density of the TI6xx family its usage in area and power
critical applications is not recommended. Therefore core architects have introduced
architectural features to prevent large program memory ports.
Providing a small program memory port requires less wiring to the program memory but
leads to poor usage of the available parallel units. During the execution of control code this
limitation is reasonable, because data and control dependencies limit the average ILP to 2-3
as illustrated in [106][112][115][159]. Loop-centric algorithms often used for typical DSP
functions can make use of more parallelism and therefore the peak-performance of the core
architecture would be limited by the reduced size of the program memory port. For
increasing peak performance of the core architectures, extended program memory ports
have been introduced [157].
Branch Delays
Branch delays are unusable execution cycles caused by taken conditional branch
instructions. Increasing clock frequency by increasing the number of pipeline stages
increases the number of branch delays and therefore decreases the practical performance.
Compared with single issue microcontroller cores this is further deteriorated when
executing control code with short branch distances on multi-way VLIW architectures.
Branch prediction circuits as introduced in [31][44][173] can be used to reduce the number
of branch delays but the drawback of prediction circuits has already been pointed out in
section 2.10.
An alternative to compensate branch delays is through trying to prevent branch instructions
by making use of predicated or conditional execution. In [132] benchmark results illustrate
that the use of predicated execution can be used to reduce about 30% of conditional branch
instructions. The chosen implementation for predicated execution has influence on the
practical performance. An example is the implementation used for SC140 [32] with only
one flag and few conditions can lead to a poor usage of the resources during control code
sections and several unused execution cycles. This limitation is caused by restrictions
29
during instruction scheduling. Scheduling of instructions between generation and evaluation
of the status information is not allowed. For multi-way VLIW architectures featuring deep
pipelines the gap between the theoretical and practical performance can increase significant.
3.1.3 Power Consumption
The power consumption of a core subsystem is influenced a number of factors, namely the
core architecture itself, the memory subsystem, execution frequency and the executed
algorithms which have influence on the traffic on the data memory bus [139][143]. This
section considers power consumption aspects where embedded core architectures can
contribute to reduce power dissipation, however the technology aspects are not considered
in detail.
Power dissipation in CMOS circuits is mainly caused by three sources, leakage current,
short circuit current and charging and discharging of capacitive loads during logic changes.
P = Pleak + Pshort+ Pdynamic
[1]
Leakage current is primarily determined by the fabrication technology and circuit area. The
short circuit currents can be avoided by careful design [59][62][72][158][161], and the same
is true for leakage [70][73][85][104][141][165].
Three degrees of freedom are an inherent part of the system low power design space:
voltage, physical capacitance and data activity. These factors will be briefly discussed in
this sub-section. Equation 2 contains the factors which mainly influence dynamic power
consumption [110].
1
Pisw = V 2CiTDi
2
[2]
Voltage
The quadratic relation between voltage and power dissipation favors this parameter as an
effective possibility for reducing power dissipation. Voltage scaling influences not only one
part of the SoC solutions where system aspects have to be carefully as with decreasing
supply voltage a speed penalty is evident [56][107][131][156].
In [61] an architecture driven voltage scaling strategy is presented whereas pipelined and
parallel architectures are used to compensate the throughput problem caused by reduced
supply voltage. A different approach is illustrated in [175].
Another possibility to compensate the speed decrease caused by reduced supply voltage is
to decrease Vt. This is limited by constraints of noise margins and by the need to control the
increase of sub-threshold leakage current. Dual-Vt techniques such as those introduced in
[107] require multi-threshold CMOS transistors (MTCMOS) which have to be supported by
the target technology.
30
Physical Capacitance
Dynamic power consumption depends linearly on the switched physical capacitance.
Therefore besides reducing supply voltage a reduction of the capacitance can be used to
reduce power dissipation. Using less logic, small devices and short wires the physical
capacitance can be reduced.
On the other hand as already mentioned in voltage scaling it is not possible to minimize one
parameter without influencing some others for example reducing the device size will reduce
the current drive of the transistors in turn resulting in slower operating speed at the circuits.
Switching Activity
Reducing switching activity also linearly influences dynamic power dissipation. A circuit
containing a large amount of physical capacitance will show no power dissipation when
there is no switching.
However, calculating switching activity is not simple. This is caused by the fact that
switching consists of spurious activity and functional activity. In certain circuits like adders
and multipliers [52] spurious activity can dominate.
Combining data activity with physical capacitance leads to switched capacitance describing
the average capacitance charged during each data period.
Summary
The design space for low power design is mainly influenced by the following parameters;
supply voltage, capacitance and switching activity which are cross-related to each other and
have influence on static power dissipation. For an embedded DSP core the design space is
even more limited because aspects like voltage scaling or dual-VT techniques which are
system or technology aspects and thus cannot be influenced by the core architecture itself.
The DSP core architecture introduced in this thesis supports architectural features [P5] and
compiler related aspects [99] for reducing switching activity. Implementation aspects to
reduce capacitance are considered by making use of manual full-custom design [P4].
31
3.2 Architectural Alternatives
DSP cores are processors that provide specific features for efficient implementation of
algorithms for digital signal processing as illustrated in section 2. Each core architecture
aims to solve specific problems whereas an efficient architecture meets the requirements of
the algorithm executed. Meeting the requirements can be subsumed in the key features area
consumption leading to costs, low power dissipation leading to increased battery life time or
higher integration density and system development costs, which are mainly dominated by
software development costs. In this section some architectural alternatives used in current
DSP core architectures are briefly introduced.
The solution space is multi-dimensional and different parameters have mutually coupled
influence upon the space. More details concerning the available design space for DSP core
architectures and a methodology how to find the best solution for solving a certain
application-specific problem can be found in [P10][P11][13].
3.2.1 Single Issue versus Multi Issue
Single issue architectures invoke only one instruction each execution cycle. This concept is
well established for microcontroller architectures for example ARM microcontrollers. The
problem of efficient instruction scheduling is simplified to a linear problem and
programming a single issue core is straight forward. Control code typically executed on
microcontroller architectures is linear code with a lot of dependencies and therefore
executing more than one instruction per execution bundle (instructions executed during the
same clock cycle) does not significantly increase the performance of the core architecture.
To increase the performance of these core architectures more complex instructions can be
used [20][29].
Figure 25: Architectural Alternatives: Issue Rates for Available DSP Core Architectures.
32
DSP algorithms are loop-centric algorithms where a significant amount of execution time is
spent in loop iterations. Therefore increasing performance during execution of the loop
bodies significantly increases the performance of the core architecture. Software pipelining
and loop unrolling as introduced in a later section allow execution of several instructions in
parallel to increase system performance.
In Figure 25 the issue rate of available DSP core architectures over time is illustrated. While
most of the core architectures in the 1980s allowed the execution of one instruction per
execution cycle only 10 years later up to 8 instructions were able to be executed. Core
architects have increased the number of instructions executed in parallel to increase relative
performance of their core architectures.
There are also other aspects to consider for example the Instruction Level Parallelism (ILP).
The average ILP indicates the average number of instructions executed in parallel. The ILP
is limited by the core resources by the issue rate, and data and control dependencies in the
executed algorithm. The issue rate means that a single issue core cannot reach a value of
more than one. Increasing the possible number of instructions executed in parallel will not
increase the average ILP when executing an algorithm primarily based on control code. For
loop-centric algorithms the increased parallelism can be used for increasing core
performance.
It is nearly impossible to develop code for a multi-issue DSP core architecture manually
considering deep pipelines and related dependences, therefore the use of high-level
language compilers like a C-compiler is required.
3.2.2 VLIW versus Superscalar
Scalar and superscalar architectures are common for microcontrollers. Scalar processors
support the execution of one instruction per cycle which limits the attainable performance.
Superscalar processor architectures overcome this problem by supporting the execution of
several instructions in parallel where resolving of dependencies in the executed application
code is done by hardware circuits. Issuing queues [149], score boards [149] and highly
sophisticated branch prediction circuits [153] take care of making efficient use of the core
resources. The programming model is based on dynamic scheduling, i.e. the execution order
of instructions is defined during run-time based on dependency analysis [148]. Superscalar
architectures allow minimization of the execution time by enabling a change in the program
execution order as long as dependencies are considered. This minimizes average execution
time.
The Very Long Instruction Word (VLIW) programming model is based on static scheduling.
Dependencies in the application code are already resolved during compile time. The
execution order of instructions is not changed during runtime. Changing the execution order
33
during run-time is not possible, due to lack of support of hardware circuits for dependency
resolution not supported in VLIW architectures. The advantage is reduced core complexity
which simplifies hardware development. Using caches for VLIW architectures leads to
penalty cycles during cache misses which cannot be used to execute different code sections.
One possibility to overcome this limitation is the invoking of multithreading with the
drawback of increased core complexity. Static scheduling allows minimization of the worstcase execution time which is required for algorithms with real-time requirements.
Developing a C-Compiler for VLIW architectures is more complex because of dependency
analysis and sophisticated instruction scheduling algorithms [45][166].
Most of the latest DSP architectures are based on the VLIW programming model driven by
the real-time requirements of algorithms executed on DSP architectures. To overcome the
drawback of traditional VLIW having poor code density, enhanced implementations like
Variable Length Execution Set (VLES) [36], Configurable Long Instruction Word (CLIW)
[157] and scaleable Long Instruction Word (xLIW) [P2] are used in existing core
architectures.
3.2.3 Deep Pipeline versus Lean Pipeline
Pipelines were already introduced in supercomputers in the 1960s and the motivation for
pipelining is to increase instruction throughput by an increased usage of hardware resources.
This is achieved by splitting of operations into sub-operations and invoking of new suboperations as early as possible. This split into sub-operations allows the reaching of higher
clock frequencies. In Figure 39 the concept of SW-Pipelining is illustrated for four
operations. The main concept is the same for hardware and software pipelines.
Pipeline structures used in DSPs are CISC and RISC pipelines. Direct memory architectures
are based on CISC pipelines. Besides fetching, decoding and execution of instructions
typical for RISC pipelines the memory operation for fetching operands from data memory
requires additional pipeline stages which are inserted after the decode stage. In Figure 26
the two types of pipelines are illustrated. In RISC pipelines separate instructions are used to
fetch data from data memory and these instructions use the same pipeline structure. In CISC
pipelines the memory operations are part of arithmetic instructions.
34
Figure 26: Architectural Alternatives: RISC versus CISC Pipeline.
Besides the chosen pipeline structure (which is mainly influenced by the general
architectural concept) the number of clock cycles used to implement the pipeline is an
important performance aspect.
To increase the computational power of core architectures, splitting operations in small suboperations and thus using several clock cycles to execute one “natural” pipeline stage leads
to super-pipelined architectures. Dependencies between pipeline stages (a more detailed
discussion can be found in the Design Space section) can lead to a poor usage of available
hardware resources.
Figure 27: Architectural Alternatives: Pipeline Depth of Available DSP Cores.
The worst case scenario can even be an increased clock frequency which produces high
power dissipation but reduced system performance due to data and control dependencies. In
Figure 27 the pipeline depth for available core architectures is illustrated. For overcoming
data and control dependencies caused by deep pipeline structures bypass and branch
prediction circuits are introduced.
3.2.4 Direct Memory versus Load-Store
For load-store architectures separate instructions are used to transfer data between data
memory and register file. For direct-memory architectures the data transfer is coded inside
35
the arithmetic instruction. For load-store architectures the register file plays a central role
and therefore is traditionally located between data memory and execution units whereas for
direct memory architecture the register file is in parallel to the execution units where
intermediate results can be stored. The difference between load-store and direct memory
architecture is illustrated in Figure 28.
Figure 28: Architectural Alternatives: Direct Memory versus Load-Store.
Assuming traditional DSP algorithms e.g. filtering, the direct memory architecture allows
an increase in code density by using less instruction words. The load/store operations are
already included inside the coding for the arithmetic instructions, however the coding space
for the data transfer has to be provided inside the instruction word thus leading to more
complex instruction words for example the 24-bit instruction word of Carmel [11]. For code
sections which cannot make use of the more complex instructions, code density is decreased.
The execution of control code with CISC instructions features the problem of poor usage of
the binary coding and therefore decreased code density. The application code requires more
instruction words when using less complex instructions. These provide more flexibility and
can be used to increase code density on application level.
3.2.5 Mode Register versus Instruction Coding
Memory dominates the area consumption of embedded DSP subsystems. Code density is a
factor mirroring how efficiently a certain algorithm can make use of the provided core
resources and the related instruction set architecture and instruction coding. High code
density reduces the required program memory and therefore the necessary die area of the
DSP subsystem.
One possibility to increase code density is the usage of a mode register. The mode register
allows the meaning of an instruction to change by modifying the related mode bit. Quite
often examples [32] are mode bits for addressing modes or saturation modes.
The disadvantage of these mode registers is caused by instruction scheduling. In Figure 29
the problem is clearly illustrated. An instruction is not necessarily dependent only on the
36
instruction word. The meaning of the instruction also depends on the mode set for a certain
code section.
Figure 29: Architectural Alternatives: Mode Dependent Limitations during Instruction
Scheduling.
It is not possible to move an instruction out of a section without considering changing the
mode for the other section (which is again related with additional instructions – the new
mode has to be set and reset). This is impossible for multi-issue VLIW DSP core
architectures. Therefore mode register can help to increase the code density of a small
kernel but the reduced freedom for scheduling of instructions can lead to an increased
number of execution cycles and even to a decreased code density.
37
3.3 Available DSP Core Architectures
This section introduces commercially available DSP cores. Initially each core architecture is
introduced and the main aspects shall be discussed, such as available arithmetic units,
pipeline structure, supported addressing modes and core specific features. At the end of
each description a short summary section briefly assesses the features of the core
architecture from an orthogonality point of view which is a major aspect for developing a
C-compiler. This section does not contain a table comparing available DSP cores on metrics
like reachable frequency, number of parallel executed instructions or pipeline depth in order
to prevent superficial comparisons. These kinds of tables hide micro-architectural
limitations with influence on practical performance as introduced in 3.1.2.
DSP cores are quite often rated to their number of supported MAC instructions each clock
cycle. The first three cores described in this section are OAKDSP, Motorola 56000 and
TIC54x, chosen as examples of so called single-MAC DSP cores (DSP core architectures
supporting the execution of one MAC instruction at a time). The ZSP has been chosen as an
example of a DSP core based on the superscalar programming model. As examples for
dual-MAC architectures Infineon Carmel DSP, TI C62x and Blackfin have been chosen. As
a last example the SC140 of Starcore LCC has been chosen as a DSP core architecture
supporting the execution of four MAC instructions in parallel. This thesis does not consider
vector processor architectures due to their specialized architecture which can only be used
for one class of algorithmic problem.
3.3.1 OAKDSP
The OAKDSP [29] core was introduced in the early 90’s from DSPgroup (now Ceva [14])
as a successor to the PineDSP core. OAKDSP core is a single-issue 16-bit DSP core based
on traditional direct memory architecture. This is where arithmetic instructions fetch their
operands from memory and store the results into memory. The instruction set is based on a
native 16-bit instruction word, and long immediate values or offsets are stored in an
additional instruction word. The pipeline consists of four stages: fetch, decode, operand
fetch and execution. The data memory space is split into an asymmetric X and Y memory
space. The Computation Unit (CU) as illustrated in Figure 30 contains a sixteen by sixteen
multiplier (also supporting double precision), an ALU/Shifter data path for implementation
of MAC instructions and a separate Bit Manipulation Unit (BMU) containing a barrel
shifter which is the major difference to PineDSP.
38
Figure 30: Architectural Overview: OAKDSP Core.
Four shadowed accumulator registers each 36 bits wide containing four guard bits are
supported. Two of these are assigned to the BMU and two to the CU. Zero overhead loop
instructions with four nesting levels are supported. A software stack allows execution of
recursive function calls.
The address generation unit supports post increment/decrement operations and modulo
address generation. Reverse carry addressing is not supported. Three status registers are
available containing flags, status bits, control bits, user I/O and paging bits. Most of the
fields can be modified by the user. The first status register contains the flags (zero, minus,
normalized, overflow, carry …) which are influenced by the last CU or BMU operation.
Most of the registers are shadowed which allows a task switch with reduced spilling of the
core status to data memory. No separate interrupt control unit is available and three
different interrupt sources and an NMI are supported.
Figure 31: Architectural Overview: Motorola 56300.
39
Summary: OAK DSP core is a single-issue DSP core which thus reduces its relative
performance. The limited address space requires a paging mechanism with the related
limitations for instruction scheduling. Compared with modern core architectures the
reduced feature set enables good code density for typical DSP algorithms like filtering.
Missing support for reverse carry address mode limits the possibility of an efficient
implementation of FFT algorithms on the OAKDSP core. Status and configuration registers
limit instruction scheduling. Flags influenced by the last occurrence of ALU or BMU
instructions limit the use of conditionally executed instructions. The missing support of
nested interrupts is caused by only one level of shadow registers and limits the usage of
interrupt service routines. Although the support of a software stack eases the development
for a C-compiler, the architectural features of OAKDSP are not orthogonal and therefore
implementing a powerful C-compiler is questionable
3.3.2 Motorola 56300
The Motorola 56300 DSP [22] core is a powerful member of the Motorola 56k family [20]
introduced in 1995. The 56300 is a single issue 24-bit load/store architecture where
arithmetic instructions fetch their operands from two operand registers X and Y (each 56
bits wide).
The native instruction set consists of 24-bit wide instruction words and long offsets or
immediate values are stored in an additional instruction word. The instruction format can be
split into a parallel instruction format and a not parallel instruction format. The parallel
instruction format supports CISC instructions: in addition to the operation op-code and
operand, operations taking place on the X and Y memory bus and a condition can be coded.
The pipeline consists of seven stages: Prefetch I + II, decode, address generation I+II,
execute I+II, all of which are hidden from the programmer. The memory space is split into
X and Y memory. The computation unit as illustrated in Figure 31 contains a 24 by 24 bit
multiplier, an accumulator including a shifter and a separate bit-field unit including a barrel
shifter.
The register file consists of 6 accumulator register Ax and Bx with each being 56 bits wide.
The supported data types are 24-bit based (also including byte support) in addition 8 guard
bits support higher precision calculation. Zero overhead loops are supported and can be
nested. The nesting level is not limited because loop handling registers are spilled to the
software stack (limited only by the available data space).
The 56300 supports register direct and indirect address modes (including pre-and post
operation) and specialized address modes used for efficient implementation of traditional
DSP algorithms, namely reverse carry which allows efficient implementation of FFT
algorithms and modulo addressing. The size of the modulo buffer stored in the modulo
40
register is configurable whereas the start address has to be aligned. The address register file
is banked which means half of the registers are assigned to one of the two Address
Generation Units (AGU).
The actual core status is stored as flag information like carry, zero and overflow in a status
register. Mode registers are available for choosing saturation mode and rounding mode and
an operation mode register is available for determining the status of the core (e.g. stack
overflow). Four interrupts and one NMI are supported.
Although the 56300 is a 24-bit DSP processor, a compatibility mode supporting 16-bit data
format is available. The unused bits are cleared or sign extended, depending on the position.
Summary: The 56300 from Motorola supports a 24-bit datapath and therefore is well suited
for audio algorithms. The seven-stage pipeline allows the reading of higher clock
frequencies but dependencies in the application code can lead to limited usage of the clock
cycles. Configuration and mode register limit instruction scheduling. Even if the parallel
instruction format allows an increase in performance of the DSP core architecture it is still a
single issue core. Control code sections in particular will suffer from poor code density by
24-bit native instruction word size. The supported address modes allow efficient
implementation of traditional DSP algorithms including FFT. The use of the address
registers is limited by the banked implementation of the address register file.
3.3.3 Texas Instruments 54x
Texas Instruments introduced two major DSP families, namely the embedded core family
TI C5xx and the high performance stand-alone DSP family TI C6xx as outlined in [39]. The
TI C54x [33] has been chosen as example as illustrated in Figure 32. There are several
members of the core family available e.g. [34] which provide different features but the TI
C54x has been chosen because it is still the most referenced DSP core. Berkley Design
Technologies Inc. (BDTi) normalizes its performance figures of analyzed DSP cores on the
performance figures of the TI C54x [9].
TI C54x is based on direct memory architecture and supports three data busses and an
independent program memory bus, each of which are 16 bits wide. Arithmetic instructions
include the operand fetch instructions from data memory. The native instruction word is 16
bits wide. Several instructions require a second instruction word (e.g. for branch
instructions). The second instruction word has to be fetched sequentially and therefore some
pipeline cycles remain unused. Conditional execution is supported which reduces the
number of branch delays.
The pipeline consists of six stages: pre-fetch, fetch, decode, access, read and execute.
Executing branch instructions results in three branch delays, the first delay is caused by
executing the branch instruction consisting of two instruction words itself whereas the next
41
two executing cycles have to be flushed. To overcome this problem TI C54x supports
delayed branching, which allows the use of the branch delays with unconditionally executed
instructions. The Central Processing Unit (CPU) contains a 17 by 17 multiplier which
supports double precision arithmetic, an adder and a barrel shifter.
Figure 32: Architectural Overview: TI C54x.
The DSP core architecture features two accumulator registers each 40 bits wide including 8
guard bits for internal high-precision arithmetic. Zero overhead hardware loop instructions
are supported.
TI C54x supports several addressing modes including absolute and indirect addressing. The
specialized address modes reverse carry and modulo addressing are also available.
Therefore an efficient implementation of FFT algorithms is possible.
Flags indicating the status of the core architecture are stored in core status registers. To
increase code density some of the functions like saturation of multiplication results are
stored in mode registers. TI C54x supports several hardware and software interrupts
including prioritization of different interrupt sources.
Summary: The core architecture of Texas Instruments TI C54x is well suited for efficient
implementation of traditional DSP algorithms. Therefore it is quite often used as a reference
concerning code density and power dissipation. The reachable performance is limited by the
single-issue execution logic, a small-sized program memory port which leads to stalls by
executing instructions consisting of two instruction words, and by missing a register file.
This requires the fetching of operands from data memory for each arithmetic instruction.
Only a few functions are stored in mode and status registers, therefore instruction
scheduling is less limited. Dynamically generated flags are stored in status registers which
limits instruction scheduling when using predicated execution (i.e. no instructions are
allowed to be scheduled between generating the condition and the conditionally executed
instruction). The support of delayed branching reduces the drawback of branch delays.
42
3.3.4 ZSP 400
The ZSP 400 DSP [42] core architecture illustrated in Figure 33 is a family member of the
ZSP DSP family of LSI Logic. Different to the remaining core architectures introduced in
this section ZSP uses the superscalar programming model. The core is based on a RISC
load-store architecture where arithmetic instructions get their operands from the register file.
Separate instructions are used to fetch and store data from data memory.
Figure 33: Architectural Overview: ZSP400.
The instruction set is based on a native 16-bit instruction word. The core supports the
execution of up to four instructions each execution cycle with some limitations concerning
the grouping of instructions. Therefore ZSP is a multi-issue DSP core where dependencies
are resolved during run-time (dynamic scheduling). The pipeline of the ZSP 400 contains 5
stages: fetch/decode, group, read, execute and write-back. The data memory bandwidth is
four words wide and a cache is located between memory and core and the communication is
established via data links. During data memory access no alignment restrictions have to be
considered. The Computation Unit consists of two MAC units and two ALU paths, each 16
bits wide. It is possible to combine them as a single 32-bit ALU path.
The register file is built up of sixteen 16-bit wide general purpose registers. Two 16-bit
registers can be addressed as one 32-bit register. Two of the 32-bit registers contain
additional 8 guard bits used for internal higher precision calculation. Eight of the 16-bit
registers are shadowed and switching between the two sets is done with a configuration bit.
ZSP 400 supports two circular buffers and no explicit address registers are supported. The
first 13 registers can be used for reverse carry addressing which enables efficient
implementation of FFT algorithms. The load/store instructions are supported with autoincrement and offset address calculation usual for state-of-the-art DSP architectures.
43
Mode registers are supported to enable saturation and rounding modes. Several other core
functions are controlled by additional configuration registers. Status registers contain core
status information like hardware flags indicating for example overflow, zero or pending
interrupts.
Summary: ZSP 400 is an example of the ZSP family form LSI Logic. The core is based on
superscalar programming model. This implies dynamic instruction scheduling during
runtime and the intensive usage of cache structures which limits the possibility of
minimizing the worst-case execution time. The advantage of a unique register file for
instruction scheduling is counteracted by a huge number of restrictions and non-orthogonal
architectural features for example only a few registers are shadowed. Some can be used for
any addressing mode, however some do not support all of them. To increase code density
several typical DSP functions like saturation and rounding modes are shifted to mode
registers with limitations for instruction scheduling.
The ZSP architecture is well suited for implementing a C-compiler due to the superscalar
architecture which eases the compiler development. Several restrictions and non-orthogonal
architectural features limit the possibility of an optimizing compiler. Comparing ZSP with
traditional DSP core architectures gives the indication that ZSP is not a typical DSP core. It
is more a microcontroller with some features used in DSP core architectures (like address
modes, MAC units and circular buffers).
3.3.5 Carmel
Carmel DSP [10][11][12] core was introduced in mid-1990’s by Infineon Technologies, the
former Siemens Semiconductor group. Carmel is a 16-bit fixed point direct-memory
architecture where arithmetic instructions fetch their operands from memory locations. This
is reflected in the 8-stage pipeline: program address, program read, decode I+II, data read
address, operand fetch, execution and write address, data write.
The native instruction word size is 24 bits and instructions are built of up to two instruction
words. The instruction coding is code density optimized which requires two pipeline stages
for instruction decoding. Carmel is based on the VLIW programming model. The
implementation of the program memory port is patented as Configurable Long Instruction
Word (CLIW) [157]. The regular program memory port is 48 bits wide. An extended
memory port of 96 bits allows the fetching of up to 144-bit instruction words. The CLIW
memory contains parameterized instruction combinations. Some of the instructions are only
supported as part of CLIW instructions.
44
Figure 34: Architectural Overview: Carmel.
Carmel supports the fetching of data from up to 4 independent memory locations. The data
memory is split into A1, A2, B1 and B2 memory blocks. The execution unit as illustrated in
Figure 34 contains two data paths, each of which contain a MAC unit and an ALU. The left
path additionally supports a shifter and an exponent unit. The results can be stored into an
intermediate register file of six 40-bit wide accumulator registers or directly to data memory
via two 16-bit wide write ports.
Zero-overhead hardware loops are supported with a nesting level of four. Similar to
OAKDSP fast context switch is supported with the support of two secondary accumulator
registers.
Carmel supports 16-bit address space with traditional addressing modes found in DSP cores.
Efficient implementation of FFT algorithms is enabled by the support of bit-reverse
addressing mode. The first type of modulo addressing scheme is supported with aligned
boundary addresses. A second modulo addressing mode allows prevention of memory
fragmentation by the support of non-aligned boundary addresses.
Configuration registers are used to choose rounding mode to activate saturation and to
enable fractional data format. Conditional execution is supported for most of the
instructions and can utilize two condition registers.
Summary: Carmel DSP core is the 16-bit embedded DSP core created by Infineon
Technologies in cooperation with the Israeli company ICCOM. The traditional directmemory architecture favors CISC instructions. To increase code density and execution
speed for traditional DSP algorithms like filtering and FFT any orthogonality aspects have
45
been ignored. The extended program memory port with the restriction that some
instructions can only be used with this port limits the development of an optimizing CCompiler. The necessary two pipeline stages for decoding increases the number of branch
delays. Configuration registers for major DSP functions like saturation and rounding modes
are limiting instruction scheduling. Considering Carmel as a high performance DSP core the
limitation of a 16-bit address space is crucial and requires a paging mechanism with its
related drawbacks. The conditional execution supported by Carmel limits instruction
scheduling by supporting only two registers for storing conditions.
In 2002 Carmel was sold to Starcore LCC and in the same year Carmel was discontinued.
This is an example of typical DSP core developments of the mid 1990s and one of the last
examples for direct-memory architectures. Carmel’s BDTi benchmarks still have the
leading edge for traditional DSP algorithms and especially for FFT algorithms.
3.3.6 Texas Instruments 62xx
The Texas Instruments C62x [36][37] is a family member of the Texas Instruments C6xx
high performance DSP core family. TI C62x is a 16-bit fixed point multi-issue DSP core
based on RISC load-store architecture. Operands for the arithmetic instructions are fetched
from register file and separate instructions are supported to move data between register file
and data memory. The instruction is based on a 32-bit native instruction word. Up to eight
instructions can be executed each clock cycle. The programming model is based on VLIW.
To overcome the drawback of poor code density Texas Instruments introduced the Variable
Length Execution Set (VLES) which allows decoupling of the fetch and execution bundles.
Two independent data busses as illustrated in Figure 35 connect the register file with data
memory. The execution unit supports eight different units, two for data exchange with data
memory, two multiplier, two ALU units and two shift and bit manipulation units. Each of
the units contains many different features which overlap as explained in [36] and allows the
shifting of functions from one unit to another. There is no explicit support of Multiply and
Accumulate (MAC) which requires two instructions one for programming the
multiplication and one for the accumulation.
The register file contains sixteen 32-bit wide general purpose registers. It is possible to store
40-bit values in two consecutive 32-bit registers. The pipeline consists of three phases fetch,
decode and execute which are split over 11 clock cycles. The pipeline of TI C62x can be
called super-pipelined.
TI C62x provides full predicated execution which means that each instruction can be
executed conditionally. Six registers of the register file can be used to build the condition.
46
Figure 35: Architectural Overview: TI C6xx.
Summary: The Texas Instruments TI C62x as an example of the C6xx family is a high
performance processor architecture. A large register file and up to eight instructions
executed in parallel provide an impressive peak performance. The deep pipeline structure
allows high clock frequencies to be reached but long define-use and load-use dependencies
can result in poor usage of available resources (details in the Design Space section). Some
of the characteristics for DSP architectures like 40-bit accumulators or MAC units are not
available. These specialized functions are emulated, for example using two 32-bit registers
to implement an accumulator or combining a multiplication with an accumulation to realize
a MAC instruction. The predicated execution is a powerful feature for implementing control
code by reducing the number of branch instructions with the unusable branch delays.
However, the limitation to a few of the available registers for building up the condition
restricts the use of this feature. An additional limitation for instruction scheduling is the
banked register file with one path for transferring data between the two banks. An important
drawback not to be overlooked is the poor code density. It is not feasible to use the core as
an embedded core and most of the applications making use of it use the TI C62x as a
standalone device with external memory. The described DSP core is well suited for
development of an optimizing C-compiler.
3.3.7 Starcore SC140
The Starcore SC140 [32] is the high performance DSP core of the Starcore DSP family.
Starcore LCC has been founded by Motorola and Agere, the former semiconductor group of
Lucent Technologies. Infineon Technologies joined this cooperation just two years ago. The
SC140 is a multi-issue high performance 16-bit fixed-point DSP core based on RISC,
47
nearly ‘pure’ load-store architecture. Most of the instructions get their operands from the
register file. However a few get operands directly from memory.
The instruction set is based on a 16-bit wide native instruction word where the instructions
are 16, 32 or 48 bits wide. Up to six instructions can be grouped to a Variable Length
Execution Set (VLES). Some limitations exist concerning grouping. The SC140 features a
five-stage pipeline: pre-fetch, fetch, dispatch, address generation and execution. Two
independent data busses connect the core with data memory. The memory addresses are
interleaved to reduce the possibility of an address conflict and related stall cycles.
The CU of the SC140 as illustrated in Figure 36 consists of four independent data paths,
each of which support the execution of a MAC instruction or an ALU operation including
shifting. During execution of filter algorithms the available four MAC units provide
significant peak performance.
Figure 36: Architectural Overview: Starcore SC140.
The register file consists of sixteen 40-bit entries. The accumulator registers contain 8 guard
bits for internal higher precision calculation. Zero overhead looping is supported up to a
nesting level of four.
Two independent address generation units support address modes available on state-of-theart DSP cores and due to reverse carry support an efficient implementation of FFT
algorithms is possible. The modulo addressing support allows the addressing of modulo
buffers with any size starting at any position, which prevents fragmented data memory.
Mode registers for coding special addressing, rounding and saturation modes are available.
Only one flag (T) is available for dynamic evaluation of the core status used for conditional
or predicated execution. The SC140 supports a separate interrupt control unit (ICU) with a
similar powerful feature set as available for microcontroller architectures [15].
48
Summary: SC140 is the powerful DSP core architecture of the Starcore DSP family, where
the support of four independent MAC units increases peak performance during execution of
traditional DSP algorithms like filtering. The support of one flag significantly limits the use
of predicated execution for improving execution of control code on the DSP architecture.
The grouping mechanism to identify the VLES decreases code density. Benchmark results
illustrate poor code density [28]. The register file used for address generation is not fully
orthogonal and making use of some address modes limits the use of all registers. Mode
registers restrict instruction scheduling. Some specialized instructions are limited to certain
source and destination registers, which limits register allocation and instruction scheduling.
Figure 37: Architectural Overview: Blackfin.
3.3.8 Blackfin DSP
Blackfin DSP [1][2][8] core is co-developed by Intel Inc and Analog Devices. Blackfin is a
high-performance 16-bit fixed-point DSP core based on a RISC load-store architecture.
Instructions are available for transferring data between register file and data memory
whereas arithmetic instructions receive their operands from the register file. The register file
consists of 8 32-bit wide entries, each of which can be addressed as two 16-bit entries. Two
of the eight 32-bit register entries are extended by eight guard bits each and used as
accumulator registers for internal higher precision calculation.
The instruction set is based on 16-bit wide native instruction words and instructions are 16,
32 and 64 bits wide. The fetch bundle contains 64 bits. Blackfin features an eight-stage
pipeline: Fetch I+II, decode, address generation, execute I+II+III and write back. Nested
loops with a nesting level of two are supported.
49
The execution unit of Blackfin as introduced in Figure 37 contains two 16 by 16 bit
multipliers, two 40 bit wide ALU datapaths and one shifter unit. Typical DSP address
modes are available including circular buffers (with no restrictions to the start address and
the buffer size) and reverse carry addressing which enables efficient implementation of FFT
algorithms.
Status registers are available which mirror the core status including hardware flags and also
configuration details for example the rounding mode.
Summary: Blackfin (ADSP-21535) is a high performance 16-bit DSP processor developed
by Analog Devices and Intel. The register file consists of only two accumulator registers
and the remaining registers are 32 bits wide, which limits instruction scheduling. The core
description emphasizes the topic of cache architectures like L1 and L2 cache architectures
and cache is provided for data and program. The main problem of cache architecture is the
unpredictability of cache hit and cache miss events which reduce the possibility of reducing
worst case execution time for real-time critical algorithms.
3.4 xDSPcore
xDSPcore is a fixed-point embedded DSP core architecture based on modified DualHarvard load-store architecture. A brief architectural overview of the core architecture can
be found in Figure 38. The bit-width of the datapath is parameterized whereas the first
implementation will have a 16-bit datapath. The operands for the arithmetic instructions are
fetched from register files and the results stored in the register files. Two independent data
memory ports are used to transfer data values between data memory and register file.
The native instruction word size is also parameterized. The first implementation contains a
20-bit wide instruction word which allows the coding of all 3-operand arithmetic
instructions within one instruction word. A parallel word is used to store long immediate or
offset values, but a rich set of short addressing modes enables high code density. The
chosen programming model is VLIW and for overcoming the drawback of code density a
scalable Long Instruction Word (xLIW) is introduced [P2][P3][P4].
For the core a RISC pipeline is chosen with three phases, namely instruction fetch, decode
and execute. The number of clock cycles used to implement the structure can be
parameterized. The first implementation contains a five stage pipeline implementation: fetch,
align, decode and execute I+II.
50
Figure 38: Architectural Overview: xDSPcore.
The register file is split into three parts, a data register file containing eight accumulator
registers, each of which is 40 bits wide. The accumulator without guard bits can be
addressed as a 32-bit long register, which itself can be accessed as two 16-bit data registers.
The number of entries can be scaled. The second part of the register file contains eight
address registers and related modifier registers which are used for bit-reversal addressing
and modulo addressing scheme. The third register file called branch-file contains the flags
and reflects the core status used for conditional branch instructions and predicated
execution. The register files are orthogonal and no register is assigned to specific functions.
Zero overhead loop instructions are supported with a scalable nesting level whereas the first
implementation supports 4 nesting levels. Further nesting levels require spilling of loop
counter and loop addresses.
xDSPcore supports addressing modes usually supported by state-of-the-art DSP cores. Preand post operations are supported without additional clock cycles. Bit reversal addressing
allows efficient implementation of FFT algorithms. The size of the modulo buffer is
programmable and the start address has to be aligned. The address registers are structured
orthogonally which means each can be used for both AGUs.
No configuration or mode registers are supported because all functions are coded inside the
instruction word. Core status flags like zero or sign flags are assigned to the destination
registers. The flags are used for predicated execution which reduces branch instructions in
control code sections (if-then-else) without limitations and restrictions for instruction
scheduling and register allocation.
In the publication part of the thesis the main architectural features are introduced in detail.
51
4 High Level Language Compiler Issues
This section covers the compiler aspects considered during definition of xDSPcore starting
with an introduction of coding practices used for implementing algorithms on DSP core
architectures. The second part gives a brief overview of the structure used for high-level
language compilers followed by a discussion about architectural requirements for
implementing an optimizing compiler. The second part ends with a short summary about
why xDSPcore can be called compiler friendly architecture.
4.1 Coding Practices in DSP’s
Traditional DSP algorithms like filtering are loop-centric, which means that 90% of the
execution cycles are executed in code sections consuming less than 10% of the application
code. Increasing the usage of core resources used in loop constructs leads to a significantly
decreased number of required execution cycles.
This section introduces coding practices in Digital Signal Processors for increasing ILP in
VLIW architectures which leads to a better performance during execution of loop constructs.
The first part covers software pipelining which reduces the number of execution cycles and
increases the usage of core resources. The limitations and restrictions of software pipelining
are investigated as can be reviewed in [78].
The second part introduces loop unrolling which is often used in combination with software
pipelining. Different to software pipelining loop unrolling is used to increase the work
carried out inside the loop kernel. At the end of the section the specific implementation of
predicated execution for xDSPcore is introduced which can also be also used for increasing
the performance of loops. In [23][46][49][71][123][136] and [140] aspects for
implementing C-code for efficient code generation are mentioned and illustrate which
indicates the importance of high-level language programming of Digital Signal processors.
4.1.1 Software Pipelining
Software pipelining tries to invoke the next loop iteration as early as possible resulting in
overlapped execution. The example in Figure 39 illustrates functionality of software
pipelining on a loop body with four instructions A, B, C and D. In the first row A is
executed for the first time (A1). In the second row when B is executed for the first time (B1)
the next loop iteration is initiated in parallel (A2).
52
Figure 39: Principle of Software Pipelining.
In row four of the example in Figure 39 four instructions are executed in parallel (D1, C2,
B3, and A4) and the usage of resources reaches the maximum value. Row four is called
kernel, the rows before it used for filling the pipeline prolog and the execution cycles for
flushing the pipeline epilog.
Software pipelines face similar limitations as hardware pipelines e.g. data dependencies
between instructions of the loop. Some of these limitations are considered in the next subsections.
Trip Count
The trip count is equal to the number of loop iterations. If the loop iteration count equals
trip count the loop is terminated. A minimum number of loop iterations are required to fill
the software pipeline. The software pipeline does not increase system performance when the
loop iteration count is less than the trip count. For the example in Figure 39 the trip count is
four, which requires minimum loop iteration equal to four to make use of the advantages of
software pipelining.
Minimum Initiation Interval
The Minimum Initiation Interval (MII) is equal to the minimum number of execution
bundles building up a software pipelined loop kernel. The MII is restricted by data
dependencies which are introduced later as live too long problem and by the number of
available architectural resources.
Modulo Iteration Interval Scheduling
Modulo iteration interval scheduling provides a methodology to keep track of resources that
are a modulo iteration interval away from others. For example for a two-cycle loop
instructions scheduled on cycle n cannot use the resources as instructions scheduled on
cycle n+2, n+4, … .
The xDSPcore architecture supports the execution of five instructions in parallel, two
load/store, two arithmetic and one program flow instruction. In Figure 40 the Data Flow
Graph (DFG) of a small kernel is illustrated. The sum of two values first loaded from data
53
memory is calculated and the result then stored in data memory again. The memory
addresses are auto-incremented.
Figure 40: Data Flow Graph of an Example Issuing Summation of two Data Values.
The relative dependency between load instruction and add instruction is two which requires
one cycle distance between fetching data and issuing the add operation. The dependency
between ADD and Store operation is however zero, therefore both instructions can be
issued during the same cycle.
Cycle Number
0
MOV1
MOV2
CMP1
CMP2
BR
LD (R0)+, D0
LD (R1)+, D1
Cycle Number
1
MOV1
MOV2
CMP1
CMP2
BR
2
4
…
ADD D0,D1,D4
3
5
ST D4, (R2)+
…
Table 1: Principle of Resource Allocation Table.
In this example three load/store instructions are executed which requires a minimum kernel
length of two execution bundles. A resource allocation table as used in Table 1 is useful for
manually performing software pipelining. The two load instructions are scheduled into
cycle 0. Data dependency leads to an unused second cycle. The third cycle (cycle 2) is used
to invoke the add operation. As mentioned above it is possible to invoke the store operation
54
in the same cycle as the ADD operation. To prevent resource conflicts during software
pipelining the store operation is shifted to the fourth cycle (cycle 3) which has no influence
on the overall cycle count already limited by an MII of two caused by the required three
load/store instructions.
In Table 2 software pipelining is manually introduced for that example. The first column is
copied into the second column and the second into the third.
Cycle Number
0
MOV1
MOV2
CMP1
CMP2
BR
LD (R0)+, D0
LD (R1)+, D1
Cycle Number
1
MOV1
MOV2
CMP1
CMP2
BR
2
4
LD (R0)+, D0
LD (R1)+, D1
ADD D0,D1,D4 ADD D0,D1,D4
3
5
ST D4, (R2)+
ST D4, (R2)+
Table 2: Resource Allocation Table including Software Pipeline Technology for increased
Usage of Core Resources.
In Table 2 it becomes apparent why the move of the store operation into the fourth cycle
(cycle 3) carried out to prevent data hazards has no influence on the overall performance.
The second column in Table 2 (grey shaded) is equal to the kernel as illustrated in Figure 41.
Column one is equal to the prolog but instead of using a NOP instruction it is possible to
schedule the loop instantiation into the free execution cycle (which increases code density).
Column three is equal to the epilog of the software pipelined loop.
Prolog: LD (R0)+,D0 || LD (R1)+, D1
BKREP N-1, epilog
Kernel: LD (R0)+,D0 || LD (R1)+,D1 || ADD D0, D1, D4
ST D4,(R2)+
Epilog: ADD D0, D1, D4 || ST D4,(R2)+
Figure 41: Example for Assembler Code Implementation including Software Pipelining
(xDSPcore).
Live Too Long Problem
An additional limitation is the live too long problem. For example a loop kernel consists of
two execution cycles. It is not possible to use a register entry for more than two cycles
because the next loop iteration would overwrite the value before it has been used. The two
55
aspects influencing the live too long problem are the loop carry path (LCP) and the split
join path (SJP).
To illustrate the related limitations another code example as in the subsection before is
necessary. Figure 42 introduces another code example namely the implementation of a
search function, where the maximum value of a vector has to be found.
Figure 42: Data Flow Graph for Maximum Search Example.
Loop Carry Path
A loop carry path is caused by an instruction writing a result whose value is used for the
next loop iteration. For the example in Figure 42 the LCP is between the CMP function
responsible for comparing the current maximum value with the new loaded value and the
conditionally executed move register function which updates the latest maximum entry.
In the example Figure 42 the LCP is equal to two and restricts the MII to the value of two.
Split Join Path
If the same value is used by more than one instruction it has to be valid until all instructions
have been executed. The longest path determines the minimum length of the MII still
guaranteeing correct semantics.
For the example in Figure 42 the SJP is between the load instruction and the conditional
move register instruction and is equal to three. Therefore the minimum number of execution
cycles building up a valid loop kernel is equal to three.
For the example in Figure 42 the SJP dominates the MII. In some code sections it is
possible to reduce the LCP by moving the instruction overwriting the value to a later
56
execution cycle. Adding additional register move instructions can be used to split up the
SJP in smaller parts and therefore to reduce the MII.
4.1.2 Compiler Support
The C-compiler for xDSPcore supports software pipelining. Different to commercially
available C-compilers the use of compiler-known functions and intrinsics is not required. In
Figure 43 a small C-code example is illustrated, calculating 16 dot products and summing
up the results in an accumulator register.
for (j=0; j<16, j++) sum += a[j]*b[j]
Figure 43: C-Code Example for Illustration of Software Pipelining.
Figure 44 illustrates the code generated by the xDSPcore C-compiler without making use of
software pipelining, which requires 52 execution cycles to calculate the result. The loop
kernel consists of two load instructions (loading data from memory) and a MAC instruction
which executes the multiplication and the summation in one instruction.
loopend:
CLR A0 || BKREP 15, loopend
LD (R0)+, D2 || LD (R1)+, D3
NOP
FMAC D2, D3, A0
RET
NOP
NOP
Figure 44: Generated Assembler Code without Software Pipelining (xDSPcore).
The NOP is necessary due to data dependency (load-in-use dependency). The two NOP
instructions after return (RET) are branch delay NOPs caused by the five-stage pipeline of
xDSPcore. Making use of software pipelining the loop kernel can be reduced to one
execution bundle and the number of cycles used to calculate the result is decreased to 19
execution cycles (illustrated in Figure 45).
LD (R0)+,D2 || LD (R1)+,D3 || CLR A0
LD (R0)+,D2 || LD (R1)+,D3 || REP 13
FMAC D2, D3, A0 || LD (R0)+,D2 || LD (R1)+,D3
FMAC D2, D3, A0 || RET
FMAC D2, D3, A0
NOP
Figure 45: Generated Assembler Code including Software Pipelining (xDSPcore).
One of the two branch delays can be filled with an instruction of the epilog and only one
NOP instruction remains. This remaining NOP instruction can be removed by making use
57
of a non-delayed RET instruction. Another possibility is to use a delayed RET instruction
and to make use of predicated execution. For this purpose the predicated execution
implementation of xDSPcore supports loop conditions. Further details can be found in a
later section.
The software pipelining algorithm is based on modulo scheduling [134] which estimates the
MII. The instruction scheduler tries to find a solution fulfilling the estimation value. If no
valid solution is found, the MII is increased until a solution can be obtained or the number
of necessary execution cycles exceeds the original scheduled loop.
4.1.3 Loop Unrolling
Loop unrolling is quite often used in combination with software pipelining. If dependencies
or SJP and LCP limit the performance increase by software pipelining loop unrolling can
decrease the number of execution bundles.
In Figure 46 the basic functionality is illustrated. The loop has to be executed for N times. If
the number of available resources is sufficient and several loop bodies can be implemented
in parallel it is possible to reduce the number of loop iterations, for example by executing
two loop bodies in parallel to reduce the loop iterations by a factor of two.
Figure 46: Principle of Loop Unrolling.
For illustrating loop unrolling the summation of vector elements example is used and the
related DFG has been illustrated in Figure 40. Software pipelining is limited by the required
three load/store instructions and therefore a MII of two. If the elements are split in two
halves (even and odd ) and the two kernel functions executed in parallel an increase in the
kernel by one cycle will occur but reduces the necessary iteration count to half of the loop
count. This leads to a reduction of necessary execution cycles by about 25%. The drawback
is a decreased code density due to the duplicated kernel, however the contribution of loop
kernels to the application code density is negligible.
58
4.1.4 Predicated Execution using Loop Flags
xDSPcore supports a predicated execution implementation, which allows loop flags to be
used to indicate the status of the loop iteration for building up the condition. Therefore it is
possible to move instructions from epilog or prolog into the loop kernel and execute them
only once (e.g. during first or last loop iteration). In Figure 47 an example is illustrated
making use of this feature. The first implementation shows a standard loop implementation
and the second illustrates the advantage of using predicated execution.
Prolog:
Kernel:
Epilog: loopend:
Loop:
loopend:
inst1 || inst2 || inst3
inst4 || inst5 || BKREP N, loopend
inst1 || inst2 || inst3
inst4 || inst5 || inst6
inst6
BKREP N, loopend
inst1 || inst2 || inst3
FSEL sf0=0: dc:inst4 || dc: inst5 || true: inst6
inst6
Figure 47: Principle of Predicated Execution using Loop Flags.
The drawback of predicated execution is the decreased code density caused by condition
coding. For the implementation example illustrated in Figure 47 the code density may even
be increased. Different to the example in Figure 47 where parts of the prolog are shifted into
the loop kernel it is also possible to move the epilog into the loop kernel. For example
moving a return operation into the loop kernel which is then executed only once can be used
to compensate the drawback of branch delays. More details can be found in [P6][P7].
59
4.2 Compiler Overview
The compiler of xDSPcore can be split into two parts namely an architecture-independent
front-end and an architecture-dependent back-end [43].
The front-end performs lexical, syntax and semantic analysis of the source code and
architecture-independent code optimizations are carried out in the front-end part. An
intermediate representation (IR) for example the IR of the Open Compiler Environment of
ATAIR is built up and used to transfer the parsed and pre-optimized application code to the
back-end.
The back-end transforms the IR to the processor specific programming language. The backend considers the available processor resources and performs architecture-dependent
optimizations.
Figure 48: General High-level Language Compiler Structure.
The same front-end (as illustrated in Figure 48) can be used for different processor
architectures. The implied optimizations are architecture independent whereas each input
language requires a modified front-end. The back-end contains architecture-dependent
optimizations and is therefore bound to the target architectures. The same back-end can be
used in combination with different front-ends and therefore different input languages as
long as IR is the same.
The OCE from ATAIR is used as the front-end of the xDSPcore C-compiler with some
minor modifications. The back-end was developed in cooperation between Infineon
Technologies and the Christian Doppler Gesellschaft (CDG) in Vienna, Austria. Some
implementation details of the back-end are illustrated in the next subsection and more
details can be found in [98][163].
The methods introduced in the following pages are well known and have been used in
various compiler backends [43][46][114]. Some of these aspects are important when
discussing the suitability of architectural core features and therefore shall be briefly
introduced [99][105].
60
Instruction Selection
Instruction selection takes the abstract syntax tree provided by the frontend as input and
generates machine language instructions or low level intermediate instructions. The most
favored approach is bottom up tree rewriting [86]. A set of rules specifies the tree patterns
which can be matched, the semantic actions for these patterns like machine instructions and
the related costs e.g. the number of cycles required to execute the machine instructions.
Typeform symbols determine the location and representation of the values at run time for
example address register, immediate constant and so on. They appear as operands and
results of the rules.
During the first recursive traversal of the tree all nodes are labeled in bottom-up order
following the rules according the patterns matched by this and its descendant nodes. This
labeled information includes the typeform of the result and the minimum cost to achieve
this result. The generated information is used to determine the applicable rules for ancestor
nodes and the cumulated costs for ancestor sub-trees. In a second pass the tree is traversed
recursively considering the rules evaluated in the first pass.
Instruction Scheduling
During instruction scheduling the logical instruction stream is reordered without altering the
code semantics. The aim of this reordering is to reduce the number of execution cycles. For
a multi-issue architecture instructions which can be executed in parallel can be identified.
Unused execution cycles caused by data and control dependencies (for example branch
delay slots) can be filled with independent instructions.
One major issue concerning instruction scheduling is the phase ordering problem between
instruction scheduling and register allocation. Instruction scheduling performed before
register allocation (named prepass scheduling) offers maximum flexibility but can lead to
high register pressure, causing a need for additional instructions for saving and restoring
register entries (spilling). Performing register allocation before instruction scheduling can
introduce false dependencies between instructions which reduce possible instruction level
parallelism (ILP). There are some strategies known to lessen this problem for example by
obeying register pressure during scheduling or bottom up reorganization of scheduled code
in order to reduce register pressure [63][90].
Register Allocation
Early passes in the compiler assume an unlimited number of registers - symbolic registers.
During register allocation the mapping of these numerous symbolic registers to the
physically available CPU registers takes place.
The standard solution is the graph-coloring method introduced by Chaitin [60] and refined
by Briggs [54][55]. This method examines data and control flow and builds an interference
61
graph. The nodes of this graph represent the symbolic registers. Whenever two nodes are
alive at the same time an edge between them has to be added to the graph. Two nodes
connected by an edge cannot use the same CPU register. The graph is tried to be colored
with N colors, where N is the number of available CPU registers [100][142][147]. If the
mapping cannot be completely performed some of the symbolic registers have to be stored
in memory. This technique is called spilling. Spilling causes additional instructions and
additional execution cycles and therefore choosing the right nodes for spilling is an
important issue to keep the spilling costs low [66][130][152].
A second task of register allocation is the elimination of copy instructions, called register
coalescing [87]. Earlier passes for example Static Single Assignment (SSA) optimization
[46][68][69] may introduce move instructions which are redundant if source and destination
of the move do not interfere. The two nodes representing source and destination operands
are coalesced to one node which increases code density and decreases the number of
execution cycles.
62
4.3 Requirements
The requirement section is split into two parts and will begin by describing the requirements
a programmer expects from a C-compiler followed an introduction of architectural features
which support the development of an optimizing HLL compiler..
4.3.1 Requirements onto C-Compiler
State-of-the-art C-compilers have to support the ANSI-C standard. For efficient use of DSP
specific functions like saturation and special rounding modes the support of the EmbeddedC standard is required. Compilers supporting C++ or other object oriented languages are
built up in the same manner as an ANSI-C compiler, with a modified front-end as illustrated
in Figure 48. The question is if the support of an object oriented language for describing
digital signal processing algorithms is required.
A state-of-the-art C-compiler has to support fractional data types whereas ANSI-C is based
on integer data types. This can be done by type conversion through destination variables of
the new type. Multiple data memory banks are another feature of DSP core architectures
which has to be supported by the compiler. The challenge of multiple memory banks for the
compiler is the variable assignment. If it is not possible to split the variables to the two data
ports only half of the memory bandwidth can be used. Methods for interleaved address
assignment can be found in [151].
HLL compilers for DSP cores produce significant code overhead compared with manually
generated code. Some of the reasons for the reduced code density will be discussed in the
following subsections. Some reasons for the lack of optimizing C-compiler for DSP
architectures are due to the architecture, however commercial aspects also should be
considered.
The commercial pressure to provide optimizing compiler technology has not been overly
strong. The application code for high volume products is manually coded and development
costs are negligible compared with the costs of poorly used silicon caused by low code
density. Most DSP core vendors favor a third-party tool concept. Companies like Green
Hills, Tasking and also Metrowerks have experience in the area of tools for
microcontrollers. To obtain efficient code for a DSP core additional knowledge and effort is
required but the possible market to sell licenses is small, especially for embedded DSP
cores. The third party tool provider cannot expect to sell thousands of licenses each year.
4.3.2 Architectural Requirements
The concept of defining a DSP core architecture driven by algorithmic and hardware
development aspects and afterwards developing a tool-chain has lead to the status quo of Ccompilers which generate significant code overhead compared with manual coding. These
63
are not suitable for high volume products. The second limitation is the large amount of
manually developed assembly code (legacy code) which limits the efficiency of products
and system solutions.
Load Store Architecture
Load/Store architectures support separate instructions for transferring data entries of the
register files to data memory and vice versa. Arithmetic instructions fetch their operands
from the register files. The decoupling of arithmetic and data move instructions allows the
use of lean pipeline structures and the reduction of the native instruction word size.
The use of software pipelining and loop unrolling can be simply introduced by decoupled
arithmetic and move data operations. An additional advantage during execution of control
code is that not each instruction requires data transfer to data memory and intermediate
results can be stored inside the register files.
Large Uniform Register Sets
Register files significantly influence the die area of the core. The internal registers are used
to store intermediate values which reduces the number of data memory accesses. During
instruction scheduling an infinite number of registers is assumed. During register allocation
the used variables are mapped to the physically available register sets.
Figure 49: Example for banked Register Files (TI C62x).
The requirement of compiler architects to support a large number of registers is neither
economic nor feasible from a hardware point of view. The requirement of supporting one
large register file influences reachable clock frequency of the subsystem due to multiplexer,
address decoder logic and wiring effort caused by the increased number of read/write ports
and decreased code density due to the required addressing space. The TI C62x provides a
large register file (32 times 32-bit wide registers) and the same registers are used for
arithmetic and addressing modes (as illustrated in Figure 49). This is a compiler friendly
aspect. The fact of missing accumulator support (e.g. 40-bit as common in most of today’s
core architectures) leads to waste of two 32-bit registers for supporting 40-bit intermediate
64
results. To reach a higher clock frequency the register file is banked with a cross connection
used to transfer data between the two banks.
Banked register files restrict instruction scheduling. In this case instructions which use
results of prior instructions have to be executed on the same data path. The register allocator
of the compiler is limited to the use for the registers of one bank for consecutive
instructions. Unused registers in the second bank cannot be used. The transfer of data
entries between two register banks costs additional clock cycles resulting in an increased
define-in-use dependency.
No Modes
Mode registers are used to increase code density. The same instruction coding takes on a
different meaning depending on the chosen mode. This is often used for DSP specific
functions like saturation and rounding modes.
Figure 50: Limitations during Instruction Scheduling caused by Processor Modes.
In Figure 50 two small code sections are illustrated where the chosen assembly code
represents xDSPcore assembler. For the first code block the saturate mode is activated and
for the second code block it is not. Saturation influences the result of arithmetic instructions
when the result of the operation exceeds the data range of the destination register.
Moving instructions from one code block into the next where a different mode is set is
expensive in the terms of code density and cycle count. The mode has to be changed before
executing the moved instruction and reset afterwards. In the example Figure 50 moving of
the add instruction is not possible due to the different mode settings. The same mechanism
as illustrated in Figure 50 is true for instructions of different basic blocks with different
mode settings where basic blocks are code sections between branch instructions.
The advantage of the increased code density by introducing mode bits can be overcompensated by the required instructions for setting and resetting modes. The alternative is
65
a bad usage of the provided core resources. Mode settings have a similar influence on code
fragmentation as branch instructions.
Orthogonal Instruction Set
In the section where DSP specific aspects are discussed orthogonality is defined. An
example for increasing theoretically attainable performance by increased clock frequency
by omitting orthogonality is illustrated in Figure 51. The AGU of Motorola 56k DSP family
supports address, modifier and base registers. Each of the AGUs has a group of registers
assigned, which means that not every address register can be used for each address
operation.
Figure 51: Example for Address Generation Unit (Motorola 56300).
This restricts the use of the address registers and the AGUs during register allocation and
instruction scheduling which can lead to a significant decrease of the core performance.
In Figure 52 a second example of not orthogonal instructions is illustrated. The MAX2VIT
D2,D4 instruction of the Starcore SC140 is introduced to increase performance when
calculating Viterbi algorithms [26].
Figure 52: Example for not Orthogonal Instructions: MAX2VIT D4,D2 (Starcore SC140).
66
The instruction can only make use of two predefined data registers, where one of the two
data registers is also used as the destination register. Even worse a mode bit is used to
switch between these two data registers and a second register couple. The implicit
addressed register entries do not provide flexibility to the register allocator and even
restricting the use of the register couple in code sections making use of the MAX2VIT
instruction.
Simple Issue Rules
Dependencies between instructions increase complexity during instruction scheduling. Deep
pipelines with implicit dependencies increase complexity of instruction scheduling when
considering control dependencies. Data value dependent execution time as supported by
ARM [6] for the multiplication operation can be used as an example. Depending on the data
value of the multiplication operands the execution time can differ between 1 and 4 clock
cycles. The related define-in-use dependency has to be considered as worst-case during
compile time which leads to unused execution cycles.
Hidden cluster latency caused for example by banked registers as mentioned in Figure 49
for the Texas Instruments TI C62x, increases issue complexity. The relationship between
register bank and datapath reduces flexibility during instruction scheduling.
Efficient Stack Frame Access
Compilers communicate arguments to subroutines by using a stack. One of the address
registers is used as stack frame pointer and the called subroutines do not have to know the
absolute address where the arguments are located. The bypassed data values are transferred
relative to the stack frame pointer. For this reason efficient stack frame addressing is
important when using a compiler for automatic code generation.
4.3.3 Architectural Obstacles
Some other examples of architectural obstacles used in commercially available core
architectures to increase code density and reduce consumed silicon area are illustrated in the
next subsections. The drawbacks of these architectural features are mentioned.
Modes for Different Instruction Sets
To increase code density an efficient binary coding of the instruction set architecture is
required. In the case of microcontroller architectures code density is not so important
because often the code is stored in external and therefore cheap memory.
67
Figure 53: Example for Mode Dependent Instruction Sets: ARM Thumb Decompression
Logic.
In mid 1990s ARM introduced the thumb instruction set architecture [3] to increase code
density. Thumb is a condensed version of the ARM instruction set with reduced support of
operands along immediate and offset values. A decompression logic as illustrated in Figure
53 is used to internally build up regular ARM instructions requiring additional hardware.
For example the multiply instruction supports 4 bits for operand coding in a regular ARM
instruction and 3 bits for a thumb multiplication. Thumb can be used to increase code
density but it has the drawback of irregularity and mode dependency. Instruction selection
has a significant impact on the output of register allocation. The compiler has to take
conservative assumptions when using the reduced register set for Thumb instructions.
Making use of the reduced register set increases register pressure and produces more spill
code. Additional move instructions are required which decrease code density.
Irregular Instructions
The Starcore SC140 [32] supports up to 16 data registers as illustrated in Figure 54. The
higher 8 data registers are also used as base registers during modulo addressing modes
which limits their use during the code sections not using modulo addressing modes.
68
Figure 54: Example for Address Generation Unit (Starcore SC140).
Conservative assumptions by the compiler prevent the use of the higher eight address
register. An instruction that makes use of the higher data registers must not be moved into a
code section where modulo addressing is used.
Implicit Dependencies
The implementation of predicated execution in the Starcore SC140 [32] is an example of
implicit dependencies. The Starcore is taken as an example but the problem is similar in
most of today’s core architectures.
Predicated execution or conditional execution can be used to reduce the number of
conditional branch instructions in control code sections typically caused by if-then-else
constructs [149]. The drawback of branch delays caused by branch instructions can be
compensated with conditional execution. In most of the available implementations the
disadvantage is decreased code density. A more detailed analysis can be found in [149].
The implementation for the Starcore supports one status flag called the T-Flag which is
used to build up the condition. During instruction scheduling no instruction which
influences the T-Flag is allowed to be scheduled between status generation and evaluation,
which significantly limits the instruction scheduler. Multi-way VLIW which supports the
execution of several instructions in parallel (e.g. at the Starcore up to 6 instructions) results
in low branch distances. This limitation can have a significant influence on the execution
time of the application code.
69
Complex Instruction Sets
In the first section the main differences between microcontroller and DSP architectures
were discussed. One major aspect to increase code density is the support of complex
instructions like the MAC instruction with implicit operand fetch from memory.
Complex instruction set architectures are not suitable for use by automatic compiling tools
for example the instruction set architecture of Carmel DSP. To overcome the drawback of
low code density of VLIW architectures CLIW [12][157] is introduced (illustrated in Figure
55). An extended program memory port supports the fetching of instructions during the
moments of required peak performance of the DSP core.
To make use of the extension port requires a special instruction coding and some of the
instructions are only supported in CLIW. Both aspects limit the reachable performance of
the instruction scheduler.
Figure 55: Configurable Long Instruction Word (CLIW of Carmel DSP Core).
RISC instructions allow the use of the available parallel resources and coding practices as
illustrated in an earlier part of this section aim to increase code density and decrease the
required execution cycles.
4.4 HLL-Compiler Friendly Core Architecture
This sub-section briefly summarizes the aspects concerning why xDSPcore is suitable as a
target architecture for an optimizing C-compiler. Many of the aspects have already been
illustrated in the subsections before and their individual implementation for xDSPcore is
illustrated in the following subsections.
Load-Store Architecture
xDSPcore is based on a modified Dual-Harvard load-store architecture. Separate
instructions are available for transferring data values between register file and data memory.
A brief overview is illustrated in Figure 56. Program and data are assigned to different
address spaces and an instruction buffer with cache logic [92] is used to increase code
70
density and decrease power dissipation by reducing the switching activity at the program
memory port [P2][P3][P4][P5].
Figure 56: xDSPcore Core Overview.
Orthogonal Register Set
The register set of xDSPcore is split into three parts; a data register file, an address register
file and a branch file (illustrated in Figure 57). The registers inside each of the register files
are orthogonal, while none of the registers are assigned to a certain instruction.
Figure 57: Orthogonal Register File.
The first implementation will combine data and address register file in one register file
structure which increases the number of read and write ports for one register file but reduces
the read and write ports when considering the register file as one part.
No Mode Register
xDSPcore does not contain mode registers. Supported features and instructions are coded
inside the instruction word. The status bits indicating the core status of xDSPcore like sign,
zero or overflow flags are destination register related which increases flexibility for the
instruction scheduler. The flags are stored in a separate branch file as illustrated in Figure
57.
71
Orthogonal Instruction Set
The instruction set is orthogonal in the sense of the definition by BDTi [116]. None of the
instructions contain implicit operand addressing or micro architectural limitations in the
sense of mode dependent resource allocation, examples of which can be found in [3].
Simple Issue Rules
The lean RISC pipeline structure as introduced in Figure 58 allows short define-in-use and
load-in-use dependencies. The supported instructions and their dependencies inside the
pipeline can be summarized to five different cases. In Figure 58 the issuing rules are
summarized and illustrated.
Figure 58: Issuing Rules for xDSPcore Architecture.
The n-way VLIW architecture allows the execution of n instructions in parallel. The first
implementation supports 5 parallel units. The decoder structure is split into separate units so
removing or adding additional units has only minor influence on the decoder architecture.
Efficient Stack Frame Addressing
Addressing modes common in DSP architectures are supported and index addressing is
frequently used by compilers for subroutine data exchange. The support of short addressing
modes allows an increase in code density.
Examples
For verification of the aspects introduced in section 4.3 small application code examples are
used where the results in this section are generated using the C-Compiler. These
comparisons are not made to illustrate the advantage of the proposed DSP core compared
with existing solutions. The figures illustrate the influence on the outcome of HLL-compiler
when considering the aspects introduced in section 4.3.2 and 4.3.3.
The first example is control code (parts of the Dhrystone benchmark suite [18]) where a 32bit microcontroller [40] and a 16-bit DSP core [10] have been chosen for comparison. The
results focus on code density. C-compilers based on the same compiler technology from
72
ATAIR are used for a comparison between the two DSP cores. The C-compiler for
xDSPcore is a prototype C-compiler where several optimizations are not been included yet.
In Figure 59 the memory footprints normalized in bytes are illustrated. The ISA of
xDSPcore can be scaled in an application-specific way to increase code density which has
not been done for this comparison but the standard 20-bit native instruction word has been
used.
Dhrystone
3500
program memory (byte)
3000
2500
Tricore
2000
Carmel
1500
xDSPcore
1000
500
0
Dhrystone 1
Dhrystone 2
Figure 59: Results for Dhrystone Benchmarks generated by C-Comipler.
xLIW as seen in [P2], the scalable long instruction word concept, lean pipeline structure
and predicated execution [P6] allow similar code density as that of microcontroller
architectures. The second code example (parts of the enhanced full rate (EFR) speech codec
algorithm) compares the required code density for the two DSP cores.
program memory (byte)
EFR Encode r
45000
40000
35000
30000
25000
Carmel
20000
15000
10000
xDSPcore
5000
0
EFR Encoder
Figure 60: Results for EFR Benchmarks generated by C-Compiler.
The results illustrated in Figure 60 show a code density improvement of approximately 50%.
The improvement has been achieved by considering the aspects in section 4.3 as both
compilers are based on the same technology.
73
5 Summary of Publications
This chapter summarizes the publications included in Part II of the thesis. The publications
can be split into two parts, the first part consists of the publications covering the
architectural features of the xDSPcore architecture and the second contains the publications
introducing DSPxPlore which is a design exploration methodology for RISC based core
architectures.
5.1 Architectural Aspects of Scalable DSP Core
In Figure 61 the main architectural features of xDSPcore are illustrated and the publications
numbered with [P1], [P2] … being assigned to the relevant architectural blocks.
Figure 61: xDSPcore Overview.
Publication [P1]: xDSPcore – a Configurable DSP Core. This publication provides an
overview of the xDSPcore architecture introduces the main architectural features and briefly
illustrates the concept of DSPxPlore, the design space exploration methodology. xDSPcore
is a RISC DSP core considering the development of an optimizing high-level language
compiler already during architecture definition. The core architecture introduced in this
thesis is the outcome of the research done in collaboration between Infineon Technologies
Austria, Vienna University of Technology and Tampere University of Technology. To
obtain the power and area consumption requirements of SoC applications the main
architectural core features can be parameterized to acquire application-specific
implementations. DSPxPlore is used to analyze the requirements of the application code to
adapt the core configuration to meet power dissipation and area targets.
74
Publication [P2]: xLIW – a Scaleable Long Instruction Word. A main architectural feature
of xDSPcore is the architecture of the program memory port. The core is based on VLIW
programming model making use of VLES to increase code density and reduce the size of
the program memory port by utilizing an instruction buffer architecture. This publication
illustrates the problem that VLIW exhibits poor code density. To overcome this drawback
xLIW (a scalable long instruction word) is introduced. The main architectural blocks like
the instruction buffer for implementing the features of xLIW are illustrated. The possibility
to minimize the worst case execution time strongly influences the chosen structure.
Publication [P3]: Align Unit for a Configurable DSP Core. The Align Unit is the central
part of the xDSPcore program memory port. It reassembles the execution bundles during
run-time. The Align Unit contains an instruction buffer for compensating the memory
bandwidth problem caused by the reduced program memory port during peak performance
of the core architecture. This publication introduces architectural details of the Align Unit
including the instruction buffer management. The alignment process building up the
execution bundles is illustrated in detail including an analysis concerning limitations and
possible stall cycles. A separate section covers the behavior during loop handling and
serving hardware interrupt handling which has significant influence on the buffer
management.
Publication [P4]: A Scaleable Instruction Buffer for a Configurable DSP Core. The main
topic of this publication is implementation details of the Instruction Buffer of the Align Unit.
The Align Unit is used to compensate the memory bandwidth mismatch between fetch
bundle and worst case execution bundle and the instruction buffer is used to reduce power
dissipation during execution of loop constructs by reducing the number of program memory
accesses. The reduced switching activity at the program memory port reduces dynamic
power dissipation as illustrated in [P5]. The loop body is fetched once and then executed
from the buffer. To obtain a balanced relation between buffer size (causing an increased
silicon consumption and available storage space to keep instructions) the number of entries
is parameterized. A regular structure as the instruction buffer of the Align Unit is suited to
be implemented in manually optimized full-custom design. The DPG of RWTH Aachen is
used to implement regular parts of the instruction buffer. This methodology makes use of
the advantages of manual full-custom design like increased performance and decreased
power dissipation [88][167] and satisfies the requirements of scalability for xDSPcore.
Besides architectural and implementation details the publication illustrates the influence of
different buffer configurations onto the core area.
Publication [P5]: A Scaleable Instruction Buffer and Align Unit for xDSPcore. The
publication is an extended paper of publication [P4]. The concept and advantages of xLIW
are summarized and benchmarks highlight the relevance of using an instruction buffer
during handling of loop-centric algorithms which can be found in typical DSP algorithms.
75
The influence on the switching at the program memory ports onto the overall dynamic
power consumption is illustrated. The publication also contains a short overview of the
DPG of RWTH Aachen, the chosen full-custom methodology used for implementing the
instruction buffer.
Publication [P6]: FSEL – Selective Predicated Execution for a Configurable DSP Core.
Increasing application complexity leads to a shift in the system partitioning between DSP
cores and microcontroller architectures. DSP cores also have to handle control code sections
efficiently. Typically control code contains if-then-else constructs for implementing
decision paths. The drawbacks of branch instructions are branch delays (unused clock
cycles caused by the break in the program flow) which decrease the practical performance
of the core architecture. The number of branch delays can be reduced by minimizing the
number of conditional branch instructions and by introducing predicated or conditional
execution. The publication introduces FSEL, the predicated execution implementation for
xDSPcore. FSEL allows a reduction in the number of branch instructions without
decreasing code density. This decrease in code density is the major drawback of available
implementations. Benchmark results illustrate the advantages of FSEL. This
implementation (destination register related flags) allows an efficient use by a high-level
language compiler
Publication [P7]: A Branch File for a Configurable DSP Core. For cycle-efficient
implementation of control code xDSPcore supports exhaustive predicated execution and a
rich set of delayed and non delayed conditional branch instructions. For both features
hardware flags indicating the status of the core architecture are required. The publication
introduces the concept of static and dynamic flags and illustrates the structure of the branch
file which is used for storing the status information. The separate register file is used to
relax the regular register files concerning read/write ports which are already stressed due to
orthogonality requirements.
Publication [P8]: A Scaleable Shadow Stack for a Configurable DSP Concept. The
performance of core architectures can be improved by increasing the number of pipeline
stages. xDSPcore supports a 3-phase RISC pipeline and the first implementation uses 5
clock cycles for mapping the three phases. The execution phase is split into two execution
cycles. This relaxes timing during executing of MAC instructions (including write back into
the register file). During handling of interrupt service routines a data consistency problem
can take place. A known solution for this problem is adding shadow registers for storing of
intermediate results. This publication introduces the shadow stack, taking care of this data
consistency problem without requiring any MIPS or instruction words. Benchmarks in the
publication illustrate the advantage especially when supporting nesting interrupts where the
shadow stack reduces the required silicon area and provides additional flexibility.
76
Publication [P9]: xICU – a Scaleable Interrupt Unit for a Configurable DSP Core. The
changing requirements in DSP core architectures need to handle interrupts more efficiently
as was common for early DSP architectures. In this publication some of the features
commonly used are illustrated and compared with the features of interrupt control units
(ICUs) used for microcontroller architectures. Prioritizing interrupt sources is supported by
most of the ICUs where xDSPcore additionally supports a feature called priority morphing.
The interrupt priority of an interrupt source can be changed during run-time. This can be
done by explicitly programming the priority but also automatically by time which allows an
increase or decrease in priority in dependency of the number of clock cycles. This feature
cannot be used during handling of real-time critical code segments but can be used by
operating systems (OS) and hardware schedulers to change program flow during run time.
5.2 Design Space Exploration
This subsection introduces some publications concerning the design space exploration
namely DSPxPlore. xDSPcore is a core architecture where main architectural features can
be modified according to the application. To understand the requirements of the application
code in an early phase of a project the analysis possibilities of DSPxPlore can be used. The
methodology is briefly introduced in Figure 62. At the end of this subsection the topic of
validation is briefly covered.
Figure 62: DSPxPlore Overview.
Publication [P10]: DSPxPlore – Design Space Exploration for a Configurable DSP Core.
This publication is used to give an overview of the concept for design space exploration, the
tools used and parameters to quantify architectural changes resulting in area consumption
and necessary cycle count. Examples for static and dynamic results are included.
Publication [P11]: Design Space Exploration for an Embedded DSP Core. The publication
illustrates further development of the design space exploration methodology. Besides a brief
core overview focusing on the parameters which have a significant influence on the core
performance the design space for RISC based core architectures is discussed. The
77
parameters generated during static and dynamic analysis are outlined and results on a set of
benchmark programs illustrate the potential of DSPxPlore.
Publication [P12]: An Automatic Decoder Generator for a Scaleable DSP Architecture.
The current core configuration is stored in an XML based configuration file, which is used
to keep the tools, the hardware description and the documentation consistent. As an
example for the hardware description this publication introduces a decoder generator tool
which provides the hardware decoder for the xDSPcore in VHDL-RTL [48][64]. For this
publication the configuration has been stored in a file of xls-format, where the latest update
of the decoder generator already uses the common XML based configuration file.
5.3 Author’s Contribution to Published Work
In this section the author’s contribution to the afore-mentioned publications is pointed out.
The author is the primary author in most of the publications under agreement of the coauthors. None of the publications have been before used as part of an academic thesis or
dissertation. The publications have been written by the author and Prof. Jari Nurmi who
contributed his guidance through the shelves of the published work and took care to polish
and reduce the number of errors and Germanism to an acceptable level.
Publication [P1]: The DSP Core introduced in the publication has been developed in
collaboration between Infineon Technologies Austria, Vienna University of Technology
and Tampere University of Technology. The core architecture has been defined by the
author and topics concerning the compiler friendliness have been contributed by Prof.
Andreas Krall. Dr. Reinhard Rückriem working at Infineon Technologies contributed to the
architecture by pointing out weaknesses during the definition stage. Besides the architecture
and documentation work the author was involved in the VHDL description of the core
architecture. The name of the core architecture has been assigned by the author and the x is
used to indicate the flexibility and the possibility to configure the core architecture to
application specific requirements. The logo for the core architecture was designed by Marco
Pertl employed at Infineon Technologies Austria.
Publication [P2]: xLIW, the scalable long instruction word of xDSPcore was defined by the
author and the name for the long instruction word also.
Publication [P3]: The first implementation of the Align Unit in VHDL has been done by
Raimund Leitner, under supervision of the author as part of his master thesis [122]. The first
implementation used a circular buffer whereas a later version instead used a n-way cache
logic for controlling the instruction buffer content. An additional contribution concerning
the cache logic was carried out by Michael Bramberger, as part of his master thesis under
the supervision of the author [53].
78
Publication [P4]: The manual full-custom implementation of the instruction buffer as
illustrated in the publication has been done by Michael Bramberger, as his master thesis
under supervision of the author and by colleagues at Infineon Technologies Austria [53].
Publication [P5]: In this publication the author was responsible for presenting the idea and
the advantages of the chosen program memory port, for illustrating architectural details and
benchmark results to outline the advantages. The raw data for the benchmarks was
generated together with Ulrich Hirnschrott, at Vienna University of Technology. Volker
Gierenz, from Catena contributed valuably to the power dissipation analysis and the
introduction of the DPG.
Publication [P6]: The predicated execution implementation has been contributed by the
author with contributions of Prof. Andreas Krall contributed the suitability of use of a highlevel language compiler. Raimund Leitner and Gunther Laure contributed implementation
details of the FSEL instructions. The work has been patented under the German patent
number DE10101949C1, the author as the inventor.
Publication [P7]: The initial idea of a separate branch file was introduced by Manish
Bardwaj during his stay at Infineon Technologies Austria. The branch file structure used for
xDSPcore and implementation details have been introduced and published by the author.
Volker Gierenz, employed at Catena Inc., contributed implementation relevant topics.
Publication [P8]: The shadow stack structure for compensating data consistency problems
during interrupt handling was defined by the author. Raimund Leitner contributed valuably
and partly coded a first behavioral description of the functionality in VHDL. The final
implementation for xDSPcore was carried out by the author in VHDL-RTL. The benchmark
analysis of the advantages compared with existing solutions was also carried out by the
author.
Publication [P9]: The architecture and the feature set of the xICU, the scalable interrupt
control unit for xDSPcore was defined by the author. The first implementation in VHDLRTL was carried out by Johannes Hohl at Infineon Technologies Austria, the final VHDLRTL coding of xICU by the author.
Publication [P10]: The initial idea for DSPxPlore came from the author, as also the main
features of the development flow presented in the publication. Prof. Jari Nurmi provided the
name for the design space exploration. Details concerning static analysis have been added
by Ulrich Hirnschrott. Dynamic analysis results have been contributed by Gunther Laure
[117] and Wolfgang Lazian [118], responsible for the development of the ISS for xDSPcore,
called xSIM.
Publication [P11]: Ulrich Hirnschrott at Vienna University of Technology and the author
investigated the topic of design space exploration in detail and analyzed benchmark results
79
which form part of this publication. These benchmarks are used to illustrate the features and
the use of design space exploration.
Publication [P12]: The initial idea and structure of a central configuration file was
introduced by the author. The main contributors to the final structure and the tags in the
configuration file were Gunther Laure, Wolfgang Lazian, Stefan Farfeleder and Ulrich
Hirnschrott. The idea for generating parts of the VHDL coding automatically from this
description was introduced by the author in the publication. The first version of the decoder
generator tool was carried out by Armin Schilke as his master thesis at CTI under the
supervision of the author [145]. Taking into consideration the first implementation Ulrich
Hirnschrott carried out the final version of the tool.
81
6 Conclusion
This section summarizes the outcome of the research and development project, however
please keep in mind that not all of the work was carried out by the author. This part is
followed by a brief introduction into planned research work to increase the use of the
processor resources and to reduce area and power dissipation.
6.1 Main Results
This subsection provides a brief overview of the outcome of the research work and the
scientific part is covered in detail by the publications in the second part of the thesis. The
short summary below closes the circle at to the goals defined in the introduction.
6.1.1 Core architecture
In the technical report [P1] the xDSPcore architecture is introduced. The following are the
key features of the core:
• RISC, load-store architecture
• Scalable long instruction word (xLIW) including instruction buffer
• lean pipeline structure (for short pipeline implementations where even no bypass circuits
are required)
• orthogonal register file
• destination register based predicated execution enabling efficient control code execution
The first implementation of the core architecture was carried out in VHDL-RTL. The first
tapeout is planned in cooperation with Catena Inc. as prototype for a Digital Radio Mondale
(DRM) project and will take place at the end of 2004. The estimated gate count is about 120
k gates (depending on the target frequency and on the chosen core configuration) and the
frequency required for the prototype is about 120 MHz (0.13 µm CMOS technology,
military worst case conditions). Further improvements will take place when taking into
account the results from the first prototype.
To increase the potential performance and to reduce power dissipation regular parts of the
core architecture are implemented in manual full-custom design (as introduced in [P5] but
not considered for the first prototype). The influence of different buffer variants on the die
area is also illustrated in this publication. The register file and the data path are regular parts
and will also be implemented in full-custom design methodology making use of the locality
aspect during implementation (limiting the scalability). To ease the verification flow for the
manual full-custom design these parts are implemented in Verilog HDL [128].
82
6.1.2 Tools
The first draft tool-chain of xDSPcore consists of a prototype C-compiler, an
Assembler/Linker and a cycle-true Instruction Set Simulator (xSIM). The tool-chain was
developed in cooperation with Vienna University of Technology. The lean core architecture
considering the aspects introduced in section 4.3 enables the development of a prototype Ccompiler for a DSP core even with the limited resources of a research project. In Figure 63
a screenshot of the current simulator environment is shown.
Figure 63: Screenshot of xSIM
83
6.1.3 Design Space Exploration
The core architecture introduced in this thesis enables application specific scaling of the
main architectural features like:
• data paths (number, feature set)
• register file (structure, number of entries, size of entries)
• memory bandwidth
• instruction set (ISA, binary encoding)
• pipeline (number of clock cycles)
• instruction buffer (size and number of entries)
In the publication [P10] DSPxPlore, which is a design space exploration methodology, is
first introduced and the possibilities to analyze the influence of different core architectures
onto the overall system performance are briefly explained. In [P11] further improvements
are introduced, and some application code examples are used to illustrate the influence of
changing some of the parameters.
Figure 64: DSPxPlore Design Flow
The design flow is illustrated more in detail in Figure 64. However the actual status of
DSPxPlore still requires manual analysis of statistics and the understanding of the influence
84
of different core parameters onto the overall system performance. Further research is
necessary to allow automatic suggestion of core configurations.
6.1.4 Validation
In publication [P1] the aspect of consistency is mentioned. A scalable core architecture
requires a configuration file. For the DSP core introduced in the thesis a XML based
configuration file keeps the tools of the tool-chain, the hardware description and
documentation consistent.
The decoder generator introduced in [P12] is an example for updating the hardware
description when changing the core configuration. The used configuration file for the first
implementation as published in [P12] is based on an xls-sheet whereas the update of the
decoder generator already uses the XML based configuration file. Further generator tools
are planned for example for the full custom description for better support of scalability.
6.2 Future Research
xDSPcore is a compiler based configurable DSP core whose main architectural features and
the C-compiler were developed together with Vienna University of Technology and the
Christian Doppler Laboratory of Prof. Dr. Andreas Krall “Compilation and De-compilation
Techniques for Embedded Processor Architectures” based on the OCE (Open Compiler
Environment) of ATAIR.
In this section further architectural details are illustrated which will be investigated and
weighted for suitability to increase performance, reduce silicon area and decrease power
consumption of the core subsystem in the future.
6.2.1 Multithreading
Multithreading opens up the possibility to increase instruction level parallelism of core
architectures, however not all multithreading variants are suitable for VLIW architectures
[148].
In the area of research the different multithreading techniques for superscalar processor
architectures are intensively investigated. A comparison between superscalar processors
supporting single threads, fine-grain multithreading and simultaneous multithreading for
superscalar processors can be found in [76][125][160]. The influence on the system
performance and the additional hardware effort will be investigated which is necessary for
implementing multithreading on a superscalar architecture. Simultaneous Multithreading
has an advantage in increasing the utilization of available hardware resources on superscalar
processors, due to the possibility to combine instruction-level parallelism and thread-level
parallelism.
85
A comparison between fine-grain multithreading and coarse-grain multithreaded
architectures can be found in [176]. The results indicate that only a few threads are needed
to use the processor efficiently. Throughput on the node of a network can get the main
bottleneck in several applications. In [67] so called Network Processors are analyzed.
Benchmarks are used to compare performance results on an OOO (out-of-order) superscalar
processor, a fine grained multithreaded processor, a single chip multiprocessor and a
simultaneous multithreaded processor. The results for these kind of applications indicate the
best performance results for the SMT (simultaneous multithreading) processor, due to the
capability to handle instruction-level parallelism and thread-level parallelism.
Intel presents in [164] an approach using a SMT (simultaneous multithreading) Processor to
overcome the cache miss problem for the data caches with speculative pre-computation.
They analyzed that only a few loads are responsible for cache misses with a long latency
utilized resources for speculative assumptions.
If more threads have to be handled than what the available hardware of the SMT processor
supports then a kind of priority decision has to be taken, which threads will be executed
during the next clock cycle. In [129] several approaches are discussed and their impact to
the necessary hardware and software effort estimated. The results illustrate that a
prioritization of the highest throughput threads can increase the CPU utilization by up to
15% (with only spending some internal counters)
The impact of Simultaneous Multithreading on the operating systems is analyzed in [137].
For the measurements the DEC Unix 4.0d OS has been modified. The results show that the
SMT architectures provide a higher throughput rate with only a few minor adaptations of
the operating system.
In [126] an approach is investigated to share register contents and to allow the usage of a
single global register file for several threads in a SMT processor by register renaming. The
results show that a boost of up to three is possible when between 2 and 4 threads are
running in parallel.
Mini-threads are considered in [138] to overcome the drawback of increased register files
due to the hardware support of several threads in parallel. The same register-file will be
used by the mini-threads in parallel.
In [77] a method is analyzed to achieve simplification of issue queuing. In architectures
supporting multithreading the issue queuing. Complexity is a significant problem
concerning issue queuing in architectures supporting multithreading.
Commercial implementation of multithreading technology can for example be found in
Alpha 21164 [75] which is one of the first products implementing SMT technology. It
supports hardware for 4 threads in parallel (4-way SMT).
86
Intel is promoting hyper-threading technology [27]. Hyper-threading technology is based on
a single physical processor which appears as a lot of logical processor resources. For each
thread the architecture state is available and the threads use a single set of physical
execution resources. On a program level it looks like a set of logical processors whereas on
a micro-architecture level the threads are executed in parallel on the shared execution
resources.
In May 2001 Imagination Technologies announced the Meta-1 architecture using
multithreading technology. From their announcements one can assume that the Meta-1
supports fine-grain and simultaneous multithreading approaches specifically adapted to this
core architecture.
In the area of VLIW architectures it is evident that less research is being carried out. One
reason can be that due to the static instruction scheduling simultaneous multithreading is not
as well suited for traditional VLIW architectures. The Starcore SC140 is the basis for the
investigations in [108]. Simultaneous multithreading is introduced by supporting up to five
tasks. The motivation to introduce multithreading has been to reduce system power
dissipation for wireless applications. The drawback of the described approach is the
additional pipeline stage needed to decide which functional unit will be used by which
thread during the next cycle. Till now no commercial product from Starcore supporting
simultaneous multithreading has been announced.
The most well known companies in the area of VLIW DSP architectures like Texas
Instruments, Analog Devices and Motorola have no commercial products available
supporting hardware based multithreading. They provide the handling of several threads in
software (controlled by the RTOS). The core architecture supports different user levels and
the possibility to restrict the use of memory addresses. In combination with an efficient task
switch software based multithreading is possible.
Tensilica claims the support of multithreading for FLIX VLIW [41][101]. No architectural
details are available to verify which kind of multithreading technology they support.
Sandbridge Technologies announced at the end of 2001 the support of multithreading
techniques in the Sandblaster Multithreaded DSP [89]. The available description (white
paper and product brief) provides no details about the supported technology and the
influence on the system performance. Sandbridge Technologies announced a vector
processor oriented architecture and efficient execution of JAVA code.
Summary for Multithreading
Multithreading is a known methodology for superscalar core architectures which can be
used to increase the usage of available core resources. Making use of multithreading
technology for VLIW architectures is limited by static instruction scheduling and therefore
87
missing support for dependency resolution during run-time. Slight modifications of the
xLIW technology introduced in [P2] allow dynamic scheduling of execution bundles. This
will be a topic of further research.
6.2.2 Code compaction
Increasing code density in core subsystems reduces program memory. xDSPcore with the
xLIW concept provides an efficient approach but the instructions are still plain coded.
There are several ideas for code compaction available in [82][84][120][121][150].
The major problem of code compaction is the required time for unpacking [51]. Even that
could be handled for linear code but the major code part is caused by control code
traditionally with short branch distances. Further research will be carried out.
6.2.3 Design Space Exploration
In [P10] and [P11] DSPxPlore the design space exploration methodology of xDSPcore is
introduced. The presented concepts have to be improved and further research will enable
automatically generated feedback to be provided. Making use of the DSPxPlore still
requires a deep understanding of processor architectures and the influence of different
features onto the core performance. In the near future DSPxPlore shall provide suggestions
about which parameters should be changed to increase core performance of the core
architecture (for the algorithms being executed on it). For example automatic assignment of
binary coding to a chosen instruction set architecture to reduce switching activity at the
program memory port is feasible.
In [P12] the decoder generator is introduced. Another example for making use of the XMLbased configuration file could be a tool providing automatic generation or adaptation of the
VHDL-RTL core description.
89
7 References
[1]
“ADSP-21535, Blackfin DSP Hardware Reference”, Analog Devices Inc.,
Preliminary Edition, November 2001.
[2]
“ADSP-21535, Blackfin DSP Hardware Reference”, Digital Signal Processor
Division, Analog Devices Inc., Revision 1.0, November 2002.
[3]
“An Introduction to Thumb”, Advanced RISC Machines Ltd., Version 2.0, March
1995.
[4]
“ARCtangent - A5 Microprocessor with DSP Extensions”, White Paper, ARC
International, USA, September 2003.
[5]
“ARM7TDMI-S, Technical Reference Manual”, Advanced RISC Machines Ltd.,
Revision 3, 2000.
[6]
“ARM920T, Technical Reference Manual”, Advanced RISC Machines Ltd., Revision
1, 2000.
[7]
“ARM9TDMI, Technical Reference Manual”, Advanced RISC Machines Ltd.,
Revision 3, 2000.
[8]
“Blackfin DSP Instruction Set Reference”, Digital Signal Processor Division, Analog
Devices Inc., First Revision, March 2002.
[9]
"Buyer’s Guide to DSP Processors, 2004 Edition", Berkeley Design Technology, Inc.
(BDTi), 2004.
[10] “Carmel DSP Core Architecture Specification”, Infineon Technologies, 2001.
[11] “Carmel Architecture Overview”, Infineon Technologies North America, Revision
1.0, January 06, 2000.
[12] “Carmel 10xx Users Manual”, Infineon Technologies, June 16, 2001.
[13] “Choosing a DSP Processor”, Technical Report, Berkley Design Technologies, Inc.,
2000.
90
[14] “Ceva-X Architecture, Ceva-X 1620 Datasheet”, Ceva Inc., 2003.
[15] “CPM Interrupt Controller”, Motorola Inc., 2002.
[16] "DECchip 21064-AA Microprocessor Hardware Reference Manual", Digital
Equipment Corporation, 1992.
[17] “DECchip 21064 and DECchip 21064a Alpha AXP Microprocessors Hardware
Reference Manual”, EC-Q9ZUA-TE, DEC, Maynard, Massachusetts, 1994.
[18] “Dhrystone Benchmark, History, Analysis, Scores and Recommendations”, White
Paper, October 1, 2002.
[19] “DSP 16xx, Programmers Guide”, Lucent, 1997.
[20] “DSP56000 – 24 Bit Digital Signal Processor Family Manual, DSP56KFAMUM/AD,
Motorola Inc., Austin, Texas, USA, 1995.
[21] “DSP56002 – 24 Bit Digital Signal Processor Users Manual, DSP56KFAMUM/AD,
Motorola Inc., Austin, Texas, USA, 1995.
[22] “DSP 56300 Family Manual”, Motorola Inc., DSP56300FM/AD, Revision 0, May
2001.
[23] “DSP Compilers: Challenges for Efficient DSP Code Generation”, DACS Software
Pvt. Ltd., Version 1.0, March 2001.
[24] “DSP-C Emulation from ACE Associated Compiler Experts Offers Design Flow
Breakthrough for New Architectures”, ACE Associated Computer Experts, Orlando,
Florida, USA, November 1, 1999.
[25] “Evaluating DSP Processor Performance”, Technical Report, Berkley Design
Technologies, Inc., 2000.
[26] “How to Implement a Viterbi Decoder on the Starcore SC140”, Motorola Inc.
Application Note, ANSC140VIT/D, Alpha Release, July 18, 2000.
[27] “Hyper-Threading Technology Architecture and Micro-Architecture”, Intel Technical
Report, Intel Technology Journal, Volume 6, Issue 1, February 14, 2002.
91
[28] "Inside the StarCore SC140", Berkeley Design Technology, Inc. (BDTi), Berkeley,
California, USA, 2000.
[29] “OAKDSP Core, Programmers Reference Manual”, Siemens, January 1998.
[30] “Power PC603 RISC Microprocessor Technical Summary”, MPC603/D, Motorola
Inc., 1994.
[31] “Power PC620 RISC Microprocessor Technical Summary”, MPC620/D, Motorola
Inc., 1994.
[32] “SC140 DSP Core Reference Manual”, Motorola Inc., MNSC140CORE/D, Revision
3, November 2001.
[33] “TMS320C54x DSP Reference Set, Volume 1: CPU and Peripherals”, Texas
Instruments, SPRU131G, March 2001.
[34] “TMS320C55x DSP CPU Reference Guide, Preliminary Draft”, SPRU371D, Texas
Instruments, May 2001.
[35] “TMS320C55x Technical Overview”, SPU393, Texas Instruments, February 2000.
[36] “TMS320C6000 CPU and Instruction Set Reference Guide”, Texas Instruments,
October 2000.
[37] “TMS320C6000 Optimizing Compiler Tutorial”, Texas Instruments, SPRH046, 2001.
[38] “TMS320C6201 Technical Overview”, Texas Instruments, SPRS051G, January1997
(revised November 2000).
[39] “TMS320C64x Technical Overview”, Texas Instruments, SPRU395, February 2000.
[40] “Tricore 2, 32-bit Unified Processor Core, V2.0 Architecture”, Infineon Technologies,
June 2003.
[41] “Xtensa Architecture and Performance, White Paper”, Tensilica Inc., September
2002.
92
[42] “ZSP 400, Digital Signal Processor, Architecture”, LSILogic Corporation, DB14000121-03 (Fourth Edition), December 2001.
[43] A.V. Aho, J.D. Ulman, “Principles of Compiler Design”, Addison-Wesley, Narosa,
1999.
[44] D. Albert, D. Avnon, “Architecture of the Pentium Microprocessors”, IEEE Micro, pp.
11–21, June 1993.
[45] R. Allen, K. Kennedy, "Optimizing Compilers for Modern Architectures", Morgan
Kaufmann Publishers, September 2001.
[46] A.W. Appel, “Modern Compiler Implementation in C”, Cambridge University Press,
2000.
[47] R. Arnold, F. Mueller, D. Whalley and M. Harmon, “Bounding Worst-Case
Instruction Cache Performance”, IEEE Real-Time System Symposium, pp. 172-181,
December 1994.
[48] P.J. Ashenden, “The Designers Guide to VHDL”, Morgan Kaufmann Publishers, San
Francisco, California, USA, 1995.
[49] K. Baldwin et.al., “Guidelines for Efficient C Code Generation in Accumulator Based
DSPs – the C54x as an Example”, Texas Instruments Inc., ESC 1999.
[50] S. Basumallick, K. Nilsen, “Cache Issues in Real-Time Systems”, in Proceedings of
the 1st ACM SIGPLAN Workshop on Language, Compiler and Tool Support for RealTime Systems, June 21, 1994.
[51] M. Benes, A. Wolfe, S.M. Nowick, “A High-Speed Asynchronous Decompression
Circuit for Embedded Processors”, in Proceedings of the 17th Conference on Advanced
Research in VLSI (ARVLSI’97), pp. 219-236, September 15-16, 1997.
[52] L. Benini, M. Favalli, B. Ricco, “Analysis of Hazard Contribution to Power
Dissipation in CMOS IC's”, in IEEE International Workshop on Low Power Design, pp.
27-32, May 1994.
[53] M. Bramberger, “Design of a Cache Structure for a DSP Concept in Full-Custom
Design”, Master Thesis, University of Technology, Graz, Austria, January 2002.
93
[54] P. Briggs, K.D. Cooper, L. Torczon, “Coloring Register Pairs”, ACM Letters on
Programming Languages and Systems, pp. 3-13, March 1992.
[55] P. Briggs, K.D. Cooper, L. Torczon, “Improvements to Graph Coloring Register
Allocation”, ACM Transactions on Programming Language and Systems, Volume 16,
Issue 3, pp. 428-455, May 1994.
[56] T.D. Burd, R.W. Brodersen, “Design Issues for Dynamic Voltage Scaling”, in
Proceedings of the 2000 International Symposium on Low Power Electronics and Design,
Rapallo, Italy, pp. 9-14, 2000.
[57] B, Burgess, N. Ullah, P. Overen, D. Ogden, “The PowerPC 603 Microprocessor”,
Communications of the ACM, pp. 34–42, June 1994.
[58] B. Burgess, M. Alexander, H. Yingh-wai, S.P. Licht, S. Mallick, D. Ogden, P. SungHo, J. Slaton, “The PowerPC 603 Microprocessor: A High Performance, Low Power,
Superscalar RISC Microprocessor”, in Proceedings COMPCON, San Francisco,
California, USA, pp. 300-306, February 28 – March 4, 1994.
[59] J.A. Butts and G.S. Sohi, “A Static Power Model for Architects”, in Proceedings of
the 33rd Annual International Symposium on Microarchitecture, Monterey, California,
USA, pp. 191-201, December 10-13, 2000.
[60] G.J. Chaitin, “Register Allocation and Spilling via Graph Coloring”, in Proceedings
of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN’82), Boston,
Massachusetts, USA, pp. 98-105, 1982.
[61] A. Chandrakasan, et al., “Design Considerations and Tools for Low-Voltage Digital
System Design”, in Proceedings of 33rd Annual Conference on Design Automation
Conference, Las Vegas, Nevada, USA, pp. 113-118, 1996.
[62] A. Chandrakasan, S. Sheng, R. Brodersen, “Low-power CMOS design”, IEEE Journal
of Solid State, pp. 472-484, April 1992.
[63] G. Chen and M.D. Smith, “Reorganizing Global Schedules for Register Allocation”,
Proceedings of the 1999 Conference on Supercomputing, ACM SIGARCH, ACM Press,
New York, pp. 408-416, June 20-25, 1999.
94
[64] D.R. Coelho, “The VHDL Handbook”, Kluwer Academic Publishers, Norwell, MA
02061, USA, 1990.
[65] A. Colin, I. Puaut, “Worst Case Execution Time Analysis for a Processor with Branch
Prediction”, Journal of Real-Time Systems, Special Issue on Worst-Case Execution
Time Analysis, pp. 249-274, April 2000.
[66] K.D. Cooper, P. Briggs, K. Kennedy, L. Torczon, “Coloring Heuristics for Register
Allocation”, SIGPLAN Notices, pp. 275-284, July 1989.
[67] P. Crowley, M.E. Fiuczynski, J.-L. Baer, B.N. Bershad, “Characterizing Processor
Architectures for Programmable Network Interfaces”, in Proceedings of the 2000
International Conference on Supercomputing, Danta Fe, New Mexico, USA, pp. 54-64,
May 2000.
[68] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, F.K. Zadeck, “An Efficient
Method of Computing Static Single Assignment Form”, in Proceedings of the 16th ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Austin,
Texas, USA, pp. 25-35, 1989.
[69] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, F.K. Zadeck, “Efficiently
Computing Static Single Assignment Form and the Control Dependence Graph”, ACM
Transactions on Programming Languages and Systems (TOPLAS), pp. 451-490, October
1991.
[70] B. Davari, “CMOS Technology Scaling, 0.1 mm and Beyond”, in Proceedings of the
International Electron Devices Meeting, pp. 555-558, 1996.
[71] A. Davis, “Tips for Writing more Efficient DSP C Code”, TI's Software Development
Systems Semiconductor Group, EDN Magazine, June 1997.
[72] V. De, “Leakage-Tolerant Design Techniques for High Performance Processors
(Invited Paper)”, in Proceedings of the 2002 International Symposium on Physical
Design (ISPD’02), San Diego, California, USA, p. 28, April 7-10, 2002.
[73] V. De, S. Borkar, “Technology and Design Challenges for Low Power and High
Performance”, in Proceedings of the International Symposium on Low Power
Electronics and Design, San Diego, California, USA, pp. 163-168, 1999.
95
[74] S. Dropsho, “Real-Time Penalties in RISC Processing”, Technical Report at
Department of Computer Science University of Massachusetts-Amherst (TR-95-110),
December 12, 1995.
[75] J. Edmondson, P. Rubinfeld, R. Preston, V. Rajagopalan, “Superscalar Instruction
Execution in the 21164 Alpha Microprocessor”, IEEE Micro, Volume 15, Issue 2, pp.
33–43, April 1995.
[76] S. Eggers, J. Elmer, H. Levy, R. Stamm and D. Tullson, “Simultaneous
Multithreading: A Foundation for next Generation Processors”, IEEE Micro, pp. 17-22,
August 1997.
[77] A. El-Moursy, D.H. Albonesi, “Front-end Policies for Improved Issues Efficiency in
SMT Processors”, in Proceedings of the 9th International Symposium on HighPerformance Computer Architecture (HPCA'03), Anaheim, California, pp. 31, February
08 - 12, 2003.
[78] P. Elbischger, “Performance and Architectural Analysis of a Digital Signal Processor”,
Master Thesis, University of Technology, Graz, Austria, June 2001.
[79] J. Engblom, ”Processor Pipelines and Static Worst-Case Execution Time Analysis”,
PhD Thesis, Department of Computer Systems Uppsala University, Uppsala, 2002.
[80] J. Engblom, A. Ermedahl, M. Sjödin, “Worst-Case Execution-Time Analysis for
Embedded Real-Time Systems”, in Software Tools for Technology Transfer (STTT)
Special Issue on ASTEC, 2001.
[81] A. Ermedahl, J. Gustafsson, “Deriving Annotations for Tight Calculation of
Execution Time”, in Proceedings Euro-Par’97 Parallel Processing, Springer Verlag, pp.
1298-1307, August 1997.
[82] J. Ernst, W. Evans, C.W. Fraser, S. Lucco, T.A. Proebsting, “Code Compression”, in
Proceedings of the ACM SIGPLAN’97 Conference on Programming Language Design
and Implementation (PLDI), Las Vegas, Nevada, USA, pp. 358-365, June 15-18, 1997.
[83] C. Ferdinand, F. Martin, R. Wilhelm, “Applying Compiler Techniques to Cache
Behavior Prediction”, in Proceedings ACM SIGPLAN Workshop on Languages,
Compilers and Tools for Real-Time Systems, Las Vegas, Nevada, USA, June 15, 1997.
96
[84] G. Fettweis, “Embedded SIMD Vector Signal Processor Design”, SAMOS Workshop,
Samos, Greece, July 21-23, 2003.
[85] K. Flautner, N.S. Kim, S. Martin, D. Blaauw, T. Mudge, D. Caches, “Simple
Techniques for Reducing Leakage Power”, International Symposium on Computer
Architecture, Anchorage, Alaska, pp. 148-157, May 25-29, 2002.
[86] C.W. Fraser, D.R. Hanson, T.A. Proebsting, “Engineering a Simple, Efficient CodeGenerator Generator”, ACM Letters on Programming Languages and Systems
(LOPLAS), Volume 1, Issue 3, pp. 213-226, September 1992.
[87] L. George, A.W. Appel, “Iterated Register Coalescing”, ACM Transactions on
Programming Languages and Systems, pp. 300-324, May 1996.
[88] V.S. Gierenz, R. Schwann, T.G. Noll, “A Low Power Digital Beamformer for
Handheld Ultrasound Systems”, in Proceedings of the ESSCIRC 2001, Villach, Austria,
pp. 276-279, September 18-20, 2001.
[89] J. Glossner, E. Hokenek, M. Mondgibl, “An Overview of the Sandbridge Processor
Technology”, White Paper, Sandbridge Technology Inc., 2002.
[90] J.R. Goodman, W. Hsu, “Code Scheduling and Register Allocation in large Basic
Blocks”, International Conference on Supercomputing, St. Malo, France, pp. 442-452,
1988.
[91] J. Gustafsson, “Analyzing Execution-Time of Object-Oriented Programs Using
Abstract Interpretation”, PhD Thesis, Department of Computer Systems, Information
Technology, Uppsala University, May, 2000.
[92] J. Handy, “The Cache Memory Book”, Academic Press, 1998.
[93] C. Healy, M. Sjödin, V. Rustagi, D. Whalley, “Bounding Loop Iterations for Timing
Analysis”, in Proceedings of the 4th IEEE Real-Time Technology and Applications
Symposium (RTAS’98), pp. 12, June 3-5, 1998.
[94] C. Healy, R. Arnold, F. Müller, D. Whalley, M. Harmon, “Bounding Pipeline and
Instruction Cache Performance”, IEEE Transactions on Computers”, Volume 48, Issue 1,
pp. 53-70, January 1999.
97
[95] F. Hedley, “ARM DSP-Enhanced Extension”, ARM White Paper, ARM Ltd., May
2001.
[96] J.L. Hennessy, D.A. Patterson, “Computer Architecture. A Quantitative Approach”,
San Mateo CA, Morgan Kaufmann Publishers, 1996.
[97] J.L. Hennessy, D.A. Patterson, “Computer Organization and Design: The
Hardware/Software Interface”, Morgan Kaufmann Publishers, 2nd edition, 1997.
[98] U. Hirnschrott, “DSP Compiler Optimization”, Master Thesis, Vienna University of
Technology, Vienna, Austria, January 2001.
[99] U. Hirnschrott, A. Krall, “VLIW Operation Refinement for Reducing Energy
Consumption”, in Proceedings 2003 International Symposium on System-on-Chip
(SOC’03), Tampere, Finland, pp. 131-134,November 19-21, 2003.
[100] U. Hirnschrott, A. Krall ,B. Scholz, "Graph Coloring versus Optimal Register
Allocation for Optimizing Compilers", In Proceedings of the International Conference on
Compilers, Architectures and Synthesis of Embedded Systems (CASES'02), Grenoble,
France, pp. October 8-11, 2002.
[101] B. Huffman, M.A. Mohamed, “Flexible Length Instruction Extensions – The Next
Generation Xtensa ISA”, Microprocessor Forum 2002, October 16, 2002.
[102] W. Hwu, “Computer Microarchitecture: Hardware and Software”, Lecture material,
University of Illinois, Urbana-Champaign, 1999.
[103] E.C. Ifeachor, B.W. Jervis, “Digital Signal Processing, A Practical Approach”,
Prentice Hall, Second Edition, 2002.
[104] M. Johnson, D. Somasekhar, K. Roy, “Models and Algorithms for Bounds on
Leakage in CMOS Circuits”, IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, Volume 18, Number 6, pp. 714-725, June 1999.
[105] R. Johnson, M.S. Schlankser, “Analysis Techniques for Predicated Code”, in
Proceedings of the 29th Annual International Symposium on Mircoarchitecture, Paris,
France, pp. 100-113, December 2-4, 1996.
98
[106] N.P. Jouppi, “The nonuniform Distribution of Instruction-level and Machine
Parallelism and its Effect on Performance”, in IEEE Transactions on Computers, pp.
1645-1658, December 1989.
[107] J. Kao, A. Chandrakasan, “Dual-Threshold Voltage Techniques for Low-Power
Digital Circuits”, in IEEE Journal of Solid-State Circuits, Volume 35, Issue 7, pp. 10091018, July 2000.
[108] S. Kaxiras, G. Narlikar, A.D. Berenbaum, Z. Hu, “Comparing Power Consumption of
an SMT and a CMP DSP for Mobile Phone Workloads“, In CASES’01, Atlanta, Georgia,
USA, pp. 211-220, November 16-17, 2001.
[109] B. Kerningham, D. Ritchie, “The C Programming Language”, Prentice-Hall Software
Series, 2nd Edition, March 22, 1988.
[110] M. Ketkar, S.S. Sapatnekar, P. Patra, “Convexity-Based Optimization for PowerDelay Tradeoff using Transistor Sizing”, in Proceedings of the IEEE/ACM International
Workshop on Timing Issues in the Specification and Synthesis of Digital Systems
(Tau’00), Austin, Texas, USA, pp. 52-57, 2000.
[111] S.K. Kim, S.L. Min, R. Ha, “Efficient Worst Case Timing Analysis of Data Caching”,
in Proceedings of 2nd IEEE Real-Time Technology and Applications Symposium
(RTAS’96), Boston, MA, USA, pp. 230-240, June 10-12, 1996.
[112] D.J. Kuck, Y. Muraoka, S.-C. Chen, ”On the Number of Operations Simultaneously
Executable in Fortran-like Programs and their Resulting Speedup”, IEEE Transactions
on Computers, pp.1293-1310, December 1972.
[113] M. Kuulusa, “DSP Processor Core-Based Wireless System Design”, PhD Thesis,
Digital and Computer Systems Laboratory, Tampere University of Technology, August
18, 2000.
[114] M. Lam, “Software Pipelining: an Effective Scheduling Technique for VLIW
Machines”, in Proceedings of the ACM SIGPLAN 1988 Conference on Programming
Language Design and Implementation, Atlanta, Georgia, USA, pp. 318-328, 1988.
[115] M.S. Lam, R.P. Wilson, “Limits of Control Flow on Parallelism”, in Proceedings of
the 19th Annual International Symposium on Computer Architecture (AISCA),
Queensland, Australia, pp. 46-57, 1992.
99
[116] P. Lapsley, J. Bier, A. Shoham and E.A. Lee, “DSP Processor Fundamentals,
Architectures and Features”, New York, IEEE Press, 1997.
[117] G. Laure, “Creation of a Configurable Component Based Framework”, Draft of
Master Thesis, University of Technology, Graz, Austria, June 2004.
[118] W. Lazian, “Simulation of a DSP through a Component Based Framework”, Draft of
Master Thesis, University of Technology, Graz, Austria, June 2004.
[119] R.A. Lebeck, D.A. Wood, “Cache Profiling and the SPEC Benchmarks: A Case
Study”, IEEE Computer, pp. 15–26, October 1994.
[120] C. Lefurgy, P. Bird, I.C. Chen, T. Mudge, “Improving Code Density using
Compression Techniques”, in Proceedings of the 30th Annual ACM/IEEE International
Symposium on Microarchitecture (Micro) , Research Triangle Park, NC, USA, pp. 194203, December 1-3, 1997.
[121] C. Lefurgy, T. Mudge, “Code Compression for DSP”, in Proceedings of Compiler and
Architecture Support for Embedded Computing Systems (CASES 98), December 4-5,
1998.
[122] R. Leitner, “VHDL Model of a Digital Signal Processor”, Master Thesis, University
of Technology, Graz, Austria, March 2001.
[123] M. Levy, “C Compilers for DSPs flex their Muscles”, EDN Magazine, June 1997.
[124] S.S. Lim, Y.H. Bae, C.T. Jang, B.D. Rhee, S.L. Min, C.Y. Park H. Shin, K. Park, C.S.
Ki, ”An Accurate Worst-Case Timing Analysis for RISC Processors”, IEEE
Transactions on Software Engineering, Volume 21, Issue 7, pp. 593-604, July 1995.
[125] J.L. Lo, S. Eggers, J.S. Emer, H.M. Levy, R.L. Stamm, D.M. Tullson, “Converting
Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous
Multithreading”, ACM Transactions on Computer Systems, Volume 15, Number 3, pp.
322-354, August 1997.
[126] J.L. Lo, S.S. Parekh, S.J. Eggers, H.M. Levy, D.M. Tullson, “Software-Directed
Register Deallocation for Simultaneous Multithreaded Processors”, IEEE Transactions
on Parallel and Distributed Systems, Volume 10, Issue 9, pp. 922-933, September 1999.
100
[127] F. Müller, “Timing Predictions for Multi-Level Caches”, in Proceedings of ACM
SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, pp.
29-36, June 1997.
[128] S. Palnitkar, “Verilog HDL – A Guide to Digital Design and Synthesis”, Sun
Microsystems, Mountain View, California, USA, 1996.
[129] S. Parekh, S. Eggers, H. Levy, J. Lo, “Thread-Sensitive Scheduling for SMT
Processors”, Technical Report, Department of Computer Science, University of
Washington, 2000.
[130] J. Park and S.M. Moon, “Optimistic Register Coalescing”, in Proceedings of the 1998
International Conference on Parallel Architectures and Compilation Techniques
(PACT’98), Paris, France, pp. 196-204, October 12-18, 1998.
[131] T. Pering, T. Burd, R. Brodersen, “The Simulation and Evaluation of Dynamic
Voltage Scaling Algorithms”, in Proceedings of the 1998 International Symposium on
Low Power Electronics and Design (ISPLED), Monterey, California, USA, pp. 76-81,
1998.
[132] D.N. Pnevmatikatos, G.S. Sohi, “Guarded Execution and Branch Prediction in
dynamic ILP Processors”, Proceedings of the 21st Annual International Symposium on
Computer Architecture (ISCA), pp. 120-129, 1994.
[133] U. Ramacher, W. Raab, N. Brüls, U. Hachmann, C. Sauer, A. Schackow, J. Gliese, J.
Harnisch, M. Richter, E. Sicheneder, R. Schüffny, U. Schulze, H. Feldkämper, C.
Lütkemeyer, H. Süsse, S. Altmann, “A 53-GOPS Programmable Vision Processor for
Processing, Coding-Decoding and Synthesizing of Images”, in Proceedings of European
Solid State Conference (ESSCIRC), Villach, Austria, September 18-20, 2001.
[134] B.R. Rau, “Iterative Modulo Scheduling: an Algorithm for Software Pipelining
Loops”, in Proceedings of the 27th Annual International Symposium on
Microarchitecture, ACM Press, pp. 63-74, 1994.
[135] J. Rawat, “Static Analysis of Cache Performance for Real-Time Programming”,
Technical Report TR93-19, Iowa State University of Science and Technology,
November 1993.
101
[136] J. Rayfield, ”Using HLLs to Develop DSP Applications”, ARM Inc., DSP World Fall
1999.
[137] J.A. Redstone, S.J. Eggers, H.M. Levy, ”An Analysis of Operating System Behavior
on a Simultaneous Multithreaded Architecture”, in Proceedings of the 9th International
Conference on Architectural Support for Programming Languages and Operating
Systems, Volume 35, Issue 11, pp. 245-256, November 2000.
[138] J. Redstone, S. Eggers, H. Levy, “Mini-threads: Increasing TLP on Small-Scale SMT
Processors”, in Proceedings of the 9th International Symposium on High Performance
Computer Architecture (HPCA-9), pp. 19-30, February 8-12, 2003.
[139] S. Rele, S. Pande, S. Onder, R. Gupta, “Optimizing Static Power Dissipation by
Functional Units in Superscalar Processors”, in Proceedings of the 11th International
Conference on Compiler Construction, pp. 261-275, April 8-12, 2002.
[140] R.J. Ridder, “Trends in Application Programming for Digital Signal Processing”,
Tasking Inc., ESC 1999.
[141] K. Roy, “Leakage Power Reduction in Low-Voltage CMOS Design”, in Proceedings
of the IEEE International Conference on Circuits and Systems, pp. 167-173, 1998.
[142] J. Runeson, S.-O. Nyström, “Retargetable Graph-Coloring Register Allocation for
Irregular Architectures”, in Proceedings of 7th International Workshop, SCOPES 2003,
Vienna, Austria, pp. 240-254, September 18-20, 2003.
[143] S. Rusu, “Trends and Challenges in VLSI Technology Scaling Towards 100 nm”,
Invited Talk at European Solid State Conference (ESSCIRC 2001), Villach, Austria,
September 18-20, 2001.
[144] R.H. Saavedra, A.J. Smith, “Measuring Cache and TLB Performance and Their Effect
on Benchmark Runtimes”, IEEE Transactions on Computers, Volume 44, Issue 10, pp.
1223–1235, October 1995.
[145] A. Schilke, “An Automatic Decoder Generator for a Scalable DSP Architecture”,
Master Thesis, Carinthian Tech Institute, Villach, Austria, September 2002.
[146] J. Schneider, C. Ferdinand, “Pipeline Behavior Prediction for Superscalar Processors
by Abstract Interpretation”, in Proceedings of the ACM SIGPLAN 1999 Workshop on
102
Languages, Compilers and Tools for Embedded Systems, ACM, Atlanta, Georgia, USA,
pp. 35-44, May 1999.
[147] B. Scholz, E. Eckstein, “Register Allocation for Irregular Architectures”, in
Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded
Systems, ACM Press, pp. 139-148, 2002.
[148] J. Silc, B. Robic, T. Unger, “Processor Architecture, From Dataflow to Superscalar
and Beyond”, Springer-Verlag, Berlin-Heidelberg, 1999.
[149] D. Sima, T. Fountain, P. Kacsuk, “Advanced Computer Architectures: A Design
Space Approach”, Addison Wesley Publishing Company, Harlow, 1997.
[150] P. Simonen, I. Saastamoinen, J. Nurmi, “Variable-Length Instruction Compression for
Area Minimization”, in the 14th International Conference on Application-specific
Systems, Architectures and Processors, The Hague, Netherlands, pp. 155-160, June 2426, 2003.
[151] V. Sipkova, “Efficient Variable Allocation to Dual Memory Banks of DSPs”, In
Proceedings of the 7th International Workshop on Software and Compilers for Embedded
Systems (SCOPES'03), Vienna, Austria, pp. 359-372, September 2003.
[152] M.D. Smith and G. Holloway, “Graph-Coloring Register Allocation for Irregular
Architectures”, PLD’01, 2001.
[153] J.E. Smith, “A Study of Branch Prediction Strategies”, in Proceedings of the 8th
Annual Symposium on Computer Architecture (ASCA), Minneapolis, Minnesota, USA,
pp. 135-148, 1981.
[154] F. Stappert, P. Altenbernd, “Complete Worst-Case Execution Time Analysis of
Straight-line Hard Real-Time Programs”, Journal of Systems Architecture, Volume 46,
Number 4, pp. 339-355, February 2000.
[155] F. Stappert, J. Engblom, A. Ermedahl, “Efficient Longest Executable Path Search for
Programs with Complex Flows and Pipeline Effects”, Proceedings of the International
Conference on Compilers, Architecture, and Synthesis for Embedded Systems, Atlanta,
Georgia, USA, pp. 132-140, 2001.
103
[156] A. Stratakos, R. W. Brodersen, S. R.Sanders, “High-Efficiency Low-Voltage DC-DC
Conversion for Portable Applications”, in Proceedings of 1994 International Workshop
on Low-Power Design, Napa Valley, CA, April 1994.
[157] R. Sucher, R. Niggebaum, G. Fettweis, A. Rom, “Carmel – A New High Performance
DSP Core Using CLIW”, 9th Annual International Conference on Signal Processing
Applications and Technology, pp. 499-503, September 1998.
[158] S. Thompson, P. Packan, M. Bohr, “CMOS Scaling: Transistor Challenges for the 21st
Century”, Intel Technology Journal, Q3. 1998.
[159] G.S. Tjaden, M.J. Flynn, “Detection and Parallel Execution of Independent
Instructions”, IEEE Transactions on Computers, Vol. C-19, No. 10, pp. 889-895,
October 1970.
[160] D.M. Tullson, S.J. Eggers, and H.M. Levy, “Simultaneous Multithreading:
Maximizing On-Chip Parallelism”, In Proceedings of the 22nd Annual International
Symposium on Computer Architecture (ISCA’95), Santa Margherita Liguere, Italy, pp.
392, June 22-24, 1995.
[161] H.J.M. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and its
Impact on the Design of Buffer Circuits”, IEEE Journal of Solid State Circuits, Volume
SC-19, pp. 468–473, August 1984.
[162] G. Vink, “Programming DSPs using C: Efficiency and Portability Trade-offs”,
Journal of Embedded Systems, May 2000.
[163] K. Vögler, “A DSP C-Compiler”, Master Thesis, Vienna University of Technology,
Vienna, Austria, April 2002.
[164] H. Wang, P.H. Wang, R.D. Weldon, S.M. Ettinger, H. Saito, M. Girkar, S. Shihwei,
J.P. Shen, “Speculative Precomputation: Exploring the Use of Multithreading for
Latency”, Intel Technology Journal, Volume 6, Issue 1, Q1. 2002.
[165] L. Wanhammar, “DSP Integrated Circuits”, Academic Press, February 1999.
[166] H.S. Warren, “Instruction Scheduling for the IBM RISC System/6000 Processor”,
IBM J. Res. Dev., pp. 85-92, 1990.
104
[167] O. Weiss, M. Gansen, T.G. Noll, “A Flexible Datapath Generator for Physical
Oriented Design”, Proceedings of the ESSCIRC 2001, Villach, Austria, pp. 408-411,
September 18-20, 2001.
[168] D.B. Whalley, R. Arnold, F. Mueller, M. Harmon, “Bounding Worst-Case Instruction
Cache Performance”, IEEE Symposium on Real-Time Systems, pp. 172–181, December
1994.
[169] WG14 N1005, “Programming Languages, their Environments and System Software
Interfaces - Extensions for the Programming Language C to Support Embedded
Processors”, ISO/IEC DTR 18037.2, April 25, 2003.
[170] R. White, F. Müller, C. Healy, D. Whalley, M. Harmon, “Timing Analysis for Data
Caches and Set-Associative Caches”, in Proceedings of 3rd IEEE Real-Time Technology
and Applications Symposium (RTAS’97), Montreal, Canada, pp. 192-202, June 9-11,
1997.
[171] M. Willems, H. Keding, V. Zivojnovic, “Modulo-Addressing Utilization in Automatic
Software Synthesis for Digital Signal Processors”, in Proceedings of International
Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich,
Germany, Volume 1, pp. 687-690, April 21-24, 1997.
[172] R. Yates, “Fixed-Point Arithmetic: An Introduction”, Digital Sound Labs, March 3,
2001.
[173] T.Y. Yeh, Y.N. Patt, “Alternative Implementations of Two-Level Adaptive Branch
Predictions”, in Proceedings of the 19th Annual International Symposium on Computer
Architecture (ISCA), Queensland, Australia, pp.124-134, 1992.
[174] S. Zammattio, “How to Reduce Time-to-Market for System-on-Chip Design”, White
paper ARC International, 2002.
[175] W. Zhang, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, D. Duarte, Y-F. Tsai,
“Exploiting VLIW Schedule Slacks for Dynamic and Leakage Energy Reduction”, in
Proceedings of the 34th Annual ACM/IEEE International Symposium on
Microarchitecture, IEEE Computer Society, Austin, Texas, USA, pp. 102-113, 2001.
[176] W.M. Zuberek, “Performance Comparison of Fine-Grain and Block Multithreaded
Architectures”, in Proceedings High Performance Computing (HPC’2000), 2000
105
Advanced Simulation Technologies Conference, Washington DC, USA, pp. 383-388,
April 16-20, 2000.
PART II
PUBLICATIONS
PUBLICATION 1
C. Panis, J. Nurmi, “xDSPcore - a Configurable DSP Core”, Technical Report 1-2004,
Tampere University of Technology, Institute of Digital and Computer Systems, Tampere,
Finland, May 2004.
©2004 Tampere University of Technology. Reprinted, with permission, from Technical
Report 1-2004.
xDSPcore - a Configurable DSP Core
Christian Panis (Carinthian Tech Institute)
Jari Nurmi
Institute of Digital and Computer Systems
Tampere University of Technology
Tampere, Finland
Abstract
Exponentially increasing mask costs for submicron technologies have lead to a strong demand on
reconfigurable hardware and software-programmable embedded cores for SoC (System on Chip)
applications. General purpose DSP cores suffer from inadequate usage of the core resources and
therefore an overhead in silicon area and power dissipation. The programs executed on these cores
are written in assembly language to prevent even more overhead caused by poorly performing Ccompilers.
This paper is used to introduce xDSPcore, a configurable DSP core architecture efficiently
programmable in C. The main architectural features with influence on the overall system
performance can be scaled or configured to meet application-specific requirements and therefore to
reduce wasting of silicon area and power dissipation. Programming in C enables architectureindependent description of the algorithms and overcomes software compatibility issues. A
methodology called DSPxPlore is used for design space exploration to understand the requirements
of the application in an early project phase for refining the core architecture configuration.
Introduction
Increasing complexity of SoC applications increases the demand on powerful embedded cores. The
flexibility provided by the usage of software programmable cores quite often leads to an increase in
consumed silicon area and an increased power dissipation. Therefore dedicated hardware has been
favored over software-based platform solutions. The picture is changing, significant increasing
mask costs due to the use of advanced process technologies and difficulties to enter such highvolume products to the heterogeneous market that would justify the high non-recurring cost
together increase the pressure for developing product platforms. These platforms are used for a
group of applications such that software executed on programmable core architectures is used for
differentiating the products.
Embedded general purpose core architectures cause an inefficient use of the core resources. Each
class of applications has requirements, which cannot be efficiently covered by a general-purpose
architecture. To close the gap between dedicated hardware and software-based solutions, a
platform-specific adaptation of the core architecture is required.
For embedded DSP cores (Digital Signal Processors) an additional problem is virulent. Due to nonorthogonal core architectures which have been preferred because of obtaining better performance
and less area consumption when mapping DSP algorithms onto a processor, DSPs are still
programmed manually in assembly language [1]. The price for the better usage of the available
processor resources is an architecture-dependent description of the algorithms which makes changes
in the core architecture difficult and costly (due to compatibility issues) and prohibits applicationspecific adaptations. Therefore products based on a programmable core architecture are sticking to
the same architecture for a long time frame, even if not state-of-the-art any more. Additional risks
and costs by changing the core architecture thus lead to solutions that are not competitive.
Additional consequences from using assembly language are long development cycles. 10 years ago
algorithms executed on DSP cores consisted of several hundred lines of code. Manual coding was
1
reasonable even if minor changes in the application code required several weeks of coding and
verification. Today’s DSP cores are more powerful and enable to execute large programs consisting
of several hundred thousand lines of code. DSP cores are not only used for filtering operations any
more. Especially in low cost products where due to cost reasons not more than one core is
reasonable, the control code is executed on the DSP core.
To increase the performance of the DSP subsystems a high degree of parallelism and deep pipeline
structures are provided. Unfortunately, manual programming of highly parallelized DSP core
architectures with deep pipelines, and resolving the data and control dependencies manually is
limited or even impossible. Therefore the motivation of using assembly code to increase the use of
the available resources is not valid any more.
In late 1999, the development of xDSPcore was started based on considerations on the described
aspects. To overcome the problem of programming issues and the increased time-to-market
pressure, one aim of this project is to provide a DSP core architecture which can be efficiently
programmed in C. Efficiently means in this case less than 10% overhead on application level
compared with manual assembly programming. Caring about the aspect of less area consumption
and low power dissipation, it is possible to configure the main architectural features of the DSP core
to application-specific requirements. Compatibility issues between different core architectures are
covered by the architecture independent algorithm description (e.g. Embedded-C [2]) and the
automatic assembly generation.
Different to available DSP core architectures whose architecture is mainly influenced by
algorithmic aspects and limited by hardware restrictions, aspects of software development have
been considered already during the definition of the instruction set and the core architecture, to
enable the development of an efficient C-Compiler.
This paper is introducing xDSPcore. The first part is used for an overview of the xDSPcore
architecture. The requirements of the C-compiler (like orthogonality and not using configuration
and mode registers) and the aspect of scalability have been influencing the main architectural
features. The motto “keep it simple” helps to develop efficient tools and to allow changes in the
implementation.
The second part is used to introduce DSPxPlore, the design space exploration methodology for
xDSPcore. A configurable and scalable DSP core leads to the question: “which kind of core
architecture is well suited to meet my application requirements”. DSPxPlore can be used to evaluate
how efficiently an application code can make use of the provided resources of the DSP core. These
results can be used to refine the HW/SW partitioning and to refine the core architecture to optimize
the core subsystem to the application requirements. DSPxPlore enables to close the gap between
dedicated hardware and software-based solutions.
The third part is used to discuss aspects of configurability. To keep the configuration parameters
consistent, a unique configuration file based on XML is used. In the beginning of part three the
structure of the configuration file is introduced. The configuration file is used for generating parts of
the VHDL-RTL code, as basis for the tool chain and for automatically updating documentation. For
illustration the VHDL code generator xSIM, a cycle true ISS (Instruction Set Simulator) and the
documentation generation flow are briefly introduced.
2
State-of-the-art
This section is used to briefly introduce available concepts and to position the DSP core project
presented in this paper.
Traditional DSP core architectures
The latest announcements and products based on traditional DSP core architectures are Starcore
SC1200/SC1400 [3] announced by the Starcore LCC, which is a cooperation between Motorola,
Agere and Infineon Technologies and Blackfin [4], the outcome of a cooperation between Analog
Devices and Intel Inc. Both core concepts are RISC (Reduced Instruction Set Computer) based
load-store architectures, claiming to be efficiently be programmed in high-level languages like
C/C++.
The ISA (Instruction Set Architecture) and the core architecture are fixed, which prevents
application specific modifications, which is a requirement to close the gap between hardwired
implementations and software based solutions.
Scaleable core architectures
The most famous representatives for scaleable cores are Tensilica and ARC [5]. Both concepts are
based on traditional microcontroller architectures e.g. Tensilica’s Xtensa [6] is based on the MIPS
architecture. Therefore efficient implementation of traditional DSP algorithms is not possible and
aspects like minimizing the worst case execution time are not sufficiently covered. The software
support for using the DSP specific features is quite insufficient. By adding “just an additional MAC
unit” the main focus is to increase theoretical performance, instead of analyzing the overall system
performance.
Architecture Description Languages
LISA (Language for Instruction Set Architecture) from Coware [7] mainly developed at RWTH
Aachen [8] is the most famous architecture description language. Later projects include e.g. ArchC
project running in Brazil [9]. The concept of defining your own specific core architecture fulfilling
the requirements of your application code sounds brilliant. However, generating the core
microarchitecture described on behavior level automatically results in poorly used silicon.
The large solution room provided by these concepts allows describing any kind of core architecture,
but only a few of them allow the development of an optimizing high-level language compiler. Most
of the design teams are interested in integrating a core into their System-on-Chip (SoC) or Systemin-Package (SiP) solution. For using an architecture description language like LISA and generating
efficient solutions, deep processor architecture knowledge is required.
Design Space Exploration, strongly required when supporting scaling or configuring a core
architecture has to be based on a high-level language compiler. The generation of high-level
language compilers is still not feasible, and even considering approaches like COSY from ACE [10],
the quality of the code produced by an automatically generated compiler is poor. The quality of the
generated code which is then basis for architectural modifications is important, whereas poor results
can be even misleading. The problem can be summarized as chicken-egg problem, whereas
understanding the application requirements onto a core architecture requires an efficient high-levellanguage compiler, which is not feasible to be generated for each core architecture.
Summary of state-of-the art
The briefly introduced concepts of this section can be split into three groups, traditional DSP core
architectures lacking the possibility of application specific adaptations, scaleable core architectures,
3
lacking in support for efficiently implementing DSP algorithms, and last but not least architecture
description languages, which open a large solution room, but are lacking efficient software support
and mapping to HW implementations, which is a strong requirement for success.
The concept of xDSPcore introduced in this paper tries to solve the above described problems.
xDSPcore is a general purpose DSP core architecture based on RISC load-store and enables
efficient execution of traditional DSP algorithms. This includes also system aspects like the
possibility for minimizing the worst-case execution time. To close the gap between hardwired ASIC
implementations and software based solutions the core concept enables scaling of the main
architectural features, whereas the micro-architectural concept is not changed. The microarchitecture has been defined under consideration of the requirements for developing an optimizing
C-compiler, which enables to analyze the design space of a certain application code. To keep
validation and verification effort low, a unique configuration file based on XML is introduced
which allows scaling core features without struggling with taking care that all the changes are
considered in hardware, tool-chain and documentation.
4
DSP Core Architecture
This section is used for briefly introducing the xDSPcore architecture. Already during the definition
of the instruction set architecture and micro-architecture the development of an efficient CCompiler has been considered [11]. To enable an area and power efficient DSP subsystem, the
architectural features with significant influence on area and power consumption can be configured.
Adapting the core subsystem to application specific requirements allows reducing the gap between
systems based on dedicated hardware and software-based solutions. The configuration aspects are
considered in a later section.
Overview
xDSPcore is based on a modified Dual-Harvard load-store architecture [11]. VLIW (Very Long
Instruction Word) is used as programming model, where static scheduling allows reducing the core
complexity since resolving data and control dependencies is done during compile time [12]. In
Figure 1 a generic overview of the core architecture is given. Two data busses are connecting the
data memory with the DSP core; for preventing address conflicts the addresses are interleaved. An
independent data bus connects the program memory and is used for fetching instructions. Data and
program memory have different address spaces. The register file plays a central role in the core
architecture. Load/store architectures feature instructions for moving entries from data memory to
the register file and from register file back to memory. Separate instructions are used for coding
arithmetic functions. The size of the memory ports is scaleable (therefore no values are assigned in
Figure 1). An instruction buffer between program memory and instruction decoder is used to
increase code density and for reducing power dissipation during the execution of loop constructs
which are quite often a central part of traditional DSP algorithms.
Figure 1: Core Overview.
xDSPcore is featuring a 3-phase RISC pipeline with instruction fetch, decode and execute. For
increasing the reachable clock frequency (especially for relaxing the timing at the memory ports)
the three pipeline phases can be split over several clock cycles. The example architecture illustrated
in Figure 2 is using five clock cycles for mapping the three basic pipeline phases.
Figure 2: Pipeline.
The instruction fetch phase is split over two clock cycles, fetch and align, the decode stage takes
only one clock cycle and the execution phase is again split over two clock cycles (EX1, EX2).
Using several clock cycles for one pipeline phase can be used for reaching higher core clock
frequencies, but considering system aspects it even can lead to reduced system performance: Using
several clock cycles for the instruction fetch phase increases the number of branch delays,
5
additional clock cycles for the execution phase leads to increased load-in-use and define-in-use
dependency [12]. For considering the trade-off between reachable clock frequency and system
performance DSPxPlore is introduced in a later section.
All instructions of xDSPcore can be assigned to three different operation classes: load/store
instructions used to transfer data between data memory and register file, arithmetic instructions
(including miscellaneous instructions like interrupt disable) and branch instructions influencing the
program flow. Each instruction consists of one or two instruction words. The first three bits of the
instruction word are used for operation class and alignment information. The parallel word (as
illustrated in Figure 3) is used for long offset and long immediate values.
Figure 3: Instruction Coding.
Data copy functions between registers of the register file can be omitted by featuring 3-operand
instructions and by an orthogonal register file. For preventing limitations during instruction
scheduling, no mode bits are used. In difference to available core architectures where mode bits are
used for increasing code density, all functions are coded in instruction words.
Figure 4: Parallelism.
To allow efficient use of the core resources the number of instructions executed in parallel can be
adapted according to the application. The example architecture in Figure 4 allows the execution of
two load/store, two arithmetic and one branch instruction in parallel. As already mentioned VLIW
is used as programming model. Overcoming the drawback of traditional VLIW architectures which
are featuring low code density, xLIW (a scalable long instruction word) is introduced. xLIW is
based on VLES (Variable Long Execution Set) and additionally supports a reduced program
memory port. xLIW is introduced in detail in [13].
The example architecture illustrated in Figure 1 features two busses to data memory. Two
independent AGUs (Address Generation Unit) enable generating two independent addresses. Each
of the AGUs can make use of all address registers, they are not banked. If the addresses generated
in parallel access the same physical memory block, the hazard is detected during run-time and the
memory operations are serialized.
The AGU of xDSPcore supports all common DSP address modes, like memory direct, register
direct and register indirect addressing. The auto inc/decrement address operation supports pre- and
post address calculation and an efficient stack frame addressing. The size of modulo buffers for
modulo addressing schemes is programmable; the start address has to be aligned.
Figure 5: Register Files.
6
Load-store architecture implies that all operands for the arithmetic instructions are fetched from the
register file. The structure of the register file and number and size of registers is configurable;
Figure 5 is used for illustrating the register file of the example architecture. The register file is split
into three parts, a data register file, an address register file and a branch file.
Figure 6: Data Register File.
The data register file as in Figure 6 consists of 8 accumulators, 8 long registers or 16 data registers.
Two consecutive data registers can be addressed as a long register. A long register including
additional guard bits (for higher precision calculation, e.g. 8 additional bits) can be addressed as an
accumulator. The register file is orthogonal, which means that each register can be used for each
operation and none of them is assigned to a certain instruction or has a predefined functionality. The
drawback of an orthogonal register file is the crossbar to enable mapping of the read and write ports
to the registers.
The address register file contains 8 address registers and 8 modifier registers. The modifier registers
are directly connected to the corresponding address registers, therefore it is not possible to use, e.g.,
address register r0 with modifier register m7. The modifier registers are used for modulo addressing
and for bit reversal addressing which can be used for optimizing FFT implementations.
Figure 7: Branch File.
The third part of the register file is the branch file. The branch file contains flags, reflecting the
status of the core and the related register files. A separate branch file is used to refrain from adding
more read and write ports to the data and address register file, which are already stressed enough
due to orthogonality requirements. xDSPcore supports a rich set of conditions for the conditional
branch instructions and predicated execution [14]. Predicated execution is used to reduce the
number of branch instructions and therefore the number of unused branch delays [15], [16], [17],
[18].
Figure 8: SIMD Cross Operation.
The instruction set of xDSPcore features 3-operand arithmetic instructions. Each of the arithmetic
instructions is available for all supported data types (data, long, accu). SIMD type of instructions is
7
also used to increase code density and system performance [11]. xDSPcore supports integer and
fractional data types. SIMD-cross instructions illustrated in Figure 8 allow reducing the number of
data move instructions between registers. For efficient implementation of control code (e.g.
framing), bit field instructions are included (insertion, extraction).
xDSPcore features a combination of hardware and software stack. The hardware stack is used for
automatically storing the program counter. For handling of the hardware stack no clock cycles or
instructions are needed; the number of hardware stack entries can be configured. For recursive
function calls a software stack is supported. A separate instruction is available to move the last
hardware stack entry into register file, illustrated in Figure 9, and a software stack can be built up in
data memory with regular load/store instructions. When returning from a function call, the jumpregister-indirect instruction is used to preload the program counter with a register value.
jump register indirect
pop
PC
register file
stack pointer
Figure 9: Stack Concept.
Today DSP cores have to take care of efficient interrupt handling, which is characterized by low
latency and small overhead during task switching. xICU (scaleable interrupt control unit) of
xDSPcore supports low latency interrupt handling [19]. Priority morphing is added to prevent
interrupt starving and to enable controlling of the execution order of interrupt service routines
during run-time. The scheduling is taken care of by the operating system.
8
Design Space Exploration
This section is used to introduce DSPxPlore, a design space exploration methodology for xDSPcore.
DSPxPlore can be used to understand application specific requirements on the processor
architecture already in an early stage of the project. At later stages DSPxPlore enables fine-tuning
of the chosen core subsystem. The introduced methodology is not limited to xDSPcore. The first
part of the section is used to discuss the design space of RISC based DSP core architectures while
xDSPcore is taken for illustration. The influence of different configuration parameters on consumed
silicon area and power dissipation of the core subsystem are briefly discussed. The second part
introduces the exploration parameters used for application specific optimizations. The third part is
used for illustrating the methodology DSPxPlore is based on, ending with some examples and
results.
Motivation
Making use of the additional degree of freedom from the parameterized core architecture, the
requirements of the application have to be understood. Quite often the core decisions are done by
the most experienced engineers focusing on the aspects “what is already available?” and “what has
been already proven in silicon?” to reduce the risk. Different requirements of the applications lead
to suboptimal solutions concerning consumed silicon area and power consumption by using one
core subsystem. In the price-critical consumer IC market this can be crucial for the own market
position and the revenues.
Design Space
The design space of RISC based DSP architectures can be identified to a few parameters, which
shall be briefly introduced in this subsection. A more detailed analysis can be found in [20].
• Register File
The register file in load/store DSP architectures plays a major role. Arithmetic instructions are
using entries from the register file as operands, and separate instructions are used for
exchanging data between memory and the register file. Providing fewer entries leads to
additional spill code, which decreases system code density. Spill code is used to copy entries
of the register file to memory, to free space in the register file for new results.
Huge register files with many entries have significant influence on the core area. The register
entries have to be encoded and therefore increasing the number of registers decreases code
density. The requirement of orthogonal architectures to obtain efficient instruction scheduling
does not allow banked register files with the consequence that the increased crossbar to the
execution unit is limiting reachable core clock frequency.
• Data Paths
The number of data paths available for executing instructions in parallel is influencing the
reachable core performance. Dependencies (control and data) in the application code and
micro-architectural limitations like reduced program memory ports (e.g. Carmel DSP [21]) are
limiting system performance.
Providing several data paths in parallel influences system performance due to the crossbars
between the register file and data paths. Adding additional data paths not only increases the
core area but also decreases system code density by additional instruction space necessary for
coding new instructions. Therefore a balanced relation between application-specific
requirements and the number of parallel data paths provided is required, which allows
efficient use of silicon area.
9
• Memory Bandwidth
The memory bandwidth is influencing the possible usage of available parallel data paths.
Featuring the execution of four MAC instructions in parallel, without providing enough data
for executing the MACs is wasted silicon. Therefore a balance between the number of
instructions executed in parallel and the memory bandwidth is required for efficient use of the
core resources. That is true for both data and program memory ports.
• Instruction Set/Size/Encoding
The instruction set architecture is a key feature influencing code density, and code density is a
significant factor influencing the consumed silicon area of the core subsystem. Therefore
application-specific instruction encoding allows efficient usage of the coding space.
Application-specific reduction of the instruction set architecture enables to resize the
instruction word. The switching activity at the program memory port can be reduced by
adapting the binary coding of the fetched instruction words.
• Pipeline Structure
The number of clock cycles spent for implementing the pipeline structure influences the
reachable clock frequency and therefore the performance of the core subsystem. On the other
hand, branch delays, load-use and define-use dependencies limit the utilization of the core
resources. Increasing the number of clock cycles enables providing core architectures with a
theoretical higher possible computation performance. If the restrictions in the application code
do not allow making use of the theoretical performance provided, only the power dissipation
has been increased (due to a higher clock frequency).
• Instruction Buffer
The instruction buffer is a specific feature of xDSPcore. Also several commercial cores are
getting a similar feature in their newer versions and therefore it is reasonable to discuss the
aspect. The instruction buffer of xDSPcore is used to compensate the memory bandwidth
mismatch between fetch and execution bundles in average, and for executing loop constructs
power efficiently by preventing repeated fetch cycles to program memory. The number of
entries limits the size of loop bodies which can be efficiently handled. On the other hand, too
many buffer entries influence the size of the core and the reachable performance due to the
crossbar connecting the buffer and the decoder ports.
Methodology
Providing a configurable DSP core architecture for meeting application-specific requirements
reduces area consumption and power dissipation. For identifying an application-specific core
architecture it is important to understand the specific requirements of the application.
For this purpose DSPxPlore is introduced. DSPxPlore is not a tool, it is a methodology. The
exploration methodology is based on an optimizing C-compiler and a configurable ISS.
Figure 10 provides an overview of the exploration methodology. The optimizing C-compiler is used
to generate static analysis results. For evaluating dynamic results, the cycle true ISS called xSIM is
used. Both results together can be used to analyze the application requirements for the core
subsystem. The core configuration is described in an XML-based configuration file, introduced in
more details in the next section. In the next sub-section some of the analysis results are introduced.
10
Figure 10: DSPxPlore Overview.
Analysis
As illustrated in Figure 10 the analysis results are split into static and dynamic analysis. This subsection is used to briefly introduce some analysis results. A more detailed description can be found
in [20].
Static analysis
Static analyses are generated by the C-Compiler. Some examples are
• Code size
The consumed silicon area of a DSP subsystem is significantly influenced by the program
memory and therefore by code density. Code density is a measure of how efficiently the
application can make use of the chosen instruction set architecture and the related binary
coding. Furthermore, micro-architectural limitations like the structure of the program memory
port are mirrored in code density. The analysis result code size is normalized to bytes, which
makes the results from different sized instruction words comparable.
• Parallelism
Providing the execution of several instructions in parallel increases the theoretical
performance of a DSP core. Control and data dependencies in the application code are limiting
the actual usage of the parallel execution units. Therefore the analysis result parallelism is
used to analyze how efficiently the application code can make use of the hardware resources.
• Instruction histogram
Mapping of an instruction set architecture to the related binary coding has influence on power
dissipation (switching activity at the program memory port) and code density. The analysis
result instruction histogram lists the instructions used and the frequency of their occurrence.
This result can be used to identify the most used instructions for optimizing binary coding.
Dynamic analysis
Dynamic analyses results are generated by the cycle true instruction set simulator xSIM. Some
examples are
• Execution count per bundle
It is typical for DSP algorithms that the central parts are implemented as loop constructs.
Therefore the execution bundles (instructions executed during the same clock cycle) that are a
part of a loop construct are executed frequently, which makes optimizations more significant
than optimizations in less frequently executed bundles. The analysis result execution count per
bundle is used to identify frequently executed bundles. The result is also used for weighting
the static result parallelism.
11
• Execution count per instruction
Instructions as part of loop constructs are more frequently used as instructions in sequential
control code. The analysis result execution count per instruction identifies the execution
frequency of instructions used. The result is also used for weighting the static result
instruction histogram.
• Program memory fetch
The number of program memory fetch cycles has significant influence on the power
dissipation of the DSP subsystem. Especially control code featuring low branch distance can
lead to significant number of fetch cycles, fetching instructions never being executed. To
identify the unused fetch bundles and to optimize the fetch process at the program memory
port the analysis result program memory fetch is used.
Example Results
This subsection is used to illustrate some results generated by using the methodology described
above. The application code used for generating these results consists of typical DSP code examples
like FFT, cryptographic algorithms and control code examples. A more detailed description can be
found in [20].
Register file
The number of entries in the register file has influence on code density and performance of the core
subsystem. Too many entries increase core area and decrease code density due to additional coding
space needed. Fewer entries are influencing code density by the need of additional spill code.
Register File
50000
45000
40000
35000
nr. bundles
30000
nr. instructions
25000
nr. delay NOPs
20000
code size (byte)
15000
10000
5000
0
4
8
12
number of register
Figure 11: Influence of the Register File.
The example in Figure 11 is used to illustrate the influence of the size of the register file to the core
subsystem. The x-axis counts the number of register entries, e.g. 4 means 4 accumulator registers,
equivalent to 8 data registers. The number of instructions needed to code the application is
decreasing with an increasing number of register file entries (less spill code needed). The same is
true for the number of execution bundles needed to execute the algorithm. The number of branch
delay NOPs is not significantly changing. Therefore the code density normalized in bytes is
decreasing with the number of provided registers. The influence of the size of the register file on
code density and number of execution bundles is starting to saturate after 12 entries, as can be
interpreted from Figure 11. Adding more register entries, e.g. 16 or 32, has no significant influence
on system performance any more.
12
Data paths
Providing the execution of several instructions in parallel can be used to increase system
performance of a DSP subsystem. Control and data dependencies in the application code can lead to
poor usage of the core resources.
parallelism
40000
35000
30000
nr. bundles
25000
nr. instructions
20000
nr. delay NOPs
15000
code size (byte)
10000
5000
0
model 0
model 1
model 2
model 3
model 4
model 0-4
Figure 12: Influence of Parallel Data Paths.
In Figure 12 the influence of different VLIW models on code density and execution bundles is
illustrated. Model 0 supports to execute 3 instructions in parallel, model 4 up to 6 instructions. The
different models are described in Figure 13.
model 0 model 1 model 2 model 3 model 4
computation
1
1
2
2
3
load/store
1
2
1
2
2
branch
1
1
1
1
1
Figure 13: Model Description.
Increasing the number of parallel datapaths has influence on the number of execution bundles
needed. A significant influence on system performance by adding parallel units is limited by
dependencies in the application code. As illustrated in Figure 12, model 2 (a 4-way VLIW) provides
a significant reduction of execution bundles needed, adding more data paths has less influence.
The increasing number of branch delay NOPs is caused by the compiler feature for scheduling
instructions as early as possible; this can be changed. But anyway, wider VLIW architectures
decrease branch distance and therefore increase the number of branch delay NOPs.
Pipeline structure
Increasing the number of clock cycles per pipeline stage is used to increase reachable clock
frequency of core subsystems. However, dependencies in the application code and branch delays
can even decrease system performance. Branch delays are caused by taken branch instructions and
lead to unused clock cycles. Supporting delayed branching allows making use of the branch delay
slots, which is limited by application code.
The example in Figure 14 compares the execution on a 3-way VLIW architecture and on a 6-way
VLIW core. Doubling the number of instructions which can be executed in parallel reduces the
number of needed execution bundles by about 15%. However, reducing the number of execution
bundles does not reduce the code size needed, which keeps the same. The advantages of the wider
VLIW are compensated by additional branch delay NOPs, caused by the reduced branch distance.
13
influence branch delays
40000
35000
30000
nr bundles (model 0)
25000
nr. bundles (model 4)
20000
code size (model 0)
15000
code size (model 4)
10000
5000
0
2
3
4
nr. of branch delays
Figure 14: Influence of Branch Delay Slots.
Increasing the number of branch delays decreases code density, as illustrated in Figure 14. The
additional branch delay NOPs needed for branch delay slots which cannot be filled with useful code
are decreasing code density.
Summary of examples
The examples in this section were used for illustrating the complexity of measuring system
performance. Changing one of the core architectural features has influence on several other aspects;
none of them can be analyzed isolated.
The application code executed on the core subsystem has significant influence on the results.
Therefore evaluation of core architectures has to be done with the real application code. Traditional
benchmark examples as available for commercial core architectures can be misleading.
14
Configuration
The DSP core architecture introduced before enables application specific adaptations of the main
architectural features. Some exploration parameter are introduced in the section before, used to
illustrate the design space of RISC based DSP architectures.
For easy maintenance of configurations and keeping the tools, core and documentation consistent,
xDSPcore has a single common configuration file. Information concerning core details is available
only once. This allows keeping tools, core and documentation consistent. In this section the
configuration file is briefly introduced, together with the related tools making use of it.
Configuration File structure
The configuration file for the xDSPcore architecture is based on XML. Using a standard language
allows making use of existing tools especially helpful during documentation generation. The
configuration file is split into several sections.
Architectural configuration
The first part contains the chosen architectural configuration. The structure of the register file, the
number and size of the different register types and the mapping of the registers is described. The
chosen pipeline structure and the number of clock cycles for each pipeline stage is mentioned. An
example for this section of the configuration file is illustrated in Figure 15.
<config type="register" name="A0">
<data name="width" value="40"/>
</config>
<config type="shared_register" name="L0">
<connect method="getValue">
<hook method="getValue" name="A0"/>
</connect>
<connect method="setValue">
<hook method="setValue" name="A0"/>
</connect>
<data name="width" value="32"/>
<data name="msb" value="31"/>
<data name="lsb" value="0"/>
<data name="sign_extension" value="1"/>
</config>
Figure 15: Register Configuration.
The example in Figure 15 is used to define a register as illustrated in Figure 6. The accumulator
with the name A0 is defined as a register with 40 bits word length. The long register L0 is defined
as a shared register; as illustrated in Figure 6 L0 is part of A0. L0 is 32 bits wide, starting at bit 0
and therefore the MSB (most significant bit) is located at bit position 31. Sign extension is done
automatically. To fully describe one accumulator register as introduced in Figure 6, the data register
has to be defined (illustrated in Figure 16). For example D1, equal to the high word of L0 is again a
shared register, 16 bits wide, located between bits 16 and 31. Similar to long register, sign extension
is done automatically. Data register D0, the low word of long register L0 is defined in a similar way.
The only difference to D1, besides the position of the data word, is the configuration parameter
15
sign_extension, which is set to 0. Changing the entry of D0 does not automatically change the
leading bits of the accumulator.
<config type="shared_register" name="D1">
<connect method="getValue">
<hook method="getValue" name="A0"/>
</connect>
<connect method="setValue">
<hook method="setValue" name="A0"/>
</connect>
<data name="width" value="16"/>
<data name="msb" value="31"/>
<data name="lsb" value="16"/>
<data name="sign_extension" value="1"/>
</config>
<config type="shared_register" name="D0">
<connect method="getValue">
<hook method="getValue" name="A0"/>
</connect>
<connect method="setValue">
<hook method="setValue" name="A0"/>
</connect>
<data name="width" value="16"/>
<data name="msb" value="15"/>
<data name="lsb" value="0"/>
<data name="sign_extension" value="0"/>
</config>
Figure 16: Data Register Definition.
Similar to the example of register definition and register file structure, the configurable architectural
features as introduced in the section before are described in the configuration file.
Instruction Set description
For keeping the core application-specific it is possible to add, remove and change entries of the
instruction set architecture. For obtaining lower switching activity at the program memory port the
binary coding can be changed. For this reason the instruction set architecture and instruction set are
part of the configuration file.
For illustration, the definition of a three-operand addition of data registers is shown in Figure 17.
The instruction is defined and version numbering allows evolutionary modifications of the
configuration file. The binary coding consists of fixed bits and some parameters. In the example
below parameters are used for coding of register entries. The assembler mnemonic is introduced
and a name specified for the documentation. Operands, their order and the chosen syntax are
defined below. The section execution cycle is used to map functions of the instruction (e.g. reading
operand1 from register) to pipeline stages. This mapping is defined in a section of the core
architecture and mapped to the instruction at this stage.
The restriction, description and operation tags are used for the documentation part of the instruction.
16
<instruction name="ADDITION_DATA_REGISTER" type="instance" version="0.1.0">
<var_info>
<opcode>10p00000aaaabbbbcccc</opcode>
<execution_model/> <!-- link to one of the specified models -->
<image/>
</var_info>
<const_info>
<mnemonic>add</mnemonic>
<name>ADDITION (DATA)</name>
<instruction_type>CMP</instruction_type>
<operands>
<operand char="a" order="1">
<operand_type>DATA_REGISTER</operand_type>
</operand>
<operand char="b" order="2">
<operand_type>DATA_REGISTER</operand_type>
</operand>
<operand char="c" order="3">
<operand_type>DATA_REGISTER</operand_type>
</operand>
</operands>
<syntax>
<define_value>op1, op2, op3</define_value>
</syntax>
<execution_cycle>
</execution_cycle>
<restrictions>None</restrictions>
<description>Performs an addition of two data register values
(op1,op2) and stores the result in a data register (op3).
</description>
<operation><description><para>op3 = op1 + op2 </para></description></operation>
</const_info>
</instruction>
Figure 17: Instruction Definition.
All instructions supported by xDSPcore are located in the instruction set description. In the next
sub-sections some “tools” are introduced, making use of the configuration file.
Decoder generator
The configuration file is used to automatically generate the VHDL description of the instruction
decoder. Considering the possibility of a change in the binary coding of the instructions and the
related coding and verification effort, a tool like this is required.
Figure 18 is used to briefly introduce the structure of the decoder generator. The central block is the
database. The database is fed with the instruction set and the chosen binary coding. A spreadsheet is
used for better visualization of the instruction set during definition. The instructions are described in
instruction groups and symbols are used to identify the instructions bundled to instruction groups.
The second input is the actual register configuration. In [22] a separate part of the spreadsheet was
used. The final version of the decoder generator is using the content of the configuration file.
17
Figure 18: Decoder Generator Structure.
The third input to the database is called container. The container is a C++ class, which is used to
map the instruction groups to VHDL statements. Therefore for major changes in the instruction set
like adding new instruction groups, changes in the container structure are getting necessary.
Adapting of the existing instruction set does not require any changes in the container structure.
Examples of container classes can be found in [22].
Building up the database in Figure 18 the three input sources are read and a consistency check
prevents ambiguous coding of instruction groups. All instructions are mapped to instruction groups,
the instruction groups are mapped to container classes. A consistency check is used for preventing
missing instructions in the VHDL description of the instruction decoder.
A tree structure is used to map the op-code of the instructions generated by the content of the
database. Each node in the tree can take three status (don’t care, zero or one) which can be used for
the symbols of the instruction words. Each branch in the tree is representing an instruction group.
case instruction (11) is
when '1' => -- ADDI_Family
cmp_instruction := addi;
cmp_ex1_add1.en := '1';
cmp_ex1_add1.add_const := '1';
cmp_ex1_write1 := setDxLxRx(instruction(4 downto 0));
cmp_ex1_read1 := setDxLxRx(instruction(4 downto 0));
cmp_ex1_cntrl1.const := signExtend16(instruction(10 downto 5));
when others => -- MOVR_Family
cmp_instruction := movr;
case instruction(10 downto 8) is
when "000" =>
cmp_ex1_write1 := setData(instruction(3 downto 0));
cmp_ex1_read1 := setData(instruction(7 downto 4));
…
Figure 19: Decoder Code Example.
During output generation the status of each node is checked starting at the root of the tree. Nodes
with don’t care status are being handled first, because the related coding bit could be possibly used
in a different instruction group. Reaching the end of the branch without further sub-branch
connections the remaining coding of the instructions can be handled in one statement. If further sub-
18
branches are found, an additional split in the coding (implemented as case statement) is getting
necessary.
In Figure 19 a code example for generated VHDL-RTL code can be found. The control signals for
the DSP core units are set and the register coding is assigned. The decoder architecture is separated
into decoder structures for each datapath, and assigned to one of three operation classes (load/store,
arithmetic and branch). The described procedure has to be done for each of the datapaths. Splitting
the decoder in sub-decoder structures enables to remove a sub-decoder if the related datapath is not
included in a chosen core architecture. Therefore silicon area consumed by the instruction decoder
scales with the number of supported instructions executed in parallel.
Tool chain
As an example of making use of the configuration file, xSIM the cycle-true instruction set simulator
of xDSPcore is chosen. Different to available simulator concepts introduced in [23][24][25] and [26]
xSIM is based on a framework supporting run-time configurablility and supports VLIW
architectures. Both are mandatory for using an ISS for xDSPcore.
The framework is based on components. Each component consists of classes, in the sense of object
oriented programming. The component needs input, provides a certain predefined functionality and
produces some output. The communication between components takes place via hooks. The
components are observable, which means that the function of a component (e.g. the GUI – graphical
user interface) can be activated by another component (e.g. by changing a register value).
Configuration parameters of the components have to be predefined and can then be changed during
run-time (e.g. changing the word length of a register).
Figure 20: Simulator Generation.
Dynamic link libraries are used to provide components which allow any user to add own type of
components following the predefined structure. Different components can be used to set up a new
component. Also the simulator itself is a component. The configuration information stored in the
XML based configuration file will be evaluated during starting up of the simulator (run-time
configurable). In Figure 20 the steps during startup are illustrated. First reading the model
configuration provides the information which components are used and how they are connected to
each other. The instantiated components are read from a dynamic component library. Already
during this time, the framework checks the connection between the components and verifies their
features. If the framework passes this stage without error messages it is guaranteed that the
described model consists of properly connected components.
In Figure 21 a layer model of the simulator to simulate the different architecture variants is
illustrated. The base layer reflects the functional model of the core architecture. Further layers can
be added to simulate different aspects e.g. the power dissipation. The higher layers are again
connected with hooks to the base layer. The layer-based structure allows different abstraction levels
and therefore different simulation speed. Starting with a base layer, describing the chosen
architecture and verifying on functional level can be done with high simulation speed. Adding
layers will cause a decrease in the simulation speed.
19
Figure 21: Layer Concept.
To obtain execution platform independency the GUI is a separate layer. Supported interface
libraries are GTK and QT.
Documentation
The configuration file is used as source for automatic generation of documentation. E.g. for
changing binary coding of instructions an automatic update of the documentation is necessary. This
reduces effort for changing and validation. The instruction definition as the example in Figure 17 is
used as source for updating the core documentation.
The same source is also used for, e.g., generating the instruction set description and the assembler
documentation. Therefore documents describing the core architecture are consistent without the
need of manual verification.
Figure 22 is used for illustrating the chosen flow. Using the XML core configuration file and a style
sheet information as input, an XSL processor (saxon [27]) is used for generating a DoCBook File
also based on XML [28]. Depending on the chosen output format e.g. directly an html description
can be generated (again using saxon and style sheet information this time for DoCBook to html) or
via an intermediate step (making use of the Apache FOP (Formatting objects processor) [29]) a .pdf
version of the documents can be generated.
Figure 22: Generating Documentation.
Making use of the DoCBook intermediate representation enables to use only one style sheet for
different output formats which prevents the need to keep them consistent [30].
20
Summary
This paper introduced xDSPcore, an application-specific configurable DSP core. The main
architectural features with influence on area and power dissipation can be scaled or configured to
close the gap between dedicated hardware implementations and software-based solutions. Therefore
xDSPcore is well suited as a platform core. To identify the application requirements onto the core
subsystem, design space exploration methodology DSPxPlore is introduced which allows exploring
the design space of RISC based core architectures. Some results are illustrating the complexity of
changing parameters and the influence on the key aspects area and power consumption. To obtain
consistency between core hardware, programming tools and documentation, a unique configuration
file based on XML is introduced and some of the “tools” making use of it are briefly illustrated.
Acknowledgement
The work has been supported by the EC with the project SOC-Mobinet (IST-2000-30094), the
Christian Doppler Lab “Compilation Techniques for Embedded Processors” and Infineon
Technologies Austria.
References
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, “DSP Processor Fundamentals, Architectures and Features”,
New York, IEEE Press, 1997.
[2] ISO/IEC DTR 18037.2, “Information technology – Programming languages, their environments and
system software interfaces – Programming Language C.” Version for DTR approval ballot, 25.04.03.
[3] “SC140 DSP Core Reference Manual”, Motorola Inc., MNSC140CORE/D, Revision 3, November
2001.
[4] “Blackfin DSP Instruction Set Reference”, Digital Signal Processor Division, Analog Devices Inc.,
First Revision, March 2002.
[5] ARC 600, Reference Manual, http://www.arc.com.
[6] “Xtensa Architecture and Performance”, white paper of Tensilica Inc., September 2002.
[7] Coware, http://www.coware.com.
[8] “Lisa 2.0 Language Reference Manual, Manual RM_2002.02”, LisaTek, February 2002.
[9] “The ArchC Architecture Description Language v0.8.1, Reference Manual”, www.archc.org, 2004.
[10] “DSP specific extension to ANSI C”, ACE, http://www.ace.nl/products/cosydsp.htm.
[11] J. L. Hennessy, D. A. Patterson, “Computer Architecture. A Quantitative Approach”, San Mateo CA,
Morgan Kaufmann Publishers, 1996.
[12] D.Sima, T.Fountain, P.Kacsuk, “Advanced Computer Architectures: A Design Space Approach”,
Addison Wesley Publishing Company, Harlow, 1997.
[13] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi “xLIW – a Scaleable Long Instruction Word”, ISCAS 2003,
Bangkok, Thailand, 2003.
[14] C.Panis, U.Hirnschrott, A.Krall, G.Laure, W.Lazian, J.Nurmi,” FSEL - Selective Predicated Execution
for a Configurable DSP Core”, IEEE Annual Symposium on VLSI, Lafayette, LA, USA, 02. 2004.
[15] Smith J.E., “A study of branch prediction strategies”, in Proc. 8th ISCA, pp.135-48, 1981.
[16] Lee J.K.F. and Smith A.J., “Branch prediction strategies and branch target buffer design”, Computer
17(1), pp.6-22, 1984.
[17] Yeh T.-Y. and Patt Y.N., “Alternative implementations of two-level adaptive branch predictions”, In
Proc. 19th ISCA, pp.124-34, 1992.
[18] Pnevmatikos D.N. and Soshi G.S., “Guarded Execution and branch prediction in dynamic ILP
processors”, In Proc. 21st ISCA, pp. 120-9, 1994.
[19] C.Panis, J.Hohl, H.Grünbacher, J.Nurmi, “xICU - a Scaleable Interrupt Unit for a Configurable DSP
Core", SOC-03, Tampere, Finland, 11.2003.
21
[20] C.Panis, U.Hirnschrott, G.Laure, W.Lazian, J.Nurmi, “Design Space Exploration for an Embedded DSP
Core”, SAC 04, Cyprus, 03.2004.
[21] Infineon Technologies, “Carmel DSP Core Architecture Specification”, Infineon Technologies, 2001.
[22] C.Panis, A.Schilke, Habiger, J.Nurmi, “An Automatic Decoder Generator for a Scaleable DSP
Architecture”, Norchip, Copenhagen, Denmark, 11.2002.
[23] M.Rosenblum, E.Bugnion, A.Herrod and S.Devine, “Using the SimOS Machine Simulator to Study
Complex Computer Systems”, ACM Transactions on Modeling and Computer Simulation, Vol.7, 01.1997.
[24] K.Shadron and S.A.Pritpal, “HydraScalar: A Multipath Capable Simulator”, Newsletter of the IEEE
Technical Comitee on Computer Architecture (TCCA), 01.2001.
[25] J.Emer, P.Ahuja, E.Borch, A.Klausner, C.Luk, S.Manne, S.Mukherjee, H.Patil, S.Wallace, N.Binkert,
R.Espasa and T.Juan, “ASIM: A Performance Model Framework”, IEEE Computer, 02.2002.
[26] T.Austin, E.Larson and D.Ernst, “SimpleScalar: An Infrastructure for Computer System Modelling”,
IEEE Computer, 02.2002.
[27] Michael H. Kay: SAXON The XSLT and XQuery Processor saxon.sourceforge.net.
[28] Norman Walsh, Leonard Muellner: Docbook: The Definitive Guide, O’Reilly, October 1999
(www.docbook.org).
[29] Apache FOP: XSL-FO Processor xml.apache.org.
[30] Norman Walsh: DocBook XSL Stylesheets docbook. sourceforge.net.
22
PUBLICATION 2
C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “xLIW – a Scaleable Long Instruction
Word”, in Proceedings The 2003 IEEE International Symposium on Circuits and Systems
(ISCAS 2003), Bangkok, Thailand, May 25-28, 2003, pp. V69-V72.
©2003 IEEE. Reprinted, with permission, from proceedings of the 2003 IEEE International
Symposium on Circuits and Systems.
[/,:D6FDOHDEOH/RQJ,QVWUXFWLRQ:RUG
&KULVWLDQ3DQLV
5DLPXQG/HLWQHU
+HUEHUW*UQEDFKHU
-DUL1XUPL
&DULQWKLDQ7HFK,QVWLWXWH
)DFKKRFKVFKXOH.lUQWHQ
$9LOODFK$XVWULD
,QILQHRQ7HFKQRORJLHV
$9LOODFK
$XVWULD
&DULQWKLDQ7HFK,QVWLWXWH
)DFKKRFKVFKXOH.lUQWHQ
$9LOODFK$XVWULD
7DPSHUH8QLYHUVLW\RI
7HFKQRORJ\),1
7DPSHUH
$%675$&7
,QFUHDVLQJV\VWHPFRPSOH[LW\RI62&DSSOLFDWLRQVOHDGVWR
DQ LQFUHDVLQJ UHTXLUHPHQW RI SRZHUIXO HPEHGGHG '63
SURFHVVRUV 7R LQFUHDVH WKH FRPSXWDWLRQDO SRZHU RI '63
SURFHVVRUV WKH QXPEHU RI SLSHOLQH VWDJHV KDV EHHQ
LQFUHDVHG IRU KLJKHU IUHTXHQFLHV DQG WKH QXPEHU RI
SDUDOOHOO\ H[HFXWHG LQVWUXFWLRQV WR LQFUHDVH WKH
FRPSXWDWLRQDO EDQGZLGWK 7R SURJUDP WKH SDUDOOHO XQLWV
9/,:9HU\/RQJ,QVWUXFWLRQ:RUGKDVEHHQLQWURGXFHG
3URJUDPPLQJWKHSDUDOOHOXQLWVDWWKHVDPHWLPHOHDGVWRDQ
H[SDQGHG SURJUDP PHPRU\ SRUW RU WR WKH OLPLWDWLRQ WKDW
RQO\DIHZXQLWVFDQEHXVHGLQSDUDOOHO7RRYHUFRPHWKLV
OLPLWDWLRQWKHSDSHUSURSRVHVD6FDOHDEOH/RQJ,QVWUXFWLRQ
:RUG[/,:7KH[/,:FRQFHSWDOORZVWKHIXOOXVDJHRI
WKH DYDLODEOH XQLWV LQ SDUDOOHO ZLWK RSWLPDO FRGH GHQVLW\
$Q LQVWUXFWLRQ EXIIHU LQFOXGHG UHGXFHV WKH SRZHU
GLVVLSDWLRQ DW WKH SURJUDP PHPRU\ SRUWV GXULQJ ORRS
KDQGOLQJ 7KH [/,: FRQFHSW LV SDUW RI D GHYHORSPHQW
SURMHFWRIDFRQILJXUDEOH'63
SURJUDPPDELOLW\ RI WKH DYDLODEOH H[HFXWLRQ XQLWV ZLWK D
PRGHUDWHSURJUDPPHPRU\SRUWVL]HZKLFKLVIHDVLEOHIRU
HPEHGGHG SURFHVVRUV XQGHU FRQVLGHUDWLRQ RI SRZHU
GLVVLSDWLRQ DQG VLOLFRQ DUHD $GGLWLRQDOO\ LW WDNHV LQWR
DFFRXQW WKDW XS WR RI WRGD\¶V '63 DSSOLFDWLRQV DUH
FRQWUROFRGHGRPLQDWHG>@ZKLFKGRHVQRWDOORZWRIXOO\
XWLOL]HWKHDYDLODEOHSDUDOOHOXQLWV
7RKDQGOHWKH[/,:VDQGUHDOLJQLQJWKHRSHUDWLRQVWRWKH
FRUUHVSRQGLQJH[HFXWLRQXQLWVWKHDOLJQXQLWLVLQWURGXFHG
2QH RI LWV EXLOGLQJ EORFNV DQ LQVWUXFWLRQ EXIIHU DOORZV
SRZHU RSWLPL]HG KDQGOLQJ RI ORRS LQVWUXFWLRQV 7KH ILUVW
SDUW RI WKH SDSHU LQWURGXFHV DQ H[DPSOH DUFKLWHFWXUH DQG
LOOXVWUDWHVWKHVWUXFWXUHRIWKH[/,:,QWKHVHFRQGSDUWWKH
DUFKLWHFWXUH RI WKH DOLJQ XQLW ZLOO EH GLVFXVVHG 7KH PDLQ
EXLOGLQJ EORFNV LQVWUXFWLRQ EXIIHU PDSSLQJ XQLW DQG
FRQWUROXQLWZLOOEHH[SODLQHGLQGHWDLO
instr 1
instr 2
in str 3
instr 5
instr 6
in str 7
in str 4
unit 1
unit 2
instr 1
unit 3
7KH LQFUHDVLQJ UHTXLUHPHQWV RQ FRPSXWDWLRQDO SRZHU RI
HPEHGGHGSURFHVVRUVKDYHOHGWRGHHSHUSLSHOLQHVWUXFWXUHV
DQGWRDQLQFUHDVHRISDUDOOHOH[HFXWLRQXQLWV7RSURJUDP
WKHVHSDUDOOHOXQLWVWKH9/,:9HU\ORQJLQVWUXFWLRQZRUG
>@ ZLOO EH XVHG ZKLFK LV EXLOW XS RI VHYHUDO LQVWUXFWLRQ
ZRUGV 7KH VFKHGXOLQJ RI WKH LQVWUXFWLRQV DQG WKH
UHVROXWLRQ RI GDWD GHSHQGHQFLHV EHWZHHQ LQVWUXFWLRQV LV
VKLIWHGWRVRIWZDUHHJWRWKH&&RPSLOHU7KHUHIRUHQR
KDUGZDUH VXSSRUW OLNH VFRUHERDUGV >@ LQ VXSHUVFDODU
SURFHVVRU DUFKLWHFWXUHV LV DYDLODEOH IRU 9/,:
DUFKLWHFWXUHV7KH9/,:ZLOOEHIHWFKHGGHFRGHGDQGWKHQ
LVVXHG WR WKH SDUDOOHO H[HFXWLRQ XQLWV 3URYLGLQJ D KLJK
QXPEHURISDUDOOHOXQLWVOHDGVWKHUHIRUHWRDZLGHSURJUDP
PHPRU\ SRUW RU WRWKH OLPLWDWLRQ WKDW RQO\ D IHZ RI WKH
DYDLODEOH XQLWV FDQ EH XVHG LQ SDUDOOHO 7R RYHUFRPH WKLV
OLPLWDWLRQ WKLV SDSHU SURSRVHV WKH [/,: 6FDOHDEOH /RQJ
,QVWUXFWLRQ :RUG 7KH [/,: FRQFHSW DOORZV IXOO
V69
a lig n
un it
unit 5
in str 2
in str 3
,1752'8&7,21
unit 4
in str 4
in str 5
in str 6
in str 7
)LJXUH9/,:DUFKLWHFWXUHV
[/,:&21&(37
7KHQH[WVHFWLRQJLYHVDQRYHUYLHZRIWKH[/,:FRQFHSW
7R LOOXVWUDWH WKH IXQFWLRQ DQG WKH DGYDQWDJHV RI WKH
SURSRVHG FRQFHSW DQ H[DPSOH DUFKLWHFWXUH ZLOO EH XVHG
7KH PRGLILHG 'XDO+DUYDUG ORDGVWRUH '63 DUFKLWHFWXUH
FRQVLVWV RI ILYH LQGHSHQGHQW GDWD SDWKV DV LOOXVWUDWHG LQ
)LJXUH
1RUPDO,QVWUXFWLRQVFRQVLVWRIRQHELWZLGHLQVWUXFWLRQ
ZRUG$ELW LQVWUXFWLRQ VHW KDV EHHQ FKRVHQ WR VXSSRUW
DOO DULWKPHWLF RSHUDWLRQV ZLWK RSHUDQG LQVWUXFWLRQV
$GGLWLRQDOO\DVLWFDQEHVHHQLQ)LJXUHWKHIHWFKEXQGOH
FRQVLVWV RI LQVWUXFWLRQ ZRUGV ZKLFK JLYHV DQ ELWV
ZLGHSURJUDPPHPRU\EXVRUSRVVLEO\DUHDOL]DWLRQZLWK
WLPHV ELWV ZLGH EXV 7KH ILUVW WKUHH ELWV RI HDFK
LQVWUXFWLRQ ZLOO EH XVHG WR DVVLJQ WKH LQVWUXFWLRQ ZRUG WR
RQHRIWKHWKUHHRSHUDWLRQFODVVHVDQGDGGLWLRQDOO\FRQWDLQV
WKH DOLJQPHQW LQIRUPDWLRQ 7KH PHPRU\ FRQWHQW LV QRW
DOLJQHG WR WKH IHWFK EXQGOH VWDUWLQJ DGGUHVVHV ZKLFK
DOORZV VDYLQJ PHPRU\ UHVRXUFHV DV DOVR LOOXVWUDWHG LQ
)LJXUH
/RQJ LQVWUXFWLRQV VKRZQ LQ )LJXUH KDYH DQ DGGLWLRQDO
ORQJ ZRUG 7KH ORQJ ZRUG FDQ EH XVHG IRU RIIVHW YDOXHV
LPPHGLDWHYDOXHVRUEUDQFKWDUJHWDGGUHVVHVZKLFKGRQRW
ILWLQWRWKHLQVWUXFWLRQZRUGLWVHOI)RUHDFKORQJLQVWUXFWLRQ
DOVR D VKRUW LQVWUXFWLRQ LV DYDLODEOH HJ WKH EUDQFK
LQVWUXFWLRQ FRQVLVWV RI RQO\ RQH LQVWUXFWLRQ ZRUG 7KH
SURYLGHG DGGUHVV VSDFH IRU WKH EUDQFK LV ELWV ZLGH ,I
WKLVLVQRWVXIILFLHQWDEUDQFKORQJLQVWUXFWLRQZLWKDORQJ
ZRUGLVDYDLODEOHIRUDGGLWLRQDODGGUHVVVSDFH
no rm al instruction
lo ng in stru ctio n
$/,*181,7
7R KDQGOH WKLV DGGLWLRQDO IOH[LELOLW\ RI D 6FDOHDEOH /RQJ
,QVWUXFWLRQ :RUG UHTXLUHV DQ DGGLWLRQDO IXQFWLRQ XQLW
FDOOHG DOLJQ XQLW 7KH IROORZLQJ VHFWLRQ ZLOO EH XVHG WR
LOOXVWUDWHWKHVWUXFWXUHRIWKHDOLJQXQLWDQGWRSRLQWRXWWKH
PDLQLVVXHVFRQFHUQLQJLWVLPSOHPHQWDWLRQ7KHDOLJQXQLW
ZLOO RSHUDWH ZLWKLQ D SLSHOLQH VWDJH RI LWV RZQ ORFDWHG
EHWZHHQ WKH IHWFK VWDJH IHWFKLQJ WKH LQVWUXFWLRQ ZRUGV
IURP WKH SURJUDP PHPRU\ DQG WKH GHFRGH VWDJH
UHVSRQVLEOHIRULQVWUXFWLRQGHFRGLQJ
)LJXUHVKRZVDQRYHUYLHZRIWKHPDLQEXLOGLQJEORFNVRI
WKHDOLJQXQLW ZKLFK ZLOOEHGHVFULEHGLQWKHQH[WVHFWLRQ
PRUHLQGHWDLO
unit 1
in stru c tio n
b u ffe r
in struc tio n
m ap p ing
instru ction
instru ction
unit 2
unit 5
lo ng w o rd
align co ntrol
)LJXUH,QVWUXFWLRQ:RUG
7KH ORQJ ZRUG GRHV QRW FRQWDLQ WKH DVVLJQPHQW WR DQ
RSHUDQG FODVV EHFDXVH LW EHORQJV WR DQ LQVWUXFWLRQ
7KHUHIRUH LW LV PDQGDWRU\ WKDW WKH ORQJ ZRUG IROORZV LWV
LQVWUXFWLRQZRUGLQVLGHWKHIHWFKEXQGOH
$Q [/,: FDQ EH EXLOW XS RI VHYHUDO LQVWUXFWLRQ ZRUGV DV
GHVFULEHG EHIRUH 7KH QXPEHU RI LQVWUXFWLRQ ZRUGV
FRPELQHGWRDQ[/,:LVRQO\GHSHQGHQWRQWKHDOJRULWKP
DQG QRW RQ DQ\ RWKHU KDUGZDUH UHVWULFWLRQV ZLWK WKH
H[FHSWLRQRIWKHQXPEHURIDYDLODEOHSDUDOOHOXQLWV,QWKH
IROORZLQJ ZH ZLOO FDOO WKH [/,: H[HFXWLRQ EXQGOH ZKLFK
LQGLFDWHV LQVWUXFWLRQV H[HFXWHG DW WKH VDPH WLPH 7KH
H[HFXWLRQ EXQGOH VL]H FDQ YDU\ IURP RQH WR WHQ
LQVWUXFWLRQVDVVKRZQLQ)LJXUH
1 in s tru c tio n w o rd
)LJXUH$OLJQ8QLW
,QVWUXFWLRQEXIIHU
7KH VL]H RI WKH IHWFK EXQGOH LV FRQVWDQW LQVWUXFWLRQ
ZRUGV DQG WKH H[HFXWLRQ EXQGOH FDQ YDU\ EHWZHHQ DQG
LQVWUXFWLRQ ZRUGV 9/,: DUFKLWHFWXUHV KDYH QR
KDUGZDUH VXSSRUW OLNH VFRUHERDUGV >@ WKDW DUH XVHG LQ
VXSHUVFDODU SURFHVVRU DUFKLWHFWXUHV WR UHVROYH GDWD
GHSHQGHQFLHV 7KH LQVWUXFWLRQ VFKHGXOLQJ ZLOO EH GRQH LQ
VRIWZDUH HJ E\ D FRPSLOHU 7KDW LV ZK\ LW LV QHFHVVDU\
WKDW WKH ZKROH H[HFXWLRQ EXQGOH ZLOO EH H[HFXWHG DW WKH
VDPHWLPH,IWKHVL]HRIWKHH[HFXWLRQEXQGOHH[FHHGVWKH
VL]HRIWKHIHWFKEXQGOHDVWDOOF\FOHZLOOEHQHFHVVDU\
7RSUHYHQWVWDOOF\FOHVDQLQVWUXFWLRQEXIIHULVLQWURGXFHG
WR EDODQFH WKH PHPRU\ EDQGZLGWK PLVPDWFK EHWZHHQ WKH
VL]H RI WKH IHWFK EXQGOH DQG WKH VL]H RI WKH H[HFXWLRQ
EXQGOH7KHVWUXFWXUHRIWKHLQVWUXFWLRQEXIIHULVDFLUFXODU
EXIIHU>@DVVKRZQLQ)LJXUH
1 0 in s tru c tio n w o rd s
)LJXUH[/,:VWUXFWXUH
7KH LQVWUXFWLRQV IHWFKHG IURP SURJUDP PHPRU\ DW WKH
VDPHWLPHZLOOEHFDOOHGIHWFKEXQGOH7KHVL]HRIWKHIHWFK
EXQGOH LV FRQVWDQW LQ WKH H[DPSOH DUFKLWHFWXUH WKH IHWFK
EXQGOHFRQVLVWVRILQVWUXFWLRQZRUGV
V70
7KH IHWFK EXQGOH ZLOO EH VWRUHG IURP RQH VLGH LQWR WKH
LQVWUXFWLRQ EXIIHU RQ WKH RWKHU VLGH WKH H[HFXWLRQ EXQGOH
ZLOOEHEXLOWXSRIWKHDOUHDG\IHWFKHGLQVWUXFWLRQZRUGV$
VHFRQGDGYDQWDJHRIWKHLQVWUXFWLRQEXIIHULVWKHSRVVLELOLW\
WR KDQGOH ORRSV LQVLGH WKH EXIIHU ZLWKRXW DQ\ DGGLWLRQDO
IHWFK F\FOHV WR WKH SURJUDP PHPRU\ ZKLFK VLJQLILFDQWO\
UHGXFHV WKH SRZHU GLVVLSDWLRQ DW WKH SURJUDP PHPRU\
SRUWV (IIHFWLYHO\ WKH LQVWUXFWLRQ EXIIHU IXQFWLRQV DV D
VPDOOLQVWUXFWLRQFDFKH
d e co d e r u n it
(1 0 x2 0 )
in stru ctio n b u ffe r
(1 6 x2 0 )
P RA M [FC ]
P RA M [FC + 1]
P RA M [FC + 2]
P RA M [FC + 3]
w rite e n a b le 0
w rite e n a b le 1
w rite en a b le n
in stru ctio n a d d re ss
(1 0 x4 )
)LJXUH,QVWUXFWLRQ%XIIHU
)LJXUH,QVWUXFWLRQ0DSSLQJ
7R FRQWURO WKH FRQWHQW RI WKH LQVWUXFWLRQ EXIIHU DQ
DGGLWLRQDO VLJQDO LV UHTXLUHG WKH ZULWH HQDEOH ZKLFK
DOORZV RYHUZULWLQJ D EXIIHU FHOO ,I D KDUGZDUH ORRS
LQVWUXFWLRQ KDVEHHQGHWHFWHGGXULQJWKHGHFRGHVWDJH WKH
FRQWHQWRIWKHEXIIHUZLOOEHLQWHUORFNHG/RRSVKDQGOHGLQ
VRIWZDUH HJ ZKLOH ORRSV FDQQRW EH DXWRPDWLFDOO\
GHWHFWHG)RUWKLVSXUSRVHDQLQVWUXFWLRQLVDYDLODEOHZKLFK
FDQEHXVHGWRLQWHUORFNWKHLQVWUXFWLRQEXIIHUPDQXDOO\
&RQWURO8QLW
7KLVVHFWLRQZLOOEHXVHGWRH[SODLQWKHFRQWUROPHFKDQLVP
XVHG WR KDQGOH WKH GDWD IORZ IURP WKH SURJUDP PHPRU\
SRUWWRWKHGHFRGHUSRUWV
7KH LQVWUXFWLRQ EXIIHU FRQWDLQV VHYHUDO H[HFXWLRQ EXQGOHV
WKDWLVDW\SLFDOVLWXDWLRQDIWHUVRPHFRQWUROFRGHVHFWLRQV
ZKHUHWKHVL]HRIWKHH[HFXWLRQEXQGOHLVVPDOOHUWKDQWKH
VL]H RI WKH IHWFK EXQGOH )RXU IHWFK EXQGOHV ZLOO EH
DQDO\]HG FRQVLGHULQJ ZKHUH WKH QH[W H[HFXWLRQ EXQGOH LV
ORFDWHGDQGKRZPDQ\LQVWUXFWLRQZRUGVDUHLQFOXGHG
7KH PLQLPXP VL]H RI WKH LQVWUXFWLRQ EXIIHU LV WKH ZRUVW
FDVH VL]H WKDW DQ H[HFXWLRQ EXQGOH FDQ FRQVXPH VSUHDG
RYHUWKHIHWFKEXQGOHVL]HVHH)LJXUH
b u nd le n
bu n d le n+ 1
)LJXUH&DFKH6L]H
bu n d le n+ 1
cu rre n t in stru ctio n b un d le
bu n d le n+ 1
,Q WKLV H[DPSOH DUFKLWHFWXUH WKH YDOXH LV IHWFK EXQGOHV
ZKLFK FDQ DOZD\V FRQWDLQ D ZRUVWFDVH H[HFXWLRQ EXQGOH
$Q\ZD\ LW LV UHFRPPHQGHG WR XVH D ODUJHU LQVWUXFWLRQ
EXIIHUZKLFKHQDEOHVWKHKDQGOLQJRIORRSVLQVLGHWKHFRUH
DQGWKHUHIRUHUHGXFHVWKHSRZHUGLVVLSDWLRQDWWKHSURJUDP
PHPRU\SRUWV
bu n d le n+ 2
n e xt in stru ction b u n dle sta rt
bu n d le n+ 2
bu n d le n+ 3
bu n d le n+ 4
,QVWUXFWLRQ0DSSLQJ
)LJXUH([HFXWLRQ%XQGOHLQVLGHWKH,QVWUXFWLRQ%XIIHU
7KHLQVWUXFWLRQVFDQEHDVVLJQHGWRWKUHHRSHUDWLRQFODVVHV
&RPSXWDWLRQ /RDG6WRUH %UDQFK 7KH LQVWUXFWLRQ
PDSSLQJXQLWLVUHVSRQVLEOHWRDVVLJQWKHLQVWUXFWLRQZRUGV
WRWKHFRUUHFWGHFRGHUSRUW
,Q )LJXUH WKH SRLQWHU LQGLFDWHV WKH VWDUW SRVLWLRQ RI WKH
QH[W H[HFXWLRQ EXQGOH LQVLGH WKH LQVWUXFWLRQ EXIIHU 7KH
DOLJQPHQWLQIRUPDWLRQDYDLODEOHLQHDFKLQVWUXFWLRQZRUG
ZLOO EH XVHG WR LGHQWLI\ WKH DIILOLDWLRQ WR D VSHFLILF
H[HFXWLRQ EXQGOH 7R LQFUHDVH WKH SHUIRUPDQFH RI WKH
LPSOHPHQWDWLRQWKHIXQFWLRQDOLW\ZLOOEHVSOLWLQWRVHYHUDO
SDUWV$ZHOOVXLWHGVSOLWWLQJSRLQWFDQEHWKHIHWFKEXQGOH
ERUGHU ,Q WKH H[DPSOH DUFKLWHFWXUH WKLV JLYHV DQ DQDO\VLV
DUHDRIIRXULQVWUXFWLRQV
)URPWKHLPSOHPHQWDWLRQSRLQWRIYLHZDELJPXOWLSOH[HULV
QHFHVVDU\ WR FRQQHFW HDFK EXIIHU HQWU\ WR HDFK GHFRGHU
SRUW $Q\ VHTXHQWLDO DVVLJQPHQW SRVVLEOH ZLWK OHVV DUHD
FRQVXPSWLRQ LV QRW IHDVLEOH GXH WR SHUIRUPDQFH DVSHFWV
,Q )LJXUH DQ H[DPSOH IRU WKH DUFKLWHFWXUH LV LOOXVWUDWHG
LQVWUXFWLRQ ZRUGV KDYH WR EH DQDO\]HG DQG XS WR LQVWUXFWLRQZRUGVZLOOEHIRUZDUGHGWRWKHGHFRGHUSRUWV
7KHIRXULQVWUXFWLRQVZLOOEHDQDO\]HGDQGWKHUHVXOWVRIDOO
VHSDUDWHVHFWLRQVZLOOEHVXPPHGXSWRDJOREDOUHVXOW,Q
HDFKVHFWLRQDYDOLGHQGVLJQDWXUH ZLOOEHVHDUFKHGIRU$
YDOLGHQGFDQEHIRXQGLIDQHZH[HFXWLRQEXQGOHZLOOVWDUW
V71
LQ WKLV VHFWLRQ &RQVLGHULQJ IURP WKH QH[W LQVWUXFWLRQ
EXQGOH VWDUW VHH )LJXUH D YDOLG HQG VLJQDWXUH PXVW EH
IRXQGLQVLGHWKHLQVWUXFWLRQEXIIHU,Q)LJXUHHYHQLQWKH
VDPH VSOLWEXIIHUWKLV FRQGLWLRQ LV WUXH ,I WKH FRQGLWLRQ LV
WUXH WKH H[HFXWLRQ EXQGOH LV DOUHDG\ FRPSOHWHO\ DYDLODEOH
LQVLGHWKHLQVWUXFWLRQEXIIHUDQGFDQEHH[HFXWHG
bund le n
bun dle n +1
bun dle n +1
valid e nd
bun dle n +2
valid e nd
valid e nd
current in structio n bu ndle
bun dle n +1
next instruction bun dle start
bun dle n +2
bun dle n +3
bun dle n +4
valid e nd
bun dle n +4
bun dle n +5
)LJXUH6SOLW%XIIHU&RQWHQW
,I WKH FRQGLWLRQ LV QRW WUXH WKH DOLJQ XQLW KDV WR FRQVLGHU
RQHRUPRUHVWDOOF\FOHVXQWLOWKHZKROHH[HFXWLRQEXQGOH
LVDYDLODEOH7KLVVKRXOGQRWKDSSHQGXULQJOLQHDUSURJUDP
IORZ +RZHYHU LI DQ LQWHUUXSW RFFXUV RU D EUDQFK WDNHV
SODFHWKLVNLQGRIVWDOOF\FOHVDUHSRVVLEOH
)LJXUH%UDQFK7DUJHW2SWLPL]DWLRQ
$Q RSWLPL]LQJ OLQNHU >@ FDQ DOLJQ EUDQFK WDUJHWV RU ORRS
ERGLHVWRSUHYHQWVWDOOF\FOHV$VLQ)LJXUHDQH[HFXWLRQ
EXQGOHFRQVLVWLQJRIWZRLQVWUXFWLRQZRUGVJUD\EORFNVLV
VSOLW RYHU WZR IHWFK EXQGOHV ,I WKLV H[HFXWLRQ EXQGOH LV
XVHG DV D EUDQFK WDUJHW D VWDOO F\FOH ZLOO EH QHFHVVDU\ WR
IHWFKWKHVHFRQGIHWFKEXQGOHEHIRUHWKHH[HFXWLRQEXQGOH
LV FRPSOHWHO\ DYDLODEOH WR EH LVVXHG WR WKH GHFRGHU SRUWV
$OLJQLQJWKHEUDQFKWDUJHWVLQGLFDWHGLQ)LJXUHZLWKD
VPDOO DUURZ RQ WKH OHIW VLGH PDNHV WKH VWDOO F\FOH
XQQHFHVVDU\ :LWK WKH IHWFK EXQGOH RI WKH EUDQFK WDUJHW
DGGUHVV WKH H[HFXWLRQ EXQGOH LV DYDLODEOH LQVLGH WKH
LQVWUXFWLRQEXIIHU
(DFK LQVWUXFWLRQ ZRUG FRQWDLQV LQIRUPDWLRQ DERXW WKH
RSHUDWLRQFODVVLWEHORQJVWR7KLVLQIRUPDWLRQZLOOEHXVHG
IRU SUHGHFRGLQJ ZKLFK HQDEOHV WKH DVVLJQPHQW RI WKH
LQVWUXFWLRQ WR WKH UHODWHG GHFRGHU SRUW 7KH SUHGHFRGLQJ
DQGWKHDOLJQPHQWDOORZWKHXVDJHRIXQLWVSHFLILFGHFRGHU
VWUXFWXUHVDQGWKHUHIRUHDOHDQHUGHFRGHUDUFKLWHFWXUH
7KH RQO\ H[FHSWLRQV DUH WKH ORQJ ZRUGV ZKLFK FRQWDLQ
DOLJQPHQW LQIRUPDWLRQ EXW KDYH QR RSHUDWLRQ FODVV
DVVLJQPHQW 7R RYHUFRPH WKH SUREOHP RI WKH ORQJ ZRUGV
DOUHDG\ GXULQJ WKH IHWFK SKDVH WKH ORQJ ZRUGV ZLOO EH
GHWHFWHG DQG VWRUHG LQVLGH WKH LQVWUXFWLRQ EXIIHU WRJHWKHU
ZLWKWKHDVVRFLDWHGLQVWUXFWLRQ'XHWRWKHSUHGHFRGLQJRI
WKHORQJZRUGVWKHPDSSLQJXQLWKDVQRWWRWDNHFDUHRIWKH
ORQJ ZRUGV ,I WKH ORQJ ZRUG LQGLFDWLRQ RI DQ LQVWUXFWLRQ
ZRUG LV VHW WKH LQVWUXFWLRQ ZRUG ZLOO EH PDSSHG WR WKH
ORQJZRUGSRUWRIWKHGHFRGHUXQLW
$GGLWLRQDO GHVLJQ FRPSOH[LW\ IRU WKH PDSSLQJ FRQWURO LV
LQWURGXFHGLIWKHGDWDSDWKVRIWKHRSHUDQGJURXSVDUHQRW
RUWKRJRQDO (J WKHUH DUH WZR FRPSXWDWLRQDO XQLWV
DYDLODEOH(DFKFDQIRUH[DPSOHEHXVHGWRFDOFXODWH0$&
LQVWUXFWLRQV RU $'' RSHUDWLRQV EXW VKLIW RSHUDWLRQV FDQ
WDNH SODFH RQO\ RQ WKH ULJKW RQH ,Q WKLV FDVH WKLV NLQG RI
LUUHJXODULW\ KDV WR EH FRQVLGHUHG DOUHDG\ GXULQJ WKH
LQVWUXFWLRQPDSSLQJ
&RQFOXVLRQ
7KH [/,: 6FDOHDEOH /RQJ ,QVWUXFWLRQ :RUG FRQFHSW
HQDEOHV WKH IXOO XWLOL]DWLRQ RI WKH SDUDOOHOLVP RI H[HFXWLRQ
XQLWV SURYLGHG LQ 9/,: DUFKLWHFWXUHV ,QVWUXFWLRQ ZRUGV
ZLOORQO\EHLVVXHGWRWKHH[HFXWLRQXQLWVWKDWZLOOEHXVHG
DWDFHUWDLQF\FOHDQGQR123LQVWUXFWLRQVDUHQHFHVVDU\DV
SODFHKROGHU7KHUHIRUHDQRSWLPDOFRGHGHQVLW\ IRU9/,:
DUFKLWHFWXUHV FDQ EH UHDFKHG ZLWKRXW OLPLWDWLRQV
FRQFHUQLQJ WKH XVDJH RI WKH DYDLODEOH FRPSXWDWLRQDO
SRZHU 7KH LQVWUXFWLRQ EXIIHU LQFOXGHG FDQ EH XVHG WR
VLJQLILFDQWO\ UHGXFH WKH SRZHU GLVVLSDWLRQ DW WKH PHPRU\
SRUWVGXULQJORRSKDQGOLQJ7KH[/,:FRQFHSWLVSDUWRID
GHYHORSPHQWSURMHFWIRUDFRQILJXUDEOH'63
5HIHUHQFHV
>@ 3 /DSVOH\ - %LHU $ 6KRKDP DQG ( $ /HH DSP
Processor Fundamentals, Architectures and Features ,(((
3UHVV1HZ<RUN
>@ ' 6LPD 7 )RXQWDLQ 3 .DFVXN $GYDQFHG &RPSXWHU
$UFKLWHFWXUHV $ 'HVLJQ 6SDFH $SSURDFK $GGLVRQ :HVOH\
3XEOLVKLQJ&RPSDQ\
>@ -(\UHDQG-%LHU7KHHYROXWLRQRI'63SURFHVVRUV,(((
6LJQDO3URFHVVLQJ0DJD]LQHYROQR
>@ -/+HQQHVV\DQG'$3DWWHUVRQComputer Architecture.
A Quantitative Approach0RUJDQ.DXIPDQQ3XEOLVKHUV6DQ
0DWHR&$
>@ - 5 /HYLQH /LQNHUV DQG /RDGHUV 0RUJDQ .DXIPDQQ
3XEOLVKHUV6DQ)UDQFLVFR&$
V72
PUBLICATION 3
C. Panis, R. Leitner, H. Grünbacher, J. Nurmi, “Align Unit for a Configurable DSP Core”,
in Proceedings on the IASTED International Conference on Circuits, Signals and Systems
(CSS 2003), Cancun, Mexico, May 19-21, 2003, pp. 247-252.
©2003 IASTED. Reprinted, with permission, from proceedings of the IASTED
International Conference on Circuits, Signals and Systems.
Proceedings of the IASTED International Conference
CIRCUITS, SIGNALS, AND SYSTEMS
May 19-21, 2003, Cancun, Mexico
ALIGN UNIT FOR A CONFIGURABLE DSP CORE
Christian Panis
Carinthian Tech Institute
A-9524 Villach
Austria
Raimund Leitner
Infineon Technologies
A-9500 Villach
Austria
Herbert Gruenbacher
Carinthian Tech Institute
A-9524 Villach
Austria
Jari Nurmi
Tampere University of
Technology FIN-33101 Tampere
Finland
The VLIW will be fetched, decoded and then issued to the
parallel execution units. At traditional VLIW
architectures, a high number of parallel units lead
therefore to a wider program memory port. To overcome
this problem the paper describes the architecture of the
align unit. The align unit enables the usage of unaligned
program memory to increase the code density and
therefore to reduce the area consumption and power
dissipation caused by the program memory.
In the first part of the paper different DSP concepts are
discussed with focus on the micro-architecture of the
program fetch unit. The second part is used to explain the
architecture of the align unit in detail and to analyze the
influence of loop handling and the execution of branches
and interrupt service routines on the align unit. At the end
some realization results are discussed.
ABSTRACT
Increasing system complexity of SOC applications leads
to an increasing requirement of powerful embedded DSP
processors. To increase the performance of DSP
processors the number of parallel executed instructions
has been increased. To program the parallel units VLIW
(Very Long Instruction Word) has been introduced.
Programming the parallel units at the same time leads to
an expanded program memory port or to the limitation
that only a few units can be used in parallel. Traditional
VLIW architectures feature poor code density and
therefore high area consumption of the program memory.
To overcome this limitation the paper describes the align
unit, which allows using unaligned program memory
without any limitations on the core performance. The
architecture, some implementation details and the
influence on area consumption and power dissipation of
the align unit are discussed in this paper. The align unit is
part of a development project for a configurable DSP.
2. INSTRUCTION ALIGNMENT
This section is used to explain the motivation for using
unaligned program memory. Available DSP concepts are
discussed with focus on the micro-architecture of the
instruction fetch unit.
At the last subsection an example architecture will be
shortly introduced to explain the function of the proposed
align unit.
KEY WORDS
VLIW, unaligned program memory, configurable DSP,
align unit
2.1. Single Instruction Execution
1. INTRODUCTION
Assuming DSP architectures fetching and executing one
instruction per cycle (e.g. the OAK DSP from DSPgroup)
the memory port to the program memory has the same
size as the native instruction word size. Figure 1 shows
the OAKDSPCore from DSP Group. Inc. supporting a 16bit address bus PAB and a unidirectional data bus PDB,
also 16 bits wide [3].
One instruction per cycle is fetched and then executed.
The direct memory architecture of the OAKDSPCore
allows encoding of two data memory fetch-operations and
the arithmetic operation within one instruction word
(CISC instruction set).
The increasing requirements on computational power of
embedded processors have led to deeper pipeline
structures and to an increase of parallel execution units.
To program these parallel units the VLIW (Very Long
Instruction Word) [1] will be used, which is built up of
several instruction words. VLIW architectures have no
hardware support like scoreboards [2] (common in superscalar processor architectures) for the resolution of data
dependencies between the instructions.
391-084
247
In the SC140 Starcore a prefix and a serial grouping
mechanism is available to identify the instructions
executing in parallel during the same clock cycle [6].
PAB
OAKDSPCore
PDB
2.3. xLIW Instruction Word
The align unit described in this paper is part of a
configurable DSP concept. Instead of a traditional VLIW
architecture xLIW, a scaleable long instruction word is
used. The native instruction word size is 20 bits, and long
offsets or immediate values are coded in an additional 20bit word. Each clock cycle four instruction words can be
fetched, therefore the size of the fetch bundle is 80 bits, as
illustrated in Figure 3.
Figure 1: OAKDSPCore
Therefore, even for MAC (Multiply and Accumulate)
operations only one instruction word is needed.
2.2. VLIW Instruction Word
The increasing complexity of the applications using DSPs
requires more powerful DSP architectures. One way to
increase the performance of DSP architectures is to
increase the number of parallel executed instructions. A
VLIW architecture consists of multiple execution units
performing multiple instructions in parallel.
fetch bundle
80-bit
Figure 3: Fetch Bundle
PAB
The size of the execution bundle (the execution bundle
contains the instructions executed in parallel) depends on
the number of available units and of their possible usage
(depending on the executed algorithm). In the
implemented version five parallel units are available and
therefore an execution bundle can have a size of one up to
five instruction words. In Figure 4 examples of possible
execution bundles are shown. The first one consists of
only one instruction word and the second one assumes the
maximum value of parallel executed instructions, each of
them containing an additional word for offsets or
immediate values.
EP1
TI C62xx
PDB
EP2
Figure 2: VLIW TI C62x
In Figure 2 the C62x from Texas Instruments is used as
an example. With the C62x each clock cycle 8 instruction
words (each 32-bit) can be fetched. Assuming data
dependencies between the instructions of the application
code it will not be possible to use every unit each cycle in
parallel. The unused instruction space inside the VLIW
has to be filled with NOP instruction words. This example
illustrates the problem of traditional VLIW architectures.
The code density of the application is quite poor due to
the NOP instructions. On the other side reducing the
number of fetched instruction words per cycle will lead to
a reduced use of the available data paths.
The C62x of Texas Instruments provides the possibility to
split the fetch bundle (instructions that are fetched during
the same clock cycle) into execution packages (EP) [4].
One fetch bundle can contain several execution bundles,
which reduces the need of NOP instructions to align the
execution bundle at the border of the fetch bundle.
Carmel DSP of Infineon solves the problem with CLIW, a
Configurable Long Instruction Word [5]. The program
memory port is two native instruction words wide (2
times 24-bit). If more than two instructions have to be
executed in parallel, an extended program memory port is
available, providing additional 96 bits for instruction
coding.
execution bundle
20 bit
200 bit
Figure 4: Execution Bundle
To increase the code density, unaligned program memory
is supported which means that the beginning of the
execution bundle not have to be equal with the beginning
of the fetch bundles.
Assuming an execution bundle with more than four
instruction words, the execution bundle has to be spread
over several fetch bundles. In the example of Figure 5 the
execution bundle consists of 10 instruction words (the
grey fields), which are spread over four fetch bundles (the
rows).
248
3.2. Fetch phase
Each fetch bundle contains four instruction words, which
are fetched from one physical memory address. The
fetched values are stored in the instruction buffer. The
instruction buffer is organized as a circular buffer, built
up of several buffer cells as shown in Figure 7.
19
Figure 5: Unaligned Program Memory
0
physical address
VLIW architectures do not contain hardware support for
resolving data dependencies between the different
instructions and therefore the scheduling is done in
software (e.g. by the compiler) [7]. Before the instructions
fetched from the program memory can be executed, the
execution bundle has to be aligned. The align unit
explained in the following section will do this.
E V
Figure 7: Buffer Cell
The four instruction words are stored in the buffer cell,
and the related physical address is stored in the related
address field. The address field contains two additional
bits, the executed bit E and the valid bit V. Each time
when a buffer cell is loaded with instructions from the
program memory, the valid bit is set to one. That indicates
that the buffer cell already contains fetched data. During
the align phase, the executed bit is set as soon as the
buffer entry has been used for execution. If there is no
space left inside the instruction buffer for storing further
program data, the buffer cells with an executed bit set to
one are allowed to be overwritten. Details and exceptions
of this rule will be discussed in subsections 3.5 and 3.6.
In Figure 8 an example for an instruction buffer is shown.
It is built up of buffer cells as shown in Figure 7. A fill
pointer FP is circulating and used to assign the next free
buffer cell. Already during the fetch phase the parallel
word detection takes place. As pointed out in subsection
2.3, some of the instructions can have an additional
parallel word associated to them. Such long instructions
have to be decoded with the additional word included.
Therefore, a first pre-decoding takes place during storing
the fetched instruction words in the buffer cell, to detect
parallel words. If an instruction has been identified
containing a parallel word, an indication bit beside the
instruction word in the instruction buffer is activated. This
indication bit is used during mapping of the instruction
words to the decoder ports, which takes place at the end
of the align phase.
At the beginning of the fetch phase the fetch counter is
compared with all physical address entries stored inside
the instruction buffer (only entries with activated valid bit
are considered). If the comparison gives a positive result
the fetch bundle is already stored inside the instruction
buffer.
3. ALIGN UNIT
This section is used to illustrate the architecture of the
align unit. The align unit consists of a fetch unit with an
instruction buffer used for fetching instruction words
from the program memory and storing them into the
buffer. The align phase is used to set up the execution
bundle and to map the instructions words to the decoder
ports. Additionally the impact of hardware loops,
branches and interrupts on the architecture of the align
unit is discussed.
3.1. Pipeline structure
The proposed DSP architecture has a RISC-like pipeline
based on three phases, instruction fetch, decode and
execute. The three phases can be split over several clock
cycles to increase the reachable core frequency. The align
unit is part of the instruction fetch phase as illustrated in
Figure 6.
Figure 6: Pipeline Structure
In the example architecture, the instruction fetch from the
program memory and the align process are each executed
during one clock cycle.
249
fetch counter
19
analysis bundle, which will be used to identify the next
execution bundle.
fc found ?
0
physical address
E V
program counter
+
19
19
0
physical address
E V
physical address
E V
physical address
E V
0
physical address
E V
+
FP
19
19
0
0
physical address
E V
19
0
Figure 8: Buffer Fill
Figure 9: Buffer Read
The fetch from program memory will be disabled, which
reduces the power dissipation at the program memory
port. Especially during loop execution it can be used to
reduce the traffic at the program memory port
considerably and therefore to reduce the power
dissipation. Details concerning loop handling are
discussed in subsection 3.5.
Each of the blocks in Figure 10 corresponds to an
instruction word. The p (for parallel) or s (for sequential)
inside the blocks indicates whether the instruction word
will be executed in parallel with the leading instruction
word or separately in the next clock cycle.
p
3.3. Align phase
In Figure 9 the read process from the instruction buffer is
illustrated. As already pointed out, the VLIW does not
provide HW support like scoreboards to resolve data
dependencies between instructions. In subsection 2.3 it
has been mentioned that the execution bundle can be even
split over four fetch bundles. Therefore if the execution
bundle is spread over several fetch bundles and the
instructions are not already stored inside the instruction
buffer, stall cycles will be necessary to fetch the
instructions needed to build up the execution bundle. The
program counter (pointing to the address of the next
instruction) and the consecutive addresses will be
analyzed, whether instructions of the execution bundle are
already inside the buffer. The two lower bits of the
program counter are not used for the comparison with the
physical address, they are used to address the single
instruction words inside the buffer cell. Besides the
comparison with the reduced program counter the
consecutive addresses will be compared, that means
program counter plus 4, 8 and 12 (or reduced program
counter +1,+2 and +3).
The available buffer cells will be read from the instruction
buffer and used for further analysis. Missing addresses
will be fetched in the next clock cycle. In Figure 10 the
consecutive instruction words have been set together to an
p
s
p
p
p
p
p
s
s
p
p
s
p
p
s
Figure 10: Analysis Bundle
Due to HW realization reasons (a serial search of the cells
will take too long) the check for the end of the execution
bundle is done in parallel for subsections. In Figure 11
each buffer cell will be analyzed separately. Starting from
the program counter address (in Figure 11 indicated by
the grey arrow) the next valid end (indicated by the
arrows below the instruction words) is identified. The
analysis procedure of the separate buffer cells takes place
in parallel. Due to the maximum size of 10 instruction
words building up one execution bundle, the size of 16
entries for the analysis bundle guarantees one valid end
inside the analysis bundle.
program counter
p
p
s
p
p
p
p
p
s
s
p
p
s
p
p
s
Figure 11: Valid End Detection
If a valid end has been detected the next execution bundle
is already available and can be forwarded to the decode
ports. The proposed instruction set can be split into three
250
operation classes, computation, load/store and program
flow operations. The first three bits of each instruction
word are used to encode the execution bundle information
and the affiliation to the operation class. The operation
class of the instruction word will be analyzed at the end of
the align phase and the instruction mapped to the
corresponding decoder port. Each decoder port is
dedicated to a certain operation class. That reduces the
overhead of the instruction decoder and allows removing
a decoder port, when the related data path is not available
any more (considering the aspect that the align unit is part
of a configurable DSP concept).
operation class
parallel word
load/store
load/store
computation
computation
program flow
program counter
p
p
s
p
p
p
p
p
Figure 13: Instruction Mapping
3.5. Loop handling
Figure 12: No Valid End
The instruction buffer can be used to store the instruction
words of the loop body inside the core and no further
fetch operations from the program memory during loop
execution are necessary. The proposed DSP concept
supports the execution of single instruction cycle loops
(repeat) and of loops consisting of several execution
bundles (block repeat).
Executing a repeat loop, the execution bundle that will be
executed multiple times is once fetched from the program
memory. Further execution cycles take the instruction
words from the entries of the instruction buffer. During
loop execution, the fetch unit fetches consecutive
instruction words into the instruction buffer to prevent
stall cycles after the loop execution. To prevent
overwriting of loop body entries in the instruction buffer
during loop execution, the executed bit of already
executed instructions will not be set to one. Executing the
loop for the last time (indicated by a hardware loop
counter equal to zero), the executed bits of the instruction
buffer entries will be set to one (as in linear program
flow) and the cell buffers can be filled with new
instruction words.
Executing a block repeat loop, the relationship between
the size of the instruction buffer and the size of the loop
body has influence on the efficient use of the buffer. If the
loop body fits into the entries of the instruction buffer, the
handling is similar to the handling of a repeat loop. The
loop body is fetched once from program memory, stored
into the instruction buffer and then executed multiple
times. During the execution of the loop body the fetch
counter is enabled to fill free cell buffer entries of the
instruction buffer.
In Figure 12 an analysis bundle is illustrated where stall
cycles are getting necessary. The analysis bundle is the
same as the example in Figure 11, but not all four
consecutive fetch addresses are available. The analysis
gives a valid end, but the valid end is located before the
actual program counter and therefore is not relevant. The
missing valid end in the analysis bundle leads to a stall
cycle.
Linear program flow code can be scheduled to prevent
stall cycles due to missed instruction words. Branches are
little bit more critical. The Compiler (or the programmer
if coded manually) has to take care that the branch target
fits into one fetch bundle to prevent stall cycles. On one
side this can be managed by short execution bundles at
the branch target or on the other hand, the branch target
can be aligned to the start address of a fetch bundle.
If the instruction buffer is offering enough entries to keep
the loop body, the alignment of the loop is not important.
Executing the loop first time will possibly lead to a stall
cycle, but further executions will be handled directly from
the instruction buffer with no stall cycles necessary.
3.4. Mapping
The execution bundle has been identified and has to be
mapped to the decoder ports. During the fetch and align
phase the operation class and the eventually available
parallel words has been detected. With this information,
the instructions of the execution bundle are mapped to the
related decoder ports. Each of the available decoder ports
decodes only instructions of one operation class. Parallel
words containing long immediate values and offsets will
be directly mapped to the execution units. The order of
the instruction words inside the execution bundle is not
restricted. In Figure 13 the instruction mapping is
illustrated. In dependency of the operation class, the
decoder port will be chosen.
251
BKREP Decoded
4. RESULTS
loop fits
The described align unit has been described in VHDLRTL (including the Fetch Unit). The chosen instruction
buffer has 8 buffer cells and allows the storage of 32
instruction words. The simulation results have shown that
most of the inner loops typically part of classical DSP
functions can be handled with that instruction buffer size.
The synthesis results in 0.13 m technology give an
estimated size of 0.15 mm². For dedicated benchmark
examples the use of the instruction buffer leads to 60%
less switching activity at the program memory port and
leads therefore to reduced power dissipation.
Anyway, the absolute numbers in this section have to be
considered carefully due to the strong dependency on the
chosen benchmark examples.
YES
first
execution
last
execution
YES
YES
all executed
bits set to 1
buffer fill start
executed bits
set to 1 after
executing the
cell buffer
executed bits
not set to 1
fetch counter
fetches
instructions
after loop body
last
execution
YES
executed bits
not set to 1
loop body
executed
executed bits
set to 1 after
execution
fetch counter
related to
program
counter
5. CONCLUSION
Figure 14: Bkrep Program Flow
All executed bits of the instruction buffer entries are set to
one before the instruction buffer entries will be filled with
instructions of the loop body. That prevents that already
fetched entries that are not part of the loop body allocate
entries of the instruction buffer. Again the executed bits of
the loop body will not be set to one until the loop body
will be executed for the last time. If the loop body does
not fit into the instruction buffer, the normal loop
handling will be deactivated. After executing instructions
of the loop body, the executed bits are set to one. The only
difference to linear program flow is that the fetch counter
is bounded to the program counter that means no fetch of
instructions from outside the loop body during loop
execution. In Figure 14 the program flow for handling of
a block repeat loop (bkrep) is illustrated.
A while loop is implemented with a conditional branch
instruction. The hardware loop counter cannot be used
and the loop body cannot automatically be handled inside
the instruction buffer. Therefore, instructions are available
to control the behavior of the instruction buffer manually.
To prevent unintended malfunction during manual control
of the buffer content, automatic control features are
included.
The align unit enables the usage of multiple data paths in
parallel without the drawback of wasting program
memory (which is a problem of traditional VLIW
architectures). The instruction buffer inside the align unit
is used to reduce the width of the program memory port.
Loops can be handled inside the instruction buffer that
allows to reduce the fetch activity at the program memory
port and therefore to reduce the power dissipation. The
unaligned program memory increases the system code
density of the VLIW architecture. The align unit is part of
a development project for a configurable DSP.
REFERENCES
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP
Processor Fundamentals, Architectures and Features
(New York, IEEE Press, 1997).
[2] Dezso Sima, Terence Fountain, Peter Kacsuk,
Advanced Computer Architectures: A Design Space
Approach (Harlow, Addison Wesley Publishing
Company, 1997).
[3] Siemens, OAK DSP Core, Programmers Reference
Manual (Siemens AG, 01.98).
[4] Texas Instruments, TMS320C6000 CPU and
Instruction
Set
Reference
Guide
(Texas
Instruments,10.2000).
[5] Infineon Technologies, Carmel DSP Core
Architecture Specification (Infineon Technologies,
2001).
[6] Motorola, SC140 DSP Core Reference Manual
(Motorola, Rev.0,12/99).
[7] J. L. Hennessy, D. A. Patterson, Computer
Architecture. A Quantitative Approach (San Mateo
CA, Morgan Kaufmann Publishers, 1996).
3.6. Branches, Interrupt handling
If a branch instruction is executed and the branch is taken,
the executed bits of instruction buffer entries will be set to
one. The fetch unit then starts filling the buffer cells with
instruction words of the branch target. To prevent stall
cycles the branch target should fit into the fetch bundle.
The handling of interrupts is similar to the behavior of
unconditional branches. The branch target address is
provided from the ICU (interrupt control unit). The
executed bits of the already fetched entries are set to one
and the instruction buffer is filled with entries of the ISR
(interrupt service routine).
252
PUBLICATION 4
C. Panis, M. Bramberger, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer for a
Configurable DSP Core”, in Proceedings of 29th European Solid State Conference
(ESSCIRC 2003), Estoril, Portugal, September 16-18, 2003, pp. 49-52.
©2003 IEEE. Reprinted, with permission, from proceedings of 29th European Solid State
Conference.
A Scaleable Instruction Buffer for a
Configurable DSP Core
Christian Panis1, Michael Bramberger2, Herbert Grünbacher1, Jari Nurmi3
1
Carinthian Tech Institute
Europastrasse 4
A-9524 Villach
Austria
2
Infineon Technologies
Siemensstrasse 2
A-9500 Villach
Austria
Abstract:
Increasing system complexity of SOC applications leads
to an increasing requirement on powerful embedded
DSP processors. To increase the performance of DSP
processors the number of parallel-executed instructions
has been increased. To program the parallel units VLIW
(Very Long Instruction Word) has been introduced.
Traditional VLIW architectures feature poor code
density and therefore high area consumption caused by
the program memory. To overcome this limitation the
proposed configurable DSP core supports unaligned
program memory, to reduce the size of the program
memory port an execution bundle can be mapped onto
several fetch bundles. To overcome the memory
bandwidth mismatch between fetch and execution bundle
an instruction buffer is introduced. Using the instruction
buffer during execution of inner loops the power
dissipation of the DSP subsystem can be reduced. Cache
logic is used to control the entries of the instruction
buffer during out-of-order execution. This paper
describes the architecture and the implementation of the
instruction buffer. The instruction buffer is part of a
project for a configurable DSP core.
1. Introduction
Increasing system complexity of SOC applications leads
to a strong demand on powerful embedded processors.
To increase the performance of embedded processors the
number of pipeline stages is increased to reach higher
clock frequencies and the number of parallel executed
instructions is increased to gain higher system
performance. To program the available parallel units
VLIW has been introduced [1]. The drawback of
traditional VLIW architectures is an increase of the
program memory and therefore a poor code density [2].
To overcome this problem available DSP architectures
decouple fetch and execution bundle. The size of the
fetch bundle (and therefore the size of the program
memory port) is equal to the size of the maximum
possible execution bundle. xLIW [3], a scaleable long
instruction word supports a reduced program memory
port size. The size of the fetch bundle is constant; the
size of the execution bundle can be different each cycle
(depending on application specific requirements). One
execution bundle can be mapped onto several fetch
bundles. To prevent stall cycles due to an incomplete
execution bundle, an instruction buffer is introduced. The
3
Tampere University of Technology
P.O. Box 553
FIN-33101 Tampere
Finland
entries of the instruction buffer are controlled by cache
logic to make use of the advantages of the instruction
buffer also during out-of-order execution.
A typical feature of DSP algorithms are loop constructs.
Therefore the proposed DSP architecture supports zerooverhead loop instructions. The loop handling like
decrementing of the loop counter is handled by hardware
and does not require further instructions. Using the
instruction buffer during loop handling reduces the
number of program memory fetch cycles and therefore
reduces power dissipation. The loop is fetched once from
memory and then executed from the instruction buffer.
This paper describes the architecture and the
implementation of the instruction buffer. The first
section is used to introduce VLIW architectures and the
xLIW concept. The second part illustrates specific
requirements of the proposed DSP core. The third part
illustrates the architecture of the instruction buffer and is
followed by a section describing implementation details.
2. Motivation
This section is used to briefly introduce the drawback of
traditional VLIW architectures concerning code density.
Available solutions to overcome this problem are
illustrated. At the end of this section xLIW, a scaleable
long instruction word is briefly introduced.
2.1. VLIW
Traditional VLIW architectures feature poor code
density. Instruction scheduling is done in SW and
therefore no hardware support for resolving data and
instruction dependencies like scoreboards is available
(static scheduling). The fetch bundle (instructions which
are fetched in parallel) is fetched from program memory
and the instructions are decoded. The size of the fetch
bundle is equal to the maximum number of parallel
executed instructions. An increasing number of parallel
data paths leads to a wide program memory port and
poor code density due to missing issuing queues. The
data dependency inside the application code leads to a
poor usage of the available data paths and for traditional
VLIW architectures to a poor code density.
2.2. TI C62xx
A possibility to overcome this problem is to decouple
fetch and execution bundle (instructions which are
executed in parallel). The C62xx from Texas Instruments
enables to map several execution bundles to one fetch
bundle. The size of the execution bundle is scaleable and
is called VLES (Variable-Length Execution Set). The
program memory port is 8 instruction words wide, each
instruction 32 bits; this leads to a size of the program
memory port of 256 bits. As illustrated in Figure 1 the
fetch bundle can consist of several execution bundles,
which are executed during consecutive clock cycles. The
same figure illustrates a problem related to this
implementation. The execution bundle has to fit
completely into the size of the fetch bundle (the
execution bundles in Figure 1 are marked with n, n+1,
…).
Figure 1: VLES of TI C'62xx
On one side this leads to a wide program memory port
(the size of the fetch bundle is equivalent to the size of
the largest possible execution bundle). On the other side
still unused program memory addresses are available due
to the alignment requirements. Does the execution
bundle not completely fit into the remaining space of the
actual fetch bundle, it has to be shifted into the next fetch
bundle. The static scheduling does not allow to change
the order of the execution bundle to optimize the
consumed program memory space.
2.3. TI C64xx
To overcome the drawback of unused program memory
due to alignment restrictions, the C64xx of Texas
Instruments allows mapping of parts of the execution
bundle into a fetch bundle. The remaining parts of the
execution bundle are mapped to the next fetch bundle.
For indication of the end of the execution bundle a single
bit in the instruction word is used (the same is true for
the C62xx architecture).
2.4. SC140
The Starcore SC140 supports a similar concept of
decoupling fetch and execution bundle. Instead of using
a single bit a prefix is used to indicate the size of the
execution bundle. The prefix word also contains
information e.g. for predicated execution.
The introduced concepts allow to size the execution
bundle to the requirements of the application code and
therefore to increase the code density. The concepts still
feature a wide program memory port. The size of the port
is influenced by the maximum number of instructions,
which can be executed in parallel.
2.5. xLIW
The proposed DSP core features a similar concept of
decoupling of fetch and execution bundle to increase the
code density compared with traditional VLIW
architectures. The proposed DSP core allows configuring
of the main architectural features, which allows to drive
the core architecture into an application specific
optimum concerning power dissipation and area
consumption.
To reduce the size of the program memory port, without
limitations concerning the calculation bandwidth it is
possible to map one execution bundle onto several fetch
bundles. In average the fetch bandwidth and the required
execution bandwidth (driven by the requirements of the
application code) has to fit. To compensate a bandwidth
mismatch, which can take place in small code sections
(e.g. inner loops), an instruction buffer (with n-entries) is
introduced. The buffer is filled on one side with fetch
bundles of constant width, on the other side execution
bundles of variable size are set together.
Using the introduced instruction buffer for reducing
power dissipation, loops are executed from the buffer.
The instructions of the loop are fetched only once and
are executed several times, without additional fetch
cycles; the switching activity at the program memory
port can be reduced and therefore the power dissipation
of the DSP subsystem.
3. Architectural Requirements
This sub-section is used to point out the requirements to
the architecture of the instruction buffer.
The requirements are scalability, fully deterministic
behaviour and power-efficient handling of loop
constructs (which includes also e.g. while loops, which
will not be handled as hardware loops)
• Scalability: The proposed DSP core allows scaling of
most of the architectural features like the size of the
register file, the width and number of the data
paths, the instruction coding and the memory
bandwidth [4]. This is necessary to drive the DSP
core architecture to an application specific optimum
in power dissipation and area consumption. For the
architecture of the instruction buffer this means that
the instruction size and the number of possible
entries of the instruction buffer have to be
scaleable.
• Deterministic behavior: DSP applications require a
deterministic time behavior. The execution time of
a certain program has to be constant (worst case
timing has to be considered during the definition of
the system architecture. Therefore a prediction
mechanism does not gain any system performance).
For the architecture of the instruction buffer this
requirement does not allow any influence on the
execution time of the program; independently if the
instruction words are already in the buffer or have
to be fetched from program memory.
• Power efficient loop handling: The instruction buffer
should be used during loop handling to reduce the
power dissipation at the program memory port. The
proposed DSP architecture supports zero-overhead
hardware loops. However, the advantage of the
instruction buffer should also be used during e.g.
while loops (which are implemented by conditional
branch instructions) and branch instructions inside
loop bodies. To handle while loops manual control
of the buffer entries is necessary. During out-oforder execution the advantage of the instruction
buffer has to be available.
Considering these requirements the following section
will be used to illustrate the chosen instruction buffer
architecture.
4. Instruction Buffer
This section is used to introduce the chosen architecture
of the instruction buffer mainly influenced by the
requirements introduced in section 3.
4.1. Program Memory Fetch
The instruction buffer is part of the fetch stage of the
pipeline. To fulfil the requirement concerning
deterministic behaviour the execution of the program has
to use the same number of cycles independent whether
the instructions are already inside the buffer or not.
Therefore at the beginning of the fetch stage, in parallel
to the access to program memory, the content of the
instruction buffer will be compared with the required
instruction words. Assuming the data has been already
fetched before and is already available inside the
instruction buffer; the fetch cycle to the program
memory is suspended. If there is more than one clock
cycle reserved for the program memory fetch that is true
for all of them. Therefore instructions already available
inside the instruction buffer do not reduce the number of
pipeline stages, but reduces the power dissipation at the
program memory port.
4.3. Cache Logic
For the cache logic a set-associative approach has been
chosen as illustrated in Figure 4. The advantage
compared with a fully-associative approach is that the
cache directory and the cache data memory are getting
the address in parallel [5]. In the fully-associative
approach the cache data memory is getting the address
sequentially (which increases the critical path in timing).
The tag-address is used to differentiate the different
pages of the memory addresses. One possible problem
can get the cache trashing. Cache trashing takes place
when a frequently used location is replaced by another
frequently used location. The problem of cache trashing
can be reduced by using a n-way set associative cache.
Figure 4: Set Associative Cache
Figure 2: Program Fetch Decision
In Figure 2 the fetch stage of the pipeline is illustrated. In
the beginning of the first cycle the decision of
suspending the memory access is done.
4.2. Buffer Structure
In Figure 3 an example for the instruction buffer is
illustrated (the fill operation of the buffer). A fill pointer
is filling empty cells. Each cell has additional control
bits. The executed bits (E) are used to indicate, whether a
fetched instruction word has already been used for
execution and therefore can be overwritten with new
entries. The valid bits (V) are used to indicate valid
entries inside the instruction buffer. Therefore no
initialization of the buffer entries is necessary. Assuming
linear program flow the buffer can be built up as a
circular buffer.
5. Implementation
This section is used to illustrate the implementation
aspects of the instruction buffer. For regular structures a
full-custom implementation has significant advantage in
power dissipation, area consumption and reachable
frequency (which is quite important due to the critical
timing of the program memory access). To obtain the
requirement of scalability the DPG (Data Path
Generator) of RWTH Aachen was used to develop the
full custom parts of the instruction buffer [6].
5.1. Partitioning
In Figure 5 the set associative cache can be split into a
CDM (cache data memory) and into a CAD (cache
address directory).
Figure 5: CDM and CAD
Figure 3: Instruction Buffer Write
If branch instructions interrupt the program flow, already
fetched instructions cannot be used any more. That is
why cache logic is used to control the entries of the
instruction buffer.
Both are well suited to be implemented in full-custom
due to the regular architecture. The EBM (executed bit
management also including the handling of the valid
bits) is part of the control logic and therefore
implemented in VHDL.
The CDM is used to store the instructions inside the
instruction buffer, which is organized in cache-lines. The
number of cache lines is scaleable, as also the number of
bits of each cache line. This enables the implementation
of instruction buffer variants with a different number of
entries and also the size of the instruction words can be
changed. Each block in Figure 3 (consisting of 4 entries,
the first one is marked with grey colour) is equal to one
cache line inside the instruction buffer.
Data in
6
4
7
5
2
3
Load
0
1
In Table 1 implementation results for certain
configurations are illustrated. Due to the chosen
architecture the timing for the MatchX line is
independent of the number of cache lines. The data
present time (access time) is increasing with the number
of cache lines due to the increasing capacity of the data
output lines.
In Figure 8 the layout of one cache configuration is
displayed (16 cache lines, 80 data bit, 14 address bit).
MatchX,dbgRead
Data out
Figure 6: Cache Line
Figure 8: Layout
To prevent an unsymmetrical height to width ratio, two
successive bits are placed one upon the other as
illustrated in Figure 6. The data are fed vertically and the
control (like load, MatchX or debug control signals) of
the cells is done horizontally. With the implementation
example in Figure 6 the granularity of the size of the
instruction words is limited to two bits. If the number of
bits per instruction word is odd one cell keeps unused.
The CAD is used to store the address of the instructions
of the instruction buffer. Each of the address bits is
compared with five entries (four data ports and one
debug port, illustrated in Figure 7). If the compared entry
matches to one of them then the related MatchX signal is
set. The MatchX lines are cascaded to reduce the timing
critical path. In this implementation example four
MatchX lines are grouped.
The upper part contains the CAD, followed by the driver
of the MatchX signals. The lower part in Figure 8
contains the CDM.
6. Conclusion
The scaleable instruction buffer as proposed in this paper
can be used to reduce the width of the program memory
port without limitations to the calculational bandwidth.
Inner loops in typical DSP application code can be
handled power efficient due to a reduced number of
program memory fetch cycles. The scalable
implementation of the buffer architecture enables
application specific adaptations to minimize the
consumed silicon area. The introduced cache logic to
control the buffer entries allows making use of the
advantages of the instruction buffer during out-of-order
execution. The chosen architecture allows minimizing
the worst-case execution time as required by real-time
constraints of DSP algorithms. The scaleable instruction
buffer is part of a project for a configurable DSP core.
7. Acknowledgement
The work has been supported by RWTH Aachen and by
the EC with the project SOC-Mobinet (IST-2000-30094).
Literature
Figure 7: CAD Architecture
5.2. Results
The instruction buffer has been implemented in a
0.13µm CMOS technology with a supply voltage of 1.2V
(1.5V is the nominal supply voltage; the lower one has
been chosen to reduce power dissipation).
cache lines
area
MatchX
data present
time
16
32
64
128
0,069 mm²
0,138 mm²
0,276 mm²
0,552 mm²
920 ps
920 ps
920 ps
920 ps
1.2 ns
1.3 ns
1.4 ns
1.6 ns
Table 1: Implementation Results
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, “DSP
Processor Fundamentals, Architectures and Features”,
IEEE Press, New York 1997.
[2] D.Sima, T.Fountain, P.Kacsuk, “Advanced Computer
Architectures: A Design Space Approach”, Addison
Wesley Publishing Company, Harlow, 1997.
[3] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi "xLIW – a
Scaleable Long Instruction Word", ISCAS 2003, Bangkok,
Thailand, 2003.
[4] C.Panis, G.Laure, W.Lazian, A.Krall, H.Grünbacher,
J.Nurmi "DSPxPlore – Design Space Exploration for a
Configurable DSP Core", GSPx 2003, Dallas, Texas, USA,
2003.
[5] J.Handy, “The Cache Memory Book”, Academic Press,
1998
[6] O.Weiss, M.Gansen, T.G. Noll, “A flexible Datapath
Generator for Physical Oriented Design”, Proceedings of
the ESSCIRC 2001, Villach, 18.-20. September 2001, pp.
408-411
PUBLICATION 5
C. Panis, H. Grünbacher, J. Nurmi, “A Scaleable Instruction Buffer and Align Unit for
xDSPcore”, IEEE Journal of Solid-State Circuits, Volume 35, Number 7, July 2004, pp.
1094-1100.
©2004 IEEE. Reprinted, with permission, from IEEE Journal of Solid-State Circuits.
1094
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004
A Scalable Instruction Buffer and Align Unit
for xDSPcore
Christian Panis, Member, IEEE, Herbert Grünbacher, Member, IEEE, and Jari Nurmi, Senior Member, IEEE
Abstract—Increasing mask costs and decreasing feature sizes
together with productivity demand have led to the trend of platform design. Software programmable embedded cores are used
to provide the necessary flexibility in integrated systems. Facing
increasing system complexity, single-issue digital signal processors
(DSPs) have been replaced by cores providing the execution of
several instructions in parallel. The most common programming
model for multi-issue DSP core architectures is Very Long Instruction Word (VLIW) which is based on static scheduling, and
enables minimization of the worst case execution time and reduces
core complexity. The drawback of traditional VLIW is poor code
density, which leads to high program memory requirements and,
therefore, requires a large silicon area of the DSP subsystem. To
overcome this problem without limiting the core performance,
a scalable long instruction word (xLIW) is introduced. A special
align unit is used for implementing the xLIW program memory
interface. In this paper, the align unit and its main architectural
feature, a scalable instruction buffer, is introduced in detail. xLIW
is part of a project for a parameterized DSP core.
Index Terms—Application-specific integrated circuits (ASICs),
buffer memories, cache memories, digital signal processors, parallel architectures, reduced instruction set computing.
I. INTRODUCTION
T
HE single-issue digital signal processor (DSP) architecture has proven to be inadequate to achieve the
performance required by high-end media processing applications. DSP architects first tried to tackle this problem by adding
an extra multiply-accumulate (MAC) unit to the traditional
DSP architecture [1]–[3]. However, this solution is not scalable
beyond two MAC units and even then it is cumbersome to
program efficiently. Since the late 1990s, the Very Long
Instruction Word (VLIW) model has dominated high-end
DSP architectures. There are several advantages arising from
this programming model, but several disadvantages, as well.
These disadvantages are especially severe when a low-cost and
low-power implementation is desired. Thus, improvements in
the program memory interface and instruction representation
have emerged to counteract the problems evident in VLIW.
While targeting at high performance of the data processing architecture, the ease of programming also needs to be addressed.
It is important that processors are also efficiently programmable
in high-level languages since instruction scheduling in VLIW
Manuscript received October 30, 2003; revised January 20, 2004. This work
was supported by RWTH Aachen, the EC under Project SOC-Mobinet (IST2000-30094), and Infineon Technologies Austria.
C. Panis and H. Grünbacher are with the Carinthian Technical Institute,
A-9524 Villach, Austria (e-mail: [email protected]; [email protected]).
J. Nurmi is with the Tampere University of Technology, FIN-33101 Tampere,
Finland (e-mail: [email protected]).
Digital Object Identifier 10.1109/JSSC.2004.829411
relies on compilers that are, at their best, able to reach near-optimal solutions. It is no longer practical to program a highly parallel DSP at assembly level, which was and is still extremely
common with traditional DSP architectures. Therefore, the improved architectural features should make the construction of
efficient compilers easier. One of the targets is orthogonality of
the resources available for the programmer/compiler.
When working with DSP architectures, there is one thing that
cannot be compromised: the predictability of the execution time.
Superscalar DSP architectures have been suggested [4], but DSP
programmers are wary of this approach since the instruction
scheduling is performed dynamically. They cannot get a grip
on the schedules before the application is running, but need to
plan the timing based on the worst case scenario instead. The
reason cache memories in real-time DSP systems are not used is
exactly the same. It is due to the uncertainty of meeting the processing deadlines because of something beyond the control of
the programmer. The importance of providing program memory
interfaces that can guarantee real-time operation is, therefore,
apparent.
This paper starts with an overview of existing program
memory port architectures. The requirements for improvements
are summarized and the properties of the xLIW processor
architecture are presented. The related instruction buffer architecture alongside its operation and implementation are covered.
The paper concludes with a presentation of results.
II. STATE OF THE ART
This section briefly introduces available implementations of
program memory ports for DSPs. The first subsection illustrates
VLIW and its related advantages and disadvantages, and the
second part introduces some commercial implementations of
the concept having already considered ways of overcoming the
relevant drawbacks.
A. VLIW
VLIW is the most common programming model for high-performance DSPs [5]. To execute instructions on parallel available units, a very long instruction word is built up of several instruction words which are used to code the instructions for the
different units. The long instruction word is fetched, decoded,
and executed. The VLIW model differs to the superscalar programming model in that there is no issuing queue available [6].
Data and control dependencies are resolved in software by static
scheduling [7]. The advantage of static scheduling is deterministic behavior of the executed algorithms allowing minimization
0018-9200/04$20.00 © 2004 IEEE
PANIS et al.: SCALABLE INSTRUCTION BUFFER AND ALIGN UNIT FOR xDSPcore
1095
Fig. 1. VLIW featuring poor code density.
Fig. 2. Constant fetch bundle; scalable execution bundle.
of the worst case execution time. For real-time critical DSP algorithms, it is essential to minimize the worst execution time,
hereby ensuring the execution of a section of program to meet
a predefined schedule.
The requirement of supporting more computational power
leads to the support of executing several instructions in parallel,
and therefore, to an increased size of the VLIW. However, data
and control dependencies in the algorithms result in poor utilization of the available parallelism (about two to three instructions per clock cycle [7]). The main disadvantage of VLIW is
apparent during static mapping of a fetch bundle to an execution
bundle. This is poor code density. Code density indicates how
efficiently an algorithm can be mapped to certain core architectures and the related instruction set. It is the most significant
factor influencing the consumed silicon area in a highly integrated system-on-chip (SoC).
However, an average instruction level parallelism (ILP) of 2 to
3 does not necessarily mean that executing several instructions
in parallel is a poor solution for increasing the computational
power of an embedded DSP core. Inner loops of traditional
DSP algorithms such as filtering and fast Fourier transform
(FFT) can make use of the provided parallelism. Code density
of application code is influenced more by control parts of the
algorithms executed.
Fig. 1 illustrates the described problem in a processor architecture providing the execution of up to five instructions in parallel. Assuming a traditional VLIW programming model and
the execution of control code, the drawback of poor code density becomes obvious.
B. VLES
One way to overcome the problem of poor code density is
decoupling of fetch and execution bundles. Variable Length
Execution Set (VLES) allows resizing of the execution bundle
depending on the algorithm executed. It is possible to map
several execution bundles to one fetch bundle, which intensifies the usage of program memory. However, the size of the
program memory port is equal to the maximum possible size
of the execution bundle which can lead to significant routing
between DSP core and memory ports, see, e.g., [9] and [10].
C. CLIW
Configurable Long Instruction Word (CLIW) allows an reduction of the regular program memory port size. The disadvantage of reducing the size of the program memory port is the
decreased peak performance of the core architecture. Available
commercial architectures are overcoming this drawback by supporting an extended program memory port [11] used during execution of inner loops. Even if not frequently used the size of the
program memory port is still equal to the supported peak performance. The extended program memory port is a nonorthogonal
feature and therefore efficient C-compilation is not feasible.
D. xLIW
xLIW is the scaleable long instruction word of xDSPcore,
an application-specific configurable DSP core architecture
efficiently programmable in C [12]. Efficiently in this sense
means featuring less than 10% overhead compared with
manual coding which allows omitting the manual assembly
optimization, which is very slow and error-prone. This allows
architecture-independent description of the algorithms.
xLIW is based on the traditional VLIW programming model
(static scheduling) and as with VLES the fetch bundle and the
execution bundle are decoupled. Additionally, the size of the
program memory port is reduced: for the example in Fig. 2,
the size of the worst case execution bundle consists of up to
ten instruction words, while the associated fetch bundles consists of only four instruction words. Considering average ILP of
two to three operations [7], the reduced size of the fetch bundles provides enough instructions for efficient use of the core
resources. The memory bandwidth problem of the reduced fetch
bundle compared with the worst case execution bundle during
supporting peak performance is solved by introducing the align
unit which contains an internal instruction buffer. The align unit
takes care of loading the fetch bundle from program memory
and setting up the execution bundle whose execution order is
not allowed to be changed due to static scheduling.
In Fig. 3, the main functionality of the align unit is illustrated. The unit connects to the program memory to fetch the
fetch bundle. It stores the fetched instructions to compensate the
memory bandwidth problem during execution of inner loops and
setting up the next execution bundle. The align unit is responsible for insertion of stall cycles if the consecutive execution
bundle is not available. This can take place at nonaligned branch
targets or if the execution set at the branch target exceeds the size
of the fetch bundle. The central part of the align unit is an instruction buffer which is discussed in detail in the next section.
III. INSTRUCTION BUFFER
The instruction buffer compensates for the memory bandwidth mismatch between the fetch bundle and the worst case
execution bundle. It enables a reduced program memory port
without limiting the use of available core resources and reduces
switching activity at the program memory port during loop execution. The first subsection describes the chosen architecture
1096
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004
Fig. 5.
Fig. 3. Align unit overview.
Fig. 4. Storage cell.
for the instruction buffer by considering the requirements discussed above. The second part describes some implementation
details.
A. Architecture
The instruction buffer is built up of instruction storage cells.
As illustrated in Fig. 4, one storage cell consists of instruction
words. The number of entries ( ) is equal to the number of instruction words building up one fetch bundle. For the example
in Fig. 4, each storage cell contains four entries. As well as the
instruction words, the physical address is stored and later used
for comparing fetch pointer content with addresses stored in the
instruction buffer. Each storage cell includes two control bits,
namely the valid ( ) and the executed ( ) bits. The valid bit is
used for indicating valid entries in the storage cell. After reset,
all bits are flushed. The same feature can be used during run
time, when the buffer entries are flushed without rewriting the
related cells, thus, saving power. The executed bit indicates the
status of the entries in the storage cell. If all entries of a storage
cell have been already used building up an execution bundle the
related bit is set. Storage cells with a set bit can be overwritten with new fetch bundles fetched from program memory.
The main function of the instruction buffer is to compensate
the possible memory bandwidth problem between the reduced
fetch bundle and the worst case execution bundle. Furthermore,
the instruction buffer can also be used during the execution of
loop bodies. The loop is fetched once from program memory
and then executed from the buffer. Therefore, the fetching can
Hardware loop.
be suspended during loop execution, which reduces switching
activity at the program memory port. During execution of a
loop body, the bits of already-used storage cells are not set
to prevent overwriting of the storage cells containing the loop
body. This feature prevents cache trashing [13]. When executing
bits are handled as during hanthe last loop iteration, the
dling of linear code, allowing the pre-fetching of instructions
located after the loop body. During normal program flow, the
fetch counter is slightly decoupled from the program counter.
Therefore, it is possible to fetch data into the instruction buffer
several cycles before the entries will be used to build up the next
execution bundle, thus, preventing stall cycles due to missing instructions. During loop execution, this feature is disabled to prevent fetching instructions that cannot be stored in the instruction
buffer.
Many DSP architectures feature hardware support for
loops. The control of the loop, such as decrementing of the
loop counter, is implicitly done in hardware. Traditional DSP
algorithms like filtering frequently contain loop instructions.
Therefore, the hardware loop can be used to reduce execution
cycles and, therefore, power dissipation and the number of
instructions and subsequently increasing code density. An example is illustrated in Fig. 5 [5]. The hardware loop instructions
can be identified during run time and the loop body is executed
from the entries of the instruction buffer.
In the case of loop constructs where the loop execution count
depends on a condition (e.g., while loop) have to be implemented as conditional branch instructions. The core hardware
cannot identify these software loop constructs during run time.
To make use of the features of the instruction buffer during execution of while loops, the handling of the buffer content can be
manually controlled by locking and unlocking of storage cells.
The number of storage cells building up the instruction buffer
is scaleable according to the application requirements which allows for making tradeoffs between area consumption caused by
the storage cells and the advantage of executing loop constructs.
If the number of buffer entries is too low, the loop body cannot
be fully stored inside the instruction buffer (as application examples in Fig. 11 illustrate).
B. Dynamic Power Dissipation
To determine and reduce power dissipation at the program
memory port, the switching power dissipation model
is used to identify the advantages of the reduced
switching activity. Where is the supply voltage, the output
the transition density, all correspond to
capacitance and
one gate [14]. Reducing power dissipation by voltage scaling
is efficient due to the quadratic dependency [15], but cannot
be considered within an embedded DSP core separate from the
complete system. Reducing output capacitance by considering
and preserving locality between cells is an implementation issue
and is discussed in a later section. Therefore, minimizing the
PANIS et al.: SCALABLE INSTRUCTION BUFFER AND ALIGN UNIT FOR xDSPcore
1097
transition density factor
can be used to reduce dynamic
power dissipation on core level.
xDSPcore development considers this aspect on all layers:
high code density and, therefore, less program memory accesses on architectural level, and reduced switching activity
by instruction reordering during compile time [16] and during
execution time by making use of the instruction buffer during
loop handling. Benchmark results illustrate the efficient use
of the instruction buffer for reducing transition activity during
loop execution (illustrated in Section IV).
C. Buffer Control
Control of the entries of the instruction buffer is done with
-way set-associative cache logic [13]. The advantage compared with fully associative cache logic is the possibility of a
parallel search in the cache tag directory and the cache memory.
The reason to implement a more sophisticated cache logic has
been the ability to make use of the instruction buffer during
breaks in program flow, or even if branches in a loop construct
occur. Control by this cache logic does not implicitly mean providing a conventional cache: the behavior is still fully deterministic. If a requested instruction word is already stored in
the instruction buffer, it is possible to suspend the memory access and, therefore, to reduce power dissipation. The execution
time remains unchanged. The same is true if the entry has to be
fetched from program memory. There is no cache miss penalty.
D. Implementation
The instruction buffer introduced in this paper is part of a parameterized embedded DSP core architecture named xDSPcore.
xDSPcore enables to parameterize architectural key features,
like, e.g., bit width and register file size, to meet application specific requirements. For this purpose, a flexible implementation
which is easily portable to different product variants and different silicon technologies is required. However, performance
requirements of different applications do not allow to apply a
synthesis-based semi-custom design flow.
The alternative, a manual full-custom design, would meet
the performance requirements, but the effort for implementation
and porting to different technologies is not tolerable. This becomes even more evident due to the fact that scalability, a main
feature of xDSPcore, is not supported by traditional manual
full-custom design flow.
To satisfy the requirements of scalability and portability of the
embedded DSP core architecture on one hand and meeting the
performance requirements on the other hand, critical structures
of the align unit are implemented in a physical-oriented way by
using a dedicated data path generator (DPG) [17].
Starting from a high level description of the signal flow graph
(SFG), the DPG assembles highly optimized macro layouts
from abutment cells. These abutment cells are automatically
derived from a small library of optimized leaf cells. This
exploits the inherent regularity and locality typical for SFGs in
digital signal processing for the optimization of silicon area,
throughput rate, and especially power dissipation, and offers
the possibility for iteratively optimizing the SFG by simply
modifying the SFG description. Porting the align unit to a new
Fig. 6. CDM and CAD partitioning.
silicon technology only requires porting of the small leaf cell
library, while the architectural description remains unchanged.
The above-described methodology is especially well suited
for regular structures and makes use of locality, allowing short
routing distance and low driving capacity, and therefore, leading
to lower power dissipation [18]. VHDL-RTL is used for implementation of the remaining parts of the align unit which are
mainly control logic taking care of storage cell management.
As illustrated in Fig. 6, the -way associative buffer can be
split into a cache data memory (CDM) and a cache address directory (CAD). Both are well suited to be implemented using
DPG; the executed bit management (EBM) including handling
of the valid bit is part of the control logic and, therefore, implemented in VHDL-RTL. The same is true for the logic that
is responsible for building up the execution bundle. The CDM
is used to store the instructions inside the instruction buffer
which is organized in cache lines. The number of cache lines
(equal to the number of fetch bundles, which can be stored inside the instruction buffer) and the number of bits of each cache
line is scaleable (which allows changing the number of bits
used for each instruction word). This enables the implementation of instruction buffer variants with different numbers of entries and with instruction words with different numbers of bits.
Each storage cell as in Fig. 4 (consisting of four entries) is equal
to one cache line inside the instruction buffer. To prevent an unsymmetrical height to width ratio (especially for an instruction
buffer with a small number of entries), two successive bits are
placed one upon the other, as illustrated in Fig. 7. The data is
fed vertically and the control (like load, MatchX, or debug control signals) of the cells is done horizontally. With the implementation example in Fig. 7, the granularity of the size of the
instruction words is limited to two bits. If the number of bits per
instruction word is odd, one cell remains unused. The debug port
allows reading the content of storage cells without influencing
the content of the instruction buffer or the buffer behavior.
The CAD is used to store the address of the instructions of
the instruction buffer. Each of the address bits is compared with
five entries (four data ports and one debug port, illustrated in
Fig. 8). If the compared entry matches to one of them, then the
1098
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004
TABLE I
RESULTS OF BUFFER VARIANTS
Vector operations are also based on inner loop constructs, as
illustrated in Fig. 10. Increasing the number of operated elements leads to an increasing number of instruction cycles, which
make use of the instruction buffer.
Fig. 11 contains some application examples, for example,
the cryptography algorithm Blowfish, some examples of voice
coders (adpcm, G.711, G723), and an implementation of
Huffman coding. For adpcm, Blowfish, and G.723, most of the
algorithm can make use of the instruction buffer. The Huffman
coding example illustrates the limitation of the instruction
buffer implementation. The loop body produces so many execution cycles that it exceeds the size of the instruction buffer.
Summarizing the results, for loop-centric algorithms which
are typical for algorithms implemented on DSP cores, the instruction buffer significantly reduces switching activity at the
program memory port.
Fig. 7. Bit alignment.
B. Buffer Implementation
Fig. 8.
Read ports/MatchX lines.
related MatchX signal is set. The MatchX lines are cascaded to
reduce the timing critical path. In this implementation example
four MatchX lines are grouped.
IV. RESULTS
This section is in two parts. The first part considers the architectural advantages of using the above-described instruction
buffer. Implementation examples of algorithms are used to
illustrate reduced switching activity at the program memory
port. The second part illustrates the results of the chosen
implementation.
A. Application Examples
The instruction buffer is used for reducing switching density and, therefore, dynamic power dissipation. For the results
in Fig. 9, different finite-impulse response (FIR) filter implementations have been used. The cycle count is normalized to
100% while the x axis shows different filter configurations. The
grey-shaded part indicates the number of cycles that make use
of the instruction buffer. For these filter kernels, no access to
program memory is required for most of the execution time.
The instruction buffer has been implemented in a 0.13- m
CMOS technology with a supply voltage of 1.2 V. In Table I,
implementation results for certain configurations are illustrated.
Due to the chosen architecture the timing for the MatchX line
is independent of the number of cache lines. The data present
time (access time) increases with the number of cache lines due
to the increasing capacitance of the data output lines. For small
filter kernels or vector operations (as illustrated in the previous
subsection), a cache configuration supporting 16 cache lines is
well suited to compensate the memory bandwidth problem and
to store loop bodies with reasonable area overhead.
In Fig. 12, the layout of the full-custom parts of a cache configuration is displayed (16 cache lines, 80 data bits, and 14 address bits). The upper part contains the CAD followed by the
driver of the MatchX signals. The lower part of Fig. 12 contains
the CDM.
V. SUMMARY
This paper introduces a scaleable long instruction word
(xLIW), which is part of a DSP core architecture that enables
efficient programming in a high-level language like C. xLIW
is based on traditional VLIW with static scheduling, which
enables minimizing of the worst case execution time. xLIW
enables efficient use of the program memory and, therefore,
features high code density without limiting the usage of the
available core resources. The key feature of xLIW is the align
unit whose main architectural unit is an instruction buffer
which is introduced in this paper in detail. Making use of the
instruction buffer for loop handling reduces power dissipation
PANIS et al.: SCALABLE INSTRUCTION BUFFER AND ALIGN UNIT FOR xDSPcore
1099
Fig. 9. FIR filter operations.
Fig. 10.
Vector operations.
Fig. 11. Application examples.
both in hardware and software loops. The chosen architecture enables scaling of the main architectural features of the
instruction buffer to satisfy application-specific requirements
and to prevent silicon area overhead. Within practical buffer
sizes, the total access time in contemporary technologies is
within 1–2 ns, which does not form a bottleneck for the DSP
subsystem. xLIW is part of a project for a parameterized DSP
core architecture.
1100
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004
[18] V. S. Gierenz, R. Schwann, and T. G. Noll, “A low power digital beamformer for handheld ultrasound systems,” in Proc. ESSCIRC, Villach,
Austria, 2001, pp. 276–279.
Christian Panis (M’98) received the M.Sc. degree
form Vienna University of Technology, Vienna, Austria, in 1995. Since 2002, he has been working toward the Ph.D. degree at Tampere University of Technology, Tampere, Finland. His research topic is in the
area of digital signal processor architectures.
From 1996 to 2002, he was with Infineon
Technologies Austria as Development Engineer
and Project Manager for wireline communication
products.
Fig. 12.
Layout (full-custom part).
REFERENCES
[1] “DSP 16000, Digital Signal Processor Core,” Agere Systems, Reference
Manual 06.2002.
[2] “Teak DSP Core,” ParthusCeva, Inc., Data Sheet, 2002.
[3] “ADSP 21535, Blackfin DSP Hardware Reference,” Analog Devices,
Inc., Preliminary Edition, 11.2001.
[4] “ZSP 400, Digital Signal Processor Architecture,” LSILogic Corp.,
DB14-00012103, 4th ed., Dec. 2001.
[5] P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, Architectures and Features. New York: IEEE Press, 1997.
[6] J. L. Hennessy and D. A. Patterson, Computer Architecture. A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1996.
[7] D. Sima, T. Fountain, and P. Kacsuk, Advanced Computer Architectures:
A Design Space Approach. Harlow, MA: Addison Wesley, 1997.
[8] “TMS320C6000 CPU and Instruction Set Reference Guide,” Texas Instruments, Inc., 10.2002.
[9] “TMS320C6201 Technical Overview,” Texas Instruments, Inc.,
SPRS051G, 01.1997 (revised 11.2000).
[10] Texas Instruments, “TMS320C64x Technical Overview,” Texas Instruments, Inc., SPRU395, 02.2000.
[11] “Carmel DSP Core Architecture Specification,” Infineon Technologies,
2001.
[12] C. Panis, R. Leitner, H. Grünbacher, and J. Nurmi, “xLIW—a scaleable
long instruction word,” in Proc. IEEE Int. Symp. Circuits and Systems
(ISCAS), Bangkok, Thailand, 2003, pp. V-69–V-72.
[13] J. Handy, The Cache Memory Book. New York: Academic, 1998.
[14] M. Ketkar, S. S. Sapatnekar, and P. Patra, “Convexity-based optimization
for power-delay tradeoff using transistor sizing,” in Proc. ACM/IEEE
Int. Workshop on Timing Issues in the Specification and Synthesis of
Digital Systems (TAU’00), Austin, TX, 2000, pp. 52–57.
[15] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation
of dynamic voltage scaling algorithms,” in Proc. Int. Symp. Low Power
Electronics and Design (ISLPED), Monterey, CA, 1998, pp. 76–81.
[16] U. Hirnschrott and A. Krall, “VLIW operation refinement for reducing
energy consumption,” in Proc. IEEE Int. Symp. System-On-Chip (SOC),
Tampere, Finland, 2003, pp. 131–134.
[17] O. Weiss, M. Gansen, and T. G. Noll, “A flexible datapath generator
for physical oriented design,” in Proc. ESSCIRC, Villach, Austria, Sept.
2001, pp. 408–411.
Herbert Grünbacher (M’73) was born in Kitzbühel,
Austria, in 1945. He received the Dipl.-Ing. degree
(cum laude) from Graz University of Technology,
Austria, in 1971. In 1979, he received the Ph.D.
degree (cum laude) from the same university on the
subject of computer-aided circuit analysis.
In 1982, he joined Austria Microsystems International, Graz, where he worked on analog and
digital full-custom circuits and was responsible for
the design groups in Unterpremstätten, Austria, and
Swindon, U.K. From 1986 to 1987, he was with
Siemens Components, Munich, Germany, as Deputy Director for MOS CAD.
In 1987, he became a Full Professor of computer engineering at the Vienna
University of Technology, Vienna, Austria, heading the VLSI Design Group
of the Institute for Computer Engineering. From 1998 to 2003, he was on
leave from Vienna University of Technology to head Carinthia Tech Institute,
Villach, Austria. Since 2003, he has been with back with Vienna University
of Technology. His current research interest is system on chips with focus on
automotive applications.
Dr. Grünbacher organized the European Solid-State Circuits Conference (ESSCIRC) 1989 in Vienna, Austria, and in 2001 in Villach, Austria. He is member
of the ESSCIRC/ESSDERC steering committee and was Executive Secretary
from 1992 to 2002. He is also a member of the steering committee of the FPL
conference series since 1992 and evaluator and reviewer in several European
research programs.
Jari Nurmi (S’88–M’91–SM’01) received the
M.Sc., Licentiate of Technology, and Doctor of
Technology degrees from Tampere University of
Technology (TUT), Finland, in 1988, 1990, and
1994, respectively.
From 1987 to 1994, he was with TUT as a Research Assistant, Teaching Assistant, Research Scientist, Project Manager, Senior Research Scientist,
and Acting Associate Professor (1991–1993). From
1995 to 1998, he worked for VLSI Solution Oy, Tampere, as the company Vice President responsible for
DSP processor development activities. Since January 1999, he has been a Professor at the Institute of Digital and Computer Systems of TUT. He is the head
of the national TELESOC graduate school. He is the author or coauthor of over
80 international papers, coeditor of Interconnect-centric Design for Advanced
SoC and NoC (Kluwer), and has supervised more than 60 M.Sc., Licentiate,
and Doctoral theses. His current research interests include system-on-chip integration, on-chip communication, embedded and application-specific processor
architectures, and circuit implementations of digital communication and DSP
systems.
Dr. Nurmi is the Chairman of the annual International Symposium on
System-on-Chip and its predecessor SoC Seminar in Tampere since 1999 and
a board member of the NORCHIP conference series. He is a senior member of
the IEEE Signal Processing Society, Circuits and Systems Society, Computer
Society, Solid-State Circuits Society, and Communications Society, and a
member of EIS (the Finnish Society of Electronics Engineers).
PUBLICATION 6
C. Panis, U. Hirnschrott, A. Krall, G. Laure, W. Lazian, J. Nurmi, “FSEL – Selective
Predicated Execution for a Configurable DSP Core”, in Proceedings of IEEE Annual
Symposium on VLSI (ISVLSI-04), Lafayette, Louisiana, USA, February 19-20, 2004, pp.
317-320.
©2004 IEEE. Reprinted, with permission, from proceedings of IEEE Annual Symposium
on VLSI.
FSEL - Selective Predicated Execution for a Configurable DSP Core
C. Panis
Carinthian Tech
Institute
[email protected]
U.Hirnschrott, A.Krall
Vienna University of
Technology
[email protected]
[email protected]
G.Laure, W.Lazian
Infineon Technologies
Austria
[email protected]
[email protected]
have been introduced [2][3][4][5]. Especially in the
area of DSP algorithms deterministic behavior is
required, which contradicts prediction approaches.
During system definition the worst case execution time
has to be considered and the prediction assumed as not
to be taken. Therefore prediction has no added value
for system performance. Another approach for reducing
the number of branch delays is reducing the number of
branch instructions. Predicated execution can be
efficiently used to remove conditional branch
instructions caused by if-then-else constructs.
Predicated or conditional execution has already been
introduced in the 80’s. The main drawback of
predicated execution is an increased program code
space. This paper introduces selective predicated
execution based on FSEL enabling a reduced number
of branch instructions without the drawback of
increased code space. Only code sections which can
make use of the advantage of selective predicated
execution need additional instruction space. The
chosen orthogonal implementation of FSEL can be
efficiently used by a C-Compiler.
The first part of the paper is used for illustrating the
motivation of using predicated execution. The second
part introduces two implementation examples of
predicated execution (Texas Instruments C’62xx,
Starcore SC140). The third section is used for
introducing selective predicated execution based on
FSEL. The result section contains some benchmark
results comparing algorithm implementations using
selective predicated execution.
Abstract
Increasing system complexity of SOC applications
leads to an increased need of powerful embedded DSP
processors. To fulfill the required computational
bandwidth, state-of-the-art DSP processors allow
executing several instructions in parallel and for
reaching higher clock frequencies they increase the
number of pipeline stages. However, deeply pipelined
processors have drawbacks in the execution of branch
instructions: branch delays. In average not more than
two branch delay slots can be used, additional ones
keep unused and decrease the overall system
performance. Instead of compensating the drawback of
branch delays (e.g. branch prediction circuits) it is
possible to reduce the number of branch delays by
reducing the number of branch instructions.
Predicated execution (also guarded execution or
conditional execution) can be used for implementing ifthen-else constructs without using branch instructions.
The drawback of traditional predicated execution is
decreased code density. This paper introduces
selective predicated execution based on FSEL which
allows reducing the number of branch instructions
without decreasing code density. Selective predicated
execution based on FSEL is part of a project for a
configurable DSP core.
1. Introduction
Increasing system complexity of SOC applications
leads to an increasing demand on computational power
of embedded processors. Deep pipelined processors are
used for reaching higher clock frequencies. But deep
pipelined processors have obstacles when executing
branch instructions: branch delays [1].
Branch delays are caused by taken branch
instructions which cause a break in the linear program
flow. Branch delays can lead to significant
performance lack of the processor sub-system. To
overcome this drawback branch prediction circuits
0-7695-2097-9/04 $20.00 (c) 2004 IEEE
J. Nurmi
Tampere
University of
Technology
[email protected]
2. Motivation
This section describes the branch delay problem
caused by branch instructions in deeply pipelined
processors. The number of branch delays depends on
the number of pipeline stages located between the
instruction fetch stage and the branch condition
evaluation. Two possible solution approaches are
317
the advantage of this architectural feature
Pnevmatikatos and Sohi [11] have analyzed benchmark
programs (including Espresso, Gcc and Yacc). About
20% of the instructions have been conditional and 5%
unconditional branch instructions.
In their study they distinguish between fully
guarding which assumes that all instructions can be
executed conditionally, and restricted guarding which
enables only to execute arithmetic instructions under
certain conditions. Detailed results can be found in
[11]. For these benchmark examples about one-third of
the conditional and unconditional branches can be
replaced using full guarding. For restricted guarding
the numbers are lower: about 15% of the conditional
and 2% of the unconditional branch instructions can be
replaced.
The drawback of guarded execution is the growth of
the basic block size. In the above discussed benchmark
examples the size of the basic blocks increases from
4.8 to 7.3 instructions for full guarding. Using
restricted guarding the enlargement is quite less.
Today most of the VLIW architectures support a
mechanism of guarded execution. This is mainly
influenced due to the aspect that VLIW architectures
support a high ILP (instruction-level parallelism),
which requires effective branch handling to prevent
severe performance limitation.
shortly explained in this section: Branch prediction and
predicated execution.
Today’s VLIW (Very Long Instruction Word) DSP
processors provide additional computational bandwidth
by supporting the execution of several instructions in
parallel and by increasing the possible clock frequency
due to deep pipeline structures. Whether an application
can make use of the provided parallelism is mainly
influenced by data dependence between instructions
and by the branch instruction frequency. Less branch
instructions lead to longer basic code blocks and
therefore to a higher possibility to schedule instructions
in parallel (increased instruction level parallelism). In
[6][7][8] the branch frequency of benchmark examples
is analyzed. The ratio is different for scientific code
and general-purpose programs (GP). In average general
purpose programs have a branch ratio between 2030%, for scientific code it is still 5-10%. Even for
scientific programs (which will be more significant for
programs running on an embedded DSP core) each 10th
to 20th instruction is a branch instruction. The ratio
between conditional and unconditional branches is
about 75% conditional branch instructions. Assuming a
processor providing several parallel units, the distance
between branch instructions is getting quite low.
Therefore branch delay slots will consume a significant
number of cycles.
One way to reduce the penalty of branch delays is
the usage of branch prediction. Grohoski [9] divides
conditional branch instructions into loop closing
branch instructions (e.g. caused by while loops) and
other conditional branches. Loop closing conditional
branches will be taken for n-1 times. Assuming that the
remaining conditional branches will be taken with 50%
probability, this leads to a ratio of 5/6 to 1/6 between
taken and not taken branch instructions. Other literature
sources [6][10] estimate a ratio of ¾ to ¼, which still
justifies the emphasis on the effective implementation
of the taken branch instructions.
There are several branch prediction implementations
available, getting trickier as the number of pipeline
stages is increasing [2][3][4][5]. However, this is not in
the focus of this paper.
Another possibility to overcome the problem of
unusable branch delays is predicated or guarded
execution. It can be used to eliminate conditional
branch instructions e.g. generated by if-then-else
constructs, which are common in control code. It
consists of a condition part and an operation part:
3. Predicated Execution
This section is used to illustrate implementation
examples of predicated execution. The Texas
Instruments C62x family and the Starcore SC140 have
been chosen. Both VLIW DSP architectures provide
the possibility to execute several instructions in
parallel, and therefore predicated execution is
mandatory to prevent a performance lack in code
sequences with high branch frequency.
3.1. TI C62x
The C62x architecture of Texas Instruments supports
each instruction to be executed conditionally (full
guarding) [12]. To obtain full guarding, 3 bits of each
instruction word are used to decode the register whose
status is needed to generate the condition. The possible
registers are B0, B1, B2, A1 and A2. Under certain
conditions A0 can also be used. The remaining coding
space (with 3 bits it is possible to encode 8 states) is
used to encode unconditional execution and one code
combination keeps reserved.
(condition) operation
Already in the HP Precision Architecture (1985)
conditional execution has been introduced. To quantify
Figure 1: TI C62x instruction example (addk)
318
Different to the definition of partial guarding by
Pnevmatikatos and Sohi [11], all instructions with
exception of the program flow instructions can be
conditionally executed. The FSEL instruction is part of
the program flow execution slot and therefore no
program flow instruction can be part of the execution
bundle. To enable conditional branch instructions the
condition is coded in the instruction word itself.
The instruction example in Figure 1 shows the leading
three bits labeled creg used to code one of the registers.
The z bit following the creg is used to decode, whether
the test takes place for equal to zero (z=1) or not equal
to zero (z=0).
Each instruction consumes 4 bits to code the condition
for the predicated execution, which has influence on
the code density. The implementation to encode static
registers is useful for the scheduler of the C-Compiler
which has a certain freedom of reordering instructions,
which is especially necessary for a VLIW architecture
supporting the execution of several instructions in
parallel. The limitation on a few registers of the
register-file supporting predicated execution leads to a
restricted use of these registers.
Figure 2: Influence of FSEL on instruction slots
In Figure 2 the influence of the FSEL instruction is
illustrated. The FSEL instruction contains the
execution condition for the instructions of the same
execution bundle. However, not all of the instructions
of the execution bundle have to be executed
conditionally. Therefore the FSEL instruction supports
coding space to enable unconditional execution of
instructions in parallel (don’t care section).
3.2. Starcore SC140
The architecture of the SC140 supports full guarding
[13]. Instead of spending the code to each instruction
the prefix (already used to build up the execution
bundle) is used to code the condition. Execution bundle
are those instructions executed during the same clock
cycle. There are two subsets per execution bundle
possible (even and odd). In the assembly syntax, three
instructions are available. IFF is used for instructions
of the current set which will be executed, if the flag T
is equal to zero. If T is one, the instructions are handled
as NOP. The IFT instruction is used for the inverse
function. If T is equal to one, the instructions will be
executed, if T is equal to zero the instructions will be
treated as NOP instructions. The IFA is used for
instructions of the same execution bundle, which are
executed unconditionally.
The predicated execution implementation of the
Starcore SC140 consumes less code space for
implementing predicated execution, but the limitation
on the status of T leads to a significant limitation for
efficient instruction scheduling.
4.2. Code example
The code example in Figure 3 illustrates the feature of
the FSEL instruction. An if-then-else construct is well
suited for this purpose. If the condition is true, the first
instruction shall be executed, if not then the second
one. On the right side in Figure 3 the related assembly
code for the chosen DSP concept can be seen.
Assuming a five stage pipeline, two branch delays will
decrease the system performance. In the example of
Figure 3 the worst case scenario has been pointed out:
none of the available branch delays caused by branch
instructions can be filled with useful instructions (NOP
instructions are inserted). Assuming the if-then-else
construct in Figure 3 as part of a longer code section,
some of the branch delays get filled with instructions
executed anyway. In this example the available
resources of the DSP core cannot be used. Therefore
the short program sequence has to be executed
sequentially.
4. Selective Predicated Execution
Selective predicated execution based on FSEL is
implemented as separate instruction, which enables to
execute the instructions of the same execution bundle
conditionally (as illustrated in Figure 2). Therefore the
disadvantage of additional coding space (as pointed out
in section 2) is restricted to sections, where predicated
execution provides added value.
4.1. Architecture
Figure 3: Code example
Referring to section 2, the proposed concept is
supporting partial guarding, which means that not all of
the instructions can be executed conditionally.
Using FSEL the if-then-else construct can be coded
within one assembly line and executed within one clock
319
cycle, as illustrated in Figure 4. The dc (don’t care)
section is not used for this example but can be used for
instructions executed unconditionally.
instructions. The number of execution cycles can be
decreased (less branch delays) which reduces the
required clock frequency for executing an algorithm. A
reduced number of branch delay NOPs leads to
reduced switching activity at the program memory port.
Lower clock frequency and less switching activity at
the program memory port decrease the power
dissipation of the DSP subsystem.
Traditional implementations of predicated execution
feature poor code density. Selective predicated
execution as introduced in this paper provides the
advantages of predicated execution by reducing the
number of unused branch delays, without decreasing
code density. Selective predicated execution based on
FSEL is part of a project for a configurable DSP core.
Figure 4: Code example using FSEL
Besides increasing code density (no NOP
instructions are inserted), the number of execution
cycles can be reduced. Both aspects have influence on
the power dissipation of the DSP subsystem. Fetching
fewer instructions reduces the switching activity at the
program memory port. Less cycles for executing a
program reduces the required clock frequency.
5. Results
8. References
In Table 1 some benchmark examples are illustrated.
The first column contains the name of the chosen
algorithm. The remaining columns contain relative
numbers in %. The benchmark results are generated
once without using predicated execution and once
making use of selective predicted execution.
algorithm
Nr.of
bundle (%)
Blowfish
Dspstone
Efr
Huffmann
Serpent
98,4
98,8
91,0
88,6
95,6
Nr.
of
branch
delay
NOPS (%)
80,8
94,6
76,6
79,9
88,9
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP
Processor Fundamentals, Architectures and Features,
IEEE Press, New York, 1997.
[2] Smith J.E., A study of branch prediction strategies, in
Proc. 8th ISCA, pp.135-48, 1981.
[3] Albert D. and Avnon D., “Architecture of the Pentium
Microprocessors”, IEEE Micro, June 1993.
Code size
(%)
[4] Heinrich J., MIPS1000 Microprocessor Users Manual
Alpha Draft 11.Oct, MIPS Technologies Inc., Mountain
View. Ca, 1994
99,2
100,1
98,3
99,3
100,9
[5] Motorola Inc., Power PC620 RISC Microprocessor
Technical Summary, MPC 620/D, Motorola Inc., 1994
[6] Lee J.K.F. and Smith A.J., “Branch prediction strategies
and branch target buffer design”, Computer 17(1), pp.622, 1984.
[7] Stephens C.,Cogswell B. Heinlein J., Palmer G. and
Shen J.P., Instruction level profiling and evaluation of
the IBM RS/6000. In Proc. 18th ISCA, pp.137-46, 1991.
Table 1: Benchmark results
The results in the table indicate a reduction of
execution bundles and a reduction of necessary branch
delay NOPs by using selective predicated execution.
The influence on code density is neglectable. Thus, the
use of selective predicated execution based on FSEL
allows increasing system performance by reducing the
number of branch delays without decreasing code
density.
[8] Yeh T.-Y. and Patt Y.N., Alternative implementations of
two-level adaptive branch predictions. In Proc. 19th
ISCA, pp.124-34, 1992.
[9] Grohoski G.F., “Machine organization of the IBM RISC
System/6000 processor”, IBM J.Res. Develop., 34(1),
Jan., 37-58, 1990.
[10] Edenfield R.W., Gallup M.G., Ledbetter Jr., W.B.,
Mc.Garity R.C., Quintana E.E. and Reininger R.A., ”The
68040 processor”, IEEE Micro, pp. 66-78, Feb. 1990.
6. Acknowledgement
[11] Pnevmatikos D.N. and Soshi G.S., ”Guarded Execution
and branch prediction in dynamic ILP processors”, In
Proc. 21st ISCA, pp. 120-9, 1994.
The work has been supported by European
Commission with the project SOC-Mobinet (IST-200030094) and the CDG Gesellschaft.
7. Conclusion
[12] Texas Instruments, CPU and Instruction Set Reference
Guide, SPRU189B, Texas Instruments, July 1997.
Predicated execution can be used to reduce the number
of branch delays by reducing the number of branch
[13] Motorola Inc. and Lucent Technologies Inc. SC140
DSP Core Reference Manual, MNSC140CORE/D,
Rev.0, 12.1999.
320
PUBLICATION 7
C. Panis, G. Laure, W. Lazian, H. Grünbacher, J. Nurmi, “A Branch File for a Configurable
DSP Core”, in Proceedings of the International Conference on VLSI (VLSI’03), Las Vegas,
Nevada, USA, June 23-26, 2003, pp. 7-12.
©2003 CSREA. Reprinted, with permission, from proceedings of the International
Conference on VLSI.
VLSI-03 International Conference
7
A Branch File for a Configurable DSP Core
C. Panis
Carinthian Tech
Institute
[email protected]
G. Laure, W.Lazian
Infineon Technologies
[email protected]
[email protected]
Abstract
Increasing system complexity of System-on-Chip
applications leads to an increasing requirement on
powerful embedded processors. Low-cost applications do
not allow using more than one embedded core and
therefore the DSP processor has to handle also control
code and configuration sections efficiently. One aspect of
control code is an increased number of branch
instructions compared with classical DSP functions like
FIR filtering. Pipelined processors have obstacles during
branch instruction handling: branch delays. Branch
delays can lead to a severe performance limitation. A
possibility to reduce branch delays is to reduce the
number of branch instructions. Conditional branches
caused by if-then-else constructs can be removed by using
predicated or conditional execution. For efficient use of
predicated execution a wide range of conditions has to be
available. To build up the conditions status flags are
necessary. This paper describes the architecture of a
branch file. The branch file contains the status flags,
which can be used for conditional branch instructions
and for predicated execution. During architecture
definition an efficient use by the C-Compiler has been
taken into account. The advantage of using a separate
branch file is pointed out and the handling of the status
flags during exception handling described. The branch
file is part of a project for a configurable DSP core.
Keywords
Branch
instrcution
File,
Predicated
execution,
conditional
1. Introduction
Increasing complexity of SoC (System-on-Chip)
applications increases the demand on powerful software
programmable embedded processor cores. Low cost
applications do not allow supporting more than one
processor; therefore the DSP core has to handle an
increasing part on control code. To fulfill the requirements
of the applications, the reachable clock frequency of DSP
cores have been increased by deeper pipelines. The
H. Gruenbacher
Carinthian Tech
Institute
[email protected]
J. Nurmi
Tampere University of
Technology
[email protected]
possibility to execute several instructions in parallel
additionally increases the computational bandwidth of
embedded processors.
Compared with classical DSP functions like filtering,
the control code shows an increased number of branch
instructions. Pipelined processors have obstacles
executing branch instructions: branch delays [1]. Branch
delays are caused by taken branch instructions and a
consecutive break in the linear program flow takes place.
The instructions located after the branch instruction will
be fetched but not executed, or alternatively a number of
instructions would have to be moved to these delay slots
to be executed independently of the condition evaluation
to true or false. One third of the conditional branch
instructions are caused by if-then-else constructs and
therefore can be removed by predicated execution.
Especially at VLIW architectures supporting the execution
of several instructions in parallel, predicated execution
can prevent severe performance limitations caused by
unusable branch delays. To build up different conditions a
wide range of status flags is necessary.
The drawback of predicated execution is an increased
code effort [2] which is necessary for coding of the
condition. For the proposed DSP core this problem has
been solved by adding the condition only in code sections,
which can make use of the advantages of predicated
execution.
This paper describes the architecture and
implementation of a branch file, which is a separate
register file containing the status flags. The proposed
concept is suited for a C-Compiler, which can efficiently
make use of the provided predicated execution and the
available flags for conditional branch instructions.
In the first section the configurable DSP core is
introduced and the implementation of the predicated
execution is pointed out. The second part is used to briefly
introduce available implementations. The third part covers
the architecture of the branch file, introducing the
different types of flags which are supported and pointing
out the advantage of the chosen implementation during
exception handling.
8
VLSI-03 International Conference
2. Motivation
This section is used to illustrate the need of using a
branch file. In the first part the proposed configurable
DSP core is shortly introduced. Besides conditional
branch instructions one use for the status flags located
inside the branch file is predicated execution. Therefore
the chosen implementation of predicated execution for
this DSP core is also illustrated.
contains the address register e.g. r0 (including modifier
register e.g. m0 for modulo buffer and FFT addressing
support). The third part is the branch file, which is the
topic of this paper.
2.1. DSP Core Architecture
This sub-section will be used to introduce the
configurable DSP core architecture. DSP cores have to
handle control code efficiently in terms of code density
and cycle count. During architecture definition of the DSP
core, the development of an efficient C-compiler has been
considered, efficient in that way that the overhead
compared with manual assembly coding is less than 10%.
The modified dual Harvard architecture has two
independent data memory busses, which in the chosen
example are 32 bits wide. The instructions will get the
source operands from the register file and store the results
again into the register file, and the data moves between the
register file and data memory are explicitly coded as
separate instructions (load/store architecture, as illustrated
in Figure 1). The RISC-like pipeline consists of three
phases fetch, decode, execute (including write back). To
obtain higher clock frequencies each of the phases can be
further split over several clock cycles.
Figure 2: Register file
In the example of Figure 3 a 5-stage pipeline is
considered, splitting the fetch phase over the pipeline
stages instruction fetch (IF) and instruction alignment
(AL) and the execution phase over EX1 and EX2. The
operands are read when they are required. For example a
MAC (multiply and accumulate) instruction, calculating
the multiply result in EX1 and the ADD operation in EX2.
memory port
Figure 3: Pipeline structure
register file
execution unit
Figure 1: DSP Core overview
The register file is split into three parts (illustrated in
Figure 2). The data register file contains the data, long and
accumulator registers. Two consecutive data registers can
be addressed as a long register (e.g. l0), including guard
bits as accumulator (e.g. a0). Due to orthogonality
requirements of the C-Compiler each of the registers can
be used for each of the instructions. The second part
The operands for the multiplier are fetched from the
register file in the beginning of EX1, for the add operation
(used to sum up the multiply results) the accumulator is
fetched at the beginning of EX2. For instructions using
only the first execution stage the write back takes place at
the end of the pipeline stage EX1. This reduces the definein-use dependency between the instructions of consecutive
execution bundles (execution bundle marks instructions,
which are executed in parallel) [2]. The operand fetch 2
enables efficient calculating of filter structures. Although
the MAC instruction is split over two clock cycles, the
accumulator is fetched in the beginning of pipeline stage
EX2. Therefore the results of the MAC operation located
in the execution bundle before can be already used (no
lost clock cycles due to define-in-use dependency).
VLSI-03 International Conference
2.2. Predicated Execution
In [3][4][5] the branch frequency in certain benchmark
examples (including Espresso, Gcc and Yacc) is analyzed.
The studies distinguish between general purpose and
scientific code. Assuming that applications running on an
embedded DSP core are comparable with scientific code
examples the ratio of branch instructions is about 5-10%,
which means that every 10th to 20th instruction is a branch
instruction. VLIW DSP architectures support the
execution of several instructions in parallel, e.g. for the
example architecture of this paper up to five. Thus, the
branch distance is getting quite low, about two to ten
execution cycles. In control code sections the need for
branches can be even considerably higher, approaching
the situation where almost every VLIW instruction cycle
will execute a branch.
Figure 4: FSEL example
The drawbacks of branch instructions in pipelined
processors are branch delays. This can lead to severe
performance limitations.
9
details in sub-section 4.3). The status flags are stored in
the branch file. The flags are also used for the conditional
branch instructions, which cannot be replaced by
predicated execution.
3. Implementation examples
This section is used to illustrate implementation
examples of predicated execution on available VLIW
DSP cores. The Texas Instruments C62x family and the
Starcore SC140 have been chosen to represent typical
architectures. Both VLIW DSP architectures provide the
possibility to execute several instructions in parallel, and
therefore predicated execution is mandatory to prevent a
performance lack in code sequences with a high branch
frequency. For both processor cores, the available flags
will be analyzed concerning their suitability for using a CCompiler. The same is true for conditional branches,
which use the same flags to build up the condition.
3.1. TI C62x
The C62x architecture of Texas Instruments supports
predicated execution for each instruction (full guarding)
[7]. To obtain full guarding, 3 bits of each instruction
word are used to decode the register whose status is
needed to generate the condition. The possible registers
are B0, B1, B2, A1 and A2. Under certain conditions A0
can also be used. The remaining coding space (with 3 bits
it is possible to encode 8 states) is used to encode
unconditional execution and one code combination keeps
reserved.
Figure 6: TI C62x instruction example (addk - add a
constant)
Figure 5: Influence of FSEL
Pnevmatikos and Sohi have shown in [6] that about
one third of the branch instructions can be replaced by
predicated execution. If-then-else code examples can be
efficiently covered by predicated execution. The proposed
DSP concept uses FSEL for this purpose. In Figure 4 an
example for predicated execution is illustrated. The FSEL
instruction is located in an execution bundle (illustrated in
Figure 5) such that it enables the remaining instructions in
the execution bundle to be executed conditionally.
The FSEL instructions can be used with a wide range
of conditions. The bases for this condition are status flags.
The proposed DSP concept supports static flags (detailed
discussed in sub-section 4.2) and dynamic flags (more
The instruction example in Figure 6 shows the leading
three bits labeled creg used to code one of the registers.
The z bit following the creg is used to decode, whether the
test takes place for equal to zero (z=1) or not equal to
zero (z=0).
Each instruction consumes 4 bits to code the condition
for the predicated execution, which has influence on the
code density. The implementation to encode static
registers is useful for the scheduler of the C-Compiler
which has a certain freedom of reordering instructions,
which is especially necessary for a VLIW architecture
supporting the execution of several instructions in parallel.
The limitation on a few registers of the register file
supporting predicated execution leads to a restricted use
of these registers.
10
VLSI-03 International Conference
3.2. Starcore SC140
The architecture of the SC140 supports full guarding
[8]. Instead of spending the code to each instruction the
prefix (already used to build up the execution bundle) is
used to code the condition. There are two subsets per
execution bundle possible (even and odd). In the assembly
syntax three instructions are available. IFF is used for
instructions of the current set which will be executed, if
the flag T is equal to zero. If T is one, the instructions are
handled as NOP. The IFT instruction is used for the
inverse function. If T is equal to one, the instructions will
be executed, if T is equal to zero the instructions will be
treated as NOP instructions. The IFA is used for
instructions of the same execution bundle, which are
executed unconditionally.
The predicated execution implementation of the
Starcore SC140 consumes less code space for the
predicated execution. The restricted availability of status
flags for coding different conditions leads to a significant
limitation for efficient scheduling of the instructions (all
conditions are dependent on the status of T).
which represent the status of the register content (also as
illustrated in Figure 12).
In the predicated execution example of Figure 4 the
register d0 is used as destination register for an add
(addition) instruction and in the same execution bundle
(an execution bundle consists of instructions executed
during the same cycle) the status of the register d0 is used
to decide if the load operation of the accumulator a4 takes
place. Having the status flags as part of the register file the
number of read and write ports of the register file has to
be doubled, which will reduce the reachable clock
frequency.
address
Z
data
S
Z
OF
S
Z
OF
Figure 8: Static flags
Figure 7: Problem of limited flags
In Figure 7 an example is used to illustrate the related
problem for instruction scheduling. In this example it is
not possible to schedule any instructions influencing the
T-flag between the compare and the conditional
instruction.
The update of the status flags takes place when the
register contents are stored. At the beginning of the
execution stage (in this example at the beginning of EX1,
see Figure 3) the condition will be created of the status
flags located inside the branch file. Is the condition true
the conditional instruction is executed, if not then the
instruction is suspended.
4. Branch File
This section is used to describe the architecture of the
branch file for the proposed DSP core in section 2.1 and
to discuss the different types of status flags. At the end,
the requirements during exception handling will be
mentioned and the advantage of the proposed concept
pointed out.
4.1. Architecture
The branch file is a register file, containing the status
flags for predicated execution and conditional branch
instructions. The branch file has independent read and
write ports, with no dependency on the read and write
ports of the data and address register files. This enables an
independent usage of the register and the related flags
Figure 9: Dynamic flags
In Figure 8 the architecture of the branch file for the
proposed DSP core (only the part for the static flags) is
illustrated. The branch file consists of two parts, a part
containing flags associated to registers (static flags) of the
register file and a part influenced by the program flow
(dynamic flags). The next sub-sections are used to
introduce the different flag types and to illustrate related
possibilities to increases the core performance.
VLSI-03 International Conference
11
4.2. Static Flags
Static flags are representing the status of the content of
the registers. For each of the registers of the data register
file three flags are stored (illustrated in Figure 8). A Zero
Flag (Z), a Sign Flag(S) and an Overflow Flag(OF) are
stored inside the branch file. The OF flag is not a static
flag, but will be assigned to the register, therefore it is
already mentioned in this sub-section. If an instruction
result exceeds the available data width, the OV flag of the
destination register is set. Due to the fact, that the long
registers are a concatenation of two data registers the flags
of the data registers have to be combined to evaluate the
status of the long register. The flags representing the
status of the accumulator are stored separately.
The address registers are useful to implement software
counter. Therefore the Zero Flag of the address register is
evaluated and stored inside the branch file to enable a
condition equal to zero.
Using flags related to the destination register of an
instruction enables the instruction scheduler of the CCompiler to reschedule instructions (the dependency
between generating the condition and executing the
dependent instruction cannot be resolved, but there is no
need to keep them together).
4.3. Dynamic Flags
Dynamic flags are related to the program flow.
Examples are the OV flag already introduced above and
e.g. the loop status flags (FL - first loop ,LL - last loop).
Some examples are illustrated in Figure 9. The proposed
DSP core supports zero-overhead loop handling. When
the loop is executed for the first time, the FL-flag is set.
This enables to schedule instructions into the loop body
and use predicated execution to execute them only one
time.
In Figure 10 a loop example is illustrated. In the
chosen example, software pipeline techniques have been
already mentioned to reduce the load-in-use dependency
(between the load instructions and the related arithmetic
instructions). In this example it was possible to hide the
load-in-use-dependency, in other code sections this will
not be possible with the drawback of introducing NOP
instructions; which means wasted clock cycles and wasted
instruction space, because also NOP instructions have to
be encoded.
In Figure 10 the load operation of the loop body is first
time executed before the loop takes place; therefore the
loop has to be executed one time less. The add instruction
is located directly in front of the loop body (the
instruction can be scheduled for this core architecture in
parallel to the loop instruction). The add operation will be
executed only once.
Figure 10: Loop example
In Figure 11 the same example is shown but this time
using the flag FL. The add instruction can be shifted into
the loop-body. Still the add operation will be executed
only once (during first time executing the loop). Using the
flag in combination with predicated execution, it is
possible for this example to reduce the number of cycles.
Figure 11: Loop example using LL
The drawback of an increased loop body and therefore
the need of additional fetch cycles during loop body
execution are compensated by an instruction buffer [9].
This buffer is used to execute loop bodies without the
need of further fetch operations from memory which
would lead to increased power dissipation. During the last
execution of the hardware loop the LL-flag is set.
Similarly as for the first loop execution it is possible to
schedule instructions into the loop body and execute them
only during the last execution cycle of the loop.
For each of the loop nesting levels a pair of loop flags
is available, indicated by the number in front of the status
flags.
4.4. Exception handling
During exception handling (handling of interrupt
service routines or during task switching) the static and
dynamic flags are handled differently.
The static flags are updated each time a value will be
stored into the register file. Therefore it is not necessary to
explicitly save the static flags into data memory during
exception handling. E.g. at the beginning of the task
switch all registers of the register file are stored into data
memory. The register values used for the new task are
fetched from memory and stored into the register file.
During the store operation into the register file the static
flags are automatically updated. Again switching back to
the first task only the registers of the second task have to
12
VLSI-03 International Conference
be stored to data memory and the register contents of the
first task are fetched from memory and again stored into
the register file. The static flags are again automatically
updated. Therefore during exception handling the static
flags do not have to be considered. The content of the
static flags will be all the time consistent to the register
contents.
influencing the reachable clock frequency by doubling the
read and write ports of the data register file. The register
related status flags enable the C-Compiler to reschedule
instructions and therefore to reduce the number of
necessary execution cycles. During exception handling the
static flags do not have to be taken care off. Updating the
register content will automatically update the static flags
in the branch file. The automatic flag handling reduces the
necessary overhead during task switching and handling of
interrupt service routines. The branch file is part of a
configurable DSP core.
References
Figure 12: Handling of static flags
The dynamic flags (including the overflow flag) are
dependent on the program flow and therefore it is
necessary to take care of them during exception handling.
This can be done by regular load/store instructions.
In Figure 12 the above described mechanism is
illustrated. In parallel to the update of the registers in the
register file the related entries in the branch file are
updated. This is true for values fetched values from the
data memory as also for results of arithmetic operations.
The advantage of generating the flags in parallel is
consistency of the register content and the related branch
file each cycle, availability of the flags, as soon as they are
requested for conditional operations and there is no need
for explicit instructions to take care of the static flags
during exception handling.
5. Conclusion
The branch file described in this paper is used to store
dynamic and static flags. Besides the conditional branch
instructions which can make use of the flags of the branch
file, predicated execution can be used to reduce the
number of branch instructions especially in typical control
code examples like if-then-else constructs. The branch file
with separate read and write ports enables to use the
register content and the status flag independently without
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP
Processor Fundamentals, Architectures and Features,
IEEE Press, New York 1997.
[2] Dezso Sima, Terence Fountain, Peter Kacsuk,
Advanced Computer Architectures: A Design Space
Approach, Addison Wesley Publishing Company,
Harlow, 1997.
[3] Lee J.K.F. and Smith A.J., “Branch prediction
strategies and branch target buffer design”, Computer,
17(1), 6-22, 1984.
[4] Stephens C.,Cogswell B. Heinlein J., Palmer G. and
Shen J.P., “Instruction level profiling and evaluation
of the IBM RS/6000”, in Proc. 18th ISCA, pp.137-46,
1991.
[5] Yeh
T.-Y.
and
Patt
Y.N.,
“Alternative
implementations of two-level adaptive branch
predictions”, iIn Proc. 19th ISCA, pp.124-34
[6] Pnevmatikos D.N. and Soshi G.S.,”Guarded
Execution and branch prediction in dynamic ILP
processors”, in Proc. 21.ISCA , pp. 120-9, 1994.
[7] Texas Instruments, CPU and Instruction Set
Reference Guide, SPRU189B, Texas Instruments, July
1997.
[8] Motorola Inc. and Lucent Technologies Inc. SC140
DSP Core Reference Manual, MNSC140CORE/D,
Rev.0, 12.1999.
[9] C.Panis, R.Leitner, H.Grünbacher, J.Nurmi, “xLIW –
a Scaleable Long Instruction Word”, ISCAS 2003,
Bangkok, Thailand, 05/2003.
PUBLICATION 8
C. Panis, R. Leitner, J. Nurmi, “A Scaleable Shadow Stack for a Configurable DSP
Concept”, in Proceedings The 3rd IEEE International Workshop on System-on-Chip for
Real-Time Applications (IWSOC), Calgary, Canada, June 30-July 2, 2003, pp. 222-227.
©2003 IEEE. Reprinted, with permission, from proceedings of the IEEE International
Workshop on System-on-Chip for Real-Time Applications.
Scaleable Shadow Stack for a Configurable DSP Concept
Christian Panis
Carinthian Tech Institute
[email protected]
Raimund Leitner
Infineon Technologies
[email protected]
On the other side many applications especially in the
low-cost area do not justify to use more than one
processor core.
Therefore the configuration of these dedicated
hardware blocks has to be done by the DSP and also due
to the missing micro-controller more control code has to
be executed. Due to these facts, interrupts are getting
more important and generate a lot of additional problems
in DSP architectures. Besides real-time requirements, a
lean task switch and several interrupt nesting levels have
to be supported.
This paper describes a mechanism to overcome the
data consistency problem between different pipeline
stages for a DSP concept with no restrictions on the
nesting level of interrupts. The so called scaleable shadow
stack needs no interactions from the DSP kernel itself and
therefore no MIPS or program memory has to be spent.
In the first part the DSP architecture will be shortly
introduced. The structure of the pipeline will be explained
more in detail to illustrate the problems of nested
interrupts and data consistency between the different
pipeline stages. The second part will be used to exploit the
architecture of the scaleable shadow stack, to describe the
structure of the stack packets and to outline the swapping
mechanism. The third part contains an analysis at which
nesting level the proposed concept will lead to an
advantage compared with classical approaches.
Abstract
SoC (System-on-Chip) applications map complex
system functions on a single die. The increasing
importance of flexibility in SoC applications leads to a
raising portion implemented in firmware. Therefore, the
demand on computational power of the embedded
processors in the application is increasing. The newest
silicon technologies (e.g. 0.13 µm and lower) help to
increase the reachable frequency, but the demand cannot
be sufficiently satisfied. One approach to increase the
processor frequency is the introduction of pipelining. To
guarantee data consistency in deep pipelined processors
different methods have been developed. Additional
complexity is introduced by the occurrence of interrupts.
This paper describes a concept to enable data consistency
between the instructions of different pipeline stages in
pipelined DSP kernels during interrupt service routines,
without the interaction of the DSP itself and with no
restrictions concerning the nesting level of the interrupts.
The scaleable shadow stack is part of a development
project for a configurable DSP concept.
1. Introduction
Today’s applications map complex systems to a single
die and therefore the complexity of the integrated circuits
is increasing. The importance of software/firmware in
applications is increasing, mainly driven by higher
flexibility requirements, which leads to an increasing
relevance of embedded processors [1]. To achieve the
frequency requirements for the embedded processors, it is
not sufficient only to gain the necessary frequency
increase by smaller feature size of the newest CMOS
technologies. Architectural features like pipelining
(introduced originally in the late 1960’s − IBM 360/91,
1967; CDC 7600, 1970) helps to increase the core
frequency by reducing the number of actions during one
clock cycle [2][3].
The application requirements on DSP subsystems are
changing. On one side the classical number crunching
functions like FIR, IIR and FFT will be implemented in
dedicated HW.
2. DSP Architecture
This section will be used to shortly introduce the DSP
architecture, which is the basis for the scaleable shadow
stack. To overcome the requirements of handling control
code on a DSP, a C-compiler entry is mandatory.
Therefore during architecture definition the development
of an efficient C-compiler has been considered (efficient
in that way that the overhead compared with manual
assembly coding is less than 10%). The modified dual
Harvard architecture has two independent data memory
busses, in the chosen example 32 bits wide.
222
0-7695-1944-X/03 $17.00 © 2003 IEEE
Jari Nurmi
Tampere University of Technology
[email protected]
The instructions will get the source operands from the
register file and store the results again into the register
file, and the data moves between the register file and data
memory are explicitly coded as separate instructions
(load/store architecture). The RISC-like pipeline has a
split execution stage to prevent a timing critical path
inside the execution unit. These two pipeline stages are
called EX1 and EX2 as illustrated in Figure 1.
If the register used for op2 of instruction t+1 will be
the same as the result register of instruction t, then the
result of instruction t will already be used as operand of
instruction t+1. This increases the utilization of available
registers and decreases the data dependency between the
instructions of different pipeline stages.
2.1. Pipeline structure
If an interrupt disrupts the linear program flow an
interrupt service routine (ISR), has to be executed. The
instructions which are already being served in the pipeline
will be finished and their results stored into the register
file. Then the ISR will be executed and after finishing the
interrupt request the original program flow can be
resumed.
The handling of an ISR is illustrated in Figure 3. The
results of instruction t will be stored at the end of EX2 and
the ISR can be started. After finishing the program of the
ISR the instruction t+1 (of Figure 2) will be handled in
t+n. The ISR itself takes care of saving and restoring the
contents of the registers that it needs to occupy.
2.2. Interrupt handling
To increase the reachable core frequency of the DSP
kernel, the execution stage of the RISC-like pipeline
structure has been split into two parts. For example the
MAC instruction (Multiply-Accumulate) is split into the
Multiply operation calculated during the EX1-stage and
the Accumulation operation calculated during the EX2stage.
To reduce the define-use [4] dependency the operands
for the instruction will be fetched from the register file as
late as possible. This enables e.g. the implementation of a
FIR filter without any stall cycles due to data dependency.
Figure 1: Pipeline structure
For the multiply operation the multiplicand and
multiplier will be fetched at the beginning of the EX1stage, for the final accumulation the operand will be
fetched at the beginning of the EX2-stage. At the end of
pipeline stage EX2 the result of the MAC instruction will
be stored into the register file. In Figure 2 two consecutive
cycles are used to illustrate the data dependency between
the instructions of different clock cycles (t, t+1).
Figure 3: Interrupt handling
Due to the delayed execution (from the system point of
view) of the instruction t+1 (now t+n) the operand op1
has been changed by instruction t, when writing the result
at the end of EX2. Therefore the instruction t+1 (now
t+n) will give a different result if the instruction will be
executed in t+1 or in t+n (delayed by the execution of the
ISR).
Figure 4: Register file
Figure 2: Data dependency
An example with register values is now used to
illustrate the problem. In this example a MAC instruction
will be followed by an ADD instruction. The register file
in this example is built up as in Figure 4. The example is
illustrated in Figure 5. Two consecutive (16 bit wide) data
registers (e.g. d0, d1) can be accessed as long register l0
(32 bit wide). The long register and the guard bits (gb)
together can be addressed as accumulator a0.
The operand op1 of the instruction t+1 cannot already
contain the result of the instruction t. This enables the use
of the same register of the register file for 2 instructions,
first to keep the operands for instruction t+1 and then the
result of instruction t.
223
register
values :
d0 := 2
d1 := 3
read op1
t
t+1
ID
EX 1
d6 := 7
d3 := 5
read op2
EX 2
consistency problem. The number of possible interrupt
nesting levels is limited to the available register resources.
d2 := 4
write back
3. Scaleable shadow stack
mac d0,d1,d3
start
ISR ... Interrupt service routine
end
ID
t+n
EX 1
read op1
a3 := 2*3 + 5
d4 := 7 + 4
EX 2
read op2
add d6,d2,d4
write back
a3 := 2*3 + 5
d4 := 11 + 4
The scaleable shadow stack can be used to solve the
described problem in a more elegant way with no
restrictions to the nesting level of interrupts. The
following section is used to explain the architecture of the
stack, the structure of the shadow stack packet in detail
and the fine-tuning mechanism, which enables application
specific optimization.
3.1. Architecture
Figure 7 illustrates the integration of the shadow stack
into the core architecture. If an interrupt service routine
has to be executed, the results of the instructions in the
last cycle before the ISR are not stored into the register
file.
Figure 5: Example
In the example of Figure 5 the source register d6 of the
ADD instruction will be physically part of the result
register a3.
The ADD instruction will get the first operand at the
beginning of EX1 and the second at the beginning of EX2
(in order to reduce the number of read ports of the register
file). The normal program flow (left column) gives a
different result compared to the flow where the ISR has
been executed between MAC and ADD instruction (right
column).
2.3. Shadow register
One approach to overcome this problem is to save the
instructions in cycle t before starting the ISR. The results
can be stored in a register during the execution of the ISR.
After finishing the interrupt service, the values have to be
restored into the register file. If the number of nesting
levels for the interrupt handling is more than one, for each
nesting level an additional storing place has to be
prepared, as shown in Figure 6. Today’s VLIW
architectures provide a lot of instruction level parallelism
and therefore a lot of possible parallel results.
Figure 7: Shadow stack integration
In the example of Figure 7, four possible results have
to be saved in the shadow stack. This will be done without
any instructions in the DSP program. After finishing the
ISR the register contents previously saved in the shadow
stack will be restored in the register file of the DSP core.
To prevent any data inconsistency as described in the
section before, the write-back has to be handled in the
correct cycle of the pipeline.
Figure 6: Shadow register
For each of the instructions which can be executed in
parallel and for each nesting level such a field of
intermediate registers has to be prepared [5]. This leads to
a lot of additional silicon area to handle the data
Figure 8: Shadow stack structure
The shadow stack has access to the data memory ports
of the DSP subsystem. The already available memory can
224
be used to store parts of the shadow stack content into the
memory. This has to be done to free up space for the next
results (initiated by the next nested interrupt). When the
storing actually becomes necessary depends on the size of
the shadow stack. Details of the swapping mechanism will
be discussed in the next subsection.
From the core point of view the shadow stack contains
the following information:
ƒ
ƒ
ƒ
Figure 9: Packet structure
register content: the result values of the last
instructions before the ISR execution
target address: the target register which
should be written with the register content
after finishing the ISR
interrupt level: if several levels of interrupts
are possible, then the assignment of the stack
entries to a certain level must be possible
To compose the data in the right manner, when they
will be fetched back from the data memories, it is
necessary to know where they have been stored before.
Registers containing more than 20 bits payload have to
be split in parts of 20 bits, to fit into the illustrated raster.
To dimension the shadow stack for optimal utilization a
worst case estimation has to be done (how many shadow
stack packets have to be handled for one interrupt level at
the same time). This analysis also has to be done to set the
thresholds for the swapping mechanism.
The shadow stack structure is depicted in Figure 8. For
the core, the stack looks like a hardware stack, with
pointers which handle the stack administration. The begin
shadow ptr will be incremented if packets have to be
stored on the stack, and decremented if data has to be
restored into the register file. The end shadow ptr will be
used to handle the data exchange with the memories,
while swapping stack contents to the data memories of the
DSP core.
3.3. Thresholds
The thresholds of the shadow stack structure can be
seen in Figure 10. There are 4 limits, which can be
configured to application specific requirements.
•
LIL_MIN: is used to guarantee the availability of the
results in the stack after executing the current
interrupt service routine.
•
LIL_MAX: is used to guarantee enough space in the
shadow stack to keep the next values if a new
interrupt occurs.
•
BAL_MIN: is an optional limit which will be used to
have a well balanced stack content. If there is no
value assigned, it will be set with LIL_MIN.
•
BAL_MAX: is an optional limit which will be used to
have a well balanced stack content. If there is no
value assigned, it will be set with LIL_MAX.
3.2. Shadow stack packet structure
Due to the instruction level parallelism in VLIW
architectures [3] the shadow stack has to handle several
results at the same time (in this example architecture up to
4). Each of the result registers can have different bit width
(e.g. 16 bits for data registers and 40 bits for an
accumulator). The width of the busses to the data
memories of the DSP core is another factor influencing
the structure of the shadow stack packet. In the following
example illustrated in Figure 9 two busses each 32 bits
wide have been assumed. The biggest result register has
the size of 40 bits (accumulator).
The interface to store the packets into the data
memories is 32 bits wide. Therefore the packet structure is
limited to 32 bits. 20 bits will be used for payload (the
register content), 7 bits are reserved to encode the
destination register. One bit will be needed to identify the
next interrupt level and two bits for the previous packet
location. A single bit for the interrupt level is sufficient
because the shadow stack packets will be served in a
FILO method (first in, last out), and this level bit will be
used to indicate the start of the next interrupt level. The
previous packet location has to be stored, because the next
free bus cycle will be used, independent of which of the
two busses to the data memory is available.
If the stack contains too many entries, additional
interrupts can lead to stalling cycles of the DSP core to
free space for the next register values.
space to restore
at least the
results of the
most recent
interrupts
space to store at
least the results
of the next
interrupts
LIL_MIN
BAL_MIN
BAL_MAX
LIL_MAX
Figure 10: Thresholds
225
If the stack is kept empty and all shadow stack packets
have been swapped preventively to the data memory, the
stack is well prepared for new interrupts. However, the
register contents for finishing the interrupt service routine
are missing. If the shadow packets have to be fetched, this
will lead again to stalling cycles and to additional power
dissipation due to the switching at the data memory ports.
For balanced use of the stack, the two limits BAL_MIN
and BAL_MAX can be configured individually.
The FSM (finite state machine) handling the swapping
mechanism of the shadow packets is illustrated in Figure
11. As long as there is enough space in the shadow stack
available to save the new values and the values are
available for restoring after finishing the ISR, only the
states Save Results and Restore Results will be used. If
there is no space left for the next interrupt level, the state
Store LIL will be used to free space inside the stack. If the
result values of the already serving ISR are not available
inside the stack, the state Load LIL will be used to initiate
a fetch from the data memory of the DSP subsystem. Both
operation states have no influence on the performance of
the DSP kernel, because the memory operations will be
done during memory cycles not used by the DSP anyway,
as explained in the next subsection. If the next interrupt
occurs and there is too little space available inside the
stack the state Store Stalling will be used to stall the DSP
program and to free the stack until enough free space is
available. If the ISR is finished and the related register
contents are not available to restore them into the register
file, the state Load Stalling will be used to fetch the
missing data from the data memory. Both states have
influence on the DSP due to the stall mechanism.
Figure 11: FSM for swapping mechanism
If an interrupt service routine finishes and the related
shadow stack packets are not available inside the shadow
stack, the shadow stack has the option to claim the data
bus of the DSP core and to fetch the necessary packets.
During this emergency fetching, the program flow on the
DSP core will be stalled.
4. Comparison
3.4. Cycle stealing
The minimal size of the shadow stack is one level of
results. A level of results consists of the result values
generated by the last instructions before executing the
ISR. Such a small stack requires swapping of the stack
content each time an interrupt occurs or an ISR has been
finished. Additionally during the swapping no further
interrupts are allowed which will increase the response
time. In the example above 4 cycles are needed for
swapping the data and 4 cycles to restore them.
Having a stack with 2 levels of results no advantage
compared to the shadow registers is still gained,
concerning silicon area. However, in this configuration
each interrupt will not cause swapping to the data memory
of the DSP subsystem.
The scalable shadow stack has access to the ports of
the data memory to swap stack contents to the memory of
the DSP subsystem. To prevent stall cycles where the
program running on the DSP core will be stalled to free
the data busses, cycle stealing [2] is used to regulate the
communication on the data bus. The DSP core is the
master of the data busses and indicates via a signal, which
of the memory busses is not used during the next cycle.
During the state Store LIL the related bits in the shadow
stack packet (previous packet location) are set and the
free bus is used for storing the shadow stack packet in the
data memory. During the state Load LIL the previous
packet location of the shadow stack packets is analyzed
(to get the information on which bus the next value has
been stored before) and the data are fetched from the data
memory of the DSP subsystem.
226
Most of these parameters have influence on the
consumed silicon area and the power dissipation.
A FSM is used to balance the fill level to reduce the
power dissipation at the memory ports and to prevent
stalling cycles of the DSP core.
5. Conclusion
The scaleable shadow stack is a smooth solution to
overcome the problem of data consistency between the
instructions of different pipeline stages during interrupt
service routines, without any restrictions on the possible
nesting level of the interrupts. The proposed concept
enables data consistency without the usage of DSP
instructions or cycles and has an area advantage compared
to a shadow register concept at deeper nesting levels of
interrupts.
The scaling parameters can be used to adapt the
shadow stack structure to application specific
requirements. Different thresholds to administrate the
swapping mechanism to the data memory can be used to
reduce the power dissipation by less memory accesses.
The shadow stack is a part of a development for a
configurable DSP concept.
Figure 12: Comparison of shadow registers
versus shadow stack
Assuming a stack size of 3 levels of results, the size of
the stack is big enough to calculate with statistics. Not
each of the units executing instructions before an interrupt
occurring will have valid results. Therefore even in a 3level shadow stack more than 3 nesting levels can be
handled, without any swapping to the data memory. No
further registers have to be spent for handling of deeper
nesting levels.
In Figure 12 the trade-off between silicon area and
possible nesting level is illustrated for the shadow
registers and for the solution of the shadow stack. For the
first 2 nesting levels the shadow stack has an overhead
due to the stack administration. About at the third level
the break-even point is reached. Further nesting levels will
not lead to additional HW in the case of the shadow stack,
in the worst case some stack packets have to be spilled to
the data memory. The solution with the shadow registers
will require additional HW for the support of further
nesting levels.
The scaleable architecture enables an applicationspecific adaptation of the shadow stack architecture. The
following parameters can be influenced.
•
number of entries of the stack
•
thresholds for swapping administration
•
number of memory ports
•
bit width of the memory ports
•
number of result registers per interrupt level
•
size of the result registers (bit width)
6. References
[1] Eyre and J.Bier, The evolution of DSP processors, IEEE
Signal Processing Magazine, vol. 17, no. 2, 2000.
[2] J. L. Hennessy and D. A. Patterson, Computer
Architecture. A Quantitative Approach, Morgan Kaufmann
Publishers, San Mateo CA, 1990.
[3] P.Lapsley, J.Bier, A.Shoham and E.A.Lee. DSP Processor
Fundamentals, Architectures and Features. IEEE Press,
New York, 1977.
[4] Dezso Sima, Terence Fountain, Peter Kacsuk, Advanced
Computer Architectures: A Design Space Approach,
Addison Wesley Publishing Company, Harlow, 1997.
[5] Silvia
M.Mueller,
Wolfgang
J.Paul,
Computer
Architecture: Complexity and Correctness, Springer, New
York, 2000.
227
PUBLICATION 9
C. Panis, J. Hohl, H. Grünbacher, J. Nurmi, “xICU - a Scaleable Interrupt Unit for a
Configurable DSP Core”, in Proceedings 2003 International Symposium on System-on-Chip
(SOC’03), Tampere, Finland, November 19-21, 2003, pp. 75-78.
©2003 IEEE. Reprinted, with permission, from proceedings of the 2003 International
Symposium on System-on-Chip.
xICU – an Interrupt Control Unit for a configurable DSP Core
C. Panis
Carinthian Tech
Institute
[email protected]
J.Hohl
Infineon Technologies
Austria
[email protected]
H. Gruenbacher
Carinthian Tech
Institute
[email protected]
J. Nurmi
Tampere University of
Technology
[email protected]
third part briefly covers the architectural details of the
integration into the proposed core architecture.
Abstract
Increasing complexity of SoC applications leads to a
strong demand on powerful software programmable
embedded cores. Low-cost applications do not allow
adding more than one core to the application. Depending
on the application a DSP or a microcontroller will be
used. Therefore DSP cores have to handle interrupts
typically served by microcontroller sub-systems also with
low latency and small overhead concerning cycle count
and code density.
This paper describes the architecture of an ICU
(interrupt control unit) for a configurable DSP core. The
main architectural features of the ICU can be configured
to reduce the consumed silicon area to an application
specific optimum. Priority morphing is introduced to
enable the control of the execution order of pending
interrupt sources during run-time and to prevent the loss
of interrupt information. A smooth integration into the
program sequencer allows short interrupt latency and low
overhead for serving ISRs (interrupt service routines).
xICU is part of a project for a configurable DSP core.
2. Motivation
This section is used to briefly introduce the features of
interrupt control units of available core architectures. The
chosen examples are illustrating the increasing importance
of interrupts for DSP sub-systems. The OAK DSP core
from DSPGroup does not provide a separate interrupt
control unit. As state-of-the-art architecture the ICU of the
Starcore SC140 is introduced. As reference for ICUs the
Power Quicc MPC 680 from Motorola is introduced.
2.1. OAK DSP Core
The OAK DSP core from DSPGroup is chosen as
example for traditional DSP core architectures [3] not
supporting an explicit interrupt control unit. The interrupts
(one non mask-able and three mask-able interrupts) are
directly connected to the DSP core, the activation of the
pins is high-level sensitive.
1. Introduction
Integration of system solutions onto one die (Systemon-Chip) or into one package (System-in-a-package) leads
to a strong demand onto powerful embedded software
programmable cores. This is true for microcontroller and
DSP cores. Especially for low-cost applications it is not
possible to add more than one core. Depending on the
algorithms implemented in software, a microcontroller or
a DSP core is chosen. Therefore DSP cores also have to
handle efficiently control code and configuration code
sections [1]. Interrupts has to be served as by
microcontroller architectures, which means with low
latency and less overhead in cycle count and code density
[2].
This paper describes the architecture and integration of
an interrupt controller unit (ICU) for a configurable DSP
core. The first section illustrates the changing importance
of serving interrupts in DSP subsystems. The second part
briefly introduces the requirements to the ICU architecture
with focus on priority morphing, which is used to control
the execution order of the interrupt sources during runtime and to prevent the loss of interrupt information. The
Figure 1: OAK DSP Core
The OAK DSP core supports context switching used
during serving of interrupt service routines (ISR). For the
accumulator register a set of shadow registers is available
to handle the second task; therefore no initialization
procedure is mandatory. The limitation of one set of
shadow registers (supporting more than one has
significant influence on the core area) does not allow
using of the context switching mechanism for nested
interrupts. The OAK DSP is not restricted by this
limitation because there is no support of nesting of maskable interrupts, which are controlled by the IE bit.
2.2. Starcore SC140
The implementation of the exception handling for the
Starcore SC140 is split into two parts [4]. The PIC
75
0-7803-8160-2/03/$17.00 © 2003 IEEE
For maskable interrupt sources a mask register is
available containing the mask information used during the
selection process of the next interrupt source. The priority
of the different interrupt sources can be assigned by
mapping of the interrupt sources to a priority matrix. The
assigned numbers must be unique.
Comparing
the
exception
handling
for
a
microcontroller with the exception handling for a state-ofthe art DSP core the differences are neglect able. Both
architectures have a predefined configuration. The number
of supported interrupt sources, the number of available
priority levels, the size of the address bus and the related
address space are fixed.
xICU, the interrupt control unit introduced in this
paper allows to modify these parameters to application
specific requirements to reduce the consumed silicon area.
(programmable interrupt controller) is used for
prioritization and arbitration of the different exception
sources and is not part of the DSP core. Up to 7 different
levels of priority are supported (not including the NMI).
Figure 2: SC140-PIC Interface
The interface of the PIC to the DSP core is illustrated
in Figure 2. Exceptions can be generated by traditional
external interrupt sources (“the left side” in Figure 2) or
from the PSEQ unit of the DSP core e.g. by execution of
illegal instructions. The address of the interrupt vector
consists of three parts, based on the vector base address
(stored in the VBA), an offset (internal or external,
depending on the kind of exception) and an aligned startaddress, initialized with zero (the lower 6 bits, because the
distance between 2 exception vectors is 64 byte).
The PIC supports up to 32 interrupt requests including
NMI’s (non maskable interrupt). Up to eight edgetriggered NMI’s are supported, the remaining 24 inputs
are supporting edge-or level triggered interrupt requests.
The PIC provides the possibility to monitor pending
interrupts, which is useful for debugging purposes.
The SC140 supports delayed return from exception
handling, which allows making use of some of the branch
delays of the return from interrupt instruction (up tp two
out of six branch delays).
3. xICU Requirements
This section is used to introduce the requirements for
xICU. The main part is used to illustrate priority morphing
(and the specific implementation), which is used to
prevent loss of interrupt information and allows changing
of the order of served ISRs (interrupt service routine)
during run-time.
3.1. Synchronization
Interrupt sources can generate asynchronous interrupt
signals (e.g. from different clock domains). Therefore the
ICU has to take care for synchronization of the incoming
signals (as the PIC of SC140).
3.2. Scalability
The proposed DSP core architecture enables to adapt
the main architectural features to application specific
requirements to obtain an optimum in power dissipation
and area consumption. The same is required for xICU.
The number of interrupt sources, the number of supported
priority levels (with influence on the size of the related
configuration registers) and the width of the interface to
the core itself has to be scaleable.
3.3. Priority
Figure 3: CPM (CPIC)
Different interrupt sources can be assigned to different
priority levels. There is no restriction for the number of
priority levels and the priority of an interrupt source has to
be changeable during run-time. The same priority level
can be assigned to more than one source at the same time.
If two sources have the same priority assigned, the time of
occurrence is used to decide about the order of execution.
2.3. Motorola Power Quicc
The exception handling for the Motorola Power Quicc
is chosen because this microcontroller is available in
several applications today [5]. The interrupt controller for
the MPC860 is called CPM. The CPIC (a part of the
CPM) is used for synchronization and priorisation of all
internal and external interrupts sources, similar as the PIC
for SC140 (illustrated in Figure 3).
3.4. Low latency
As pointed out in the introduction an important
requirement of efficient interrupt handling is low latency.
76
During run-time the assigned priority level can be
changed, increased and decreased. The change can be
done by a predefined time basis (the time basis can be
changed during run time). For example each millisecond
the priority of the external interrupt in Figure 4 is
increased by one.
The priority level can be also changed by the source to
get earlier served. This “self-tuning” can get useful if the
interrupt source indicates loosing data or wants to activate
data transfer with another Co-processor as mentioned
before. This feature has influence on the program flow
and therefore has to explicitly activated by the DSP core.
The DSP core can also change the priority of an
interrupt source. The increased priority level of an
interrupt source leads to an earlier execution.
The latency is influenced by the synchronization logic and
by core restrictions. The proposed DSP core limits the
handling of ISR’s only during execution of branch delays
(for the five stage pipeline configuration two branch
delays are necessary).
3.5. Priority Morphing
Similar concepts are known from microprocessor
architectures. The priority is changing on time-basis with
wrap around. Starting with low priority after a predefined
time-basis the priority of a certain interrupt source is
increased. Reaching the highest priority and getting not
served, the process is started again with the lowest
priority.
Microcontroller architectures allow minimizing of the
average execution time. Real-time requirements of DSP
algorithms require minimizing the worst case execution
time [6]. Therefore the feature has to be adapted to this
aspect.
Enabling the handling of ISRs during execution of
code sequences has influence on the execution time. One
possibility to overcome this problem during the execution
of real-time critical code sections can be disabling of
interrupt sources. Another possibility is to execute these
code sections in interrupt service routines with a high
priority.
Figure 4 is used to illustrate a possible scenario with a
DSP core and several Co-Processors handling specific
functions. Each of the Co-Processors is controlled by the
DSP core. Each Co-Processor can request a
communication channel with the DSP core by using an
interrupt line. Data transfer between Co-processors can be
initiated without interaction of the DSP core, which
prevents having the core as bottle neck in the data flow. In
Figure 4 an external interrupt source is indicated by an
arrow (e.g. a relay on a board), generating interrupts with
low frequency. The interrupt source has a low priority,
because the relay is working at a lower frequency then the
core or the Co-processors.
4. xICU Architecture
This section is used to introduce the architecture of
xICU. The first part covers an overview of the structure,
the second part discusses topics like interfacing to the
DSP core and data consistency aspects.
4.1. Overview
In Figure 5 an overview of the xICU architecture is
illustrated. Four main blocks can be identified: The
synchronization unit, the scheduling unit, the interrupt
setup unit and the feedback unit. The configuration
parameter for each source can be assigned via pinstrapping and is also software programmable. Default
after reset is pin-strapping.
Figure 5: xICU Overview
• Synchronization unit
The synchronization unit is used to synchronize the
asynchronous interrupt sources. Each of the interrupt
sources can be edge or level triggered (stored in a
configuration register). For each interrupt source only one
interrupt can be pending. The next interrupt of the same
source is accepted after serving the already pending one.
To reduce interrupt latency it is possible to use optional
both clock edges for synchronization.
• Scheduling unit
The scheduling unit is used to priories pending
interrupt requests. The interrupts with the highest priority
will be served first. If two pending interrupts are assigned
Figure 4: application example
But if the interrupt request is not served before the next
one of the same interrupt source is requested, an interrupt
get lost. To overcome this problem the above mentioned
priority morphing can be used. Before loosing
information, the priority level of the source is increased
and the interrupt request gets served earlier. There is no
automatic wrap around of the priority supported.
77
depending on the chosen synchronization mode and the
chosen pipeline structure). The core handles ISR like a
branch sub-routine with an external provided branch
address.
The DSP core supports split execution, which means
that more than one clock cycle is used to execute
instructions. To increase the usage of the available
hardware resources the pipeline is visible, which leads to
data consistency problems during execution of interrupt
service routines. A solution is illustrated in [7].
Due to the support of nested interrupts no shadow
registers as for the OAK DSP core are available. A
complete software task switch (which is quite often not
necessary for a ISR) consumes about ten clock cycles.
with the same priority the time of occurrence is chosen for
deciding the execution order: first in, first out.
As mentioned in section 3 xICU is supporting priority
morphing. Therefore the priority of an interrupt source
can be changed during run-time by the source itself, by the
core or by defined time-basis. After reset priority
morphing is deactivated, the priority level indicated by the
pin strapping is used. After setting the related
configuration register with the morphing mode (time,
source, master) a separate bit is used to activate this
feature (for each interrupt source separately).
This unit is also responsible to take care of the masking
information. Masked interrupts will not be considered for
scheduling as long as the mask bit is set. After reset all
mask bits are set.
• Interrupt Setup Unit
The chosen interrupt source has to be translated into an
interrupt vector address. There are no predefined memory
addresses for interrupt vectors. The address space of the
interrupt vector table is configurable.
An interrupt is handled like a call of a sub-routine with
an externally provided address. Therefore the xICU has to
provide the start-address of the ISR and an interrupt
request signal, which is executed like a branch sub-routine
instruction. The branch delays of the “external” branch
sub routine cannot be used (two clock cycles for the five
stage pipeline) due to core restrictions. For the same
reason during execution of branch delays no further
interrupt requests (e.g. by an NMI) can be served.
5. Results
The described interrupt control unit (xICU) has been
implemented in VHDL-RTL. The main architectural
features influencing the consumed silicon area like the
number of interrupt sources, the size of the control
registers and the supported interrupt priorities are
configurable. The timing critical part is the interface to the
DSP core (running on core frequency). Therefore the
interface is decoupled from the remaining implementation.
6. Conclusion
This paper describes the architecture of xICU, a
scaleable interrupt control unit. Compared with available
ICUs the main architectural features can be configured to
meet application specific requirements. Priority morphing
enables changing the execution order of the interrupt
service routines during run-time and to prevent a loss of
interrupt sources due to starvation. xICU is part of a
project for a configurable DSP core.
7. References
Figure 6: Control Signals
[1] P.Lapsley, J.Bier, A.Shoham and E.A.Lee, DSP Processor
Fundamentals, Architectures and Features, IEEE Press,
New York, 1997.
The interrupt setup unit is also responsible for nested
interrupts. The scaleable core architecture supports any
nesting level. The acknowledge signals are assigned to the
related interrupt source.
• Feedback Unit
The feedback unit is responsible to get feedback from
the other units and from the core and to provide control
signals to the interrupt sources. Some of the signals are
illustrated in Figure 6 e.g. the ack feedback signal
indicating that the related interrupt has been served or the
error signal, indicating, that an interrupt is pending and
another interrupt of the same source has been activated.
[2] D.Sima, T.Fountain, P.Kacsuk, Advanced Computer
Architectures: A Design Space Approach, Addison Wesley
Publishing Company, Harlow, 1997.
[3] Siemens, OAK DSP Core, Programmers Reference
Manual, Siemens AG, Munich, 1998.
[4] Motorola, SC140 DSP Core Reference Manual, Motorola,
Rev.0, 1999.
[5] Motorola, CPM Interrupt Controller, Motorola, 2002.
4.2. Core Integration
[6] J. L. Hennessy, D. A. Patterson, Computer Architecture. A
Quantitative Approach, Morgan Kaufmann Publishers, San
Mateo CA, 1996.
The DSP core using xICU enables smooth handling of
interrupts with low latency (about 4 core clock cycles,
[7] C.Panis, R.Leitner, J.Nurmi, “Scaleable Shadow Stack for a
Configurable DSP Concept”, IWSOC 2003, Calgary,
Canada, June 2003, pp 222-227.
78
PUBLICATION 10
C. Panis, G. Laure, W. Lazian, A. Krall, H. Grünbacher, J. Nurmi, “DSPxPlore – Design
Space Exploration for a Configurable DSP Core”, in Proceedings International Signal
Processing Conference (GSPx), Dallas, Texas, USA, March 31- April 3, 2003, CD-ROM.
©2003 Global Technology Conferences. Reprinted, with permission, from proceedings of
the International Signal Processing Conference.
DSPxPlore – Design Space Exploration for a Configureable
DSP Core
Christian Panis1
Gunther Laure2
Wolfgang Lazian2
Carinthian Tech Institute
+43 4242 90500 2124
Infineon Technologies
+43 4242 305 0
Infineon Technologies
+43 4242 305 0
[email protected]
[email protected]
[email protected]
Herbert Grünbacher1
Andreas Krall3
Jari Nurmi4
Carinthian Tech Institute
+43 4242 90500 2100
Vienna University of Technology
+43 1 58801 18511
Tampere University of Technology
+358 331153884
[email protected]
[email protected]
[email protected]
$%675$&7
62&
&DWHJRULHVDQG6XEMHFW'HVFULSWRUV
6\VWHPRQ&KLS
DSSOLFDWLRQV
PDS
FRPSOH[
V\VWHP
62&,3'HVLJQ
IXQFWLRQV RQ D VLQJOH GLH 7R FRYHU WKH LQFUHDVLQJ LPSRUWDQFH RI
IOH[LELOLW\
LQ
62&
DSSOLFDWLRQV
D
UDLVLQJ
SRUWLRQ
ZLOO
EH
LPSOHPHQWHG LQ VRIWZDUH 7KHUHIRUH WKH LPSRUWDQFH RI HPEHGGHG
*HQHUDO7HUPV
SURFHVVRUV OLNH PLFURFRQWUROOHUV SURWRFRO SURFHVVRUV DQG '63V LV
3HUIRUPDQFH'HVLJQ([SHULPHQWDWLRQ
LQFUHDVLQJ
%XW ZKLFK RQH LV WKH ULJKW FRUH WR FRYHU WKH GHPDQGV RI D FHUWDLQ
DSSOLFDWLRQ" 0RVW RI WKH WLPH WKLV GHFLVLRQ ZLOO EH GRQH E\ WKH
.H\ZRUGV
'63'HVLJQVSDFHH[SORUDWLRQ&RQILJXUDEOH&RUH'63[3ORUH
PRVW H[SHULHQFHG HQJLQHHUV PDLQO\ IRFXVLQJ RQ WKH DVSHFWVÄZKDW
LV DOUHDG\ DYDLODEOH"´ DQG ÄZKDW KDV EHHQ DOUHDG\ SURYHQ LQ
VLOLFRQ"´
'XH
WR
WKH
GLIIHUHQW
UHTXLUHPHQWV
RI
GLIIHUHQW
,1752'8&7,21
DSSOLFDWLRQV RQH FRUH FDQQRW ILW RSWLPDOO\ HYHU\ZKHUH DQG WKLV
7KH
OHDGV TXLWH RIWHQ WR VROXWLRQV ZLWK RYHUKHDG FRQFHUQLQJ VLOLFRQ
LPSRUWDQFH
LQFUHDVLQJ
DUHD DQG SRZHU FRQVXPSWLRQ ,Q WKH SULFHFULWLFDO FRQVXPHU ,&
SURWRFRO SURFHVVRUV DQG '63¶V &KRRVLQJ WKH ULJKW RQH LV TXLWH
PDUNHW WKLV FDQ EH FUXFLDO IRU WKH RZQ PDUNHW SRVLWLRQ DQG
GLIILFXOW (DFK RI WKH FRUH YHQGRUV FODLP WR KDYHWKHEHVWRQHEXW
UHYHQXHV
WKHEHVWIRU³ZKDW´"7KHTXHVWLRQKDVWREH³ZKLFKLVWKHEHVWFRUH
'63[3ORUHLVDGHVLJQVSDFHH[SORUDWLRQPHWKRGIRUDQHPEHGGHG
IRU
FRQILJXUDEOH'63SURFHVVRUZKLFKFDQEHDGDSWHGWRDSSOLFDWLRQ
FRQFHUQLQJVLOLFRQDUHDDQGSRZHUGLVVLSDWLRQKDYHOHDGLQWKHODVW
VSHFLILF UHTXLUHPHQWV 7KH PDMRU SLOODUV RI '63[3ORUH DUH DQ
\HDUV WR PRUH DSSOLFDWLRQ VSHFLILF FRUH DUFKLWHFWXUHV DQG DOVR WR
RSWLPL]LQJ &FRPSLOHU DQG LQVWUXFWLRQ VHW VLPXODWRU EDVHG RQ D
FRQILJXUDEOHDQGDSSOLFDWLRQDGDSWDEOHHPEHGGHGFRUHV
FRQILJXUDEOH FRPSRQHQW IUDPHZRUN DQG RQ D FRQILJXUDEOH '63
+DYLQJ DQ DSSOLFDWLRQ VSHFLILF DGDSWDEOH HJ '63 FRUH ZLOO
FRUHDUFKLWHFWXUH
HQDEOH \RX WR RSWLPL]H WKH FRQVXPHG VLOLFRQ DUHD DQG SRZHU
P\
RI
LPSRUWDQFH
HPEHGGHG
DSSOLFDWLRQ
RI
6:
OHDGV
SURFHVVRUV
UHTXLUHPHQWV"´
WR
OLNH
7KH
DQ
LQFUHDVLQJ
PLFURFRQWUROOHUV
62&
UHTXLUHPHQWV
'63[3ORUH HQDEOHV WKH HYDOXDWLRQ RI DUFKLWHFWXUDO DVSHFWV RI D
GLVVLSDWLRQ RI \RXU '63 VXEV\VWHP DQG WKHUHIRUH HQDEOHV \RX WR
'63FRUHDQGWKHLULQIOXHQFHRQWKHRYHUDOOV\VWHPSHUIRUPDQFHDW
GHVLJQ FRPSHWLWLYH 62& SURGXFWV 7R PDNH XVH RI WKH DYDLODEOH
DQ HDUO\ VWDJH RI WKH SURMHFW 7KLV FDQ EH XVHG IRU D EHWWHU
IOH[LELOLW\ DQG WR FKRRVH WKH ULJKW +:6: SDUWLWLRQLQJ ZKLFK LV
XWLOL]DWLRQ RI WKH VLOLFRQ DUHD DQG WR UHGXFH WKH SRZHU GLVVLSDWLRQ
EHFRPLQJ RQH RI WKH NH\ LVVXHV LQ PDNLQJ VXFFHVVIXO SURGXFWV
RI WKH DSSOLFDWLRQ'63[3ORUHLVSDUWRIDGHYHORSPHQWSURMHFWIRU
\RX KDV WR XQGHUVWDQG WKH UHTXLUHPHQWV RI \RXU DSSOLFDWLRQ LQ DQ
DFRQILJXUDEOH'63FRQFHSW
HDUO\VWDJHRIWKHSURGXFWGHVLJQF\FOH
7KLV SDSHU LQWURGXFHV '63[3ORUH ZKLFK VKRXOG KHOS WR RSWLPL]H
D FRQILJXUDEOH '63 FRUH DUFKLWHFWXUH WR WKH VSHFLILF UHTXLUHPHQWV
RI D FHUWDLQ DSSOLFDWLRQ IRFXVLQJ RQ WKH PDLQ LVVXHV VLOLFRQ DUHD
DQG SRZHU GLVVLSDWLRQ '63[3ORUH LV EDVHG RQ DQ HYDOXDWLRQ &
FRPSLOHU DQG D UHFRQILJXUDEOH LQVWUXFWLRQ VHW VLPXODWRU ,66
EDVHGRQDFRPSRQHQWIUDPHZRUN
,Q WKH ILUVW VHFWLRQ WKH FRQILJXUDEOH '63 DUFKLWHFWXUH LV VKRUWO\
LQWURGXFHG 7KH VHFRQG SDUW ZLOO IRFXV RQ WKH H[SORUDWLRQ
SDUDPHWHUV
(XURSDVWUDVVH $ 9LOODFK $XVWULD
32%R[),17DPSHUH)LQODQG
DQG
WKHLU
LQIOXHQFH RQ
VLOLFRQ
DUHD DQG
SRZHU
6LHPHQVVWUDVVH $ 9LOODFK $XVWULD $UJHQWLQHUVWUDVVH $ 9LHQQD $XVWULD
GLVVLSDWLRQ 7KH WKLUG SDUW LV XVHG WR LQWURGXFH '63[3ORUH DQG WR
ZRUGV WKH VL]H RI WKH H[HFXWLRQ EXQGOH LV RQO\ OLPLWHG E\ WKH
H[SODLQWKHDQDO\VLVUHVXOWVRIWKHHYDOXDWLRQWRROFKDLQ
QXPEHURIDYDLODEOHGDWDSDWKVVFDOHDEOHORQJLQVWUXFWLRQZRUG
%HVLGHV WKH DVSHFW RI DQ DUFKLWHFWXUDO LQGHSHQGHQW GHVFULSWLRQ RI
'63$5&+,7(&785(
WKHDSSOLFDWLRQ DQGWKHUHIRUHWKHUHTXHVWIRUDKLJKOHYHOODQJXDJH
7KLV VHFWLRQ ZLOO EH XVHG WR VKRUWO\ LQWURGXFH WKH FKRVHQ '63
HQWU\ WKH DVSHFWV RI VLOLFRQ DUHD DQG SRZHU GLVVLSDWLRQ KDYH EHHQ
DUFKLWHFWXUH 7KH FKDQJLQJ UHTXLUHPHQWV RQ '63 VXEV\VWHPV LQ
FRQVLGHUHG GXULQJ DUFKLWHFWXUH GHILQLWLRQ :KHQ DQDO\]LQJ D '63
WKH DUHD RI 62& DSSOLFDWLRQV OHDGV WR D VWURQJ GHPDQG RI D KLJK
VXEV\VWHP WKHVH WZR SDUDPHWHUV DUH PDLQO\ LQIOXHQFHG E\ WKH
OHYHO ODQJXDJH HQWU\ OLNH & RU -DYD WKH PDQXDO FRGLQJ HIIRUW IRU
PHPRU\ VXEV\VWHPV $ KLJK FRGH GHQVLW\ DQG DV OHVV DV SRVVLEOH
ODUJ DSSOLFDWLRQ FRGHV LV LQFUHDVLQJ H[SRQHQWLDOO\ %XW WKH
PHPRU\ DFFHVVHV FDQ EH WDUJHWHG WR REWDLQ JRRG UHVXOWV %HVLGHV
UHTXLUHPHQWV FRQFHUQLQJ VLOLFRQ DUHD FRQVXPSWLRQ DQG SRZHU
XQDOLJQHG SURJUDP PHPRU\ 6,0' VXSSRUW DQG DQ DGDSWDEOH
GLVVLSDWLRQ GR QRW DOORZ RYHUKHDG IRU D FRGH DXWRPDWLFDOO\
LQVWUXFWLRQ FRGLQJ D OHDQ FRUH DUFKLWHFWXUH KDV EHHQ FKRVHQ 7KLV
JHQHUDWHG E\ D &FRPSLOHU 7KHUHIRUH WKH DUFKLWHFWXUH GHILQLWLRQ
DOVR VXSSRUWV WKH GHYHORSPHQW RI WKH &&RPSLOHU $ VFDOHDEOH
KDV EHHQ LQIOXHQFHG E\ UHTXLUHPHQWV RI DQ HIILFLHQW &FRPSLOHU
LQVWUXFWLRQ EXIIHU IRU LQQHU ORRSV DQG DQ RSWLPL]HG KDUGZDUH
HIILFLHQW LQ WKH ZD\ WKDW WKH RYHUKHDG FRPSDUHG ZLWK PDQXDO
LPSOHPHQWDWLRQ RI WKH FRUH DUFKLWHFWXUH ZLOO DGGLWLRQDOO\ UHGXFH
DVVHPEO\FRGLQJLVOHVVWKDQ
WKHSRZHUGLVVLSDWLRQ
2YHUYLHZ
(;3/25$7,213$5$0(7(56
7KH ELW IL[HG SRLQW PRGLILHG 'XDO +DUYDUG DUFKLWHFWXUH KDV
WZR LQGHSHQGHQW GDWD PHPRU\ EXVVHV HJ ELW ZLGH >@ 7KH
LQVWUXFWLRQ ZLOO JHW WKH VRXUFH RSHUDQGV IURP WKH UHJLVWHU ILOH DQG
WKH GDWD PRYHV EHWZHHQ WKH UHJLVWHU ILOH DQG GDWD PHPRU\ DUH
H[SOLFLWO\ FRGHG DV VHSDUDWH LQVWUXFWLRQV ORDGVWRUH DUFKLWHFWXUH
7KH5,6&OLNHSLSHOLQHFRQVLVWVRIWKUHH SKDVHVLQVWUXFWLRQIHWFK
LQVWUXFWLRQ GHFRGH DQG LQVWUXFWLRQ H[HFXWLRQ ZKLFK FDQ EH VSOLW
RQWR VHYHUDO FORFN F\FOHV WR LQFUHDVH WKH UHDFKDEOH IUHTXHQF\ $Q
H[WHQVLYH VHW RI SUHGLFDWHG H[HFXWLRQ IHDWXUHV D OHDQ WDVN VZLWFK
VXSSRUW DQG QR OLPLWV RQ WKH QHVWLQJ OHYHO RI LQWHUUXSWV DOORZV
HIILFLHQWKDQGOLQJRIFRQWUROFRGHVHFWLRQV
Data Memory
Register Files
Execution Units
Program Memory
Port B
Instruction Buffer
Port A
5HJLVWHU)LOH
)LJXUH&RUH$UFKLWHFWXUH2YHUYLHZ
7KH UHTXLUHPHQWV RI WKH HYDOXDWLRQ &&RPSLOHU RI XVLQJ DQ
RUWKRJRQDO LQVWUXFWLRQ VHW DQG WR VXSSRUW HIILFLHQW VWDFN IUDPH
DGGUHVVLQJ KDV EHHQ FRQVLGHUHG >@ 7KH ODUJH XQLIRUP UHJLVWHU
VHWV VLPSOH LVVXH UXOHV DQG WKH DEGLFDWLRQ RI PRGH GHSHQGHQW
LQVWUXFWLRQV
HQDEOHV
HIILFLHQW
PDFKLQH
FRGH
7KLV VHFWLRQ LV XVHG WR H[SODLQ WKH H[SORUDWLRQ SDUDPHWHUV
DYDLODEOH IRU WKH '63 DUFKLWHFWXUH GHVFULEHG LQ 6HFWLRQ 7KH
H[SORUDWLRQSDUDPHWHUVZLOOEHGLVFXVVHGDQGWKHLQIOXHQFHRQWKH
RYHUDOOV\VWHPSHUIRUPDQFHF\FOHFRXQWVLOLFRQDUHDDQGSRZHU
GLVVLSDWLRQ RI WKH '63 VXEV\VWHP DQDO\]HG 7KH H[SORUDWLRQ
SDUDPHWHUV
5HJLVWHUILOH
1XPEHUNLQGRISDUDOOHOH[HFXWLRQXQLWV
0HPRU\EDQGZLGWKGDWDSURJUDP
,QVWUXFWLRQVL]HELQDU\HQFRGLQJ
3LSHOLQHVWDJHV
DUHDYDLODEOHIRUWKH'63DUFKLWHFWXUH7KHLQIOXHQFHRIWKHFRUH
LWVHOIFRPSDUHG ZLWKWKHLQIOXHQFHRIWKHPHPRU\VXEV\VWHPRQ
WKH FRQVXPHG VLOLFRQ DUHD LV TXLWH ORZ %XW IRU ORZ FRVW
DSSOLFDWLRQVDGGLWLRQDOPPðDUHDOUHDG\FULWLFDODQGWKHUHIRUH
DOVRWKHLQIOXHQFHRQWKHFRUHDUHDZLOOEHSRLQWHGRXW
JHQHUDWHG
DXWRPDWLFDOO\
7KHLQVWUXFWLRQFDQEHVHWXSRIRQHRUWZRLQVWUXFWLRQZRUGV7KH
VHFRQG ZRUG DOVR FDOOHG SDUDOOHO ZRUG LV XVHG WR NHHS ORQJ
RIIVHWV LPPHGLDWH YDOXHV RU IDU EUDQFKWDUJHWV >@ 7KH QDWLYH
LQVWUXFWLRQ VL]H LV ELW ZKLFK HQDEOHV WR FRGH WKH ZKROH
LQVWUXFWLRQ VHW LQ VKRUW LQVWUXFWLRQV DQG XVLQJ WKH ORQJ ZRUG RQO\
IRU FRQVWDQWV ,Q VXEVHFWLRQ WKH SRVVLELOLW\ RI FKDQJLQJ WKH
QDWLYH VL]H DQG WKH LQIOXHQFH RQ WKH V\VWHP SHUIRUPDQFH ZLOO EH
SRLQWHGRXW7KHH[HFXWLRQEXQGOHLVEXLOWXSRIQDWLYHLQVWUXFWLRQ
)RU D ORDGVWRUH DUFKLWHFWXUH WKH UHJLVWHU ILOH SOD\VDQ LPSRUWDQW
UROH (DFK RI WKH LQVWUXFWLRQV XVLQJ RSHUDQGV KDV YDOXHV VWRUHG
LQVLGHWKHUHJLVWHUILOH$VDUXOHRIWKXPEDERXWRIWKHFRUH
DUHDDUHFRQVXPHGE\WKHUHJLVWHUILOH
7KHDGYDQWDJHRIRSHUDWLQJRQYDOXHVRIWKHUHJLVWHUILOHLVWKDW
LQWHUPHGLDWHUHVXOWVGRQRWKDYH WREHVWRUHGEDFNLQWRWKHGDWD
PHPRU\ DQG IHWFKHG DJDLQ IRU D FRQVHFXWLYH RSHUDWLRQ ZKLFK
UHGXFHVWKHSRZHUGLVVLSDWLRQDWWKHPHPRU\SRUWV+RZHYHUWKH
VL]H RI WKH UHJLVWHU ILOH KDV WR ILW WR WKH UHTXLUHPHQWV RI WKH
DSSOLFDWLRQ FRGH ,QFUHDVLQJ WKH VL]H RI WKH UHJLVWHU ILOH UHGXFHV
WKHDFWLYLWLHVDWWKHGDWDPHPRU\SRUWVEXWWKHHQWULHVKDYHWREH
HQFRGHGLQWRWKHLQVWUXFWLRQV7KLVKDVLQIOXHQFHRQWKHVL]HRIWKH
LQVWUXFWLRQZRUGVDQGWKHUHIRUHRQWKHFRGHGHQVLW\7KHQXPEHU
RIHQWULHVRIWKHUHJLVWHUILOHLVLQIOXHQFLQJWKHVL]HRIWKHFURVVEDU
DWWKHUHDGZULWHSRUWVDQGWKHUHIRUHWKHUHDFKDEOHFRUHIUHTXHQF\
7R RYHUFRPH WKHVH SUREOHPV WKH UHJLVWHU ILOH TXLWH RIWHQ LV
FOXVWHUHG >@ 8QIRUWXQDWHO\ FOXVWHUHG UHJLVWHU ILOHV UHVWULFW WKH
&RPSLOHU GXULQJ LQVWUXFWLRQ VFKHGXOLQJ DQG UHTXLUH DGGLWLRQDO
LQVWUXFWLRQVWRWUDQVIHUYDOXHVIURPRQHFOXVWHUWRWKHQH[W,QWKH
SURSRVHG'63FRQFHSWRIVHFWLRQDOOHQWULHVRIWKHUHJLVWHUILOHV
DUHKDQGOHGHTXDODQGFOXVWHULQJKDVQRWEHHQWDNHQLQWRDFFRXQW
DQ\ZD\ GDWD DQG DGGUHVV YDOXHV DUH VWRUHG LQ GLIIHUHQW UHJLVWHU
ILOHV
8VLQJDUHJLVWHUILOHZLWKIHZHQWULHVFDQOHDGWRDZDVWHRIFORFN
F\FOHV DQG DQ LQFUHDVH RI WKH SURJUDP PHPRU\ ,I DOO DYDLODEOH
HQWULHVDUHDOUHDG\XVHGDQGIXUWKHUVSDFHIRULQWHUPHGLDWHUHVXOWV
LVUHTXLUHGWKHUHJLVWHUILOHFRQWHQWKDVWREHVSLOOHGWRWKHGDWD
PHPRU\7KHVSLOOFRGHFRQVXPHVF\FOHVLQVWUXFWLRQZRUGVDQG
WKHUHIRUH LQIOXHQFHV WKH FRGH GHQVLW\ RI WKH DSSOLFDWLRQ DQG
DGGLWLRQDOGDWDPHPRU\
1XPEHU.LQGRI3DUDOOHO8QLWV
9/,: DUFKLWHFWXUHV DUH SURYLGLQJ WKH H[HFXWLRQ RI VHYHUDO
LQVWUXFWLRQV LQ SDUDOOHO ,I QRW DOO RI WKH XQLWV DUH XVHG HDFK F\FOH
PHPRU\VSDFHWR VWRUHWKHORQJLQVWUXFWLRQZRUGZRXOGEHZDVWHG
VHH )LJXUH 7KH SURSRVHG '63 FRQFHSW VXSSRUWV D FRQVWDQW
IHWFK EXQGOH HJ LQVWUXFWLRQ ZRUGV DQG D VFDOHDEOH H[HFXWLRQ
EXQGOH HJ EHWZHHQ XS WR LQVWUXFWLRQZRUGV LQ SDUDOOHO7R
SUHYHQW
VWDOO
F\FOHV
DQ
LQVWUXFWLRQ
EXIIHU
LV
LQFOXGHG
WR
FRPSHQVDWH WKH PHPRU\ EDQGZLGWK PLVPDWFK EHWZHHQ IHWFK
EXQGOH DQG H[HFXWLRQ EXQGOH 7KH UHODWLRQVKLS EHWZHHQ WKH IHWFK
EXQGOH VL]H DQG WKH H[HFXWLRQ EXQGOH VL]H LQIOXHQFHV WKH V\VWHP
SHUIRUPDQFH 6WDOO F\FOHV DUH QHFHVVDU\ LI WKH IHWFK EXQGOH VL]H LV
9/,:'63DUFKLWHFWXUHVVXSSRUWWKHH[HFXWLRQRIPRUHWKDQRQH
LQVWUXFWLRQSHUF\FOH7KHVHLQVWUXFWLRQVFDQEHZHOOXVHGIRUILOWHU
RSHUDWLRQV ZKLFK FDQ EH VHHQ LQ D ULFK VHW RI DYDLODEOH
EHQFKPDUNV RI YDULRXV FRUH YHQGRUV ,Q ORZ FRVW DSSOLFDWLRQV
ZKHUHQRWPRUHWKDQRQHFRUHLVIHDVLEOH
OLNH
ILOWHURSHUDWLRQVZLOOEHGRQHTXLWHRIWHQLQGHGLFDWHG+:DQGWKH
FRQWURO FRGH DQG FRQILJXUDWLRQ FRGH LV GRPLQDWLQJ WKH FRGH
H[HFXWHG RQ WKH '63 7KHVHDSSOLFDWLRQV FDQQRWPDNHXVDJHRI
WKHSDUDOOHODYDLODEOHXQLWV
6XSSRUWLQJ VHYHUDO LQVWUXFWLRQV LQ SDUDOOHO KDV LQIOXHQFH RQ WKH
FRUH VL]H %HVLGHV WKH XQLWV LWVHOI WKH GHFRGHU VWUXFWXUH KDV WR
LQFUHDVHGDQGWKHQXPEHURIUHDGZULWHSRUWVIRUWKHUHJLVWHUILOHLV
UDLVLQJ$GGLWLRQDOLQVWUXFWLRQVKDYHWREHIHWFKHGDQGWKHVL]HRI
WKHSURJUDPPHPRU\SRUWKDVWREHLQFUHDVHG
%HIRUHDGGLQJDQDGGLWLRQDOXQLWLQWRWKHFRUHDUFKLWHFWXUHHJDQ
DGGLWLRQDO 0$& 0XOWLSO\$FFXPXODWH XQLW WKH EHQHILWIRUWKH
DSSOLFDWLRQ KDV WR EH FRQVLGHUHG TXLWH FDUHIXOO\ $QDO\]LQJ WKH
IXOO DSSOLFDWLRQ FRGH FDQ OHDG WR GLIIHUHQW UHTXLUHPHQWV
FRQFHUQLQJWKHQXPEHURIQHFHVVDU\SDUDOOHOGDWDSDWKVFRPSDUHG
WRRQO\IRFXVLQJRQILOWHURSHUDWLRQEHQFKPDUNV
WRR VPDOO IRU WKH UHTXLUHPHQWV RI WKH DSSOLFDWLRQ +DYLQJ D ZLGHU
SURJUDP PHPRU\ SRUW OHDG WR DGGLWLRQDO URXWLQJ HIIRUW WR WKH
SURJUDPPHPRU\ZLWKLQIOXHQFHRQWKHVLOLFRQDUHD
0HPRU\%DQGZLGWKGDWDSURJUDP
7KHXVDJHRIWKHDYDLODEOHV\VWHPUHVRXUFHVLVLQIOXHQFHGE\WKH
DYDLODEOH PHPRU\ EDQGZLGWK GDWD DQG SURJUDP $GGLQJ
DGGLWLRQDO FRPSXWDWLRQDO XQLWV WR WKH '63 VXEV\VWHP ZLWKRXW
SURYLGLQJWKHSRVVLELOLW\WRIHWFKWKHQHFHVVDU\RSHUDQGVOHDGWR
XQXVHGV\VWHPSHUIRUPDQFH
+DYLQJ VHYHUDO GDWD PHPRU\ SRUWV HJ VXSSRUWLQJ LQGHSHQGHQWDGGUHVVHVLQFUHDVHVWKHSHUIRUPDQFHIRUDOJRULWKPV
OLNH))72QWKHRWKHUKDQGWKHLQIOXHQFHRQWKHFRUHVL]HLVWKH
QHHG RI DGGLWLRQDO $*8V $GGUHVV *HQHUDWLRQ 8QLW DQG WKH
QHFHVVDU\URXWLQJWRWKHGDWDPHPRU\$VVXPLQJFURVVEDUVDWWKH
PHPRU\ VXEV\VWHP WR UHGXFH WKH SUREDELOLW\ RI PHPRU\
FROOLVLRQV PRUH WKDQ RQH DGGUHVV SRLQW WR WKH VDPH DGGUHVV
VSDFHWKHVHFURVVEDUVKDYHWREHJURZQ7KHGDWDPHPRU\DFFHVV
LVWLPHFULWLFDODQGWKHUHIRUHLQFUHDVLQJWKHFURVVEDUKDVLQIOXHQFH
RQWKHUHDFKDEOHFRUHIUHTXHQF\
Instruction Buffer
QXPEHU FUXQFKLQJ
scaleable
execution
bundle
(e.g. 1 up to 10 instr.)
constant
fetch bundle
(e.g. 4 instr.)
)LJXUH,QVWUXFWLRQ%XIIHU
7KHVL]HRIWKHLQVWUXFWLRQEXIIHULQ)LJXUHLQIOXHQFHVWKHRYHUDOO
V\VWHPSHUIRUPDQFHIRULQQHUORRS>@2QFHIHWFKHGQRDGGLWLRQDO
SRZHU GLVVLSDWLRQ DW WKH SURJUDP PHPRU\ SRUW ZLOO EH FRQVXPHG
,I WKH EXIIHU LV WRR VPDOO WR VWRUH WKH ORRS ERG\ WKH DGYDQWDJH RI
WKH LQVWUXFWLRQ EXIIHU LV ORVW ([FHHGLQJ WKH QXPEHU RIHQWULHVKDV
LQIOXHQFHRQWKHFRUHVL]HDQGRQWKHUHDFKDEOHFRUHIUHTXHQF\
,QVWUXFWLRQ6L]H%LQDU\&RGLQJ
$ KLJK FRGH GHQVLW\ LV PDQGDWRU\ IRU DQ HPEHGGHG '63
SURFHVVRU %XW KLJK FRGH GHQVLW\ KDV WR EH FRQVLGHUHG RQ
DSSOLFDWLRQ OHYHO ZKLFK GRHV QRW DOORZ KLGLQJ SUREOHPV LQ WKH
PLFURDUFKLWHFWXUH )UHTXHQWO\ XVHG LQVWUXFWLRQV KDYH WR EH FRGHG
PRUHHIILFLHQWXQXVHGLQVWUXFWLRQVFDQEHHYHQUHPRYHG
7KH QDWLYH LQVWUXFWLRQ VL]H RI WKH SURSRVHG '63 FRQFHSW LQ
6HFWLRQ LV ELW $ SDUDOOHO ZRUG LV XVHG IRU ORQJ LPPHGLDWH
YDOXHV
IDU
EUDQFK
WDUJHWV
RU
ORQJ
RIIVHWV
7KH
DULWKPHWLF
LQVWUXFWLRQV FDQ EH XVHG ZLWK GLIIHUHQW RSHUDQGV ([HFXWLQJ
FRQWUROFRGHDVLQWKHH[DPSOHRI)LJXUHZLOOQRWPDNHXVDJHRI
WKH SURYLGHG RSHUDQG DULWKPHWLF LQVWUXFWLRQV )RU WKH FRGH
H[DPSOH LQ )LJXUH WKH ELW LQVWUXFWLRQ VHW ZKHUH HJ WKH
SDUDOOHO ZRUG LV XVHG IRU HQFRGLQJ RI WKH RSHUDQG LQVWUXFWLRQV
LQFUHDVHV WKH FRGH GHQVLW\ E\ DERXW 7KH VPDOOHU QDWLYH
LQVWUXFWLRQ ZRUG LV DQ DGYDQWDJH IRU WKH FRGH RI WKH H[DPSOH LQ
)LJXUH
Unit 1
Unit 2
Instr. n
Instr. n+2
Unit 3
Unit 4
t
16/32
20/40
710
215
1850
211
710
164
2185
211
Instr. n+1
Instr. n+3
Instr. n+4
Instr. n+6
Unit 5
Instr. n+5
Instr. n+7
Instr. n+8
number instructions:
long instructions:
bytes:
delay nops:
)LJXUH9/,:DUFKLWHFWXUH
)LJXUH,QVWUXFWLRQ6L]H
7KH SRZHU GLVVLSDWLRQ DW WKH SURJUDP PHPRU\ SRUWV LV FDXVHG E\
DUFKLWHFWXUHZLWKWKHFRQILJXUDWLRQILOHDQGWRUHVWDUWWKHDQDO\VLV
WKH VZLWFKLQJ DFWLYLW\ EHWZHHQ ³´ DQG ³´ DQG YLFH YHUVD
SURFHVV
5HGXFLQJ
WKH
VZLWFKLQJ
DFWLYLW\
E\
HJ
UHRUGHULQJ
RI
WKH
LQVWUXFWLRQV LQVLGH WKH H[HFXWLRQ EXQGOH ZKLFK KDV QR LQIOXHQFH
core architecture
configuration (xml)
RQWKHVFKHGXOLQJRUFKDQJLQJWKHELQDU\FRGLQJRIWKHLQVWUXFWLRQ
VHW WR UHGXFH WKH VZLWFKLQJ DW WKH SURJUDP PHPRU\ SRUWV KDV
VLJQLILFDQW LQIOXHQFH RQ WKH SRZHU GLVVLSDWLRQ RI WKH '63 VXE
V\VWHP
Application code
in C (DSP-C)
Eval.
C-Comp.
3LSHOLQHVWDJHV
ISS
.asm
7KH QXPEHU RI SLSHOLQH VWDJHV LQIOXHQFHV WKH UHDFKDEOH FORFN
IUHTXHQF\ RI WKH '63 VXEV\VWHP 6SOLWWLQJ FRPSOH[ RSHUDWLRQV RU
WLPLQJ FULWLFDO IXQFWLRQV OLNH PHPRU\ DFFHVV HQDEOHV KLJKHU FORFN
IUHTXHQF\ )URP D V\VWHP SRLQW RI YLHZ WKH UHVXOW FDQ EH
dynamic analysis
results
static analysis
results
PLVOHDGLQJ $VVXPLQJ D SLSHOLQH VWUXFWXUHDV GHVFULEHG LQ6HFWLRQ
ZLWK LQVWUXFWLRQ IHWFK LQVWUXFWLRQ GHFRGH DQG LQVWUXFWLRQ
H[HFXWH D KLJKHU FORFN IUHTXHQF\ FDQ EH UHDFKHG ZKHQ VSOLWWLQJ
WKHPRYHUVHYHUDOFORFNF\FOHV
)LJXUH'63[3ORUH)ORZ
,QFUHDVLQJ WKH QXPEHU RI FORFN F\FOHV IRU WKH LQVWUXFWLRQ IHWFK
SKDVH LQFUHDVHV WKH QXPEHU RI EUDQFK GHOD\V ,Q FRQWURO FRGH
VHFWLRQV RU LQ V\VWHPV ZLWK D ORW RI LQWHUUXSW VHUYLFH URXWLQHV WKH
6WDWLF$QDO\VLV
IUHTXHQF\ JDLQ FDQ EH FRPSHQVDWHG E\ WKH XQXVDEOH EUDQFK
7R REWDLQ WKH VWDWLF DQDO\VLV UHVXOWV WKH (YDOXDWLRQ &&RPSLOHU LV
GHOD\V 3UHGLFDWHG H[HFXWLRQ FDQ EH XVHG IRU VPDOO EUDQFK WDUJHW
XVHG 7KH DXWRPDWLF JHQHUDWHG UHVXOWV KDYH WR EH DFFXUDWH
GLVWDQFHV EXW LW KDV LQIOXHQFH RQ WKH FRGH GHQVLW\ RI WKH
FRPSDUHG ZLWK PDQXDO FRGLQJ OHVV WKDQ RYHUKHDG 8VLQJ D
DSSOLFDWLRQ %UDQFK SUHGLFWLRQ PHFKDQLVPV FDQQRW EH WDNHQ LQWR
&&RPSLOHU KDYLQJ DQ RYHUKHDG RI DERXW WR WLPHV FRPSDUHG
DFFRXQW GXH WR WKH PLVVLQJ SUHGLFWDELOLW\ 7KH V\VWHP KDV WR EH
ZLWK PDQXDO RSWLPL]HG FRGH ZKLFK LV TXLWH XVXDO IRU WRGD\¶V
GHVLJQHG ZLWK ZRUVW FDVH SDUDPHWHUV DQG WKHUHIRUH D SUHGLFWLRQ
DYDLODEOH &&RPSLOHUV ZLOO DGXOWHUDWH WKH UHVXOWV 7KLV VHFWLRQ
KDVWREHDVVXPHGDV³QRWWREHWDNHQ´
LQWURGXFHVVRPHVWDWLFUHVXOWVJHQHUDWHGE\'63[3ORUHWRTXDQWLI\
6SOLWWLQJ
WKH
GHSHQGHQF\
H[HFXWLRQ
DQG
WKH
SKDVH
LQFUHDVHV
GHILQHLQXVH
WKH
ORDGLQXVH
GHSHQGHQF\
>@
'DWD
IRUZDUGLQJ FLUFXLWV LQFUHDVH WKH VLOLFRQ DUHD DQG LQFUHDVH WKH
WKHFRUHDUFKLWHFWXUH
&RGH6L]H
$V DOUHDG\ SRLQWHG RXW LQ VHFWLRQ WKH PHPRULHV DUH
GHVLJQFRPSOH[LW\
GRPLQDWLQJ WKH VLOLFRQ DUHD RI WKH '63 VXEV\VWHP 7KHUHIRUH D
'63[3ORUH
KLJKFRGHGHQVLW\LVPDQGDWRU\IRUGHVLJQLQJDVXFFHVVIXOSURGXFW
7KLVVHFWLRQLVXVHGWRLQWURGXFHWKHIORZXVLQJ'63[3ORUHDQGWR
GLVFXVV WKH UHVXOWV RI WKH VWDWLF DQG G\QDPLF DQDO\VLV '63[3ORUH
VXSSRUWWKHV\VWHPGHVLJQHU WR FKRRVHWKHFRUHDUFKLWHFWXUHILWWLQJ
WR WKH DSSOLFDWLRQ ,W FDQ DOVR EH XVHG WR ILQG DQ RSWLPDO +:6:
SDUWLWLRQLQJ IRU D FHUWDLQ DSSOLFDWLRQ RU WR GHFLGH ZKLFK SDUWV RI
WKHDSSOLFDWLRQDUHZHOOVXLWHGIRUD+:&R3URFHVVRU
7KH YDOXH FRGH VL]H SURYLGHV DQ LQGLFDWLRQ FRQFHUQLQJ WKH
QHFHVVDU\ SURJUDP PHPRU\ WR PDS WKH DSSOLFDWLRQ RQWR WKH
FKRVHQ LQVWUXFWLRQ VHW DQG LQVWUXFWLRQ FRGLQJ ,Q )LJXUH D SDUW
RI WKH FRGH VL]H DQDO\VLV UHVXOWV LV LOOXVWUDWHG 7KH QXPEHU RI
LQVWUXFWLRQV VXPV XS DOO LQVWUXFWLRQV LQGHSHQGHQW RI WKHLU ZRUG
OHQJWK WKH QXPEHU RI ORQJ LQVWUXFWLRQV FRXQWV WKH LQVWUXFWLRQV
XVLQJ D SDUDOOHO ZRUG 2IIVHWV IRU DGGUHVV JHQHUDWLRQ DQG EUDQFK
'63[3ORUH)/2:
WDUJHWV DUH DYDLODEOH DW OLQN WLPH 7KHUHIRUH D OLQNHU IHHGEDFN LV
7KH EDVLV RI '63[3ORUH LV DQ (YDOXDWLRQ &&RPSLOHU DQG D UH
SDUW RI WKH FRPSLOHU EDFNHQG RSWLPL]DWLRQV :HLJKWLQJ WKH
FRQILJXUDEOH ,QVWUXFWLRQ 6HW 6LPXODWRU EDVHG RQ D FRPSRQHQW
GLIIHUHQW LQVWUXFWLRQ OHQJWK DQG QRUPDOL]LQJ LW WR WKH UHVXOW E\WHV
IUDPHZRUN 6WDUWLQJ ZLWK WKH DSSOLFDWLRQ GHVFULEHG LQ & '63&
JLYHV D QXPEHU IRU WKH FRGH VL]H 7KH SURSRVHG '63 FRQFHSW
SUHIHUUHG GXH WR WKH VXSSRUW RI IUDFWLRQDO GDWD W\SHV WKH
VXSSRUWV XQDOLJQHG SURJUDP PHPRU\ 7KH UHVXOW YDOXH E\WHV LV
HYDOXDWLRQ &FRPSLOHU ZLOO EH XVHG WR PDS WKH DSSOLFDWLRQ WR D
HTXDO WR WKH V\VWHP FRGH VL]H ZLWK QR DGGLWLRQDO PHPRU\ HIIRUW
FHUWDLQ
FRUH
FDXVHG E\ WKH PLFURDUFKLWHFWXUH 7KH FKRVHQ '63 DUFKLWHFWXUH
DUFKLWHFWXUHZLOOEHGHVFULEHGLQDFRQILJXUDWLRQILOHEDVHGRQ[PO
VXSSRUWV GHOD\HG DQG QRQGHOD\HG EUDQFKLQVWUXFWLRQV ,I WKH
7KH DYDLODEOH H[SORUDWLRQV SDUDPHWHUV RI WKH '63 DUFKLWHFWXUH
LQVWUXFWLRQ DQG GDWD GHSHQGHQFLHV RI WKH DSSOLFDWLRQ FRGH DOORZ
LQWURGXFHGLQ6HFWLRQKDVEHHQGLVFXVVHGLQ6HFWLRQ,Q)LJXUH
WKH ILOO RI WKH EUDQFK GHOD\V WKH V\VWHP SHUIRUPDQFH FDQ EH
WKH IORZ RI '63[3ORUH LV LOOXVWUDWHG 7KH DSSOLFDWLRQ &&RGH
LQFUHDVHG DQG WKH FRGH GHQVLW\ LPSURYHG QR 123V QHFHVVDU\ ,I
ZLOO EH FRPSLOHG ZLWK WKH HYDOXDWLRQ &FRPSLOHU DQG WKHQ
WKH
VLPXODWHG ZLWK WKH FRQILJXUDEOH ,66 ,QVWUXFWLRQ 6HW 6LPXODWRU
LQVWUXFWLRQ LQFUHDVHV WKH FRGH GHQVLW\ QR 123V QHFHVVDU\ WKH
7KH UHVXOW RI WKH &&RPSLOHU LV DQ DVVHPEOHU GHVFULSWLRQ RI WKH
F\FOHV DUH ZDVWHG DQ\ZD\ 'XH WR WKH OHDQ SLSHOLQH VWUXFWXUH DQG
DSSOLFDWLRQDQG WKH VWDWLFDQDO\VLVUHVXOWVZKLFKZLOOEHGLVFXVVHG
WKHUHIRUH
LQ GHWDLO LQ 6HFWLRQ 7KH G\QDPLF DQDO\VLV UHVXOWV DUH
PHFKDQLVP KDV EHHQ QHJOHFWHG 7KH SUREOHP RI SUHGLFWLRQ LV
JHQHUDWHG E\ VLPXODWLRQV ZLWK WKH ,66 DQG ZLOO EH GHVFULEHG LQ
DOUHDG\PHQWLRQHGLQVHFWLRQ
FRUH
DUFKLWHFWXUH
7KH
IHDWXUHV RI
WKH
FKRVHQ
6HFWLRQ 7KH DVVHVVPHQW UHVXOWV FDQ EH XVHG WR DGDSW WKH FRUH
EUDQFK
GHOD\V
OHVV
FDQQRW
EUDQFK
EH
GHOD\V
XVHG
WKH
D
XVH
QRQGHOD\HG
RI
EUDQFK
EUDQFK
SUHGLFWLRQ
DYDLODEOHSDUDOOHO XQLWVLWZRXOGEHIHDVLEOHWRUHPRYHVRPHRIWKH
GDWDSDWKV %XWLI WKHVHEXQGOHVDUHSDUWRILQQHUORRSVZKLFKHJ
number instructions:
827
long instructions:
70
bytes: 2242,5
delay nops:
42
WKH DSSOLFDWLRQ FRGH LV H[HFXWLQJ RIWKHH[HFXWLRQ WLPHWKHQ
UHPRYLQJ
WKH
XQLWV
ZLOO
GHFUHDVH
WKH
V\VWHP
SHUIRUPDQFH
VLJQLILFDQWO\
.LQGRI,QVWUXFWLRQV
$ OLVW RI WKH LQVWUXFWLRQV XVHG WR PDS WKH DSSOLFDWLRQ FRGH WR WKH
)LJXUH&RGH6L]H
FKRVHQFRUH DUFKLWHFWXUH HQDEOHVILQHWXQLQJRIWKHLQVWUXFWLRQVHW
3DUDOOHOLVP
)UHTXHQWO\ XVHG LQVWUXFWLRQV FDQ EH HQFRGHG PRUH HIILFLHQW WR
7KH SDUDOOHOLVP UHVXOW YDOXH JLYHV DQ LQGLFDWLRQ KRZ ZHOO WKH
DSSOLFDWLRQ FRGH FDQ XVH WKH SURYLGHG SDUDOOHO XQLWV 'XH WR GDWD
GHSHQGHQFLHV LQ WKH DSSOLFDWLRQ FRGH WKH SDUDOOHOLVP FDQ EH TXLWH
VPDOO HVSHFLDOO\ LQ FRQWURO FRGH VHFWLRQV 7KH SDUDOOHOLVP ZLOO
EH DQDO\]HG RQ WKH OHYHO RI H[HFXWLRQ EXQGOHV 7KH DSSOLFDWLRQ
FRGHXVHGIRUWKHH[DPSOHLQ)LJXUHLVFRQWUROFRGH7KHFKRVHQ
'63 DUFKLWHFWXUH HQDEOHV WKH H[HFXWLRQ RI LQVWUXFWLRQV LQ
SDUDOOHOPHPRU\RSHUDWLRQVDULWKPHWLFRSHUDWLRQVSURJUDP
IORZ RSHUDWLRQ 7KH DQDO\VLV UHVXOWV LQ )LJXUH JLYH D ILUVW
LQFUHDVH WKH FRGH GHQVLW\ )RU ORZ FRVW DSSOLFDWLRQV XQXVHG
LQVWUXFWLRQV HYHQ FDQ EH UHPRYHG WR VTXHH]H WKH LQVWUXFWLRQ
FRGLQJDQGWRUHGXFHWKHQXPEHURIELWVQHFHVVDU\IRUHQFRGLQJRI
WKHLQVWUXFWLRQVHW
6L]HRI,PPHGLDWH9DOXHV
7KH VL]H RI WKH LPPHGLDWH YDOXHV FDQ EH LGHQWLILHG DOUHDG\ DIWHU
VWDWLF DQDO\VLV 7KLV UHVXOW JLYHV DQ LQGLFDWLRQ LI WKH DYDLODEOH
FRGLQJ VSDFH LQVLGH
WKH
LQVWUXFWLRQ
ZRUGV DOUHDG\ ILWV WKH
DSSOLFDWLRQUHTXLUHPHQWVRUDORWRISDUDOOHOZRUGVDUHQHFHVVDU\WR
LQGLFDWLRQDERXWWKHHIILFLHQWXVDJHRIWKHSDUDOOHOXQLWV
FRGH WKH LPPHGLDWH YDOXHV 7KH YDOXH UDQJH RI WKH LPPHGLDWH
bundles with 1 instruction(s):
bundles with 2 instruction(s):
bundles with 3 instruction(s):
bundles with 4 instruction(s):
bundles with 5 instruction(s):
YDOXHV LQIOXHQFHV WKH PLQLPXP HQFRGLQJ VSDFH RI WKH QDWLYH
255
105
52
5
0
LQVWUXFWLRQZRUG
1R2SHUDWLRQ
7KH PRVW XVHOHVV RSHUDWLRQ UXQQLQJ RQ DQ HPEHGGHG '63
FRUH LVWKH123QRRSHUDWLRQ EHFDXVH GXULQJWKLV WLPHWKH'63
GRHV QRW SURYLGH DQ\WKLQJ WR WKH V\VWHP SHUIRUPDQFH 'XH WR
)LJXUH%XQGOH$QDO\VLV
GLIIHUHQW UHDVRQV LW FDQQRW EH FRPSOHWHO\ RPLWWHG DQG GXULQJ WKLV
7R XQGHUVWDQG WKH OLPLWV RI WKH FKRVHQ DUFKLWHFWXUH IRU WKH
WLPHQRUPDOO\WKHFRQVXPHGSRZHUGLVVLSDWLRQLVSUHWW\ORZ2QH
DSSOLFDWLRQ FRGH D PRUH GHWDLOHG DQDO\VLV LV QHFHVVDU\ )RU WKH
H[DPSOH IRU XVHIXO 123V LV D 123 IRU XQXVDEOH EUDQFK GHOD\VRU
UHVXOWV LQ )LJXUH WKH VDPH FRQWURO FRGH H[DPSOH DV IRU WKH
123V XVHG WR DOLJQ EUDQFK WDUJHWV RU ORRS ERGLHV 7KDW FDQ EH
UHVXOWV RI )LJXUH KDV EHHQ XVHG 7KH H[HFXWLRQ EXQGOHV DUH VSOLW
QHFHVVDU\ WR SUHYHQWVWDOOF\FOHVLIWKHEUDQFKWDUJHWLVQRWSDUWRI
LQWR GLIIHUHQW FDWHJRULHV HDFK FRQVLVWLQJ RI XS WR LQVWUXFWLRQV
WKH LQVWUXFWLRQ EXIIHU DQG WKH H[HFXWLRQ EXQGOH LV VSUHDG RYHU
EXWUHYHDOLQJWRGLIIHUHQWGDWDSDWKV
VHYHUDOIHWFKEXQGOHV
Nop (incl. delay fill nops)
MemX......................
...........MemY.................
MemX.MemY.................
......................ALU1............
MemX............ALU1............
......................ALU1.ALU2.......
MemX............ALU1.ALU2.......
MemX.MemY.ALU1.ALU2.......
.........................................BrUnit
MemX...............................BrUnit
MemX.MemY....................BrUnit
......................ALU1..........BrUnit
MemX............ALU1..........BrUnit
......................ALU1.ALU2.BrUnit
MemX............ALU1.ALU2.BrUnit
...........MemY.ALU1.ALU2.BrUnit
80 19.2 %
82 19.7 %
8 1.9 %
10 2.4 %
48 11.5 %
22 5.3 %
15 3.6 %
4 1.0 %
2 0.5 %
37 8.9 %
26 6.2 %
1 0.2 %
32 7.7 %
18 4.3 %
29 6.9 %
2 0.5 %
1 0.2 %
NOP
)LJXUH123IRU%UDQFK7DUJHW$OLJQPHQW
,Q )LJXUH WKH H[HFXWLRQ EXQGOH RI WKH EUDQFK WDUJHW LV VSOLW RYHU
WZRIHWFKEXQGOHV,IWKHVHFRQGSDUWRIWKHIHWFKEXQGOHLVQRWSDUW
RI WKH LQVWUXFWLRQ EXIIHU D VWDOO F\FOH LV QHFHVVDU\ $OLJQLQJ WKH
EUDQFK
WDUJHW E\
DGGLQJ
D 123
ZLOO LQFUHDVH WKH V\VWHP
SHUIRUPDQFHE\DUHDVRQDEOHRYHUKHDGIRUWKHV\VWHPFRGHVL]H
7KHUH
DUH
VHYHUDO
UHDVRQV
PRVWO\
FDXVHG
E\
WKH
PLFUR
DUFKLWHFWXUH RI WKH '63 FRUH IRU 123 7KH VWDWLF DQDO\VLV JLYHV
)LJXUH3DUDOOHOLVP
7KH UHVXOWVOLVWHGLQ)LJXUH DUH DYDLODEOHIRUHDFK EDVLFEORFNRI
WKH&FRGHZKLFKDOORZVDILQHJUDLQDQDO\VLVRIWKHDSSOLFDWLRQ
DQ
LQGLFDWLRQ
ZK\
WKH
123
ZDV
LQWURGXFHG
$V
DOUHDG\
PHQWLRQHG LQ VHFWLRQ LW LV QHFHVVDU\ WR ZHLJKWWKHVH UHVXOWV E\
WKH UHVXOWV RI WKH G\QDPLF DQDO\VLV WR SUHYHQW ORFDO RSWLPLVDWLRQV
ZLWK LQIOXHQFH RQ WKH RYHUDOO V\VWHP SHUIRUPDQFH ,I WKH EUDQFK
WDUJHWRI WKH H[DPSOHLQ)LJXUH LV SDUWRIDZKLOHORRSH[HFXWHG
7KH VWDWLF DQDO\VLV UHVXOWVDV LQ)LJXUH KDYH WR EHZHLJKWHGE\
WKH UHVXOWV RI WKH G\QDPLF DQDO\VLV ,I WKH VWDWLF DQDO\VLV LQGLFDWHV
WKDW RQO\ RQH RU WZR H[HFXWLRQ EXQGOH FDQ PDNH XVH RI WKH
TXLWHIUHTXHQWO\DQGWKHORRSLVWRRORQJIRUWKHLQVWUXFWLRQEXIIHU
WKH DOLJQPHQW ZLOO KDYH VLJQLILFDQW LQIOXHQFH RQ WKH V\VWHP
SHUIRUPDQFH
'\QDPLF$QDO\VLV
IHWFKHG LQVWUXFWLRQV FDQ EH FRGHG PRUH HIILFLHQWO\ WR LQFUHDVH WKH
7RZHLJKWWKHUHVXOWVRIWKHVWDWLFDQDO\VLVSURFHVVVLPXODWLRQVDUH
QHFHVVDU\ 7R VLPXODWH WKH DVVHPEOHU FRGH JHQHUDWHG E\ WKH &
&RPSLOHU D F\FOH WUXH ,QVWUXFWLRQ 6HW 6LPXODWRU LV XVHG 7R FRYHU
WKH IOH[LELOLW\ DYDLODEOH IRU WKH FKRVHQ '63 DUFKLWHFWXUH WKH
6LPXODWRU
KDV
WR
EH
FRQILJXUDEOH
7KH
VLPXODWRU
XVHG
IRU
'63[3ORUH LV EDVHG RQ D FRPSRQHQW IUDPHZRUN ZKLFK FDQ EH
FRQILJXUHG GXULQJ UXQWLPH 7KH VLPXODWRU LV EXLOW XS RI GLIIHUHQW
OD\HUV
ZKLFK
DOORZV
KLJKHU
VLPXODWLRQ
IUHTXHQF\
IRU
OHVV
FRGH GHQVLW\ 5HGXFHG IHWFK HIIRUW UHGXFHV WKH VZLWFKLQJ DFWLYLW\
DW WKH SURJUDP PHPRU\ SRUWV DQG WKHUHIRUH LQIOXHQFHV WKH SRZHU
GLVVLSDWLRQRIWKH'63VXEV\VWHP
1XPEHURI6WDOO&\FOHV
$V
DOUHDG\
PHQWLRQHG
LQ
VHFWLRQ
WKH
SURSRVHG
'63
DUFKLWHFWXUH KDV WR FRPSHQVDWH D PHPRU\ EDQGZLGWK PLVPDWFK
EHWZHHQ IHWFK EXQGOHV DQG H[HFXWLRQ EXQGOHV 7R REWDLQ WKLV DQ
LQVWUXFWLRQ EXIIHU LV XVHG $Q\ZD\ DW EUDQFK WDUJHWV RU LQWHUUXSW
GHEXJJLQJDQGHYDOXDWLRQIHDWXUHV
VHUYLFH URXWLQHV D VWDOO F\FOH FDQ EHFRPH QHFHVVDU\ WR IHWFK WKH
7KLV VHFWLRQ FRQWDLQV VRPH H[DPSOHV IRU G\QDPLF DQDO\VLV UHVXOWV
ILUVW H[HFXWLRQ EXQGOH HJ H[DPSOH LQ )LJXUH 7KH QXPEHU RI
DQG WKHLU LQIOXHQFH RQ WKH RYHUDOO V\VWHP SHUIRUPDQFH 7KH
VWDOO F\FOHV FDXVHG E\ IHWFK RSHUDWLRQV LV FRXQWHG DQG WRJHWKHU
6LPXODWRU VXSSRUWV YLVXDO LQWHUSUHWDWLRQ RI PRVW RI WKH DQDO\VLV
ZLWK WKH UHVXOW RI H[HFXWLRQ F\FOHVEXQGOH RSWLPL]DWLRQV HJ
UHVXOWV ZKLFK HQDEOHV D FRPIRUWDEOH DQDO\VLV RI WKH DSSOLFDWLRQ
EUDQFKWDUJHWDOLJQPHQWFDQWDNHSODFHGHVFULEHGLQVHFWLRQ
FRGH
7KH FKRVHQ '63 DUFKLWHFWXUH KDV WZR GDWD PHPRU\ SRUWV ZKLFK
FDQ EH XVHG LQGHSHQGHQWO\ LQ SDUDOOHO ,I ERWK DGGUHVVHV SRLQW WR
1XPEHURI3URJUDP)HWFK&\FOHV
$V
DOUHDG\
VLJQLILFDQWO\
PHQWLRQHG
LQIOXHQFHV
WKH
WKH
QXPEHU
SRZHU
RI
WKH VDPH SK\VLFDO PHPRU\ EORFN D VWDOO F\FOH LV PDQGDWRU\ 7KH
PHPRU\
GLVVLSDWLRQ
RI
DFFHVVHV
WKH
'63
VXEV\VWHP 7KH FKRVHQ '63 DUFKLWHFWXUH XVHV DQ LQVWUXFWLRQ
VWDOO F\FOHV LQLWLDWHG E\ WKH PHPRU\ DFFHVVHV DUH VXPPHG XS DQG
WRJHWKHU ZLWK WKH UHVXOW RI H[HFXWLRQ F\FOHVEXQGOH WKH PHPRU\
SDUWLWLRQLQJFDQEHRSWLPL]HG
EXIIHU WR EDODQFH WKH PHPRU\ EDQGZLGWK EHWZHHQ WKH IHWFK
EXQGOHV ZKLFK KDYH FRQVWDQW OHQJWK DQG WKH H[HFXWLRQ EXQGOH
WKH VL]H RI WKH H[HFXWLRQ EXQGOH LV GHSHQGHQW RQ WKH QXPEHU RI
LQVWUXFWLRQV H[HFXWHG LQ SDUDOOHO $QDO\]LQJ LQQHU ORRS FRGH
VHFWLRQV FDQ EH XVHG WR GHWHUPLQH WKH RSWLPDO VL]H RI WKH
LQVWUXFWLRQEXIIHUIRUDFHUWDLQDSSOLFDWLRQ
&21&/86,21
&KRRVLQJ WKH ³EHVW´ '63 FRUH RU WKH ³EHVW FRQILJXUDWLRQ´ LI WKH
'63 FRUH LV FRQILJXUDEOH WKDW PHHWV WKH UHTXLUHPHQWV RI WKH
DSSOLFDWLRQ LV TXLWH WULFN\ '63[3ORUH FDQ EH XVHG WR DQDO\]H WKH
DSSOLFDWLRQV UHTXLUHPHQWV DW DQ HDUO\ VWDJH RI WKH SURMHFW DQG WR
TXDQWLI\ WKH LQIOXHQFH RI DUFKLWHFWXUDO GHFLVLRQV RQ WKH VL]H DQG
8QXVHG3URJUDP0HPRU\
SRZHUGLVVLSDWLRQRIWKH'63VXEV\VWHP
7R SUHYHQW VWDOO F\FOHV GXH WR PLVVLQJ LQVWUXFWLRQV WKH IHWFK
FRXQWHU LV ORRVHO\ GHFRXSOHG IURP WKH SURJUDP FRXQWHU WR SUH
IHWFK LQVWUXFWLRQV GXULQJ WKH H[HFXWLRQ RI H[HFXWLRQ EXQGOHV ZLWK
IHZLQVWUXFWLRQZRUGV,QFRQWUROFRGHVHFWLRQVLWFDQKDSSHQZLWK
D KLJKHU UDWH RI EUDQFK LQVWUXFWLRQV WKDQ D ORWRIVXFK LQVWUXFWLRQV
ZLOO EH IHWFKHG LQWR WKH LQVWUXFWLRQ EXIIHU WKDW ZLOO QHYHU EH
H[HFXWHG +RZHYHU IHWFKLQJ LQVWUXFWLRQV KDV DQ LPSDFW RQ WKH
'63[3ORUH
LV
EDVHG
RQ
DQ
HYDOXDWLRQ
&&RPSLOHU
DQG
D
FRQILJXUDEOH FRPSRQHQW IUDPHZRUN DQG LV SDUW RI D SURMHFW IRU D
FRQILJXUDEOH'63FRUH
$&.12:/('*0(17
$7$,5 &KULVWLDQ 'RSSOHU )RUVFKXQJVJHVHOOVFKDIW DQG WKH *UD]
8QLYHUVLW\RI7HFKQRORJ\KDYHVXSSRUWHGSDUWRIWKHZRUN
SRZHU GLVVLSDWLRQ 8QXVHG SURJUDP PHPRU\ HQDEOHV WR LGHQWLI\
WKHVH FRGH VHFWLRQV 7R UHGXFH WKH IHWFK RYHUKHDG WKH SURSRVHG
'63 FRUH VXSSRUWV D XVHUFRPSLOHU GULYHQ KDQGOLQJ RI WKH
LQVWUXFWLRQ EXIIHU FRQWHQW DQG WKHUHIRUH WR FRQWURO WKH FRUUHODWLRQ
5()(5(1&(6
>@
- / +HQQHVV\ DQG ' $ 3DWWHUVRQ &RPSXWHU $UFKLWHFWXUH
$ 4XDQWLWDWLYH $SSURDFK 0RUJDQ .DXIPDQQ 3XEOLVKHUV
EHWZHHQIHWFKFRXQWHUDQGSURJUDPFRXQWHU
6DQ0DWHR&$
([HFXWLRQ&\FOHV%XQGOH
>@
3/DSVOH\ -%LHU $6KRKDP DQG ($/HH '63 3URFHVVRU
7KLV SDUDPHWHU FRXQWV WKH H[HFXWLRQ IUHTXHQF\ RI WKH H[HFXWLRQ
)XQGDPHQWDOV$UFKLWHFWXUHVDQG)HDWXUHV,(((3UHVV1HZ
EXQGOHV 7RJHWKHU ZLWK WKH VWDWLF DQDO\VLV UHVXOW SDUDOOHOLVP WKH
<RUN
IUHTXHQWO\
H[HFXWHG
EXQGOHV
FDQ
EH
LGHQWLILHG
:LWK
WKLV
FRUUHODWLRQ WKH GHFLVLRQ FRQFHUQLQJ WKH QXPEHU RI SDUDOOHO GDWD
>@
XVHG WR LGHQWLI\ KRW VSRWV LQVLGH WKH DSSOLFDWLRQ FRGH 7KLV
&R3URFHVVRUIHDVLEOH
>@
7KH UHVXOW FDQ EH XVHG WR RSWLPL]H WKH LQVWUXFWLRQ VHW )UHTXHQWO\
-1XUPL
$Q
$XWRPDWLF
7H[DV
,QVWUXPHQWV
706&[[
3URJUDPPHUV
*XLGH
>@
-+DQG\7KH&DFKH0HPRU\%RRN$FDGHPLF3UHVV
>@
'H]VR 6LPD 7HUHQFH )RXQWDLQ 3HWHU .DFVXN $GYDQFHG
([HFXWLRQ)UHTXHQF\,QVWUXFWLRQ
7KH H[HFXWLRQ IUHTXHQF\ RI HDFK LQVWUXFWLRQ ZLOO EH TXDQWLILHG
++DELJHU
&RSHQKDJHQ
DQDO\VLV FDQ EH XVHG IRU RSWLPL]LQJ WKH +:6: SDUWLWLRQLQJ DQG
IRU LGHQWLI\LQJ FRGH VHFWLRQV ZKLFK PDNH WKH XVH RI D KDUGZDUH
$6FKLONH
'HFRGHU *HQHUDWRU IRU D &RQILJXUDEOH '63 &RUH 1RUFKLS
SDWKVFDQEHGRQH
7KH SDUDPHWHU FDQ EH JUDSKLFDOO\ LOOXVWUDWHG DQG WKHUHIRUH HDVLO\
&3DQLV
&RPSXWHU
$UFKLWHFWXUHV
$
'HVLJQ
6SDFH
$GGLVRQ:HVOH\3XEOLVKLQJ&RPSDQ\
$SSURDFK
PUBLICATION 11
C. Panis, U. Hirnschrott, G. Laure, W. Lazian, J. Nurmi, “DSPxPlore - Design Space
Exploration Methodology for an Embedded DSP Core”, in Proceedings of the 2004 ACM
Symposium on Applied Computing (SAC 04), Nicosia, Cyprus, March 14-17, 2004, pp. 876883.
©2004 IEEE. Reprinted, with permission, from proceedings of the 2004 ACM Symposium
on Applied Computing.
DSPxPlore – Design Space Exploration Methodology for
an Embedded DSP Core
Christian Panis
Ulrich Hirnschrott
Gunther Laure
Carinthian Tech Institute
Vienna University of Technology
Infineon Technologies Austria
Europastrasse 4
Argentinierstrasse 8
Siemensstrasse 2
A-9524 Villach, Austria
A-1040 Vienna, Austria
A-9524 Villach, Austria
+43 4242 90500 2124
+43 1 58801 58520
+43 4242 305 0
[email protected]
[email protected]
Wolfgang Lazian
[email protected]
Jari Nurmi
Infineon Technologies Austria
Tampere University of Technology
Siemensstrasse 2
P.O.Box 553
A-9524 Villach, Austria
FIN-33101 Tampere
+43 4242 305 0
+358 3 3115 3884
[email protected]
[email protected]
ABSTRACT
High mask and production costs for the newest CMOS silicon
technologies increase the pressure to develop hardware platforms
useable for different applications or variants of the same
application. To provide flexibility for these platforms the need on
software programmable embedded processors is increasing. To
close the gap concerning consumed silicon area and power
dissipation between optimized hardware implementations and
software based solutions, it is necessary to adapt the subsystem of
the embedded processor to application specific requirements.
DSPxPlore can be used to explore the design space of RISC based
embedded core architectures. At an early stage of the project the
main architectural requirements of the application code can be
identified in order to meet the area and power dissipation
requirements. During the development process DSPxPlore
supports fine-tuning of the subsystem architecture (e.g.
modifications of the binary coding of instructions). DSPxPlore is
part of a development project for a configurable DSP core.
Keywords
DSPxPlore, Design Space Exploration, embedded DSP
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage,
and that copies bear this notice and the full citation on the first
page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a
fee.
SAC’04, March 14-17, 2004, Nicosia, Cyprus
Copyright 2004 ACM 1-58113-812-1/03/04...$5.00
1. INTRODUCTION
Decreasing feature size and increasing system complexity enables
to map complex system functions onto one die (SoC, System-onChip) or into one package (SiP, System in a Package). High mask
and production costs for the newest silicon technologies increases
the need of platform solutions, enabling to use the same silicon
for several applications. Providing flexibility to the platform
solutions allowing to realize several applications with the same
silicon, embedded software programmable cores can be used.
Therefore the importance of embedded processors like
microcontrollers, protocol processors and digital signal processors
(DSP) is increasing.
One aspect of using dedicated hardware implementations instead
of software based solutions is the degree of efficiency in terms of
consumed silicon area and power dissipation. To overcome the
efficiency drawbacks of software based solutions without loosing
the advantage of flexible platform architectures, providers of
embedded core architectures provides the possibility to modify
their core architectures to application specific requirements [1][2].
Making use of the additional degree of freedom the requirements
of the application have to be understood. Quite often the core
decisions are done by the most experienced engineers focusing on
the aspects „what is already available?” and „what has been
already proven in silicon?” to reduce the risk. Different
requirements of the applications lead to not optimal solutions
concerning consumed silicon area and power consumption by
using one core subsystem. In the price-critical consumer IC
market this can be crucial for the own market position and the
revenues.
This paper introduces DSPxPlore, a design space exploration
methodology for an embedded configurable DSP processor.
DSPxPlore can be used to understand the requirements of the
application code on the processor architectures in an early stage of
the project. During the development project DSPxPlore can be
used to fine-tune the chosen architecture. The first part introduces
the RISC based DSP core architecture used as basis for
DSPxPlore. The introduced methodology is not limited to this
architecture. The second part is used to discuss the design space
of RISC based DSP core architectures. The influence of
configuration parameters concerning consumed silicon area and
power dissipation of the core subsystem is illustrated. The third
part introduces the DSPxPlore methodology. DSPxPlore is based
on an optimizing C-Compiler (about 5 to 10% overhead compared
with manual assembly coding) and a cycle-true Instruction Set
Simulator (ISS), based on a configurable component framework.
A XML-based configuration file contains a description of the
chosen core architecture and is used to configure the tool chain
and to automatically update the documentation for the DSP core.
The last section covers some exploration examples and gives an
outlook for future work.
2. ARCHITECTURAL INTRODUCTION
This section is used to give a short introduction of the DSP
architecture DSPxPlore has been developed for. The main
architectural features and the instruction set have been defined
under consideration of low silicon area and power dissipation of
the DSP subsystem and to enable the development of an
optimizing C-Compiler (about 5-10% overhead compared with
manual assembler coding). An example architecture has been
chosen for this paper and will be shortly introduced in this
section.
increases the number of branch delays, additional clock cycles for
the execution phase leads to increased load-in-use and define-inuse dependencies [6]. Therefore deeper pipeline structures can
lead to a decreased overall system performance due to data and
control dependencies in the application code.
Figure 2: Pipeline
The instructions are divided into three operation classes:
load/store instructions, used to transfer data between the data
memory and the register file, arithmetic/logic instructions
performing calculations on register values, and branch
instructions influencing the program flow. Each instruction
consists of one or two instruction words. The size of the native
instruction word for the example architecture is 20 bit; the
optional second word is used for long immediate values and
offsets (parallel word as in Figure 3).
Figure 3: instruction coding
Figure 1: Core Overview
The proposed DSP core features a modified Dual-Harvard loadstore architecture (an overview is illustrated in Figure 1) [3]. An
independent data bus connects the program memory with the DSP
core, an instruction buffer is used to execute loop constructs
power efficient [4]. Data and program memory are featuring
different address spaces [5]. The bit width of the ports in Figure 1
is scaleable, which allows application specific adaptation of
memory bandwidth.
The core is featuring a RISC like 3-phase pipeline, instruction
fetch, decode and execute. The three phases can be split over
several clock cycles. The example architecture illustrated in
Figure 2 is using five clock cycles for the three pipeline phases.
The instruction fetch phase is split over a fetch and an align clock
cycle, the decode stage takes one clock cycle, the execution phase
is split over two clock cycles (EX1, EX2). Splitting of a pipeline
phase over several clock cycles enables to reach higher clock
frequencies. But additional pipeline stages in the fetch phase
All arithmetic instructions support 3 operands, which prevents
data copy functions between different registers of the register file.
All features of the DSP core are coded inside the instruction set;
no mode bits are used to increase code density. The drawbacks of
using mode bits are limitations during instruction scheduling
when moving instructions between different mode sections [7]. As
illustrated in Figure 3 the first three bits of the instruction words
are used for assigning the operation class and the alignment
information.
Figure 4: parallelism
The number of possible parallel executed instructions is scaleable.
The example architecture enables the execution of up to five
instructions in parallel. It is possible to execute two load/store,
two arithmetic and one branch instruction in parallel (illustrated in
Figure 4). The chosen programming model is VLIW (Very long
instruction Word), which implies static scheduling (data and
control dependencies are analyzed and resolved in software). The
drawback of traditional VLIW architectures featuring low code
density is solved by xLIW (a scalable long instruction word) [8].
xLIW is based on VLES (Variable long execution set) and
additionally supports a decreased program memory port. For this
purpose also the already mentioned instruction buffer is used [9].
The example architecture supports two busses to data memory.
Therefore two independent AGUs (address generation unit) are
available. Each of the AGU can make use of each of the address
registers (no banked address register). If two parallel generated
addresses access the same physical memory block, the core
hardware automatically detects the hazard and serializes the
memory operations. Data memory operations exceeding the
physical size of the memory port are realized as consecutive
memory operations at the same data bus.
All common DSP address modes like memory direct, register
direct and register indirect addressing are supported. The auto in/decrement address operation supports pre- and post address
calculation and an efficient stack frame addressing. The size of the
modulo buffer is programmable; the start address of the buffer has
to be aligned. This is a compromise between hardware effort and
supported features.
enable mapping of the read and write ports to each of the
registers.
3. DESIGN SPACE FOR RISC BASED DSP
ARCHITECTURES
This section is used to introduce the available design space for
RISC based DSP subsystems with influence on area consumption,
power dissipation and overall system performance. The example
architecture is used to illustrate the main architectural features.
The influence of some of the parameters is illustrated by first
exploration results.
3.1 Register File
The register file in load-store architectures has a central role. All
arithmetic instructions are fetching their operands from the
register file and store their results into the register file. Therefore
the number of supported registers of the register file influences the
performance parameters of the DSP subsystem.
Supporting less register reduces the necessary core area but can
lead to additional spill code. Spill code is added if no registers are
available to store a result. In this case register file content has to
be stored to data memory to free register resources. If any of the
spilled data is needed again, it has to be reloaded from memory.
The added spill code increases the demand on program memory
and therefore decreases the code density of the application code.
Further it increases execution time and therefore decreases system
performance.
Figure 5: register files
Load-store architecture implies that all operands for the arithmetic
instructions reside in registers. Therefore the register file has an
important role. The structure of the register file and the size and
the number of registers is configurable; for the example
architecture a register file as in Figure 5 is used. It is split into two
parts, a data register file, and an address register file.
Supporting a larger register file with more entries increases the
core area and again has influence on code density. More entries
require more coding space to address the register entries –
especially considering the orthogonal requirement to enable the
development of an optimizing C-Compiler banking registers or
supporting registers for special functions is not possible.
Figure 7: register file (64-bit accu)
Figure 6: data register file
The data register file as in Figure 6 consists of 8 accumulators, 8
long registers or 16 data registers. Two consecutive data registers
can be addressed as a long register. A long register including
guard bits (for higher precision calculation) can be addressed as
accumulator. The size of the operands can be modified application
specific. The registers inside the register file are orthogonal,
which means that none of them is assigned to a certain instruction.
The drawback of an orthogonal register file is the crossbar to
It is possible to change the structure of the register file. Figure 7 is
used to illustrate an example for a 64-bit data register file (e.g.
used for a 64-bit/quad MAC architecture). The register file on the
left side of Figure 7 has a similar structure as the register file in
Figure 5; instead of using guard bits the accumulator supports 64bit. The number of addressable data registers has not been
doubled; the necessary coding space for the additional data
registers has influence on the code density. If an application code
requires the use of more than 16 data registers to reduce the spill
code a register file like in Figure 7 can support up to 32 data
register. The same register file on the right side of Figure 7 has a
different structure. Eight of the data registers are mapped onto the
first two accumulator registers, the remaining eight are split onto
the next six accumulator registers.
3.2 Data paths
Increasing the number of data paths and parallel executed
instructions increase the maximum possible calculation power of
core architectures. Providing the possibility to execute several
instructions in parallel requires the availability of operands.
Therefore a balanced relation between memory bandwidth,
number of independent load/store instructions and the number of
arithmetic data paths characterize the possible performance of
core architectures.
Table 1: ILP
Tjaden
and Flynn
31 library
programs
1,23,2
1,9
Kuck
et.al.
20 Fortran
programs
1,2-17
4
Rieseman
n, Foster
7 Fortran/
assembler
1,2-3
1,8
1,41,6
1,6
Jouppy
8 modulo2
programs
1,62,2
1,9
2,43,3
2,8
Lam,
Wilson
6SPECmar
ks+4others
1,52,8
2,1
2-293
Additional influence comes from the application program
executed on the core architecture. Control and data dependencies
can lead to a low usage of the provided core resources. In Table 1
some examples for ILP (instruction level parallelism) can be
found. The benchmark examples are based on general purpose
code (column 3,4) as also scientific code (column 5,6). The
average ILP in these examples is about two to three instructions.
Traditional algorithms executed on DSP cores are filtering
operations. Filter algorithms are characterized by an inner loop,
where a significant amount of execution time is spent. These inner
loops (considering software pipelining) can make efficient use of
parallel provided resources. Therefore the ILP for this kind of
algorithms is higher than that for general purpose code. The MAC
(multiply and accumulate) instruction is typical used for e.g. FIR
filter algorithms. Therefore the performance of DSP cores is
measured in the number of provided MAC instructions per second
and in the number of clock cycles needed for execution
(considering the define-in-use dependency).
Changing the number and kind of data paths has influence on the
core hardware. If the changes in the data path structure have
influence on the instruction set (by adding or removing
instructions) the code density is influenced. Changes of the data
path structure have influence on the execution bundle. Therefore
after changing the data path structure, it is necessary to verify if
the average relation between the size of the fetch and execution
bundle is still balanced and that the memory bandwidth still fits to
the data path structure.
3.3 Memory bandwidth
The memory bandwidth is closely related to the data path
parameter. Providing a lot of parallelism with insufficient memory
bandwidth is resulting in bad usage of available core resources.
The size of the memory ports has influence on power dissipation
and consumed silicon area of a DSP subsystem.
Data memory port: Today most of the commercial available DSP
cores are supporting two independent data memory busses.
Supporting additional busses increases the flexibility of data
transfer and several algorithms e.g. FFT algorithms can make use
of it. But the drawback of more memory ports is the hardware
effort for additional AGUs (Address Generation Unit) and the
wiring effort to the memory sub system.
Program memory port: For most of the commercial available DSP
cores, the size of the program memory port is equal to the
maximum number of parallel executed instructions. Similar as for
the data memory port, the wiring is influencing area and power
consumption. One possibility to decouple the size of the program
memory port with the provided parallelism of the execution unit is
the usage of an instruction buffer, as mentioned in section 2.
3.4 Instruction size/encoding
The instruction set describes the functionality supported by the
core architecture. The mapping of the instruction set to binary
instruction words has significant influence on the area
consumption of the core sub system, because the memory used to
store the instructions is dominating the area consumption.
In Figure 8 an example for different mappings of the same
instruction set to two different instruction layouts is illustrated. In
the right example, the instruction set has been mapped using
instructions with a native size of 16-bit, using 32 bit for the
remaining instructions, which cannot be mapped to the native
instructions set like three operand arithmetic instructions. For the
example of the left column a native instruction word size of 20 bit
is used, allowing to map all instructions into the native instruction
word size. The second word is only used for long immediate
values and offsets. Considering a certain algorithm (e.g. some
control code as in Figure 8) the smaller native instruction word
size is providing a lower overall code effort. This can be different
for another code example, which e.g. requires three operand
instructions, coded more efficient in the longer native instruction
word.
Figure 8: example for instruction set mapping
The binary coding is influencing the switching activity at the
program memory port and therefore the mapping of the
instruction set to a certain binary coding has influence on the
power dissipation of the DSP subsystem. More often used
instructions can be coded more efficiently resulting in an
increased code density. Also reordering of instructions inside the
same execution bundle can be performed in order to decrease
power dissipation at the program memory bus [10][11][12].
3.5 Instruction buffer size
The instruction buffer mentioned in section 2 is not available in
each core, but shall be mentioned for the core architecture
introduced in section 2. For this core the instruction buffer is used
to compensate the memory bandwidth mismatch between fetch
and execution bundle and also to execute loop constructs power
efficient by reducing the number of memory accesses. To make
use of this feature, the size of the instruction buffer has to be
scalable to adapt the instruction buffer to application code specific
requirements. Power efficient loop handling can only be achieved,
if the loop body fits into the buffer. Therefore the chosen size of
the instruction buffer has influence onto the power dissipation of
the core subsystem. On the other side providing a buffer with
many entries leads to a significant increase on core area.
3.6 Pipeline stages
Increasing the number of pipeline stages allows increasing the
reachable core frequency. Higher core frequencies lead to
increased power dissipation due to the need of a higher supply
voltage and an increased switching activity [13].
satisfying the requirements of all applications efficient. The
application code executed on a core architecture make a certain
core configuration efficient. To understand the requirements of an
application code, the following section is used to introduce a
design space exploration methodology for RISC based core
subsystems.
4. EXPLORATION METHODOLOGY
The DSP core architecture introduced in section 2 allows adapting
the architectural features introduced in section 3. Providing a
configurable DSP core architecture to meet application specific
requirements enables to reduce area consumption and power
dissipation. To find the optimal core architecture (optimal for one
application) it is important to understand the application specific
requirements.
For this purpose DSPxPlore is introduced. DSPxPlore can be used
to analyze the influence of certain core subsystem configurations
on the system parameter core area, power dissipation and overall
system performance. During the product development process
DSPxPlore supports a fine tuning of the core subsystem. The
exploration methodology is based on an optimizing C-Compiler
and a configurable ISS (instruction set simulator).
Increasing the number of pipeline stages also increases the core
complexity, because additional hardware circuits like bypass are
getting necessary to reduce the increased dependency between
instructions of different pipeline stages [14][15][16].
Increasing the number of pipeline stages can even lead to a
decrease of system performance due to control and data
dependencies. Therefore a balanced pipeline structure considering
dependencies of the application code and physical aspects of
technology are important to obtain a good cost ratio between area
consumption, power dissipation and system performance.
Classifying core subsystems by MIPS, MOPs or MMACs or any
other similar parameter is misleading: for an embedded core the
core performance has to be classified, how efficient an application
code can make use of the available core resources.
Figure 9: DSPxPlore Overview
Increasing the number of pipeline stages for the fetch phase of the
pipeline relaxes the timing at the program memory but increases
the number of branch delays. Additional hardware circuits have to
be introduced to compensate the unused branch delays [17][18].
Predicated execution can help to reduce the number of branch
delays by reducing the number of conditional branch instructions
[19].
In Figure 9 an overview of the exploration methodology is
illustrated. An optimizing C- compiler is used to generate static
analysis results. A cycle true Instruction Set Simulator (ISS) is
used for evaluation of dynamic results. Both results together can
be used to analyze the application specific requirements to the
core subsystem. The chosen core configuration is located in an
XML-based configuration file, which is used by both tools.
Adding pipeline stages to speed up the execution phase and to
relax the timing at the data memory ports leads to an increased
define-in-use and load-in-use dependency. Bypass logic can be
used to reduce the dependencies but again by increasing core
complexity.
4.1 Static analysis
3.7 Summary
This section has been used to briefly introduce the architectural
features of RISC based core architectures (with focus on DSP
cores) which are significant influencing the area consumption and
power dissipation of the core subsystem. None of these
parameters can be considered isolated; changing one of them
influences several others. There is not a single shot solution
To obtain reasonable accurate results for static analysis it is
necessary to use a C-Compiler that generates near-optimal
assembly code (compared to manually optimized code). If the
quality of the C-Compiler is poor, the generated results can be
misleading and architectural decisions can lead to a suboptimal
solution. The C-Compiler for the core architecture introduced in
section 2 provides an accuracy of about 5-10% overhead
compared with manual coding. Some of the generated static
evaluation results are
4.1.1 code size
The memory of a DSP subsystem is dominating the silicon area
consumption. Therefore a high code density reduces area
consumption. An example for the parameter code size is
illustrated in Figure 10. The number of instructions necessary to
port the application code to the chosen core architecture is
counted and the required long instructions are summed up. The
chosen instruction word length is normalized to bytes to have a
comparable value. The example architecture is using a 20-bit
native instruction word and therefore the number of counted
instructions have to be multiplied by 2,5 to get the code effort in
bytes. Instructions with long words are counting double.
parallel).
DSP architectures like the C62x from Texas
Instruments will not have a higher usage of the core resources,
even if its relative performance (calculated as number of possible
parallel instructions multiplied with the reachable clock
frequency) provides higher numbers [20].
4.1.3 instruction histogram
The instruction histogram analysis result provides a list of the
used instructions and their static occurrence inside the application
code. This result can be used to optimize the instruction set during
fine tuning of the core subsystem (e.g. optimized coding of
frequently used instructions).
4.1.4 immediate values
The size of the immediate values can be analyzed already during
the static process. This gives an indication about the needed
coding space inside the instruction set. These results (similar as in
Figure 8) can be used to choose the optimal size for the native
instruction word.
Figure 10: code size analysis
4.1.2 parallelism
The analysis result parallelism gives an indication of the usage of
the provided core resources. Data and control dependencies in the
application code restrict the execution of parallel instructions and
leads to a poor use of the available processor resources. The
example in Figure 11 illustrates the dependency problem (on the
left side a summary, on the right side more in detail).
4.1.5 delay slots
The number of delay slots can significantly influence the overall
system performance of a DSP subsystem. Delay slots are caused
by branch instructions or function calls. Increasing the number of
pipeline cycles during the fetch phase results in more delay slots.
Some of the delay slots can be filled with useful instructions, the
others are lost cycles.
4.2 Dynamic analysis
To weight the static analysis results, dynamic analysis are
necessary. A cycle true instruction set simulator (ISS) is used to
obtain the results. xSIM, the ISS used for the core introduced in
section 2 is based on a configurable component framework. An
XML based configuration file is used to define the chosen core
configuration. Some of the dynamic results are
4.2.1 program memory fetch
The fetch of instructions from program memory significantly
influences the power dissipation of the DSP subsystem. Therefore
reducing the switching at the program memory port can be used to
reduce power dissipation. The number of fetch cycles from
program memory is analyzed, the fetch frequency of the different
fetch bundles is counted and the alignment analysis for loop and
branch constructs is considered.
4.2.2 unused program memory
The DSP core introduced in section 2 features an instruction
buffer to overcome the bandwidth mismatch between fetch and
maximum execution bundle size and to execute loop constructs
power efficient. Especially during breaks in the program flow
already fetched program data are not executed. This parameter is
used to analyze which code sections have been fetched but not
executed and can be used to reduce the switching at the program
memory port.
Figure 11: bundle assignment
Only a few execution bundles can make use of the parallel units
(the example architecture provides to execute five instructions in
4.2.3 execution count per bundle
Counting the execution frequency of each execution bundle can
be used to identify hot spots and to optimize the HW/SW
partitioning (e.g. deciding which parts can be more efficient
implemented in hardware). Together with the static parallelism
analysis the provided parallelism can be classified and the number
of data paths adjusted to the requirements of the application code.
The results can be visualized by xSIM, which easies the
interpretation of the results.
4.2.4 execution count per instruction
The list of used instructions generated during static analysis is
extended by the execution count of instructions. With this
information the instruction set and the binary coding can be
optimized, increasing code density and decreasing switching
activity at the program memory port. Frequently executed
instructions can be coded more efficient, not used instructions
even removed.
4.2.5 stall cycles
During execution of application code stall cycles can take place.
During the stall cycles the core is not contributing to the system
performance of the DSP subsystem. This can be caused e.g. by
simultaneous memory access to the same physical memory block
or by missing program data, due to an empty instruction buffer
(e.g. at not aligned branch targets). This information can be used
to identify possible reasons and to modify the core architecture
and the application code to prevent useless stall cycles.
5. RESULTS
This section is used to illustrate some of the results using
DSPxPlore. The set of benchmark examples consists of traditional
DSP functions like FFT and code examples of the area of
cryptology but also of control code examples e.g. framing
algorithms.
For the results in Table 2 the size of the register file has been
modified. The number in the first column is equal to the number
of supported accumulator register.
Table 2: register size evaluation
#regs
bundles
inst.
delay nops
code size
4
24008
35284
2473
47263
8
17041
26544
2722
31810
16
14507
23046
2826
26497
Increasing the number of registers relaxes the register pressure for
register allocation, resulting in a decreased code effort. In the
second row the number of available registers has been increased
from four to eight leading to a reduced code effort of about 50%.
Doubling the register file again from eight to 16 accumulator
registers increases the code density only by about additional 17%.
The algorithm examples cannot make use of the additional
registers. The results in Table 2 do not include the influence on
the coding space due to the increased number of registers. The
coding used for the comparison supports the medium sized
register file; considering also the difference in the coding space
will reduce the absolute distance between the result values.
Table 3: parallelism analysis
model
bundles
inst.
delay nops
code size
0 1M1A-1B
21070
27962
2694
33441
1 2M1A-1B
20783
28022
2714
33441
2 1M2A-1B
18194
27728
2862
33196
3 2M2A-1B
17872
27871
2866
33421
4 2M3A-1B
17347
27890
2958
33558
The first column in Table 3 indicates the number of parallel
executed instructions: The M for load/store instructions, A for
arithmetic/logic instructions and B for branch instructions. As
expected adding more units in parallel decreases the number of
necessary execution bundles (column three in Table 3). Data and
control dependencies reduce the effect of further added units. One
remark concerning the increase of the number of branch delay
NOPs: During compilation, the C-Compiler has been configured
to execute the instructions as early as possible. Providing more
parallelism leads to shorter branch distances and therefore fewer
instructions are available to fill delay slots. The number of
necessary NOP instructions for filling delay slot is increasing.
Table 4: model 0, branch delay slots
branch
delay
bundles
inst.
delay nops
code
size
2
21051
27943
2694
33422
3
22544
29438
2686
34868
4
24118
31004
2664
36432
For the results in Table 4 (model 0) and Table 5 (model 4) the
same core models as for the results of Table 3 are used. The
parameter on the left side is the number of branch delays.
Additional branch delays caused by further clock cycles used for
the fetch phase of the pipeline e.g. to relax the timing at the
program memory port, leads to an increased number of
instructions and therefore to a decreased code density (e.g. due to
additional NOP instructions to fill delay slots).
Table 5: model 4, branch delay slots
branch
delay
bundles
inst.
delay nops
code
size
2
17324
27867
2959
33537
3
18897
29451
2966
35121
4
20524
31055
2947
36725
Comparing the results for model 0 and model 4, as expected the
number of necessary execution bundles is decreasing. The
configuration for the results of Table 4 supports to execute only
three instructions in parallel, the configuration used in Table 5 up
to six instructions in parallel.
[6] Sima, D., Fountain, T., and Kacsuk, P., Advanced Computer
6. OUTLOOK
[7] Morgan, R., Building an Optimizing Compiler, Digital Press,
DSPxPlore is used to understand the requirements of the
application code on the core architecture, to identify hot spots and
to optimize the HW/SW partitioning. DSPxPlore is still an expert
system. For interpretation of the generated results and the related
modifications of the core architecture a deep understanding of the
core architecture, the configuration parameter and the influence of
the chosen configuration onto silicon area and power
consumption is necessary. In the next development phase it will
be possible to get easier understandable feedback from
DSPxPlore. This enables the system architect optimizing his core
subsystem for application specific requirements and to gets hints
for further optimizations.
7. SUMMARY
DSPxPlore is a design space exploration methodology for RISC
based embedded cores. Analyzing application specific
requirements in an early stage of the project enables to modify the
core subsystem and therefore to obtain low silicon area
consumption and low power dissipation. During the design
process DSPxPlore can be used for fine tuning of the core
subsystem e.g. optimization of the binary coding to reduce power
dissipation. With the application specific optimized core
subsystems it is possible to reduce the gap between a dedicated
hardware implementation and a core based solution providing the
flexibility of software programmability. DSPxPlore is part of a
project for a configurable DSP core.
8. ACKNOWLEDGMENTS
The work has been supported by the Christian Doppler Lab for
Compilation Techniques for Embedded Processors and by the EC
with the project SOC-Mobinet (IST-2000-30094).
9. REFERENCES
[1] www.arc.com
[2] www.tensilica.com
[3] Hennessy, J. L., Patterson, D. A., Computer Architecture. A
Quantitative Approach, Morgan Kaufmann Publishers, San
Mateo CA, 1996.
[4] Panis, C., Bramberger, M., Grünbacher, H., and Nurmi, J., A
Scaleable Instruction Buffer for a Configurable DSP Core,
ESSCIRC 2003, Lissabon, Portugal, 2003.
[5] Lapsley, P., Bier, J., Shoham, A., and Lee, E.A., DSP
Processor Fundamentals, Architectures and Features, IEEE
Press, New York, 1997.
Architectures: A Design Space Approach, Addison Wesley
Publishing Company, Harlow, 1997.
1998.
[8] Panis, C., Leitner, R., Grünbacher, H., and Nurmi, J., xLIW
– a Scaleable Long Instruction Word, ISCAS 2003,
Bangkok, Thailand, 2003.
[9] Panis, C., Leitner, R., Grünbacher, H., and Nurmi, J., Align
Unit for a Configurable DSP Core, CSS 2003, Cancun,
Mexico, 2003.
[10] Hirnschrott, U., and Krall, A., VLIW Operation Refinement
for Reducing Energy Consumption, Proceedings of
International Symposium on System-on-Chip '03, Tampere,
2003.
[11] Shin, D., Kim, J., and Chang, N., An Operation
Rearrangement Technique for Power Optimization in
(VLIW) Instruction Fetch, ACM, Munich, 2001.
[12] Choi, K., and Chatterjee, A., Efficient Instruction-Level
Optimization Methodology for Low-Power Embedded
Systems, Proceedings of International Symposium on System
Synthesis ISSS 01, 2001.
[13] Chandrakasan, A., Sheng, S., and Brodersen, R., Low-Power
(CMOS) Digital Design, Design. JSSC, Nr.4, 1992.
[14] Smith J.E., A study of branch prediction strategies, in Proc.
8th ASCA, pp.135-48, 1981.
[15] Albert D. and Avnon D., Architecture of the Pentium
Microprocessors, IEEE Micro, Juni 1993.
[16] Heinrich J., MIPS1000 Microprocessor Users Manual Alpha
Draft 11.Oct, Mips Technologies Inc., Mountain View. Ca,
1994
[17] Motorola Inc., Power PC620 RISC Microprocessor
Technical Summary, MPC 620/D, Motorola Inc., 1994
[18] Lee J.K.F. and Smith A.J., Branch prediction strategies and
branch target buffer design, Computer 17(1), pp.6-22, 1984.
[19] Pnevmatikos D.N. and Soshi G.S., Guarded Execution and
branch prediction in dynamic ILP processors, In Proc. 21.
ISCA, pp. 120-9, 1994.
[20] Texas Instruments, TMS320C6000 CPU and Instruction Set
Reference Guide, Texas Instruments, 10.2000.
PUBLICATION 12
C. Panis, A. Schilke, H. Habiger, J. Nurmi, “An Automatic Decoder Generator for a
Scaleable DSP Architecture”, in Proceedings of the 20th Norchip Conference (Norchip’02),
Copenhagen, Denmark, November 11-12, 2002, pp. 127-132.
©2002 IEEE. Reprinted, with permission, from proceedings of the 20th Norchip Conference.
An Automatic Decoder Generator for a Scaleable DSP
Architecture
&K3DQLV$6FKLONH
++DELJHU
,QILQHRQ7HFKQRORJLHV
)DFKKRFKVFKXOH.lUQWHQ
6LHPHQVVWUDVVH$
&DULQWKLDQ7HFK,QVWLWXWH
9LOODFK
(XURSDVWUD‰H$
FKULVWLDQSDQLV#LQILQHRQFRP
9LOODFK
[email protected]
DUPLQVFKLONH#LQILQHRQFRP
3KRQH
3KRQH
)$;
)D[
-1XUPL
7DPSHUH8QLYHUVLW\RI
7HFKQRORJ\
32%R[),1
7DPSHUH
[email protected]
3KRQH
)D[
$EVWUDFW
6FDOHDEOH DQG IOH[LEOH '63 FRUHV DUH QHFHVVDU\ WR REWDLQ WKH DUHD DQG SRZHU GLVVLSDWLRQ
UHTXLUHPHQWV RI 62& DSSOLFDWLRQV 7KH GUDZEDFN RI IOH[LEOH VWUXFWXUHV LV WKH DGGLWLRQDO
YHULILFDWLRQ DQG WHVW HIIRUW 2QH DSSURDFK WR RYHUFRPH WKLV GUDZEDFN LV WKH XVDJH RI
JHQHUDWRUV SURGXFLQJ FRUUHFW FRGH E\ FRQVWUXFWLRQ 7KH SDSHU GHVFULEHV WKH VWUXFWXUH DQG
LPSOHPHQWDWLRQRIDJHQHUDWRUZKLFKJHQHUDWHV9+'/FRGHIRUDQLQVWUXFWLRQGHFRGHU7KH
LQSXWRIWKHJHQHUDWRULVWKHFKRVHQLQVWUXFWLRQVHWGHVFULSWLRQ7KLVHQDEOHVWKHGHILQLWLRQRI
DSSOLFDWLRQ VSHFLILF LQVWUXFWLRQ VHWV WR UHDFK KLJK FRGH GHQVLW\ ORZ SRZHU GLVVLSDWLRQ DQG
LQFUHDVHGSHUIRUPDQFH7KHGHYHORSPHQWRIWKH'HFRGHU*HQHUDWRUKDVEHHQGRQHZLWKLQWKH
VFRSHRIWKHSURMHFW62&0RELQHW6\VWHPRQ&KLSIRU0RELOH,QWHUQHW
,QWURGXFWLRQ
7RGD\¶V 62& DSSOLFDWLRQV PDS D FRPSOH[ V\VWHP WR VLOLFRQ DQG KDYH VWURQJ GHPDQGV
FRQFHUQLQJSRZHUGLVVLSDWLRQDQGFRQVXPHGVLOLFRQDUHD7KLVRIWHQOHDGVWRWKHXVDJHRIWKH
QHZHVW&026WHFKQRORJLHVHJmPDQGORZHUZLWKWKHGUDZEDFNWKDWWKHPDVNFRVWV
DQG WKH FRVWV IRU VLOLFRQ IDEULFDWLRQ DUH ULVLQJ H[SRQHQWLDOO\ )XUWKHUPRUH WKH JURZLQJ
FRPSOH[LW\ LV QRW IXOO\ FRYHUHG E\ WKH UHGXFWLRQ RI IHDWXUH VL]H UHVXOWLQJ LQ DQ LQFUHDVH RI
VLOLFRQ DUHD DQG FRPSRQHQW FRVW 7KHUHIRUH WKH PRUH SURPLVLQJ DSSURDFK ZLOO EH D EHWWHU
XWLOL]DWLRQRIWKHDYDLODEOHVLOLFRQ
:LWKDQLQFUHDVLQJLPSRUWDQFHRIVRIWZDUHHPEHGGHGSURFHVVRUV>@DUHRQHH[DPSOHZKHUH
LQFUHDVLQJ SHUIRUPDQFH UHTXLUHPHQWV VKRXOG QRW EH FRYHUHG E\ DGGLWLRQDO XVDJH RI VLOLFRQ
DUHDRUSRZHUFRQVXPSWLRQ
7KLVSDSHUGHVFULEHVDGHFRGHUJHQHUDWRUZKLFKLVSDUWRIDGHYHORSPHQWSURMHFWZLWKWKHDLP
WRSURYLGHDVFDOHDEOHDQGIOH[LEOH'63NHUQHOPHHWLQJWKHDSSOLFDWLRQVSHFLILFUHTXLUHPHQWV
RI 62& DSSOLFDWLRQV >@ 7R RYHUFRPH WKH HIIRUW RI PDQXDO 9+'/ FRGLQJ DQG WKH UHODWHG
YHULILFDWLRQDQGWHVWHIIRUWWKHGHFRGHUJHQHUDWRUDXWRPDWLFDOO\JHQHUDWHVWKH9+'/FRGHIRU
WKHLQVWUXFWLRQGHFRGHURIWKH'63NHUQHOGLUHFWO\IURPWKHFKRVHQLQVWUXFWLRQVHWGHVFULSWLRQ
,QWKHILUVWVHFWLRQWKHVWUXFWXUHRIWKHLQVWUXFWLRQVHWZLOOEHH[SODLQHG7KHPLGGOHSDUWFRYHUV
WKHVWUXFWXUHRIWKHGHFRGHUJHQHUDWRUDQGWKHVROXWLRQSURFHVVIRUGLIIHUHQWSUREOHPVZLOOEH
SRLQWHGRXW$VKRUWFRGHH[DPSOHLOOXVWUDWHVWKHRXWFRPHRIWKHJHQHUDWLRQSURFHVV
127
,QVWUXFWLRQVWUXFWXUH
7KH LQVWUXFWLRQV FDQ EH VSOLW LQWR LQVWUXFWLRQ FODVVHV FRPSXWDWLRQDO ORDGVWRUH EUDQFK
LQVWUXFWLRQV(DFKRIWKHLQVWUXFWLRQVFRQVLVWVRIELWVVKRZQLQ)LJXUH7KHILUVWELWV
ZLOOEHXVHGWRDVVLJQWKHLQVWUXFWLRQWRDFHUWDLQLQVWUXFWLRQFODVVDQGWRKDQGOHWKHGLVSDWFK
RI WKH LQVWUXFWLRQV WR WKH GHFRGH VWDJH 7KH UHPDLQLQJ ELWV ZLOO EH XVHG WR HQFRGH WKH
LQVWUXFWLRQLWVHOI
bit19
bit18
bit17
IC
IC
DP
Instruction Coding Area
bit0
)LJXUH,QVWUXFWLRQ:RUG
6RPH SDUWV RI WKH LQVWUXFWLRQ ZRUGV FRQWDLQV DOUHDG\ ELQDU\ FRGLQJ VRPH DUH ILOOHG ZLWK
V\PEROV)LJXUHVKRZVDQH[DPSOH$''LQVWUXFWLRQ
bit19
bit18
bit17
IC
IC
DP
bit0
I
I
I
I
0
D1
D1
D1
D1
D2
D2
D2
D2
D3
D3
D3
D3
)LJXUH$'',QVWUXFWLRQ:RUG
7KHVHV\PEROVFDQEHGLYLGHGLQJURXSV
œ 6LQJOHELWV
6LQJOHELWVZLOOEHXVHGWRHQDEOHRUGLVDEOHDQDGGLWLRQDOIXQFWLRQDOLW\RIDFHUWDLQLQVWUXFWLRQ
)RU WKH FODVV RI ORDGVWRUH LQVWUXFWLRQV HJ WKH IUDFWLRQDO ELW I ZKLFK HQDEOHV PLUURULQJ RI
WKH FDOFXODWHG DGGUHVV XVHG IRU ))7 DOJRULWKPV $Q H[DPSOH IRU WKH FRPSXWDWLRQDO FODVV
ZRXOGEHWKHVDWXUDWHIXQFWLRQDOLW\VLQGLFDWLQJDVDWXUDWLRQRIDUHVXOWEHIRUHVWRULQJWRWKH
UHJLVWHUILOH7KHVHVLQJOHELWVFDQEHORFDWHGDWHDFKSODFHRIWKHHQFRGLQJELWV
œ *URXSELWV
*URXSELWVZLOOEHXVHGWRHQFRGHIOH[LEOHSDUDPHWHUV)RUWKHFODVVRIEUDQFKHVHJWKHORRS
OHQJWKV RI D KDUGZDUH ORRS Q ZKLFK HQDEOH WKH SURJUDPPHU WR XVH D QRQRYHUKHDG ORRS
FRQVWUXFW$QH[DPSOHIRUWKHORDGVWRUHFODVVZRXOGEHDUHODWLYHRIIVHWIRUDORDGLQVWUXFWLRQ
R ZKLFK ZLOO EH DGGHG WR WKH FXUUHQW YDOXH RI WKH DGGUHVV UHJLVWHU 7KH JURXS ELWV FDQ EH
ORFDWHGDWHDFKSODFHRIWKHHQFRGLQJELWVDQGFDQEHWRUQDSDUWLQVXEJURXSVLOOXVWUDWHGLQ
)LJXUHIRUDEUDQFKLQVWUXFWLRQ
bit19
bit18
bit17
IC
IC
DP
bit0
0
0
0
o
o
o
o
o
o
1
1
1
o
o
o
o
o
)LJXUH%65,QVWUXFWLRQ:RUG
œ 2SHUDQGELWV
2SHUDQGELWVZLOOEHXVHGWRHQFRGHWKHVRXUFHDQGGHVWLQDWLRQVUHJLVWHUVRIDFHUWDLQIXQFWLRQ
7KHQXPEHUDQGWKHW\SHRIDGGUHVVHGUHJLVWHUVFDQGLIIHUDQGKDVWREHPDSSHGFRUUHFWO\E\
WKHGHFRGHUJHQHUDWRU)LJXUHLOOXVWUDWHVDQH[DPSOHIRUDORDGLQVWUXFWLRQ,IWKHQXPEHURI
GDWDUHJLVWHUVH[FHHGVWKHDYDLODEOHFRGLQJVSDFHLWKDVWREHDGMXVWHG
128
A
0
0
1
1
a
0
1
0
1
d
d
d
Data register 0 to 7
Data register 7 to 15
Long registers 0 to 7
Address registers 0 to 7
)LJXUH2SHUDQG5HJLVWHU&RGLQJ
'HFRGHU*HQHUDWRU
8VLQJWKHSUHVHQWHGLQVWUXFWLRQVWUXFWXUHWKHGHFRGHUJHQHUDWRUFDQQRZEHGHILQHG:HQHHG
WRVSHFLI\WKHVWUXFWXUHWKHLQSXWDQGD&FRQWDLQHUFODVVIRUWKHLQVWUXFWLRQPDSSLQJ>@
>@7KHQWKHJHQHUDWLRQSURFHVVFDQEHH[SODLQHGZKLFKFRQVLVWVRIWKHGDWDEDVHJHQHUDWLRQ
WKH KDQGOLQJ RI D GHFRGH WUHH DQG DW OHDVW WKH RXWSXW JHQHUDWLRQ ZKLFK FUHDWHV WKH 9+'/
GHVFULSWLRQRIWKHLQVWUXFWLRQGHFRGHU
6WUXFWXUH
7KH LQWHUQDO GDWDEDVH RI WKH GHFRGHU JHQHUDWRU LV EXLOG XS RI D VSUHDGVKHHW DQG D FRQWDLQHU
FODVV7KHVSUHDGVKHHWGHILQHVWKHLQVWUXFWLRQVHWIRUDFHUWDLQDSSOLFDWLRQ7KHFRQWDLQHUFODVV
FRQWDLQVSUHGHILQHGGHFRGHVWDWHPHQWVZKLFKZLOOEHXSGDWHGGXULQJWKHJHQHUDWLRQSURFHVV
7KHUHJLVWHUFRQILJXUDWLRQZLOOEHWDNHQWRJHQHUDWHD9+'/SDFNDJHZKLFKLVXVHGWRGHILQH
H[WHQGHGIXQFWLRQV
6SUHDGVKHHW
UHJLVWHU
'HFRGH7UHH
FRQILJXUDWLRQ
2XWSXW*HQHUDWLRQ
*HQHUDWLRQ
'DWDEDVH
&RQWDLQHU
9+'/
FRPSXWDWLRQDO
ORDGVWRUH
EUDQFK
LQVWUXFWLRQ
LQVWUXFWLRQ
LQVWUXFWLRQ
GHFRGHU
GHFRGHU
GHFRGHU
SDFNDJH
)LJXUH6WUXFWXUHRIWKH'HFRGHU*HQHUDWRU
,QSXW
7KH LQVWUXFWLRQ VHW LV GHVFULEHG LQ WKH VSUHDGVKHHW D VPDOO H[DPSOH LV GHSLFWHG LQ ILJXUH DQG FRQVLVWV RI WKUHH FROXPQV (DFK URZ GHVFULEHV DQ LQVWUXFWLRQ JURXS ZKLFK LWVHOI FDQ
FRQVLVWRIDFRXSOHRIVXELQVWUXFWLRQV7KHLQVWUXFWLRQVRIDQLQVWUXFWLRQJURXSGLIIHUIURPWKH
RWKHUJURXSPHPEHUVRQO\E\WKHYDOXHRIWKHV\PEROV7KHILUVWFROXPQFRQWDLQVWKHFRGLQJ
RIWKHLQVWUXFWLRQJURXS7KHVHFRQGFROXPQNHHSVWKHQDPHRIWKHLQVWUXFWLRQJURXSDQGWKH
WKLUGRQHLVRSWLRQDOEHFDXVHLWZLOOEHRQO\XVHGIRULQVWUXFWLRQJURXSVZKLFKKDYHRSHUDQGV
LQ WKHLU FRGLQJ 7KH WKUHH OHWWHUV LQGLFDWH WKH YDOLG UHJLVWHU W\SHV IRU WKH LQVWUXFWLRQ JURXS
HJ'/$±GDWDUHJLVWHUORQJUHJLVWHUDFFXUHJLVWHU
129
SLWVIPUUUDDGGG
SLLGGGGDDGGG
SQRRRRRRRRRRR
SRRRRRRRFFFDDGGG
LD_Family
SR_Family
LSLR
LSRR
LSR32R
ASRR
BSR_Family
BRCC_Family
BSR_Family
DLA
DAR
)LJXUH,QSXW6SUHDGVKHHW
7KHFRGLQJRIWKHVXELQVWUXFWLRQVLVGHVFULEHGDVDVXEVHWRIWKHLQVWUXFWLRQJURXSHJURZ
LQILJXUH7KHVKLIWLQVWUXFWLRQJURXSLVGLYLGHGLQVXELQVWUXFWLRQVZKLFKKDVWKHVDPH
ELWLQVWUXFWLRQFRGLQJZLWKH[FHSWLRQRIWKHFRGLQJIRUWKHLLV\PEROV
,WLVSRVVLEOHWKDWWZRLQVWUXFWLRQJURXSVVKDUHWKHVDPHLQVWUXFWLRQFRGLQJHJ%65BJURXS
DQG%5&&BJURXS7KHLQVWUXFWLRQJURXSZKLFKRQO\XVHVDVXEVHWKDVWREHORFDWHGLQDQ\
URZDERYH
&RQWDLQHU
7KH FRQWDLQHU LV D & FODVV ZKLFK LV XVHG WR PDS WKH LQVWUXFWLRQ JURXSV WR 9+'/
VWDWHPHQWV>@
œ )L[HGVWDWHPHQW
$'',B'HFRGH6WDWHPHQWV!DGG'HFRGH6WDWHPHQW)L[HG6WPWFPSBLQVWUXFWLRQ DGGL
)L[HG VWDWHPHQWV DUH XVHG WR DVVLJQ RSFRGH LQGHSHQGHQW LQIRUPDWLRQ ,Q WKH H[DPSOH DERYH
WKHLQVWUXFWLRQQDPHLVDVVLJQHGWRDQLQWHUQDO9+'/YDULDEOH
œ 3RUW6WDWHPHQW
$'',B'HFRGH6WDWHPHQWV!DGG'HFRGH6WDWHPHQW3RUW6WPWFPSBH[BZULWH,)
!JHW6HW)XQF,)DGBFRGLQJ6)B!JHW5HJ&RXQW
3RUW VWDWHPHQWV DUH XVHG WR DVVLJQ WKH RSHUDQGV RI DQ LQVWUXFWLRQ JURXS WR WKH VSHFLILF
KDUGZDUHSRUWV7KHRSHUDQGVFRGLQJZLOOEHWDNHQRXWRIWKHVSUHDGVKHHW
œ 9DULDEOH9HFWRU
$'',B'HFRGH6WDWHPHQWV!DGG'HFRGH6WDWHPHQW9DULDEOH9HFWRUFPSBH[BFQWUOFRQVW
VLJQ([WHQG
R
,)!JHW2SFRGH
9DULDEOH 9HFWRUV ZLOO EH EXLOW XS WR DVVLJQ FRQVWDQWV RIIVHWV DQG LPPHGLDWH YDOXHV ZKLFK
PD\EHVSUHDGRYHUWKHLQVWUXFWLRQZRUG7KHELWSRVLWLRQRIWKHYHFWRULQVLGHWKHLQVWUXFWLRQ
ZRUGZLOOEHGHILQHGE\WKHVSUHDGVKHHW
œ 9DULDEOH6WDWHPHQW
$''B/21*B)DPLO\!DGG'HFRGH6WDWHPHQW9DULDEOH6WPW
V
FPSBH[BDGGVLPG ,)
!JHW2SFRGH
9DULDEOHVWDWHPHQWVDUHXVHGWRGLUHFWO\DVVLJQV\PEROVRIWKHLQVWUXFWLRQZRUGWRWKHUHODWHG
9+'/FRQVWUXFW
œ ,)6WDWHPHQW
,I6WPW$''B/21*B,I6WPW QHZ,I6WPW
[
,)!JHW2SFRGH
,IVWDWHPHQWVDUHXVHGWRFRQGLWLRQDOO\DVVLJQV\PEROVRIWKHLQVWUXFWLRQZRUGWRWKHUHODWHG
9+'/FRQVWUXFW
130
œ &DVH6WDWHPHQW
&DVH6WPW0295B&DVH6WPW QHZ&DVH6WPWLQVWUXFWLRQGRZQWR
&DVH VWDWHPHQWV DUH XVHG WR FRQGLWLRQDOO\ DVVLJQ PRUH WKDQ RQH V\PERO RI WKH LQVWUXFWLRQ
ZRUGWRWKHUHODWHG9+'/FRQVWUXFW
'DWDEDVHJHQHUDWLRQ
7KH GDWDEDVH ZLOO EH EXLOW XS RI WKH FRQWHQWV RI WKH VSUHDGVKHHW DQG RI WKH FRQWDLQHU
LQIRUPDWLRQ>@:KHQUHDGLQJLQWKHVSUHDGVKHHWDFRQWLQXRXVFRQVLVWHQF\FKHFNZLOOEHGRQH
WR SUHYHQW DPELJXRXV FRGLQJ RI LQVWUXFWLRQ JURXSV 7KH VXE LQVWUXFWLRQV ZLOO EH PDSSHG WR
WKHLU LQVWUXFWLRQ JURXSV 7KH LQVWUXFWLRQ JURXSV LWVHOI ZLOO EH OLQNHG WR WKH UHODWHG FRQWDLQHU
FRQWHQWV$JDLQDFRQVLVWHQF\FKHFNZLOOEHGRQHWREHDZDUHWKDWDOOLQVWUXFWLRQJURXSVRIWKH
VSUHDGVKHHWKDYHWKHLUFRUUHVSRQGLQJHQWU\LQWKHFRQWDLQHUVWUXFWXUH
'HFRGHWUHH
$OOLQVWUXFWLRQRSFRGHVZKLFKKDYHEHHQEXLOWXSLQWKHGDWDEDVHZLOOEHPDSSHGWRWKHWUHH
VWUXFWXUH(DFKQRGHLQWKHWUHHKDVWKUHHSRVVLEOHVWDWHV]HURRQHDQGGRQ¶WFDUHZKLFKLV
XVHGIRUWKHLQVWUXFWLRQELWVRIWKHV\PEROV(DFKEUDQFKRIWKHWUHHUHSUHVHQWVDQLQVWUXFWLRQ
JURXS
ro o t o f the tree
Bit3
0
Bit4
0
Bit1
0
B it2
_
1
0
a
b
_
1
_
c
Bit5
1
1
1
0
Bit6
0
0
1
_
Bit7
_
0
1
_
Bit8
1
1
1
0
Bit9
_
1
1
_
1
0
1
d
Bit10
0
0
0
Bit11
0
0
1
0
1
0
1
0
1
Bit12
_
0
1
_
_
_
_
_
_
Bit13
_
0
1
_
_
_
_
_
_
0
0
0
)LJXUH'HFRGH7UHH
2XWSXWJHQHUDWLRQ
7KH RXWSXW JHQHUDWLRQ LV GRQH E\ D UHFXUVLYH IXQFWLRQ IRU HDFK LQVWUXFWLRQ FODVV VHSDUDWHO\
6WDUWLQJDWWKHURRWRIWKHWUHHHDFKQRGHZLOOEHFKHFNHGIRUWKHVWDWXV ,IWKHUHDUHDOOWKUHH
SRVVLEOHEUDQFKHVDYDLODEOHzerooneDQGdon’t careWKHdon’t careSDWKZLOOEHFRYHUHGILUVW
7KLV LV QHFHVVDU\ EHFDXVH LW LV SRVVLEOH WR XVH XQXVHG FRPELQDWLRQ RI V\PEROV LQ RWKHU
LQVWUXFWLRQJURXSV,IWKHzeroRUoneSDWKZRXOGEHFRYHUHGILUVWWKHYDOXHVIRUWKHV\PERO
ZLOOQRWEHDYDLODEOHDQ\PRUHaLQILJXUH
131
case instruction (11) is
when ’1’ => -- ADDI_Family
cmp_instruction := addi;
cmp_ex1_add1.en := ’1’;
cmp_ex1_add1.add_const := ’1’;
cmp_ex1_write1 := setDxLxRx(instruction(4 downto 0));
cmp_ex1_read1 := setDxLxRx(instruction(4 downto 0));
cmp_ex1_cntrl1.const := signExtend16(instruction(10 downto 5));
when others => -- MOVR_Family
cmp_instruction := movr;
case instruction(10 downto 8) is
when "000" =>
cmp_ex1_write1 := setData(instruction(3 downto 0));
cmp_ex1_read1 := setData(instruction(7 downto 4));
…
)LJXUH*HQHUDWHG&RGH([DPSOH
,IWKHUHDUHRQO\WZREUDQFKHVDYDLODEOHzeroDQGoneLWZLOOEHWULHGWRUHDFKWKHHQGRIWKH
oneEUDQFKbLQILJXUH,IWKHHQGFDQEHUHDFKHGZLWKRXWDQ\IXUWKHUEUDQFKFRQQHFWLRQV
DQGQRIXUWKHUV\PEROVZLOOEHIRXQGWKHFRGLQJFDQEHFRYHUHGE\DVLQJOHFDVHVWDWHPHQW,I
V\PEROV don’t care DUH SODFHG LQ WKH EUDQFK WKH FDVH VWDWHPHQW KDV WR EH VSOLW 7KH VDPH
DSSOLHVIRUWKHzeroEUDQFKcLQ)LJXUHZKLFKFDQEHKDQGOHGDVWKHVHFRQGSDUWRIWKHFDVH
VWDWHPHQW
,IWKHHQGRIWKHEUDQFKFDQQRWEHUHDFKHGEHFDXVHRIIXUWKHUEUDQFKFRQQHFWLRQVdLQ)LJXUH
WKHIXQFWLRQVHDUFKHVIRUDFRQWLQXRVELWJURXS7KHELWJURXSZLOOWKHQEHWUDQVODWHGLQWRD
FDVHVWDWHPHQW
7KH JHQHUDWHG FDVH VWUXFWXUH ZKLFK FRYHUV DOO LQVWUXFWLRQ JURXSV ZLOO EH ILOOHG ZLWK WKH
LQIRUPDWLRQJHQHUDWHGLQWKHGDWDEDVHEHIRUH)LJXUHLOOXVWUDWHVDVKRUWFRGHH[DPSOH
&RQFOXVLRQ
7KH GHFRGHU JHQHUDWRU ZLOO EH XVHG WR DXWRPDWLFDOO\ JHQHUDWH WKH 9+'/ GHVFULSWLRQ RI DQ
LQVWUXFWLRQ GHFRGHU IRU D '63 NHUQHO GLUHFWO\ IURP WKH LQVWUXFWLRQ VHW GHVFULSWLRQ 7KH
JHQHUDWHG9+'/FRGHLVFRUUHFWE\FRQVWUXFWLRQ7KLVSURYLGHVWKHSRVVLELOLW\RIDSSOLFDWLRQ
VSHFLILF LQVWUXFWLRQ VHWV IRU KLJKHU FRGH GHQVLW\ ORZHU SRZHU GLVVLSDWLRQ DQG LQFUHDVHG
SHUIRUPDQFH ZLWKRXW DGGLWLRQDO 9+'/ FRGLQJ HIIRUW DQG WKH UHODWHG YHULILFDWLRQ DQG WHVW
HIIRUW
7KH'HFRGHUJHQHUDWRUKDVEHHQGHYHORSHGLQ&DQGZLOOEHXVHGLQDGHYHORSPHQWSURMHFW
IRUDFRQILJXUDEOH'63NHUQHO
5HIHUHQFHV
>@ - / +HQQHVV\ DQG ' $ 3DWWHUVRQ Computer Architecture. A Quantitative Approach
0RUJDQ.DXIPDQQ3XEOLVKHUV6DQ0DWHR&$
>@ 3/DSVOH\ -%LHU $6KRKDP DQG ($/HH DSP Processor Fundamentals, Architectures
and Features,(((3UHVV1HZ<RUN
>@ $QGUq :LOOPV & 3URJUDPPLHUXQJ 3URJUDPPLHUVSUDFKH 3URJUDPPLHUWHFKQLN
'DWHQRUJDQLVDWLRQ$GGLVRQ:HVOH\
>@(ULFK*DPPD5LFKDUG+HOP5DOSK-RKQVRQ-RKQ9OLVVLGHVDesign Patterns - Elements
of Reusable Object-Oriented Software$GGLVRQ:HVOH\
>@ 3HWHU - $VKHQGHQ The Designer`s Guide to VHDL 0RUJDQ .DXIPDQQ 3XEOLVKHUV 6DQ
)UDQFLVFR
>@8%UH\PDQQDie C++ Standard Template Library$GGLVRQ:HVOH\
132