Chapter 1 INTRODUCTION TO VLSI SYSTEMS

Transcription

Design of VLSI Systems - Chapter 1
1 of 18
file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des...
Chapter 1
INTRODUCTION TO VLSI SYSTEMS
Historical Perspective
VLSI Design Flow
Design Hierarchy
Concepts of Regularity, Modularity and Locality
VLSI Design Styles
1.1 Historical Perspective
The electronics industry has achieved a phenomenal growth over the last two decades, mainly due to the rapid
advances in integration technologies, large-scale systems design - in short, due to the advent of VLSI. The number of
applications of integrated circuits in high-performance computing, telecommunications, and consumer electronics
has been rising steadily, and at a very fast pace. Typically, the required computational power (or, in other words, the
intelligence) of these applications is the driving force for the fast development of this field. Figure 1.1 gives an
overview of the prominent trends in information technologies over the next few decades. The current leading-edge
technologies (such as low bit-rate video and cellular communications) already provide the end-users a certain amount
of processing power and portability. This trend is expected to continue, with very important implications on VLSI
and systems design. One of the most important characteristics of information services is their increasing need for
very high processing power and bandwidth (in order to handle real-time video, for example). The other important
characteristic is that the information services tend to become more and more personalized (as opposed to collective
services such as broadcasting), which means that the devices must be more intelligent to answer individual demands,
and at the same time they must be portable to allow more flexibility/mobility.
[Click to enlarge image]
Figure-1.1: Prominent trends in information service technologies.
5/1/2007 10:44 AM
2 of 18
As more and more complex functions are required in various data processing and telecommunications devices, the
need to integrate these functions in a small system/package is also increasing. The level of integration as measured
by the number of logic gates in a monolithic chip has been steadily rising for almost three decades, mainly due to the
rapid progress in processing technology and interconnect technology. Table 1.1 shows the evolution of logic
complexity in integrated circuits over the last three decades, and marks the milestones of each era. Here, the numbers
for circuit complexity should be interpreted only as representative examples to show the order-of-magnitude. A logic
block can contain anywhere from 10 to 100 transistors, depending on the function. State-of-the-art examples of ULSI
chips, such as the DEC Alpha or the INTEL Pentium contain 3 to 6 million transistors.
ERA
(number of logic blocks per chip)
DATE
COMPLEXITY
Single transistor
Unit logic (one gate)
Multi-function
Complex function
Medium Scale Integration
Large Scale Integration
Very Large Scale Integration
Ultra Large Scale Integration
1959
1960
1962
1964
1967
1972
1978
1989
less than 1
1
2 - 4
5 - 20
20 - 200
200 - 2000
2000 - 20000
20000 - ?
(MSI)
(LSI)
(VLSI)
(ULSI)
Table-1.1: Evolution of logic complexity in integrated circuits.
The most important message here is that the logic complexity per chip has been (and still is) increasing
exponentially. The monolithic integration of a large number of functions on a single chip usually provides:
Less area/volume and therefore, compactness
Less power consumption
Less testing requirements at system level
Higher reliability, mainly due to improved on-chip interconnects
Higher speed, due to significantly reduced interconnection length
Significant cost savings
Figure-1.2: Evolution of integration density and minimum feature size, as seen in the early 1980s.
Therefore, the current trend of integration will also continue in the foreseeable future. Advances in device
manufacturing technology, and especially the steady reduction of minimum feature size (minimum length of a
transistor or an interconnect realizable on chip) support this trend. Figure 1.2 shows the history and forecast of chip
complexity - and minimum feature size - over time, as seen in the early 1980s. At that time, a minimum feature size
of 0.3 microns was expected around the year 2000. The actual development of the technology, however, has far
exceeded these expectations. A minimum size of 0.25 microns was readily achievable by the year 1995. As a direct
result of this, the integration density has also exceeded previous expectations - the first 64 Mbit DRAM, and the
INTEL Pentium microprocessor chip containing more than 3 million transistors were already available by 1994,
5/1/2007 10:44 AM
3 of 18
pushing the envelope of integration density.
When comparing the integration density of integrated circuits, a clear distinction must be made between the memory
chips and logic chips. Figure 1.3 shows the level of integration over time for memory and logic chips, starting in
1970. It can be observed that in terms of transistor count, logic chips contain significantly fewer transistors in any
given year mainly due to large consumption of chip area for complex interconnects. Memory circuits are highly
regular and thus more cells can be integrated with much less area for interconnects.
Figure-1.3: Level of integration over time, for memory chips and logic chips.
Generally speaking, logic chips such as microprocessor chips and digital signal processing (DSP) chips contain not
only large arrays of memory (SRAM) cells, but also many different functional units. As a result, their design
complexity is considered much higher than that of memory chips, although advanced memory chips contain some
sophisticated logic functions. The design complexity of logic chips increases almost exponentially with the number
of transistors to be integrated. This is translated into the increase in the design cycle time, which is the time period
from the start of the chip development until the mask-tape delivery time. However, in order to make the best use of
the current technology, the chip development time has to be short enough to allow the maturing of chip
manufacturing and timely delivery to customers. As a result, the level of actual logic integration tends to fall short of
the integration level achievable with the current processing technology. Sophisticated computer-aided design (CAD)
tools and methodologies are developed and applied in order to manage the rapidly increasing design complexity.
1.2 VLSI Design Flow
The design process, at various levels, is usually evolutionary in nature. It starts with a given set of requirements.
Initial design is developed and tested against the requirements. When requirements are not met, the design has to be
improved. If such improvement is either not possible or too costly, then the revision of requirements and its impact
analysis must be considered. The Y-chart (first introduced by D. Gajski) shown in Fig. 1.4 illustrates a design flow
for most logic chips, using design activities on three different axes (domains) which resemble the letter Y.
5/1/2007 10:44 AM
4 of 18
Figure-1.4: Typical VLSI design flow in three domains (Y-chart representation).
The Y-chart consists of three major domains, namely:
behavioral domain,
structural domain,
geometrical layout domain.
The design flow starts from the algorithm that describes the behavior of the target chip. The corresponding
architecture of the processor is first defined. It is mapped onto the chip surface by floorplanning. The next design
evolution in the behavioral domain defines finite state machines (FSMs) which are structurally implemented with
functional modules such as registers and arithmetic logic units (ALUs). These modules are then geometrically placed
onto the chip surface using CAD tools for automatic module placement followed by routing, with a goal of
minimizing the interconnects area and signal delays. The third evolution starts with a behavioral module description.
Individual modules are then implemented with leaf cells. At this stage the chip is described in terms of logic gates
(leaf cells), which can be placed and interconnected by using a cell placement & routing program. The last evolution
involves a detailed Boolean description of leaf cells followed by a transistor level implementation of leaf cells and
mask generation. In standard-cell based design, leaf cells are already pre-designed and stored in a library for logic
design use.
5/1/2007 10:44 AM
5 of 18
Figure-1.5: A more simplified view of VLSI design flow.
Figure 1.5 provides a more simplified view of the VLSI design flow, taking into account the various representations,
or abstractions of design - behavioral, logic, circuit and mask layout. Note that the verification of design plays a very
important role in every step during this process. The failure to properly verify a design in its early phases typically
causes significant and expensive re-design at a later stage, which ultimately increases the time-to-market.
Although the design process has been described in linear fashion for simplicity, in reality there are many iterations
back and forth, especially between any two neighboring steps, and occasionally even remotely separated pairs.
Although top-down design flow provides an excellent design process control, in reality, there is no truly
unidirectional top-down design flow. Both top-down and bottom-up approaches have to be combined. For instance,
if a chip designer defined an architecture without close estimation of the corresponding chip area, then it is very
likely that the resulting chip layout exceeds the area limit of the available technology. In such a case, in order to fit
the architecture into the allowable chip area, some functions may have to be removed and the design process must be
repeated. Such changes may require significant modification of the original requirements. Thus, it is very important
to feed forward low-level information to higher levels (bottom up) as early as possible.
In the following, we will examine design methodologies and structured approaches which have been developed over
the years to deal with both complex hardware and software projects. Regardless of the actual size of the project, the
basic principles of structured design will improve the prospects of success. Some of the classical techniques for
reducing the complexity of IC design are: Hierarchy, regularity, modularity and locality.
1.3 Design Hierarchy
5/1/2007 10:44 AM
6 of 18
The use of hierarchy, or ?divide and conquer? technique involves dividing a module into sub- modules and then
repeating this operation on the sub-modules until the complexity of the smaller parts becomes manageable. This
approach is very similar to the software case where large programs are split into smaller and smaller sections until
simple subroutines, with well-defined functions and interfaces, can be written. In Section 1.2, we have seen that the
design of a VLSI chip can be represented in three domains. Correspondingly, a hierarchy structure can be described
in each domain separately. However, it is important for the simplicity of design that the hierarchies in different
domains can be mapped into each other easily.
As an example of structural hierarchy, Fig. 1.6 shows the structural decomposition of a CMOS four-bit adder into its
components. The adder can be decomposed progressively into one- bit adders, separate carry and sum circuits, and
finally, into individual logic gates. At this lower level of the hierarchy, the design of a simple circuit realizing a
well-defined Boolean function is much more easier to handle than at the higher levels of the hierarchy.
In the physical domain, partitioning a complex system into its various functional blocks will provide a valuable
guidance for the actual realization of these blocks on chip. Obviously, the approximate shape and size (area) of each
sub-module should be estimated in order to provide a useful floorplan. Figure 1.7 shows the hierarchical
decomposition of a four-bit adder in physical description (geometrical layout) domain, resulting in a simple
floorplan. This physical view describes the external geometry of the adder, the locations of input and output pins,
and how pin locations allow some signals (in this case the carry signals) to be transferred from one sub-block to the
other without external routing. At lower levels of the physical hierarchy, the internal mask
Figure-1.6: Structural decomposition of a four-bit adder circuit, showing the hierarchy down to gate level.
5/1/2007 10:44 AM
7 of 18
Figure-1.7: Hierarchical decomposition of a four-bit adder in physical (geometrical) description domain.
Figure-1.8: Layout of a 16-bit adder, and the components (sub-blocks) of its physical hierarchy.
Figure-1.9: The structural hierarchy of a triangle generator chip.
5/1/2007 10:44 AM
8 of 18
Figure-1.10: Physical layout of the triangle generator chip.
layout of each adder cell defines the locations and the connections of each transistor and wire. Figure 1.8 shows the
full-custom layout of a 16-bit dynamic CMOS adder, and the sub-modules that describe the lower levels of its
physical hierarchy. Here, the 16-bit adder consists of a cascade connection of four 4-bit adders, and each 4-bit adder
can again be decomposed into its functional blocks such as the Manchester chain, carry/propagate circuits and the
output buffers. Finally, Fig. 1.9 and Fig. 1.10 show the structural hierarchy and the physical layout of a simple
triangle generator chip, respectively. Note that there is a corresponding physical description for every module in the
structural hierarchy, i.e., the components of the physical view closely match this structural view.
1.4 Concepts of Regularity, Modularity and Locality
The hierarchical design approach reduces the design complexity by dividing the large system into several
sub-modules. Usually, other design concepts and design approaches are also needed to simplify the process.
Regularity means that the hierarchical decomposition of a large system should result in not only simple, but also
similar blocks, as much as possible. A good example of regularity is the design of array structures consisting of
identical cells - such as a parallel multiplication array. Regularity can exist at all levels of abstraction: At the
transistor level, uniformly sized transistors simplify the design. At the logic level, identical gate structures can be
used, etc. Figure 1.11 shows regular circuit-level designs of a 2-1 MUX (multiplexer), an D-type edge-triggered flip
flop, and a one-bit full adder. Note that all of these circuits were designed by using inverters and tri-state buffers
only. If the designer has a small library of well-defined and well-characterized basic building blocks, a number of
different functions can be constructed by using this principle. Regularity usually reduces the number of different
modules that need to be designed and verified, at all levels of abstraction.
5/1/2007 10:44 AM
9 of 18
Figure-1.11: Regular design of a 2-1 MUX, a DFF and an adder, using inverters and tri-state buffers.
Modularity in design means that the various functional blocks which make up the larger system must have
well-defined functions and interfaces. Modularity allows that each block or module can be designed relatively
independently from each other, since there is no ambiguity about the function and the signal interface of these
blocks. All of the blocks can be combined with ease at the end of the design process, to form the large system. The
concept of modularity enables the parallelisation of the design process. It also allows the use of generic modules in
various designs - the well-defined functionality and signal interface allow plug-and-play design.
By defining well-characterized interfaces for each module in the system, we effectively ensure that the internals of
each module become unimportant to the exterior modules. Internal details remain at the local level. The concept of
locality also ensures that connections are mostly between neighboring modules, avoiding long-distance connections
as much as possible. This last point is extremely important for avoiding excessive interconnect delays. Time-critical
operations should be performed locally, without the need to access distant modules or signals. If necessary, the
replication of some logic may solve this problem in large system architectures.
1.5 VLSI Design Styles
Several design styles can be considered for chip implementation of specified algorithms or logic functions. Each
design style has its own merits and shortcomings, and thus a proper choice has to be made by designers in order to
provide the functionality at low cost.
1.5.1 Field Programmable Gate Array (FPGA)
Fully fabricated FPGA chips containing thousands of logic gates or even more, with programmable interconnects,
are available to users for their custom hardware programming to realize desired functionality. This design style
provides a means for fast prototyping and also for cost-effective chip design, especially for low-volume applications.
A typical field programmable gate array (FPGA) chip consists of I/O buffers, an array of configurable logic blocks
(CLBs), and programmable interconnect structures. The programming of the interconnects is implemented by
programming of RAM cells whose output terminals are connected to the gates of MOS pass transistors. A general
architecture of FPGA from XILINX is shown in Fig. 1.12. A more detailed view showing the locations of switch
matrices used for interconnect routing is given in Fig. 1.13.
A simple CLB (model XC2000 from XILINX) is shown in Fig. 1.14. It consists of four signal input terminals (A, B,
C, D), a clock signal terminal, user-programmable multiplexers, an SR-latch, and a look-up table (LUT). The LUT is
5/1/2007 10:44 AM
10 of 18
a digital memory that stores the truth table of the Boolean function. Thus, it can generate any function of up to four
variables or any two functions of three variables. The control terminals of multiplexers are not shown explicitly in
Fig. 1.14.
The CLB is configured such that many different logic functions can be realized by programming its array. More
sophisticated CLBs have also been introduced to map complex functions. The typical design flow of an FPGA chip
starts with the behavioral description of its functionality, using a hardware description language such as VHDL. The
synthesized architecture is then technology-mapped (or partitioned) into circuits or logic cells. At this stage, the chip
design is completely described in terms of available logic cells. Next, the placement and routing step assigns
individual logic cells to FPGA sites (CLBs) and determines the routing patterns among the cells in accordance with
the netlist. After routing is completed, the on-chip
Figure-1.12: General architecture of Xilinx FPGAs.
Figure-1.13: Detailed view of switch matrices and interconnection routing between CLBs.
5/1/2007 10:44 AM
11 of 18
Figure-1.14: XC2000 CLB of the Xilinx FPGA.
performance of the design can be simulated and verified before downloading the design for programming of the
FPGA chip. The programming of the chip remains valid as long as the chip is powered-on, or until new
programming is done. In most cases, full utilization of the FPGA chip area is not possible - many cell sites may
remain unused.
The largest advantage of FPGA-based design is the very short turn-around time, i.e., the time required from the start
of the design process until a functional chip is available. Since no physical manufacturing step is necessary for
customizing the FPGA chip, a functional sample can be obtained almost as soon as the design is mapped into a
specific technology. The typical price of FPGA chips are usually higher than other realization alternatives (such as
gate array or standard cells) of the same design, but for small-volume production of ASIC chips and for fast
prototyping, FPGA offers a very valuable option.
1.5.2 Gate Array Design
In view of the fast prototyping capability, the gate array (GA) comes after the FPGA. While the design
implementation of the FPGA chip is done with user programming, that of the gate array is done with metal mask
design and processing. Gate array implementation requires a two-step manufacturing process: The first phase, which
is based on generic (standard) masks, results in an array of uncommitted transistors on each GA chip. These
uncommitted chips can be stored for later customization, which is completed by defining the metal interconnects
between the transistors of the array (Fig. 1.15). Since the patterning of metallic interconnects is done at the end of the
chip fabrication, the turn-around time can be still short, a few days to a few weeks. Figure 1.16 shows a corner of a
gate array chip which contains bonding pads on its left and bottom edges, diodes for I/O protection, nMOS
transistors and pMOS transistors for chip output driver circuits in the neighboring areas of bonding pads, arrays of
nMOS transistors and pMOS transistors, underpass wire segments, and power and ground buses along with contact
windows.
5/1/2007 10:44 AM
12 of 18
Figure-1.15: Basic processing steps required for gate array implementation.
Figure-1.16: A corner of a typical gate array chip.
Figure 1.17 shows a magnified portion of the internal array with metal mask design (metal lines highlighted in dark)
to realize a complex logic function. Typical gate array platforms allow dedicated areas, called channels, for intercell
routing as shown in Figs. 1.16 and 1.17 between rows or columns of MOS transistors. The availability of these
routing channels simplifies the interconnections, even using one metal layer only. The interconnection patterns to
realize basic logic gates can be stored in a library, which can then be used to customize rows of uncommitted
transistors according to the netlist. While most gate array platforms only contain rows of uncommitted transistors
separated by routing channels, some other platforms also offer dedicated memory (RAM) arrays to allow a higher
density where memory functions are required. Figure 1.18 shows the layout views of a conventional gate array and a
gate array platform with two dedicated memory banks.
With the use of multiple interconnect layers, the routing can be achieved over the active cell areas; thus, the routing
channels can be removed as in Sea-of-Gates (SOG) chips. Here, the entire chip surface is covered with uncommitted
nMOS and pMOS transistors. As in the gate array case, neighboring transistors can be customized using a metal
mask to form basic logic gates. For intercell routing, however, some of the uncommitted transistors must be
sacrificed. This approach results in more flexibility for interconnections, and usually in a higher density. The basic
platform of a SOG chip is shown in Fig. 1.19. Figure 1.20 offers a brief comparison between the channeled (GA) vs.
the channelless (SOG) approaches.
5/1/2007 10:44 AM
13 of 18
Figure-1.17: Metal mask design to realize a complex logic function on a channeled GA platform.
Figure-1.18: Layout views of a conventional GA chip and a gate array with two memory banks.
Figure-1.19: The platform of a Sea-of-Gates (SOG) chip.
In general, the GA chip utilization factor, as measured by the used chip area divided by the total chip area, is higher
than that of the FPGA and so is the chip speed, since more customized design can be achieved with metal mask
5/1/2007 10:44 AM
14 of 18
designs. The current gate array chips can implement as many as hundreds of thousands of logic gates.
Figure-1.20: Comparison between the channeled (GA) vs. the channelless (SOG) approaches.
1.5.3 Standard-Cells Based Design
The standard-cells based design is one of the most prevalent full custom design styles which require development of
a full custom mask set. The standard cell is also called the polycell. In this design style, all of the commonly used
logic cells are developed, characterized, and stored in a standard cell library. A typical library may contain a few
hundred cells including inverters, NAND gates, NOR gates, complex AOI, OAI gates, D-latches, and flip-flops.
Each gate type can have multiple implementations to provide adequate driving capability for different fanouts. For
instance, the inverter gate can have standard size transistors, double size transistors, and quadruple size transistors so
that the chip designer can choose the proper size to achieve high circuit speed and layout density. The
characterization of each cell is done for several different categories. It consists of
delay time vs. load capacitance
circuit simulation model
timing simulation model
fault simulation model
cell data for place-and-route
mask data
To enable automated placement of the cells and routing of inter-cell connections, each cell layout is designed with a
fixed height, so that a number of cells can be abutted side-by-side to form rows. The power and ground rails
typically run parallel to the upper and lower boundaries of the cell, thus, neighboring cells share a common power
and ground bus. The input and output pins are located on the upper and lower boundaries of the cell. Figure 1.21
shows the layout of a typical standard cell. Notice that the nMOS transistors are located closer to the ground rail
while the pMOS transistors are placed closer to the power rail.
5/1/2007 10:44 AM
15 of 18
Figure-1.21: A standard cell layout example.
Figure 1.22 shows a floorplan for standard-cell based design. Inside the I/O frame which is reserved for I/O cells, the
chip area contains rows or columns of standard cells. Between cell rows are channels for dedicated inter-cell routing.
As in the case of Sea-of-Gates, with over-the- cell routing, the channel areas can be reduced or even removed
provided that the cell rows offer sufficient routing space. The physical design and layout of logic cells ensure that
when cells are placed into rows, their heights are matched and neighboring cells can be abutted side-by-side, which
provides natural connections for power and ground lines in each row. The signal delay, noise margins, and power
consumption of each cell should be also optimized with proper sizing of transistors using circuit simulation.
Figure-1.22: A simplified floorplan of standard-cells-based design.
If a number of cells must share the same input and/or output signals, a common signal bus structure can also be
incorporated into the standard-cell-based chip layout. Figure 1.23 shows the simplified symbolic view of a case
where a signal bus has been inserted between the rows of standard cells. Note that in this case the chip consists of
two blocks, and power/ground routing must be provided from both sides of the layout area. Standard-cell based
designs may consist of several such macro-blocks, each corresponding to a specific unit of the system architecture
such as ALU, control logic, etc.
Figure-1.23: Simplified floorplan consisting of two separate blocks and a common signal bus.
After chip logic design is done using standard cells in the library, the most challenging task is to place individual
cells into rows and interconnect them in a way that meets stringent design goals in circuit speed, chip area, and
power consumption. Many advanced CAD tools for place-and-route have been developed and used to achieve such
goals. Also from the chip layout, circuit models which include interconnect parasitics can be extracted and used for
timing simulation and analysis to identify timing critical paths. For timing critical paths, proper gate sizing is often
practiced to meet the timing requirements. In many VLSI chips, such as microprocessors and digital signal
5/1/2007 10:44 AM
16 of 18
processing chips, standard-cells based design is used to implement complex control logic modules. Some full custom
chips can be also implemented exclusively with standard cells.
Finally, Fig. 1.24 shows the detailed mask layout of a standard-cell-based chip with an uninterrupted single block of
cell rows, and three memory banks placed on one side of the chip. Notice that within the cell block, the separations
between neighboring rows depend on the number of wires in the routing channel between the cell rows. If a high
interconnect density can be achieved in the routing channel, the standard cell rows can be placed closer to each other,
resulting in a smaller chip area. The availability of dedicated memory blocks also reduces the area, since the
realization of memory elements using standard cells would occupy a larger area.
Figure-1.24: Mask layout of a standard-cell-based chip with a single block of cells and three memory banks.
1.5.4 Full Custom Design
Although the standard-cells based design is often called full custom design, in a strict sense, it is somewhat less than
fully custom since the cells are pre-designed for general use and the same cells are utilized in many different chip
designs. In a fuller custom design, the entire mask design is done anew without use of any library. However, the
development cost of such a design style is becoming prohibitively high. Thus, the concept of design reuse is
becoming popular in order to reduce design cycle time and development cost. The most rigorous full custom design
can be the design of a memory cell, be it static or dynamic. Since the same layout design is replicated, there would
not be any alternative to high density memory chip design. For logic chip design, a good compromise can be
achieved by using a combination of different design styles on the same chip, such as standard cells, data-path cells
and PLAs. In real full-custom layout in which the geometry, orientation and placement of every transistor is done
individually by the designer, design productivity is usually very low - typically 10 to 20 transistors per day, per
designer.
In digital CMOS VLSI, full-custom design is rarely used due to the high labor cost. Exceptions to this include the
design of high-volume products such as memory chips, high- performance microprocessors and FPGA masters.
Figure 1.25 shows the full layout of the Intel 486 microprocessor chip, which is a good example of a hybrid
full-custom design. Here, one can identify four different design styles on one chip: Memory banks (RAM cache),
data-path units consisting of bit-slice cells, control circuitry mainly consisting of standard cells and PLA blocks.
5/1/2007 10:44 AM
17 of 18
Figure-1.25: Mask layout of the Intel 486 microprocessor chip, as an example of full-custom design.
Figure-1.26: Overview of VLSI design styles.
5/1/2007 10:44 AM
18 of 18
This chapter edited by Y. Leblebici
production of
5/1/2007 10:44 AM
1 of 16
Chapter 2 CMOS FABRICATION TECHNOLOGY
AND DESIGN RULES
Notice: This chapter is a largely based on Chapter 2 (Fabrication of MOSFETs) of the book CMOS Digital
Integrated Circuit Design - Analysis and Design by S.M. Kang and Y. Leblebici.
Introduction
Fabrication Process Flow - Basic Steps
The CMOS n-Well Process
Advanced CMOS Fabrication Technologies
Layout Design Rules
2.1 Introduction
In this chapter, the fundamentals of MOS chip fabrication will be discussed and the major steps of the process
flow will be examined. It is not the aim of this chapter to present a detailed discussion of silicon fabrication
technology, which deserves separate treatment in a dedicated course. Rather, the emphasis will be on the
general outline of the process flow and on the interaction of various processing steps, which ultimately
determine the device and the circuit performance characteristics. The following chapters show that there are
very strong links between the fabrication process, the circuit design process and the performance of the
resulting chip. Hence, circuit designers must have a working knowledge of chip fabrication to create effective
designs and in order to optimize the circuits with respect to various manufacturing parameters. Also, the circuit
designer must have a clear understanding of the roles of various masks used in the fabrication process, and how
the masks are used to define various features of the devices on-chip.
The following discussion will concentrate on the well-established CMOS fabrication technology, which requires
that both n-channel (nMOS) and p-channel (pMOS) transistors be built on the same chip substrate. To
accommodate both nMOS and pMOS devices, special regions must be created in which the semiconductor type
is opposite to the substrate type. These regions are called wells or tubs. A p-well is created in an n-type
substrate or, alternatively, an n- well is created in a p-type substrate. In the simple n-well CMOS fabrication
technology presented, the nMOS transistor is created in the p-type substrate, and the pMOS transistor is created
in the n-well, which is built-in into the p-type substrate. In the twin-tub CMOS technology, additional tubs of
the same type as the substrate can also be created for device optimization.
The simplified process sequence for the fabrication of CMOS integrated circuits on a p- type silicon substrate is
shown in Fig. 2.1. The process starts with the creation of the n-well regions for pMOS transistors, by impurity
implantation into the substrate. Then, a thick oxide is grown in the regions surrounding the nMOS and pMOS
active regions. The thin gate oxide is subsequently grown on the surface through thermal oxidation. These steps
5/1/2007 10:52 AM
2 of 16
are followed by the creation of n+ and p+ regions (source, drain and channel-stop implants) and by final
metallization (creation of metal interconnects).
Figure-2.1:
Simplified process sequence for fabrication of the n-well CMOS integrated circuit with a single polysilicon
layer, showing only major fabrication steps.
The process flow sequence pictured in Fig. 2.1 may at first seem to be too abstract, since detailed fabrication
steps are not shown. To obtain a better understanding of the issues involved in the semiconductor fabrication
process, we first have to consider some of the basic steps in more detail.
2.2 Fabrication Process Flow - Basic Steps
Note that each processing step requires that certain areas are defined on chip by appropriate masks.
Consequently, the integrated circuit may be viewed as a set of patterned layers of doped silicon, polysilicon,
metal and insulating silicon dioxide. In general, a layer must be patterned before the next layer of material is
applied on chip. The process used to transfer a pattern to a layer on the chip is called lithography. Since each
layer has its own distinct patterning requirements, the lithographic sequence must be repeated for every layer,
using a different mask.
To illustrate the fabrication steps involved in patterning silicon dioxide through optical lithography, let us first
examine the process flow shown in Fig. 2.2. The sequence starts with the thermal oxidation of the silicon
surface, by which an oxide layer of about 1 micrometer thickness, for example, is created on the substrate (Fig.
2.2(b)). The entire oxide surface is then covered with a layer of photoresist, which is essentially a
light-sensitive, acid-resistant organic polymer, initially insoluble in the developing solution (Fig. 2.2(c)). If the
photoresist material is exposed to ultraviolet (UV) light, the exposed areas become soluble so that the they are
no longer resistant to etching solvents. To selectively expose the photoresist, we have to cover some of the areas
on the surface with a mask during exposure. Thus, when the structure with the mask on top is exposed to UV
light, areas which are covered by the opaque features on the mask are shielded. In the areas where the UV light
can pass through, on the other hand, the photoresist is exposed a nd becomes soluble (Fig. 2.2(d)).
5/1/2007 10:52 AM
3 of 16
Figure-2.2: Process steps required for patterning of silicon dioxide.
The type of photoresist which is initially insoluble and becomes soluble after exposure to UV light is called
positive photoresist. The process sequence shown in Fig. 2.2 uses positive photoresist. There is another type of
photoresist which is initially soluble and becomes insoluble (hardened) after exposure to UV light, called
negative photoresist. If negative photoresist is used in the photolithography process, the areas which are not
shielded from the UV light by the opaque mask features become insoluble, whereas the shielded areas can
subsequently be etched away by a developing solution. Negative ph otoresists are more sensitive to light, but
their photolithographic resolution is not as high as that of the positive photoresists. Therefore, negative
photoresists are used less commonly in the manufacturing of high-density integrated circuits.
5/1/2007 10:52 AM
4 of 16
Following the UV exposure step, the unexposed portions of the photoresist can be removed by a solvent. Now,
the silicon dioxide regions which are not covered by hardened pho toresist can be etched away either by using a
chemical solvent (HF acid) or by using a dry etch (plasma etch) process (Fig. 2.2(e)). Note that at the end of this
step, we obtain an oxide window that reaches down to the silicon surface (Fig. 2.2(f)). The remaining
photoresist can now be stripped from the silicon dioxide surface by using another solvent, leaving the patterned
silicon dioxide feature on the surface as shown in Fig. 2.2(g).
The sequence of process steps illustrated in detail in Fig. 2.2 actually accomplishes a single pattern transfer onto
the silicon dioxide surface, as shown in Fig. 2.3. The fabrication of semiconductor devices requires several such
pattern transfers to be performed on silicon dioxide, polysilicon, and metal. The basic patterning process used in
all fabrication steps, however, is quite similar to the one shown in Fig. 2.2. Also note that for accurate
generation of high-density patterns required in sub-micron devices, electron beam (E-beam) lithography is used
instead of optical lithography. In the following, the main processing steps involved in the fabrication of an
n-channel MOS transistor on p-type silicon substrate will be examined.
Figure-2.3:
The result of a single lithographic patterning sequence on silicon dioxide, without showing the intermediate
steps. Compare the unpatterned structure (top) and the patterned structure (bottom) with Fig. 2.2(b) and Fig.
2.2(g), respectively.
The process starts with the oxidation of the silicon substrate (Fig. 2.4(a)), in which a relatively thick silicon
dioxide layer, also called field oxide, is created on the surface (Fig. 2.4(b)). Then, the field oxide is selectively
etched to expose the silicon surface on which the MOS transistor will be created (Fig. 2.4(c)). Following this
step, the surface is covered with a thin, high-quality oxide layer, which will eventually form the gate oxide of
the MOS transistor (Fig. 2.4(d)). On top of the thin oxide, a layer of polysilicon (polycrystalline silicon) is
deposited (Fig. 2.4(e)). Polysilicon is used both as gate electrode material for MOS transistors and also as an
interconnect medium in silicon integrated circuits. Undoped polysilicon has relatively high resistivity. The
resistivity of polysilicon can be reduced, however, by doping it with impurity atoms.
After deposition, the polysilicon layer is patterned and etched to form the interconnects and the MOS transistor
gates (Fig. 2.4(f)). The thin gate oxide not covered by polysilicon is also etched away, which exposes the bare
silicon surface on which the source and drain junctions are to be formed (Fig. 2.4(g)). The entire silicon surface
is then doped with a high concentration of impurities, either through diffusion or ion implantation (in this case
with donor atoms to produce n-type doping). Figure 2.4(h) shows that the doping penetrates the exposed areas
on the silicon surface, ultimately creating two n-type regions (source and drain junctions) in the p-type
substrate. The impurity doping also penetrates the polysilicon on the surface, reducing its resistivity. Note that
the polysilicon gate, which is patterned before doping actually defines the precise location of the channel region
and, hence, the location of the source and the drain regions. Since this procedure allows very precise positioning
of the two regions relative to the gate, it is also called the self-aligned process.
5/1/2007 10:52 AM
5 of 16
5/1/2007 10:52 AM
6 of 16
Figure-2.4: Process flow for the fabrication of an n-type MOSFET on p-type silicon.
Once the source and drain regions are completed, the entire surface is again covered with an insulating layer of
silicon dioxide (Fig. 2.4(i)). The insulating oxide layer is then patterned in order to provide contact windows for
the drain and source junctions (Fig. 2.4(j)). The surface is covered with evaporated aluminum which will form
the interconnects (Fig. 2.4(k)). Finally, the metal layer is patterned and etched, completing the interconnection
of the MOS transistors on the surface (Fig. 2.4(l)). Usually, a second (and third) layer of metallic interconnect
can also be added on top of this structure by creating another insulating oxide layer, cutting contact (via) holes,
depositing, and patterning the metal.
5/1/2007 10:52 AM
7 of 16
2.3 The CMOS n-Well Process
Having examined the basic process steps for pattern transfer through lithography, and having gone through the
fabrication procedure of a single n-type MOS transistor, we can now return to the generalized fabrication
sequence of n-well CMOS integrated circuits, as shown in Fig. 2.1. In the following figures, some of the
important process steps involved in the fabrication of a CMOS inverter will be shown by a top view of the
lithographic masks and a cross-sectional view of the relevant areas.
The n-well CMOS process starts with a moderately doped (with impurity concentration typically less than 1015
cm-3) p-type silicon substrate. Then, an initial oxide layer is grown on the entire surface. The first lithographic
mask defines the n-well region. Donor atoms, usually phosphorus, are implanted through this window in the
oxide. Once the n-well is created, the active areas of the nMOS and pMOS transistors can be defined. Figures
2.5 through 2.10 illustrate the significant milestones that occur during the fabrication process of a CMOS
inverter.
Figure-2.5:
Following the creation of the n-well region, a thick field oxide is grown in the areas surrounding the transistor
active regions, and a thin gate oxide is grown on top of the active regions. The thickness and the quality of the
gate oxide are two of the most critical fabrication parameters, since they strongly affect the operational
characteristics of the MOS transistor, as well as its long-term reliability.
5/1/2007 10:52 AM
8 of 16
Figure-2.6:
The polysilicon layer is deposited using chemical vapor deposition (CVD) and patterned by dry (plasma)
etching. The created polysilicon lines will function as the gate electrodes of the nMOS and the pMOS
transistors and their interconnects. Also, the polysilicon gates act as self-aligned masks for the source and drain
implantations that follow this step.
5/1/2007 10:52 AM
9 of 16
Figure-2.7:
Using a set of two masks, the n+ and p+ regions are implanted into the substrate and into the n- well,
respectively. Also, the ohmic contacts to the substrate and to the n-well are implanted in this process step.
5/1/2007 10:52 AM
10 of 16
Figure-2.8:
An insulating silicon dioxide layer is deposited over the entire wafer using CVD. Then, the contacts are defined
and etched away to expose the silicon or polysilicon contact wind ows. These contact windows are necessary to
complete the circuit interconnections using the metal layer, which is patterned in the next step.
5/1/2007 10:52 AM
11 of 16
Figure-2.9:
Metal (aluminum) is deposited over the entire chip surface using metal evaporation, and the metal lines are
patterned through etching. Since the wafer surface is non-planar, the quality and the integrity of the metal lines
created in this step are very critical and are ultimately essential for circuit reliability.
5/1/2007 10:52 AM
12 of 16
Figure-2.10:
The composite layout and the resulting cross-sectional view of the chip, showing one nMOS and one pMOS
transistor (built-in n-well), the polysilicon and metal interconnections. The final step is to deposit the
passivation layer (for protection) over the chip, except for wire-bonding pad areas.
The patterning process by the use of a succession of masks and process steps is conceptually summarized in Fig.
2.11. It is seen that a series of masking steps must be sequentially performed for the desired patterns to be
created on the wafer surface. An example of the end result of this sequence is shown as a cross-section on the
right.
5/1/2007 10:52 AM
13 of 16
Figure-2.11: Conceptual illustration of the mask sequence applied to create desired structures.
2.4 Advanced CMOS Fabrication Technologies
In this section, two examples will be given for advanced CMOS processes which offer additional benefits in
terms of device performance and integration density. These processes, namely, the twin-tub CMOS process and
the silicon-on-insulator (SOI) process, are becoming especially more popular for sub-micron geometries where
device performance and density must be pushed beyond the limits of the conventional n-well CMOS process.
Twin-Tub (Twin-Well) CMOS Process
This technology provides the basis for separate optimization of the nMOS and pMOS transistors, thus making it
possible for threshold voltage, body effect and the channel transconductance of both types of transistors to be
tuned independently. Generally, the starting material is a n+ or p+ substrate, with a lightly doped epitaxial layer
on top. This epitaxial layer provides the actual substrate on which the n-well and the p-well are formed. Since
two independent doping steps are performed for the creation of the well regions, the dopant concentrations can
be carefully optimized to produce the desired device characteristics.
In the conventional n-well CMOS process, the doping density of the well region is typically about one order of
magnitude higher than the substrate, which, among other effects, results in unbalanced drain parasitics. The
twin-tub process (Fig. 2.12) also avoids this problem.
Figure-2.12: Cross-section of nMOS and pMOS transistors in twin-tub CMOS process.
Silicon-on-Insulator (SOI) CMOS Process
Rather than using silicon as the substrate material, technologists have sought to use an insulating substrate to
improve process characteristics such as speed and latch-up susceptibility. The SOI CMOS technology allows
the creation of independent, completely isolated nMOS and pMOS transistors virtually side-by-side on an
insulating substrate (for example: sapphire). The main advantages of this technology are the higher integration
density (because of the absence of well regions), complete avoidance of the latch-up problem, and lower
parasitic capacitances compared to the conventional n-well or twin-tub CMOS processes. A cross-section of
nMOS and pMOS devices in created using SOI process is shown in Fig. 2.13.
The SOI CMOS process is considerably more costly than the standard n-well CMOS process. Yet the
improvements of device performance and the absence of latch-up problems can justify its use, especially for
deep-sub-micron devices.
5/1/2007 10:52 AM
14 of 16
Figure-2.13: Cross-section of nMOS and pMOS transistors in SOI CMOS process.
2.5 Layout Design Rules
The physical mask layout of any circuit to be manufactured using a particular process must conform to a set of
geometric constraints or rules, which are generally called layout design rules. These rules usually specify the
minimum allowable line widths for physical objects on-chip such as metal and polysilicon interconnects or
diffusion areas, minimum feature dimensions, and minimum allowable separations between two such features.
If a metal line width is made too small, for example, it is possible for the line to break during the fabrication
process or afterwards, resulting in an open circuit. If two lines are placed too close to each other in the layout,
they may form an unwanted short circuit by merging during or after the fabrication process. The main objective
of design rules is to achieve a high overall yield and reliability while using the smallest possible silicon area, for
any circuit to be manufactured with a particular process.
Note that there is usually a trade-off between higher yield which is obtained through conservative geometries,
and better area efficiency, which is obtained through aggressive, high- density placement of various features on
the chip. The layout design rules which are specified for a particular fabrication process normally represent a
reasonable optimum point in terms of yield and density. It must be emphasized, however, that the design rules
do not represent strict boundaries which separate "correct" designs from "incorrect" ones. A layout which
violates some of the specified design rules may still result in an operational circuit with reasonable yield,
whereas another layout observing all specified design rules may result in a circuit which is not functional and/or
has very low yield. To summarize, we can say, in general, that observing the layout design rules significantly
increases the probability of fabricating a successful product with high yield.
The design rules are usually described in two ways :
Micron rules, in which the layout constraints such as minimum feature sizes and minimum allowable
feature separations, are stated in terms of absolute dimensions in micrometers, or,
Lambda rules, which specify the layout constraints in terms of a single parameter (?) and, thus, allow
linear, proportional scaling of all geometrical constraints.
Lambda-based layout design rules were originally devised to simplify the industry- standard micron-based
design rules and to allow scaling capability for various processes. It must be emphasized, however, that most of
the submicron CMOS process design rules do not lend themselves to straightforward linear scaling. The use of
lambda-based design rules must therefore be handled with caution in sub-micron geometries. In the following,
we present a sample set of the lambda-based layout design rules devised for the MOSIS CMOS process and
illustrate the implications of these rules on a section a simple layout which includes two transistors (Fig. 2.14).
MOSIS Layout Design Rules (sample set)
Rule number
Description
L-Rule
5/1/2007 10:52 AM
15 of 16
R1
R2
Minimum active area width
Minimum active area spacing
3 L
3 L
R3
R4
R5
R6
Minimum poly width
Minimum poly spacing
Minimum gate extension of poly over active
Minimum poly-active edge spacing
(poly outside active area)
Minimum poly-active edge spacing
(poly inside active area)
2
2
2
1
R8
R9
Minimum metal width
Minimum metal spacing
3 L
3 L
R10
R11
R12
R13
R14
Poly contact
Minimum poly
Minimum poly
Minimum poly
Minimum poly
2
2
1
1
3
R15
R16
Active contact size
Minimum active contact spacing
(on the same active region)
Minimum active contact to active edge spacing
Minimum active contact to metal edge spacing
Minimum active contact to poly edge spacing
Minimum active contact spacing
(on different active regions)
R7
R17
R18
R19
R20
size
contact
contact
contact
contact
spacing
to poly edge spacing
to metal edge spacing
to active edge spacing
L
L
L
L
3 L
L
L
L
L
L
2 L
2 L
1
1
3
6
L
L
L
L
Figure-2.14: Illustration of some of the typical MOSIS layout design rules listed above.
5/1/2007 10:52 AM
16 of 16
References
1. W. Maly, Atlas of IC Technologies, Menlo Park, CA: Benjamin/Cummings, 1987.
2. A. S. Grove, Physics and Technology of Semiconductor Devices, New York, NY: John Wiley & Sons,
Inc., 1967.
3. G. E. Anner, Planar Processing Primer, New York, NY: Van Nostrand Rheinhold, 1990.
4. T. E. Dillinger, VLSI Engineering, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1988.
5. S.M. Sze, VLSI Technology, New York, NY: McGraw-Hill, 1983.
production of
5/1/2007 10:52 AM
1 of 12
Chapter 3
FULL-CUSTOM MASK LAYOUT DESIGN
Introduction
CMOS Layout Design Rules
CMOS Inverter Layout Design
Layout of CMOS NAND and NOR Gates
Complex CMOS Logic Gates
3.1 Introduction
In this chapter, the basic mask layout design guidelines for CMOS logic gates will be presented. The design of
physical layout is very tightly linked to overall circuit performance (area, speed, power dissipation) since the
physical structure directly determines the transconductances of the transistors, the parasitic capacitances and
resistances, and obviously, the silicon area which is used for a certain function. On the other hand, the detailed mask
layout of logic gates requires a very intensive and time-consuming design effort, which is justifiable only in special
circumstances where the area and/or the performance of the circuit must be optimized under very tight constraints.
Therefore, automated layout generation (e.g., standard cells + computer-aided placement and routing) is typically
preferred for the design of most digital VLSI circuits. In order to judge the physical constraints and limitations,
however, the VLSI designer must also have a good understanding of the physical mask layout process.
Mask layout drawings must strictly conform to a set of layout design rules as described in Chapter 2, therefore, we
will start this chapter with the review of a complete design rule set. The design of a simple CMOS inverter will be
presented step-by-step, in order to show the influence of various design rules on the mask structure and on the
dimensions. Also, we will introduce the concept of stick diagrams, which can be used very effectively to simplify the
overall topology of layout in the early design phases. With the help of stick diagrams, the designer can have a good
understanding of the topological constraints, and quickly test several possibilities for the optimum layout without
actually drawing a complete mask diagram.
The physical (mask layout) design of CMOS logic gates is an iterative process which starts with the circuit topology
(to realize the desired logic function) and the initial sizing of the transistors (to realize the desired performance
specifications). At this point, the designer can only estimate the total parasitic load at the output node, based on the
fan-out, the number of devices, and the expected length of the interconnection lines. If the logic gate contains more
than 4-6 transistors, the topological graph representation and the Euler-path method allow the designer to determine
the optimum ordering of the transistors. A simple stick diagram layout can now be drawn, showing the locations of
the transistors, the local interconnections between the transistors and the locations of the contacts.
After a topologically feasible layout is found, the mask layers are drawn (using a layout editor tool) according to the
layout design rules. This procedure may require several small iterations in order to accommodate all design rules, but
the basic topology should not change very significantly. Following the final DRC (Design Rule Check), a circuit
5/1/2007 10:56 AM
2 of 12
extraction procedure is performed on the finished layout to determine the actual transistor sizes, and more
importantly, the parasitic capacitances at each node. The result of the extraction step is usually a detailed
Figure-3.1: The typical design flow for the production of a mask layout.
SPICE input file, which is automatically generated by the extraction tool. Now, the actual performance of the circuit
can be determined by performing a SPICE simulation, using the extracted net-list. If the simulated circuit
performance (e.g., transient response times or power dissipation) do not match the desired specifications, the layout
must be modified and the whole process must be repeated. The layout modifications are usually concentrated on the
(W/L) ratios of the transistors (transistor re-sizing), since the width-to-length ratios of the transistors determine the
device transconductance and the parasitic source/drain capacitances. The designer may also decide to change parts or
all of the circuit topology in order to reduce the parasitics. The flow diagram of this iterative process is shown in Fig.
3.1.
3.2 CMOS Layout Design Rules
As already discussed in Chapter 2, each mask layout design must conform to a set of layout design rules, which
dictate the geometrical constraints imposed upon the mask layers by the technology and by the fabrication process.
The layout designer must follow these rules in order to guarantee a certain yield for the finished product, i.e., a
certain ratio of acceptable chips out of a fabrication batch. A design which violates some of the layout design rules
may still result in a functional chip, but the yield is expected to be lower because of random process variations.
5/1/2007 10:56 AM
3 of 12
The design rules below are given in terms of scaleable lambda-rules. Note that while the concept of scaleable design
rules is very convenient for defining a technology-independent mask layout and for memorizing the basic constraints,
most of the rules do not scale linearly, especially for sub-micron technologies. This fact is illustrated in the right
column, where a representative rule set is given in real micron dimensions. A simple comparison with the lambdabased rules shows that there are significant differences. Therefore, lambda-based design rules are simply not useful
for sub-micron CMOS technologies.
5/1/2007 10:56 AM
4 of 12
Figure-3.2: Illustration of CMOS layout design rules.
3.3 CMOS Inverter Layout Design
In the following, the mask layout design of a CMOS inverter will be examined step-by-step. The circuit consists of
one nMOS and one pMOS transistor, therefore, one would assume that the layout topology is relatively simple. Yet,
we will see that there exist quite a number of different design possibilities even for this very simple circuit.
First, we need to create the individual transistors according to the design rules. Assume that we attempt to design the
inverter with minimum-size transistors. The width of the active area is then determined by the minimum diffusion
contact size (which is necessary for source and drain connections) and the minimum separation from diffusion
contact to both active area edges. The width of the polysilicon line over the active area (which is the gate of the
transistor) is typically taken as the minimum poly width (Fig. 3.3). Then, the overall length of the active area is
simply determined by the following sum: (minimum poly width) + 2 x (minimum poly-to- contact spacing) + 2 x
(minimum spacing from contact to active area edge). The pMOS transistor must be placed in an n-well region, and
the minimum size of the n- well is dictated by the pMOS active area and the minimum n-well overlap over n+. The
distance between the nMOS and the pMOS transistor is determined by the minimum separation between the n+
active area and the n-well (Fig. 3.4). The polysilicon gates of the nMOS and the pMOS transistors are usually
aligned. The final step in the mask layout is the local interconnections in metal, for the output node and for the VDD
and GND contacts (Fig. 3.5). Notice that in order to be biased properly, the n-well region must also have a VDD
contact.
Figure-3.3: Design rule constraints which determine the dimensions of a minimum-size transistor.
5/1/2007 10:56 AM
5 of 12
Figure-3.4: Placement of one nMOS and one pMOS transistor.
Figure-3.5: Complete mask layout of the CMOS inverter.
The inital phase of layout design can be simplified significantly by the use of stick diagrams - or so-called symbolic
layouts. Here, the detailed layout design rules are simply neglected and the main features (active areas, polysilicon
lines, metal lines) are represented by constant width rectangles or simple sticks. The purpose of the stick diagram is
to provide the designer a good understanding of the topological constraints, and to quickly test several possibilities
for the optimum layout without actually drawing a complete mask diagram. In the following, we will examine a
series of stick diagrams which show different layout options for the CMOS inverter circuit.
The first two stick diagram layouts shown in Fig. 3.6 are the two most basic inverter configurations, with different
alignments of the transistors. In some cases, other signals must be routed over the inverter. For instance, if one or two
metal lines have to be passed through the middle of the cell from left to right, horizontal metal straps can be used to
access the drain terminals of the transistors, which in turn connect to a vertical Metal-2 line. Metal-1 can now be
used to route the signals passing through the inverter. Alternatively, the diffusion areas of both transistors may be
used for extending the power and ground connections. This makes the inverter transistors transparent to horizontal
metal lines which may pass over.
The addition of a second metal layer allows more interconnect freedom. The second- level metal can be used for
power and ground supply lines, or alternatively, it may be used to vertically strap the input and the output signals.
The final layout example in Fig. 3.6 shows one possibility of using a third metal layer, which is utilized for routing
5/1/2007 10:56 AM
6 of 12
three signals on top.
Figure-3.6: Stick diagrams showing various CMOS inverter layout options.
3.4 Layout of CMOS NAND and NOR Gates
The mask layout designs of CMOS NAND and NOR gates follow the general principles examined earlier for the
CMOS inverter layout. Figure 3.7 shows the sample layouts of a two- input NOR gate and a two-input NAND gate,
using single-layer polysilicon and single-layer metal. Here, the p-type diffusion area for the pMOS transistors and the
n-type diffusion area for the nMOS transistors are aligned in parallel to allow simple routing of the gate signals with
two parallel polysilicon lines running vertically. Also notice that the two mask layouts show a very strong symmetry,
due to the fact that the NAND and the NOR gate are have a symmetrical circuit topology. Finally, Figs 3.8 and 3.9
show the major steps of the mask layout design for both gates, starting from the stick diagram and progressively
defining the mask layers.
5/1/2007 10:56 AM
7 of 12
Figure-3.7: Sample layouts of a CMOS NOR2 gate and a CMOS NAND2 gate.
5/1/2007 10:56 AM
8 of 12
Figure-3.8: Major steps required for generating the mask layout of a CMOS NOR2 gate.
5/1/2007 10:56 AM
9 of 12
Figure-3.9: Major steps required for generating the mask layout of a CMOS NAND2 gate.
3.5 Complex CMOS Logic Gates
The realization of complex Boolean functions (which may include several input variables and several product terms)
typically requires a series-parallel network of nMOS transistors which constitute the so-called pull-down net, and a
corresponding dual network of pMOS transistors which constitute the pull-up net. Figure 3.10 shows the circuit
diagram and the corresponding network graphs of a complex CMOS logic gate. Once the network topology of the
nMOS pull- down network is known, the pull-up network of pMOS transistors can easily be constructed by using the
dual-graph concept.
5/1/2007 10:56 AM
10 of 12
Figure-3.10: A complex CMOS logic gate realizing a Boolean function with 5 input variables.
Now, we will investigate the problem of constructing a minimum-area layout for the complex CMOS logic gate.
Figure 3.11 shows the stick-diagram layout of a ?first-attempt?, using an arbitrary ordering of the polysilicon gate
columns. Note that in this case, the separation between the polysilicon columns must be sufficiently wide to allow for
two metal-diffusion contacts on both sides and one diffusion-diffusion separation. This certainly consumes a
considerable amount of extra silicon area.
If we can minimize the number of active-area breaks both for the nMOS and for the pMOS transistors, the separation
between the polysilicon gate columns can be made smaller. This, in turn, will reduce the overall horizontal
dimension and the overall circuit layout area. The number of active-area breaks can be minimized by changing the
ordering of the polysilicon columns, i.e., by changing the ordering of the transistors.
Figure-3.11:
Stick diagram layout of the complex CMOS logic gate, with an arbitrary ordering of the polysilicon gate columns.
A simple method for finding the optimum gate ordering is the Euler-path method: Simply find a Euler path in the
pull-down network graph and a Euler path in the pull-up network graph with the identical ordering of input labels,
i.e., find a common Euler path for both graphs. The Euler path is defined as an uninterrupted path that traverses each
edge (branch) of the graph exactly once. Figure 3.12 shows the construction of a common Euler path for both graphs
in our example.
5/1/2007 10:56 AM
11 of 12
Figure-3.12:
Finding a common Euler path in both graphs for the pull-down and pull-up net provides a gate ordering that
minimizes the number of active-area breaks. In both cases, the Euler path starts at (x) and ends at (y).
It is seen that there is a common sequence (E-D-A-B-C) in both graphs. The polysilicon gate columns can be
arranged according to this sequence, which results in uninterrupted active areas for nMOS as well as for pMOS
transistors. The stick diagram of the new layout is shown in Fig. 3.13. In this case, the separation between two
neighboring poly columns must allow only for one metal-diffusion contact. The advantages of this new layout are
more compact (smaller) layout area, simple routing of signals, and correspondingly, smaller parasitic capacitance.
Figure-3.13: Optimized stick diagram layout of the complex CMOS logic gate.
It may not always be possible to construct a complete Euler path both in the pull-down and in the pull-up network. In
that case, the best strategy is to find sub-Euler-paths in both graphs, which should be as long as possible. This
approach attempts to maximize the number of transistors which can be placed in a single, uninterrupted active area.
Finally, Fig. 3.14 shows the circuit diagram of a CMOS one-bit full adder. The circuit has three inputs, and two
outputs, sum and carry_out. The corresponding mask layout of this circuit is given in Fig. 3.15. All input and output
signals have been arranged in vertical polysilicon columns. Notice that both the sum-circuit and the carry-circuit
have been realized using one uninterrupted active area each.
5/1/2007 10:56 AM
12 of 12
Figure-3.14: Circuit diagram of the CMOS one-bit full adder.
Figure-3.15: Mask layout of the CMOS full adder circuit..
production of
5/1/2007 10:56 AM
1 of 12
Chapter 4
PARASITIC EXTRACTION AND PERFORMANCE
ESTIMATION FROM PHYSICAL STRUCTURE
Introduction
The Reality with Interconnections
MOSFET Capacitances
Interconnect Capacitance Estimation
Interconnect Resistance Estimation
4.1 Introduction
In this chapter, we will investigate some of the physical factors which determine and ultimately limit the performance
of digital VLSI circuits. The switching characteristics of digital integrated circuits essentially dictate the overall
operating speed of digital systems. The dynamic performance requirements of a digital system are usually among the
most important design specifications that must be met by the circuit designer. Therefore, the switching speed of the
circuits must be estimated and optimized very early in the design phase.
The classical approach for determining the switching speed of a digital block is based on the assumption that the loads
are mainly capacitive and lumped. Relatively simple delay models exist for logic gates with purely capacitive load at
the output node, hence, the dynamic behavior of the circuit can be estimated easily once the load is determined. The
conventional delay estimation approaches seek to classify three main components of the gate load, all of which are
assumed to be purely capacitive, as: (1) internal parasitic capacitances of the transistors, (2) interconnect (line)
capacitances, and (3) input capacitances of the fan-out gates. Of these three components, the load conditions imposed
by the interconnection lines present serious problems.
5/1/2007 10:46 AM
2 of 12
Figure-4.1: An inverter driving three other inverters over interconnection lines.
Figure 4.1 shows a simple situation where an inverter is driving three other inverters, linked over interconnection lines
of different length and geometry. If the total load of each interconnection line can be approximated by a lumped
capacitance, then the total load seen by the primary inverter is simply the sum of all capacitive components described
above. The switching characteristics of the inverter are then described by the charge/discharge time of the load
capacitance, as seen in Fig. 4.2. The expected output voltage waveform of the inverter is given in Fig. 4.3, where the
propagation delay time is the primary measure of switching speed. It can be shown very easily that the signal
propagation delay under these conditions is linearly proportional to the load capacitance.
In most cases, however, the load conditions imposed by the interconnection line are far from being simple. The line,
itself a three-dimensional structure in metal and/or polysilicon, usually has a non-negligible resistance in addition to its
capacitance. The length/width ratio of the wire usually dictates that the parameters are distributed, making the
interconnect a true transmission line. Also, an interconnect is very rarely ?alone?, i.e., isolated from other influences.
In real conditions, the interconnection line is in very close proximity to a number of other lines, either on the same
level or on different levels. The capacitive/inductive coupling and the signal interference between neighboring lines
should also be taken into consideration for an accurate estimation of delay.
Figure-4.2: CMOS inverter stage with lumped capacitive load at the output node.
Figure-4.3: Typical input and output waveforms of an inverter with purely capacitive load.
4.2 The Reality with Interconnections
Consider the following situation where an inverter is driving two other inverters, over long interconnection lines. In
general, if the time of flight across the interconnection line (as determined by the speed of light) is shorter than the
signal rise/fall times, then the wire can be modeled as a capacitive load, or as a lumped or distributed RC network. If
the interconnection lines are sufficiently long and the rise times of the signal waveforms are comparable to the time of
flight across the line, then the inductance also becomes important, and the interconnection lines must be modeled as
5/1/2007 10:46 AM
3 of 12
transmission lines. Taking into consideration the RLCG (resistance, inductance, capacitance, and conductance)
parasitics (as seen in Fig. 4.4), the signal transmission across the wire becomes a very complicated matter, compared to
the relatively simplistic lumped- load case. Note that the signal integrity can be significantly degraded especially when
the output impedance of the driver is significantly lower than the characteristic impedance of the transmission line.
Figure-4.4:
(a) An RLCG interconnection tree. (b) Typical signal waveforms at the nodes A and B, showing the signal delay and
the various delay components.
The transmission-line effects have not been a serious concern in CMOS VLSI until recently, since the gate delay
originating from purely or mostly capacitive load components dominated the line delay in most cases. But as the
fabrication technologies move to finer (sub- micron) design rules, the intrinsic gate delay components tend to decrease
dramatically. By contrast, the overall chip size does not decrease - designers just put more functionality on the same
sized chip. A 100 mm2 chip has been a standard large chip for almost a decade. The factors which determine the chip
size are mainly driven by the packaging technology, manufacturing equipment, and the yield. Since the chip size and
the worst-case line length on a chip remain unchanged, the importance of interconnect delay increases in sub-micron
technologies. In addition, as the widths of metal lines shrink, the transmission line effects and signal coupling between
neighboring lines become even more pronounced.
This fact is illustrated in Fig. 4.5, where typical intrinsic gate delay and interconnect delay are plotted qualitatively, for
different technologies. It can be seen that for sub-micron technologies, the interconnect delay starts to dominate the
gate delay. In order to deal with the implications and to optimize a system for speed, the designers must have reliable
and efficient means of (1) estimating the interconnect parasitics in a large chip, and (2) simulating the time- domain
effects. Yet we will see that neither of these tasks is simple - interconnect parasitic extraction and accurate simulation
of line effects are two of the most difficult problems in physical design of VLSI circuits today.
5/1/2007 10:46 AM
4 of 12
Figure-4.5: Interconnect delay dominates gate delay in sub-micron CMOS technologies.
Once we establish the fact that the interconnection delay becomes a dominant factor in CMOS VLSI, the next question
is: how many of the interconnections in a large chip may cause serious problems in terms of delay. The hierarchical
structure of most VLSI designs offers some insight on this question. In a chip consisting of several functional modules,
each module contains a relatively large number of local connections between its functional blocks, logic gates, and
transistors. Since these intra-module connections are usually made over short distances, their influence on speed can
be simulated easily with conventional models. Yet there are also a fair amount of longer connections between the
modules on a chip, the so-called inter-module connections. It is usually these inter-module connections which should
be scrutinized in the early design phases for possible timing problems. Figure 4.6 shows the typical statistical
distribution of wire lengths on a chip, normalized for the chip diagonal length. The distribution plot clearly exhibits
two distinct peaks, one for the relatively shorter intra-module connections, and the other for the longer inter-module
connections. Also note that a small number of interconnections may be very long, typically longer than the chip
diagonal length. These lines are usually required for global signal bus connections, and for clock distribution
networks. Although their numbers are relatively small, these long interconnections are obviously the most problematic
ones.
Figure-4.6: Statistical distribution of interconnection length on a typical chip.
To summarize the message of this section, we state that : (1) interconnection delay is becoming the dominating factor
which determines the dynamic performance of large-scale systems, and (2) interconnect parasitics are difficult to
model and to simulate. In the following sections, we will concentrate on various aspects of on-chip parasitics, and we
will mainly consider capacitive and resistive components.
4.3 MOSFET Capacitances
5/1/2007 10:46 AM
5 of 12
The first component of capacitive parasitics we will examine is the MOSFET capacitances. These parasitic
components are mainly responsible for the intrinsic delay of logic gates, and they can be modeled with fairly high
accuracy for gate delay estimation. The extraction of transistor parasitics from physical structure (mask layout) is also
fairly straightforward.
The parasitic capacitances associated with a MOSFET are shown in Fig. 4.7 as lumped elements between the device
terminals. Based on their physical origins, the parasitic device capacitances can be classified into two major groups:
(1) oxide-related capacitances and (2) junction capacitances. The gate-oxide-related capacitances are Cgd
(gate-to-drain capacitance), Cgs (gate-to-source capacitance), and Cgb (gate-to-substrate capacitance). Notice that in
reality, the gate-to-channel capacitance is distributed and voltage dependent. Consequently, all of the oxide-related
capacitances described here change with the bias conditions of the transistor. Figure 4.8 shows qualitatively the
oxide-related capacitances during cut-off, linear-mode operation and saturation of the MOSFET. The simplified
variation of the three capacitances with gate-to-source bias voltage is shown in Fig. 4.9.
Figure-4.7: Lumped representation of parasitic MOSFET capacitances.
Figure-4.8:
Schematic representation of MOSFET oxide capacitances during (a) cut-off, (b) linear- mode operation, and (c)
saturation.
Note that the total gate oxide capacitance is mainly determined by the parallel-plate capacitance between the
polysilicon gate and the underlying structures. Hence, the magnitude of the oxide-related capacitances is very closely
related to (1) the gate oxide thickness, and (2) the area of the MOSFET gate. Obviously, the total gate capacitance
decreases with decreasing device dimensions (W and L), yet it increases with decreasing gate oxide thickness. In
5/1/2007 10:46 AM
6 of 12
sub-micron technologies, the horizontal dimensions (which dictate the gate area) are usually scaled down more easily
than the horizontal dimensions, such as the gate oxide thickness. Consequently, MOSFET transistors fabricated using
sub-micron technologies have, in general, smaller gate capacitances.
Figure-4.9: Variation of oxide capacitances as functions of gate-to-source voltage.
Now we consider the voltage-dependent source-to-substrate and drain-to-substrate capacitances, Csb and Cdb. Both of
these capacitances are due to the depletion charge surrounding the respective source or drain regions of the transistor,
which are embedded in the substrate. Figure 4.10 shows the simplified geometry of an n-type diffusion region within
the p-type substrate. Here, the diffusion region has been approximated by a rectangular box, which consists of five
planar pn-junctions. The total junction capacitance is a function of the junction area (sum of all planar junction areas),
the doping densities, and the applied terminal voltages. Accurate methods for estimating the junction capacitances
based on these data are readily available in the literature, therefore, a detailed discussion of capacitance calculations
will not be presented here.
One important aspect of parasitic device junction capacitances is that the amount of capacitance is a linear function of
the junction area. Consequently, the size of the drain or the source diffusion area dictates the amount of parasitic
capacitance. In sub-micron technologies, where the overall dimensions of the individual devices are scaled down, the
parasitic junction capacitances also decrease significantly.
It was already mentioned that the MOSFET parasitic capacitances are mainly responsible for the intrinsic delay of
logic gates. We have seen that both the oxide-related parasitic capacitances and the junction capacitances tend to
decrease with shrinking device dimensions, hence, the relative significance of intrinsic gate delay diminishes in
sub-micron technologies.
Figure-4.10: Three-dimensional view of the n-type diffusion region within the p-type substrate.
5/1/2007 10:46 AM
7 of 12
4.4 Interconnect Capacitance Estimation
In a typical VLSI chip, the parasitic interconnect capacitances are among the most difficult parameters to estimate
accurately. Each interconnection line (wire) is a three dimensional structure in metal and/or polysilicon, with
significant variations of shape, thickness, and vertical distance from the ground plane (substrate). Also, each
interconnect line is typically surrounded by a number of other lines, either on the same level or on different levels.
Figure 4.11 shows a possible, realistic situation where interconnections on three different levels run in close proximity
of each other. The accurate estimation of the parasitic capacitances of these wires with respect to the ground plane, as
well as with respect to each other, is obviously a complicated task.
Figure-4.11: Example of six interconnect lines running on three different levels.
Unfortunately for the VLSI designers, most of the conventional computer-aided VLSI design tools have a relatively
limited capability of interconnect parasitic estimation. This is true even for the design tools regularly used for
sub-micron VLSI design, where interconnect parasitics were shown to be very dominant. The designer should
therefore be aware of the physical problem and try to incorporate this knowledge early in the design phase, when the
initial floorplanning of the chip is done.
First, consider the section of a single interconnect which is shown in Fig. 4.12. It is assumed that this wire segment has
a length of (l) in the current direction, a width of (w) and a thickness of (t). Moreover, we assume that the interconnect
segment runs parallel to the chip surface and is separated from the ground plane by a dielectric (oxide) layer of height
(h). Now, the correct estimation of the parasitic capacitance with respect to ground is an important issue. Using the
basic geometry given in Fig. 4.12, one can calculate the parallel-plate capacitance Cpp of the interconnect segment.
However, in interconnect lines where the wire thickness (t) is comparable in magnitude to the ground-plane distance
(h), fringing electric fields significantly increase the total parasitic capacitance (Fig. 4.13).
Figure-4.12: Interconnect segment running parallel to the surface, used for parasitic capacitance estimations.
5/1/2007 10:46 AM
8 of 12
Figure-4.13: Influence of fringing electric fields upon the parasitic wire capacitance.
Figure 4.14 shows the variation of the fringing-field factor FF = Ctotal/Cpp, as a function of (t/h), (w/h) and (w/l). It
can be seen that the influence of fringing fields increases with the decreasing (w/h) ratio, and that the fringing-field
capacitance can be as much as 10-20 times larger than the parallel-plate capacitance. It was mentioned earlier that the
sub-micron fabrication technologies allow the width of the metal lines to be decreased somewhat, yet the thickness of
the line must be preserved in order to ensure structural integrity. This situation, which involves narrow metal lines
with a considerable vertical thickness, is especially vulnerable to fringing field effects.
Figure-4.14: Variation of the fringing-field factor with the interconnect geometry.
A set of simple formulas developed by Yuan and Trick in the early 1980?s can be used to estimate the capacitance of
the interconnect structures in which fringing fields complicate the parasitic capacitance calculation. The following two
cases are considered for two different ranges of line width (w).
(4.1)
5/1/2007 10:46 AM
9 of 12
(4.2)
These formulas permit the accurate approximation of the parasitic capacitance values to within 10% error, even for
very small values of (t/h). Figure 4.15 shows a different view of the line capacitance as a function of (w/h) and (t/h).
The linear dash-dotted line in this plot represents the corresponding parallel-plate capacitance, and the other two
curves represent the actual capacitance, taking into account the fringing-field effects.
Figure-4.15: Capacitance of a single interconnect, as a function of (w/h) and (t/h).
Now consider the more realistic case where the interconnection line is not ?alone? but is coupled with other lines
running in parallel. In this case, the total parasitic capacitance of the line is not only increased by the fringing-field
effects, but also by the capacitive coupling between the lines. Figure 4.16 shows the capacitance of a line which is
coupled with two other lines on both sides, separated by the minimum design rule. Especially if both of the
neighboring lines are biased at ground potential, the total parasitic capacitance of the interconnect running in the
middle (with respect to the ground plane) can be more than 20 times as large as the simple parallel-plate capacitance.
Note that the capacitive coupling between neighboring lines is increased when the thickness of the wire is comparable
to its width.
5/1/2007 10:46 AM
10 of 12
Figure-4.16: Capacitance of coupled interconnects, as a function of (w/h) and (t/h).
Figure 4.17 shows the cross-section view of a double-metal CMOS structure, where the individual parasitic
capacitances between the layers are also indicated. The cross-section does not show a MOSFET, but just a portion of a
diffusion region over which some metal lines may pass. The inter-layer capacitances between the metal-2 and metal-1,
metal-1 and polysilicon, and metal-2 and polysilicon are labeled as Cm2m1, Cm1p and Cm2p, respectively. The other
parasitic capacitance components are defined with respect to the substrate. If the metal line passes over an active
region, the oxide thickness underneath is smaller (because of the active area window), and consequently, the
capacitance is larger. These special cases are labeled as Cm1a and Cm2a. Otherwise, the thick field oxide layer results
in a smaller capacitance value.
Figure-4.17: Cross-sectional view of a double-metal CMOS structure, showing capacitances between layers.
The vertical thickness values of the different layers in a typical 0.8 micron CMOS technology are given below as an
example.
Field oxide thickness
Gate oxide thickness
Poly 1 thickness
Poly-metal oxide thickness
Metal 1 thickness
Via oxide thickness
Metal 2 thickness
n+ junction depth
p+ junction depth
n-well junction depth
0.52
16.0
0.35
0.65
0.60
1.00
1.00
0.40
0.40
3.50
um
nm
um
um
um
um
um
um
um
um
(minimum width 0.8 um)
The list below contains the capacitance values between various layers, also for a typical 0.8 micron CMOS
technology.
Poly over field oxide
(area)
0.066
fF/um2
5/1/2007 10:46 AM
11 of 12
Poly over field oxide
(perimeter)
0.046
fF/um
Metal-1 over field oxide
(area)
(perimeter)
0.030
0.044
fF/um2
fF/um
(area)
(perimeter)
0.016
0.042
fF/um2
fF/um
Metal-1 over poly
Metal-1 over poly
(area)
(perimeter)
0.053
0.051
fF/um2
fF/um
Metal-2 over poly
Metal-2 over poly
(area)
(perimeter)
0.021
0.045
fF/um2
fF/um
Metal-2 over metal-1
Metal-2 over metal-1
(area)
(perimeter)
0.035
0.051
fF/um2
fF/um
For the estimation of interconnect capacitances in a complicated three-dimensional structure, the exact geometry must
be taken into account for every portion of the wire. Yet this requires an unacceptable amount of computation in a large
circuit, even if simple formulas are applied for the calculation of capacitances. Usually, chip manufacturers supply the
area capacitance (parallel-plate cap) and the perimeter capacitance (fringing-field cap) figures for each layer, which
are backed up by measurement of capacitance test structures. These figures can be used to extract the parasitic
capacitances from the mask layout. It is often prudent to include test structures on chip that enable the designer to
independently calibrate a process to a set of design tools. In some cases where the entire chip performance is
influenced by the parasitic capacitance of a specific line, accurate 3-D simulation is the only reliable solution.
4.5 Interconnect Resistance Estimation
The parasitic resistance of a metal or polysilicon line can also have a profound influence on the signal propagation
delay over that line. The resistance of a line depends on the type of material used (polysilicon, aluminum, gold, ...), the
dimensions of the line and finally, the number and locations of the contacts on that line. Consider again the
interconnection line shown in Fig. 4.12. The total resistance in the indicated current direction can be found as
(4.2)
where the greek letter ro represents the characteristic resistivity of the interconnect material, and Rsheet represents the
sheet resistivity of the line, in (ohm/square). For a typical polysilicon layer, the sheet resistivity is between 20-40
ohm/square, whereas the sheet resistivity of silicide is about 2- 4 ohm/square. Using the formula given above, we can
estimate the total parasitic resistance of a wire segment based on its geometry. Typical metal-poly and metal-diffusion
contact resistance values are between 20-30 ohms, while typical via resistance is about 0.3 ohms.
In most short-distance aluminum and silicide interconnects, the amount of parasitic wire resistance is usually
negligible. On the other hand, the effects of the parasitic resistance must be taken into account for longer wire
segments. As a first-order approximation in simulations, the total lumped resistance may be assumed to be connected
in series with the total lumped capacitance of the wire. A much better approximation of the influence of distributed
parasitic resistance can be obtained by using an RC-ladder network model to represent the interconnect segment (Fig.
4.18). Here, the interconnect segment is divided into smaller, identical sectors, and each sector is represented by an
RC-cell. Typically, the number of these RC-cells (i.e., the resolution of the RC model) determines the accuracy of the
simulation results. On the other hand, simulation time restrictions usually limit the resolution of this distributed line
model.
5/1/2007 10:46 AM
12 of 12
Figure-4.18: RC-ladder network used to model the distributed resistance and capacitance of an interconnect.
production of
5/1/2007 10:46 AM
1 of 5
Chapter 5
CLOCK SIGNALS AND SYSTEM TIMING
5.1 On-Chip Clock Generation and Distribution
Clock signals are the heartbeats of digital systems. Hence, the stability of clock signals is highly important. Ideally,
clock signals should have minimum rise and fall times, specified duty cycles, and zero skew. In reality, clock
signals have nonzero skews and noticeable rise and fall times; duty cycles can also vary. In fact, as much as 10% of
a machine cycle time is expended to allow realistic clock skews in large computer systems. The problem is no less
serious in VLSI chip design. A simple technique for on-chip generation of a primary clock signal would be to use a
ring oscillator as shown in Fig. 5.1. Such a clock circuit has been used in low-end microprocessor chips.
Figure 5.1: Simple on-chip clock generation circuit using a ring oscillator.
However, the generated clock signal can be quite process-dependent and unstable. As a result, separate clock chips
which use crystal oscillators have been used for high- performance VLSI chip families. Figure 5.2 shows the circuit
schematic of a Pierce crystal oscillator with good frequency stability. This circuit is a near series-resonant circuit in
which the crystal sees a low load impedance across its terminals. Series resonance exists in the crystal but its
internal series resistance largely the determines the oscillation frequency. In its equivalent circuit model, the crystal
can be represented as a series RLC circuit; thus, the higher the series resistance, the lower the oscillation frequency.
The external load at the terminals of the crystal also has a considerable effect on the frequency and the frequency
stability. The inverter across the crystal provides the necessary v oltage differential, and the external inverter
provides the amplification to drive clock loads. Note that the oscillator circuit presented here is by no means a
typical example of the state-of-the-art; design of high-frequency, high-quality clock oscillators is a formidable task,
which is beyond the scope of this section.
5/1/2007 10:53 AM
2 of 5
Figure-5.2: Circuit diagram of a Pierce crystal oscillator circuit.
Usually a VLSI chip receives one or more primary clock signals from an external clock chip and, in turn, generates
necessary derivatives for its internal use. It is often necessary to use two non-overlapping clock signals. The logical
product of such two clock signals should be zero at all times. Figure 5.3 shows a simple circuit that generates CK-1
and CK-2 from the original clock signal CK. Figure 5.4 shows a clock decoder circuit that takes in the primary
clock signals and generates four phase signals.
Figure-5.3: A simple circuit that generates a pair of non-overlapping clock signals from CK.
Figure-5.4:
Clock decoder circuit: (a) symbolic representation and (b) sample waveforms and gate-level implementation.
Since clock signals are required almost uniformly over the chip area, it is desirable that all clock signals are
distributed with a uniform delay. An ideal distribution network would be the H-tree structure shown in Fig. 5.5. In
such a structure, the distances from the center to all branch points are the same and hence, the signal delays would
be the same. However, this structure is difficult to implement in practice due to routing constraints and different
fanout requirements. A more practical approach for clock-signal distribution is to route main clock signals to
5/1/2007 10:53 AM
3 of 5
macroblocks and use local clock decoders to carefully balance the delays under different loading conditions.
Figure-5.5: General layout of an H-tree clock distribution network.
The reduction of clock skews, which are caused by the differences in clock arrival times and changes in clock
waveforms due to variations in load conditions, is a major concern in high-speed VLSI design. In addition to
uniform clock distribution (H-tree) networks and local skew balancing, a number of new computer-aided design
techniques have been developed to automatically generate the layout of an optimum clock distribution network with
zero skew. Figure 5.6 shows a zero-skew clock routing network that was constructed based on estimated routing
parasitics.
Regardless of the exact geometry of the clock distribution network, the clock signals must be buffered in multiple
stages as shown in Fig. 5.7 to handle the high fan-out loads. It is also essential that every buffer stage drives the
same number of fan-out gates so that the clock delays are always balanced. In the configuration shown in Fig. 5.8
(used in the DEC Alpha chip designs), the interconnect wires are cross- connected with vertical metal straps in a
mesh pattern, in order to keep the clock signals in phase across the entire chip.
So far we have seen the needs for having equal interconnect lengths and extensive buffering in order to distribute
clock signals with minimal skews and healthy signal waveforms. In practice, designers must spend significant time
and effort to tune the transistor sizes in buffers (inverters) and also the widths of interconnects. Widening the
interconnection wires decreases the series resistance, but at the cost of increasing the parasitic resistance.
Figure-5.6: An example of the zero-skew clock routing network, generated by a computer-aided design tool.
5/1/2007 10:53 AM
4 of 5
Figure-5.7: Three-level buffered clock distribution network.
Figure-5.8: Genaral structure of the clock distribution network used in DEC Alpha microprocessor chips.
The following points should always be considered carefully in digit al system design, but especially for successful
high-speed VLSI design:
Ideal duty cycle of a clock signal is 50%, and the signal can travel farther in a chain of inverting buffers with
ideal duty cycle. The duty cycle of a clock signal can be improved, i.e., made closer to 50%, by using
feedback based on the voltage average.
To prevent reflection in the interconnection network, the rise time and the fall time of the clock signal should
not be reduced excessively.
The load capacitance should be reduced as much as possible, by reducing the fan-out, the interconnection
lengths and the gate capacitances.
The characterictic impedance of the clock distribution line should be reduced by using properly increased
(w/h)-ratios (the ratio of the line width to vertical separation distance of the line from the substrate).
Inductive loads can be used to partially cancel the effects of parasitic capacitance of a clock receiver
(matching network).
Adequate separation should be maintained between high-speed clock lines in order to prevent cross-talk.
Also, placing a power or ground rail between two high-speed lines can be an effective measure.
5/1/2007 10:53 AM
5 of 5
production of
5/1/2007 10:53 AM
Design of VLSI Systems
1 of 34
Chapter 6
ARITHMETIC FOR DIGITAL SYSTEMS
Introduction
Notation Systems
Principle of Generation and Propagation
The 1-bit Full Adder
Enhancement Techniques for Adders
Multioperand Adders
Multiplication
Addition and Multiplication in Galois Fields, GF(2n)
6.1 Introduction
Computation speeds have increased dramatically during the past three decades resulting from the development of various
technologies. The execution speed of an arithmetic operation is a function of two factors. One is the circuit technology
and the other is the algorithm used. It can be rather confusing to discuss both factors simultaneously; for instance, a
ripple-carry adder implemented in GaAs technology may be faster than a carry-look-ahead adder implemented in CMOS.
Further, in any technology, logic path delay depends upon many different factors: the number of gates through which a
signal has to pass before a decision is made, the logic capability of each gate, cumulative distance among all such serial
gates, the electrical signal propagation time of the medium per unit distance, etc. Because the logic path delay is
attributable to the delay internal and external to logic gates, a comprehensive model of performance would have to
include technology, distance, placement, layout, electrical and logical capabilities of the gates. It is not feasible to make a
general model of arithmetic performance and include all these variables.
The purpose of this chapter is to give an overview of the different components used in the design of arithmetic operators.
The following parts will not exhaustively go through all these components. However, the algorithms used, some
mathematical concepts, the architectures, the implementations at the block, transistor or even mask level will be
presented. This chapter will start by the presentation of various notation systems. Those are important because they
influence the architectures, the size and the performance of the arithmetic components. The well known and used
principle of generation and propagation will be explained and basic implementation at transistor level will be given as
examples. The basic full adder cell (FA) will be shown as a brick used in the construction of various systems. After that,
the problem of building large adders will lead to the presentation of enhancement techniques. Multioperand adders are of
particular interest when building special CPU's and especially multipliers. That is why, certain algorithms will be
introduced to give a better idea of the building of multipliers. After the show of the classical approaches, a logarithmic
multiplier and the multiplication and addition in the Galois Fields will be briefly introduced. Muller [Mull92] and
Cavanagh [Cava83] constitute two reference books on the matter.
6.2 Notation Systems
6.2.1 Integer Unsigned
5/1/2007 10:53 AM
2 of 34
The binary number system is the most conventional and easily implemented system for internal use in digital computers.
It is also a positional number system. In this mode the number is encoded as a vector of n bits (digits) in which each is
weighted according to its position in the vector. Associated to each vector is a base (or radix) r. Each bit has an integer
value in the range 0 to r-1. In the binary system where r=2, each bit has the value 0 or 1. Consider a n-bit vector of the
form:
(1)
where ai=0 or 1 for i in [0, n-1]. This vector can represent positive integer values V = A in the range 0 to 2n-1, where:
(2)
The above representation can be extended to include fractions. An example follows. The string of binary digits 1101.11
can be interpreted to represent the quantity :
23 . 1 + 22 . 1 + 21 . 0 + 20 . 1 + 2-1 . 1 + 2-2 . 1 = 13.75 (3)
The following Table 1 shows the 3-bit vector representing the decimal expression to the right.
Table-6.1: Binary representation unsigned system with 3 digits
6.2.2 Integer Signed
If only positive integers were to be represented in fixed-point notation, then an n-bit word would permit a range from 0 to
2n-1. However, both positive and negative integers are used in computations and an encoding scheme must be devised in
which both positive and negative numbers are distributed as evenly as possible. There must be also an easy way to
distinguish between positive and negative numbers. The left most digit is usually reserved for the sign. Consider the
following number A with radix r,
where the sign digit an-1 has the following value:
for binary numbers where r=2, the previous equation becomes:
The remaining digits in A indicate either the true value or the magnitude of A in a complemented form.
6.2.2.1 Absolute value
5/1/2007 10:53 AM
3 of 34
Table-6.2: binary representation signed absolute value
In this representation, the high-order bit indicates the sign of the integer (0 for positive, 1 for negative). A positive number
has a range of 0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive number is :
The negatives numbers having the following representation:
One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two
number with opposite signs. The magnitudes have to be compared to determine the sign of the result.
6.2.2.2 1's complement
Table-6.3: binary representation signed
In this representation, the high-order bit also indicates the sign of the integer (0 for positive, 1 for negative). A positive
number has a range of 0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive
number is :
The negatives numbers having the following representation:
One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two
number with opposite signs. The magnitudes have to be compared to determine the sign of the result.
6.2.2.3 2's complement
5/1/2007 10:53 AM
4 of 34
Table-6.4: binary representation signed in 2's complement
In this notation system (radix 2), the value of A is represented such as:
The test sign is also a simple comparison of two bits. There is a unique representation of 0. Addition and subtraction are
easier because the result comes out always in a unique 2's complement form.
6.2.3 Carry Save
In some particular operations requiring big additions such as in multiplication or in filtering operations, the carry save
notation is used. This notation can be either used in 1's or 2's or whatever other definition. It only means that for the result
of an addition, the result will be coded in two digits which are the carry in the sum digit. When coming to the
multioperand adders and multipliers, this notion will be understood by itself.
6.2.4 Redundant Notation
It has been stated that each bit in a number system has an integer value in the range 0 to r-1. This produces a digit set S:
(4)
in which all the digits of the set are positively weighted. It is also possible to have a digit set in which both positive- and
negative-weighted digits are allowed [Aviz61] [Taka87], such as:
(5)
where l is a positive integer representing the upper limit of the set. This is considered as a redundant number system,
because there may be more than one way to represent a given number. Each digit of a redundant number system can
assume the 2(l+1) values of the set T. The range of l is:
(6)
Where:
is called the ceiling of .
For any number x , the ceiling of x is the smallest integer not less than x. The floor of x , is the largest integer not greater
than x. Since the integer l bigger or equal than 1 and r bigger or equal than 2, then the maximum magnitude of l will be
(7)
Thus for r=2, the digit set is:
(8)
5/1/2007 10:53 AM
5 of 34
For r=4, the digit set is
(9)
For example, for n=4 and r=2, the number A=-5 has four representation as shown below on Table 5.
23
22
21
20
A=
0
-1
0
-1
A=
0
-1
-1
1
A=
-1
0
1
1
A=
-1
1
0
-1
Table-6.5: Redundant representation of A=-5 when r=4
This multirepresentation makes redundant number systems difficult to use for certain arithmetic operations. Also, since
each signed digit may require more than one bit to represent the digit, this may increase both the storage and the width of
the storage bus.
However, redundant number systems have an advantage for the addition which is that is possible to eliminate the problem
of the propagation of the carry bit. This operation can be done in a constant time independent of the length of the data
word. The conversion from binary to binary redundant is usually a duplication or juxtaposition of bits and it does not cost
anything. On the contrary, the opposite conversion means an addition and the propagation of the carry bit cannot be
removed.
Let us consider the example where r=2 and l=1. In this system the three used digits are -1, 0, +1.
The representation of 1 is 10, because 1-0=1.
The representation of -1 is 01, because 0-1=-1.
One representation of 0 is 00, because 0-0=0.
One representation of 0 is 11, because 1-1=0.
The addition of 7 and 5 give 12 in decimal. The same is equivalent in a binary non redundant system to 111 + 101:
We note that a carry bit has to be added to the next digits when making the operation "by hand". In the redundant system
the same operation absorbs the carry bit which is never propagated to the next order digits:
The result 1001100 has now to be converted to the binary non redundant system. To achieve that, each couple of bits has
to be added together. The eventual carry has to be propagated to the next order bits:
5/1/2007 10:53 AM
6 of 34
6.3 Principle of Generation and Propagation
6.3.1 The Concept
The principle of Generation and Propagation seems to have been discussed for the first time by Burks, Goldstine and Von
Neumann [BGNe46]. It is based on a simple remark: when adding two numbers A and B in 2’s complement or in the
simplest binary representation (A=an-1...a1a0, B=bn-1...b1b0), when ai =bi it is not necessary to know the carry ci . So it is
not necessary to wait for its calculation in order to determine ci+1 and the sum si+1.
If ai=bi=0, then necessarily ci+1=0
If ai=bi=1, then necessarily ci+1=1
This means that when ai=bi, it is possible to add the bits greater than the ith, before the carry information ci+1 has arrived.
The time required to perform the addition will be proportional to the length of the longest chain i,i+1, i+2, i+p so that ak
not equal to bk for k in [i,i+p].
It has been shown [BGNe46] that the average value of this longest chain is proportional to the logarithm of the number of
bits used to represent A and B. By using this principle of generation and propagation it is possible to design an adder with
an average delay o(log n). However, this type of adder is usable only on asynchronous systems [Mull82]. Today the
complexity of the systems is so high that asynchronous timing of the operations is rarely implemented. That is why the
problem is rather to minimize the maximum delay rather than the average delay.
Generation:
This principle of generation allows the system to take advantage of the occurrences “ai=bi”. In both cases (ai=1 or
ai=0) the carry bit will be known.
Propagation:
If we are able to localize a chain of bits ai ai+1...ai+p and bi bi+1...bi+p for which ak not equal to bk for k in [i,i+p],
then the output carry bit of this chain will be equal to the input carry bit of the chain.
These remarks constitute the principle of generation and propagation used to speed the addition of two numbers.
All adders which use this principle calculate in a first stage.
pi = ai XOR bi (10)
gi = ai bi (11)
The previous equations determine the ability of the ith
bit to propagate carry information or to generate a carry information.
6.3.2 Transistor Formulation
5/1/2007 10:53 AM
7 of 34
[Click to enlarge image]Figure-6.1: A 1-bit adder with propagation signal controling the pass-gate
This implementation can be very performant (20 transistors) depending on the way the XOR function is built. The carry
propagation of the carry is controlled by the output of the XOR gate. The generation of the carry is directly made by the
function at the bottom. When both input signals are 1, then the inverse output carry is 0.
In the schematic of Figure 6.1, the carry passes through a complete transmission gate. If the carry path is precharged to
VDD, the transmission gate is then reduced to a simple NMOS transistor. In the same way the PMOS transistors of the
carry generation is removed. One gets a Manchester cell.
[Click to enlarge image]Figure-6.2: The Manchester cell
The Manchester cell is very fast, but a large set of such cascaded cells would be slow. This is due to the distributed RC
effect and the body effect making the propagation time grow with the square of the number of cells. Practically, an
inverter is added every four cells, like in Figure 6.3.
[Click to enlarge image]Figure-6.3: The Manchester carry cell
6.4 The 1-bit Full Adder
It is the generic cell used not only to perform addition but also arithmetic multiplication division and filtering operation.
In this part we will analyse the equations and give some implementations with layout examples.
The adder cell receives two operands ai and bi, and an incoming carry ci. It computes the sum and the outgoing carry ci+1.
ci+1 = ai . bi + ai . ci + ci . bi = ai . bi + (ai + bi ). ci ci+1 = pi . ci + gi
[Click to enlarge image]Figure-6.4: The full adder (FA) and half adder (HA) cells
where
pi = bi XOR ai is the PROPAGATION signal (12)
gi = ai . bi is the GENERATION signal (13)
5/1/2007 10:53 AM
8 of 34
si =
si = ai XOR bi XOR ci (14)
.(ai + bi + ci) + ai . bi . ci (15)
These equation can be directly translated into two N and P nets of transistors leading to the following schematics. The
main disadvantage of this implementation is that there is no regularity in the nets.
[Click to enlarge image]Figure-6.5: Direct transcription of the previous equations
The dual form of each equation described previously can be written in the same manner as the normal form:
(16)
dual of (16) (17)
(18)
In the same way :
(19)
dual of (19) (20)
(21)
The schematic becomes symmetrical (Figure 6.6), and leads to a better layout :
[Click to enlarge image]Figure-6.6: Symmetrical implementation due to the dual expressions of ci and si.
The following Figure 6.7 shows different physical layouts in different technologies. The size, the technology and the
performance of each cell is summarized in the next Table 6.
Name of cell
Number of
Tr.
Size
(µm2)
Technology
fa_ly_mini_jk
fa_ly_op1
Fulladd.L
24
24
28
2400
3150
962
1.2 µ
1.2 µ
0.5 µ
Worst Case
Delay (ns)
(Typical
Conditions)
20
5
1.5
5/1/2007 10:53 AM
9 of 34
fa_ly_itt
24
3627
1.2 µ
10
Table-6.6: Characteristics of layout cells from Figure 7.
Figure-6.7: Mask layout for different Full Adder cells
6.5 Enhancement Techniques for Adders
The operands of addition are the addend the augend. The addend is added to the augend to form the sum. In most
computers, the augmented operand (the augend) is replaced by the sum, whereas the addend is unchanged. High speed
adders are not only for addition but also for subtraction, multiplication and division. The speed of a digital processor
5/1/2007 10:53 AM
10 of 34
depends heavily on the speed of adders. The adders add vectors of bits and the principal problem is to speed- up the carry
signal. A traditional and non optimized four bit adder can be made by the use of the generic one-bit adder cell connected
one to the other. It is the ripple carry adder. In this case, the sum resulting at each stage need to wait for the incoming
carry signal to perform the sum operation. The carry propagation can be speed-up in two ways. The first –and most
obvious– way is to use a faster logic circuit technology. The second way is to generate carries by means of forecasting
logic that does not rely on the carry signal being rippled from stage to stage of the adder.
[Click to enlarge image]Figure-6.8: A 4-bit parallel ripple carry adder
Generally, the size of an adder is determined according to the type of operations required, to the precision or to the time
allowed to perform the operation. Since the operands have a fixed size, if becomes important to determine whether or not
there is a detected overflow
Overflow: An overflow can be detected in two ways. First an overflow has occurred when the sign of the sum does not
agree with the signs of the operands and the sign s of the operands are the same. In an n-bit adder, overflow can be
defined as:
(22)
Secondly, if the carry out of the high order numeric (magnitude) position of the sum and the carry out of the sign position
of the sum agree, the sum is satisfactory; if they disagree, an overflow has occurred. Thus,
(23)
A parallel adder adds two operands, including the sign bits. An overflow from the magnitude part will tend to change the
sign of the sum. So that an erroneous sign will be produced. The following Table 7 summarizes the overflow detection
an-1
0
0
1
1
bn-1
sn-1
cn-1
cn-2
Overflow
0
0
0
0
0
0
1
0
1
1
1
0
1
0
1
1
1
1
1
0
Table-6.7: Overflow detection for 1's and 2's complement
Coming back to the acceleration of the computation, two major techniques are used: speed-up techniques (Carry Skip and
Carry Select), anticipation techniques (Carry Look Ahead, Brent and Kung and C3i). Finally, a combination of these
techniques can prove to be an optimum for large adders.
6.5.1 The Carry-Skip Adder
Depending on the position at which a carry signal has been generated, the propagation time can be variable. In the best
case, when there is no carry generation, the addition time will only take into account the time to propagate the carry
signal. Figure 9 is an example illustrating a carry signal generated twice, with the input carry being equal to 0. In this case
three simultaneous carry propagations occur. The longest is the second, which takes 7 cell delays (it starts at the 4th
position and ends at the 11th position). So the addition time of these two numbers with this 16-bits Ripple Carry Adder is
7.k + k’, where k is the delay cell and k’ is the time needed to compute the 11th sum bit using the 11th carry-in.
With a Ripple Carry Adder, if the input bits Ai and Bi are different for all position i, then the carry signal is propagated at
all positions (thus never generated), and the addition is completed when the carry signal has propagated through the whole
adder. In this case, the Ripple Carry Adder is as slow as it is large. Actually, Ripple Carry Adders are fast only for some
configurations of the input words, where carry signals are generated at some positions.
5/1/2007 10:53 AM
11 of 34
Carry Skip Adders take advantage both of the generation or the propagation of the carry signal. They are divided into
blocks, where a special circuit detects quickly if all the bits to be added are different (Pi = 1 in all the block). The signal
produced by this circuit will be called block propagation signal. If the carry is propagated at all positions in the block,
then the carry signal entering into the block can directly bypass it and so be transmitted through a multiplexer to the next
block. As soon as the carry signal is transmitted to a block, it starts to propagate through the block, as if it had been
generated at the beginning of the block. Figure 6.10 shows the structure of a 24-bits Carry Skip Adder, divided into 4
blocks.
[Click to enlarge image]Figure-6.10: The "domino behaviour of the carry propagation and generation signals
[Click to enlarge image]Figure-6.10a: Block diagram of a carry skip adder
To summarize, if in a block all Ai's?Bi's, then the carry signal skips over the block. If they are equal, a carry signal is
generated inside the block and needs to complete the computation inside before to give the carry information to the next
block.
OPTIMISATION TECHNIQUE WITH BLOCKS OF EQUAL SIZE
It becomes now obvious that there exist a trade-off between the speed and the size of the blocks. In this part we analyse
the division of the adder into blocks of equal size. Let us denote k1 the time needed by the carry signal to propagate
through an adder cell, and k2 the time it needs to skip over one block. Suppose the N-bit Carry Skip Adder is divided into
M blocks, and each block contains P adder cells. The actual addition time of a Ripple Carry Adder depends on the
configuration of the input words. The completion time may be small but it also may reach the worst case, when all adder
cells propagate the carry signal. In the same way, we must evaluate the worst carry propagation time for the Carry Skip
Adder. The worst case of carry propagation is depicted in Figure 6.11.
[Click to enlarge image]Figure-6.11: Worst case for the propagation signal in a Carry Skip adder with blocks of equal size
The configuration of the input words is such that a carry signal is generated at the beginning of the first block. Then this
carry signal is propagated by all the succeeding adder cells but the last which generates another carry signal. In the first
and the last block the block propagation signal is equal to 0, so the entering carry signal is not transmitted to the next
block. Consequently, in the first block, the last adder cells must wait for the carry signal, which comes from the first cell
of the first block. When going out of the first block, the carry signal is distributed to the 2nd, 3rd and last block, where it
propagates. In these blocks, the carry signals propagate almost simultaneously (we must account for the multiplexer
delays). Any other situation leads to a better case. Suppose for instance that the 2nd block does not propagate the carry
signal (its block propagation signal is equal to zero), then it means that a carry signal is generated inside. This carry signal
starts to propagate as soon as the input bits are settled. In other words, at the beginning of the addition, there exists two
sources for the carry signals. The paths of these carry signals are shorter than the carry path of the worst case. Let us
formalize that the total adder is made of N adder cells. It contains M blocks of P adder cells. The total of adder cells is
then
N=M.P (24)
5/1/2007 10:53 AM
12 of 34
The time T needed by the carry signal to propagate through P adder cells is
T=k1.P (25)
The time T' needed by the carry signal to skip through M adder blocks is
T'=k2.M (26)
The problem to solve is to minimize the worst case delay which is:
(27)
(28)
So that the function to be minimized is:
(29)
The minimum is obtained for:
(30)
OPTIMISATION TECHNIQUE WITH BLOCKS OF NON-EQUAL SIZE
Let us formalize the problem as a geometric problem. A square will represent the generic full adder cell. These cells will
be grouped in P groups (in a column like manner).
L(i) is the value of the number of bits of one column.
L(1), L(2), ..., L(P) are the P adjacent columns. (see Figure 6.12)
[Click to enlarge image]Figure-6.12: Geometric formalization
If a carry signal is generated at the ith section, this carry skips j-i-1 sections and disappears at the end of the jth section.
So the delay of propagation is:
(31)
By defining the constant a equal to:
(32)
one can position two straight lines defined by:
(at the left most position) (33)
5/1/2007 10:53 AM
13 of 34
(at the right most position) (34)
The constant a is equivalent to the slope dimension in the geometrical problem of the two two straight lines defined by
equations (33) and (34). These straight lines are adjacent to the top of the columns and the maximum time can be
expressed as a geometrical distance y equal to the y-value of the intersection of the two straight lines.
(35)
(36)
because
(37)
[Click to enlarge image]Figure-6.13: Representation of the geometrical worst delay
A possible implementation of a block is shown in Figure 6.14. In a precharged mode, the output of the four inverter-like
structure is set to one. In the evaluation mode, the entire block is in action and the output will either receive c0 or the
carry generated inside the comparator cells according to the values given to A and B. If there is no carry generation
needed, c0 will be transmitted to the output. In the other case, one of the inversed pi's will switch the multiplexer to
enable the other input.
[Click to enlarge image]Figure-6.14: A possible implementation of the Carry Skip block
6.5.2 The Carry-Select Adder
This type of adder is not as fast as the Carry Look Ahead (CLA) presented in a next section. However, despite its bigger
amount of hardware needed, it has an interesting design concept. The Carry Select principle requires two identical parallel
adders that are partitioned into four-bit groups. Each group consists of the same design as that shown on Figure 15. The
group generates a group carry. In the carry select adder, two sums are generated simultaneously. One sum assumes that
the carry in is equal to one as the other assumes that the carry in is equal to zero. So that the predicted group carry is used
to select one of the two sums.
It can be seen that the group carries logic increases rapidly when more high- order groups are added to the total adder
length. This complexity can be decreased, with a subsequent increase in the delay, by partitioning a long adder into
sections, with four groups per section, similar to the CLA adder.
5/1/2007 10:53 AM
14 of 34
[Click to enlarge image]Figure-6.15: The Carry Select adder
[Click to enlarge image]Figure-6.16:
The Carry Select adder . (a) the design with non optimised used of the gates, (b) Merging of the redundant gates
A possible implementation is shown on Figure 6.16, where it is possible to merge some redundant logic gates to achieve a
lower complexity with a higher density.
6.5.3 The Carry Look-Ahead Adder
The limitation in the sequential method of forming carries, especially in the Ripple Carry adder arises from specifying ci
as a specific function of ci-1. It is possible to express a carry as a function of all the preceding low order carry by using
the recursivity of the carry function. With the following expression a considerable increase in speed can be realized.
(38)
Usually the size and complexity for a big adder using this equation is not affordable. That is why the equation is used in a
modular way by making groups of carry (usually four bits). Such a unit generates then a group carry which give the right
predicted information to the next block giving time to the sum units to perform their calculation.
Figure-6.17: The Carry Generation unit performing the Carry group computation
Such unit can be implemented in various ways, according to the allowed level of abstraction. In a CMOS process, 17
transistors are able to guarantee the static function (Figure 6.18). However this design requires a careful sizing of the
transistors put in series.
The same design is available with less transistors in a dynamic logic design. The sizing is still an important issue, but the
number of transistors is reduced (Figure 6.19).
5/1/2007 10:53 AM
15 of 34
[Click to enlarge image]Figure-6.18: Static implementation of the 4-bit carry lookahead chain
[Click to enlarge image]Figure-6.19: Dynamic implementation of the 4-bit carry lookahead chain
To build large adders the preceding blocks are cascaded according to Figure 6.20.
[Click to enlarge image]Figure-6.20: Implementation of a 16-bit CLA adder
6.5.4 The Brent and Kung Adder
which combines couples of generation and
The technique to speed up the addition is to introduce a "new" operator
propagation signals. This "new" operator come from the reformulation of the carry chain.
REFORMULATION OF THE CARRY CHAIN
Let an an-1 ... a1 and bn bn-1 ... b1 be n-bit binary numbers with sum sn+1 sn ... s1. The usual method for addition
computes the si’s by:
c0 = 0 (39)
ci = aibi + aici-1 + bici-1 (40)
si = ai ++ bi ++ ci-1, i = 1,...,n (41)
sn+1 = cn (42)
Where ++ means the sum modulo-2 and ci is the carry from bit position i. From the previous paragraph we can deduce
that the ci’s are given by:
c0 = 0 (43)
ci = gi + pi ci-1 (44)
gi = ai bi (45)
pi = ai ++ bi for i = 1,...., n (46)
5/1/2007 10:53 AM
16 of 34
One can explain equation (44) saying that the carry ci is either generated by ai and bi or propagated from the previous
carry ci-1. The whole idea is now to generate the carry’s in parallel so that the nth stage does not have to “wait” for the
n-1th carry bit to compute the global sum. To achieve this goal an operator
Let
is defined.
be defined as follows for any g, g', p and p' :
(g, p)
(g', p') = (g + p . g', p . p') (47)
Lemma1: Let (Gi, Pi) = (g1, p1) if i = 1 (48)
(gi, pi) (Gi-1, Pi-1) if i in [2, n] (49)
Then ci = Gi for i = 1, 2, ..., n.
Proof: The Lemma is proved by induction on i. Since c0 = 0, (44) above gives:
c1 = g1 + p1 . 0 = g1 = G1 (50)
So the result holds for i=1. If i>1 and ci-1 = Gi-1 , then
(Gi, Pi) = (gi, pi) (Gi-1, Pi-1) (51)
(Gi, Pi) = (gi, pi) (ci-1, Pi-1) (52)
(Gi, Pi) = (gi + pi . ci-1, Pi . Pi-1) (53)
thus Gi = gi + pi . ci-1 (54)
And from (44) we have : Gi = ci.
Lemma2: The operator
is associative.
Proof: For any (g3, p3), (g2, p2), (g1, p1) we have:
[(g3, p3) (g2, p2)] (g1, p1) = (g3+ p3 . g2, p3 . p2) (g1, p1)
= (g3+ p3 . g2+ , p3 . p2 . p1) (55)
and,
(g3, p3) [(g2, p2) (g1, p1)] = (g3, p3) (g2 + p2 . g1, p2 . p1)
= (g3 + p3 . (g2 + p2 . g1), p3 . p2 . p1) (56)
One can check that the expressions (55) and (56) are equal using the distributivity of . and +.
To compute the ci’s it is only necessary to compute all the (Gi, Pi)’s but by Lemmas 1 and 2,
(Gi, Pi) = (gi, pi)
(gi-1, pi-1)
....
(g1, p1) (57)
can be evaluated in any order from the given gi’s and pi’s. The motivation for introducing the operator Delta is to
generate the carry’s in parallel. The carry’s will be generated in a block or carry chain block, and the sum will be obtained
directly from all the carry’s and pi’s since we use the fact that:
si = pi H ci-2 for i=1,...,n (58)
THE ADDER
Based on the previous reformulation of the carry computation Brent and Kung have proposed a scheme to add two n-bit
numbers in a time proportional to log(n) and in area proportional to n.log(n), for n bigger or equal to 2. Figure 6.21 shows
how the carry’s are computed in parallel for 16-bit numbers.
[Click to enlarge image]Figure-6.21: The first binary tree allowing the calculation of c1, c2, c4, c8, c16.
5/1/2007 10:53 AM
17 of 34
Using this binary tree approach, only the ci’s where i=2k (k=0,1,...n) are computed. The missing ci’s have to be computed
using another tree structure, but this time the root of the tree is inverted (see Figure 6.22).
In Figure 6.21 and Figure 6.22 the squares represent a
cell which performs equation (47). Circles represent a duplication cell where the inputs are separated into two distinct
wires (see Figure 6.23).
When using this structure of two separate binary trees, the computation of two 16-bit numbers is performed in T=9 stages
of cells. During this time, all the carries are computed in a time necessary to traverse two independent binary trees.
According to Burks, Goldstine and Von Neumann, the fastest way to add two operands is proportional to the logarithm of
the number of bits. Brent and Kung have achieved such a result.
[Click to enlarge image]Figure-6.22: Computation of all the carry’s for n = 16
[Click to enlarge image]Figure-6.23: (a) The
cell, (b) the duplication cell
6.5.5 The C3i Adder
THE ALGORITHM
Let ai and bi
be the digits of A and B, two n-bit numbers with i = 1, 2, ...,n. The carry's will be computed according to (59).
(59)
with: Gi = ci (60)
If we develop (59), we get:
(61)
and by introducing a parameter m less or equal than n so that it exists q in IN | n = q.m, it is possible to obtain the couple
(Gi, Pi) by forming groups of m cells performing the intermediate operations detailed in (62) and (63).
(62)
(63)
5/1/2007 10:53 AM
18 of 34
This manner of computing the carries is strictly based on the fact that the operator is associative. It shows also that the
calculation is performed sequentially, i.e. in a time proportional to the number of bits n. We will now illustrate this
analytical approach by giving a way to build an architectural layout of this new algorithm. We will proceed to give a
graphical method to place the cells defined in the previous paragraph [Kowa92].
THE GRAPHICAL CONSTRUCTIONM
1. First build a binary tree of cells.
2. Duplicate this binary tree m times to the right (m is a power of two; see Remark1 in the next pages if m is not a
power of two). The cells at the right of bit 1 determines the least significant bit (LSB).
3. Eliminate the cells at the right side of the LSB. Change the cells not connected to anything into duplication cells.
Eliminate all cells under the second row of cells, except the right most group of m cells.
4. Duplicate q times to the right by incrementing the row down the only group of m cells left after step 3. This
gives a visual representation of the delay read in Figure 6.29.
5. Shift up the q groups of cells, to get a compact representation of a "floorplan".
This complete approach is illustrated in Figure 6.24, where all the steps are carefully observed. The only cells necessary
for this carry generation block to constitute a real parallel adder are the cells performing equations (45) and (46). The first
row of functions is put at the top of the structure. The second one is pasted at the bottom.
[Click to enlarge image]Figure-6.24: (a) Step1, (b) Step2, (c) Step3 and Step4, (d) Step5
At this point of the definition, two remarks have to be made about the definition of this algorithm. Both concern the m
parameter used to defined the algorithm. Remark 1 specifies the case for m not equal to 2q (q in [0,1, ...] ) as Remark 2
deals with the case where m=n.
[Click to enlarge image]Figure-6.25: Adder where m=6. The fan-out of the 11th carry bit is highlighted
Remark 1: For m not a power of two, the algorithm is built the same way up to the very last step. The only reported
difference will concern the delay which will be equal to the next nearest power of two. This means that there is no special
5/1/2007 10:53 AM
19 of 34
interest to build such versions of these adders. The fan-out of certain cells is even increased to three, so that the electrical
behaviour will be degraded. Figure 6.25 illustrates the design of such an adder based on m=6. The fan-out of the cell of
bit 11 is three. The delay of this adder is equivalent to the delay of an adder with a duplication with m=8.
Remark 2: For m equal to the number of bits of the adder, the algorithm reaches the real theoretical limit demonstrated
by Burks, Goldstine and Von Neumann. The logarithmic time is attained using one depth of a binary tree instead of two in
the case of Brent and Kung. This particular case is illustrated in Figure 6.26. The definition of the algorithm is followed
up to Step3. Once the reproduction of the binary tree is made m times to the right, the only thing to do is to remove the
cells at the negative bit positions and the adder is finished. Mathematically, one can notice that this is the limit. We will
discuss later whether it is the best way to build an adder using m=n.
Adder where m=n. This constitutes the theoretical limit for the computation of the addition.
COMPARISONS
In this section, we develop a comparison between adders obtained using the new algorithm with different values of m. On
the plots of Figure 6.27 through Figure 6.29, the suffixes JK2, JK4, and JK8 will denote different adders obtained for m
equal two, four or eight. They are compared to the Brent and Kung implementation and to the theoretical limit which is
obtained when m equals n, the number of bits.
The comparison between these architectures is done according to the formalisation of a computational model described in
[Kowa93]. We clearly see that BK’s algorithm performs the addition with a delay proportional to the logarithm of the
number of bits. JK2 performs the addition in a linear time, just as JK4 or JK8. The parameter m influences the slope of the
delay. So that, the higher is m, the longer the delay stays under the logarithmic delay of (BK). We see that when one
wants to implement the addition faster than (BK), there is a choice to make among different values of m. The choice will
depend on the size of the adder because it is evident that a 24-bit JK2 adder (delay = 11 stages of cells) performs worse
than BK (delay = 7 stages of cells).
On the other hand JK8 (delay = 5 stages of
cells) is very attractive. The delay is better than BK up to 57 bits. At this point both delays are equal. Furthermore, even at
equal delays (up to 73 bits) our implementation performs better in terms of regularity, modularity and ease to build. The
strong advantage of this new algorithm compared to BK is that for a size of the input word which is not a power-of-two,
the design of the cells is much easier. There is no partial binary tree to build. The addition of a bit to the adder is the
addition of a bit-slice. This bit-slice is very compact and regular. Let us now consider the case where m equals n (denoted
by XXX on our figures). The delay of such an adder is exactly one half of BK and it is the lowest bound we obtain. For
small adders (n < 16), the delay is very close to XXX. And it can be demonstrated that the delays (always in term of
stages) of JK2, JK4, JK8 are always at least equal to XXX.
This discussion took into account the two following characteristics of the computational model:
The gate of a stage computes a logical function in a constant time;
The signal is divided into two signals in constant time (this occurs especially at the output of the first stage of
cells).
And the conclusion of this discussion is that m has to be chosen as high as possible to reduce the global delay. When we
turn to the comparisons concerning the area, we will take into account the following characteristics of our computational
model:
At most two wires cross at any point;
A constant but predefined area of the gates with minimum width for the wires is used;
The computation is made in a convex planar region.
For this discussion let us consider Figure 6.28 where we represent the area of the different adders versus the number of
5/1/2007 10:53 AM
20 of 34
bits. It is obvious that for m being the smallest, the area will be the smallest as well. For m increasing up to n, we can see
that the area will still be proportional to the number of bits following a straight line. For m equal to n the area will be
exactly one half of the BK area with a linear variation. The slope of this variation in both cases of BK and XXX will vary
according to the intervals [2q,2q+1] where q=0.
Here we could point out that the floorplan of BK could be optimised to become comparable to the one of XXX, but the
cost of such an implementation would be very high because of the irregularity of the wirings and the interconnections.
These considerations lead us to the following conclusion: to minimise the area of a new adder, m must be chosen low.
This is contradictory with the previous conclusion. That is why a very wise choice of m will be necessary, and it will
always depend on the targeted application. Finally, Figure 6.27 gives us the indications about the number of transistors
used to implement our different versions of adders. These calculations are based on the dynamic logic family (TSPC:
True Single Phase Clocking) described in [Kowa93]. When considering this graph, we see that BK and XXX are two
limits of the family of our adders. BK uses the smallest number of transistors , whereas XXX uses up to five times more
transistors. When m is highest, the number of transistors is highest.
Nevertheless, we see that the area is smaller than BK. A high density is an advantage, but an overhead in transistors can
lead to higher power dissipation. This evident drawback in our algorithm is counterbalanced by the progress being made
in the VLSI area. With the shrinking of the design rules, the size of the transistors decreases as well as the size of the
interconnections. This leads to smaller power dissipation. This fact is even more pronounced when the technologies tend
to decrease the power supply from 5V to 3.3V.
In other words, the increase in the number of transistors corresponds to the redundancy we introduce in the calculations to
decrease the delay of our adders.
Now we will discuss an important characteristics of our computational model that differs from the model of Brent and
Kung:
The signal travels along a wire in a time proportional to its length.
This assumption is very important as we discuss it with an example. Let us consider the 16-bit BK adder (Figure 6.22)
and the 16-bit JK4 adder (Figure 6.24). The longest wire in the BK implementation will be equal to at least eight widths
of ? cells, whereas in the JK4 implementation, the longest wire will be equal to four widths of -cells. For BK, the output
capacitive load of a
-cell will be variable and a variable sizing of the cell will be necessary. In our case, the parameter m will defined a fixed
library of -cells used in the adder. The capacitive load will always be limited to a fixed value allowing all cells to be
sized to a fixed value.
Figure-6.27: Number of transistors versus the number of bits
5/1/2007 10:53 AM
21 of 34
Figure-6.28: Area versus the number of bits
Figure-6.29: Delay in number of
stages versus the number of bits in the adder
To partially conclude this section, we say that an optimum must be defined when choosing to implement our algorithm.
This optimum will depend on the application for which the operator is to be used.
6.6 Multioperand Adders
6.6.1 General Principle
The goal is to add more than 2 operand in a time. This generally occurs in multiplication operation or filtering.
6.6.2 Wallace Trees
5/1/2007 10:53 AM
22 of 34
For this purpose, Wallace trees were introduced. The addition time grows like the logarithm of the bit number. The
simplest Wallace tree is the adder cell. More generally, an n-inputs Wallace tree is an n-input operator and log2(n)
outputs, such that the value of the output word is equal to the number of “1” in the input word. The input bits and the least
significant bit of the output have the same weight (Figure 6.30). An important property of Wallace trees is that they may
be constructed using adder cells. Furthermore, the number of adder cells needed grows like the logarithm log2(n) of the
number n of input bits. Consequently, Wallace trees are useful whenever a large number of operands are to add, like in
multipliers. In a Braun or Baugh-Wooley multiplier with a Ripple Carry Adder, the completion time of the multiplication
is proportional to twice the number n of bits. If the collection of the partial products is made through Wallace trees, the
time for getting the result in a carry save notation should be proportional to log2(n).
[Click to enlarge image]Figure-6.30: Wallace cells made of adders
Figure 6.31 represents a 7-inputs adder: for each weight, Wallace trees are used until there remains only two bits of each
weight, as to add them using a classical 2-inputs adder. When taking into account the regularity of the interconnections,
Wallace trees are the most irregular.
[Click to enlarge image]Figure-6.31: A 7-inputs Wallace tree
6.6.3 Overturned Stairs Trees
To circumvent the irregularity Mou [Mou91] proposes an alternalive way to build multi-operand adders. The method uses
basic cells called branch, connector or root. These basic elements (see Figure 6.32) are connected together to form n-input
trees. One has to take care about the weight of the inputs. Because in this case the weights at the input of the 18-input OS
tree are different. The regularity of this structure is better than with Wallace trees but the construction of multipliers is
still complex.
[Click to enlarge image]Figure-6.32: Basic cells used to build OS-trees
[Click to enlarge image]Figure-6.33: A 18-input OS-tree
5/1/2007 10:53 AM
23 of 34
6.7 Multiplication
6.7.1 Inroduction
Multiplication can be considered as a serie of repeated additions. The number to be added is the multiplicand, the number
of times that it is added is the multiplier, the result is the product. Each step of the addition generates a partial product. In
most coimputers, the operands usually contain the same number of bits. When the operands are interpreted as integers, the
product is generally twice the length of the operands in order to preserve the information content. This repeated addition
method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes
use of positional number representation.
It is possible to decompose multipliers in two parts. The first part is dedicated to the generation of partial products, and
the second one collects and adds them. As for adders, it is possible to enhance the intrinsic performances of multipliers.
Acting in the generation part, the Booth (or modified Booth) algorithm is often used because it reduces the number of
partial products. The collection of the partial products can then be made using a regular array, a Wallace tree or a binary
tree [Sinh89].
Figure-6.34: Partial product representation and multioperand addition
6.7.2 Booth Algorithm
This algorithm is a powerful direct algorithm for signed-number multiplication. It generates a 2n-bit product and treats
both positive and negative numbers uniformly. The idea is to reduce the number of additions to perform. Booth algorithm
allows in the best case n/2 additions whereas modified Booth algorithm allows always n/2 additions.
Let us consider a string of k consecutive 1s in a multiplier:
..., i+k, i+k-1, i+k-2 , ..., i, i-1, ...
..., 0 , 1 , 1
, ..., 1, 0, ...
where there is k consecutive 1s.
5/1/2007 10:53 AM
24 of 34
By using the following property of binary strings:
2i+k-2i=2i+k-1+2i+k-2+...+2i+1+2i
the k consecutive 1s can be replaced by the following string
..., i+k+1, i+k, i+k-1, i+k-2, ..., i+1, i , i-1 , ...
..., 0
, 1 , 0 , 0 , ..., 0 , -1 , 0 , ...
k-1 consecutive 0s Addition Subtraction
In fact, the modified Booth algorithm converts a signed number from the standard 2’s-complement radix into a number
system where the digits are in the set {-1,0,1}. In this number system, any number may be written in several forms, so the
system is called redundant.
The coding table for the modified Booth algorithm is given in Table 8. The algorithm scans strings composed of three
digits. Depending on the value of the string, a certain operation will be performed.
A possible implementation of the Booth encoder is given on Figure 6.35. The layout of another possible structure is given
on Figure 6.36.
BIT
21
Yi+1
M is
20
Yi
2-1
Yi-1
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
OPERATION
multiplied
by
0
add zero (no string)
add multipleic (end of
1
string)
0
add multiplic. (a string)
add twice the mul. (end
1
of string)
sub. twice the m. (beg.
0
of string)
sub. the m. (-2X and
1
+X)
sub . the m. (beg. of
0
string)
sub. zero (center of
1
string)
Table-6.8: Modified Booth coding table.
+0
+X
+X
+2X
-2X
-X
-X
-0
[Click to enlarge image]Figure-6.35: Booth encoder cell
5/1/2007 10:53 AM
25 of 34
Figure-6.36: Booth encoder cell (layout size: 65.70 µm2 (0.5µCMOS))
6.7.3 Serial-Parallel Multiplier
This multiplier is the simplest one, the multiplication is considered as a succession of additions.
If A = (an an-1……a0) and B = (bn bn-1……b0)
The product A.B is expressed as :
A.B = A.2n.bn + A.2n-1.bn-1 +…+ A.20.b0
The structure of Figure 6.37 is suited only for positive operands. If the operands are negative and coded in
2’s-complement :
1. The most significant bit of B has a negative weight, so a subtraction has to be performed at the last step.
2. Operand A.2k
must be written on 2N bits, so the most significant bit of A must be duplicated. It may be easier to shift the content
of the accumulator to the right instead of shifting A to the left.
[Click to enlarge image]Figure-6.37: Serial-Parallel multiplier
6.7.4 Braun Parallel Multiplier
The simplest parallel multiplier is the Braun array. All the partial products A.bk are computed in parallel, then collected
through a cascade of Carry Save Adders. At the bottom of the array, the output of the array is noted in Carry Save, so an
additional adder converts it (by the mean of a carry propagation) into the classical notation (Figure 6.38). The completion
time is limited by the depth of the carry save array, and by the carry propagation in the adder. Note that this multiplier is
only suited for positive operands. Negative operands may be multiplied using a Baugh-Wooley multiplier.
5/1/2007 10:53 AM
26 of 34
[Click to enlarge image]Figure-6.38: A 4-bit Braun Multiplier without the final adder
Figure 6.38 and Figure 6.40 use the symbols given in Figure 6.39 where CMUL1 and CMUL2 are two generic cells
consisting of an adder without the final inverter and with one input connected to an AND or NAND gate. A non optimised
(in term of transistors) multiplier would consist only of adder cells connected one to another with AND gates generating
the partial products. In these examples, the inverters at the output of the adders have been eliminated and the parity of the
bits has been compensated by the use of CMUL1 or CMUL2.
Figure-6.40: A 8-bit Braun Multiplier without the final adder
6.7.5 Baugh-Wooley Multiplier
This technique has been developed in order to design regular multipliers, suited for 2’s-complement numbers.
Let us consider 2 numbers A and B :
5/1/2007 10:53 AM
27 of 34
(64), (65)
The product A.B is given by the following equation :
(66)
We see that subtractor cells must be used. In order to use only adder cells, the negative terms may be rewritten as :
(67)
By this way, A.B becomes :
(68)
The final equation is :
(69)
because :
(70)
A and B are n-bits operands, so their product is a 2n-bits number. Consequently, the most significant weight is 2n-1, and
the first term -22n-1 is taken into account by adding a 1 in the most significant cell of the multiplier.
5/1/2007 10:53 AM
28 of 34
[Click to enlarge image]Figure-6.41: shows a 4-bits Baugh-Wooley multiplier
[Click to enlarge image]Figure-6.41: A 4-bit Baugh-Wooley Multiplier with the final adder
6.7.6 Dadda Multiplier
The advantage of this method is the higher regularity of the array. Signed integers can be processed. The cost for this
regularity is the addition of an extra column of adders.
[Click to enlarge image]Figure-6.42: A 4-bit Baugh-Wooley Multiplier with the final adder
6.7.7 Mou's Multiplier
On Figure 6.43 the scheme using OS-trees is used in a 4-bit multiplier. The partial product generation is done according to
Dadda multiplication. Figure 6.44 represents the OS-tree structure used in a 16-bit multiplier. Although the author claims
a better regularity, its scheme does not allow an easy pipelining.
5/1/2007 10:53 AM
29 of 34
[Click to enlarge image]Figure-6.43: A 4-bit OS-tree Multiplier with a final adder
[Click to enlarge image]Figure-6.44: A 16-bit OS-tree Multiplier without a final adder and without the partial product cells
6.7.8 Logarithmic Multiplier
The objective of this circuit is to compute the product of two terms. The property used is the following equation :
Log(A * B) = Log (A) + Log (B) (71)
There are several ways to obtain the logarithm of a number : look-up tables, recursive algorithms or the segmentation of
the logarithmic curve [Hoef91]. The segmentation method : The basic idea is to approximate the logarithm curve with a
set of linear segments.
If y = Log2 (x) (72)
an approximation of this value on the segment ]2n+1 , 2n[ can be made using the following equation :
y = ax + b = ( y /
x ).x + b = [1 / (2n+1 - 2n)].x + n-1 = 2-n x + (n-1) (73)
What is the hardware interpretation of this formula?
If we take xi = (xi7, xi6, xi5, xi4, xi3, xi2, xi1, xi0), an integer coded with 8 bits, its logarithm will be obtained as follows.
The decimal part of the logarithm will be obtained by shifting xi n positions to the right, and the integer part will be the
value where the MSB occurs.
For instance if xi is (0,0,1,0,1,1,1,0) = 46, the integer part of the logarithm is 5 because the MSB is xi5 and the decimal
part is 01110. So the logarithm of xi equals 101.01110 = 5.4375 because 01110 is 14 out of a possible 32, and 14/32 =
0.4275
Table 9 illustrates this coding. Once the coding of two linear words has been performed, the addition of the two
logarithms can be done. The last operation to be performed is the antilogarithm of the sum to obtain the value of the final
product.
Using this method, a 11.6% error on the product of two binary operands (i.e. the sum of two logarithmic numbers) occurs.
We would like to reduce this error without increasing the complexity of the operation nor the complexity of the operator.
Since the transfomations used in this system are logarithms and antilogarithms, it is natural to think that the complexity of
the correction systems will grow exponentially if the error approaches zero. We analyze the error to derive an easy and
5/1/2007 10:53 AM
30 of 34
effective way to increase the accuracy of the result.
Table-6.9: Coding of the binary logarithm according to the segmentation method
Figure 6.45 describes the architecture of the logarithmic multiplier with the different variables used in the system.
[Click to enlarge image]Figure-6.45: Block diagram of a logarithmic multiplier
Error analysis: Let us define the different functions used in this system.
5/1/2007 10:53 AM
31 of 34
The logarithm and antilogarithm curves are approximated by linear segments. They start at values which are in
powers-of-two and end at the next power-of- two value. Figure 6.46 shows how a logarithm is approximated. The same is
true for the antilogarithm.
[Click to enlarge image]Figure-6.46: Approximated value of the logarithm compared to the exact logarithm
By adding the unique value 17*2-8
to the two logarithms an improvement of 40% is achieved on the maximum error. The maximum error comes down from
11.6% to 7.0%, an improvement of 40% compared with a system without any correction. The only cost is the replacement
of the internal two input adder by a three input adder.
A more complex correction system which leads to better precision but at a much higher hardware cost is possible.
In Table 10 we suggest a system which would choose one correction among three depending on the value of the input
bits. Table 10 can be read as the values of the logarithms obtained after the coder for either a1 or a2. The penultimate
column represents the ideal correction which should be added to get 100% accuracy. The last column gives the correction
chosen among three possibilities: 32, 16 or 0.
Three decoding functions have to be implemented for this proposal. If the exclusive -OR of a-2 and a-3 is true, then the
added value is 32*2-8. If all the bits of the decimal part are zero, then the added value is zero. In all other cases the added
value is 16*2-8.
This decreases the average error. But the drawback is that the maximum error will be minimized only if the steps between
two ideal corrections are bigger than the unity step. To minimize the maximum error the correcting functions should
increase in an exponential way. Further research could be performed in this area.
5/1/2007 10:53 AM
32 of 34
Table-6.10: A more complex correction scheme
6.8 Addition and Multiplication in Galois Fields, GF(2n)
The group theory is used to introduce another algebraic system, called a field. A field is a set of elements in which we can
do addition, subtraction, multiplication and division without leaving the set. Addition and multiplication must satisfy the
commutative, associative, and distributive laws. A formal definition of a field is given below.
Definition
Let F be a set of elements on which two binary operations called addition "+" and multiplication".", are defined. The set F
together with the two binary operations + and . is a field if the following conditions are satisfied:
1. F is a commutative group under addition +. The identity element with respect to addition is called the zero element
or the additive identity of F and is denoted by 0.
2. The set of nonzero elements in F is a commutative group under multiplication . .The identity element with respect
to multiplication is called the unit element or the multiplicative identity of F and is denoted 1.
3. Multiplication is distributive over addition; that is, for any three elements, a, b, c in F:
a . ( b + c ) = a . b + a . c
The number of elements in a field is called the order of the field.
5/1/2007 10:53 AM
33 of 34
A field with finite number of elements is called a finite field.
Let us consider the set {0,1} together with modulo-2 addition and multiplication. We can easily check that the set {0,1} is
a field of two elements under modulo-2 addition and modulo-2 multiplication.field is called a binary field and is denoted
by GF(2).
The binary field GF(2) plays an important role in coding theory [Rao74] and is widely used in digital computers and data
transmission or storage systems.
Another example using the residue to the base [Garn59] is given below. Table 11 represents the values of N, from 0 to 29
with their representation according to the residue of the base (5, 3, 2).The addition and multiplication of two term in this
base can be performed according to the next example:
Table-6.11: N varying from 0 to 29 and its representation in the residue number system
The most interesting property in these systems is that there is no carry propagation inside the set. This can be attractive
when implementing into VLSI these operators
References
[Aviz61] Avizienis, Signed-Digit Number Representations For Fast Parallel Arithmetic, IRE Trans. Electron.
Compute., Vol EC-10, pp. 389-400, 1961.
[Cava83] J. J. F. Cavanagh, Digital Computer Arithmetic, McGraw-Hill computer sciences series, 1983.
[Garn59] H. L. Garner. The Residue Number System, IRE Transactions on Elec. Comput., p. 140- 147, September
1959
[Hoef91] B. Hoefflinger, M. Selzer, F. Warkowski, Digital Logarithmic CMOS Multiplier for Very- High Speed
Signal Processing, in Proc. IEEE Custom Integrated Circuit Conference, 1991, pp.16.7.1-16.7.5.
[Kowa92] J. Kowalczuk and D. Mlynek, Un Nouvel Algorithme De Generation D'Additionneurs Rapides Dédiés
5/1/2007 10:53 AM
34 of 34
Au traitement d'images, Proceeding of the Industrial Automation Conference, p. 20.9-20.13, Montreal, Québec,
Canada, June 1992.
[Kowa93] J. Kowalczuk, On the Design and implementation of algorithms for Multimedia Systems, PhD Thesis No
1188, Swiss Federal Institute of Techn., Lausanne 1994.
[Mull82] J.M. Muller, Arithmétique des ordinateurs, opérateurs et fonctions élémentaires, Masson 1989
[Rao74] T. R. N. Rao, Error Coding for Arithmetic Processors, New York, Academic Press, 1974
[Sinh89] B. P. Sinha and P. K. Srimani, Fast Parallel Algorithms for Binary Multiplication and Their
Implementation on Systolic Architectures, IEEE Transactions on Computers, vol. 38, No. 3, 424-431, March 1989
[Taka87] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa and N. Takagi, A High-Speed Multiplier Using a
Redundant Binary Adder Tree, IEEE Journal of Solid States Circuits, Volume SC- 22, No. 1, Pages 28-34,
February 1987
This chapter edited by D. Mlynek
production of
5/1/2007 10:53 AM
1 of 24
Chapter 7
LOW-POWER VLSI CIRCUITS AND SYSTEMS
Introduction
Overview of Power Consumption
Low-Power Design Through Voltage Scaling
Estimation and Optimization of Switching Activity
Reduction of Switched Capacitance
Adiabatic Logic Circuits
7.1 Introduction
The increasing prominence of portable systems and the need to limit power consumption (and hence, heat
dissipation) in very-high density ULSI chips have led to rapid and innovative developments in low-power design
during the recent years. The driving forces behind these developments are portable applications requiring low power
dissipation and high throughput, such as notebook computers, portable communication devices and personal digital
assistants (PDAs). In most of these cases, the requirements of low power consumption must be met along with
equally demanding goals of high chip density and high throughput. Hence, low-power design of digital integrated
circuits has emerged as a very active and rapidly developing field of CMOS design.
The limited battery lifetime typically imposes very strict demands on the overall power consumption of the portable
system. Although new rechargeable battery types such as Nickel-Metal Hydride (NiMH) are being developed with
higher energy capacity than that of the conventional Nickel-Cadmium (NiCd) batteries, revolutionary increase of the
energy capacity is not expected in the near future. The energy density (amount of energy stored per unit weight)
offered by the new battery technologies (e.g., NiMH) is about 30 Watt-hour/pound, which is still low in view of the
expanding applications of portable systems. Therefore, reducing the power dissipation of integrated circuits through
design improvements is a major challenge in portable systems design.
The need for low-power design is also becoming a major issue in high-performance digital systems, such as
microprocessors, digital signal processors (DSPs) and other applications. Increasing chip density and higher
operating speed lead to the design of very complex chips with high clock frequencies. Typically, the power
dissipation of the chip, and thus, the temperature, increase linearly with the clock frequency. Since the dissipated
heat must be removed effectively to keep the chip temperature at an acceptable level, the cost of packaging, cooling
and heat removal becomes a significant factor. Several high-performance microprocessor chips designed in the early
1990s (e.g., Intel Pentium, DEC Alpha, PowerPC) operate at clock frequencies in the range of 100 to 300 MHz, and
their typical power consumption is between 20 and 50 W.
ULSI reliability is yet another concern which points to the need for low-power design. There is a close correlation
between the peak power dissipation of digital circuits and reliability problems such as electromigration and
hot-carrier induced device degradation. Also, the thermal stress caused by heat dissipation on chip is a major
reliability concern. Consequently, the reduction of power consumption is also crucial for reliability enhancement.
5/1/2007 10:57 AM
2 of 24
The methodologies which are used to achieve low power consumption in digital systems span a wide range, from
device/process level to algorithm level. Device characteristics (e.g., threshold voltage), device geometries and
interconnect properties are significant factors in lowering the power consumption. Circuit-level measures such as the
proper choice of circuit design styles, reduction of the voltage swing and clocking strategies can be used to reduce
power dissipation at the transistor level. Architecture-level measures include smart power management of various
system blocks, utilization of pipelining and parallelism, and design of bus structures. Finally, the power consumed by
the system can be reduced by a proper selection of the data processing algorithms, specifically to minimize the
number of switching events for a given task.
In this chapter, we will primarily concentrate on the circuit- or transistor-level design measures which can be applied
to reduce the power dissipation of digital integrated circuits. Various sources of power consumption will be
discussed in detail, and design strategies will be introduced to reduce the power dissipation. The concept of adiabatic
logic will be given a special emphasis since it emerges as a very effective means for reducing the power
consumption.
[Table of Contents] [Top of Document]
7.2 Overview of Power Consumption
In the following, we will examine the various sources (components) of time-averaged power consumption in CMOS
circuits. The average power consumption in conventional CMOS digital circuits can be expressed as the sum of three
main components, namely, (1) the dynamic (switching) power consumption, (2) the short-circuit power
consumption, and (3) the leakage power consumption. If the system or chip includes circuits other than conventional
CMOS gates that have continuous current paths between the power supply and the ground, a fourth (static) power
component should also be considered. We will limit our discussion to the conventional static and dynamic CMOS
logic circuits.
Switching Power Dissipation
This component represents the power dissipated during a switching event, i.e., when the output node voltage of a
CMOS logic gate makes a power consuming transition. In digital CMOS circuits, dynamic power is dissipated when
energy is drawn from the power supply to charge up the output node capacitance. During the charge-up phase, the
output node voltage typically makes a full transition from 0 to VDD, and the energy used for the transition is
relatively independent of the function performed by the circuit. To illustrate the dynamic power dissipation during
switching, consider the circuit example given in Fig. 7.1. Here, a two-input NOR gate drives two NAND gates,
through interconnection lines. The total capacitive load at the output of the NOR gate consists of (1) the output
capacitance of the gate itself, (2) the total interconnect capacitance, and (3) the input capacitances of the driven gates.
Figure-7.1: A NOR gate driving two NAND gates through interconnection lines.
The output capacitance of the gate consists mainly of the junction parasitic capacitances, which are due to the drain
diffusion regions of the MOS transistors in the circuit. The important aspect to emphasize here is that the amount of
capacitance is approximately a linear function of the junction area. Consequently, the size of the total drain diffusion
5/1/2007 10:57 AM
3 of 24
area dictates the amount of parasitic capacitance. The interconnect lines between the gates contribute to the second
component of the total capacitance. The estimation of parasitic interconnect capacitance was discussed thoroughly in
Chapter 4. Note that especially in sub-micron technologies, the interconnect capacitance can become the dominant
component, compared to the transistor-related capacitances. Finally, the input capacitances are mainly due to gate
oxide capacitances of the transistors connected to the input terminal. Again, the amount of the gate oxide capacitance
is determined primarily by the gate area of each transistor.
Figure-7.2: Generic representation of a CMOS logic gate for switching power calculation
Any CMOS logic gate making an output voltage transition can thus be represented by its nMOS network, pMOS
network, and the total load capacitance connected to its output node, as seen in Fig. 7.2. The average power
dissipation of the CMOS logic gate, driven by a periodic input voltage waveform with ideally zero rise- and
fall-times, can be calculated from the energy required to charge up the output node to VDD and charge down the
total output load capacitance to ground level.
(7.1)
Evaluating this integral yields the well-known expression for the average dynamic (switching) power consumption in
CMOS logic circuits.
(7.2)
or
(7.3)
Note that the average switching power dissipation of a CMOS gate is essentially independent of all transistor
characteristics and transistor sizes. Hence, given an input pattern, the switching delay times have no relevance to the
amount of power consumption during the switching events as long as the output voltage swing is between 0 and
VDD.
Equation (7.3) shows that the average dynamic power dissipation is proportional to the square of the power supply
voltage, hence, any reduction of VDD will significantly reduce the power consumption. Another way to limit the
dynamic power dissipation of a CMOS logic gate is to reduce the amount of switched capacitance at the output. This
5/1/2007 10:57 AM
4 of 24
issue will be discussed in more detail later. First, let us briefly examine the effect of reducing the power supply
voltage VDD upon switching power consumption and dynamic performance of the gate.
Although the reduction of power supply voltage significantly reduces the dynamic power dissipation, the inevitable
design trade-off is the increase of delay. This can be seen by examining the following propagation delay expressions
for the CMOS inverter circuit.
(7.4)
Assuming that the power supply voltage is being scaled down while all other variables are kept constant, it can be
seen that the propagation delay time will increase. Figure 7.3 shows the normalized variation of the delay as a
function of VDD, where the threshold voltages of the nMOS and the pMOS transistor are assumed to be constant,
VT,n = 0.8 V and VT,p = - 0.8 V, respectively. The normalized variation of the average switching power dissipation
as a function of the supply voltage is also shown on the same plot.
Figure-7.3:
Normalized propagation delay and average switching power dissipation of a CMOS inverter, as a function of the
power supply voltage VDD.
Notice that the dependence of circuit speed on the power supply voltage may also influence the relationship between
the dynamic power dissipation and the supply voltage. Equation (7.3) suggests a quadratic improvement (reduction)
of power consumption as the power supply voltage is reduced. However, this interpretation assumes that the
switching frequency (i.e., the number of switching events per unit time) remains constant. If the circuit is always
operated at the maximum frequency allowed by its propagation delay, on the other hand, the number of switching
events per unit time (i.e., the operating frequency) will obviously drop as the propagation delay becomes larger with
the reduction of the power supply voltage. The net result is that the dependence of switching power dissipation on
the power supply voltage becomes stronger than a simple quadratic relationship, shown in Fig. 7.3.
The analysis of switching power dissipation presented above is based on the assumption that the output node of a
CMOS gate undergoes one power-consuming transition (0-to-VDD transition) in each clock cycle. This assumption,
however, is not always correct; the node transition rate can be smaller than the clock rate, depending on the circuit
topology, logic style and the input signal statistics. To better represent this behavior, we will introduce aT (node
transition factor), which is the effective number of power-consuming voltage transitions experienced per clock cycle.
Then, the average switching power consumption becomes
(7.5)
5/1/2007 10:57 AM
5 of 24
The estimation of switching activity and various measures to reduce its rate will be discussed in detail in Section 7.4.
Note that in most complex CMOS logic gates, a number of internal circuit nodes also make full or partial voltage
transitions during switching. Since there is a parasitic node capacitance associated with each internal node, these
internal transitions contribute to the overall power dissipation of the circuit. In fact, an internal node may undergo
several transitions while the output node voltage of the circuit remains unchanged, as illustrated in Fig. 7.4.
[Zoom]
Figure-7.4:
Switching of the internal node in a two-input NOR gate results in dynamic power dissipation even if the output node
voltage remains unchanged.
In the most general case, the internal node voltage transitions can also be partial transitions, i.e., the node voltage
swing may be only Vi which is smaller than the full voltage swing of VDD. Taking this possibility into account, the
generalized expression for the average switching power dissipation can be written as
(7.6)
where Ci represents the parasitic capacitance associated with each node and aTi represents the corresponding node
transition factor associated with that node.
Short-Circuit Power Dissipation
The switching power dissipation examined above is purely due to the energy required to charge up the parasitic
capacitances in the circuit, and the switching power is independent of the rise and fall times of the input signals. Yet,
if a CMOS inverter (or a logic gate) is driven with input voltage waveforms with finite rise and fall times, both the
nMOS and the pMOS transistors in the circuit may conduct simultaneously for a short amount of time during
switching, forming a direct current path between the power supply and the ground, as shown in Fig. 7.5.
The current component which passes through both the nMOS and the pMOS devices during switching does not
contribute to the charging of the capacitances in the circuit, and hence, it is called the short-circuit current
component. This component is especially prevalent if the output load capacitance is small, and/or if the input signal
rise and fall times are large, as seen in Fig. 7.5. Here, the input/output voltage waveforms and the components of the
current drawn from the power supply are illustrated for a symmetrical CMOS inverter with small capacitive load.
The nMOS transistor in the circuit starts conducting when the rising input voltage exceeds the threshold voltage
VT,n. The pMOS transistor remains on until the input reaches the voltage level (VDD - |VT,p|). Thus, there is a time
window during which both transistors are turned on. As the output capacitance is discharged through the nMOS
transistor, the output voltage starts to fall. The drain-to-source voltage drop of the pMOS transistor becomes
nonzero, which allows the pMOS transistor to conduct as well. The short circuit current is terminated when the input
voltage transition is completed and the pMOS transistor is turned off. An similar event is responsible for the shortcircuit current component during the falling input transition, when the output voltage starts rising while both
transistors are on.
5/1/2007 10:57 AM
6 of 24
Note that the magnitude of the short-circuit current component will be approximately the same during both the
rising-input transition and the falling-input transition, assuming that the inverter is symmetrical and the input rise and
fall times are identical. The pMOS transistor also conducts the current which is needed to charge up the small output
load capacitance, but only during the falling-input transition (the output capacitance is discharged through the nMOS
device during the rising-input transition). This current component, which is responsible for the switching power
dissipation of the circuit (current component to charge up the load capacitance), is also shown in Fig. 7.5. The
average of both of these current components determines the total amount of power drawn from the supply.
For a simple analysis consider a symmetric CMOS inverter with k = kn = kp and VT = VT,n = |VT,p|, and with a
very small capacitive load. If the inverter is driven with an input voltage waveform with equal rise and fall times (t =
trise = tfall), it can be derived that the time-averaged short circuit current drawn from the power supply is
(7.7)
Figure-7.5:
Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit
current in a CMOS inverter with small capacitive load. The total current drawn from the power supply is the sum of
both current components.
Hence, the short-circuit power dissipation becomes
(7.8)
Note that the short-circuit power dissipation is linearly proportional to the input signal rise and fall times, and also to
the transconductance of the transistors. Hence, reducing the input transition times will obviously decrease the
short-circuit current component.
Now consider the same CMOS inverter with a larger output load capacitance and smaller input transition times.
During the rising input transition, the output voltage will effectively remain at VDD until the input voltage completes
5/1/2007 10:57 AM
7 of 24
its swing and the output will start to drop only after the input has reached its
Figure-7.6:
Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit
current in a CMOS inverter with larger capacitive load and smaller input transition times. The total current drawn
from the power supply is approximately equal to the charge-up current.
final value. Although both the nMOS and the pMOS transistors are on simultaneously during the transition, the
pMOS transistor cannot conduct a significant amount of current since the voltage drop between its source and drain
terminals is approximately equal to zero. Similarly, the output voltage will remain approximately equal to 0 V during
a falling input transition and it will start to rise only after the input voltage completes its swing. Again, both
transistors will be on simultaneously during the input voltage transition, yet the nMOS transistor will not be able to
conduct a significant amount of current since its drain-to-source voltage is approximately equal to zero. This
situation is illustrated in Fig. 7.6, which shows the simulated input and output voltage waveforms of the inverter as
well as the short-circuit and dynamic current components drawn from the power supply. Notice that the peak value
of the supply current to charge up the output load capacitance is larger in this case. The reason for this is that the
pMOS transistor remains in saturation during the entire input transition, as opposed to the previous case shown in
Fig. 7.5 where the transistor leaves the saturation region before the input transition is completed.
The discussion concerning the magnitude of the short-circuit current may suggest that the short-circuit power
dissipation can be reduced by making the output voltage transition times larger and/or by making the input voltage
transition times smaller. Yet this goal should be balanced carefully against other performance goals such as
propagation delay, and the reduction of the short-circuit current should be considered as one of the many design
requirements that must satisfied by the designer.
Leakage Power Dissipation
The nMOS and pMOS transistors used in a CMOS logic gate generally have nonzero reverse leakage and
subthreshold currents. In a CMOS VLSI chip containing a very large number of transistors, these currents can
contribute to the overall power dissipation even when the transistors are not undergoing any switching event. The
magnitude of the leakage currents is determined mainly by the processing parameters.
Of the two main leakage current components found in a MOSFET, the reverse diode leakage occurs when the
pn-junction between the drain and the bulk of the transistor is reversely biased. The reverse-biased drain junction
5/1/2007 10:57 AM
8 of 24
then conducts a reverse saturation current which is eventually drawn from the power supply. Consider a CMOS
inverter with a high input voltage, where the nMOS transistor is turned on and the output node voltage is discharged
to zero. Although the pMOS transistor is turned off, there will be a reverse potential difference of VDD between its
drain and the n-well, causing a diode leakage through the drain junction. The n-well region of the pMOS transistor is
also reverse-biased with VDD, with respect to the p-type substrate. Therefore, another significant leakage current
component exists due to the n-well junction (Fig. 7.7).
Figure-7.7: Reverse leakage current paths in a CMOS inverter with high input voltage.
A similar situation can be observed when the input voltage is equal to zero, and the output voltage is charged up to
VDD through the pMOS transistor. Then, the reverse potential difference between the nMOS drain region and the
p-type substrate causes a reverse leakage current which is also drawn from the power supply (through the pMOS
transistor).
The magnitude of the reverse leakage current of a pn-junction is given by the following expression
(7.9)
where Vbias is the magnitude of the reverse bias voltage across the junction, JS is the reverse saturation current
density and the A is the junction area. The typical magnitude of the reverse saturation current density is 1 - 5
pA/mm2, and it increases quite significantly with temperature. Note that the reverse leakage occurs even during the
stand-by operation when no switching takes place. Hence, the power dissipation due to this mechanism can be
significant in a large chip containing several million transistors.
Another component of leakage currents which occur in CMOS circuits is the subthreshold current, which is due to
carrier diffusion between the source and the drain region of the transistor in weak inversion. An MOS transistor in
the subthreshold operating region behaves similar to a bipolar device and the subthreshold current exhibits an
exponential dependence on the gate voltage. The amount of the subthreshold current may become significant when
the gate-to-source voltage is smaller than, but very close to the threshold voltage of the device. In this case, the
power dissipation due to subthreshold leakage can become comparable in magnitude to the switching power
dissipation of the circuit. The subthreshold leakage current is illustrated in Fig. 7.8.
5/1/2007 10:57 AM
9 of 24
Figure-7.8: Subthreshold leakage current path in a CMOS inverter with high input voltage.
Note that the subthreshold leakage current also occurs when there is no switching activity in the circuit, and this
component must be carefully considered for estimating the total power dissipation in the stand-by operation mode.
The subthreshold current expression is given below, in order to illustrate the exponential dependence of the current
on terminal voltages.
(7.10)
One relatively simple measure to limit the subthreshold current component is to avoid very low threshold voltages,
so that the VGS of the nMOS transistor remains safely below VT,n when the input is logic zero, and the |VGS| of the
pMOS transistor remains safely below |VT,p| when the input is logic one.
In addition to the three major sources of power consumption in CMOS digital integrated circuits discussed here,
some chips may also contain components or circuits which actually consume static power. One example is the
pseudo-nMOS logic circuits which utilize a pMOS transistor as the pull-up device. The presence of such circuit
blocks should also be taken into account when estimating the overall power dissipation of a complex system.
7.3 Low-Power Design Through Voltage Scaling
The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply
voltage. Therefore, reduction of VDD emerges as a very effective means of limiting the power consumption. Given a
certain technology, the circuit designer may utilize on-chip DC- DC converters and/or separate power pins to achieve
this goal. As we have already discussed briefly in Section 7.2, however, the savings in power dissipation comes at a
significant cost in terms of increased circuit delay. When considering drastic reduction of the power supply voltage
below the new standard of 3.3 V, the issue of time-domain performance should also be addressed carefully. In the
following, we will examine reduction of the power supply voltage with a corresponding scaling of threshold
voltages, in order to compensate for the speed degradation. At the system level, architectural measures such as the
use of parallel processing blocks and/or pipelining techniques also offer very feasible alternatives for maintaining the
system performance (throughput) despite aggressive reduction of the power supply voltage.
The propagation delay expression (7.4) clearly shows that the negative effect of reducing the power supply voltage
upon delay can be compensated for, if the threshold voltage of the transistor is scaled down accordingly. However,
this approach is limited due to the fact that the threshold voltage cannot be scaled to the same extent as the supply
voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce the same speed-performance
at a lower VDD. Figure 7.9 shows the variation of the propagation delay of a CMOS inverter as a function of the
power supply voltage, and for different threshold voltage values.
5/1/2007 10:57 AM
10 of 24
Figure-7.9:
Variation of the normalized propagation delay of a CMOS inverter, as a function of the power supply voltage VDD
and the threshold voltage VT.
We can see, for example, that reducing the threshold voltage from 0.8 V to 0.2 V can improve the delay at VDD = 2
V by a factor of 2. The influence of threshold voltage reduction upon propagation delay is especially pronounced at
low power supply voltages. It should be noted, however, that the threshold voltage reduction approach is restricted
by the concerns on noise margins and the subthreshold conduction. Smaller threshold voltages lead to smaller noise
margins for the CMOS logic gates. The subthreshold conduction current also sets a severe limitation against
reducing the threshold voltage. For threshold voltages smaller than 0.2 V, leakage power dissipation due to
subthreshold conduction may become a very significant component of the overall power consumption.
In certain types of applications, the reduction of circuit speed which comes as a result of voltage scaling can be
compensated for at the expense of more silicon area. In the following, we will examine the use of architectural
measures such as pipelining and hardware replication to offset the loss of speed at lower supply voltages.
Pipelining Approach
First, consider the single functional block shown in Fig. 7.10 which implements a logic function F(INPUT) of the
input vector, INPUT. Both the input and the output vectors are sampled through register arrays, driven by a clock
signal CLK. Assume that the critical path in this logic block (at a power supply voltage of VDD) allows a maximum
sampling frequency of fCLK; in other words, the maximum input-to-output propagation delay tP,max of this logic
block is equal to or less than TCLK = 1/fCLK. Figure 7.10 also shows the simplified timing diagram of the circuit. A
new input vector is latched into the input register array at each clock cycle, and the output data becomes valid with a
latency of one cycle.
Figure-7.10: Single-stage implementation of a logic function and its simplified timing diagram.
5/1/2007 10:57 AM
11 of 24
Let Ctotal be the total capacitance switched every clock cycle. Here, Ctotal consists of (i) the capacitance switched in
the input register array, (ii) the capacitance switched to implement the logic function, and (iii) the capacitance
switched in the output register array. Then, the dynamic power consumption of this structure can be found as
(7.11)
Now consider an N-stage pipelined structure for implementing the same logic function, as shown in Fig. 7.11. The
logic function F(INPUT) has been partitioned into N successive stages, and a total of (N-1) register arrays have been
introduced, in addition to the original input and output registers, to create the pipeline. All registers are clocked at the
original sample rate, fCLK. If all stages of the partitioned function have approximately equal delay of
(7.12)
Then the logic blocks between two successive registers can operate N-times slower while maintaining the same
functional throughput as before. This implies that the power supply voltage can be reduced to a value of VDD,new,
to effectively slow down the circuit by a factor of N. The supply voltage to achieve this reduction can be found by
solving (7.4).
Figure-7.11:
N-stage pipeline structure realizing the same logic function as in Fig. 7.10. The maximum pipeline stage delay is
equal to the clock period, and the latency is N clock cycles.
The dynamic power consumption of the N-stage pipelined structure with a lower supply voltage and with the same
functional throughput as the single-stage structure can be approximated by
(7.13)
where Creg represents the capacitance switched by each pipeline register. Then, the power reduction factor achieved
in a N-stage pipeline structure is
(7.14)
As an example, consider replacing a single-stage logic block (VDD = 5 V, fCLK = 20 MHz) with a four-stage
5/1/2007 10:57 AM
12 of 24
pipeline structure, running at the same clock frequency. This means that the propagation delay of each pipeline stage
can be increased by a factor of 4 without sacrificing the data throughput. Assuming that the magnitude of the
threshold voltage of all transistors is 0.8 V, the desired speed reduction can be achieved by reducing the power
supply voltage from 5 V to approximately 2 V (see Fig. 7.9). With a typical ratio of (Creg/Ctotal) = 0.1, the overall
power reduction factor is found from (7.14) as 1/5. This means that replacing the original single-stage logic block
with a four-stage pipeline running at the same clock frequency and reducing the power supply voltage from 5 V to 2
V will provide a dynamic power savings of about 80%, while maintaining the same throughput as before.
The architectural modification described here has a relatively small area overhead. A total of (N-1) register arrays
have to be added to convert the original single-stage structure into a pipeline. While trading off area for lower power,
this approach also increases the latency from one to N clock cycles. Yet in many applications such as signal
processing and data encoding, latency is not a very significant concern.
Parallel Processing Approach (Hardware Replication)
Another possibility of trading off area for lower power dissipation is to use parallelism, or hardware replication. This
approach could be useful especially when the logic function to be implemented is not suitable for pipelining.
Consider N identical processing elements, each implementing the logic function F(INPUT) in parallel, as shown in
Fig. 7.12. Assume that the consecutive input vectors arrive at the same rate as in the single-stage case examined
earlier. The input vectors are routed to all the registers of the N processing blocks. Gated clock signals, each with a
clock period of (N TCLK), are used to load each register every N clock cycles. This means that the clock signals to
each input register are skewed by TCLK, such that each one of the N consecutive input vectors is loaded into a
different input register. Since each input register is clocked at a lower frequency of (fCLK / N), the time allowed to
compute the function for each input vector is increased by a factor of N. This implies that the power supply voltage
can be reduced until the critical path delay equals the new clock period of (N TCLK). The outputs of the N
processing blocks are multiplexed and sent to an output register which operates at a clock frequency of fCLK,
ensuring the same data throughput rate as before. The timing diagram of this parallel arrangement is given in Fig.
7.13.
Since the time allowed to compute the function for each input vector is increased by a factor of N, the power supply
voltage can be reduced to a value of VDD,new, to effectively slow down the circuit. The new supply voltage can be
found, as in the pipelined case, by solving (7.4). The total dynamic power dissipation of the parallel structure
(neglecting the dissipation of the multiplexor) is found as the sum of the power dissipated by the input registers and
the logic blocks operating at a clock frequency of (fCLK / N), and the output register operating at a clock frequency
of fCLK.
(7.15)
5/1/2007 10:57 AM
13 of 24
Figure-7.12:
N-block parallel structure realizing the same logic function as in Fig. 7.10. Notice that the input registers are clocked
at a lower frequency of (fCLK / N).
Note that there is also an additional overhead which consists of the input routing capacitance, the output routing
capacitance and the capacitance of the output multiplexor structure, all of which are increasing functions of N. If this
overhead is neglected, the amount of power reduction achievable in a N-block parallel implementation is
(7.16)
The lower bound of dynamic power reduction realizable with architecture-driven voltage scaling is found, assuming
zero threshold voltage, as
(7.17)
(7.17)
[Zoom]
Figure-7.13: Simplified timing diagram of the N-block parallel structure shown in Fig. 7.12.
5/1/2007 10:57 AM
14 of 24
Two obvious consequences of this approach are the increased area and the increased latency. A total of N identical
processing blocks must be used to slow down the operation (clocking) speed by a factor of N. In fact, the silicon area
will grow even faster than the number of processor because of signal routing and the overhead circuitry. The timing
diagram in Fig. 7.13 shows that the parallel implementation has a latency of N clock cycles, as in the N-stage
pipelined implementation. Considering its smaller area overhead, however, the pipelined approach offers a more
efficient alternative for reducing the power dissipation while maintaining the throughput.
7.4 Estimation and Optimization of Switching Activity
In the previous section, we have discussed methods for minimizing dynamic power consumption in CMOS digital
integrated circuits by supply voltage scaling. Another approach to low power design is to reduce the switching
activity and the amount of the switched capacitance to the minimum level required to perform a given task. The
measures to accomplish this goal can range from optimization of algorithms to logic design, and finally to physical
mask design. In the following, we will examine the concept of switching activity, and introduce some of the
approaches used to reduce it. We will also examine the various measures used to minimize the amount of capacitance
which must be switched to perform a given task in a circuit.
The Concept of Switching Activity
It was already discussed in Section 7.2 that the dynamic power consumption of a CMOS logic gate depends, among
other parameters, also on the node transition factor aT, which is the effective number of power-consuming voltage
transitions experienced by the output capacitance per clock cycle. This parameter, also called the switching activity
factor, depends on the Boolean function performed by the gate, the logic family, and the input signal statistics.
Assuming that all input signals have an equal probability to assume a logic "0" or a logic "1" state, we can easily
investigate the output transition probabilities for different types of logic gates. First, we will introduce two signal
probabilities, P0 and P1. P0 corresponds to the probability of having a logic "0" at the output, and P1 = (1 - P0)
corresponds to the probability of having a logic "1" at the output. Therefore, the probability that a power-consuming
(0-to-1) transition occurs at the output node is the product of these two output signal probabilities. Consider, for
example, a static CMOS NOR2 gate. If the two inputs are independent and uniformly distributed, the four possible
input combinations (00, 01, 10, 11) are equally likely to occur. Thus, we can find from the truth table of the NOR2
gate that P0 = 3/4, and P1 = 1/4. The probability that a power-consuming transition occurs at the output node is
therefore
(7.18)
The transition probabilities can be shown on a state transition diagram which consists of the only two possible output
states and the possible transitions among them (Fig. 7.14). In the general case of a CMOS logic gate with n input
variables, the probability of a power-consuming output transition can be expressed as a function of n0, which is the
number of zeros in the output column of the truth table.
(7.19)
5/1/2007 10:57 AM
15 of 24
Figure-7.14: State transition diagram and state transition probabilities of a NOR2 gate.
The output transition probability is shown as a function of the number of inputs in Fig. 7.15, for different types of
logic gates and assuming equal input probabilities. For a NAND or NOR gate, the truth table contains only one "0"
or "1", respectively, regardless of the number of inputs. Therefore, the output transition probability drops as the
number of inputs is increased. In a XOR gate, on the other hand, the truth table always contains an equal number of
logic "0" and logic "1" values. The output transition probability therefore remains constant at 0.25.
Figure-7.15:
Output transition probabilities of different logic gates, as a function of the number of inputs. Note that the transition
probability of the XOR gate is independent of the number or inputs.
In multi-level logic circuits, the distribution of input signal probabilities is typically not uniform, i.e., one cannot
expect to have equal probabilities for the occurrence of a logic "0" and a logic "1". Then, the output transition
probability becomes a function of the input probability distributions. As an example, consider the NOR2 gate
examined above. Let P1,A represent the probability of having a logic "1" at the input A, and P1,B represent the
probability of having a logic "1" at the input B. The probability of obtaining a logic "1" at the output node is
(7.20)
Using this expression, the probability of a power-consuming output transition is found as a function of P1,A and
P1,B.
(7.21)
Figure 7.16 shows the distribution of the output transition probability in a NOR2 gate, as a function of two input
probabilities. It can be seen that the evaluation of switching activity becomes a complicated problem in large circuits,
especially when sequential elements, reconvergent nodes and feedback loops are involved. The designer must
therefore rely on computer-aided design (CAD) tools for correct estimation of switching activity in a given network.
5/1/2007 10:57 AM
16 of 24
Figure-7.16: Output transition probability of NOR2 gate as a function of two input probabilities.
In dynamic CMOS logic circuits, the output node is precharged during every clock cycle. If the output node was
discharged (i.e., if the output value was equal to "0") in the previous cycle, the pMOS precharge transistor will draw
a current from the power supply during the precharge phase. This means that the dynamic CMOS logic gate will
consume power every time the output value equals "0", regardless of the preceding or following values. Therefore,
the power consumption of dynamic logic gates is determined by the signal-value probability of the output node and
not by the transition probability. From the discussion above, we can see that signal-value probabilities are always
larger than transition probabilities, hence, the power consumption of dynamic CMOS logic gates is typically larger
than static CMOS gates under the same conditions.
Reduction of Switching Activit
Switching activity in CMOS digital integrated circuits can be reduced by algorithmic optimization, by architecture
optimization, by proper choice of logic topology and by circuit-level optimization. In the following, we will briefly
some of the measures that can be applied to optimize the switching probabilities, and hence, the dynamic power
consumption.
Algorithmic optimization depends heavily on the application and on the characteristics of the data such as dynamic
range, correlation, statistics of data transmission. Some of the techniques can be applied only for specific algorithms
such as Digital Signal Processing (DSP) and cannot be used for general purpose processing. One possibility is the
choosing a proper vector quantization (VQ) algorithm which results in minimum switching activity. For example, the
number of memory accesses, the number of multiplications and the number of additions can be reduced by about a
factor of 30 if differential tree search algorithm is used instead of the full search algorithm.
The representation of data may also have a significant impact on switching activity at the system level. In
applications where data bits change sequentially and are highly correlated (such as the address bits to access
instructions) for example, the use of Gray coding leads to a reduced number of transitions compared to simple binary
coding. Another example is using sign- magnitude representation instead of the conventional two's complement
representation for signed data. A change in sign will cause transitions of the higher-order bits in the two's
complement representation, whereas only the sign bit will change in sign-magnitude representation. Therefore, the
switching activity can be reduced by using the sign-magnitude representation in applications where the data sign
changes are frequent.
An important architecture-level measure to reduce switching activity is based on delay balancing and the reduction
of glitches. In multi-level logic circuits, the finite propagation delay from one logic block to the next can cause
spurious signal transitions, or glitches as a result of critical races or dynamic hazards. In general, if all input signals
of a gate change simultaneously, no glitching occurs. But a dynamic hazard or glitch can occur if input signals
change at different times. Thus, a node can exhibit multiple transitions in a single clock cycle before settling to the
correct logic level (Fig. 7.17). In some cases, the signal glitches are only partial, i.e., the node voltage does not make
a full transition between the ground and VDD levels, yet even partial glitches can have a significant contribution to
dynamic power dissipation.
5/1/2007 10:57 AM
17 of 24
Figure-7.17: Signal glitching in multi-level static CMOS circuits.
Glitches occur primarily due to a mismatch or imbalance in the path lengths in the logic network. Such a mismatch in
path length results in a mismatch of signal timing with respect to the primary inputs. As an example, consider the
simple parity network shown in Fig. 7.18. Assuming that all XOR blocks have the same delay, it can be seen that the
network in Fig. 7.18(a) will suffer from glitching due to the wide disparity between the arrival times of the input
signals for the gates. In the network shown in Fig. 7.18(b), on the other hand, all arrival times are identical because
the delay paths are balanced. Such redesign can significantly reduce the glitching transitions, and consequently, the
dynamic power dissipation in complex multi-level networks. Also notice that the tree structure shown in Fig. 7.18(b)
results in smaller overall propagation delay. Finally, it should be noted that glitching is not a significant issue in
multi-level dynamic CMOS logic circuits, since each node undergoes at most one transition per clock cycle.
Figure-7.18:
(a) Implementation of a four-input parity (XOR) function using a chain structure. (b) Implementation of the same
function using a tree structure which will reduce glitching transitions.
7.5 Reduction of Switched Capacitance
It was already established in the previous sections that the amount of switched capacitance plays a significant role in
the dynamic power dissipation of the circuit. Hence, reduction of this parasitic capacitance is a major goal for
low-power design of digital integrated circuits.
At the system level, one of the approaches to reduce the switched capacitance is to limit the use of shared resources.
A simple example is the use of a global bus structure for data transmission between a large number of operational
modules (Fig. 7.19). If a single shared bus is connected to all modules as in Fig. 7.19(a), this structure results in a
large bus capacitance due to (i) the large number of drivers and receivers sharing the same transmission medium, and
(ii) the parasitic capacitance of the long bus line. Obviously, driving the large bus capacitance will require a
significant amount of power consumption during each bus access. Alternatively, the global bus structure can be
partitioned into a number of smaller dedicated local busses to handle the data transmission between neighboring
modules, as sown in Fig. 7.19(b). In this case, the switched capacitance during each bus access is significantly
reduced, yet multiple busses may increase the overall routing area on chip.
5/1/2007 10:57 AM
18 of 24
Figure-7.19:
(a) Using a single global bus structure for connecting a large number of modules on chip results in large bus
capacitance and large dynamic power dissipation. (b) Using smaller local busses reduces the amount of switched
capacitance, at the expense of additional chip area.
The type of logic style used to implement a digital circuit also affects the physical capacitance of the circuit. The
physical capacitance is a function of the number of transistors that are required to implement a given function. For
example, one approach to reduce the physical capacitance is to use transfer gates over conventional CMOS logic
gates to implement logic functions. Pass-gate logic design is attractive since fewer transistors are required certain
functions such as XOR and XNOR. In many arithmetic operations where binary adders and multipliers are used, pass
transistor logic offers significant advantages. Similarly, multiplexors and other key building blocks can also be
simplified using this design style.
The amount of parasitic capacitance that is switched (i.e. charged up or charged down) during operation can be also
reduced at the physical design level, or mask level. The parasitic gate and diffusion capacitances of MOS transistors
in the circuit typically constitute a significant amount of the total capacitance in a combinational logic circuit. Hence,
a simple mask-level measure to reduce power dissipation is keeping the transistors (especially the drain and source
regions) at minimum dimensions whenever possible and feasible, thereby minimizing the parasitic capacitances.
Designing a logic gate with minimum-size transistors certainly affects the dynamic performance of the circuit, and
this trade-off between dynamic performance and power dissipation should be carefully considered in critical circuits.
Especially in circuits driving a large extrinsic capacitive loads, e.g., large fan-out or routing capacitances, the
transistors must be designed with larger dimensions. Yet in many other cases where the load capacitance of a gate is
mainly intrinsic, the transistor sizes can be kept at minimum. Note that most standard cell libraries are designed with
larger transistors in order to accommodate a wide range of capacitive loads and performance requirements.
Consequently, a standard-cell based design may have considerable overhead in terms of switched capacitance in each
cell.
7.6 Adiabatic Logic Circuits
In conventional level-restoring CMOS logic circuits with rail-to-rail output voltage swing, each switching event
causes an energy transfer from the power supply to the output node, or from the output node to the ground. During a
5/1/2007 10:57 AM
19 of 24
0-to-VDD transition of the output, the total output charge Q = Cload VDD is drawn from the power supply at a
constant voltage. Thus, an energy of Esupply = Cload VDD2 is drawn from the power supply during this transition.
Charging the output node capacitance to the voltage level VDD means that at the end of the transition, the amount of
energy Estored = Cload VDD2/2 is stored on the output node. Thus, half of the injected energy from the power
supply is dissipated in the pMOS network while only one half is delivered to the output node. During a subsequent
VDD-to-0 transition of the output node, no charge is drawn from the power supply and the energy stored in the load
capacitance is dissipated in the nMOS network.
To reduce the dissipation, the circuit designer can minimize the switching events, decrease the node capacitance,
reduce the voltage swing, or apply a combination of these methods. Yet in all cases, the energy drawn from the
power supply is used only once before being dissipated. To increase the energy efficiency of logic circuits, other
measures must be introduced for recycling the energy drawn from the power supply. A novel class of logic circuits
called adiabatic logic offers the possibility of further reducing the energy dissipated during switching events, and the
possibility of recycling, or reusing, some of the energy drawn from the power supply. To accomplish this goal, the
circuit topology and the operation principles have to be modified, sometimes drastically. The amount of energy
recycling achievable using adiabatic techniques is also determined by the fabrication technology, switching speed
and voltage swing.
The term "adiabatic" is typically used to describe thermodynamic processes that have no energy exchange with the
environment, and therefore, no energy loss in the form of dissipated heat. In our case, the electric charge transfer
between the nodes of a circuit will be viewed as the process, and various techniques will be explored to minimize the
energy loss, or heat dissipation, during charge transfer events. It should be noted that fully adiabatic operation of a
circuit is an ideal condition which may only be approached asymptotically as the switching process is slowed down.
In practical cases, energy dissipation associated with a charge transfer event is usually composed of an adiabatic
component and a non-adiabatic component. Therefore, reducing all energy loss to zero may not be possible,
regardless of the switching speed.
Adiabatic Switching
Consider the simple circuit shown in Fig. 7.20 where a load capacitance is charged by a constant current source. This
circuit is similar to the equivalent circuit used to model the charge-up event in conventional CMOS circuits, with the
exception that in conventional CMOS, the output capacitance is charged by a constant voltage source and not by a
constant current source. Here, R represents the on-resistance of the pMOS network. Also note that a constant
charging current corresponds to a linear voltage ramp. Assuming that the capacitance voltage VC is equal to zero
initially, the variation of the voltage as a function of time can be found as
(7.22)
Hence, the charging current can be expressed as a simple function of VC and time t.
(7.23)
The amount of energy dissipated in the resistor R from t = 0 to t = T can be found as
(7.24)
Combining (7.23) and (7.24), the dissipated energy can also be expressed as follows.
5/1/2007 10:57 AM
20 of 24
(7.25)
Figure-7.20: Constant-current source charging a load capacitance C, through a resistance R.
Now, a number of simple observations can be made based on (7.25). First, the dissipated energy is smaller than for
the conventional case if the charging time T is larger than 2 RC. In fact, the dissipated energy can be made arbitrarily
small by increasing the charging time, since Ediss is inversely proportional to T. Also, we observe that the dissipated
energy is proportional to the resistance R, as opposed to the conventional case where the dissipation depends on the
capacitance and the voltage swing. Reducing the on-resistance of the pMOS network will reduce the energy
dissipation.
We have seen that the constant-current charging process efficiently transfers energy from the power supply to the
load capacitance. A portion of the energy thus stored in the capacitance can also be reclaimed by reversing the
current source direction, allowing the charge to be transferred from the capacitance back into the supply. This
possibility is unique to adiabatic operation, since in conventional CMOS circuits the energy is dissipated after being
used only once. The constant-current power supply must certainly be capable of retrieving the energy back from the
circuit. Adiabatic logic circuits thus require non-standard power supplies with time-varying voltage, also called
pulsed-power supplies. The additional hardware overhead associated with these specific power supply circuits is one
of the design trade-off that must be considered when using adiabatic logic.
Adiabatic Logic Gates
In the following, we will examine simple circuit configurations which can be used for adiabatic switching. Note that
most of the research on adiabatic logic circuits are relatively recent, therefore, the circuits presented here should be
considered as examples only. Other circuit topologies are also possible, but the overall approach of energy recycling
should still be applicable, regardless of the specific circuit configuration.
First, consider the adiabatic amplifier circuit shown in Fig. 7.21, which can be used to drive capacitive loads. It
consists of two CMOS transmission gates and two nMOS clamp transistors. Both the input (X) and the output (Y)
are dual-rail encoded, which means that the inverses of both signals are also available, to control the CMOS T-gates.
Figure-7.21:
Adiabatic amplifier circuit which transfers the complementary input signals to its complementary outputs through
CMOS transmission gates.
5/1/2007 10:57 AM
21 of 24
When the input signal X is set to a valid value, one of the two transmission gates becomes transparent. Next, the
amplifier is energized by applying a slow voltage ramp VA, rising from zero to VDD. The load capacitance at one of
the two complementary outputs is adiabatically charged to VDD through the transmission gate, while the other
output node remains clamped to ground potential. When the charging process is completed, the output signal pair is
valid and can be used as an input to other, similar circuits. Next, the circuit is de-energized by ramping the voltage
VA back to zero. Thus, the energy that was stored in the output load capacitance is retrieved by the power supply.
Note that the input signal pair must be valid and stable throughout this sequence.
Figure-7.22:
(a) The general circuit topology of a conventional CMOS logic gate. (b) The topology of an adiabatic logic gate
implementing the same function. Note the difference in charge-up and charge-down paths for the output capacitance.
The simple circuit principle of the adiabatic amplifier can be extended to allow the implementation of arbitrary logic
functions. Figure 7.22 shows the general circuit topology of a conventional CMOS logic gate and an adiabatic
counterpart. To convert a conventional CMOS logic gate into an adiabatic gate, the pull-up and pull-down networks
must be replaced with complementary transmission-gate networks. The T-gate network implementing the pull-up
function is used to drive the true output of the adiabatic gate, while the T-gate network implementing the pull-down
function drives the complementary output node. Note the all inputs should also be available in complementary form.
Both networks in the adiabatic logic circuit are used to charge-up as well as charge-down the output capacitances,
which ensures that the energy stored at the output node can be retrieved by the power supply, at the end of each
cycle. To allow adiabatic operation, the DC voltage source of the original circuit must be replaced by a pulsed-power
supply with ramped voltage output. Note that the circuit modifications which are necessary to convert a conventional
CMOS logic circuit into an adiabatic logic circuit increase the device count by a factor of two. Also, the reduction of
energy dissipation comes at the cost of slower switching speed, which is the ultimate trade-off in all adiabatic
methods.
Stepwise Charging Circuits
We have seen earlier that the dissipation during a charge-up event can be minimized, and in the ideal case be reduced
to zero, by using a constant-current power supply. This requires that the power supply be able to generate linear
voltage ramps. Practical supplies can be constructed by using resonant inductor circuits to approximate the constant
output current and the linear voltage ramp with sinusoidal signals. But the use of inductors presents several
difficulties at the circuit level, especially in terms of chip-level integration and overall efficiency.
5/1/2007 10:57 AM
22 of 24
An alternative to using pure voltage ramps is to use stepwise supply voltage waveforms, where the output voltage of
the power supply is increased and decreased in small increments during charging and discharging. Since the energy
dissipation depends on the average voltage drop traversed by the charge that flows onto the load capacitance, using
smaller voltage steps, or increments, should reduce the dissipation considerably.
Figure 7.23 shows a CMOS inverter driven by a stepwise supply voltage waveform. Assume that the output voltage
is equal to zero initially. With the input voltage set to logic low level, the power supply voltage VA is increased from
0 to VDD, in n equal voltage steps (Fig. 7.24). Since the pMOS transistor is conducting during this transition, the
output load capacitance will be charged up in a stepwise manner. The on-resistance of the pMOS transistor can be
represented by the linear resistor R. Thus, the output load capacitance is being charged up through a resistor, in small
voltage increments. For the ith time increment, the amount of capacitor current can be expressed as
(7.26)
Solving this differential equation with the initial condition Vout(ti) = VA(i) yields
(7.27)
Figure-7.23: A CMOS inverter circuit with a stepwise-increasing supply voltage.
Figure-7.24:
Equivalent circuit, and the input and output voltage waveforms of the CMOS inverter circuit in Fig. 7.23 (stepwise
charge-up case).
Here, n is the number of steps of the supply voltage waveform. The amount of energy dissipated during one voltage
step increment can now be found as
(7.28)
Since n steps are used to charge up the capacitance to VDD, the total dissipation is
5/1/2007 10:57 AM
23 of 24
(7.29)
According to this simplified analysis, charging the output capacitance with n voltage steps, or increments, reduces
the energy dissipation per cycle by a factor of n. Therefore, the total power dissipation is also reduced by a factor of
n using stepwise charging. This result implies that if the voltage steps can be made very small and the number of
voltage steps n approaches infinity (i.e., if the supply voltage is a slow linear ramp), the energy dissipation will
approach zero.
Another example for simple stepwise charging circuits is the stepwise driver for capacitive loads, implemented with
nMOS devices as shown in Fig. 7.25. Here, a bank of n constant voltage supplies with evenly distributed voltage
levels is used. The load capacitance is charged up by connecting the constant voltage sources V1 through VN to the
load successively, using an array of switch devices. To discharge the load capacitance, the constant voltage sources
are connected to the load in the reverse sequence.
The switch devices are shown as nMOS transistors in Fig. 7.25, yet some of them may be replaced by pMOS
transistors to prevent the undesirable threshold-voltage drop problem and the substrate-bias effects at higher voltage
levels. One of the most significant drawbacks of this circuit configuration is the need for multiple supply voltages. A
power supply system capable of efficiently generating n different voltage levels would be complex and expensive.
Also, the routing of n different supply voltages to each circuit in a large system would create a significant overhead.
In addition, the concept is not easily extensible to general logic gates. Therefore, stepwise charging driver circuits
can be best utilized for driving a few critical nodes in the circuit that are responsible for a large portion of the overall
power dissipation, such as output pads and large busses.
In general, we have seen that adiabatic logic circuits can offer significant reduction of energy dissipation, but usually
at the expense of switching times. Therefore, adiabatic logic circuits can be best utilized in cases where delay is not
critical. Moreover, the realization of unconventional power supplies needed in adiabatic circuit configurations
typically results in an overhead both in terms of overall energy dissipation and in terms of silicon area. These issues
should be carefully considered when adiabatic logic is used as a method for low-power design.
Figure-7.25:
Stepwise driver circuit for capacitive loads. The load capacitance is successively connected to constant voltage
sources Vi through an array of switch devices.
References
1. A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Norwell, MA: Kluwer Academic
Publishers, 1995.
5/1/2007 10:57 AM
24 of 24
2. J.M. Rabaey and M. Pedram, ed., Low Power Design Methodologies, Norwell, MA: Kluwer Academic
Publishers, 1995.
3. A. Bellaouar and M.I. Elmasry, Low-Power Digital VLSI Design, Norwell, MA: Kluwer Academic
Publishers, 1995.
4. F. Najm, "A survey of power estimation techniques in VLSI circuits," IEEE Transactions on VLSI Systems,
vol. 2, pp. 446-455, December 1994.
5. W.C. Athas, L. Swensson, J.G. Koller and E. Chou, "Low-power digital systems based on adiabatic- switching
principles," IEEE Transactions on VLSI Systems, vol. 2, pp. 398-407, December 1994.
production of
5/1/2007 10:57 AM
1 of 18
Chapter 8
TESTABILITY OF INTEGRATED SYSTEMS
Design Constraints
Testing
The Rule of Ten
Terminology
Failures in CMOS
Combinational Logic Testing
Practical Ad-Hoc DFT Guidelines
Scan Design Techniques
8.1 Design Constraints
The following paragraphs reminds the designer of some basic rules to consider before starting. Each of these
constraints has at least one tool helping in the development of the design in respect to a set of rules :
8.1.1 Design Rule Checking
Every technology has its design rules. It consists in interpreting the possible geometrical implementation of the chips
to be manufactured. These rules are given by the technology department in every foundry of IC. Rules are often
described in a document with boxes representing the layers available in the technology on which are indicated the
sizes, distances and geometrical constraints allowed in this technology.
Figure-8.1:
Designer needs to execute a program called DRC to check if his design don't violate the rules defined by the founder.
This step of verification called DRC is as important as the simulation of the functionality of your design. A sole
5/1/2007 10:57 AM
2 of 18
simulation can't take in consideration if the rules are respected in which case the manufacturing of the chip could lead
to shorts or cuts in the silicon physical implementation. Some other verification tools should also be used, such as
ERC and LVS described above.
8.1.2 Layout Versus Schematic
As a complement to the DRC, LVS is another tool to be used especially if the design started with a Schematic Entry
tool. The aim of LVS is to check if the design at the layout level corresponds or is still coherent to the schematic.
Usually, designers start with a schematic and then simulate it, if it is OK then they go to layout. But in some cases
like full-custom or some semi-custom designs the layout implementation of the chip differs from the schematic
because of some simulation results or because of a design error that simulation can't detect easily : simulation could
never be exhaustive. LVS checks that the designer did the same representation at the schematic and layout levels, if
not LVS tools indicate the occurrence. Of course a simulation of the layout using the same stimuli used for the
schematic is more secure for the final design.
8.1.3 Latch-Up and Electro-Static Discharge
Latch-up caused to CMOS the early problems that delayed its introduction in the electronic industry. It also called
"Thyristor effect" and could cause the destruction of the chip or a part of it. There are no real solution to this
phenomena but a set of design techniques exist to avoid instead of solving Latch-up occurence. The origin of
Latch-up is the distribution of the NMOS and PMOS N and P basic structures inside the silicon. In some cases, not
only PN junction are formed but also a structure like PNPN or NPNP parasitic thyristors. these parasitic elements
could feature like a real thyristor and develop a high current destroying the area around it including the PMOS and
NMOS transistors.
The most used technique in avoiding the formation of such a structure is to add "butting contact" polarising the
Nwell (or Pwell) to Vdd (or to Ground). This technique cannot eliminate the Latch-up process but reduces its effect.
Another electrical constraint to CMOS is called ESD or Electro-Static Discharge. Handling CMOS chip properly
could be a solution to avoid gates destruction caused by electro- static charges that people could have at the surface
of their hands. This is the reason why it is important to have a conducting bracelet linked to ground when handling
CMOS ICs. But even ground linked bracelet is not enough to protect CMOS chips from destruction due to ESD. Two
diodes at each pad inside the chip link every I/O to Vdd and Gnd. These two big diodes protect the chip core (CMOS
transistor gates) from ESD by limiting over-voltage.
Figure-8.2:
8.1.4 Electrical Rule Checking
Based on the previous paragraph, ERC is a guarantee that the designer has considered all the minimum necessary
implementations for ERC free design. This tool verifies that the designer did used a sufficient number of well
polarisations, applied the appropriate ESD pads or used VDD and VSS at the right places.
5/1/2007 10:57 AM
3 of 18
8.2 Testing
Design of logic integrated circuits in CMOS technology is becoming more and more complex since VLSI is the
interest of many electronic IC users and manufacturers. A common problem to be solved by both users and
manufacturers is the testing of these ICs.
Figure-8.3:
Testing can be expressed by checking if the outputs of a functional system (functional block, Integrated Circuit,
Printed Circuit Board or a complete system) correspond to the inputs applied to it. If the test of this functional system
is positive, then the system is good for use. If the outputs are different than expected, then the system has a problem:
so either the system is rejected (Go/No Go test), or a diagnosis is applied to it, in order to point out and probably
eliminate the problem's causes.
Testing is applied to detect faults after several operations : design, manufacturing, packaging and especially during
the active life of a system, and thus since failures caused by wear-out can occur at any moment of its usage.
Design for Testability (DfT) is the ability of simplifying the test of any system. DfT could be synthesized by a set of
techniques and design guidelines where the goals are :
minimizing costs of system production
minimizing system test complexity : test generation and application
improving quality
avoiding problems of timing discordance or block nature incompatibility.
8.3 The Rule of Ten
In the production process cycle, a fault can occur at the chip level. If a test strategy is considered at the beginning of
the design, then the fault could be detected rapidly, located and eliminated at a very low cost. When the faulty chip is
soldered on a printed circuit board, the cost of fault remedy would be multiplied by ten. And this cost factors
continues to apply until the system has been assembled and packaged and then sent to users.
5/1/2007 10:57 AM
4 of 18
Figure-8.4:
8.4 Terminology
At the system level the most used words are the following:
Testability
could be expressed by the ability for a Device Under Test (DUT), to be better observed and controlled easily from its
external environment.
Figure-8.5:
The Design for Testability is then reduced to a set of design rules or guidelines to be respected in order to facilitate
the test.
The Reliability
is expressed in terms of probability for a device to work without major problems for a given time. Reliability goes
down when components number is increased.
The Security
is the probability that user's life is not in danger while a problem occurs to a device. Security is enhanced if a certain
type components are added for more protection.
The Quality
5/1/2007 10:57 AM
5 of 18
is essential in some types of applications. A "zero defect" target is often required. The Quality could be enhanced by
having a proper design methodology, and a good technology, avoiding problems and simplifying testing.
8.5 Failures in CMOS
When a MOS circuit has been fabricated and initially tested, some mechanisms can still cause it to fail. Failures are
caused either by design bugs or by wearout (ageing or corrosion) mechanisms. The MOSFET transistor currently
used has two main characteristics : threshold voltage and transconductance on which the performance of that circuit
depends.
Figure-8.6:
The design bugs or defects result generally in device length and width deviating from those specified for a process
(design rules). This type of fault is difficult to detect since it occurs later during the active life of the circuit, and leads
mostly to opens and breaks in conductors or shorts between conductors.
Failures are also caused by phenomena like "hot carrier injection", "oxide breakdown", "metallization failures" or
"corrosion".
The consequences of hot carrier injection, for instance, is a threshold voltage shifting and transconductance
degrading because the gate oxide is charged when hot carriers are injected (usually electron in NMOS). Cross-talk is
also a cause of faults (generally transient), and needs to isolate properly the different parts of the device.
8.6 Combinational Logic Testing
It is more convenient to talk about "test generation for combinational logic testing" in this section, and about "test
generation for sequential logic testing" in the next section. Thus the solution to the problem of testing a purely
combinational logic block is a good set of patterns detecting "all" the possible faults.
The first idea to test an N input circuit would be to apply an N-bit counter to the inputs (controllability), then
generate all the 2N combinations, and observe the outputs for checking (observability). This is called "exhaustive
testing", and it is very efficient... but only for few- input circuits. When the input number increase, this technique
5/1/2007 10:57 AM
6 of 18
becomes very time consuming.
Figure-8.7:
8.6.1 Sensitized Path Testing
Most of the time, in exhaustive testing, many patterns do not occur during the application of the circuit. So instead of
spending a huge amount of time searching for faults everywhere, the possible faults are first enumerated and a set of
appropriate vectors are then generated. This is called "single-path sensitization" and it is based on "fault oriented
testing".
Figure-8.8:
The basic idea is to select a path from the site of a fault, through a sequence of gates leading to an output of the
combinational logic under test. The process is composed of three steps :
Manifestation : gate inputs, at the site of the fault, are specified as to generate the opposite value of the faulty
value (0 for SA1, 1 for SA0).
Propagation
: inputs of the other gates are determined so as to propagate the fault signal along the specified path to the
primary output of the circuit. This is done by setting these inputs to "1" for AND/NAND gates and "0" for
OR/NOR gates.
Consistency
: or justification. This final step helps finding the primary input pattern that will realize all the necessary input
values. This is done by tracing backward from the gate inputs to the primary inputs of the logic in order to
5/1/2007 10:57 AM
7 of 18
receive the test patterns.
Figure-8.9:
EXAMPLE1 - SA1 of line1 (L1) : the aim is to find the vector(s) able to detect this fault.
Manifestation : L1 = 0 , then input A = 0. In a fault-free situation, the output F changes with A if B,C and D
are fixed : for B,C and D fixed, L1 is SA1 gives F = 0, for instance, even if A = 0 (F = 1 for fault-free).
Propagation : Through the AND-gate : L5 = L8 = 1, this condition is necessary for the propagation of the " L1
= 0 ". This leads to L10 = 0. Through the NOR-gate, and since L10 = 0, then L11 = 0, so the propagated
manifestation can reach the primary output F. F is then read and compared with the fault-free value : F = 1.
Consistency : From the AND-gate : L5=1, and then L2=B=1. Also L8=1, and then L7=1. Until now we found
the values of A and B. When C and D are found, then the test vectors are generated, in the same manner, and
ready to be applied to detect L1= SA1. From the NOT-gate, L11=0, so L9=L7=1 (coherency with L8=L7).
From the OR-gate L7=1, and since L6=L2=B=1, so B+C+D=L7=1, then C and D can have either 1 or 0.
These three steps have led to four possible vectors detecting L1=SA1.
Figure-8.10:
EXAMPLE 2 - SA1 of line8 (L8) : The same combinational logic having one internal line SA1.
Manifestation : L8 = 0
5/1/2007 10:57 AM
8 of 18
Propagation : Through the AND-gate : L5 = L1 = 1, then L10 = 0 Through the NOR-gate : we want to have
L11 = 0, not to mask L10 = 0.
Consistency : From the AND-gate L8 = 0 leads to L7 = 0. From the NOT-gate L11 = 0 means L9 = L7 = 1, L7
could not be set to 1 and 0 at the same time. This incompatibility could not be resolved in this case, and the
fault "L8 SA1" remains undetectable.
Figure-8.11:
EXAMPLE 3 - SA1 of line2 (L2) : Always the same combinational logic, with the line L2 SA1.
Manifestation : L2 = 0, sets L5 = L6 = 0.
Propagation : Through the AND-gate : L1 = 1 and then we need L10=0. Through the OR-gate L3=L4=0, so we
can have L7=L8=L9=0, but through the NOT-gate L11 = 1.
The propagated error "L2 SA1" across a reconvergent path is masked since the NOR-gate does not distinguish the
origin of the propagation.
8.7 Practical Ad-Hoc DFT Guidelines
This section provides a set of practical Design for Testability guidelines classified into three types: those who are
facilitating test generation, test application and those avoiding timing problems.
8.7.1 Improve Controllability and Observability
All "design for test" methods ensure that a design has enough observability and controllability to provide for a
complete and efficient testing. When a node has difficult access from primary inputs or outputs (pads of the circuit), a
very efficient method is to add internal pads acceding to this kind of node in order, for instance, to control block B2
and observe block B1 with a probe.
5/1/2007 10:57 AM
9 of 18
Figure-8.12:
It is easy to observe block B1 by adding a pad just on its output, without breaking the link between the two blocks.
The control of the block B2 means to set a 0 or a 1 to its input, and also to be transparent to the link B1-B2. The logic
functions of this purpose are a NOR- gate, transparent to a zero, and a NAND-gate, transparent to a one. By this way
the control of B2 is possible across these two gates.
Another implementation of this cell is based on pass-gates multiplexers performing the same function, but with less
transistors than with the NAND and NOR gates (8 instead of 12).
The simple optimization of observation and control is not enough to guarantee a full testability of the blocks B1 and
B2. This technique has to be completed with some other techniques of testing depending on the internal structures of
blocks B1 and B2.
8.7.2 Use Multiplexers
This technique is an extension of the precedent, while multiplexers are used in case of limitation of primary inputs
and outputs.
Figure-8.13:
In this case the major penalties are extra devices and propagation delays due to multiplexers. Demultiplexers are also
used to improve observability. Using multiplexers and demultiplexers allows internal access of blocks separately
from each other, which is the basis of techniques based on partitioning or bypassing blocks to observe or control
separately other blocks.
8.7.3 Partition Large Circuits
5/1/2007 10:57 AM
10 of 18
Partitioning large circuits into smaller sub-circuits reduces the test-generation effort. The test- generation effort for a
general purpose circuit of n gates is assumed to be proportional to somewhere between n2 and n3. If the circuit is
partitioned into two sub-circuits, then the amount of test generation effort is reduced correspondingly.
Figure-8.14:
The example of the SN7480 full adder shows that an exhaustive testing requires 512 tests (29), while a full test after
partitioning into four sub-circuits, for SA0 and SA1 faults, requires 24 tests. Logical partitioning of a circuit should
be based on recognizable sub-functions and can be achieved physically by incorporating some facilities to isolate and
control clock lines, reset lines and power supply lines. The multiplexers can be massively used to separate
sub-circuits without changing the function of the global circuit.
8.7.4 Divide Long Counter Chains
Based on the same principle of partitioning, the counters are sequential elements that need a large number of vectors
to be fully tested. The partitioning of a long counter corresponds to its division into sub-counters.
The full test of a 16-bit counter requires the application of 216 + 1 = 65537 clock pulses. If this counter is divided
into two 8-bit counters, then each counter can be tested separately, and the total test time is reduced 128 times (27).
This is also useful if there are subsequent requirements to set the counter to a particular count for tests associated
with other parts of the circuit : pre-loading facilities.
Figure-8.15:
8.7.5 Initialize Sequential Logic
5/1/2007 10:57 AM
11 of 18
One of the most important problems in sequential logic testing occurs at the time of power-on, where the first state is
random if there were no initialization. In this case it is impossible to start a test sequence correctly, because of
memory effects of the sequential elements.
Figure-8.16:
The solution is to provide flip-flops or latches with a set or reset input, and then to use them so that the test sequence
would start with a known state.
Ideally, all memory elements should be able to be set to a known state, but practically this could be very surface
consuming, also it is not always necessary to initialize all the sequential logic. For example, a serial-in serial-out
counter could have its first flip-flop provided with an initialization, then after a few clock pulses the counter is in a
known state.
Overriding of the tester is necessary some times, and requires the addition of gates before a Set or a Reset so the
tester can override the initialization state of the logic.
8.7.6 Avoid Asynchronous Logic
Asynchronous logic uses memory elements in which state-transitions are controlled by the sequence of changes on
the primary inputs. There is thus no way to determine easily when the next state will be established. This is again a
problem of timing and memory effects.
Figure-8.17:
Asynchronous logic is faster than synchronous logic, since the speed in asynchronous logic is only limited by gate
5/1/2007 10:57 AM
12 of 18
propagation delays and interconnects. The design of asynchronous logic is then more difficult than synchronous
(clocked) logic and must be carried out with due regards to the possibility of critical races (circuit behavior
depending on two inputs changing simultaneously) and hazards (occurrence of a momentary value opposite to the
expected value).
Non-deterministic behavior in asynchronous logic can cause problems during fault simulation. Time dependency of
operation can make testing very difficult, since it is sensitive to tester signal skew.
8.7.7 Avoid Logical Redundancy
Logical redundancy exists either to mask a static-hazard condition, or unintentionally (design bug). In both cases,
with a logically redundant node it is not possible to make a primary output value dependent on the value of the
redundant node. This means that certain fault conditions on the node cannot be detected, such as a node SA1 of the
function F.
Figure-8.18:
Another inconvenience of logical redundancy is the possibility for a non-detectable fault on a redundant node to
mask the detection of a fault normally-detectable, such a SA0 of input C in the second example, masked by a SA1 of
a redundant node.
8.7.8 Avoid Delay Dependent Logic
Automatic test pattern generators work in logic domains, they view delay dependent logic as redundant
combinational logic. In this case the ATPG will see an AND of a signal with its complement, and will therefore
always compute a 0 on the output of the AND-gate (instead of a pulse). Adding an OR-gate after the AND-gate
output permits to the ATPG to substitute a clock signal directly.
5/1/2007 10:57 AM
13 of 18
Figure-8.19:
8.7.9 Avoid Clock Gating
When a clock signal is gated with any data signal, for example a load signal coming from a tester, a skew or any
other hazard on that signal can cause an error on the output of logic.
Figure-8.20:
This is also due to asynchronous type of logic. Clock signals should be distributed in the circuit with respect to
synchronous logic structure.
8.7.10 Strictly Distinguish Between Signal and Clock
This is another timing situation to avoid, in which the tester could not be synchronized if one clock or more are
dependent on asynchronous delays (across D-input of flip-flops, for example).
Figure-8.21:
The problem is the same when a signal fans out to a clock input and a data input.
8.7.11 Avoid Self Resetting Logic
The self resetting logic is more related to asynchronous logic, since a reset input is independent of clock signal.
5/1/2007 10:57 AM
14 of 18
Before the delayed reset, the tester reads the set value and continue the normal operation. If a reset has occurred
before tester observation, then the read value is erroneous. The solution to this problem is to allow the tester to
override by adding an OR-gate, for example, with an inhibition input coming from the tester. By this way the right
response is given to the tester at the right time.
Figure-8.22:
8.7.12 Use Bused Structure
This approach is related, by structure, to partitioning technique. It is very useful for microprocessor-like circuits.
Using this structure allows the external tester the access of three buses, which go to many different modules.
Figure-8.23:
The tester can then disconnect any module from the buses by putting its output into a high- impedance state. Test
patterns can then be applied to each module separately.
8.7.13 Separate Analog and Digital Circuits
Testing analog circuit requires a completely different strategy than for digital circuit. Also the sharp edges of digital
signals can cause cross-talk problem to the analog lines, if they are close to each other.
5/1/2007 10:57 AM
15 of 18
Figure-8.24:
If it is necessary to route digital signals near analog lines, then the digital lines should be properly balanced and
shielded. Also, in the cases of circuits like Analog-Digital converters, it is better to bring out analog signals for
observation before conversion. For Digital-Analog converters, digital signals are to be brought out also for
observation before conversion.
8.7.14 Bypassing Techniques
Bypassing a sub-circuit consists in propagating the sub-circuit inputs signals directly to the outputs. The aim of this
technique is to bypass a sub-circuit (part of a global circuit) in order to access another sub-circuit to be tested. The
partitioning technique is based on bypassing technique and they both use multiplexers to perform two different
methods.
In the bypassing technique sub-circuits can be then tested exhaustively, by controlling multiplexers in the whole
circuit. To speed-up the test, some sub-circuits are tested simultaneously if the propagation paths are associated with
other disjoint or separated sub- circuits.
Figure-8.25:
DfT Remarks
All the techniques listed above do not represent an exhaustive list for DfT, but give a set of rules to resp ect as
possible. Some of these guidelines goals are the simplification of test vectors generation, others goals are the
simplification of test vectors application, and many others are to avoid timing problems in the design.
5/1/2007 10:57 AM
16 of 18
8.8 Scan Design Techniques
The set of design for testability guidelines presented above is a set of ad hoc methods to design random logic in
respect with testability requirements. The scan design techniques are a set of structured approaches to design (for
testability) the sequential circuits.
The major difficulty in testing sequential circuits is determining the internal state of the circuit. Scan design
techniques are directed at improving the controllability and observability of the internal states of a sequential circuit.
By this the problem of testing a sequential circuit is reduced to that of testing a combinational circuit, since the
internal states of the circuit are under control.
8.8.1 Scan Path
The goal of the scan path technique is to reconfigure a sequential circuit, for the purpose of testing, into a
combinational circuit. Since a sequential circuit is based on a combinational circuit and some storage elements, the
technique of scan path consists in connecting together all the storage elements to form a long serial shift register.
Thus the internal state of the circuit can be observed and controlled by shifting (scanning) out the contents of the
storage elements. The shift register is then called a scan path.
Figure-8.26:
The storage elements can either be D, J-K, or R-S types of flip-flops, but simple latches cannot be used in scan path.
However, the structure of storage elements is slightly different than classical ones. Generally the selection of the
input source is achieved using a multiplexer on the data input controlled by an external mode signal. This multiplexer
is integrated into the D-flip-flop, in our case; the D-flip-flop is then called MD-flip-flop (multiplexed-flip-flop).
The sequential circuit containing a scan path has two modes of operation : a normal mode and a test mode which
configure the storage elements in the scan path.
In the normal mode, the storage elements are connected to the combinational circuit, in the loops of the global
sequential circuit, which is considered then as a finite state machine.
In the test mode, the loops are broken and the storage elements are connected together as a serial shift register (scan
path), receiving the same clock signal. The input of the scan path is called scan-in and the output scan-out. Several
scan paths can be implemented in one same complex circuit if it is necessary, though having several scan-in inputs
and scan-out outputs.
A large sequential circuit can be partitioned into sub-circuits, containing combinational sub-circuits, associated with
one scan path each. Efficiency of the test pattern generation for a combinational sub-circuit is greatly improved by
partitioning, since its depth is reduced.
5/1/2007 10:57 AM
17 of 18
Before applying test patterns, the shift register itself has to be verified by shifting in all ones i.e. 111...11, or zeros i.e.
000...00, and comparing.
The method of testing a circuit with the scan path is as follows:
1.
2.
3.
4.
5.
6.
7.
8.
Set test mode signal, flip-flops accept data from input scan-in
Verify the scan path by shifting in and out test data
Set the shift register to an initial state
Apply a test pattern to the primary inputs of the circuit
Set normal mode, the circuit settles and can monitor the primary outputs of the circuit
Activate the circuit clock for one cycle
Return to test mode
Scan out the contents of the registers, simultaneously scan in the next pattern
8.8.2 Boundary Scan Test (BST)
Boundary Scan Test (BST) is a technique involving scan path and self-testing techniques to resolve the problem of
testing boards carrying VLSI integrated circuits and/or surface mounted devices (SMD).
Printed circuit boards (PCB) are becoming very dense and complex, especially with SMD circuits, that most test
equipment cannot guarantee a good fault coverage.
Figure-8.27:
BST consists in placing a scan path (shift register) adjacent to each component pin and to interconnect the cells in
order to form a chain around the border of the circuit. The BST circuits contained on one board are then connected
together to form a single path through the board.
The boundary scan path is provided with serial input and output pads and appropriate clock pads which make it
possible to :
Test the interconnections between the various chip
Deliver test data to the chips on board for self-testing
Test the chips themselves with internal self-test
5/1/2007 10:57 AM
18 of 18
Figure-8.28:
The advantages of Boundary scan techniques are as follows :
No need for complex testers in PCB testing
Test engineers work is simplified and more efficient
Time to spend on test pattern generation and application is reduced
Fault coverage is greatly increased.
BS Techniques are grouped by the IEEE Standard Organization in a "standard test access port and boundary scan
architecture", namely IEEE P1149.1-1990. The Joint Test Action Group (JTAG), formed basically in 1986 at Philips,
is an international committee composed of IC manufacturers who have set the technical development of the IEEE
P1149 standard and promoted its use by all sectors of electronics industry.
The IEEE 1149 is a family of overall testability bus standards, defined by the Joint Test Action Group (JTAG),
formed basically in 1986 at Philips. JTAG is an international committee composed of European and American IC
manufacturers. The "standard Test Access Port and Boundary Scan architecture", namely IEEE P1149.1 accepted by
the IEEE standard committee in February1990, is the first one of this family. Several other ongoing standards are
developed and suggested as drafts to the technical committee of the IEEE 1149 standard in order to promote their use
by all sectors of electronics industry.
production of
5/1/2007 10:57 AM
Design of VLSI Systems - Fuzzy Logic Systems
1 of 29
Chapter 9
FUZZY LOGIC SYSTEMS
Systems Considerations
Fuzzy Logic Based Control Background
Integrated Implementations of Fuzzy Logic Circuits
Digital Implementations of Fuzzy Logic Circuits
Analog Implementations of Fuzzy Logic Circuits
Mixed Digital/Analog Implementations of Fuzzy Systems
CAD Automation for Fuzzy Logic Circuits Design
Neural Networks Implementing Fuzzy Systems
1 Systems Considerations
The use of fuzzy logic is rapidly spreading in the realm of consumer products design in order to satisfy the following
requirements: (1) to develop control systems with nonlinear characteristics and decision making systems for
controllers, (2) to cope with an increasing number of sensors and exploit the larger quantity of information, (3) to
reduce development time, (4) to reduce costs associated with incorporating the technology into the product. Fuzzy
technology can satisfy these requirements for the following reasons.
Nonlinear characteristics are realized in fuzzy logic by partitioning the rule space, bu weighting the rules, and by the
nonlinear membership function. Rule-based systems compute their output by combining results from different parts of
the partition, each part being governed by separate rules. In fuzzy reasoning, the boundaries of these parts overlap, and
the local results are combined by weighting them appropriately. That is why the output in a fuzzy system is a smooth,
nonlinear function.
In decision-making systems, the target of modeling is not a control surface but the person whose decision-making is to
be emulated. This kind of modeling is outside the realm of conventional control theory. Fuzzy reasoning can tackle this
easily since it can handle qualitative knowledge (e.g. linguistic terms like “big” and “fast”, and rules of thumb)
directly. In most applications to consumer products, fuzzy systems do not directly control the actuators, but determine
the parameters to be used for control. For example, they may determine washing time in washing machines, or if it is
the hand or the image that is shaking in a camcorder, or they compute which object is supposed to be the focus in an
auto-focus system, or they determine the contrast optimal for watching television.
A fuzzy system encodes knowledge at two levels: knowledge which incorporates fuzzy heuristics, and the knowledge
that defines the terms being used in the former level. Due to this separation of meaning, it is possible to directly encode
linguistic rules and heuristics. This reduces the development time, since the expert’s knowledge can be directly built in.
Although the developed fuzzy system may have complex input-output characteristics, as long as the mapping is static
during the operation of the device, the mapping can be discretized and implemented as a memory lookup on simple
hardware. This further reduces the costs involved in incorporating the knowledge into the device.
5/1/2007 10:46 AM
2 of 29
2 Fuzzy Logic Based Control Background
2.1 Mamdani-Type Controllers
E.H. Mamdani is famous in the circle of fuzzy for the works on fuzzy logic he made in the 70s and that are still topical
now. He has extended the application field of fuzzy logic theory to technical systems whereas most scientists thought
that these applications were restricted to non-technical fields (such as human sciences, trade, jurisprudence, etc.). At
first, he suggested that a control that can be done by an operator could be done as well by fuzzy logic after having
translated the operator experience into qualitative linguistic terms ([Mam73]-[Mam77]). Mamdani's method gave then
rise to many engineering applications, especially for industrial fuzzy control and command systems.
Mamdani introduced the fuzzification/inference/defuzzification scheme and used an inference strategy that is generally
mentioned as the max-min method. This inference type is a way of linking input linguistic variables to output ones in
accordance with the generalized modus ponens, using only the MIN and MAX functions (as T-norm and S-norm (or
T-conorm) respectively). It allows to achieve approximate reasoning (or interpolative inference).
Let consider a set of inference rules in the guise of a fuzzy associative memory (FAM) represented in an inference
matrix or table, and some fuzzy sets and respective membership functions that have been attributed to each variables
(fuzzification). Figure 2.1 represents the case where 2 rules are activated involving 2 input variables (x & y) and one
output variable (r). Let assume x be a measure of x(t) at time t, and y a measure of y(t) at the same time. Let now
consider the fuzzy sets A1, A2, B0 and B1 which respective membership functions µA1(x), µA2(x), µB0(y) and
µB1(y) take positive values for x and y. Affected inference rules are for example as follow:
else
if
if
(x = A1
(x = A2
AND
AND
y = B0)
y = B1)
then
then
r = C0
r = C1
They also can be expressed as follow:
A statement like x=A1 is true at a degree µA1(x) and a rule is activated when the combination of all membership
grades (or truth degrees) µi(k) in its condition part (premise) take a strictly positive value. Several rules may be
activated simultaneously. The max-min method realises the AND operators of the different rule conditions by taking
the respective maximum membership functions. The premises can also include some OR operators realised by taking
the minimum of the membership functions but it is rarely the case in control systems. The implications (connective
then) are realised by the truncation (or clipping) of the output sets. That consists in taking for each point the minimum
value between the membership grades resulting from rules conditions (fig. 2.1: µA2 (x) and µB0 (y)) and the
membership functions of the respective output fuzzy sets (fig. 2.1: µC1 (r) and µC0 (r)):
The rules are finally combined by using the connective else, acting then as the OR operator and interpreted as the
maximum operation for each possible value of the output variable (r on fig. 2.1) according to the defined fuzzy sets (n
sets):
It is then possible with the above defined operations to give an algorithm for fuzzy reasoning (in order to achieve a
control action for example).
5/1/2007 10:46 AM
3 of 29
[Click to enlarge image]Figure-2.1: max-min method for 2 rules involving 2 input and 1 output variables
The max-min method finally requires a defuzzification stage which is generally performed using the centre of gravity
method. It will give for the above example the real value r, using the membership function resulting from the max-min
method:
2.2 Characteristics
There are several mathematical properties that make the MIN and MAX operators well appropriated (this is simple and
efficient) for fuzzy inferences as notably described in [Dub80] and [God94]. There are however several other methods
based on different ways to realise the OR & AND operators, in purpose to improve some mathematical properties or
numerical implementation characteristics.
The max-prod method for example is similar to the max-min method, except that all implications in the rules (then
operations) are realised by a product instead of a minimisation. The truth values of the rule conditions are used to
multiply uniformly the corresponding output sets instead of clipping them at a certain level. This allow to keep the
information that consist in the shape of these sets, and that is partly lost with the max-min method. The product is
moreover simpler and faster to execute than the minimum operator in software implementations and allows some
simplifications in numerical realizations of inferences (since it can deal with analytic expressions instead of comparing
each couple of stored points).
Another common method is the sum-prod method that uses the arithmetical mean and the product to realise
respectively all the OR & AND operators. Unlike the MAX operator which select only the maximums values, the sum
takes into account all involved sets and conserves part of the information that contain their shapes.
These different methods are described and compared in [Büh94]. From this analysis emerges the fact that they lead to
very similar input/output characteristics in the case of a single input variable, when used with triangular or trapezoid
output sets. With several input variables, the max-min method produces non-linear characteristics with strong
discontinuities, while the sum-prod method produces non-linear characteristics with smoother discontinuities.
Nevertheless the choice of a specific method is mainly influenced by the way to implement it. This suggests the choice
of the max-min method for hardware implementations, because the MAX and MIN operators are then the easiest to
implement. These two operators are moreover the most robust to realize T- and S-norm according to the authors of
[Ngu93]. Since they are the most reliable when membership grades have imprecise and noisy values they are well
appropriate to fuzzy hardware with questionable accuracy. The max-min method is finally well suitable for fuzzy
rule-based systems when there is no precise model. It generally leads in a simple way to consistent rules as it can be
noticed in practical applications.
The choice of an inference method has nevertheless a great importance for one-rule inferences, to select for example
one among several candidates or to choose an optimal solution (this case is frequent especially in non-technical fields).
The operators are then to be chosen with care because they influence directly the criterion of evaluation and
consequently the final decision.
Finally, one important aspect of Mamdani's method is that it is essentially heuristic and it can sometimes be very
5/1/2007 10:46 AM
4 of 29
difficult to express an operator's or engineer's knowledge in terms of fuzzy sets and implications. Moreover such a
knowledge is often incomplete and episodic rather than systematic. There is in fact no specific methodology for clearly
deriving and analysing a rule base of a fuzzy inference system (there is consequently no exhaustive choice of optimal
rule and set numbers, shapes, operators,...). Since their principle seems to be rather simple, fuzzy system includes a lot
of parameters and can lead then to a great deal of different and complicated characteristics. Some problems may occur
when the rules have to describe a process that is too complex or to deal with a high number of variables. It can be then
very difficult to define a sufficient set of coherent rules, and the danger of having not enough or conflicting rules
occurs. There are several methods used to optimise inference rules and fuzzy sets, that are more or less perfected and
sometimes quite complicated (typical ones are gradient method, least squares method, simulated annealing,
neural-networks, etc.). They give rise to adaptive fuzzy systems which parameters are suited to the conditions of a
specific application. The high parametrical level of fuzzy systems makes automatic adaptation solutions rather difficult.
2.3 Application field
Mamdani's method is currently and effectively applied to process control, robotics and other expert systems. It is
especially well appropriated to execute an operator's control or command action. It leads to good results that are often
close to the operator's ones while dismissing the risk of human error. Thus it has been used successfully in the control
of several plants, such as those in the chemical, cement or steel industries.
Mamdani-type control is simpler than most of standard ones and requires much shorter development cycle when
linguistic rules can be easily expressed (because there is no need to develop, analyse and implement a mathematical
model). It is even in several cases as or more efficient, especially when no precise model exists, for example when a
process to control is governed by non-linear laws or includes bad-known parameters or disturbances. When the
mathematical model contains some non-linear terms, they are linearized and simplified under the assumption of small
error signals, whereas a non-linear fuzzy method often allows to control bigger ranges of error.
The non-linearity of Mamdani-type control can moreover have a favourable influence on transitory phenomenon.
Consequently it can sometimes supplant classic control to provide fast responses. When increasing the response speed
of conventional controllers the overshoot is also increasing (in position control for example). Generally fuzzy
controllers with highly non-linear characteristics gives lower overshoot before setting time than conventional PID
control, but small oscillations often remain after settlement. This oscillatory behaviour can become very difficult to
restrain and fuzzy controllers are sometimes combined with conventional ones (in cascade configuration for example)
to exploit the advantages of both. However fuzzy controllers can also have perfectly linear characteristics and replace
standard comtrollers, which can sometimes be useful to provide some features of fuzzy controllers.
Fuzzy control is attractive in some cases where parameters are varying with time and easily expressed into linguistic
terms. It is indeed easier to reassign one or several fuzzy rules rather than to calculate the mathematical equations of a
new model necessary to adapt classic controllers. However these last ones have a smaller number of parameters to
adjust, and these parameters can for example be issued from a fuzzy system which regulates them according to the
features fluctuations of a controlled process.
2.4 Defuzzification Strategy
The common centre of gravity defuzzification method requires a quantity of calculation that is prohibitive for many
real-time applications with software implementations. Its calculation can however be simplified when associated with
the sum-prod method [Büh94]. The computation of the centre of gravity can take advantage of the high speed afforded
by VLSI when integrated on an IC, which is however quite complex. Simplest defuzzification types are often used, as
for example the mean of maximum or height method [Dri93],[Büh94],[God94].
2.5 Takagi-Sugeno's method
A Sugeno-type controller differs from a Mamdani-type controller in the nature of the inference rules that are generally
called Takagi-Sugeno's rules in this case. Whereas Mamdani's method uses some inference rules that only deal with
linguistic variables, the rules of Takagi- Sugeno directly lead to real values that are not membership functions but
deterministic crisp values [Tak83],[Tak85]. This method only use fuzzy sets for the input variables, and there is no
need of any defuzzification stage. Whereas the antecedents still consist in logical functions of linguistic variables, the
5/1/2007 10:46 AM
5 of 29
output values are resulting from standard functions of the input variables (real crisp functions). In most cases, only
linear functions are used. Some variables can appear in such a function and not in the corresponding condition or vice
versa. Each function belongs to the consequence part of a rule and is considered when the respective condition has a
positive truth degree. This degree results from the different membership factors of the input fuzzy sets that the rule
condition deal with. The final output value is calculated as the weighted average of all the linear functions and the
weights are the truth degrees of their respective rule conditions.
[Click to enlarge image]Figure-2.2: Takagi-Sugeno's method for 2 rules involving 2 input and 1 output variables
Control rules are deriving in the following way:
Let the antecedent part results from a specific fuzzification achieved according to human operator's or engineer's
knowledge. That means some fuzzy sets and some logical functions making up premises are defined. The input space
has been thus divided into a certain number of fuzzy subsets, i.e. a certain number of fuzzy rules. Then the setting-up of
the consequence part of the rules requires a certain amount of numerical data (corresponding to input and output
variables) coming from the operation which has to be modelled (fig. 2.3). The coefficients of each linear function are
identified from the analysis of these data in order to minimized the difference between outputs of an original system
and the ones of a model (fuzzy engine). This optimisation is achieved by the analysis of a weighted linear regression
using a certain amount of input data (the weights are calculated as the truth degrees of the fuzzy conditions).
Finally the system identification is function of the consequence parameters (coefficients in the linear functions) as well
as the premise parameters (input sets) and the fuzzy partition of the input space. The identification of this last has no
solution since it is a combinatorial problem and is generally issued from an heuristic search. An iterative algorithm for
parameters and structure identification can be used efficiently (fig. 2.4) [Sug85],[Tak85]. The resulting fuzzy model
gives generally much better results than a statistical model [Tak83].
[Click to enlarge image]Figure-2.3: Takagi-Sugeno's method: identification of a fuzzy logic inference engine
5/1/2007 10:46 AM
6 of 29
[Click to enlarge image]Figure-2.4: Takagi-Sugeno's method: outline of an identification algorithm [Sug85],[Tak85]
2.6 Characteristics
It is not possible to derive efficient control rules by Mamdani's method when it is too difficult to translate exactly an
human operator's or engineer's knowledge into linguistic terms. The adaptive Takagi-Sugeno's method is however
efficient in such cases, provided that inference rules can be derived from the analysis of appropriate numerical data.
One more advantage of this method occurs in some complicated cases where many different variables play a part in a
process. It leads then more certainly to consistent rules than Mamdani's method.
When fuzzy reasoning is used to describe some processes, methods as Mamdani's one are not so powerful partly
because of their nonlinearity. The use of linear relations in the consequences enable to easily deal with
Takagi-Sugeno'rules as an efficient mathematical tool for fuzzy modelling. It is however not possible to cascade
inference rules without performing again a fuzzification step of crisp output values.
Takagi-Sugeno'rules implementation is simplified because there is no need of a defuzzification stage. Output values are
simply calculated from input ones, using a set of weighted linear functions. One needs however to be able to collect a
certain amount of data (sometimes during a rather long period for time-varying processes) and then to find good
coefficients with the least squares method. Measurements and data analysis often require a great deal of time, and it is
furthermore not always possible to find satisfying coefficients. Takagi-Sugeno' method becomes actually difficult or
even impossible to apply when a control has strongly non-linear characteristics (for example when there is a large
hysteresis loop) or which is moreover changing with time.
Some method as the simulated annealing for example could be used to reduce the iterative sequence of tests achieving
the coefficients and structure identification. The approximation of measured characteristic or of a calculated linear
regression could also be efficiently achieved by using a neural network. The structure identification would then result
from a supervised learning phase instead of a long and unwieldy iterative research.
2.7 Application field
The rules of Takagi-Sugeno are mainly used to realise process modelling. Fuzzy controllers are realised by regarding
an operator's control action as a process to model, and thus by collecting a lot of data when the operator is proceeding.
A typical application of Takagi- Sugeno'method is described in [Sug85], pp.125-138. Sugeno has experienced this
method with good results in the case of an automated car parking into a garage. Fuzzy control rules are then derived
from observation and modelling of a man's parking action. The method is well appropriated to this situation which is
easier to observe than to express into linguistic terms. The arms and legs actions are indeed easier to measure than to
verbalise, just like most gestures which have been acquired by heuristic experiences or which are instinctive, rather
than issued from precise reasoning.
2.8 Sugeno-Type controllers
Sugeno's method is a special case of Takagi-Sugeno'one and its particularity lies in the fact that all the output
membership functions are precise values. The consequence functions are then replaced by constants which are
weighted by the condition truth degrees to give the final output values. The fuzzy sets forming these precise values are
generally called singletons or sets of Sugeno (fig. 2.5). It can also be regarded as a particular case of Mamdani'method
where straight and without thickness non-overlapping output sets are used.
5/1/2007 10:46 AM
7 of 29
[Click to enlarge image]Figure-2.5: Example of Singletons (Sets of Sugeno)
2.9 Characteristics
The main advantage of using singletons is the great simplicity of implementing the consequence part. It can be used
with Mamdani'method to simplify considerably the defuzzification stage, whose task is reduced to the calculation of a
weighted average with a restricted set of crisp values. An other defuzzification type called maxima method is
sometimes used (cf.[Mam74bis],[Dri93]). In this case each output crisp set corresponds to a specific action, and only
the one with the maximum weight is selected (or an action midway between two maximum values).
The use of singletons has no bad consequence on the output variable domain which can be the same as with triangular
or trapezoid output sets when using the centre of gravity method. The nonlinearity of a controller characteristic can be
modulated by the distribution of these sets (but not more by their shapes). The difficulty to describe some processes
can however be then increased in comparison with the use of more complex sets in Mamdani'method or the use of
linear functions in Takagi-Sugeno'method. For the latter, the restriction to constant functions in the consequence part
doesn't allow any more to describe complicated relationships between input and output variables in a rather simple
format. This method leads then sometimes to a worse minimisation of the output error between a real process and a
fuzzy model, and structure identification (fuzzy partition of the input space) has to be much more improved to obtain
satisfactory results in complicated cases.
3 Integrated Implementations of Fuzzy Logic Circuits
3.1 Fuzzy dedicated circuits
The recent advances in fuzzy logic theory has brought about rule-based algorithms to a growing field of applications.
Several implementation approaches have been proposed during the last ten years, especially in Japan where fuzzy
systems have known a real proliferation. Fuzzy control has proved it could lead to good performance with short design
times in a wide variety of low complexity real world applications. Many specific application programs have been
developed to implement fuzzy operations on standard digital computers, and most of processor manufacturers provide
software environments to develop and simulate fuzzy applications on their available microcontrollers.
The design of fuzzy dedicated integrated circuits is however of great interest, because of the increasing number of
fuzzy applications requiring highly parallel and high-speed fuzzy processing. They attempt to give a concrete
expression to the idea of "fuzzy computers" (sometimes called computers of the sixth generation), which deal with
analog values or some digital representation of them. Fuzzy processors are designed in a way to optimise fuzzy logic
functions (cf.§3.2) as regards their implementation size and execution speed. Since practical systems often require a
great number of rule evaluations per second, a huge amount of real- time data processing is necessary and the speed of
fuzzy circuits is of prime importance. Their architectures are generally suited to the structure of approximate reasoning
and decision- making algorithms and generally include three distinct parts, proceeding fuzzification, inference and
defuzzification respectively. Fuzzy controllers are thus made on one hand of a knowledge base, which contains rules
and membership functions outlines, as other configuration parameters of the system. On the other hand stands an
inference mechanism (based on interpolative reasoning) interfacing a real process in a feedback loop configuration.
5/1/2007 10:46 AM
8 of 29
[Click to enlarge image]Figure-3.1: General structure of fuzzy logic dedicated processors for approximate reasoning
The structure of fuzzy dedicated circuits is characterized by the number and shape range of input and output variables,
the number of rules they can evaluate simultaneously, the type(s) of inference(s) (size of premises, operators,
consequences,...), the type(s) of defuzzification method(s), and so on. Their performance is evaluated according to their
processing speed (that is the number of fuzzy logic inferences per second (FLIPS)), as to their precision (error and
noise generation in analog circuits and number of bit representing fuzzy values in digital ones). Fast response is
required for non-linear functions as MIN and MAX which output signals can be subject to sudden discontinuities.
These functions are however piecewise linear and accuracy is also quite important.
Fuzzy chips are mainly used in expert systems as well as in control and command field to achieve real-time
performance. They are however less efficient for applications that deal with a large amount of data, because of the
limited I/O. It can thus be interesting to design several compatible circuits, identical or different. They can be for
example dedicated to inference rules and defuzzification respectively. In this manner it is possible to connect them
together in several different ways, using some of them for example to proceed parallel fuzzifications and inferences,
and connecting them to a defuzzification circuit [Hir93]. With a large number of input variables it may be useful to
split up the control system into several cascaded or superposed units, which aims at simplifying the inference engine
[Büh94].
Fuzzy software is useful when an application can be modelled to simulate and calculate in advance a multidimensional
response characteristic. They give the different parameters relative to an optimal characteristic in order to design or
program a fuzzy dedicated circuit. Some of these parameters (fuzzy sets and inference rules) have just to be adjusted
then in the real implementation. In some simple cases, the numeric response characteristic can be stored as a reference
rule table in a multidimensional associative memory, and provides then response values for real-time proceeding
without any more inference calculations. The stored values are provided by fuzzy software (when a model exist), by
the measure of an operator's action or by an adaptive system with a retroactive learning scheme (which comes close to
the principle of fuzzy dedicated neural networks)[Chan93].
When no model exists, a fuzzy processor has to be programmable because its parameters are established by iterative
testing on the real implementation, not by simulation. This process requires a lot of tests and is time consuming, and
may furthermore not necessarily reach an optimal solution. Since no simulation of the system's dynamic behaviour are
achieved, the danger of unstable working states occurs in control applications. The configuration flexibility of
integrated fuzzy system requires a great storage capacity (digital circuits) or tunable components (analog circuits) to
deal with variable numbers and shapes of membership functions as variable numbers and sizes of rules. Since the shape
of the membership functions is generally less important than their degree of overlap the horizontal position of sets
should always be tunable, but often complicated shapes are not essential.
5/1/2007 10:46 AM
9 of 29
[Click to enlarge image]Figure-3.2: Fuzzy logic circuits
3.2 Basic fuzzy logic functions and their relations
The fuzzy operators presented below have been introduced to deal with fuzzy values and are relative to specific
membership grades µ(x). A fuzzy operator
that deals with tho fuzzy sets A and B defines to a new membership function
. The real variable
(x) has been omitted in the expressions below.
Conventional operators as algebraic sum (+), algebraic difference (-), algebraic product (.) are commonly used in fuzzy
reasoning systems. The first nine fuzzy operators are the more commonly used in hardware system design to implement
fuzzy information processing. They have moreover the particularity that they all can be expressed by the only bounded
difference (
) and algebraic sum (+) functions [Yam86],[Zhij90],[Sas90]. There has been several approaches using these two
functions to design fuzzy units, which provides attractive perspectives for CAD automation and semicustom circuits.
Other operators also allows to express any fuzzy formulae when combined together, as for example the bounded sum (
) and the bounded product (
) [Lem94]. These ones are associative and commutative, but are not distributive to each other. The non-distributiveness
of bounded operators leads to long and complicated manipulations when used to solve fuzzy formulae, and it is quite
not obvious to substite or eliminate some terms. Fuzzy formulae are unfortunately not able to be reduced as much and
as simply than boolean formulae.
Some other fuzzy operators as symmetric difference ( ), drastic difference ( ), drastic sum ( ) or drastic product ( )
are sometimes to resolve some fuzzy equations.
5/1/2007 10:46 AM
10 of 29
4 Digital Implementations of Fuzzy Logic Circuits
4.1 Digital approach
The VLSI digital implementation of Fuzzy Logic systems offers several advantages issued from the sound knowledge
of digital circuit design and technology. Several mature CAD tools allow relatively easy design automation (synthesis
& simulation) reducing consequently time and cost of development. The automatic regeneration of logic levels
involves high noise immunity and low sensitivity to the variances of transistor characteristics. This provides accurate
5/1/2007 10:46 AM
11 of 29
and reliable data and signal processing. Binary data can be easily stored and allows to realize programmable and
multistage fuzzy processing. Complex representation of fuzzy vectors and parallel structures are however required to
obtain accurate and fast processing. Digital implementations of common fuzzy operations leads unfortunately rapidly
to complicated, enormous VLSI circuits. The density and speed of these circuits are nevertheless continually increasing
according to technological advances, so that they will become more and more efficient to implement fuzzy logic
systems.
4.2 Digital fuzzy circuits features
Digital fuzzy processors are generally designed for multipurpose applications in order to interest a maximum of
potential customers. They should thus implement a great and various number of fuzzy operators, membership functions
and inference rules. This make them rather efficient for a large range of applications, provided that appropriate
programming is possible (which supposes an appropriate internal or external memory). Combined with an appropriate
object-oriented programming environment, linguistic rules derived from an human expert can be directly translated into
an implementation on a chip.
The first hardware approaches of implementing fuzzy logic inference engine were the digital circuits designed by
Togai and Watanabe [Tog86],[Tog87],[Wat90]. ASIC's implementing specific architectures and specialized
instructions (MIN & MAX) exhibits much enhanced processing speed relatively to regular microprocessors.
operators: Min, Max implementation
architectures: Fuzzy microcontroller (embedded systems with A/D & D/A converters, Fuzzy arithmetic logic
unit (FALU) and specific memories, ...) versus fuzzy coprocessor; coprocessor. Sequential processing ->
enhance speed by parallel, pipeline [Nak93], systolic, RISC, SIMD architectures
fuzzification strategy, defuzzification strategy
For applications where a restricted number of pins of a chip can be inappropriate (such as fuzzy databases (=
ambiguous data or fuzzy relations between crisp data)), a fuzzy coprocessor is used rather than a stand-alone
chip. When other conventionnal digital processing are required (flexible system)
4.3 Digital Fuzzy Processing
Analog fuzzy values should be converted into strictly binary signals before being processed by standard digital circuits.
On one hand the analog input signals should be quantized through A/D converters, and on the other hand the
membership functions should be quantized to obtain their digital representations. Fuzzy sets are then storable but in the
guise of stair-functions. The combination of these two round-off effect can deteriorate fuzzy processing if the fuzzy
values are not represented with a sufficient number of bit. There is however a trade-off between precision and size (or
speed) since the latter is proportional to the number of bit.
The input are furthermore sampled at the frequency of digital circuits. Sampling effects: non-continuous (or
pseudo-continuous) control [Büh94] § 7.3
4.4 Different approaches of digital IC's
Togai's FC110
fuzzy accelerator, coprocessor
Watanabe
Risc approach
THOMSON's WARP (Weight Associative Rule Processor)
Cell-to-cell mapping approach, automatic synthesis [Pag93]
SIEMENS's coprocessor (SAE81C99)
optimized memory organisation [Eich92]
Sasaki
SIMD and Logic-in-Memory structure [Sas93]
5/1/2007 10:46 AM
12 of 29
OMRON's FP-3000, FP-1000 & FP-5000
NeuraLogix' controllers
INFORM's FUZZY-166 Processor
4.6 Application fields and future trends
Control, expert systems, robots, image recognition, diagnosis, database (interface with digital circuits), information
retrieval, ...
High prices must be reduced to make these processors more attractive.
Some less common hardware implementations such as pulse-width-modulation [Ung93] or sperconducting processor
have also been implemented, and are not exclude from future developments according to technological evolution.
5 Analog Implementations of Fuzzy Logic Circuits
5.1 Analog fuzzy circuits features
Analog circuits present several advantages towards digital ones, especially regarding speed of processing, power
dissipation and functional density. They can moreover perform continuous-time processing and have the particularity
to be well compatible with sensors, actuators and all other analog signals. Therefore they are obviously indicated to
deal with fuzzy values which are analog by nature. Some continuous representations of symbolic membership functions
and some non-linear fuzzy operations can be easily synthesised by dealing with transistor characteristics. There is no
need of A/D or D/A converters when implemented in a real system, provided that no specific digital signal processing
is required. Analog circuits can then supplant digital controllers for some applications requiring low-cost,
low-consumption, compact and high-speed stand-alone chips. They suffer nevertheless of the lack of reliable memory
cells. They are consequently not well appropriate to pipeline structures and have very restricted programmability
possibilities. Fortunately, the nature of fuzzy variable systems requires extensive parallelism which make analog
circuits well appropriate to proceed high- speed numerous inferences and also limits the problem of error accumulation.
Analog controllers can achieve fuzzy real-time reasoning with a large amount of fuzzy implications, especially when
no high-level accuracy is required. They are well suited to deal with vague and imprecise models, like for example
some tasks that interface human senses (eye, tactile nerves, ear,...) or replace human reasoning (pattern recognition,
reflex processes,...). Accuracy of fuzzy systems is actually not always so important since there is no thorough
mathematical background usable to define precise and exhaustive fuzzy methods. Imprecise but adjustable analog
devices are consequently suitable for a great number of cases (tunable membership functions are thus needed to
optimize the performance). They are nevertheless much less flexible and adaptable than digital ones that are
programmable, and they must be designed according to the structure of a specific application. A basic programmability
is afforded when some analog external parameters can be adjusted or when some binary inputs allow the control of
internal switches.
The choice of common fuzzy operators (§3.2) is not exhaustive, and other methods could just as well be implemented
since there is no mathematical proof than some of them are really optimal. Thus some other operators can be chosen
instead of the common MIN and MAX ones, in order to be efficiently performed with analog circuits. Some dense, fast
and accurate hardware operators can give very good practical results, even if they are theoretically not optimal. A novel
way of evaluating the condition part of fuzzy rules has thus been introduced in [Land93]. The degree of membership of
an input vector to some fuzzy subspace is defined using a measure of distance between this vector and some central
point in the subspace. MIN and MAX operators are not implemented any more and very dense and high-speed analog
hardware may be realised (in current-mode).
5/1/2007 10:46 AM
13 of 29
Analog signals are represented either by voltages (voltage-mode, §5.2) or by currents (current-mode, §5.4). Stable and
low-noise analog technologies (n-well CMOS, BiCMOS) must be used in order to design analog circuits having
sufficient accuracy with wide frequency range. Reliable CAD tools for automatic design as fast verification and
simulation tools are required for effective design.
5.2 Voltage-mode
5.2.1 Yamakawa's approach
Voltage-mode is attractive because it makes easy to distribute a signal in various parts of a circuit. Non-linear operators
such as the MIN, MAX and truncation ones are quite easy to implement in voltage-mode. Multiple-input MIN & MAX
circuits constructed with bipolar transistors are represented in figure 5.1 and 5.2 and are called emitter coupled fuzzy
logic gates. These basic non-linear gates present good characteristics and robustness [Yam93]. Such circuits are
impractical with MOS transistors which cause anacceptable error associated with the transition regions in which
multiple devices are active. CMOS multiple-input MIN & MAX circuits using gain-enhanced voltage followers based
on differencial amplifiers are presented in [Fatt93] & [Tou94]. They are more complicated but have high frequency and
accurate performance according to the authors.
[Click to enlarge image]Figure-5.1: MIN cell, voltage-mode [Yam88]
[Click to enlarge image]Figure-5.2: MAX cell,,voltage-mode [Yam88]
Yamakawa has designed a BiCMOS monolithic rule chip (TG004MC[Yam88], TG005MC[Hir93]) which architecture
is shown on figure 5.3. It is constructed with about 600 transistors and 800 resistors. The response time of fuzzy
inference is about 1µs (1 Mega- FLIPS). This chip implement one fuzzy inference including three variables in the
antecedent and one variable in the consequent. The antecedent part is made of membership function circuits (fig. 5.4)
providing for each variable seven possible fuzzy linguistic values (one of which is selected according to a voltage label
VLAB). These values can have four possible shapes
assigned by an external binary signal, and the slopes can
be changed by external resistors. A special label "not assigned" corresponding to a constant high level (+5V) can stand
for constant membership functions which value is 1.
A truncation circuit is not more than a certain number n of 2-inputs MIN circuits connected to a n-inputs MAX circuit.
A voltage level
as truncation factor is applied to one input of each MIN circuit, and membership grades of output fuzzy sets are applied
to the second inputs. The output fuzzy sets are sampled in the consequent part by a membership function generator
(MFG) in order to form fuzzy words of 25 elements which can be processed by the truncation circuit. These words are
realised by a switch array controlled by a switch setting circuit [Gup88],[Yam93], and represent the consequent
membership functions as analog voltages on 25 lines.
5/1/2007 10:46 AM
14 of 29
[Click to enlarge image]Figure-5.3: Architecture of the TG005MC rule chip [Yam93]
[Click to enlarge image]Figure-5.4: Principle of a basic membership function circuit [Hir85]
Yamakawa also designed an analog defuzzifier chip (TB005PL[Yam88], TB010PL[Hir93]) which implements the
centre of gravity calculation for 25 values (fig. 5.5). Its architecture consist of an ordinary addition circuit in parallel
with a weighted addition circuit, and followed by an analog divider. It is constructed with resistor arrays, OP
amplifiers, an analog divider and capacitors in hybrid structure. The response time of defuzzification is about 5µs and
is almost determined by that of the divider. The sum calculation is proceeded by a simple network of 25 identical
resistors connected to a same node, which produce a current proportional to the desired sum. The weighted addition is
proceeded in the same way, but with 25 different resistors: 0,R, R/2, R/3,..., R/24. For an emitter junction of a bipolar
transistor, the base-to- emitter voltage is proportional to the logarithm of the emitter current (over the range where the
effects of series and shunt resistors and reverse saturation current are negligible). Thus the division of the two currents
(proportional to the sum and the weighted sum respectively) is implemented by the subtraction of two base-to-emitter
voltages. Finally, the divider is followed by a level shifter with current-voltage conversion [Yam88bis].
Several rule chips can be connected to such a defuzzifier chip which calculate a deterministic value from the maximal
membership function. Several defuzzifier chips can also be used to realise systems involving several conditions and
several conclusions. The main weak point of such systems is that they generally lead to non-optimal cumbersome
implementations.
OMRON has developed the above analog fuzzy chips, as an analog rudimentary fuzzy computer based on these two
circuits: the FZ-5000. This last one consists of a multi-boards system, which each board includes four inference chips
and one defuzzifier chip. Inference time is about 15µs, including defuzzification. Programming is done using switches
5/1/2007 10:46 AM
15 of 29
and jumper pins, or by a specific software tool (FT-6000) using a personal computer.
[Click to enlarge image]Figure-5.5: Architecture of the TB005PL defuzzifier chip [Yam88bis]
5.2.2 Voltage-mode disadvantages
Voltage-mode fuzzy circuit implies a large stored energy into the node parasitic capacitances (CV2/2) and speed is
limited by charge delays of various capacitors. They are moreover penalized by a certain lack of precision because
signals are sensitive to changes of supply voltages. This is especially significant when the voltage range is restricted in
order to limit transistor functioning to a small parts of their characteristic, or when the electrical consumption should be
limited. The problems mainly lie in the sizing of some components.
Several functions are very difficult to build in voltage-mode, and it is also true for some basic ones as the algebraic
sum. The above described approach needs resistors to achieve additions and to convert voltages into currents.
Integrated resistors are unfortunately inaccurate, cumbersome and involve significant parasitic capacitances. The
truncation of the consequent and the defuzzification pose an important problem as regards the parallelism of the
inference engine (especially when the number and size of output sets is large). This approach implies high-power
dissipation and large chip area, and leads to high-costs implementations.
5.3 Mixed-mode
5.3.1 Transconductance
An hybrid mode can be realized by implementing on a same chip both current-mode and voltage-mode operations. This
allows to avoid some problems inherent to voltage-mode while taking advantage of some benefits of current-mode. For
example, the sum of analog voltages is rather complex to realize, while it corresponds to simple wire connections in
current-mode. This way to do has been used in the defuzzifier chip in §5.2.1. The difficulty is to obtain linear and
accurate conversions to swap between voltage - and current - modes without loosing too much precision. Efficient
transconductance elements (fig.5.6) should exhibit at the same time good linearity and good frequency response to deal
with fast and non-linearly varying membership values. They should moreover occupy few place and consume few
energy.
[Click to enlarge image]Figure-5.6: A linear tunable transconductance [Park86]
5/1/2007 10:46 AM
16 of 29
5.3.2 OTA-based approach
Operational transconductance amplifiers (OTA) can be used as basic building blocks to design analog CMOS circuits
for fuzzy logic. The use of OTAs, diodes and OTA-simulated resistors is sufficient to realize every useful fuzzy
circuits. Fig.5.7 represents an OTA of the same type that the ones used in [Inou91]. The fuzzy grade interval [0,1] is
represented by [0V,1V] in order to assure a linear operation of the OTAs. More effective OTA based on the
transconductance element of fig.5.6 are also presented in [Park86]. It is well linear and efficient in a wide range of
frequencies, but at the expense of bigger consumption and chip area.
[Click to enlarge image]Figure-5.7: Operational Transconductance Amplifier (OTA)
A high-speed bounded difference operation can easily be implemented by use of OTAs. For the circuit represented on
fig.5.8, the relationship between the input voltages (VIN1 & VIN2) and the current of the diode satisfies the definition
of the bounded difference. As the first OTA supplies a current proportional to the difference of input voltages, the
bounded difference is realized for µ1 greater or equal to µ2 when input voltages represent the membership functions
µ1 and µ2. It is also realized for µ1 < µ2
since the diode can only pass the unidirectional output current of the OTA. The second OTA finally converts this
current into a proportional output voltage (VOUT), in order to obtain a voltage-mode bounded difference characteristic.
Its output current is indeed identical to the diode one, and is also proportional to the output voltage which is connected
back to an input of the second OTA.
[Click to enlarge image]Figure-5.8: OTA-based bounded difference operator
This OTA-based bounded difference operator can be effectively used to synthesise all other fuzzy function according
to the relationships described in §3.2. The algebraic sum used in these relationships is realised by simple wire
connections before finally converting currents into voltages with OTAs. The fuzzy complement operation is realized
when the positive input voltage (VIN1) is connected to the high-level (µ1 = 1) in the circuit of fig.5.8 (VOUT is then the
complement of VIN2). Two-input MAX and MIN operators synthesis are represented on fig.9. Since µ1 = (µ1 µ2) in
any case, MIN(µ1 ,µ2) = µ1 (µ1 µ2) = µ1 - (µ1 µ2). The algebraic difference leads to a faster MIN operator
synthesis than two cascaded bounded differences. A current equal to the negative value -(µ1 µ2) is obtained by
permuting the two input voltages (VIN1 as µ1 & VIN2 as µ2) in a common bounded difference circuit operation which
diode (D1) has been connected in reverse sense. The diode D2 is just useful if the output voltage has to be limited to
positive values.
5/1/2007 10:46 AM
17 of 29
[Click to enlarge image]Figure-5.9: OTA-based MAX and MIN operators [Inou91]
Multiple-input MAX and MIN circuits are also described and analysed in [Inou91], where a fuzzy membership
function circuit is also presented. They are made of two-stages OTA structures which provide high-speed signal
processing. OTA-based realisations of other fuzzy functions are however often quite complicated and requires more
OTAs stages, which radically deteriorates accuracy and speed of processing. Several OTA are actually necessary to
synthesise some relationships of §3.2 since many voltage-current or current-voltage conversions are needed before
performing bounded differences (which deals with voltages) and algebraic sums (which deals with currents). There is
moreover quite difficult to proceed simplifications when these different functions are combined together and the
principal optimisation consists of increasing parallelism of OTA stages. This OTA-based approach leads unfortunately
rapidly to very big and not so accurate circuits for implementing fuzzy processing of complex inference systems.
5.3.2 Current-mode multiple-input MIN and MAX circuits
Multiple-input MIN and MAX circuits can be realized with current mirrors composed of a standard MOS transistor and
a lateral MOS transistor in bipolar operation [Tou94]. They are suitable for current-mode CMOS circuits and are based
on the same principle than the voltage- mode circuits of fig. 5.1 & 5.2. Input currents are converted into gate-sources
voltages which are applied to the base terminals of bipolar transistors connected as voltage followers. The voltages are
processed according to MIN or MAX operation before being converted into output current. Precision depends on the
symmetry of the structures based on MOS-bipolar mirrors followed by the inverse bipolar-MOS mirror (whose bipolar
transistor also compensates the DC level shift of approximatively 0.7V).
[Click to enlarge image]Figure-5.10: Current-mode 3-input MIN circuit [Tou94]
5.4 Current-mode
5.4.1 Current-mode circuits
Current-mode circuits do not need resistors and can achieve summation and subtraction in the most simple way, just by
wire connections. This leads to simple and intuitive configurations, which exhibits high speed and great functional
density. They are used more and more, especially for systems requiring a high level of interconnectivity (neural
networks for example). High speed is provided when capacitive nodes are not subject to great voltage fluctuations.
Current-mode circuits can also exhibit advantages as low power dissipation and low supply voltage, as good
insensitivity to the fluctuation of the latter. Since they have a single fan-out, current repeatability is of prime
5/1/2007 10:46 AM
18 of 29
importance and the distribution of signals requires multiple-output current mirrors.
5.4.2 Current mirrors
A basic realisation of multiple-output CMOS current mirror is shown on fig.5.11. This circuit is however not suitable
for synthesising accurate functions since each output current is slightly modulated by the output voltage throughout the
Early conductance. The output current should be independent of the output voltage, which is obtained by reducing the
output conductance as for the three common mirrors shown on fig.5.12. The drain voltage of the transistor which
imposes the current (drain voltage of T2 on fig.5.12) is then independent of the output voltage of the circuit.
Multiple-output cascade mirrors are often used but Wilson ones are preferable for low power applications because they
require a single polarization voltage (VG(T1) on fig.5.12.a) instead of the two superposed voltages VG(T1) & VG(T3)
on fig.5.12.b). The Mod-Wilson mirror is obtained by adding a transistor to the Wilson mirror (T4 on fig.5.12.c) to
improve its symmetry. This mirror provides good accuracy and the input current is well reproduced with perfectly
matched identical transistors. The precision of all these mirrors depends on their output resistance (which must be as
high as possible) and on the matching of their transistors. The quality of current reproducibility is very important in
cascade configurations to limit error accumulation. Dynamic mirrors can be used to obtain a greater accuracy, but at the
expense of a clocking scheme which considerably increases their size and complexity.
[Click to enlarge image]Figure-5.11: Basic n-output CMOS current mirror and symbolic representation
[Click to enlarge image]Figure-5.12: NMOS current mirrors
All the above mentioned mirrors can be constructed in standard bipolar technology as well as in MOS one.
Multiple-output current mirrors can be realized by compact bipolar circuits when using multicollector transistors,
whose accuracy is however poorer towards structures with single- collector transistors. Bipolar transistors produce two
type of significant errors, due to the base current and to the reverse mode current. The latter causes the saturation of
one collector to affect the other collectors, and the circuits should be designed so that no collector in the
multiple-output current mirrors may be saturable. These errors do not appear in multiple-output MOS current mirrors
since their input-output paths are separated and their drain currents are independent of each other. The design of
cascade MOS structures is then much easier than the one of bipolar structures whose stages are interdependent. It is
generally preferred, all the more since it is compatible with standard CMOS fabrication processes and efficient design
tools.
The mismatch between two identical transistors is however smaller for bipolar than for MOS ones, since it does not
depend on the collector current level and since they are affected by a much lesser influence of surface effects. All MOS
transistors must work in saturation mode in order to reduce the mismatch effects. Bipolar current mirrors are then more
appropriate to work with low voltages and are more precise and fast with low currents (the speed of MOS ones
decreases at low currents because of their intrinsic capacitances).
5.4.3 Fuzzy operator synthesis
5/1/2007 10:46 AM
19 of 29
Current mirrors can be used as building blocks to synthesise fuzzy logic operations and relevant processing.
In this way, nine basic fuzzy functions (cf. §3.2:
) can be easily implemented on monolithic IC's with
standard CMOS [Zhij90],[Yam86] or bipolar technologies [Yam87]. These current-mode basic logic cells exhibit good
linearity which can not be easily achieved in voltage-mode, and lead to fuzzy integrated systems which are globally
smaller than in voltage-mode.
CMOS and bipolar implementation of the bounded difference operation [Yam86],[Yam87]
A bounded difference circuit can be obtained by the combination of a current mirror and a diode (fig.5.13). The diode
can easily be realized in the CMOS circuit either by a single FET which gate and drain are connected together
(fig.5.13) or by a current mirror (fig.5.14). The first solution involves inevitable voltage drops due to the channel
resistance and can influence the normal logic function of the circuits. Nevertheless the diode can be omitted in cascade
connection of such circuits because the input current mirror of the following stage also achieves its task.
[Click to enlarge image]Figure-5.14: Two different implementations of the bounded difference operation [Zhij90]
Current mirrors are subdivided into two complementary types whether their transistors are n- or p-channel MOS FET
(NPN or PNP in bipolar technology). The directions of input and output currents depend on the type of the respective
components (input mirror and output diode in the bounded difference circuit). Thus there are four different
configurations of input and output current directions (two of which are shown on fig.5.14 and are suitable for cascade
connections). To each configuration corresponds a complementary one which is obtained by substituting p-channel
current mirrors to n-channel ones and vice versa. This is convenient for designing circuits using such fuzzy logic units
as basic bricks without worrying about specific current directions between neighbouring bricks.
The circuits of fig.5.13 and 5.14 realize the bounded difference operation on two values of membership functions µx &
µy represented by the two currents I1 and I2
respectively. They also realize the complement operation on µy (represented by I2) when I1 has the maximum value
(that represents a membership grade equal to 1). The bounded product is realized when I1 represents 1 and I2
represents the sum µx + µy. As the bounded difference and the algebraic sum are sufficient to realize all other fuzzy
functions (according to the relations described in §3.2), fuzzy circuits can be designed only by specifying connections
between bounded difference subcircuits. Multiple-output current mirrors are also required to distribute current to
several logic units (fig.5.15). Such basic cells are attractive prospects for developing CAD tools and semicustom IC's
(that are arrays of logic cells adaptable to various specifications according to the wire configurations). This leads
however to solutions which are generally not optimal as regard to the number of transistors.
Multiple-input MAX and MIN circuits are proposed in [Sas90] which aim at avoiding error accumulation and
increasing speed relatively to binary tree realizations based on two-input MIN and MAX subcircuits. The operation of
these circuits can be formulated using simultaneous bounded difference equations. A simpler multiple-input MAX
5/1/2007 10:46 AM
20 of 29
circuit is described in [Batu94].
[Click to enlarge image]Figure-5.15: CMOS multiple fan-out circuit [Yam86]
Three "primitive" operators have been introduced to obtain more elementary basic cells than the bounded difference
[Lem93],[Lem93bis],[Lem94]. These operators are defined in the following way :
All fuzzy functions can be formulated as an additive combination of these primitive operators since the bounded
difference can be expressed in the following way:
These operators are used to reduce the complexity of electrical realization of fuzzy functions, since they lead to simple
relationship between transistor-level circuits and symbolic representation of fuzzy formulae. They are actually
realisable in a most simple way by using current mirrors (fig.5.16) and exhibit special properties that help to obtain
reduced forms of compositional expressions. Some of these properties are as following:
Several relations are rather complicated when expressed from bounded difference equations and should be reduced
with the help of the above properties. This will at the same time decrease the number of cascade current mirror stages
necessary to implement them. It is also important to favour parallel operations in rule evaluations so that error
accumulation will be reduced and processing speed increased.
[Click to enlarge image]Figure-5.16: CMOS implementation of primitive operators [Lem94]
As an example of such a process, the MIN operation can be implemented with two cascade current mirrors (fig.5.17),
after having carried out a certain number of simplifications into its mathematical formulation:
5/1/2007 10:46 AM
21 of 29
[Click to enlarge image]Figure-5.17: Synthesis of the MIN function based on primitive operators [Lem94]
There are various solutions to realize non-linear analog membership function generator in current mode but they are
directly influenced by their physical features and often exhibit bad temperature behaviour. Consequently it is quite
attractive to design circuitry achieving piecewise-linear membership functions [Kett93],[Yam88-3]. The representation
of complicated membership functions can also be subdivided into an appropriate composition of elementary logic
subcircuits. The fuzzification of a crisp variable x thought membership functions with triangular shapes can be
expressed, for example, in terms of the primitive operators , P and N as follows [Lem94]:
The input value i determines the horizontal position of the triangular functions and designates in this manner one of
linguistics labels (negative high, negative low, zero,...).
5.4.4 Mixed-mode fuzzifier
As current-mode circuits are restricted to single fan-out, multiple current mirrors are required to share out signals
among several operational blocks. Voltage-mode inputs are thus preferable for fuzzy hardware systems since they must
be distributed to the membership function circuits of many rule blocks. Current-mode signals are appreciated
afterwards, because of the advantages provided by current-mode processing. Tunable voltage-input current-mode
membership function circuits are consequently useful building blocks to proceed fuzzification with current-mode
analog hardware [Chen92],[Sas92],[Ishi92]. They can also be used with the above described OTA-based approach.
Such circuits can be realized with an OTA including variable source resistors which consist of integrated
voltage-controlled resistors and which can change the OTA transconductance characteristic [Chen92]. This achieves
triangular membership functions with variable heights, widths, slopes and horizontal positions. Such membership
function circuits are suitable for realizing high-speed and small-size systems. Nevertheless their sizing become difficult
as their complexity increase and their characteristics are affected by the variations of physical parameters (mismatches,
temperature influence,...).
5.4.5 Defuzzification strategy and normalization circuit
When achieved with the centre-of-gravity method, the defuzzification consists of a weighted average requiring a sum, a
multiplication (weighting) and a division. The sum operation consists of wire connections while the multiplication can
be realized by scaling current by means of asymmetrical current mirrors. The analog division is still the most complex
operation and requires rather great chip area and processing time. The defuzzification operation is consequently often
simplified in hardware implementations.
However the required division can be easily and rapidly performed by current-mode normalization circuits [Ishi92].
For the circuit of figure 5.18 there are the following relations:
5/1/2007 10:46 AM
22 of 29
[Click to enlarge image]Figure-5.18: 3-input normalization circuit [Ishi92]
The normalization circuit can directly be extended to more than three inputs. The sum of all output currents is
normalized to I0
and so each output current (representing a membership grade) is divided by this sum (sum of all membership grades).
There is no more need of dividing the weighted sum of these currents. The weight of each current can be realized in the
normalization circuit by the W/L ration of the respective output transistor. Good precision is obtained when such a
circuit is implemented by means of bipolar transistors or bipolar operated MOS. When using saturated MOS transistors
operating in weak inversion with VS=0 the circuit is very inaccurate and variable with the temperature, due to the large
variation of VT0 from device to device.
5.5 Singleton consequences and normalization loop
A main disadvantage of analog circuits concerns the defuzzification stage (and especially the analog divider) in terms
of size, accuracy and processing speed. The quantity and complexity of calculations are however greatly decreased
with the use of singleton consequences. These ones are more appropriate for hardware implementations than
complicated output sets that must consist of a limited number of discrete analog values. It can moreover be observed
thae singleton consequences facilitate a linear interpolation between inference results of different control rules
[Yam89],[Yam93].
Yamakawa has searched to eliminate the analog division in a voltage-mode singleton- consequent controller by using
grade-controllable membership function circuits (which are obtained by modifying ordinary membership function
circuits [Yam89],[Yam93]). These can be tuned in such a way that the sum of all membership grades is equal to the
unity (or constant), and that the weighted sum doesn't need any more division. The regulation consists of shifting down
and up each membership function characteristic according to the variations of the sum of membership grades, through a
negative feedback loop.
The current-mode rule block with voltage interface (and OTA-based membership function circuits) described in
[Sas92] also includes a normalization locked loop which abolishes the division operation of the weighted average in a
similar way (fig.5.19). The sum of membership grades (
) is regulated by using its fluctuations as modulation factor
(Vm) of the membership function circuits throughout a negative feedback loop. This regulation operation is attractive
since it is faster than the analog division in fuzzy hardware implementation. Nevertheless the implementation of such
circuits requires complicated units and connections, and leads to difficult sizing and testing. The normalization solution
described in §5.4.4 is finally far simpler for current- and mixed-mode circuits.
5/1/2007 10:46 AM
23 of 29
[Click to enlarge image]Figure-5.19: Principle of a normalization locked loop [Sas92]
5.6 Fuzzy analog memories
The main weak point of analog circuits is the luck of reliable analog memory modules. Since there is no accurate and
lasting way to store analog fuzzy values, analog fuzzy computers exhibit bad features concerning programmability and
multistage sequential processing.
Temporary memory elements have however been proposed to keep signals stable within a sampling period. This
allowss to design fuzzy inference engines with pipelined structures and consequently enhanced speed [Zhij90bis]. Such
basic memory cells are however not suitable for implementing complicated sequential algebra in the same way that
digital circuits do thanks to the use of binary flip-flop and register circuits. An analog value can however be stored in a
more lasting way when represented by a voltage, by mean of a capacitor. This last component is the core of
sample-and-hold circuits which are basic cells of anolog memory elements.
A voltage-mode fuzzy flip-flop has been proposed as an extension form of binary J-K flip- flop [Gup88]. It is based on
fuzzy negation, T-norm and S-norm which are respectively restricted to complementation, MIN and MAX operations
in order to make easier its hardware implementation. The characteristics of fuzzy flip-flop based on other operations
such as algebraic product and sum, bounded product and sum, and drastic product and sum are also reported in [Hir89].
Its structure is described by the following set and reset equations which generate the same state values than a digital
J-K flip-flop in the case of boolean input and state values.
The fuzzy flip-flop can be implemented with a combinatorial part synthesising the above equation, and two
sample-and-hold circuits driven by two control pulse circuits with opposite phases (fig.5.20). Present output Q(t) is
memorized by two sample-and-hold circuits since the information is needed in the next state.
[Click to enlarge image]Figure-5.20: Fuzzy J-K flip-flop: circuit block diagram [Gup88]
Current-mode continuous valued memory elements can also be realized in the same way by using voltage-controlled
current sources in both sample-and-hold circuits to store a control voltage representing the sampled current [Balt93].
The capacitor of a sample-and-hold circuit is thus charged according to the control voltage of a first current source. In
the "hold mode" (after half a clock cycle), this stored voltage sets the output current through a second current source in
the image of the input one.
5/1/2007 10:46 AM
24 of 29
Storage capacitors are designed according to a trade-off between high speed and small silicon area on one hand, and
sufficient accuracy on the other hand (their capacitance should be high relatively to parasitic ones). High accuracy is
however difficult to achieve since it is affected by the imprecision of integrated capacitors, the mismatch in couples of
current sources, and overall the charge injection in parasitic capacitances. This last source of sampling error is however
reduced with the master-slave configuration of the sample-and-hold circuits, since the injected charges are opposite in
sign (the first order sampling errors are thus cancelled).
6 Mixed Digital/Analog Implementations of Fuzzy Systems
Fuzzy logic systems lend themselves well to analog integration, except for some control and reconfiguration structures.
Several switches are thus often integrated on analog circuits and commanded by digital inputs.
It can actually be attractive to increase the complementarity between digital and analog features and to merge them into
a single mixed chip, in order to improve the weak points of both [Fatt93],[Zhij90bis]. A fuzzy knowledge base can be
programmed in a digital memory which consists of dedicated locations and stores a variable number of parameters
characterizing membership function shapes and inference rules notably. Highly-parallel and non-sequential analog
processing is then afforded provided that D/A converters are used. A/D converters can also be used when a digital
computation of the centre-of-gravity is desired.
The VLSI Design group of SGS-THOMSON has also undertaken the design of an hybrid controller implemented by
means of a mixed analog/digital technology [Pag93bis],[Pag93-3]. It consists of a digital storage and distribution unit
followed by an analog inference core. The membership grades are converted into analog values by an internal D/A.
This system does not need expensive A/D and D/A converters in comparison with digital controllers. Its high speed
should be suitable for very demanding real-time requirements with a limited number of rules.
7 CAD Automation for Fuzzy Logic Circuits Design
7.1 CAD strategy for digital circuits
define specific cells to implement fuzzy operations = standard cells approach with usual CAD environments
THOMSON's approach [Pag93]
7.2 CAD strategy for analog circuits
Design automation of analog fuzzy blocks provides a standard-cell approach allowing to build a fast and reliable design
strategy (very similar to a digital one). The use of p- and n- channel static current mirrors as building blocks is well
appropriate to create a design automation framework for generating the layout of fuzzy units. Such a development
environment for current-mode CMOS fuzzy logic circuits has been created from a standard graphical tool and a
specific silicon compiler [Lem93],[Lem94]. The graphical tool provides a logical simulation of fuzzy algorithms and
helps to the design of fuzzy system architectures. The silicon compiler generates from the mathematical expression of a
given fuzzy algorithm its corresponding layout. The system is based on a three-level hierarchy, consisting of current
mirror circuits as generic cells for building elementary logical blocks (MIN, MAX, ,...), which ones can finally be
assembled into sophisticated fuzzy units. The use of the three primitive operators , P and N (cf.§5.4.3) leads to
analytical expressions which are suitable to describe a fuzzy units at all levels, from the current mirrors to high level
fuzzy algorithms. The aim of this methodology is to cancel redundant fuzzy function elements which exist in the trivial
implementation of the fuzzy functions. The silicon compiler should also replace each terms of the mathematical
equations by its electrical counterpart. It tries then to reduce and place each electrical block in such an optimized way
that wire lengths and IC area are minimised. The interconnections of these wires determine the vertical current flow
5/1/2007 10:46 AM
25 of 29
between p- and n- mirrors from one stage to another. Certain configurations which cause functional problems should be
forbidden and trivial connections of p- and n- mirrors involving static consumption should be avoided (case where
serial p- and n- mirrors of a branch are driving a static current between the high and low levels). The automatic cell
generator produces then the final layout as physical representation of the fuzzy algorithms. As an application of this
environment, the prototype fuzzy controller FLC#001 has been designed in standard 1.2 µm CMOS technology
[Lem94]. This low power and small size circuit achieves 10 MegaFLIPS and is quite efficient for real- time control
applications.
As the design issued of such a strategy is close to fuzzy algorithms, it makes easy the test of resulting circuits. Internal
currents can however not be measured without adding supplementary output current mirrors which increase the circuit
size. Gate voltages can be measured but they just indicate an imprecise estimation of the transistor currents.
[Click to enlarge image]Figure-7.1: Structure of a silicon compiler [Lem94]
8 Neural Networks Implementing Fuzzy Systems
Artificial neural networks and fuzzy logic system exhibits some similar or complementary features that suggest to
merge these two techniques. Both have a certain ability to work in natural analog environment and to deal with
uncertainty of real-world problems. Neural networks seem appropriate for implementing fuzzy logic since in principle
they should provide massively parallel and highly parametric computation, achieve highly non-linear operations, do not
require any model and are concerned with approximating input-output relationships. They have moreover the ability of
achieving interpolate reasoning similar to fuzzy inference. This means that once a neural network has been trained for a
representative set of data, it can interpolate and produce answers for the cases not present in the training data set (the
training is a supervised learning which consists of supplying inputs and demanding the specific output in order to
adjust the weights in the connections between neurons in different layers). The neural output characteristics (threshold
functions with sigmoid characteristics) are able to synthesise complicated membership functions. Finally the sum of
products operation that neural networks realize is close to the MIN-MAX and weighting operations that occur in
approximate reasoning. One should however bear it in mind that the implementation of neural networks is still
complicated and actually not really easier than the direct implementation of fuzzy logic engines. Thus hardware
realizations of neural networks are not so performant as regard parralelism, computation speed, functional density, and
so on.
The fusion of neural networks and fuzzy logic do not only concerns their similarities, but also mutual compensations
between their different features. Thus neural networks can offer a solution in fuzzy systems to the problems of structure
5/1/2007 10:46 AM
26 of 29
identification (number and size of rules making up a suitable fuzzy partition of the feature space) and parameter
identification (number and characteristics of membership functions). They are actually able to replace in some cases
the tedious and sometimes hazardous identification scheme by a supervised learning. The knowledge acquisition by
self-organizing networks can be used to realize adaptive fuzzy systems, which are quite attractive when linguistic rules
cannot be easily derived or when a great deal of rules are required. However there is no guaranty of convergence in the
learning scheme. The necessary number of neurons is moreover unpredictable and the acquisition of new knowledge is
fairly difficult without beginning a new learning scheme.
There is several manner to combine neural networks and fuzzy logic, which differ according to the authors. The first
idea consists of using the high flexibility of neural networks produced by learning, in order to provide automatic design
and fine tuning of the membership functions used in fuzzy control. Such an adaptive neuro-fuzzy controller adjusts its
performances automatically by accumulating experience in a learning phase. National Semiconductor Corp. has
developed the Neufuz embedded system that provides a front-end neural networks suitable for fuzzy logic technology.
A first layer performs fuzzification, a second creates the rule base and a third, a single neuron, does rule evaluation and
defuzzification. Neural networks involve a great deal of computations and hardware investments which is prohibitive
for many real-time applications. So they emulate and optimise in this case a fuzzy logic system rather than implement
directly the final application. The National Semiconductor solution includes then the capability to generate assembly
code for a strictly fuzzy logic solution. Recently Neuralogix has put on the market the NLX230 fuzzy microcontroller
which consists of a VLSI digital fuzzy logic engine based on the min-max method. It includes a neural network
implementing a high-speed minimum comparator block connected with 16 parallel fuzzifier on one side and a
maximum comparator on the other side. An other approach aims at solving the problems of consequence identification
and of defuzzification by using the Takagi-Sugeno's method which lends itself good to implementation framework
based on adaptive neural networks [Jan92].
Instead of using neurons to implement some part of fuzzy systems, the structure of approximate reasoning can be
applied to neural networks. The aim of this approach is to improve neural network frameworks by bringing some
advantages of fuzzy logic. The latter allows an explicit representation of the knowledge and has a logic structure that
allows to handle high-order processing easier. In pattern recognition for example, the learning scheme is efficient to
acquire a knowledge about reference objects. As for the structure of approximate reasoning rules, it can give
information on the knowledge distribution inside the whole network. It is then easier to find out the internal part that
cause poor performance by analysis of error according to the rule structure [Tak90bis].
References
Fuzzy Sets and Systems, Theory and Applications
[Dub80] D. Dubois & H. Prade, Fuzzy Sets and Systems, Theory and Applications, Academic Press, New York,
1980
[God94] J. Godjevac, Introduction à la logique floue, Cours de perfectionnement, Institut suisse de pédagogie
pour la formation professionelle, EPFL, Lausanne, 1994
[Ngu93] H.T. Nguyen, V. Kreinovich & D. Tolbert, On Robustness of Fuzzy Logics, IEEE Int. Conf. on Fuzzy
Systems, Vol.1, pp.543-547, San Francisco, Ca, USA, 1993
[Ter92] T. Terano, K. Asai & M. Sugeno, Fuzzy Systems Theory and its Applications, 1th. Ed., Academic Press,
San Diego, 1992
Fuzzy Control
[Büh94] H. Bühler, Réglage par logique floue, Presses Polytechniques Romandes, Lausanne, 1994
[Dri93] D. Driankov, H. Hellendoorn & M. Reinfrank, An Introduction to Fuzzy Control, Springer-Verlag,
Berlin Heidelberg, 1993
[Gee92] H.P. Geering, Introduction to Fuzzy Control, Institut für Mess- und Regeltechnik, ETH, IMRT-Bericht
Nr.24, Zürich, 1992
5/1/2007 10:46 AM
27 of 29
[Mam74] E.H. Mamdani & S.Assilian, A Case Study on the Application of Fuzzy Set Theory to Automatic
Control, pp.643-649, Budapest, Sept. 1974
[Mam74bis] E.H. Mamdani, Application of Fuzzy Algorithms for Control of Simple Dynamic Plant, Proc. IEE,
Vol.121, No.12, pp.1585-1588, Dec. 1974
[Mam75] E.H. Mamdani & S.Assilian, An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller,
Int. J. Man-Machine Studies, Vol.7, pp.1-13, 1975
[Mam76] E.H. Mamdani, Advances in the Linguistic synthesis of Fuzzy Controllers, Int. J. Man-Machine
Studies, Vol.11, pp.669-678, 1976
[Mam77] E.H. Mamdani, Application of Fuzzy Logic to Approximate Reasoning Using Linguistic Synthesis,
IEEE Trans. on Computers, Vol.26, pp.1182-1191, Dec. 1977
[Mam81] E.H. Mamdani & B.R. Gaines, Fuzzy Reasoning and Its Applications, Academic Press, Cambridge,
Mass., 1981
[Mam84] E.H. Mamdani, Development in Fuzzy Logic Control, Proc. of 23rd Conference on Decision and
Control, pp.888-893, 1984
[Sug85] M. Sugeno, Industrial Applications of Fuzzy Control, North-Holland Publ., 1985
[Tak83] T. Takagi & M. Sugeno, Derivation of Fuzzy Control Rules from Human Operator's Control Actions,
Proc. of the IFAC Symp. on Fuzzy Information, Knowledge Representation and Decision analysis, pp.55-60,
July 1983
[Tak85] T. Takagi & M. Sugeno, Fuzzy Identification of Systems and Its Application to Modeling and Control,
IEEE Trans. Syst. Man & Cybern., Vol.20, No.2, pp.116-132, 1985
Hardware Implementation of Fuzzy Systems
[Chan93] C.-H. Chang & J.-Y. Cheung, The Dimensions Effect of FAM Rule Table in Fuzzy PID Logic Control
Systems, Second IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.441-446, San Francisco, CA, USA, 1993
[Gup88] Fuzzy Computing: Theory, Hardware, and Applications, M.M. Gupta & T. Yamakawa Eds., Elsevier
Science Publishers B.V. (North-Holland), 1988
[Hir93] Industrial Applications of Fuzzy Technology, K. Hirota Ed., Tokyo, 1993
Digital Implementations
[Eich92] H. Eichfeld, M. Löhner & M. Müller, Architecture of a CMOS Fuzzy Logic Controller with Optimized
Memory Organisation and Operator Design, Second IEEE Int. Conf. on Fuzzy Systems, pp.1317-1323, San
Diego, CA, USA, March 1992
[Eich93] H. Eichfeld, T. Künemund & M. Klimke, An 8b Fuzzy Coprocessor for Fuzzy Control, Proc. of the Int.
Solid State Circuits Conf.'93 Conference, San Francisco, CA, USA, Feb. 1993
[Nak93] K. Nakamura, N. Sakeshita, Y. Nitta, K. Shimanura, T. Ohono, K. Egushi & T. Tokuda, Twelve Bit
Resolution 200 kFLIPS Fuzzy Inference Processor, Proc. of the Int. Solid State Circuits Conf'93 Conference, San
Francisco, CA, USA, Feb. 1993
[Pag93] A. Pagni, R. Poluzzi & G. Rizzotto, Automatic Synthesis, Analysis and Implementation of a Fuzzy
Controller, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.105-110, San Francisco, CA, USA, 1993
[Pag93bis] A. Pagni, R. Poluzzi & G. Rizzotto, Integrated Development Environment for Fuzzy Logic
Applications, Applications of Fuzzy Logic Technology, B. Bosacchi & J.C. Bezdek Ed., pp.66-77, Boston,
Massachusetts, USA, 1993
[Pag93-3] A. Pagni, R. Poluzzi & G. Rizzotto, Fuzzy Logic Program at SGS-THOMSON, Applications of Fuzzy
Logic Technology, B. Bosacchi & J.C. Bezdek Ed., pp.80-90, Boston, Massachusetts, USA, 1993
[Sas93] M. Sasaki, F. Ueno & T. Inoue, 7.5MFLIPS Fuzzy Microprocessor Using SIMD and Logic-in-Memory
Structure, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.527-534, San Francisco, CA, USA, 8-10 Sept. 1993
[Tog86] M. Togai & H. Watanabe, A VLSI Implementation of Fuzzy Inference Engine: toward an Expert
System on a Chip, Information Sciences, Vol.38, pp.147-163, 1986
[Tog87] M. Togai & S. Chiu, A Fuzzy Chip and a Fuzzy Inference Accelerator for Real- Time Approximate
Reasoning, Proc. 17th IEEE Int. Symp. Multiple-Valued Logic, pp.25-29, May 1987
[Wat90] H. Watanabe, W.D. Dettloff & K.E. Yount, A VLSI Fuzzy Logic Controller with Reconfigurable
Cascadable Architecture, IEEE JSSC, Vol.25, No.2, pp.376-382, April 1990
[Wat92] H. Watanabe, RISC Approach to Design of Fuzzy Processor Architecture, Second IEEE Int. Conf. on
Fuzzy Systems, pp.431-440, San Diego, CA, USA, March 1992
5/1/2007 10:46 AM
28 of 29
[Wat93] H. Watanabe & D. Chen, Evaluation of Fuzzy Instructions in a RISC Processor, IEEE Int. Conf. on
Fuzzy Systems, Vol.1, pp.521-526, San Francisco, CA, USA, 1993
Analog Implementation: Voltage-Mode
[Hir89] K. Hirota & K. Ozawa, The Concept of Fuzzy Flip-Flop, IEEE Trans., SMC- 19, (5), pp.980-997, 1989
[Tou94] C. Toumazou & T. Lande, Building Blocks for Fuzzy Processors, Voltage- and Current-Mode Min-Max
circuits in CMOS can operate from 3.3V Supply, IEEE Circuits & Devices, Vol.10, No.4, pp.48-50, July 1994
[Yam88] T. Yamakawa, Fuzzy Microprocessors - Rule Chip and Defuzzifier Chip -&- How It Works -, Proc. Int.
Workshop on Fuzzy Syst. Appl., pp.51-52 & 79-87, Iizuka, Japan, Aug. 1988
[Yam88bis] T. Yamakawa, High-Speed Fuzzy Controller Hardware System: The Mega Fips Machine,
Information Sciences, Vol.45, pp.113-128, 1988
[Yam89] T. Yamakawa, An Application of a Grade-Controllable Membership Function Circuit to a
Singleton-Consequent Fuzzy Logic Controller, Proc. of the Third IFSA Congress, pp.296-302, Seattle, WA,
USA, Aug. 6-11, 1989
[Yam93] T. Yamakawa, A Fuzzy Inference Engine in Nonlinear Analog Mode and Its Application to a Fuzzy
Logic Control, IEEE Trans. on Neural Networks, Vol.4, No.3, May 1993
Analog Implementation: Mixed-Mode
[Chen92] J.-J. Chen, C.-C. Chen & H.-W. Tsao, Tunable Membership Function Circuit for Fuzzy Control
Systems using CMOS Technology, Electronics Letters, Vol.28, No.22, pp.2101-2103, Oct. 1992
[Inou91] T. Inoue, F. Ueno, T. Motomura, R. Matsuo & O. Setoguchi, Analysis and Design of Analog CMOS
Building Blocks For Integrated Fuzzy Inference Circuits, Proc. IEEE International Symp. on Circuits and
Systems, Vol.4, pp.2024-2027, 1991
[Park86] C.-S. Park & R. Schaumann, A High-Frequency CMOS Linear Transconductance Element, IEEE
Trans. on Circuits and Systems, Vol.CAS- 33, No.11, Nov. 1986
Analog Implementation: Current-Mode
[Batu94] I. Baturone, J.L. Huertas, A. Barriga & S. Sanchez-Solano, Current-Mode Multiple-Input MAX Circuit,
Electronics Letters, Vol.30, No.9, pp.678-680, April 1994
[Balt93] F. Balteanu, I. Opris & G. Kovacs, Current-Mode Fuzzy Memory Element, Electronics Letters, Vol.29,
No.2, pp.236-237, Jan. 1993
[Ishi92] O. Ishizuka, K. Tanno, Z. Tang & H. Matsumoto, Design of a Fuzzy Controller with Normalization
Circuits, Second IEEE Int. Conf. on Fuzzy Systems, pp.1303-1308, San Diego, CA, USA, March 1992
[Kett93] T. Kettner, C. Heite & K. Schumacher, Analog CMOS Realisation of Fuzzy Logic Membership
Functions, IEEE Journal of Solid-State Circuits, Vol.29, No.7, pp.857-861, July 1993
[Land93] O. Landolt, Efficient Analog CMOS Implementation of Fuzzy Rules by Direct Synthesis of
Multidimensional Fuzzy Subspaces, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.453-458, San Francisco, CA,
USA, 1993
[Lem93] L. Lemaître, M.J. Patyra & D. Mlynek, Synthesis and Design Automation of Analog Fuzzy Logic VLSI
Circuits, IEEE Symp. on Multiple-Valued Logic, pp.74-79, Sacramento, CA, USA, May 1993
[Lem93bis] L. Lemaître, M.J. Patyra & D. Mlynek, Fuzzy Logic Functions Synthesis: A CMOS Current Mirror
Based Solution, IEEE ISCAS93, Part 3 (of 4), pp.2015- 2018, Chicago, IL, USA, 1993
[Lem94] L. Lemaître, Theoretical Aspects of the VLSI Implementation of Fuzzy Algorithms, rapport de thèse
nº1226, Département d'Electricité, EPFL, Lausanne, 1994
[Sas90] M.Sasaki, T. Inoue, Y. Shirai & F. Ueno, Fuzzy Multiple-Input Maximum and Minimum Circuits in
Current Mode and Their Analyses Using Bounded- Difference Equations, IEEE Trans. on Computers, Vol.39,
No.6, pp.768-774, June 1990
[Sas92] M. Sasaki, N. Ishikawa, F. Ueno & T. Inoue, Current-Mode Analog Fuzzy Hardware with Voltage Input
Interface and Normalization Locked Loop, Second IEEE Int. Conf. on Fuzzy Systems, pp.451-457, San Diego,
CA, USA, March 1992
[Yam86] T. Yamakawa & T. Miki, The Current Mode Fuzzy Logic Integrated Circuits Fabricated by the
Standard CMOS Process, Reprint of IEEE Trans. on Computers, Vol. C-35, No.2, pp.161-167, Feb. 1986
[Yam87] T. Yamakawa, Fuzzy Logic Circuits in Current Mode, Analysis of Fuzzy Information, James C.
Bezdek Ed., Vol.1, pp.241-262, CRC Press, Boca Raton, Florida, 1987
5/1/2007 10:46 AM
29 of 29
[Yam88-3] T. Yamakawa, H. Kabuo, A Programmable Fuzzifier Integrated Circuit - Synthesis, Design, and
Fabrication, Information Sciences, Vol.45, pp.75-112, 1988
[Zhij90] L. Zhijian & J. Hong, CMOS Fuzzy Logic Circuits in Current-Mode toward Large Scale Integration,
Proc. Int. Conf. on Fuzzy Logic & Neural Networks, pp.155-158, Iizuka, Japan, July 20-24, 1990
[Zhij90bis] L. Zhijian & J. Hong, A CMOS Current-Mode, High Speed Fuzzy Logic Microprocessor for a
Real-Time Expert System, IEEE Proc. of the 20th Int. Symp. on MVL, Charlotte, NC, USA, May 23-25, 1990
Mixed Analog-Digital Implementation
[Fatt93] J. Fattarasu, S.S. Mahant-Shetti & J.B. Barton, A Fuzzy Inference Processor, 1993 Symposium on VLSI
Circuits, Digest of Technical Papers, pp.33-34, Kyoto, Japan, May 19-21, 1993
Implementation on FPGA
[Manz92] M.A. Manzoul & D. Jayabharathi, Fuzzy Controller on FPGA Chip, IEEE Int. Conf. on Fuzzy
Systems, pp.1309-1316, San Diego, Ca, USA, Mar. 8-12, 1992
[Ung93] A.P. Ungering, K. Thuener & K. Goser, Architecture of a PDM VLSI Fuzzy Logic Controller with
Pipelining and Optimized Chip Area, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.447-452, San Francisco, Ca,
USA, 1993
[Yam92] T. Yamakawa, A Fuzzy Programmable Logic Array (Fuzzy PLA), IEEE Int. Conf. on Fuzzy Systems,
pp.459-465, San Diego, Ca, USA, Mar. 8-12, 1992
Technological Aspects
[Pat90] M.J. Patyra, Fuzzy Properties of IC Subcircuits, Int. Conf. on Fuzzy Logic and Neural Networks, Iizuka,
Japan, 1990
[Oeh92] J. Oehm & K. Schumacher, Accuracy Optimization of Analog Fuzzy Circuitry in Network Analysis
Environment, ESSCIRC, 1992
Neural Networks and Adaptive Fuzzy Systems
[Ber90] H. Berenji, Neural Networks and Fuzzy Logic in Intelligent Control, IEEE, 1990
[God93] J. Godjevac, State of the Art in the Neuro Fuzzy Field, Technical Report No.93.25, Laboratoire de
Microinformatique, DI, EPFL, Lausanne, 1993
[Jan92] J.-S. Roger Jang, ANFIS, Adaptive-Network-Based Fuzzy Inference Systems, IEEE Trans. on Systems,
Man and Cybernetics, 1992
[Kos92] B. Kosko, Neural Networks and Fuzzy Systems, Prentice-Hall International, Inc., Englewood Cliffs
N.J., 1992
[Tak90] H. Takagi, Fusion Technology of Fuzzy Theory and Neural Networks, Survey and Future Directions,
Proc. of the Int. Conf. on Fuzzy Logic & Neural Networks, Vol.1, pp.13-26, Iizuka, Japan, July 1990
[Tak90bis] H. Takagi, T. Kouda & Y. Kojima, Neural Network Designed on Approximate Reasoning
Architecture and Its Application to the Pettern Recognition, Proc. of the Int. Conf. on Fuzzy Logic & Neural
Networks, Vol.2, pp.671-674, Iizuka, Japan, July 1990
production of
5/1/2007 10:46 AM
1. Introduction
1 of 18
file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_...
Chapter 10
VLSI FOR MULTIMEDIA APPLICATIONS
Case Study: Digital TV
I. Introduction
II. Digitization of TV functions
III. Points of concern for the Design Methodology
IV. Conclusion
Today there is a race to design interoperable video systems for basic digital computer functions, involv
multimedia applications in areas such as media information, education, medecine and entertainment, to name but
a few. This chapter provides an overview of the current status in industry of digitized television includin
techniques used and their limitations, technological concerns and design methodologies needed to achieve th
goals for highly integrated systems. Digital TV functions can be optimized for encoding and decoding
implemented in silicon in a more dedicated way using a kind of automated custom design approach allowin
enough flexibility.
Some practical examples are shown in the chapter: "Multimedia Architecture’s"
I. Introduction
Significance of VLSI for Digital TV Systems
When, at the 1981 Berlin Radio and TV Exhibition, the ITT Intermetall company exhibited to the public for th
first time a digital television VLSI concept [1], [2], opinions among experts were by no means unanimo
favourable. Some were enthusiastic, while others doubted the technological and economic feasibility. Today,
after 13 years, more than 30 million TV sets worldwide have already been equipped with this system. Today, the
intensive use of VLSI chips does not need a particular justification, the main reasons being increased reliab
mainly because of the long-term stability of the color reproduction brought about by digital systems, and
5/1/2007 10:54 AM
1. Introduction
2 of 18
medium and long-term cost advantages in manufacturing which are essential for ensuring internat
competitiveness.
Digital signal processing permits solutions that guarantee a high degree of compatibility with future
developments, whether in terms of quality improvements or new features like intermediate picture storage
adaptive comb filtering for example. In addition to these benefits, a digital system offers a number of advantag
with regard to the production of TV sets:
- Digital circuits are tolerance-free and are not subject to drift or aging phenomena. These well-known properties
of digital technology considerably simplify factory tuning of the sets and even permit fully au
computer-controlled tuning.
- Digital components can be programmable. This means that the level of user convenience and the features
offered by the set can be tailored to the manufacturer's individual requirements via the software.
- A digital system is inherently modular with a standard circuit architecture. All the chips in a given sys
compatible with each other so that TV models of various specifications, from the low-cost basic mode
multi-standard satellite receiver, can be built with a host of additional quality and performance features.
- Modular construction means that set assembly can be fully automated as well. Together with automatic tuning
the production process can be greatly simplified and accelerated.
Macro-function Processing
The modular design of digital TV systems is reflected in its subdivision into largely independent functional
blocks, with the possibility having special data-bus structures. It is useful to divide the structure into
data-oriented flow and control-oriented flow, so that we have four main groups of components:
1.- The control unit and peripherals, based on well-known microprocessor structures, with a centra
communication bus for flexibility and ease to use. An arrangement around a central bus makes it possible to
easily expand the system constantly and thereby add on further quality-enhancing and special functions for
picture, text and/or sound processing at no great expense. A non-volatile storage element, in which the fact
settings are stored, is associated to this control processor.
2.- The video functions are mainly the video signal processing and some additional features like for exa
deflection, a detailed description follows in the paper. However, the key point for VLSI implementati
well-organized definition of the macro-blocks. This serves to facilitate interconnection of circuit components,
and minimizes power consumption, which can be considerable at the processor speeds needed.
3.- The digital concept facilitates the decoding of today’s new digital sound broadcasting standards as well as the
input of external signal sources, such as Digital Audio Tape (DAT) and Compact Disk (CD). Programmabi
permits mono, stereo, and multilingual broadcasts; the compatibility with other functions in the TV system
resolved with the common communication bus. This leads us to part two which is dedicated to the description
this problem.
4.- With a digital system, it is possible to add some special or quality-enhancing functions simply by
incorporating a single additional macro-function or chip. Therefore, standards are no longer so important due to
the high level of adaptability of digital solutions. For example adaptation to a 16:9 picture tube is easy.
Figure 1 shows the computation power needed for typical consumer goods applications. Notice from the figure
that when the data changes at a frequency x, a digital treatment of that data must be an order of magnitude faste
[3].
5/1/2007 10:54 AM
1. Introduction
3 of 18
Fig.1: Computation Power for Consumer Goods
In this chapter we first discuss the digitization of TV functions by analyzing general concepts based on ex
systems. The second section deals with silicon technologies and, in particular design methodologies concerns.
The intensive use of submicron technologies associated with fast on chip clock frequencies and huge num
transistors on the same substrate affects traditional methods of designing chips. As this chapter only outline
general approach of the VLSI integration techniques for Digital TV, interested readers will find more
descriptions of VLSI design methodologies and realizations in [9], [13], [15], [24], [26], [27], [28].
II. Digitization of "TV functions"
The idea of digitization of TV functions is not new. The time some companies have started to work on i
technology was not really adequate for the needed computing power so that the most effective solutions were full
custom designs. This forced the block-oriented architecture where the digital functions introduced were the one
to one replacement of an existing analog function. In Figure 2 there is a simplified representation of the ge
concept.
5/1/2007 10:54 AM
1. Introduction
4 of 18
Fig.2: Block Diagram of first generation digital TV set
The natural separation of video and audio resulted in some incompatibilities and duplication of primary
functions. The emitting principle is not changed, redundancy is a big handicap, for example the time a SEC
channel is running, the PAL functions are not in operation. New generations of digital TV systems should
re-think the whole concept top down before VLSI system partitioning.
In today’s state-of-the-art solution one can recognize all the basic functions of the analog TV set with, however,
a modularity in the concept, permitting additional features becomes possible, some special digital possibilities
are exploited, e.g. storage and filtering techniques to improve signal reproduction (adaptive filtering, 1
technology), to integrate special functions (picture-in-picture, zoom, still picture) or to receive digital
broadcasting standards (MAC, NICAM). The Figure 3 shows the ITT Semiconductors solution which was the
first on the market in 1983 [4].
5/1/2007 10:54 AM
1. Introduction
5 of 18
Fig.3: The DIGIT2000 TV receiver block diagram
By its very nature, computer technology is digital, while consumer electronics are geared to the analog world
Starts have been made only recently to digitize TV and radio broadcasts at the transmitter end (in form of DA
DSR, D2-MAC, NICAM etc). The most difficult technical tasks involved in the integration of different media
are interface matching and data compression [5].
After this second step in the integration of multimedia signals, an attempt was made towards standardizatio
namely, the integration of 16 identical high speed processors with communication and programmability concepts
comprised in the architecture (see Figure 4, Photograph of the chip of ITT Semiconductor courtesy).
5/1/2007 10:54 AM
1. Introduction
6 of 18
Fig.4: Chip Photograph – (ITT Semiconductors Courtesy)
Many solutions proposed today (for MPEG 1 mainly) are derived from microprocessor architectures or DSPs,
but there is a gap between today’s circuits and the functions needed for a real fully HDTV system. The AT&
hybrid codec [29], for instance, introduces a new way to design multimedia chips by optimizing the cost
equipment considering both processing and memory requirements. Pirsch [6] gives a detailed description o
today’s digital principles and circuit integration. Other component manufacturers also provide different solutions
for VLSI system integration [35][36][37][38][39][40]. In part IV of this paper a full HDTV system, based
wavelet transforms is described. The concept is to provide generic architectures that can be applied to a wid
variety of systems taking into account that certain functions have to be optimized and that some other c
algorithms have to be ported to generic processors.
Basics of current video coding standards
Compression methods take advantage of both data redundancy and the non-linearity of human vision. They
exploit correlation in space for still images and in both space and time for video signals. Compression in space is
known as intra-frame compression, while compression in time is called inter-frame compression. Generall
methods that achieve high compression ratios (10:1 to 50:1 for still images and 50:1 to 200:1 for vid
approximations which lead to a reconstructed image not identical to the original.
Methods that cause no loss of data do exist, but their compression ratios are lower (no better than 3:1). S
techniques are used only in sensitive applications such as medical imaging. For example, artifacts introduced by
a lossy algorithm into a X-ray radiograph may cause an incorrect interpretation and alter the diagnosis of a
medical condition. Conversely, for commercial, industrial and consumer applications, lossy algorithms ar
preferred because they save storage and communication bandwidth.
Lossy algorithms also generally exploit aspects of the human visual system. For instance, the eye is much
receptive to fine detail in the luminance (or brightness) signal than in the chrominance (or color) sig
Consequently, the luminance signal is usually sampled at a higher spatial resolution. Second, the enc
representation of the luminance signal is assigned more bits (a higher dynamic) than are the chrominance signals.
The eye is less sensitive to energy with high spatial frequency than with low spatial frequency [7]. Indeed, i
images on a personal computer monitor were formed by an alternating spatial signal of black and white, the
human viewer would see a uniform gray instead of the alternating checkerboard pattern. This deficiency
exploited by coding the high frequency coefficients with fewer bits and the low frequency coefficients with more
bits.
5/1/2007 10:54 AM
1. Introduction
7 of 18
All these techniques add up to powerful compression algorithms. In many subjective tests, reconstructed image
that were encoded with a 20:1 compression ratio are hard to distinguish from the original. Video data,
compression at ratios of 100:1, can be decompressed with close to analog videotape quality.
Lack of open standards could slow the growth of this technology and its applications. That is why several dig
video standards have been proposed:
JPEG (Joint Photographic Expert Group) for still pictures coding
H.261 at p times 64 kbit/s was proposed by the CCITT (Consultative Committee on International
Telephony and Telegraphy) for teleconferencing
MPEG-1 (Motion Picture Expert Group) up to 1,5 Mbit/s was proposed for full motion compression o
digital storage media
MPEG-2 was proposed for digital TV compression, the bandwith depends on the chosen level and prof
[33].
Another standard, the MPEG-4 for very low bit rate coding (4 kbit/s up to 64 kbit/s) is currently being debated.
For more detail concerning different standards and their definition, please see the paper included in th
Proceedings: "Digital Video Coding Standards and their Role in Video Communication", by R. Schäfer, and
Siroka.
III. Points of Concern for the Design Methodology
Like aforsaid, the main idea is to think system-wise through the whole process of development; doing that, we
had to select a suitable architecture as a demonstrator for this coherent design methodology. It makes no sense to
reproduce existing concepts or VLSI chips, therefore we focused our demonstrator on the subband codin
principle, where the DCT is only a particular case of. Following this line, there is no interest to focus on block
only considering the motion problem to solve, but rather to consider the entire screen in a first global approach
This gives us the ability to define macro-functions which are not restricted in their design limits, the onl
restriction will come from practical parameters like block area or processing speed for example, which depen
from the technology selected for developing the chips but does not depend from the architecture or th
functionality.
Before going into the detail of the system architecture, we like to discuss in this session the main design related
and technology depending factors which will influence the hardware design process and the use of some CAD
tools. We are proposing a list of major concerns one should consider when going to integrate digital TV
functions. The purpose is to give a feeling for the decision process of the management of such a project. In a first
step, we discuss RLC effects, down scaling, power management, requirements for the process technology, design
effects like parallelism and logic styles, and we conclude the session with some criterias for the prop
methodology.
R,L,C effects
5/1/2007 10:54 AM
1. Introduction
8 of 18
In computer systems today, the clocks that drive ICs are increasingly fast; 100MHz is a " standard " clo
frequency, and several chip and system manufacturers are already working on microprocessors with Ghz clocks
By the end of this decade, maybe earlier, Digital Equipment will provide a 1-2 GHz version of the Alpha AX
chip; Intel promises faster Pentiums; and Fujitsu will have a 1-GHz Sparc processor.
When working frequencies are so high, the wires that connect the circuits boards and modules, and even the
wires inside integrated circuits start behaving like transmission lines. New analysis tools become necess
circumvent and to master the high-speed effects.
As long as the electrical connections are short, with low clock rates, the wires can be modeled as RC circuits.
such cases, the designer will have to care that rise and fall times are sufficiently short with respect to the internal
clock frequency. This method is still used in fast clock rate designs by deriving clock trees to manage a good
signal synchronization on big chips or boards. However, when the wire lengths increase, their inductance starts
to play a major role. This is what transmission lines deal with. Transmission line effects include reflecti
overshoot, undershoot, and crosstalk. RLC effects have to be analyzed in a first step but it might be necessary t
use another circuit analysis to gain better insight into circuit behaviour. The interconnect will behave lik
distributed transmission line and coupled lines (electromagnetic characteristics also have to be taken into
account). But with low clock-rate systems, transmission line effects can also appear. Let's take for example
1MHz system with a rise time of 1ns. Capacitor loading will be dictated according to the timings, but dur
transition time, reflections and ringing will occur causing some false triggering of other circuits. As a rule o
thumb, high speed design techniques should be applied when the propagation delay in the interconnect is
20-25% of the signal rise and fall time [30], [34].
Some possible problems are listed below:
1. short transition time compared to the total clock period. The effect was described above.
2. inaccurate semiconductor models. It is important to take into account physical, electrical a
electromagnetic characteristics of the connections and the transistors. Important electrical paramete
metal-line resistivity and skin depth, dielectric constant and dielectric loss factor.
3. inappropriate geometry of the interconnect. Width, spacing, and thickness of the lines and thickness of
the dielectric are of real importance.
4. lack of a good ground is often a key problem. Inductance often exist between virtual and real ground
due for instance to interconnect and lead inductance.
A solution to these problems could be a higher degree of integration by eliminating the number of long wires.
The MCM (Multi Chip Module technology) is an example of this alternative. MCM simplifies the compo
improves the yield and shrinks the footprint. Nevertheless, the word alternative is not entirely correct since
MCM eliminates a certain type of problems by replacing them with another type. The narrowness of the
introduced with MCMs tends to engender significant losses due to the resistance and the skin effect.
Furthermore, due to the VIA structure, the effect of crossover and via parasitics is stronger than in traditiona
board design. Finally, ground plane effects need a special study since severe ground bounce and an effective
shift in the threshold voltage can result from the combination of high current densities with the thinn
metallizations for power distribution.
How then does one get a reliable high speed design? The best way is to analyze the circuit as deeply as possibl
The problem here is that circuit analysis is usually very costly in CPU time. Circuit analysis can be carried out in
three steps, first EM and EMI analysis, then according to the component models available in the database
electrical analysis can be performed using two general approaches: one that relies on versions of Spice, the othe
are the direct-solution methods using fixed-time increments.
EM (Electromagnetic field solver, or Maxwell's equations solver) and EMI (Electromagnetic interference
5/1/2007 10:54 AM
1. Introduction
9 of 18
analyzers perform a scanning of the layout database for unique interconnect and coupling structures
discontinuities. Then EM field solvers use the layout information to solve Maxwell's equations by numer
methods. Inputs to the solver includes physical data about the printed circuit or multichip module such as the
layer stack-up dielectrics and their thickness, placement of power and ground layers, and interconnect metal
width and spacing. The output is the mathematical representation of these electrical properties. In this way,
solvers analyze the structure in two, two and a half or three dimensions. In choosing among the variety of
solvers, the complexity of the structure, and the accuracy of the computation must be weighted against
performance and computational cost.
Electrical models are important for analysis tools and they can be automatically generated from measurements in
the time domain or from mathematical equations.
Finally, the time complexity of solving the system matrices that represent a large electrical network is an order
of magnitude larger for an electrical simulator like Spice, than for a digital simulator.
Down Scaling
As CMOS devices scale down into the submicron region, the intrinsic speed of the transistors increase
(frequencies between 60 and 100 MHz are common). The transients are also reduced, so that the increase in
output switching speed increases the rate of change of switching current (di/dt). Due to parallelization
simultaneous switching of the I/O's creates a so called simultaneous switching noise (SSN), also known as
Delta-I noise or ground bounce [8]. It is important that SSN be limited within a maximum allowable level to
avoid spurious errors like false triggering, double clocking or missing clock pulses. The output driver design i
now no longer trivial, and techniques like current controlled circuits or controlled slew rate driver designs are
used to minimize the effect of the switching noise [8]. An in-depth analysis of the chip-package interface i
required to ensure the functionality of high-speed chips (Figure 5).
5/1/2007 10:54 AM
1. Introduction
10 of 18
Fig.5: Chip-Package Interface
The lesson is that some important parts of each digital submicron chip have to be considered to be working
analog mode rather than digital. This applies not only for the I/O's but also for the timing and clocking blocks in
a system. The entire floor plan has to be analyzed in the context of noise immunity, parasitics and also
propagation and reflection in the buses and main communication lines. Our idea was to reduce the numbe
primitive cells and to structure the layout in such a way to be able to use the common software tools for
electrical optimizations of the interconnections (abutments optimization). Down Scaling of the silicon
technology is a common way today to obtain in a short time a new technology to be able to compete in the
digital market, but this shrinking is only linear in x and y (with some differences and restrictions like VIA
example). The third dimension is not shrinked linearly for technical and physical reasons. The designer has to
make sure that the models describing the devices and the parasitics are valid for the considered frequency
particular application.
Power Management
As chips grow in size and speed, power consumption is drastically amplified. The actual demand for por
consumer products implies that power consumption must be controlled at the same time that complex us
interfaces and multimedia applications are driving up computational complexity. But there are limits to how
much power can be slashed for analog and digital circuits. In analog circuits, a desired signal-to-noise ratio mus
5/1/2007 10:54 AM
1. Introduction
11 of 18
be maintained, while for digital IC power, the lower limit is set by cycle time, operating voltage, and c
capacitance [9].
A smaller supply voltage is not the only way to reduce power for digital circuits. Minimizing the number of
device transitions needed to perform a given function, local suppression of the clock, reduction of clock
frequency, elimination of system clock in favor of self-timed modules are other means to reduce the power.
means that for the cell-based design technology there is a crucial need to design the cell library to minimize
energy consumption. There are various ways to reduce the energy in each transition which is proportion
capacitance and the supply voltage to the power of two (E=CV2). Capacitance's are being reduced along with the
feature size in scale down processes, but this reduction is not linear. With the appropriate use of design
techniques like minimization of the interconnections or use of abutment or optimum placement it is possible
reduce the capacitance's in a more effective way. So what are the main techniques used to decrease th
consumption? Decrease the frequency, the size and the power supply. Technology has evolved to 3.3V processes
in production and most current processors take advantage of this progress. Nevertheless, reducing the nu
transistors and the operating frequency cannot be performed in so a brutal manner and so trade-off have to b
found. Let us bring some insight into power management by looking at different approaches found in act
products. A wise combination of those approaches will eventually lead even to new methods.
The MicroSparcII uses a 3.3V power supply and a fully static logic. It can cut power to the caches by 75% w
they're not being accessed, and in standby mode, it can stop the clock to all logic blocks. At 85MHz it is
expected to consume about 5W.
Motorola and IBM had the goal of providing high performance while consuming little power when they
produced the PowerPC603. Using a voltage of 3.3V, and 0.5 micron features CMOS process, four level metal
and static design technology, the chip is smaller as its predecessor (85.5mm2 in 0.5 micron for 1.6 Milli
transistors instead of 132mm2 in 0.6 micron for 2.8 Million transistors). The new load/store unit an
(System-Register Unit) is used to implement dynamic power management with a maximum of 3W po
consumption at 80MHz. But a lot more can be expected from reduction of power consumption associated
reduction of voltage swing on buses for example or on memory bit lines either. To achieve a reasonable
operating power for the VLSI it is necessary to decrease drastically the power consumption of internal bus
drivers, a circuit technology with a reduced voltage swing for the internal buses is a good solution. Nakagome
[10] proposed a new internal bus architecture that reduce operating power by suppressing the bus signal swing to
less than 1 V, this architecture can achieve a low power dissipation while maintaining high speed and low
standby current. This principle is shown in Figure 6.
Fig.6: Architecture of internal bus
An electrothermal analysis of the IC will show the non-homogeneous local power dissipations. This leads to
5/1/2007 10:54 AM
1. Introduction
12 of 18
avoid hot-spots in the chip itself (or in a multichip solution) to secure good yield since the failure ra
microelectronic devices doubles for every 10ºC increase in temperature. To optimize both long-term reliability
and performance, it has become essential to perform both thermal and electrothermal simulations during the chip
design. For example, undesirable thermal feedback due to temperature gradients across a chip degrade
performance of electrical circuit such as reduced-swing bus driver or mixed analog-digital component, w
matching and symmetry are important parameters [11].
Reduce chip power consumption is not the only issue. When targeting low system cost and power consumption,
it becomes interesting to include a PLL (Phase Locked Loop) allowing the processor to run at higher frequencies
than the main system clock. By multiplying the system clock by 1, 2, 3 and 4, the PowerPC603 operates
properly when slower system clock is used, three software controllable modes can be selected to redu
consumption. Most of the processor can be switched off, only the bus snooping is disabled or the time-base
register is switched-off. It takes, naturally some clock cycles to bring the processor into a full power mod
Dynamic power management is also used to switch-off only certain processor subsystems, and even the ca
protocol has been reduced from four to three states, being still compatible with the previous one.
Silicon Graphics goal has been to maintain RISC performances at a reduced price. Being nor super
super-pipelined (only 5 stages of pipeline) it combines integer and floating-point unit into a single unit. The
result is a degraded performance but with a big saving on the number of transistors. It can also power down
unused execution units, this is maybe even more necessary since dynamic logic is used. The chip should d
typically about 1.5W. Table I lists four RISC chips competing with the top-end of the 80x86 microprocessor
line. What is drawn from CPUs considerations is also applicable for television systems. The requireme
compression algorithms and their pre- and post-processing leads to very similar system sizes to comp
workstations. Our methodology was to reduce the size of the primitive cells of the library by using an optimizing
software developed in-house.
Table I Four RISC chips competing with the top-end of the 80x86 line.
Number Power Dissip.
of
Maxi
Trans
Price
for
1000
Size specint92 specfp92 Operating
in
voltage
mm2
(Mio)
DECchip21066
1.75
20W@
US$424 209
166MHz
PowerPC 603
1.6
3W@
US$?
85
80MHz
MicroSparc II
2.3
5W@
85MHz
US$500 233
70@
105@
166MHz
3.3V &
5V
166MHz peripherals
75@
85@
80MHz
80MHz
57.2@
49.5@
85MHz
85MHz
3.3V &
5V
peripherals
3.3V &
5V
peripherals
5/1/2007 10:54 AM
1. Introduction
13 of 18
Mips/NecRs4200 1.3
2W@40/80MHz US$100 81
55@40/
30@40/
80MHz
80MHz
3.3V &
5V
peripherals
Silicon Technology Requirements
It is important to notice that the today process technologies are not adapted to the new task in the consumer field:
low power, high speed, huge amount of data's. Historically the most progress was done for the memory pro
because of the potential business and the real need of storage since the microprocessor architecture exists. More
or less all the so called ASIC Process Technologies have been extrapolated from the original DRAM
technologies with some drastic simplifications because of yield sensitivity. Now the new algorithms fo
multimedia applications are requiring parallel architectures and because of the computation needs lo
memorization which means a drastic increase of interconnections. New ways to improve the efficiency of t
designs are in the improvement of the technologies but not only by shrinking linearly the features or decreasing
the supply voltage, but also by giving the possibility to store at the needed place the information, av
interconnection. This could be done by using floating gate or better ferroelectric memories. This material allows
a memorization on a small capacitance placed on the top of the flip-flop which generate the data to be
memorized; in addition, the information will not be destroyed and the material is not sensitive to radiation
Another way is the use of SOI (Silicon On Insulator). In this case the parasitic capacitances of the active device
are reduced near to zero so that it is possible to work with very minimal feature size (0.1µm to 0.3µm) and to
achieve very high speed at very low power consumption [12].
Another concern are the multilayer interconnects. Due to the ASIC oriented methodologie it was useful to ha
more than one metal interconnection-layer, this permits the so called prediffused wafer technique (like
Gate-Arrays or Sea-of-Gates). Software tools developed for this methodology enabled users to use an autom
router. Bad news for high speed circuits is that wires are done in a non-predictive way so that their length are
often not compatible with the speed requirements. It has been demonstrated a long time ago th
interconnection-layers are sufficient to solve any problem for digital circuits, one of this could be also i
polysilicon or better salicide material, so that only one metalization is really needed for high speed digital
circuits, and maybe another for power supply, and the main clock system. If the designer is using pow
minimization techniques for basic layout cells and if he takes into account the poly layer for cell to
interconnections, the reduction of the power consumption will be significant due to mainly the reduction of the
size of the cell.
Effects of parallelism
From the power management point of view, it is interesting to notice that for a CMOS gate the delay
approximately inversely proportional to voltage. Therefore, to maintain the same operational frequency,
reduction of supply voltage (for power saving) must be compensated for by computing in n parallel functions,
each of them operating n times slower. This parallelism is inherent in some tasks like image proces
Bit-parallelism is the most immediate, pipelining, systolic arrays are other approaches. The good news is that
they don't need much overhead for communication and controlling. The bad news is that they are not applica
when increasing the latency is not acceptable, for example if loops are required in some algorithms.
multiprocessors are used for more general computation, the circuit overhead for controlling and communicatio
task is growing more than linearly and the overall speed in the chip slows down by several order of magnitude
this is, by the way, the handicap of the standard DSP's applied to the multimedia tasks. The heavy use o
parallelism means also a need for memorization on chip or, if memory blocks are outside, an increase of wir
which means more power and less speed.
5/1/2007 10:54 AM
1. Introduction
14 of 18
Logic styles
Architecture's of the developed systems are usually described at a high level to ensure a correct functionality
These descriptions cannot generally take into account low level considerations such as logic design. Usually t
tools used are C descriptions converted into VHDL codes. Such codes are then remodeled into more structu
VHDL descriptions, known as RTL descriptions (Register Transfer Level). These new model of the given circuit
or system is then coupled with standard cell CMOS libraries. Finally the layout generation is produced. In such a
process, the designer is faced with successive operations where the only decisions are made at the highest level
of abstraction. Even with the freedom of changing or influencing the RTL model, the lowest level of abstraction
i.e. the logic style, will not be influenced. It is contained in the use of standard library, and is rarely different
from pure CMOS style of design.
Nevertheless, the fact that the logic style is frozen can lead to some aberrations or at least to certain difficulties
when designing a VLSI system. Clock loads may be too high, pipeline may not be easy to manage, special tri
have to be used to satisfy the gate delays. Particular circuitry has to be developed for system clocking
management. One way to circumvent clock generation units it to use only one clock. The TSPC technique (
Single Phase Clock) [13] performs with only one clock. It is particularly suited to fast pipelined circuits w
correctly sized with a non prohibitive cell area.
Other enhancements
In the whole plethora of logic families (pure CMOS, Pseudo-NMOS, Dynamic CMOS, Cascade Voltage Sw
logic, Pass Transistor, etc...) it is not possible to obtain excellent speed performances with minimal gate size
There is always a critical path to optimize. Piguet [14], introduces a technique to minimize the critical path of
basic cells. All cells are exclusively made of branches that contain one or several transistors in series conn
between a power line and a logical node. Piguet demonstrates that any logical function can be implemented w
only branches. The result is that generally, the number of transistors is greater than for conventional sche
However, it shows that by decomposing complex cells into several simple cells, the best schematics can be found
in terms of speed, power consumption and chip area. This concept of minimization of the number of trans
between a power node and a logical node is used in our concept.
Asynchronous designs tend also to speed up systems. The clocking strategy is forgotten at the expense of
switches which enable a wave of signals propagating as soon as they are available. The communication
equivalent to "handshaking". The general drawback of this technique is the overhead in the area, and
consideration is required to avoid races and hazards [15], [16]. It is also necessary to carry the clocking
information with data, which increases the number of communication lines. Finally, detecting state transis
requires an additional delay even if this can be kept to a minimum.
Redundancy of information enables interesting realizations [16] in asynchronous and synchronous design
technique consists in creating additional information to permit choosing the number representation in th
calculating process which will fit the minimum delay. Avizienis [17] introduced this field and research h
continued in this subject [18], [19], since it is not difficult to convert the binary representation into the redundant
one, though it is more complex to do the reverse. While there is no carry propagation in such a representation,
the conversion from redundant binary into a binary number is there to "absorb" the carry propagation
Enhancement can also be obtained by changing the technology. Implementation in BiCMOS or GaAs [21] wil
also allow better performance than pure CMOS. But the trade-off of price versus performance has to be carefully
studied before making any such decision. Tri-dimensional designing of physical transistors could also be
possibility to enhance a system. The capacitive load could be decreased and the density increased, but suc
methods are not yet reliable [22].
5/1/2007 10:54 AM
1. Introduction
15 of 18
Design Methodology
The aim of the proposed methodology is to show a simple and powerful way for very high speed
implementations in the framework of algorithms for image compression as well as pre- and post-processing.
These are the main goals and criteria:
• Generality: The approach used should be as general as possible without making the implementation
methodology too complex. The range of applications covered by the strategy should be as extensive as
concerning both speed and types of algorithms implemented.
• Flexibility: Here we consider the flexibility for a given task. For instance, if a processing element is designed,
it is essential that different solutions can be generated all with slightly different parameters. This includes speed,
word length, accuracy, etc.
• Efficiency: This is an indispensable goal and implies not only efficiency of the used algorithm, but als
efficiency of performing the design task itself. The efficiency is commonly measured as the performance of th
chip compared to its relative cost.
• Simplicity: This point is a milestone to the previous one. By using simple procedures, or simple macro blocks
the design task will be simplified as well. Restrictions will occur, but if the strategy itself is well structured, it
will also be simple to use.
• CAD portability: It is a must for the methodology to be fully supported by CAD tools. A design
implementation approach that is not supported by a CAD environment, cannot claim to conform to the points
given above. The methodology must be defined such that it is feasible and simple to introduce the elements in
these tools. So it is important that the existing CAD tools and systems can adopt and incorporate the con
developed earlier.
ASIC's are desirable for their high potential performance, their reliability and their suitability for high v
production. On the other hand, considering the complexity of development and design, Micro- or DSP-proce
based implementations usually represent cheaper solutions. However, the performance of the system is the
decisive factor here. For consumer applications generally cost is defined as a measurement of the required chi
area. This is the most common and important factor. Other cost measures take into account the design time,
testing and verification of the chip: complex chips cannot be redesigned several times. This reduces
time-to-market and gives opportunity to adapt to the evolving tools to follow the technology. Some phy
constraints can also be imposed on the architecture such as power dissipation, the reliability under radiation's,
etc. Modularity and regularity are two additional factors that improve the cost and flexibility of a design (it is
also much simpler to link these architecture's with CAD tools).
The different points developped above where intended to show in a general way the complexity of the designs of
modern systems. For this reason we focused on different sensitive problems of VLSI development. Today th
design methodology is too much oriented by the final product. It is usually justified by historical reasons an
natural parallel development of CAD tools and technology processes. The complexity of the tools inhibits t
needed methodology for modern system requirements. To prove the feasibility of concurrent engineering with
the present CAD tools, the natural approach is the reuse policy. It means that to reduce the development time,
one reuses the already existing designs and architectures not necessary adapted to the needs of future systems
This behaviour is lead only by the commercial constraint to sell the already possessed product slightly modified.
On the contrary EPFL solution presents a global approach with a complex system (from low bit-rate to HDT
using a design methodology which takes into account the requirements mentioned above. It shows tha
architectures bottlenecks are removed if powerful macrocells and macrofunctions are developped. Severa
5/1/2007 10:54 AM
1. Introduction
16 of 18
functions have been hardwired, but libraries of powerful macrocells are not enough. The arising problem being
the complex control of these functions and the data bandwidth. That is why a certain balance between hard and
soft functions is to be found. System analysis and optimization tools are needed to achieve this goal. W
developped software tools enabling fast and easy system analysis by giving the optimal configuration o
architecture of a given function. This tool takes into account functionality, power consumption and area. Th
access to the hardwired functions needs to be controlled by dedicated but embedded microcontroller cores. Th
way of designing these cores has to be generic since each microcontroller will be dedicated to certain subtasks of
the algorithm. On the other hand the same core will be used to achieve tasks at higher levels.
Because it is very expensive to build dedicated processors for each application and dedicated macrofunctions, it
is necessary to provide these functions with the optimal genericity to allow their use in a large spectrum
applications and in the same time with an amount of customization to allow optimal system performance. This
was achieved by using in-house hierarchical analysis tools adapted to the sub-system giving a figure o
" flexibility " of the considered structure in the global system context.
IV. Conclusion
Digitalization of the fundamental TV functions is of great interest since more than 10 years. Several million of
TV sets have been produced containing digital systems. However, the real and full digital system is for the
future. A lot of work is done in this field today, the considerations are more technical than economical which is a
normal situation for an emerging technology. The success of this new multimedia technology will be given by
the applications running with this techniques.
The needed technologies and methodologies were discussed to emphasize the main parameters influencing t
design of VLSI chips for Digital TV Applications like parallelization, electrical constraints, power mana
scalability and so on.
REFERENCES
[1] Fischer T., "Digital VLSI breeds next generation TV-Receiver", Electronics, H.16 (11.8.1981)
[2] Fischer T., "What is the impact of digital TV?", Transaction of 1982 IEEE ICCE
[3] "Ecomomy and Intelligence of Microchips", Extract of "Funkchau" 12/31. May 1991.
[4] "DIGIT2000 Digital TV System", ITT Semiconductor Group, July 1991.
[5] Heberle K., "Multimedia and digital signal processing", Elektronik Industrie, No. 11, 1993.
[6] Pirsch P., Demassieux N., Gehrke W.,"VLSI Architectures for Video Compression - A Survey", Special Issue Proccedings of the IEEE,
Advances in Image and Video Compression, Early '95.
[7] Kunt M., Ikonomopoulos A., Kocher M., "Second-Generation Image-Coding Techniques", Proceedings of the IEEE, Vol. 73, No 4, pp.
549-574, April 1985.
[8] Senthinathan R. and Prince J.L., "Application Specific CMOS Output Driver Circuit Design Techniques to Reduce Simultaneous
Switching Noise", Journal of Solid-States Circuits, Vol. 28, No. 12, pp 1383-1388, December 1993.
[9] Vittoz E., "Low Power Design: ways to approach the limits", Digest of Technical Papers, ISSCC'94, pp 14-18, 1994.
[10] Nakagome Y. and al., "Sub 1 V Swing Internal Bus Architecture for Future Low-Power ULSI's", Journal of Solid-States Circuits, Vol. 28,
No. 4, pp 414-419, April 1993.
[11] Sang Soo Lee, Allstot D., "Electrothermal Simulation of Integrated Circuits", Journal of Solid-States Circuits, Vol. 28, No. 12, pp
1283-1293, December 1993.
[12] Fujishima M. and al., "Low-Power 1/2 Frequency Dividers using 0.1-mm CMOS Circuits built with ultrathin SIMOX Substrates",
5/1/2007 10:54 AM
1. Introduction
17 of 18
Journal of Solid-States Circuits, Vol. 28, No. 4, pp 510-512, April 1993.
[13] Kowalczuk J., "On the Design and Implementation of Algorithms for Multi-Media Systems", PhD Thesis, EPFL, December 1993.
[14] Masgonty J.-M., Mosch P., Piguet C., "Branch-Based Digital Cell Libraries", Internal Report, CSEM, 1990.
[15] Yuan J., Svensson C., "High-Speed CMOS Circuit Technique", IEEE Journal of Solid-State Circuits, Vol. sc-24, No +, pp 62-70,
February 1989.
[16] McAuley A. J., "Dynamic Asynchronous Logic for High-Speed CMOS Systems", IEEE Journal of Solid-States Circuits, Vol. sc-27 No3,
pp 382-388, March 1992.
[17] Avizienis "Signed-Digit Number Representations for Fast Parallel Arithmetic", IRE Trans. Electron. Compute., Vol EC-10, pp 389-400,
1961.
[18] Takagi N., & al., "A High-Speed Multiplier using a Redundant Binary Adder Tree", IEEE Journal of Solid-State Circuits, Vol. sc-22, No.
1, pp 28-34, February 1987.
[19] McAuley A. J., "Four State Asynchrinous Architectures", IEEE Journal of Transactions on Computers Vol. sc-41 No. 2 , pp 129-142,
February 1992.
[20] Ercegovac M. D., Lang T., "On line Arithmetic: A Design Methodology and Applications in Digital Signal Processing", Journal of VLSI
Signal Processing, Vol. III, pp 252-163, 1988.
[21] Hoe D. H. K., Salama C. A. T., "Dynamic GaAs Capacitively Coupled Domino Logic (CCDL)", IEEE Journal of Solid-State Circuits,
Vol. sc-26, No. 1, pp 844-849, June 1991.
[22] Roos G., Hoefflinger B., and Zingg R., "Complex 3D-CMOS Circuits based on a Triple-Decker Cell", CICC 1991, pp 125-128, 1991.
[23] Ebrahimi T., & al. "EPFL Proposal for MPEG-2", Kurihama, Japan, November 1991.
[24] Hervigo R., Kowalczuk J., Mlynek D., "A Multiprocessor Architecture for a HDTV Motion Estimation System", Transaction on
Consumer Electronics, Vol. 38, No. 3, pp 690-697, August 1992.
[25] Langdon G. G., "An introduction to Arithmetic coding", IBM Journal of Research and Development, Vol. 28, Nr2, pp 135-149, March
1984.
[26] Duc P., Nicoulaz D., Mlynek D., "A RISC Controller with customization facility for Flexible System Integration" ISCAS '94, Edinburgh,
June 1994.
[27] Hervigo R., Kowalczuk J., Ebrahimi T., Mlynek D., Kunt M., "A VLSI Architecture for Digital HDTV Codecs", ISCAS '92, Vol. 3, pp
1077-1080, 1992.
[28] Kowalczuk J., Mlynek D.,"Implementation of multipurpose VLSI Filters for HDTV codecs", IEEE Transactions on Consumer
Electronics, Vol. 38, No. 3, , pp 546-551, August 1992.
[29] Duardo O. and al., "Architecture and Implementation of Ics for a DSC-HDTV Video Decoder System", IEEE Micro,pp 22-27, October
1992
[30] Goyal R., "Managing Signal Integrity", IEEE Spectrum,pp 54-58, March 1994
[31] Rissanen J. J. and Langdon G. G., "Arithmetic Coding", IBM Journal of Research and Development, Vol. 23, pp 149-162, 1979.
[32] Daugman J. G., "Complete Discrete {2-D} {Gabor} Transforms by Neural Networks for Image Analysis and Compression", IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol. 36,Nr 7, pp 1169-1179, July 1988.
[33] Draft International Standard ISO/IEC DIS 13818-2, pp 188-190, 1993.
[34] Forstner P., "Timing Measurement with Fast Logic Circuit", TI Technical Journal - Engineering Technology, pp 29-39, May-June 1993
[35] Rijns H., "Analog CMOS Teletext Data Slicer", Digest of Technical Papers - ISSCC94, pp 70-71, February 1994
[36] Demura T., "A Single Chip MPEG2 Video Decoder LSI", Digest of Technical Papers - ISSCC94, pp 72-73, February 1994
[37] Toyokura M., "A Video DSP with Macroblock-Level-Pipeline and a SIMD Type Vector Pipeline Architecture for MPEG2 CODEC",
Digest of Technical Papers - ISSCC94, pp 74-75, February 1994
[38] SGS-Thomson Microelectronics, "MPEG2/CCIR 601 Video Decoder - STi3500", Preliminary Data Sheet, January 1994
[39] Array Microsystems, "Image Compression Coprocessor (ICC) - a77100", Advanced information Data Sheet, Rev. 1.1, July 1993
[40] Array Microsystems, "Motion Estimation Coprocessor (MEC) - a77300", Product Preview Data Sheet, Rev. 0.1, April 1993
5/1/2007 10:54 AM
1. Introduction
18 of 18
This chapter edited by D. Mlynek , a production of
5/1/2007 10:54 AM
ch11.1
1 of 1
file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1...
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
Introduction
Telecommunication Fundamentals
General Telecommunication Network Taxonomy
Comparison Between Different Switching Techniques
ATM Networks
Case Study: ATM Switch
Case study: ATM Transmission of Multiplexed-MPEG Streams
Conclusions
Bibliography
11.1. Introduction
This document is organised as follows: a review of telecommunication fundamentals and a network taxonomy is done
in sections 11.2 and 11.3. In section 11.4 switching networks are explained as introduction to section 11.5, in which
ATM network concepts are visited. Sections 11.6 and 11.7 explain two case studies to show the main elements that will
be found in a telecommunication system-on-a-chip. The former is an ATM switch, the latter a system to transmit ov
ATM networks MPEG streams.
This chapter edited by E. Juarez, L. Cominelli and D. Mlynek
5/1/2007 10:51 AM
ch11.2
1 of 6
Chapter 11
SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.2. Telecommunication fundamentals
Figure 11.1 shows a switching network. Lines are the media links. Ovals are called network nodes. Media links
simply carry data from one point to other. Nodes take the incoming data and route them to an output port.
5/1/2007 10:55 AM
ch11.2
2 of 6
Figure-11.1: Switching network
If two different communication paths intersect through this network they have to share some resources. Tw
paths can share a media link or a network node. Next sections describe these sharing techniques.
11.2.1. Media sharing techniques
Media sharing occurs when two communication channels use the same media.
Figure-11.2: Media sharing
This section presents how some communication channels can use the same media link without arc
considerations. There are three main techniques.
11.2.1.1. Time Division Multiple Access (TDMA)
This simple method consists on multiplexing data in time. Each user transmits a period of time equal t
1/(number of possible channels) in full bandwidth W. This sharing mode can be synchronous or asynchronous.
Figure 11.3 shows a synchronous TDMA system. Each channel uses a time slot each T periods. Selecting a
time slot identifies one channel. Classical wired phone uses this technique.
Figure-11.3: Synchronous TDMA diagram
In synchronous TDMA, if an established channel stops transferring data without freeing the assigned time slo
the unused bandwidth is lost and hence, other channels can not take advantage of this. This technique ha
evolved to asynchronous TDMA to avoid this problem.
5/1/2007 10:55 AM
ch11.2
3 of 6
Figure 11.4 shows an asynchronous TDMA system. Each channel uses a time slot when the user needs to
transfer data and when a time slot is unused. The header of each time slot data stream identifies the
identification. ATM networks use this technique.
Figure-11.4: Asynchronous TDMA diagram
These two techniques are used to connect users. Providing broadcast channels in TDMA can not be done
easily. Frequency Division Multiple Access technique avoids this problem. Next section presents this shar
mode.
11.2.1.2. Frequency Division Multiple Access (FDMA)
This sharing method consists on giving to each channel a piece of available bandwidth.
Each user transmits over a constant bandwidth equal to W/(number of possible channels). Filtering wit
bandwidth equal to W’ = W/(number of possible channels) the whole W bandwidth spectrum selects one
channel. TV and radio broadcasters use this media sharing technique. Figure 11.5 shows an FDMA spec
diagram.
Figure-11.5: FDMA diagram
Another method has been developed based on the frequency dimension. This method called Code Divis
Multiple Access uses an encoding-decoding system used, initially, for military communications. Today
consumer market applications also use this technique. Next section presents this method.
11.2.1.3. Code Division Multiple Access (CDMA)
5/1/2007 10:55 AM
ch11.2
4 of 6
Each user transmits using the full bandwidth. Demodulating the whole W band using a given identification
code selects one channel out of the others. Next mobile phones standard (IS-95 or W-CDMA) uses this m
sharing technique. Figure 11.6 shows a CDMA spectrum diagram.
Figure-11.6: CDMA diagram
These techniques can be merged together. For example, the Global System Mobile (GSM = Natel D) ph
standard uses an FDMA-TDMA technique.
After this description, we will present in next section how a network node routes data from an input port to
given output port.
11.2.2. Node sharing technique
Node sharing occurs when two communication channels use the same network node. The question is how some
communication channels can use the same node in a cell switching network, i.e. an ATM network.
Figure-11.7: Shared node
Before answering this question, we have to define the specification of the switching function. Next secti
presents this concept.
11.2.2.1. Switching function
As shown in figure 11.8, a switch has N input ports and N output ports. Data come in the lines attached to t
input ports. After identifying their destination, data are routed through the switch to the appropriate output
port. After this stage, data can be sent to the communication line attach to the output port.
5/1/2007 10:55 AM
ch11.2
5 of 6
Figure-11.8: Canonical switch
We can directly implement on hardware this canonical switch. However, this technological solution poses
some throughput problems. In section 11.6.1.2.1 (the one describing the crossbar switch architectures) we wil
see why. In section 11.6.1.2.2 (the one describing the Batcher-Banyan network) we will see how the
throughput problems can be solved.
Furthermore, the incoming data sequence can pose some routing problems. Next part of this section shows
these critical scenarios.
11.2.2.2. Switching scenario
Figure 11.9 shows some switching scenarios. Scenario 1 shows two cells from two different input ports g
through the switch to two different output ports. These two cells can be simultaneously routed. Scenario 2
shows two cells from the same input port going through the switch to two different output ports. Both cells
routed to their output destinations.
Figure-11.9: Three switch scenarios
Scenario 3 shows two cells from two different input ports going trough the switch to the same output port.
There are five possible strategies to solve this problem:
To drop one cell and route the other. This solution involves a data lost, hence, it is not a good approach.
5/1/2007 10:55 AM
ch11.2
6 of 6
To route simultaneously both cells and memorize in the output port the cell that has not been sent on the
attached line. This technique is called output buffering.
To memorize the incoming cells in the input ports and route them. This technique is called input
buffering.
The two other solutions consist on memorizing the extra cells during the routing task. These techniques
are derived from input buffering.
Section 11.6.1 considers why output buffering is better than input buffering.
This chapter edited by E. Juarez, L. Cominelli and D. Mlynek , a production of
5/1/2007 10:55 AM
ch11.3
1 of 1
Chapter 11
VLSI FOR TELECOMMUNICATION SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.3. General telecommunication network taxonomy
Telecommunication Networks can be mainly classified into two groups based on the criteria of who has made the decision
of which nodes are not going to receive the transmitted information. When the network takes the responsibility of t
decision, we have a switching network. When this decision is left to the end-nodes, we have a broadcast network that can
be divided in packet radio networks, satellite networks and local area networks.
Switching networks use any of the following switching techniques: circuit, message or packet switching, this la
implemented as either virtual circuit or datagram. Let us compare these techniques.
This chapter edited by E. Juarez, L. Cominelli and D. Mlynek , a production of
5/1/2007 10:48 AM
ch11.4
1 of 5
Chapter 11
SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.4. Comparison between different switching techniques
We can begin with two rough classifications. If a connection (path) between the origin and the end node
established at the beginning of a session we are talking about circuit or packet (virtual circuit) switching. In case
it does not, we refer to message and packet (datagram) switching. On the other hand, when considering ho
message is transmitted, if the whole message is divided into pieces we have packet switching (based either o
virtual circuit or datagram) but if it does not, we have circuit and message switching.
In the following paragraphs we get into the details of different switching techniques
11.4.1. Circuit switching
In figure 11.11, the most import events in the life of a connection in a four-node circuit switching network (s
figure 11.10) are shown. When a connection is established, the origin-node identifies the first intermediate n
(node A) in the path to the end-node and sends it a communication request signal. After the first intermediate
node receives this signal the process is repeated as many times as needed to reach the end-node. Afterwards, the
end-node sends a communication acknowledge signal to the origin-node through all the intermediate nodes th
have been used in the communication request. Then, a full duplex transmission line, that it is going to be kept for
the whole communication, is set-up between the origin-node and the end-node. To release the communication
the origin-node sends a communication end signal to the end-node.
5/1/2007 10:51 AM
ch11.4
2 of 5
Figure-11.10:
Figure-11.11:
11.4.2. Message switching
Figure 11.12 shows life connection events for a message switching network. When a connection is established,
the origin-node identifies the first intermediate node in the path to the end-node and sends it the whole messa
After receiving and storing this message, the first intermediate node (node A) identifies the second one (node
5/1/2007 10:51 AM
ch11.4
3 of 5
and, when the transmission line is not busy, the former sends the whole message (store-and-forward philosophy).
This process is repeated up to the end-node. As can be seen in figure 11.12 no communication release
establishment is needed.
Figure-11.12:
11.4.3. Packet switching based on virtual circuit
Figure 11.13 shows the same events for a virtual circuit (packet) switching network. When a connection
established, the origin-node identifies the first intermediate node (node A) in the path to the end-node and sends
it
a
communication
request
packet.
This process is repeated as many times as needed to reach. Then, the end-node sends a communicatio
acknowledge packet to the origin-node through the intermediate nodes (A, B, C and D) that have been traversed
in the communication request. The virtual circuit established on this way will be kept for the whol
communication. Once a virtual circuit has been established, the origin-node begins to send packets (each of them
has a virtual circuit identifier) to the first intermediate node. Then, the first intermediate node (node A) begins to
send packets to the following node in the virtual circuit without waiting to store all message packets received
from the origin-node. This process is repeated until all message packets arrive to the end-node. In t
communication release, when the origin-node sends to the end-node a communication end packet, the latte
answers with an acknowledge packet. There are two possibilities to release a connection:
No trace of the virtual circuit information is left, so every communication is set-up as if it were the first
one.
The virtual circuit information is kept for future connections.
5/1/2007 10:51 AM
ch11.4
4 of 5
Figure-11.13:
11.4.4. Packet switching based on datagram
The most important events in the life of a communication in a datagram switching network are shown in figu
11.14. The origin-node identifies the first intermediate node in the path and begins to send packets. Each pa
carries an origin-node and end-node identifier. The first intermediate node (node A) begins to send packets
without storing the whole message, to the following intermediate node. This process is repeated up to the
end-node. As there are neither connection establishment nor connection release, the path follow for each pack
from the origin-node to the end-node can be different and therefore, as a consequence of different propagation
delays, they can arrive disordered.
5/1/2007 10:51 AM
ch11.4
5 of 5
Figure-11.14:
a joint production of
EJM 17/2/1999
5/1/2007 10:51 AM
ch11.5
1 of 7
Chapter 11
SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.5. ATM Networks
11.5.1. Asynchronous Transfer Mode
Before describing the fundamentals of ATM networks, we will define a few concepts such as transfer m
multiplexing needed to understand the main ATM points.
The concept of transfer mode summarizes two ideas related to information transmission in telecommunica
networks: how information is multiplexed, i.e. how different messages share the same communication circuit, a
how information is switched, i.e. how the messages are routed to the destination-node.
11.5.1.1. Multiplexing fundamentals
The concept of multiplexing is related to the way in which several communications can share the same
transmission medium. As seen in 2.1, different techniques used are time-division multiplexing (TD
frequency-division multiplexing (FDM). The former can be synchronous or asynchronous.
In STD (synchronous time-division) multiplexing, a periodic structure divided in time intervals, called frame
defined and each time interval is assigned to a communication channel. As the number of time intervals in eac
frame is fixed, each channel has a fixed capacity. The information delay is just function of the distance and th
access time because there is no conflict to access the resources (time intervals).
In ATD (asynchronous time-division) multiplexing, the time intervals used in a communication channel are
neither inside a frame nor previously assigned. Every time interval can be assigned to every channel. The c
assigned to each information unit has an appropriate label as identifier. With this scheme, every source mig
transmit information at every time given that there are enough free resources in the network.
11.5.1.2. Switching fundamentals
5/1/2007 10:55 AM
ch11.5
2 of 7
The switching concept is assigned to the idea of information routing from an origin-node to an end-node. We have
already talked about the different switching techniques in 11.4.1-11.4.4.
11.5.1.3. Multiplexing and switching techniques used in ATM networks
ATM networks use ATD (asynchronous time-division) as multiplexing technique and cell switching as sw
technique.
With ATD multiplexing, variable binary rate sources can be connected to the network because of the dyn
assignment of time intervals to channels.
Circuit switching is not a suitable technique if variable binary rate sources want to be used because after
connection establishment the binary rate with this switching technique must be constant. This fixed assignment i
not just an inefficient usage of available resources but a contradiction to the main goal of B-ISDN (broadb
integrated services digital network) where each service has different requirements. ATM networks will be a
element in the development of B-ISDN as stated in the ITU (International Telecommunication Uni
recommendation I.121.
Neither general packet switching is a suitable solution in ATM networks because of the difficulty to integr
real-time services. However, as it has the advantage of an efficient resource usage for bursty sources, the
switching technique adopted in ATM networks is a variant of this one: cell switching.
Cell Switching works similar than packet switching. The differences between both are the following:
All information -data, voice, video- is transported from the origin-node to the end-node in small and
constant-size packets (in traditional packet switching the packet size is variable) - 53 octets - called cells.
Just lightened protocols are used in order to allow nodes fast switching. As a drawback protocols will be
less efficient.
Signaling is completely separated from information flow in contrast to packet switching in which both,
information and signaling are mixed.
Arbitrary binary rate traffic flows can be integrated in the same network.
The size of the ATM cell header is 5 octets (approx. 10 % of the total size of the cell). With this small hea
processing is allowed in the network. The size of the cell payload is 48 octets. This small payload a
store-and-forward delays in network switching nodes (see figure 11.15).
The decision about the payload size was a trade-off between different proposals. While in convention
communication it is preferred longer payloads to reduce information overhead, in video communication, m
sensitive to delays, smaller ones are desired. The election of the current payload size was a salomonic decisi
Europe, the preferred payload size was 32 octets but in USA and Japan, the preferred load size was 64 octet
Finally, in a meeting hold in Geneva in June 1989, people agreed to have as payload size the average of th
proposals: 48 octets.
11.5.2. ATM network interfaces
In ATM networks, the interface between the network user (either an end-node or a gateway to another network)
and the network it is called UNI (User-Network Interface). UNI specifies the possible physical media, cell format,
mechanisms to identify different connections established through the same interface, total access rate an
mechanisms to define the parameters that determine the quality of service.
The interface between a pair of network nodes is called NNI (Network-Node Interface). This interface is m
dedicated to routing and switching between nodes. Besides, it is designed to allow interoperability betwee
switching fabrics of different companies.
11.5.3. ATM Cell format
5/1/2007 10:55 AM
ch11.5
3 of 7
Header format depends on whether or not a cell is at the UNI or the NNI. The functions of each cell header field
are the following (Fig 11.15):
GFC Generic Flow Control. This field appears just at the UNI and it is responsible of medium access
control, as there is the possibility that more than one end-user might be connected to the same UNI.
VPI, VCI Virtual Path Identifier, Virtual Channel Identifier. A connection in an ATM network is defined
uniquely thanks to the combination of these two fields. It allows routing and addressing at two levels. The
network routing function considers them as labels that can be changed in each node.
PT Payload Type. The main objective of this field is to distinguish between user information and OAM
(Operation & Maintenance) information.
CLP Cell Loss Priority. This field allows assigning two priorities (high or low) to cells. For example, if a
user does not meet the set-up requirements, the network can mark a cell as a low priority one. Low priority
cells will be the first to be dropped when a congestion state is detected in any of the ATM network node
queues.
HEC Header Error Control. This field allows error checking and detection of header information.
Cells can be classified in one of the following types:
Figure-11.15:
Non-assigned cells: cells with no useful information. They pass transparently the physical layer and arrive
to the remote ATM layer without modification.
Empty cells:
they are also cells with no useful information. When information sources have no cell to be sent, the
physical layer introduces these cells to match the cell flow to the maximum transmission capacity. They will
never arrive to the remote ATM layer because the physical layer will filter them.
Metasignaling cells:
5/1/2007 10:55 AM
ch11.5
4 of 7
cells to negotiate the establishment of a virtual circuit between the network and the end-user. Once the
virtual circuit has been established, all set-up and release operations will use this circuit.
Broadcasting cells: cells whose end-node is every node connected to the same interface.
Physical layer cells:
cells with OAM (Operations & Maintenance) information for the network physical layer.
11.5.4. Protocol Architecture
The protocol stack architecture used in ATM Networks considers three different planes:
User plane, whose main function is the transmission of user information.
Control plane, it is responsible of the signaling information transfer, connection and admission control.
Management plane, for all OAM operations.
We will describe now the functions of different layers in the user plane of the protocol stack.
11.5.4.1 Physical layer
This is the layer responsible for information transport. It is divided into two sublayers.
PM (Physical Medium) sublayer. It provides bit synchronization, line coding, electro-optical conversion and
the transmission media (currently, monomode optical fibers).
TC (Transmission Convergence) sublayer. The functions associated to this sublayer are the following:
HEC (Header Error Control) field generation and checking.
Matching of cell flow to the maximum transmission capacity.
Cell delimitation. When an end-node receives a bit-flow it needs to discern where each cell begins and ends,
i.e. where the header of each cell is located within the flow. The method to do so consists on searching in
the flow one octet that is the HEC of the four previously received octets.
The TC sublayer adapts the cells received from the ATM layer to the specific format used in the transmission.
11.5.4.2. ATM layer
This layer provides a connection-oriented service, independently of the transmission media used. Its main
functions are the following:
Cell multiplexing and demultiplexing from several connections into a unique cell flow. A pair of identifiers,
VCI/VPI, characterizes each connection.
Cell switching. This function consists on changing the input VCI/VPI pair for a different output pair.
Cell header generation/extraction (except the HEC field whose generation/checking is competence of the
physical layer).
Flow control and medium access control for those UNIs shared by more than one terminal.
11.5.4.3. AAL (ATM Adaptation Layer)
This layer adapts either, in the transmitter side, the information coming from higher layers to the ATM layer or, in
the receiver side, the ATM services to higher level requirements. It is divided into three sublayers:
CS (Convergence Sublayer). It is divided into two parts:
SSCS (Specific Service CS)
CPCS (Common Part CS)
SAR (Segmentation And Reensemble). In the transmitter side, its function consists on segmenting the CS
data units into data units of length equal to the cell payload: 48 octets. In the receiver side, payloads are
ensembled to reconstruct the initial data units that were given to the network to be sent. This reensemble
function is assigned to the end-nodes and not to the intermediate nodes.
5/1/2007 10:55 AM
ch11.5
5 of 7
11.5.5. ATM switching
As cell switching networks, ATM networks require a connection establishment. It is here, at this moment, where
the entire communication requirements are specified: bandwidth, delay, information priority and so on. T
parameters are defined for each connection and, independently of what is happening in other network poin
determine the connection quality of service (QoS). A connection is established if and only if the network c
guarantee the quality demanded by the user without disturbing the quality of already existing connections.
In ATM networks it is possible to distinguish two levels in each virtual connection. Each of them defined w
identifier:
VPI , Virtual Path Identifier
VCI, Virtual Channel Identifier
Virtual paths are associated to the highest level of the virtual connection hierarchy. A virtual path is a set of
virtual channels connecting ATM switches to ATM switches or ATM switches to end-nodes.
Virtual channels are associated to the lowest level of the virtual connection hierarchy. A virtual channel
unidirectional communication between end-nodes, gateways and end-nodes and between LANs (Local A
Networks) and ATM networks. As the provided communication is unidirectional, each full-duplex communication
will consist of two virtual channels (each of them with the same path through the network).
Virtual channels and paths can be established dynamically, by signaling protocols, or permanently. Usually, path
are permanent connections while channels are dynamic ones. In an ATM virtual connection, the input cell
sequence is always guaranteed at the output.
In ATM Networks, cell routing is achieved thanks to the information pair VPI/VCI. This information is not
explicit address but a label, i.e. Cells do not have in their headers the end-node address but identifiers that chan
from switch to switch before arriving to the end-node. Switching in a node begins reading the VPI/VCI fields of
the input cell header (Empty cells are managed in a special way. After they are identified, they are just dropped a
the switch input). This pair of identifiers is used to access the routing table in the switch to obtain, as a result, the
output port and a new assigned pair VPI/VCI. Next switch in the path will use this new pair of identifiers in th
same way and the procedure will be repeated.
Switches can be of two types:
VPs switches: they analyze just the VPI to route cells. As a virtual path (VP) groups several virtual
channels (VC), if the VCIs are not considered all VCs associated to a VP are switched together.
VCs switches: both identifiers are analyzed, VPI and VCI, to route cells.
11.5.6. ATM services
In an ATM network it is possible to negotiate different levels or qualities of service to adapt the network
applications and to offer to the users a flexible way to access the resources.
If we study the main service characteristics, we can establish a service classification and define different
adaptation levels for each service. Four different service class are defined for ATM networks (Table 1)
BINARY RATE
DELAY
CONNECTION-ORIENTED
APPLICATIONS
Constant
Constant
Yes
Telephony, voice
A
5/1/2007 10:55 AM
ch11.5
6 of 7
B
Variable
Constant
Yes
Compressed
and voice
video
Variable
Not
constant
Yes
Data applications
Variable
Not
constant
No
LAN interconnections
C
D
Table-11.1:
Once the different services have been characterized it is possible to define the different adaptation layers. There
are four adaptation layers in ATM networks.
ALL 1, for class A services.
AAL2, for class B services.
AAL3/4, for class C or D services.
AAL 5, also for class C or D services.
11.5.7. Traffic control in ATM networks
The main objective of traffic control function in ATM networks is to guarantee an optimal network performance
in the following aspects:
Number of cells dropped in the network.
Cell transfer delay.
Cell transfer delay variance or delay jitter
Basically, network traffic control in ATM networks is a preventive approach: it avoids congestion states
immediate effects are excessive cell dropping and unacceptable end-to-end delays.
Traffic control can be applied from two different sides: on the network side, it incorporates two main functions:
Call Acceptance Control (CAC) and Usage Parameter Control (UPC). On the user side, it mainly takes the form of
either source rate control or layer source coding (prioritization) to conform to the service contract specification.
11.5.7.1. Call acceptance control
CAC (call acceptance control) is implemented during the call setup to ensure that the admission of a call will
disturb the existing connections and also that enough network resources are available for this call. It is also
referred to as call admission control. The CAC results in a service contract.
11.5.7.2. Usage parameter control
UPC (usage parameter control) is performed during a connection life. It is performed to check if the so
characteristics respect the service contract specification. If excessive traffic is detected, it can be either
immediately discarded or tagged for selective discarding if congestion is encountered in the network. It is al
referred to as traffic monitoring, traffic shaping, bandwidth enforcement or cell admission control. The Leak
Bucket (LB) scheme is a widely accepted implementation of an UPC function.
5/1/2007 10:55 AM
ch11.5
7 of 7
EJM 17/2/1999
5/1/2007 10:55 AM
ch11.6
1 of 10
Chapter 11
SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.6. Case study: ATM switch
This section shows the architecture of the critical routing part in an ATM switch. Before talking about an exist
ATM chip, we will present the technological constrains that drive the design.
The switch functionality can be split into two main parts:
A routing function to carry data from one input port to an output port.
A queuing function to temporally memorise incoming data causing the blocking problem.
11.6.1. Main switching considerations
11.6.1.1. Solving the blocking problem (Head of line
This section show why output buffering is a better solution to solve blocking problems (section 11.2.2.1 shows the
blocking scenario)
Consider a simple 2X2 (2 input ports and 2 output ports) switch (see figure 11.16). Each number represen
destination port address. Queued cells are in yellow and routed cells are in blue.
5/1/2007 10:49 AM
ch11.6
2 of 10
Figure-11.16: Input and Output buffering sequence
With an input buffering technique we need four cycles to route all cells.
First cycle shows the queuing of one cell and the routing of the other.
Second cycle shows the routing of the previously queued cell and the queuing of two incoming cells.
Third cycle shows the routing of the two previously queued cells and the queuing of the incoming cell.
The last cycle shows the routing of last cell.
With an output buffering technique we need three cycles to route all cells.
First cycle shows the routing of all incoming cells. One is queued, the other is sent through the connected
output line #2.
Second cycle shows the routing of the second couple of incoming cells. One is queued in queue #2, the
previously queued is sent through the connected output line #2 and the last one is directly sent through the
output line #1.
Last cycle shows the sending of the queued cells through the line #2 and the routing and sending of the cell
to the line #1.
In certain cases, output buffering allows smaller cell latency. Therefore, a lower memory capacity in the swi
needed. To solve the blocking problem the use of the output buffering technique has been chosen.
After this choice, we need to know how the routing function can be implemented. Next section presents th
currently used techniques.
11.6.1.2. Routing function implementation
The simplest technique to implement the routing function is to link all the inputs to all the outputs. By
programming this array of connection the data can be routed from any of the input ports to any of the output ports.
We can implement this function using crossbar architecture.
5/1/2007 10:49 AM
ch11.6
3 of 10
11.6.1.2.1. Crossbar switch
A crossbar is an array of buses and transmission gates implementing paths from any input port to any output po
This section describes this technique. To understand the limitations of such technique we first describe
transmission gate.
Figure-11.17: Electric view of a transmission gate.
11.6.1.2.1.1 Transmission gate
Figure 11.17 shows an electric view of a transmission gate. Figure 11.18 shows a schematic view of the
transmission gate. Two complementary transistors transmit the input signal without degradation (the NMOS
transmit the VSS and the PMOS transmit the VDD). Command input enables or disables the transmission
function. For instance:
If Command = VSS then both transistors are locked.
If Command = VDD then, both transistors are saturated.
Cin represents the parasitic load on the input line and Cout represents the parasitic load on the output line.
Figure-11.18: Schematic view of a transmission gate.
11.6.1.2.1.2 The crossbar switch
If we wire an array of transmission gates as shown in figure 11.19, we obtain a programmable system capabl
routing any incoming data to any output port.
5/1/2007 10:49 AM
ch11.6
4 of 10
Figure-11.19: 2X2-crossbar switch.
We can implement a 4X4 switch repeating this 2X2 structure (see figure 11.20).
Figure-11.20: 4X4-crossbar switch.
We can repeat this structure N times to obtain the required number of input and output ports. This approach causes
a bus load problem. The more the number of input and output ports is, the more the load and length of each bus is.
For
example,
in
figure
11.20 load on the input bus #1 is four times the input load of one transmission gate plus the parasitic capacitance
of the wire. Therefore, the routing delay from an input to an output is long. We can not use this technique
implement high throughput switches with a large number of ports.
5/1/2007 10:49 AM
ch11.6
5 of 10
To solve this problem a switch based on a 2X2 switches network has been developed. Next section shows how
these switches are implemented.
11.6.1.2.2. The Batcher-Banyan switch
Figure
11.21 shows the 2X2-switch module. This switch is composed of one 2X2 crossbar implementing the routin
function and four FIFO memories implementing the output buffer function. The delay to carry data from an input
to an output is lower than that of the crossbar switch because buses are short and are loaded by only tw
transmission gates.
Figure 11.22 shows an 8X8 Banyan switch. Input ports are connected to output ports by a three stage routin
network. There is exactly one path from any input to any output port. Each 2X2-switch module simply routes o
input to one of their two outputs.
Figure-11.21: 2X2 switch.
5/1/2007 10:49 AM
ch11.6
6 of 10
Figure-11.22: Banyan network switch.
A blocking scenario in a Banyan switch is shown in figure 11.23. In this figure red paths show successful routi
cells and blue ones show blocking cells. The numbers at the inputs represent cell destination output port number.
All the incoming cells have a different output destination, but only two cells are routed. Some internal collisio
causes this problem.
A solution to this problem is to make sure that this internal collision scenario never appears. This can be achieved
if incoming cells are sorted before the Banyan routing network. The sorter should sort the incoming cells
according to bitonic sequence rules. A Batcher sorter using a 2X2 comparators network implements this function.
Figure-11.23: Blocking in a Banyan network
Figure 11.24 shows some routing scenario without internal collisions.
5/1/2007 10:49 AM
ch11.6
7 of 10
Figure-11.24: Routing scenario without collision
For instance, the following sequence is a bitonic sequence: {7, 5, 2, 1, 0, 3, 4, 6}.
Rules to identify bitonic sequences are as follows:
An ascending order sequence, {0, 1, 2, 3, 4, 5, 6, 7}, like in the first scenario of figure 11.24.
A descending order sequence.
An ascending order sequence followed by a descending order sequence.
A descending order sequence followed by an ascending order sequence {7, 5, 2, 1, 0, 3, 4, 6}, like in the
second scenario of figure 11.24.
This well-known architecture is currently used to implement the switching function. Next section comments
existent switching chip using this technique.
11.6.2. ATM Cell Switching
11.6.2.1. ATM high-level Switch Architecture
Table 2 shows the main function of each ATM layer.
Function
Layer name
Convergence Layer
CS
Segmentation and Reassemble
SAR
AAL
GFC field management
Header generation and extraction
ATM
VCI and VPI processing
Multiplexing and demultiplexing of the cells
Flow rate adaptation
HEC generation and check
TC
Cell synchronization
PL
Transmission adaptation
Synchronization
PM
Data emission and detection
Table-11.2: ATM layer structure
5/1/2007 10:49 AM
ch11.6
8 of 10
AAL: ATM Adaptation Layer
CS: Convergence Sublayer
SAR: Segmentation and Reassemble layer
ATM: ATM Layer
PL: Physical Layer
TC: Transmission Convergence
PM: Physical Medium
Figure 11.25 shows a switch high-level architecture. Each block implements some of the functions describe in
Table 1.
Figure-11.25: Switch architecture
An explanation of the general functionality of each layer can be found in section 11.5.4.
The management block drives and synchronizes other layers, for instance, it drives the control check a
administrative functions. High data transfer rates can be reached (up to some gigabits per second).
One of the critical blocks of this architecture is the switching module (surround in bold in figure 11.25).
Previous section discusses one of the most currently used techniques to implement this function. In next section
we will comment an existent chip designed with the previously described techniques.
11.6.2.2. Existent Switch Architecture
Figure 11.26, Yam[97], shows the mapping between the chip architecture and the functional architecture.
5/1/2007 10:49 AM
ch11.6
9 of 10
Figure-11.26: Comparison Functional to Real architecture
There are three main blocks in this chip:
The first block implements the heading processing
The second one implements the commutation table
The third one implements the switch function
Figure 11.27 shows the details of the entire switching system.
Figure-11.27: switching system
The switching network module is mainly composed of the following blocks: a Batcher-Banyan network, one input
multiplexer bank and one output demultiplexer bank. The Batcher-Banyan network implements the switchi
function. The Multiplexer-Demultiplexer banks are used to reduce the internal Batcher-Banyan network bus
5/1/2007 10:49 AM
ch11.6
10 of 10
width. (From 8 bits to 2 bits and vice versa).
This means that to switch one incoming 8-bit-word in one cycle, four internal Batcher-Banyan network cycle
needed. A drawback for the bus width reduction is a four times increase in the internal switch frequency.
Therefore, the chip designers had to choose a faster technology to keep a high throughput switching function. In
this case they choose the Ga-As technology, usually used for high-frequency systems.
EJM 17/2/1999
5/1/2007 10:49 AM
ch11.7
1 of 18
Chapter 11
SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.7. Case study: ATM transmission of multiplexed-MPEG streams. Introduction
Available ATM network throughputs, in the order of Gb/s, allow broadband applications to interconnect usi
ATM infrastructures. We will consider, as a case study to give some intuition about the main elements that will
be found in a telecommunication system-on-a-chip, the architectural design of an ATM ASIC. The architecture
is conceived to give service to applications in which we will need to multiplex and transport multimed
information to an end-node through an ATM network. Interactive multimedia and mobile multimedia ar
examples of applications that will use such a system.
Interactive multimedia (INM) relates to the network delivery of rich digital content, including audio and video,
to client devices (e.g. desktop computer, TV and set-top box), typically as part of an application ha
user-controlled interactions. It includes interactive movies, where viewers can explore different subplo
interactive games, where players take different paths based on previous event outcomes, training-on-demand
which training content tunes to each student existing knowledge, experience, and rate of information absorption,
interactive marketing and shopping, digital libraries, video-on-demand and so on.
Mobile multimedia applies in general to every scenario in which a remote delivery of expertise to mobile agents
will be needed. It includes applications in computer supported cooperative work (CSCW) where mobile workers
with difficult problems receive advice to enhance the efficiency and quality of their tasks or
emergency-response applications (ambulance services, police, fire brigades).
A system offering this service of multiplexing and transport through ATM networks should meet the
requirements if it wants to cover applications as explained above:
The system should easily scale the number of streams and the bandwidth associated to each of them to
accommodate future service demand increases.
The system should fairly share the available multiplex bandwidth between all different sources. This
5/1/2007 10:52 AM
ch11.7
2 of 18
feature will enable either to increase the number of streams to be multiplexed when the available
bandwidth is fixed or to reduce the necessary bandwidth to multiplex a fixed number of them.
The system should guarantee a bandwidth reservation if sources with heterogeneous traffic patterns want
to be simultaneously served.
The system should be able to give service to mobile/portable sources connected by either wireless or
infrared links.
The system should be able to control the quality of the service (QoS) offered because if no control is
applied in order to keep it constant, image quality degradation will depend sharply on transient congestion
conditions in the network when information is dropped randomly.
At last, but not least, the system should be integrated on a single chip.
11.7.1. A system view
Distributing the multiplexing function between the different sources allows meeting efficiently the requirements
of mobility/portability and streaming scalability.
Figure-11.28:
This distribution can be achieved with a basic unit that applies locally the multiplexing function to each source,
as can be seen in figure 11.28. This basic unit is repeated for each stream that we want to multiplex. Figure
11.29 shows how the basic unit works: there is a queue, where cells carrying information from the source wa
until the MAC (Medium Access Control) unit gives permission to the cells to be inserted. When an empty cell is
found and the MAC unit allows insertion, this empty cell disappears from the flow and a new cell is inserted.
Figure 11.30 shows the details of this basic unit. There are four main blocks:
Cell multiplexing unit: where empty cells are substituted by source cells when MAC makes the decision.
5/1/2007 10:52 AM
ch11.7
3 of 18
MAC: decides when the information coming from the video source is introduced into the high-speed
flow.
QoS control: manages video information in order to produce a smooth quality of service degradation
when network suffers from congestion.
Protocol processing & DMA blocks: they, respectively, adapt information coming from the source for
ATM transmission and communicate with the software running in the host processor.
Figure-11.29:
The path followed by a cell from the source to the output module when is multiplexed is also shown in figu
11.30.
5/1/2007 10:52 AM
ch11.7
4 of 18
Figure-11.30:
In what follows, we will get into the details of the QoS block, MAC block and protocol processing and DM
block, leaving up to the end the cell multiplexing unit block to explain the main design features
telecommunication ASICs.
11.7.2. Quality of Service (QoS) control (Prioritization)
One potential problem in ATM networks, caused by the bursty nature of traffic is cell loss. When several
sources transmit at their peak rates simultaneously, the buffers available at some switches may cause overflow
The subsequent drops of cells lead to severe degradation in service quality (multiplicative effect) due to the loss
of synchronization at the decoder. In figure 11.31, The effect in the quality of the image received due to cell
drops is shown. The decoded picture has been transmitted through an ATM network with congestion problems.
Rather than randomly dropping cells during network congestion, we might specify to the ATM network th
relative importance of different cells (prioritization) so that only less important ones are dropped. This is
possible in ATM networks thanks to the CLP (cell loss priority) cell header bit. Thus, if we do so, when t
network enters a period of congestion, cells are dropped in an intelligent fashion (non-priority cells first) so that
the end-user only perceives a small degradation in the service's QoS.
5/1/2007 10:52 AM
ch11.7
5 of 18
Figure-11.31:
However, when the network is operating under normal conditions, both high priority and low priority
successfully transmitted and a high quality service is available to the end user. In the worst-case scenario, the
end user is guaranteed a predetermined minimum QoS dictated by the high priority packets.
5/1/2007 10:52 AM
ch11.7
6 of 18
Figure-11.32:
Figure-11.33:
In figures 11.32, 11.33, the effect in the quality of the image received due to cell drops is shown. However, as
the priority mechanism is applied (low frequency image information as high priority data and high frequenc
image information as low priority data) an improvement in the quality of the decoded image is observed.
Figure 11.34 shows the effect of non-priority cell drops in the high frequency portion of the decoded
information.
5/1/2007 10:52 AM
ch11.7
7 of 18
Figure-11.34:
11.7.3. Medium access control (MAC)
The basic functionality of the distributed multiplexing algorithm is to incorporate low speed ATM sources into
a single ATM flow. When two or more sources try to access to the common resource a conflict can occur.
The medium access control (MAC) algorithm should solve the conflicts between two or more sour
simultaneously accessing to the high-speed bus. Each MAC block controls the behavior of a basic unit. It can be
considered as an state machine which acts depending on the basic unit inputs: empty cell from the high-speed
bus, cell from the MPEG source connected to it and access request from another basic units.
The MAC algorithm can adopt the DQDB (Distributed Queue Dual Bus) philosophy, taking into account that
there is just one information flow (downstream). A dedicated channel is responsible for sending request
upstream.
The main objective of the DQDB protocol is to create and maintain a global queue of access requests to the
shared bus. That queue is distributed among all connected basic units. If a basic unit wants to send an ATM cell,
a request to all its predecessors is sent. Therefore, each basic unit receives, from the neighbor on the right,
access requests coming from every basic unit on the right. These requests and the requests of the current basic
unit are sent to the neighbor on the left. For each request, an empty cell passes through a basic unit without
being assigned.
When QoS control is applied, these algorithms should be modified to allow all HP cells to be sent before any LP
cell queued at any basic unit. This mechanism achieves critical information to be sent first when congesti
appears.
5/1/2007 10:52 AM
ch11.7
8 of 18
11.7.4. Communication with the host processor: protocol processing & DMA.
Another important point to face is the information exchange between the software running on the host processor
and the basic unit. The main mechanism used for these transactions is DMA (Direct Memory Access). In
technique all communications passes through special shared data structures - they can be read from or written to
by both the processor and the basic unit - that are allocated in the system's main memory.
Any time any data is read from or written to main memory is consider to be "touched". A design should
minimize data touches because of the large negative impact they can have on performance.
Let us imagine we are running, on a typical monolithic Unix Kernel machine, an INM application
implementation of the AAL/ATM protocol. Figure 11.35 shows all data touch operations involved in
transmitting a cell from host main memory to the basic unit. The sequence of events is as follows:
1. The application generates data to be sent and writes it to its user-space buffer. Afterwards, It produces a
system call to the socket layer to transmit data.
To copy data from the user buffer into a set of kernel buffers, both of them located in main memory, steps
2 and 3 are needed:
2. The socket layer reads data from main memory.
3. The socket layer writes data to the main memory.
Figure-11.35:
To adapt this data to ATM transmission step 4 is needed.
5/1/2007 10:52 AM
ch11.7
9 of 18
4. AAL layer implementation reads data so that it can segment it and compute the checksum that has to be
inserted in the AAL_PDU trailer.
5. The basic unit reads data from kernel buffers, adds the ATM cell header and transmits it.
Figure 11.36 shows what happens in hardware for the events explained above. Some of
Figure-11.36:
the lines are dashed to indicate that the corresponding read operation might be satisfied from the cache mem
rather than from the main memory. In the best case, there are three data touches for any given piece of data and
in the worst case, there are five data touches.
11.7.4.1. A quantitative approach to data touches
Why is so important the number of data touches? Let us consider a main memory bandwidth of about 1.5 GB/s
for sequential writes and 0.7 GB/s for sequential reads. If we assume that on the average there are three reads
for every two writes (see figure 11.36), the resulting average memory bandwidth is ~ 1.0 GB/s . If our basic unit
requires five data touch operations for every word in every cell, then the average throughput we can expect wil
be only a fifth of the average memory bandwidth, e.g. 0.2 GB/s . Clearly, every data touch that we can save will
provide for significant improvements in throughput.
11.7.4.2. Reducing the number of data touches
The number of data touches can be reduced if either kernel buffers or user and kernel buffers are allocated fr
extra on-chip memory added to the basic unit.
In figure 11.37, kernel buffers are allocated from memory on the basic unit to reduce data touches form
Programmed I/O is the technique used to move data from the user buffer to these on-chip kernel buffers (d
touched by the processor before is transfer to the basic unit).
5/1/2007 10:52 AM
ch11.7
10 of 18
Figure-11.37:
Figure 11.38 shows the same data touch reduction but with DMA being used instead of programmed I/O . In
this case, as data arriving from main memory to the basic unit is not touched by the processor, it cannot
compute the checksum needed in the AAL layer; therefore, this computation will have to be implemented
hardware in the basic unit.
5/1/2007 10:52 AM
ch11.7
11 of 18
Figure-11.38:
Figure 11.39 shows an alternative that involves no main memory accesses at all (zero data touches). Both, use
and kernel buffers are allocated from on-chip memory. Although this approach reduces drastically the number
of data touches, it has two disadvantages:
Very large amount of memory will be needed to allocate user and kernel buffers.
The API (Application Programming Interface) that will be presented to programmers in this kind of
framework will be incompatible with existing socket-based API.
5/1/2007 10:52 AM
ch11.7
12 of 18
Figure-11.39:
11.7.5. Cell multiplexing unit: explanation of main design features of Tcomm. ASICs.
There are four modules in the Cell Multiplexing Unit (figure 11.40):
Input module
Input FIFO module
Multiplexing module
Output module
5/1/2007 10:52 AM
ch11.7
13 of 18
Figure-11.40:
Their functionalities and main design features are as follows:
Input and Output modules implement UTOPIA protocol (level one and two), the ATM-Forum stan
communication protocol between an ATM layer and a Physical layer entity. Common design elements used i
both modules are registers, finite-state machines, counters, and logic to compare register values, as shown in the
following figures (figure 11.41 and figure 11.42).
5/1/2007 10:52 AM
ch11.7
14 of 18
Figure-11.41:
5/1/2007 10:52 AM
ch11.7
15 of 18
Figure-11.42:
FIFO module isolates two different clock domains: input cell clock domain from output cell clock domain
Besides, it allows cell storing (First Input, First Output) when UTOPIA protocol stops cell flow.
Having different clock domains is a characteristic feature of telecommunication systems-on-a-chip that adds a
new dimension to the design complexity: unsynchronized clock domains generate in the flip-flops that
interfaces both domains metastable behavior. If realible system function is desired, techniques to reduce the
probability of having a metastable behavior in a flip-flop have to be implemented.
The FIFO queue is implemented with a dual-port RAM memory and two registers to store addresses: the write
and read pointer. Part of this queue is shown in figure 11.43.
5/1/2007 10:52 AM
ch11.7
16 of 18
Figure-11.43:
Multiplexing module changes empty cells by assigned ones. The insertion module has two registers to avoid the
lost of parts of a cell when the UTOPIA protocol stops, another two registers to delay the information comin
from the network and one register for pipelining the module (figure 11.44)
5/1/2007 10:52 AM
ch11.7
17 of 18
Figure-11.44:
5/1/2007 10:52 AM
ch11.7
18 of 18
EJM 17/2/1999
5/1/2007 10:52 AM
ch11.8
1 of 1
Chapter 11
VLSI FOR TELECOMMUNICATION SYSTEMS
Introduction
ATM Networks
Conclusions
Bibliography
11.8. Conclusiones
Through these two case studies within the ATM domain, we have shown the main common characteristics to
telecommunication ASIC design. Briefly speaking, these features are the following:
Different clock domains can coexist and, therefore, techniques to reduce the probability of having a metaestable behavior
have to be applied in the design.
High throughput networks imply dealing with high frequency clock designs (hundreds of megahertzs).
FIFO Memories are usually needed to either separate different clock domains or store information before accessing a
common resource.
Designs are mainly dominated by the presence of registers.
EJM 17/2/1999
5/1/2007 10:56 AM
DSP course 1998-99
1 of 22
file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D...
Chapter 12
Digital Signal Processing Architectures
Introduction
History
Typical DSP applications
The FIR Example
General Architectures
Data Path
Addressing
Peripherals
How is a DSP different from a general-purpose processor
Superscalar Architectures
12.1 Introduction
Digital signal processing is concerned with the representation of signals in digital form and the transformation
or processing of such signal representation using numerical computation.
Sophisticated signal processing functions can be realized using digital techniques – numerous important signal
processing techniques are difficult or impossible to implement using analog (continuous time) m
reprogrammability is a strong advantage over conventional analog systems. Furthermore digital systems
inherently more reliable, more compact, and less sensitive to environmental conditions and component aging
than analog systems. The digital approach allows the possibility of time-sharing (or multiplexing) a
microprocessor among a number of different signal processing functions.
12.2 History
Since the invention of the transistor and integrated circuit, digital signal processing functions have b
implemented on many hardware platforms ranging from special-purpose architectures to general-purpo
computers. One of the earliest descriptions of a special-purpose hardware architecture for digital filtering
described by Bell Labs in 1968.[1] The problem with such architectures, however, is their lack of flexibility.
order to realize a complete application, one needs to be able to perform functions that go beyond simple
filtering such as control, adaptive coefficient generation, and non-linear functions such as detection.
5/1/2007 10:58 AM
DSP course 1998-99
2 of 22
The solution is to use an architecture that is more like a general-purpose computer, but which can perform b
signal processing operations very efficiently. This means satisfying the following criteria:
The ability to perform a multiply and add operation in parallel in the time of one instruction.[2]
The ability to perform data moves to and from the arithmetic unit in parallel with arithmetic operations
and modification of address pointers.
The ability to perform logical operations on data and alter control flow based on the results of these
operations.
The ability to control operations by sequencing through a stored program.
In the 1960s and 1970s, multiple chips or special-purpose computers were designed for computing DS
algorithms efficiently. These systems were too costly to be used for anything but research or military
applications. It was not until all of this functionality (arithmetic, addressing, control, I/O, data storage, control
storage) could be realized on a single chip that DSP could become an alternative to analog signal processing fo
the wide span of applications that we see today.
In the late 1970s large-scale integration technology reached the point of maturity that it became practical
consider realizing a single chip DSP. Several companies developed products along these lines including AM
Intel, NEC, and Bell Labs.
The first DSP generation
AMI S2811
AMI announced a "Signal Processing Peripheral" in 1978.[1] The S2811 was designed to operate in conjunction
with a microprocessor such as the 6800 and depended upon it for initialization and configuration.[2] With a
small, nonexpandable program memory of only 256 words, the S2811 was intended to be used to offload som
math intensive subroutines from the microprocessor. Therefore, as a peripheral, it could not "stand alone" as
could DSPs from Bell Labs, NEC, and other companies. The part was to be implemented in an exotic
technology called "V-groove." Availability of first silicon was after 1979 and was never used in any v
product.[3]
Intel 2920
Intel announced an "Analog Signal Processor," the 2920, at the 1979 Institute of Electrical and Electron
Engineers (IEEE) Solid State Circuits Conference.[4] A unique feature of this device was the on-ch
analog/digital and digital/analog converter capability. The drawback was the lack of a multiplier. Multiplication
was performed by a series of instructions involving shifting (scaling) and adding partial products to
accumulator. Multiplication of two variables was even more involved—requiring conditional instructio
execution.
In addition, the mechanism for addressing memory was limited to direct addressing. The program could n
perform branching.[5] As such, while it could perform some signal processing calculations a little more
efficiently than a general-purpose microprocessor, it greatly sacrificed flexibility and has little resemblance
today's single-chip DSP. Too slow for any complete application, it was used as a component for part of
modem.[6]
NECµPD7720
NEC announced a digital signal processor, the 7720, at the IEEE Solid State Circuits Conference in February
1980 (the same conference that Bell Labs disclosed its first single-chip DSP). The 7720 does have all of
5/1/2007 10:58 AM
DSP course 1998-99
3 of 22
attributes of a modern single chip DSP as described above. However, devices and tools were not available in the
U.S. until as late as April 1981.[7]
The Bell Labs DSP1
The genesis of Bell Labs' first single-chip DSP was the recommendation of a study group that began to consider
the possibility of developing a multipurpose, large-scale integration circuit for digital signal processing in
January 1977.[8] Their report, issued in October 1977, outlined the basic elements of a minimal DSP
architecture which consisted of multiplier/accumulator, addressing unit, and control. The plan was for the I/O
data, and control memories to be external to the 40-pin DIP until large-scale integration technology could
support their integration. The spec was completed in April 1978 and the design a year later. First samples w
tested in May 1979. By October, devices and tools were distributed to other Bell Labs development group
became a key component in AT&T's first digital switch, 5ESS, and many other telecommunications prod
Devices with this architecture are still in manufacture today.
The first Bell Labs DSP was different from what was in the report. The DSP1 contained all of the functio
elements found in today's DSPs including a multiplier-accumulator (MAC), parallel addressing unit, contr
control memory, data memory, and I/O. It fully meets the above criteria for a single-chip DSPs.
The DSP1 was first disclosed outside AT&T at the IEEE Solid State Circuits Conference in February 1980.[9]
A special issue of the BellSystem Technical Journal was published in 1981 which described the architecture
tools, and nine fully developed telecommunications applications for the device.[10]
The following table summarizes the evolution of DSPs:
Date
Features
Example processors
First generation : 1979 1985
Harvard architecture,
hardwired multiplier
NECµPD7720, Intel 2920,
Bell Labs DSP1, Texas
Instruments TMS320C10
Second generation: 1985 1988
Concurrency, multiple
busses, on-chip memory
TMS320C25, MC56001,
DSP16 (AT&T)
Third generation: 1988 1992
On-chip floating point
operations
TMS320C30, MC96002,
DSP32C (AT&T),
Fourth generation: 1992 1997
Multi-processing features
TMS320C40&50
Image and video processors
TMS320C80
Low-power DSPs (AT&T)
Fifth generation: 1997 –
VLIW
TMS320C6x,
Philips TriMedia,
Motorola Starcore
12.3 Typical DSP applications
5/1/2007 10:58 AM
DSP course 1998-99
4 of 22
Digital signal processing in general, and DSP processors in particular, are used in a wide variety of applicati
from military radar systems to consumer electronics. Naturally, no one processor can meet the needs o
applications. Criteria such as performance, cost, integration, ease of development, power consumptions are
points to examine when designing or selecting a particular DSP for a class of applications. The table b
summarizes different processor applications.
Table 1. Common DSP algorithms and typical applications ([11])
DSP Algorithm
System Application
Speech coding and decoding
Digital cellular phones, personal communications
systems,
multimedia
computers,
secure
communication
Speech encryption and decryption
Digital cellular phones, personal communications
systems, secure communication
Speech recognition
Advanced
user
interfaces,
multimedia
workstation, robotics, automotive applications,
digital cellular phones,….
Speech synthesis
Multimedia PCs, advanced user interface, robotics
Speaker identification
Security, multimedia workstations, advanced user
interfaces
Hi-fi audio encoding and decoding
Consumer audio & video, digital audio broadcast,
professional audio, multimedia computers
Modem algorithms
Digital cellular phones, personal communication
systems, digital audio broadcast, digital signalling
on cable TV, multimedia computers, wireless
computing, navigation, data/fax modems, secure
communications
Noise cancellation
Professional audio, advanced vehicular audio,
industrial applications
Audio equalization
Consumer audio, professional audio, advanced
vehicular audio, music
Ambient acoustics emulation
Consumer audio, professional audio, advanced
vehicular audio, music
Audio mixing and editing
Professional audio, music, multimedia computers
Sound synthesis
Professional audio, music, multimedia computers,
advanced user interfaces
Vision
Security, multimedia computers, advanced user
interfaces, instrumentation, robotics, navigation
5/1/2007 10:58 AM
DSP course 1998-99
5 of 22
Image
compression
decompression
and
Digital photography, digital video, multimedia
computers, video-over-voice, consumer video
Image composition
Multimedia computers, consumer video, advanced
user interfaces, navigation
Beamforming
Navigation, medical imaging, radar/sonar, signals
intelligence
Echo cancellation
Speakerphones, modems, telephone switches
Spectral Estimation
Signals intelligence, radar/sonar, professional
audio, music
12.4. The FIR Example
The Finite Impulse Response filter (FIR) is a convenient way to introduce features needed in typical DSP
systems. The FIR filter is described by the following equation:
The following diagram shows an FIR filter. This illustrates the basic DSP operations:
additions and multiplications
delays
array handling to access coefficients
Each of these operations has its own special set of requirements:
additions and multiplications require to:
fetch two operands
perform the addition or multiplication (usually both)
store the result or hold it for a repetition
delays require to:
5/1/2007 10:58 AM
DSP course 1998-99
6 of 22
hold a value for later use
array handling requires to:
fetch values from consecutive memory locations
copy data from memory to memory
To suit these fundamental operations DSP processors often have:
parallel multiply and add (MAC operation)
multiple memory accesses (to fetch two operands and store the result)
lots of registers to hold data temporarily
efficient address generation for array handling
special features such as delays or circular addressing
12.5. General Architectures
The simplest processor memory structure is a single bank of memory, which the processor accesses through
single set of address and data lines. This structure which is common among non-DSP processors is referred as
the
Von
Neuman
architecture. In this implementation, data and instruction are stored in the same single bank and one acces
memory is performed during each instruction cycle. As seen previously to perform a typical operation for a
DSP is to have a MAC operation executed in one cycle. This operation requires to fetch two data from memory,
multiply them together and add the result to the previous result. With such a Von Neuman model it is not
possible to fetch the instruction and the data in the same cycle. This is one reason why conventional processors
do not perform well on DSP applications in general.
The solution to solve memory accesses is known as the Harvard architecture and the modified Harva
architecture. The following diagram shows the Harvard architecture. The program counter fetch an instructio
from the program memory using the program counter and stores it in the instruction register. In parallel, t
Address Calculation Unit fetch one operand from the memory and feed the execution unit with it. This
architecture allows one instruction word and one data word to be fetched in a single cycle. This system requires
4 buses: 2 address bus and 2 data bus.
5/1/2007 10:58 AM
DSP course 1998-99
7 of 22
The next picture represents the modified Harvard architecture. Two data are now fetched in the memory in
single cycle. Since it is not possible to access the same memory in the same cycle, this implementation requi
three memory banks: a program memory bank and two data memory bank commonly designed X and Y, eac
with its own set of buses: address and data.
5/1/2007 10:58 AM
DSP course 1998-99
8 of 22
12.6. Data Path
The data path of a DSP processor is where the vital arithmetic manipulations of signals take place. DSP
processor data paths are highly specialized to achieve high performance on the types of computation mos
common in DSP applications, such as multiply-accumulate operations. Registers, Adders, Multiplie
Comparators, Logic operators, Multiplexers, Buffers represent 95% of a typical DSP data path.
Multiplier
A single-cycle multiplier is the essence of a DSP since multiplication is an essential operation in all D
applications. An important distinction between multipliers in DSPs is the size of the product according to the
size of the operands. In general, multiplying two n-bit fixed-point numbers requires a 2xn bits to represent
correct result. For this reason DSPs have in general a multiplier, which is twice the word length of the n
operands.
Accumulators Registers
Accumulators registers hold intermediate and final results of multiply-accumulate and other arithmetic
operations. Most DSP processors have two or more accumulators. In general, the size of the accumulator is
larger than the size of the result of a product. These additional bits are called guard bits. These bits a
accumulating values without the risk of overflow and without rescaling. N additional bits allow up to 2n
accumulations to be performed without overflow. Guards bits method is more advantageous than scaling
multiplier product since it allows the maximum precision to be retained in intermediate steps of computations.
ALU
5/1/2007 10:58 AM
DSP course 1998-99
9 of 22
Arithmetic logic units implement basic arithmetic and logical operations. Operations such as addition
subtraction, and, or are performed in the ALU.
Shifter
In fixed-point arithmetic, multiplications and accumulations often induce a growth in the bit width of resu
Scaling is then necessary to pass results from stage to stage and is performed through the use of shifters.
The following diagram shows the Motorola 56002 Data Path
Data ALU input Registers
X1, X0, Y1, and Y0 are four 24-bit, general-purpose data registers. They can be treated as four independen
24-bit registers or as two 48-bit registers called X and Y, developed by concatenating X1:X0 and Y1
respectively. X1 is the most significant word in X and Y1 is the most significant word in Y. The registers serve
as input buffer registers between the X Data Bus or Y Data Bus and the MAC unit. They act as Data ALU
source operands and allow new operands to be loaded for the next instruction while the current instruction use
the register contents. The registers may also be read back out to the appropriate data bus to implem
memory-delay operations and save/restore operations for interrupt service routines.
5/1/2007 10:58 AM
DSP course 1998-99
10 of 22
MAC and Logic Unit
The MAC and logic unit shown in the figure below conduct the main arithmetic processing and perfor
calculations on data operands in the DSP.
For arithmetic instructions, the unit accepts up to three input operands and outputs one 56-bit result in th
following form: extension:most significant product:least significant product (EXT:MSP:LSP). The operation o
the MAC unit occurs independently and in parallel with XDB and YDB activity, and its registers facilita
buffering for Data ALU inputs and outputs. Latches on the MAC unit input permit writing an input register
which is the source for a Data ALU operation in the same instruction. The arithmetic unit contains a multiplie
and two accumulators. The input to the multiplier can only come from the X or Y registers (X1, X0, Y1, Y0).
The multiplier executes 24-bit x 24-bit, parallel, twos-complement fractional multiplies. The 48-bit product i
right justified and added to the 56-bit contents of either the A or B accumulator. The 56-bit sum is stored back
in the same accumulator. An 8-bit adder, which acts as an extension accumulator for the MAC arra
accommodates overflow of up to 256 and allows the two 56-bit accumulators to be added to and subtracted
from each other. The extension adder output is the EXT portion of the MAC unit output. Thi
multiply/accumulate operation is not pipelined, but is a single-cycle operation. If the instruction specifies
multiply without accumulation (MPY), the MAC clears the accumulator and then adds the contents to the
product.
In summary, the results of all arithmetic instructions are valid (sign-extended and zero-filled) 56-bit operands in
the form of EXT:MSP:LSP (A2:A1:A0 or B2:B1:B0). When a 56-bit result is to be stored as a 24-bit operand,
the LSP can be simply truncated, or it can be rounded (using convergent rounding) into the MSP. Conve
5/1/2007 10:58 AM
DSP course 1998-99
11 of 22
rounding (round-to-nearest) is performed when the instruction (for example, the signed multiply-accumulate
and round (MACR) instruction) specifies adding the multiplier’s product to the contents of the accumulator.
The scaling mode bits in the status register specify which bit in the accumulator shall be rounded. The logic unit
performs the logical operations AND, OR, EOR, and NOT on Data ALU registers. It is 24 bits wide and
operates on data in the MSP portion of the accumulator. The LSP and EXT portions of the accumulator a
affected. The Data ALU features two general-purpose, 56-bit accumulators, A and B. Each consists o
concatenated registers (A2:A1:A0 and B2:B1:B0, respectively). The 8-bit sign extension (EXT) is stored in A2
or B2 and is used when more than 48-bit accuracy is needed; the 24-bit most significant product (MSP) is stored
in A1 or B1; the 24-bit least significant product (LSP) is stored in A0 or B0. Overflow occurs when a sou
operand requires more bits for accurate representation than are available in the destination. The 8-bit exte
registers offer protection against overflow. In the DSP56K chip family, the extreme values that a word operan
can assume are - 1 and + 0.9999998. If the sum of two numbers is less than - 1 or greater than + 0.9999998,
result (which cannot be represented in a 24 bit word operand) has underflowed or overflowed. The 8-bit
extension registers can accurately represent the result of 255 overflows or 255 underflows. Whenever
accumulator extension registers are in use, the V bit in the status register is set.
Automatic sign extension occurs when the 56-bit accumulator is written with a smaller operand of 48 or 24 bits.
A 24-bit operand is written to the MSP (A1 or B1) portion of the accumulator, the LSP (A0 or B0) portion is
zero filled, and the EXT (A2 or B2) portion is sign extended from MSP. A 48-bit operand is written into
MSP:LSP portion (A1:A0 or B1:B0) of the accumulator, and the EXT portion is sign extended from MSP. N
sign extension occurs if an individual 24-bit register is written (A1, A0, B1, or B0).When either A or B is read,
it may be optionally scaled one bit left or one bit right for block floating-point arithmetic. Sign extension can
also occur when writing A or B from the XDB and/or YDB or with the results of certain Data ALU operatio
(such as the transfer conditionally (Tcc) or transfer Data ALU register (TFR) instructions).
Overflow protection occurs when the contents of A or B are transferred over the XDB and YDB by substituting
a limiting constant for the data. Limiting does not affect the content of A or B – only the value transferred ove
the XDB or YDB is limited. This overflow protection occurs after the content of the accumulator has been
shifted according to the scaling mode. Shifting and limiting occur only when the entire 56-bit A or B
accumulator is specified as the source for a parallel data move over the XDB or YDB. When individual
registers A0, A1, A2, B0, B1, or B2 are specified as the source for a parallel data move, shifting and limiting
are not performed.
The accumulator shifter is an asynchronous parallel shifter with a 56-bit input and a 56-bit output tha
implemented immediately before the MAC accumulator input. The source accumulator shifting operations are
as follows:
No Shift (Unmodified)
1-Bit Left Shift (Arithmetic or Logical) ASL, LSL, ROL
1-Bit Right Shift (Arithmetic or Logical) ASR, LSR, ROR
Force to zero
12.7. Addressing
The ability to generate new addresses efficiently is a characteristic feature of DSP processors. Most DS
processors include one or more special address generation units (AGUs) that are dedicated to calculatin
addresses. An AGU can perform one or more special address generation per instruction cycle without us
processor main data path. The calculation of addresses takes place in parallel with arithmetic operations on data,
5/1/2007 10:58 AM
DSP course 1998-99
12 of 22
improving processor performance.
On of the main addressing mode is the register-indirect addressing. The data addressed is in memory and t
address of the memory location containing the data is held in a register. This gives a natural way to work w
arrays of data. Another advantage is the efficiency from an instruction-set point of view since it allows powerful
and flexible addressing with relatively few bits in the instruction word.
Whenever an operand is fetched from memory using register indirect addressing, the address register c
incremented to point to the next needed value in the array. The following table summarizes most comm
increment method in DSPs:
*rP
register indirect
read the data pointed to by the address in register rP
*rP++
postincrement
Having read the data, postincrement the address pointer to
point to the next value in the array
*rP--
postdecrement
Having read the data, postdecrement the address pointer to
point to the previous value in the array
*rP++rI
register
postincrement
Having read the data, postincrement the address pointer by
the amount held in register rI to point to rI values further
down the array
*rP++rIr
bit
reversed
(FFT)
having read the data, postincrement the address pointer to
point to the next value in the array, as if the address bits
were in bit reversed order
An additional convenient feature in AGU is the presence of modulo addressing modes. It is extensively used for
circular addressing. Instead of comparing the address to a calculated value to see whether or not the end of
buffer has been reached, dedicated registers are used to automatically perform this check and take necessa
actions (i.e. reset the register to the start address of the buffer).
The following picture represents the address generation unit of the Motorola 56002
5/1/2007 10:58 AM
DSP course 1998-99
13 of 22
This AGU uses integer arithmetic to perform the effective address calculations necessary to address data
operands in memory, and contains the registers used to generate the addresses. It implements linear, modulo,
and reverse-carry arithmetic, and operates in parallel with other chip resources to minimize address-gen
overhead. The AGU is divided into two identical halves, each of which has an address arithmetic logic uni
(ALU) and four sets of three registers. They are the address registers (R0 - R3 and R4 - R7), offset registers (N0
- N3 and N4 - N7), and the modifier registers (M0 - M3 and M4 - M7). The eight Rn, Nn, and Mn registers
treated as register triplets — e.g., only N2 and M2 can be used to update R2. The eight triplets are R0
R1:N1:M1, R2:N2:M2, R3:N3:M3, R4:N4:M4, R5:N5:M5, R6:N6:M6, and R7:N7:M7.
The two arithmetic units can generate two 16-bit addresses every instruction cycle — one for any two of th
XAB, YAB, or PAB. The AGU can directly address 65,536 locations on the XAB, 65,536 locations on the
YAB, and 65,536 locations on the PAB. The two independent addresses ALUs work with the two data
memories to feed the data ALU two operands in a single cycle. Each operand may be addressed by a Rn, Nn,
and Mn triplet.
12.8. Peripherals
Most DSP processors provides on-chip peripherals and interfaces to allow the DSP to be used in an embed
system with a minimum amount of external hardware to support its operation and interfacing.
Serial port
A serial interface transmits and receives data one bit at a time. These ports have a variety of applications l
sending and receiving data samples to and from A/D and D/A converters and codecs, sending and receiving
data to and from other microprocessors or DSPs, communicating with other hardware. The two main categorie
are synchronous and asynchronous interface. The synchronous serial ports transmit a bit clock signal in addition
to the serial bits. The receiver uses this clock to decide when to sample received data. On the oppos
asynchronous serial interfaces do not transmit a separate clock signal; they rely on the receiver deducing a clock
signal from the data itself.
Direct extension of serial interfaces leads to parallel ports where data are transmitted in parallel instea
sequentially. Faster communication is obtained through costly additional pins.
Host Port
Some DSPs provide a host port for connection to a general-purpose processor or another DSP. Host ports
usually specialized 8 or 16 bit bi-directional parallel ports that can be used to transfer data between the DSP and
the host processor.
Link ports or communications ports
This kind of port is dedicated to multiprocessor operations. It is in general a parallel port intended
communication between the same types of DSPs.
Interrupt controller
An interrupt is an external event that causes the processor to stop executing its current program and branch
special block of code called an interrupt service routine. Typically this code deals with the origin of the
interrupt and then returns from the interrupt. There are different interrupt sources:
On-chip peripherals: serial ports, timers, DMA,…
5/1/2007 10:58 AM
DSP course 1998-99
14 of 22
External interrupt lines: dedicated pins on the chip to be asserted by external circuitry
Software interrupts: also called exceptions or traps, these interrupts are generated under software control or
occurs for example for floating-point exceptions (division-by-zero, overflow and so on).
DSPs associates interrupts with different memory locations. These locations are called interrupt vectors. T
vectors contain the address of the interrupt routines. When an interrupt occurs, the following scenario
encountered:
Save program counter in a stack
Branch to the relevant address given by the interrupt vector table
Save all registers used in the interrupt routine
Perform dedicated operations
Restore all registers
Restore program counter
Priority levels can be assigned to the different interrupt through the use of dedicated registers. An in
acknowledged when its priority level is strictly higher that current priority level.
Timers
Programmable timers are often used as a source of periodic interrupts. Completely software-controlled to
activate specific tasks at chosen times. It is generally a counter that is preloaded with a desired value
decremented on clock cycles. When zero is reached, an interrupt is issued.
DMA
Direct Memory Access is a technique whereby data can be transferred to or from the processor’s memory
without the involvement of the processor itself. DMA is commonly used to provide improved performance with
input/output devices. Rather than have the processor read data from an I/O device and copy the data into
memory or vice versa, a separate DMA controller can handle such transfers in parallel. Typically, the proces
loads the DMA controller with control information including the starting address for the transfer, the numbe
words to be transferred, the source and the destination. The DMA controller uses the bus request pin to notify
the DSP core that it is ready to make a transfer to or from external memory. The DSP core completes i
instruction, releases control of external memory and signals the DMA controller via the bus grant pin that th
DMA transfer can proceed. The DMA controller then transfers the specified number of data words and
optionally signals completion through an interrupt. Some processor can also have multiple channels DM
managing DMA transfers in parallel.
12.9. How is a DSP different from a general-purpose processor
DSPs intended for real-time embedded control/signal processing applications, not general-purpose
computing
DSPs strictly non-user-programmable (typically no memory management, no operating system, no cache,
no shared variables, single-process oriented)
5/1/2007 10:58 AM
DSP course 1998-99
15 of 22
DSPs usually employ some form of "Harvard Architecture" to allow simultaneous code and data fetches
Salient characteristic of all DSPs is devoting significant chip real estate to the
"multiply-accumulate"(MAC) operation – most DSPs perform a MAC operation in a single clock
DSP programs often resident in fast on-chip ROM and/or RAM (although off-chip bus expansion is
usually possible)
most DSPs have at least two multi-ported on-chip RAMs for storing operand data
DSP interrupt handling is simple, fast, and efficient (minimum context switch overhead)
many DSP applications assembly-coded, due to real-time processing constraints(although C compilers
exist for most DSPs)
DSP I/O provisions usually fairly simple
DSP address bus widths typically smaller than those of general-purpose processors (code size tends to be
small, "tight-loop" oriented)
fixed-point DSPs utilize saturation arithmetic (rather than allowing 2’s complement overflow to occur)
DSP addressing modes geared toward signal processing applications (direct support for circular buffers,
"butterfly" access patterns)
DSPs often provide direct hardware support for implementation of "do" loops many DSPs employ an
on-chip hardware stack to facilitate subroutine linkage
most "lower end" DSPs have integrated program ROM and scratchpad RAM to facilitate single-chip
solutions most DSPs do not have integrated ADCs and DACs – these interfaces (if desired) are usually
implemented externally
benchmark suites used to compare DSPs totally different than those used to compare general-purpose
(RISC/CISC) processors :
• FIR/IIR filters
• FFTs
• convolution
• dot product
12.10 Superscalar Architectures
The term "superscalar" is commonly used to designate architectures that enable more than one instruction
executed per clock cycle
Nowadays multimedia architectures, supported by the continuous improvement in technologies, are rapidl
moving towards highly parallel solutions like SIMD and VLIW machines. What do these acronyms mean ?
SIMD
5/1/2007 10:58 AM
DSP course 1998-99
16 of 22
stands for Single Instruction on Multiple Data. In simple words it is possible to say that the architecture has
single program control unit that fetches and decodes the program instructions to multiple execution units,
multiple sets of datapaths, registers and data memories. Of course a SIMD architecture can be realize
multiprocessor configuration, but the exploitation of deep submicron technologies has made possible to
integrate such architectures in a single chip. It is easy at this point to imagine each execution unit to be driven
by different program control units, permitting the possibility to execute in parallel different instructions of th
same program or different programs at the same time; in this case the resulting architecture is called M
Instructions on Multiple Data (MIMD). Again a MIMD machine can be implemented by a multiprocesso
structure or integrated in a single chip.
Historically the first examples of the so-called multiple-issue machines were typified in the early '80s, and th
were
called
VLIW
machines (for Very Long Instruction Word). These machines exploit an instruction word consisting of several
(up to 8) instruction fragments. Each fragment controls a precise execution unit; in this way the register set must
be multiported to support simultaneous access, because the multiple instructions could need to share the
variables. In order to accommodate the multiple instruction fragments, the instruction word is often over 100
bits long. [12]
The reasons that push towards these parallel approaches are essentially two; first of all many scientific
processing algorithms either for calculus or, more recently, for communication and multimedia application
contain a high degree of parallelism. Secondly a parallel architecture is a cost-effective way to compute (when
the program is parallelizable), since internal and on-chip communications are much faster and much more
efficient than external communication channels.
On the other hand parallel architectures bring with them a number of problems and new challenges that are
present in simple processors. First of all, if it is true that many programs are parallelizable, extensive researches
have shown that often the level of parallelism that can be achieved is theoretically not greater than 3; this means
that on actual architectures the speedup factor is not greater than 2. Based upon this, it would seem that in
absence of significant compiler breakthroughs available speedup is limited. A second problem concerns
memories and registers; highly parallel routines require a high memory access rate, and then a very d
optimization for register set, cache memory and data buses in order to feed the necessary amount of data into
the execution units. Finally, such complex architectures with hardly optimized datapaths and data transfers ar
very difficult to program. Normally DSP programmers used to develop applications directly in assembly
language, very similar for some aspects to the natural sequential way of thinking of the human beings
specifically conceived for smart optimizations. Machines like the MIMD and VLIW ones are not programmable
in assembly anymore, and then processor designers have to spend a great amount of resources (often more than
the time to develop the chip itself) in order to provide the Software Development Kits able to exploit the
potential of the processor, taking care of every aspect from the powerful optimization techniques u
understandable user interfaces.
More recent attempts at multiple-issue processors have been directed at rather lower amounts of concurrency
than in the first VLIW architectures (4-5 parallel execution blocks). Three examples of this new genera
superscalar machines will be briefly discussed in the next subsections, underlying architectural aspects an
specific solutions to deal with the problems of parallelization.
The Pentium processor with multimedia extensions
The Pentium processor explicitly supports multimedia, since the introduction of the so-called MMX
(MultiMedia eXtension) family. The well-known key enhancement of this new technology consists of
exploiting the 64-bit floating point registers of the processor to "pack" 8-, 16-, or 32-bit data, that can b
processed in parallel by a SIMD operating unit. Fifty-seven new instructions are implemented in the processor
in order to exploit these new functionalities, and among them "multiply and add", the basic operation in the case
of digital convolutions (FIR filters) or FFT algorithms. [13] Two considerations can be made about this
5/1/2007 10:58 AM
DSP course 1998-99
17 of 22
processor. First of all the packed data are fixed point, and then the use of these extensions for a DSP oriented
task limits the use of the floating point arithmetic; conversely a full use of floating-point operations does no
allow any boost in performance in comparison with the common Pentium family.
Moreover MMX technology has been conceived to specifically support multimedia algorithms, but at the sam
time to completely preserve code compatibility with previous processors; in this way an increased pote
fixed-point processing power is not supported by the necessary memory and bus re-design, and then it is often
not possible to "feed" the registers with the correct data. Extensive tests conducted after the disclosing of th
MMX technology have shown that for typical video application it is often a hard matter to achieve a speedu
factor of the 50%.
Figure 1.
How the Pentium MMX exploits the 64-bit floating-point registers to "pack" data in parallel and send them to a
SIMD execution unit
The TriMedia processor
Another multimedia processor other than Intel MMX that is growing in interest is the TriMedia by Ph
Electronics. This chip is not designed as a completely general purpose CPU, but with the double functionality
of CPU and DSP in the same chip, and its core processing master unit presents a VLIW architecture.
The key features of TriMedia are:
A very powerful, general-purpose VLIW processor core (the DSPCPU) that coordinates all on-chip
activities. In addition to implementing the non-trivial parts of multimedia algorithms, this processor runs a
small real-time operating system that is driven by interrupts from the other units.
DMA-driven multimedia input/output units that operate independently and that properly format data to
make software media processing efficient.
DMA-driven multimedia coprocessors that operate independently and in parallel with the DSPCPU to
perform operations specific to important multimedia algorithms.
A high-performance bus and memory system that provides communication between TM1000’s
processing units.
The architecture of the TriMedia is shown in figure 2.
The real DSP processing must be implemented in the master CPU/DSP, which is also responsible for the
algorithm direction. This unit is a 32-bit floating-point, 133 MHz general-purpose unit whose VLIW
instructions can address up to five instructions out of the 27 functional operations (integer and floating
5/1/2007 10:58 AM
DSP course 1998-99
18 of 22
multipliers and 5 ALUs).
The DSPCPU is provided with a 32 Kbytes Instruction cache memory and a dual port 16 Kbytes Data ca
memory. [14]
TriMedia also provides a set of multimedia instructions, mostly targeted at MPEG-2 video decoding;
Figure 2 The TriMedia processor architecture
Some of the programming challenges for parallel architectures are solved in the DSPCPU through the concept
of guarded conditional operations. An instruction takes the following form
R g : R dest = imul R src1 , R src2
In this instruction the integer multiplication of the two registers is put into the destination register under
condition contained in the "guard" register Rg. This allow to better control the optimization strategies of th
parallel compiler, since for instance the problem of branches is relaxed and the result is accepted or not only at
the last execution stage of the pipeline.
As mentioned above, complex processors/DSPs like TriMedia need a big amount of development tools a
software support. For this reasons the TriMedia comes with a huge amount of tools to deal with the real-tim
kernel, the DSPCPU programming and the complete system exploitation.
The TriMedia Software Development Environment provides a comprehensive suite of system software too
compile and debug multimedia applications, analyse and optimize performance, and simulate execution o
TriMedia processor. The main features are:
VLIW ANSI C/C++ compilation system
Source- and machine-level debugging
Performance analysis and enhancement
5/1/2007 10:58 AM
DSP course 1998-99
19 of 22
Cycle-accurate machine-level simulator
TriMedia applications library, including:
MPEG- 1 decoder
MPEG- 2 program stream decoder
3-D graphics pipeline
PC audio synthesis (FM, wavetable)
V. 34 modem
H. 324 (PSTN) / H. 320 (ISDN) PC video conferencing
PUMA - Parallel Universal Music Architecture
A very interesting solution recently developed for advanced audio applications is the PUMA (Parallel Universal
Music Architecture) DSP by Studer Professional Audio. This chip was conceived and realised in collaboratio
with the Integrated Systems Center (C3I) of the EPFL.
This integrated circuit is designed and optimized for digital mixing consoles; it is provided with 4 indepen
channel processors, and then with four 33-MHz, 24-bit fixed point multipliers and adders fully dedicated to data
processing (another multiplier is provided in the master DSP, which is charged of the final data processing
directs the whole chip functionalities and I/O unities); the important feature of this chip relies in the mu
processing units that can work in parallel on similar blocks of data; each channel processor has its own intern
data memory (256 24-bit words for each processor), the Master DSP and the Array DSP has independent
program memories and program control units. The design of the I/O units deserved a great care: digital audio
input and output are supported by 20 serial lines each; interprocessor communication is supported thro
independent units (the Cascade Data Input and Cascade Data Output) providing 64 channels on 8 lines a
processor speed. A general purpose DRAM/SRAM External Memory Interface and the External Host Inter
permit memory extension and flexible programmability via an external host processor. The following figur
shows the top-level architecture of the PUMA DSP.
5/1/2007 10:58 AM
DSP course 1998-99
20 of 22
The following figure shows the internal datapath of each channel processor; three units can work in parallel in a
single clock cycle: a 24x24-bit multiplier, a comparator and the general purpose ALU (adder, shifter, logical
operations).
Puma design flow
To conclude it is interesting to spend a few words about the PUMA design flow, to understand how a modern
and complex architecture, characterised by several million of transistors, can be practically realised.
First of all the functional specification of the processor is developed, defining functionalities, basic bloc
instruction set; at the same time the C-model of the architecture is implemented, in order to test with
methodology the algorithms and the architecture.
The second step is the VHDL description and simulation of the C-model at the RTL level, followed by t
5/1/2007 10:58 AM
DSP course 1998-99
21 of 22
synthesis to the gate level. All of this was accomplished exploiting the Synopsys Design Compiler and
Analyzer.
After that an optimization technique called hierarchical compiling is used: after setting the boundary constraints
for the main blocks, the constraints for the inner blocks are derived hierarchically by the compiler, and thi
permits to relax the time paths everywhere it is not strictly necessary.
The preliminary place & route follows; then the parasitic parameters (R and C) for each wire are extracted, and
the so called back-annotation, or in-place compilation is performed, in order to better adapt each load to the real
netlist placement. The place & route was made by the Compass tool, the back-annotation again in Synopsy
Design Compiler.
Finally the last place&route is made, and extensive simulations are performed for every part of the chip, in
order to verify the timing of every specific operation. Now the design is ready for the foundry.
References
1. Nicholson, Blasco and Reddy, "The S2811 Signal Processing Peripheral," WESCOM Tech Papers, Vol. 2
1978, pp. 1-12.
2. S2811 Signal Processing Peripheral, Advanced Product Description, AMI, May 1979.
3. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p. 24.
4. Hoff and Townsend, "An analog input/output microprocessor for signal processing," ISSCC Digest of T
Papers, February 1979, p. 220.
5. 2920 Analog Signal Processor Design Handbook, Intel, Santa Clara, CA, August 1980.
6. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p.24.
7. Brodersen, "VLSI for Signal Processing," Trends and Perspectives in Signal Processing, Vol. 1, No. 1,
January 1981, p. 7.
8. Stanzione et al, "Final Report Study Group on Digital Integrated Signal Processors," Bell Labs I
Memorandum, October 1977.
9. Boddie, Daryanani, Eldumtan, Gadenz, Thompson, Walters, Pedersen, "A Digital Signal Pr
Telecommunications Applications," ISSCC Digest of Technical Papers, February 1980, p.44.
10. Bell System Technical Journal, Vol. 60, No. 7, September 1981.
11. Phil Lapsley, Jeff Bier, Amit Shoham "DSP Processor Fundamentals, Arhitectures and Features", IEEE
Press.
12. Michael J. Flynn "Computer Architecture. Pipelined and parallel processor design", Jones and Bartlett,
1995.
13. Peleg A., Wilkie S., Weiser U. Intel MMX for Multimedia PCs. Communications of the ACM, Vol. 40, No
1, Jan 1997.
14. TM1000 Preliminary Data Book, Philips Electronics NAC, 1997.
5/1/2007 10:58 AM
DSP course 1998-99
22 of 22
5/1/2007 10:58 AM
ARCHITECTURES FOR VIDEO PROCESSING
1 of 23
file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A...
ARCHITECTURES FOR VIDEO
PROCESSING
Integrated System Laboratory C3I
Swiss Federal Institute of Technology, EPFL
The first question we would like to answer is: what do we mean nowadays for video processing? In the past, more or
less till the end of the 80’s there where two distinct worlds: an analog TV world and a digital computer world. All TV
processing from the camera to the receiver was based on analog processing, analog modulation and analog recording.
With the progress of digital technology, a part of the analog processing could be implemented by digital circuits with
consistent advantages in terms of reproducibility of the circuits leading to cost and stability advantages, and noise
sensitivity leading to quality advantages. At the end of the 80’s completely new video processing possibilities became
feasible by digital circuits. Today, image compression and decompression is the dominant digital video processing in
term of importance and complexity of the all TV chain.
Figure 1. Schematic representation of a TV chain.
In the near future digital processing will be used to pass from standard resolution TV to HDTV for which compression
and decompression is a must, considering the bandwidth that it would require for transmission. Other applications will
be found at the level of the camera to increase the image quality by increasing the number of bit from 8 to 10 or 12 for
each pixel, or by using appropriate processing aiming at compensating the sensors limitations (image enhancement by
non-linear filtering and processing) . Digital processing will also enter into the studio for digital recording, editing and
50/60 Hz standard conversions. Today the high communications bandwidth required by uncompressed digital video
necessary for editing and recording operations, between the studio devices limits the use of full digital video and digital
video processing at studio level.
Video Compression
5/1/2007 11:01 AM
2 of 23
Why video compression has become the dominant video processing application of TV? An analog TV channel only
needs a 5 MHz analog channel for the transmission, conversely in case of digital video with: 8 bit A/D, 720 pixels for
576 lines (54 MHz sampling rate) we need a transmission channel with a capacity of 168.8 Mbit/s!!! In case of digital
HDTV the capacity for: 10 bit A/D, 1920 pixels 1152 lines raise up to1.1 Gbit/s!!! No affordable applications, in terms
of cost, are thus possible without video compression.
These reasons have raised also the need of worldwide standards for video compression so as to achieve interoperability
and compatibility among devices and operators. H.261 is the names given to the first digital video compression standard
specifically designed for videoconference applications, MPEG-1 is the name for the one designed for CD storage (up to
1.5 Mbit/s) applications, MPEG-2 for digital TV and HDTV respectively from 4 up to 9 Mb/s for TV, or up to 20 Mb/s
for HDTV; H.263 for videoconferencing at very low bit rates (16 - 128 kb/s). All these standards can be better
considered as a family of standards sharing quite similar processing algorithms and features.
All of them are based on the same basic philosophy:
Decoder must be simple.
For TV and HDTV while we have very few encoders used by broadcaster companies (at limit just one for each
channel), we must have a decoder on each TV set.
Decoding syntax is completely specified.
This means that any compressed video bit-stream can be decoded without any ambiguity yielding the same video
result.
A decoder must be conformant.
This means that a decoder must be able to decode any video bit-stream that respects the decoding syntax.
Encoding syntax is specified.
This means that an encoder must encode a video content in a conformant syntax.
The encoder, (i.e. the encoding algorithm) is not specified.
This means that encoding algorithms are a competitive issues, encoders can be optimized aiming at achieving
higher quality of compressed video or aiming at simplifying the encoding algorithm so as to have simple encoder.
It also mean that in future disposing of more processing power we can use more and more sophisticated and
processing demanding encoding algorithms to find the best choices of the available encoding syntax.
These basic principles of the video compression standards have clearly strong consequences on the architectures
implementing video compression. So as to understand what are the main processing and architectural issues in video
compression we briefly analyze in more details the basic processing of MPEG-2 standard.
MPEG-2 Video Compression
MPEG-2 is a complete standard that specifies all stages from video acquisition up to the interface with the
communication protocols. Figure 2 reports a schematic diagram of how MPEG-2 provides after compression a transport
layer. Audio and video compressed bit-streams are multiplexed and put in packets in a suitable transport format. This
part of the processing cannot be classified as video processing, and is not considered here in details.
5/1/2007 11:01 AM
3 of 23
Figure 2. MPEG-2 transport stream diagram.
Figure 3. Basic processing for MPEG-2 compression.
5/1/2007 11:01 AM
4 of 23
Figure 4. MPEG-2 pre-filtering and spatial redundancy reduction by DCT.
Figure 5. MPEG-2 spatial redundancy reduction by quantization and entropy coding.
The basic video processing algorithms of MPEG-2 are reported in Figure 3. These algorithms are also found with some
variants in all the other compression standards mentioned before. First stage is the conversion of the image from RGB
format to the YUV format and subsequent filtering and sub-sampling of the chrominance components to yield smaller
color images. Then images are partitioned into block of pixels of size 8x8 and block are grouped in macro-blocks of size
16x16 pixels. Two main processes are then applied. One is the reduction of spatial redundancy, the other is the
reduction of temporal redundancy.
5/1/2007 11:01 AM
5 of 23
Figure 6. MPEG-2 temporal redundancy reduction by motion compensated prediction.
Spatial redundancy is reduced applying the DCT transform to blocks and then entropy coding by Huffman tables the
quantized transform coefficients. Temporal redundancy is reduced by motion compensation applied to macro-blocks
according to the IBBP group of picture structure.
In more details (see Figures 4 and 5) spatial redundancy is reduced applying 8 times horizontally and 8 times vertically
a 8x1 DCT transform. Then transform coefficients are quantized, thus reducing to zero small high frequency
coefficients, scanned in zig-zag order starting from the DC coefficient at the upper left corner of the block and coded
using Huffman tables referred also as Variable Length Coding (VLC).
The reduction of temporal redundancy is the process that drastically reduces the bit rate and enables to achieve high
compression rates. It is based on the principle of finding the current macro-block in already transmitted pictures at the
same position in the image or displaced by a so-called "motion vector" (see figure 6). Since an exact copy of the
macro-block is not guaranteed to be found, the macro-block that has the lowest average error is chosen as reference
macro-block. The "error macro-block" is then processed so as to reduce the spatial redundancy, if any, by means of the
above mentioned procedure and transmitted so as to be able to reconstruct the desired macro-block disposing of the
"motion vector" indicating the reference and the relative error.
Figure 7 reports the so-called MPEG-2 Group of Picture Structure that shows that images are classified as I (Intra), P
(Predicted) and B (Bi-directionally interpolated). The standard specifies that Intra image macro-block can only be
processed to reduce spatial redundancy, P image macro-block can also be processed to reduce the temporal redundancy
referring only to past I or P frames, B image macro-block can also be processed using an interpolation of past and future
reference macro-block. Obviously B macro-block can also be coded as Intra or Predicted if it is found to convenient for
the compression. Note that since B picture can use as reference both past and future I or P frames, the MPEG-2 image
transmission order is different from the display order, B picture are transmitted in the compressed bit-stream after the
relative I and P pictures.
5/1/2007 11:01 AM
6 of 23
Figure 7.
Structure of an MPEG-2 GOP, showing the reference pictured for motion compensated prediction of P and B pictures.
Complexity of MPEG Video Processing
At the end of the 80’s there have been a lot of discussions about the complexity of implementing DCT transforms in
real-time at video rate. Blocks of 8x8 have been chosen instead of 16x16 in order to reduce the complexity of the
transform. The main objective was to avoid complex processing at the decoder side. With this goal many DCT
optimized implementations have appeared in both form of dedicated chips and software using reduced number of
multiplication and additions.
Nowadays, digital technology has made many progresses in terms of speed increase and processing performance for
which the DCT coding or decoding is not anymore a critical issue. If we look to Figure 8 we can find a schematic block
diagram of an MPEG-2 decoder that is very similar to the ones of the other compression standards. A buffer is needed to
receive at a constant bit-rate the compressed bits that during decoding are not "consumed" at a constant rate. VLD is a
relatively simple processing that can be implemented by means of look-up tables or memories. Being a bit-wise
processing, it cannot be parallelized and results quite inefficient to be implemented in general purpose processors. This
is the reason for which new multimedia processors such as Philips "Trimedia" use specific VLC/VLD units for entropy
coding. The more costly elements of the MPEG-2 decoder are the memories for the storage of past and future reference
frames and the handling of the data flow between the Motion Compensated Interpolator unit and the Reference video
memories.
5/1/2007 11:01 AM
7 of 23
Figure 8. Block diagram of an MPEG-2 decoder.
For an MPEG-2 encoder, see Figure 9, the situation is very different. First of all we can recognize a path that
implements a complete MPEG-2 decoder, necessary to reconstruct reference images as they are found at the decoder
size. Then we have a motion estimation block (Bi-directional motion estimator) that has the goal of finding the motion
vector, and a block that selects and controls the macro-block encoding modes. As discussed in the previous paragraphs,
the way to find the best motion vectors as well as the way to chose the right coding for each macro-block is not
specified by the standard. Therefore, very simple (with limited quality performance), or extremely complex algorithms
(with high quality performance) can be implemented for these functions. Moreover, MPEG-2 allows the dynamic
definition of the GOP structure making possible many possibilities of coding modes. In general two are the critical
issues of an MPEG-2 encoder: the motion estimation processor and the handling of the complex data flow with relative
bandwidth problems between original and coded frame memories, motion estimation processor and the coding control
unit.
We have also to mention that the coding modes of MPEG-2 are much more complex of what could seem from this brief
description. In fact, existing TV is based on interlaced images and the processing all coding modes can be applied in
distinct ways to "frame" blocks and macro-blocks or on "field" blocks and macro-blocks. The same applies for motion
estimation for which we can use both field-based or frame-based vectors. Moreover all references for predictions can be
made on true image pixels or on "virtual" image pixels obtained by bi-linear interpolations as shown in Figure 10.
5/1/2007 11:01 AM
8 of 23
Figure 9. Block diagram of an MPEG-2 encoder.
Figure 10.
MPEG-2 macro-block references can be made also on "virtual" pixels (in red) obtained by bilinear interpolations,
instead of image pixels from the original raster (gray).
In this case also, motion vectors with half pixel precision need to be estimated. The possibility of using all these
possible encoding modes largely increases the quality of the compressed video, but it might become extremely
demanding in terms of processing complexity.
The challenge of MPEG-2 encoder designer is to find the best trade-off between the complexity of the implemented
algorithms and the quality of the compressed video. Architectural and algorithmic issues are very strictly related in
MPEG-2 encoder architectures.
Digital Video and Computer Graphics
In the past digital video on computers was equivalent to computer graphics. Differently from the TV world all
5/1/2007 11:01 AM
9 of 23
processing was obviously digital mainly treating synthetic images from 2-D or 3-D models. The concept of real-time
computer graphic application was very approximate since usually the application was intended to run as fast as possible
on the available processors using in parallel graphic accelerators for the arithmetic operations on pixels.
Figure 11. Sequence of typical computer graphic processing steps.
Figure 11 shows a schematic diagram of the basic computer graphic operations. For each image, 2-D or 3-D models
composed by triangles or polygons are placed in the virtual space by t he considered application that can be interactive.
The position of each vertex is calculated according to the geometric transformation of the object and projected on the
screen. The texture, mapped on each polygon, is transformed according to the light model corresponding to the position
of the polygon in the space. The pixel on the screen corresponding to the screen raster are obtained from the "original"
texture pixel on the polygon by appropriate filtering operations. Fina lly, the polygon is displayed on the screen.
Figure 12. processing requirements of 3-D graphic content in terms of pixel and polygon per second.
5/1/2007 11:01 AM
10 of 23
Computer graphic applications strongly rely on the performance of acceleration cards that are specialized to treat in
parallel with a high level of pipelines all these numerous but simple pixel operations. Figure 12 reports a diagram of the
processing requirements in terms of polygons/s and pixel/s of various graphic contents.
TV, Computer Graphics and Multimedia: MPEG-4?
The new MPEG-4 multimedia standard, which was defined as draft ISO international standard in October 98, is trying
the ambitious challenge of putting together the world of natural video and TV with the world of computer and computer
graphics.
In MPEG-4 we can find in fact both natural compressed video and 2-D and 3-D models. The standard is based on the
concept of elementary streams that represents and carry the information of a single "object" that can be of any type
"natural" or "synthetic", audio or video.
Figure 13, reports an example of what can be the content of a MPEG-4 scene. Natural and 2-D and 3-D synthetic
audio-visual objects are received and composed in a scene as seen by an hypothetical viewer.
Figure 13. Example of the content and construction of an MPEG-4 scene.
5/1/2007 11:01 AM
11 of 23
Figure 14. Diagram of MPEG-4 System layer and interface with the network layer.
Two virtual levels are necessary to interface the "elementary stream" level with the network level. The first is necessary
to multiplex/demultiplex each communication stream into packets and the second to synchronize each packet and build
the "elementary streams" carrying the "object" information as shown in Figure 14.
The processing related to MPEG-4 Systems layer cannot considered as video processing and is very similar to the
packet processing typical to network communication.
An MPEG-4 terminal can be schematized as shown in Figure 15. The communication network provides the stream that
is demultiplexed into a set of "elementary streams". Each "elementary stream" is decoded into audio/video objects.
Using the scene description transmitted with the elementary streams all object are "composed" in the video memory all
together according to the size, view angle and position in the space and then "rendered" on the display, which can be
interactive and originating a upstream data due to the user interaction and sent back to the MPEG-4 encoder.
MPEG-4 systems, therefore implement not only the classical MPEG-2-like compression/decompression processing and
functionality but also computer graphics processing such as "composition" and "rendering". The main difference
comparing to natural video of MPEG-1, MPEG-2, H.263, is the introduction of "shape coding" enabling the use of
arbitrarily shaped video objects as illustrated in Figure 16. Shape coding information is based on macro-block data
structures and arithmetic coding for the contour information associated at each boundary block.
5/1/2007 11:01 AM
12 of 23
Figure 15. Illustration of the processing and functionality implemented in an MPEG-4 terminal.
Figure 16. Compressed shape information is necessary for arbitrarily shaped objects.
5/1/2007 11:01 AM
13 of 23
Figure 17.
MPEG-4 decoder block diagram, shape coding is coded in parallel to the DCT based texture coding. Shape coding can
be of "Intra" type, or with motion compensation and prediction error like texture coding.
The block diagram of an MPEG-4 encoder is depicted in Figure 17. In general it is very similar as architecture to an
MPEG-2 encoder block diagram. We can notice a new "shape coding" block in the motion estimation loop that produce
shape coding information transmitted in parallel to the classical texture coding information.
Video Processing Architectures: Generalities
In general, we can classify the circuits implementing video processing in four families:
Application Specific Integrated Circuits (ASICs).
To this group belong all hardwired circuits specifically designed for a single processing task. The level of
programmability is very low and the circuits are usually clocked at the frequency or multiples of the input/output
data sampling rates.
Application Specific Digital Signal Processors (AS-DSPs).
These architectures are based on a DSP cores plus special functions (such as 1-D, 2-D filters, FFT, graphics
accelerators, block matching engines) that are specific to a selected application.
Digital Signal Processors (DSPs).
These are the classical processors architectures specialized and efficient for multiply-accumulate operations on
16-24-32 bit data. The classical well-known families are the ones of Motorola and Texas Instruments. The level of
programmability of these processors is very high. They are also employed for real-time applications with constant
input/output rates.
General Purpose Processors (GPPs).
These are the classical PC processors (Intel, IBM PowerPC) and workstation processors (Alpha Digital, Sun
UltraSparc,). Originally they were designed for general purpose software applications and in general, although
5/1/2007 11:01 AM
14 of 23
very powerful, are not really adapted for video processing. Moreover t he operating systems employed are not
real-time OS. The design of real-time video application on these architectures is not a simple task as it could
appear.
Considering the video processing implementations of the last years, in general, we can observe the trend versus the time
illustrated in Figure 18. If we consider different video processing algorithms (indicated as Proc.1, Proc.2 etc… in order
of increasing complexity.) such as DCT on a 8x8 block for instance, we find first in time to appear implementations
based on ASIC architectures. After some years with the evolution of IC technology these functions can then be
implemented in real-time by AS DSPs, then by standard DSPs, and then by GPPs. This trend corresponds to the desire
of transferring the complexity of the processing from the hardware architecture to the software implementation.
However, this trend does not present only advantages and does not apply to all the implementation cases. Figures 19, 22
and 23 reports an illustration of advantages and disadvantages for each class of architectures that should be considered
case by case. Let us analyze in detail and discuss each feature.
Figure 18. Trend of algorithm implementations versus the time on different architectures.
5/1/2007 11:01 AM
15 of 23
Figure 19. Conflicting trade-offs for architecture families.
Figure 19 shows how the various families of architectures behave for the two conflicting requirements of real-time
performance and flexibility/programmability. For high resource demanding processing no doubt that dedicated circuits
can be order of magnitude faster than GPPs, but the advantages of programmability and possibility of changing the
software to implement new processing capabilities becomes attractive for some applications. For instance a GPP can
decode any video standard H.261, H.263, MPEG-1 and MPEG-2 just changing the software depending on the
application. On the other hand real-time performance are not so easy to be guaranteed on most of GPP platform and the
difficulty of handling at the same time real-time processing and other processes have to be carefully evaluated and
verified. Figure 20 shows with a simple FIR filter example the concept. For a dedicated implementation (ASIC) a filter
can be implemented with simple and extremely fast circuitry. Simple architecture based on registers and multipliers just
of the size and speed necessary for the processing at hand are employed. The guarantee of real-time processing is easy
to be achieved by appropriately clocking the system to the input data. Conversely, a programmable solution results
much more complex. Figure 21 reports the various processing elements that are usually found: ALUs, memories for the
data and algorithm program instructions, communication buses etc … Moreover, even simple processing algorithms
such as a FIR filtering need to access several time the data and program memories, as reported in the instruction
example.
These considerations lead to clear advantages in terms of cost for ASICs when high volumes are required (see Figure
23). Simpler circuits that require smaller silicon surface areas are the right solution for set-top boxes and application for
high volumes (MPEG-2 decoders for digital TV broadcasting for instance). In these cases the high development costs
and the lack of debugging and software tools for the simulation and design do not constitute a serious drawback.
Modifications of the algorithms and the introduction of new versions are not possible, but are not required by this kind
of applications. Conversely, for low volume applications, the use of programmable solutions immediately available on
the market, well supported by compilers, debuggers and simulation tools that can effectively speed up the development
time and cost, might be the right solution. The much higher cost of the programmable processor, in some cases become
acceptable for relatively low volume of devices.
Another conflicting trend between hardwired and programmable solutions can be found by the need of designing
low-power solutions required by the increasing importance of portable device applications and necessary to reduce the
increasing power dissipated by high performance processors (see Figure 24). This trend conflicts with the need of
transferring the increasing complexity of processing algorithms from the architecture to the software which is much
easier and faster to be modified corrected and debugged. The optimization of memory size and accesses, clock
frequency, and other architectural features that yield low-power consumption are only possible on ASICs architectures.
5/1/2007 11:01 AM
16 of 23
What is the range of power consumption reduction that can be reached passing from a GPP to an ASIC? It is difficult to
answer to this question with a single figure, it depend by architecture to architecture, processing by processing. For
instance Figure 24 reports the power dissipation of a 2-D convolution with 3x3 filter kernel on a 256x256 image on
three different architectures. The result is that a ARM RISC implementation, beside being slower that the other
alternatives and so providing a under-estimated result, is about 3 times more demanding than a FPGA implementation
and 18 times more than an ASIC based one. The example of the IMAGE motion estimation chip that is reported at the
end of this document shows that much higher reduction factors (even more than two orders of magnitude) can be
reached by low-power optimized ASIC architectures for specific processing tasks when compared to GPPs providing the
same performance.
Figure 20. Example of FIR filtering implementation on a dedicated architecture.
Figure 21. Example of a FIR filtering implementation on a DSP architecture.
5/1/2007 11:01 AM
17 of 23
Figure 22. Conflicting trade-off for architecture families.
Figure 23. Conflicting trade-off for architecture families.
5/1/2007 11:01 AM
18 of 23
Figure 24. Power dissipation reduction for the same processing (2-D convolution 3x3) on three different architectures.
A last general consideration about the efficiency of the various architectures for video processing regards the memory
usage. Video processing applications, as we have seen in more detail for MPEG-2, require the handling of very large
amount of data (pixels) that need to be processed and accesses several time in a video encoder or decoder. Images are
filtered, coded, decoded, used as reference for motion compensation and motion estimation for different frames, in other
words accessed in order or "randomly" several times in a compression/decompression stage. If we observe the speed of
processors, and the speed to access cache SRAM and Synch. DRAM data in the last year we observe two distinct trends
(see Figure 25). Speed of processors was similar to memory access speed in 1990, but now it is more than the double
and the trend is towards a even higher speed ratios. It means that the performance bottleneck of nowadays video
processing architectures is given by the efficiency of the data flow. A correct design of the software for GPPs and a
careful evaluation of the achievable memory bandwidth of the various data exchanges is necessary to avoid the risk that
the largest fraction of time is used by the processing unit just to wait for the correct data to be processed. For graphic
accelerators performance the data flow handling is the basic objective of the processing. Figure 26 reports the
performance of some state of the art devices versus the graphic content.
5/1/2007 11:01 AM
19 of 23
Figure 25.
Evolution of the processing speed of processors, SRAM and Synch. DRAM in the last years. Memory access speed has
become the performance bottleneck of data-intensive processing systems.
Figure 26.
Performance and power dissipation of state of the art graphic accelerators (AS-DSPs) versus polygons and pixel/s.
Motion Estimation Case Study
Block motion estimation for high quality video compression applications (i.e. digital TV broadcasting, multimedia
content production…..) is a typical example for which GPP architectures are not a good choice for the implementation.
5/1/2007 11:01 AM
20 of 23
Motion estimation is indeed the most computational demanding stage of video compression at the encoder side. For
normal resolution TV we have to encode 1620 macro-block per frame, with 25 frame per second. Roughly, to search a
motion vector error we need to perform about 510 arithmetic operations on data from 8 to 16 bits. The number of vector
displacements depends on the search window size that should be large for guaranteeing high quality coding. For
instance for sport sequences a size of about 100x100 is required. This leads to about 206 x 109 arithmetic operations per
second on 8 to 16 data. Even if we are able to select an "intelligent" search algorithm that reduces from one up to two
orders of magnitude the number of search points the number of operations remain extremely high and not feasible by for
state of the art GPPs. Moreover, 32 or 64 bit processing arithmetic cores are wasted when operations only on 8 to 16 bits
are necessary. Completely different architectures that implement a high level of parallelism at bit level are necessary.
If we want to be more accurate, we can notice that B pictures require both forward and backward motion estimation, and
for instance for TV applications each macro-block can use the best between frame-based or field-based motion vectors
at full or half pixel resolution level. Therefore, we realize that the real processing needs can increase of more than a
factor 10, if all possible motion vectors are estimated.
Another reason for which ASICS or AP-DSPs are an interesting and actual choice for motion estimation is the still
unsolved need of motion estimation for TV displays. Large TV displays require to double the refresh rate to avoid the
annoying flickering phenomenon appearing on the side portions of large screens. A conversion of interlaced content
from 50 to 100 Hz by the simple doubling of each field provides satisfactory results in there is no motion. In case of
moving objects the image quality provided by field doubling is low and motion compensated interpolation is necessary
to reconstruct the movement phase of the interpolated images. An efficient and low-cost motion estimation stage is
necessary for high quality up-conversion on TV displays.
IMAGE a Motion Estimation Chip for MPEG-2 applications.
We briefly describe the characteristics of a motion estimation chip designed in the C3I laboratory of the EPFL in the
framework of the ATLANTIC european project in collaboration with the BBC, CSELT, Snell&Wilcox and Fraunhofer
Institute. IMAGE is an acronim for Integrated MIMD Architecture for Genetic motion Estimation. The requirements for
the chip was to provide estimations for MPEG-2 encoders in very large search windows for forward, backward,
field-based, frame-based, full-pel, half-pel precision motion vectors. Figure 27 and 28 report the MPEG-2 broadcasting
chain and the main input-output specification of the chip. The same chip is also required to provide the result of the
candidate motion compensation mode (forward, backward, filed, frame, intra), and the selection of the corresponding
best coding decision. Since all these operation are macro-block based, they share the same level of parallelism of motion
estimation algorithms.
The basic architectural idea has been to design a processing engine extremely efficient in getting the mean absolute
difference between macro-blocks (matching error) with fast access to a large image section (search window size). By
extremely efficient it is meant exploiting as much as possible the parallelism intrinsic to pixel operations on 16x16 block
of pixels and able to access randomly any position in the search window without useless waiting times (i.e. providing
the engine with the sufficient memory bandwidth to fully exploit its processing power). Figure 29 reports the block
diagram of the "block-matching" engine. We can notice in the center the "pixel processor" for the parallel execution of
the macro-block difference, two cache memory banks for the storage of the current macro-block and for the search
window reference, a RISC processor for the handling of the genetic motion estimation algorithm and for the
communications between processing units. The basic processing unit of Figure 29 is then reported in the general
architecture of the chip reported in Figure 30. We can notice two macro-block processing units in parallel, the various
I/O modules for the communication with the external frame memory and the communication interfaces for cascading the
chip for forward and backward motion estimation and for larger search window sizes. As mentioned discussion data
intensive applications one of the main difficulty of the chip design is the correct balancing of the processing time of the
various units and the optimization of the various communications between modules. It is fundamental that all module
processing are scheduled so as to avoid wait times and the communication busses have the necessary bandwidth.
Low power optimizations are summarized in figure 31. Deactivation of processing units, local gated clocks and
implementation of a low-power internal SRAM as cache memory enabled to keep power dissipation below 1W. Figure
32 reports the final layout of the chip with the main design parameters.
5/1/2007 11:01 AM
21 of 23
In conclusion, the chip IMAGE can be classified as an AS-DSP for its high programmability where the application
specific for which a special hardware is used is the calculation of macro-block differences. Its performance for motion
estimation are much higher than any state of the art GPPs and obtained with a relatively small chip dissipating less than
1W when providing real-time motion estimation for MPEG-2 video compression. More details about the IMAGE chip
can be found in: F. Mombers, M. Gumm and Al. "IMAGE: a low-cost low-power video processor for high quality
motion estimation in MPEG-2 encoding", IEEE Trans. on Consumer Electronics, Vol 44, No. 3 August 1998, pp.
774-783.
Figure 27. Block diagram of a TV broadcasting chain based on MPEG-2 compression.
Figure 28. Requirements of a motion estimation/prediction selection chip for MPEG-2 encoding.
5/1/2007 11:01 AM
22 of 23
Figure 29. Block diagram of the "block matching" processor.
Figure 30. High level architecture of the IMAGE chip with the indication of the critical communication paths.
5/1/2007 11:01 AM
23 of 23
Figure 31. Low power optimizations achieved on the IMAGE chip.
Figure 32. Main design data of the IMAGE chip.
5/1/2007 11:01 AM

Chapter 1 INTRODUCTION TO VLSI SYSTEMS

Transcription

Similar documents

ACTAUSIYERSIIATISLODZ IENSIS POLIA OECONOMICA 68, 1987

Insect Order ID: Coleoptera (Beetles, Weevils)

Dark Knight Caryopteris

developing an internet marketing strategy pdf

the PDF of this issue

About Tsogo High School

St-Eustache Police Station case study Evolu-Tech Ltée 1410

Issue 42 - noiZe Magazine

Manual - HP Archive

TheProfessional - Guardian Association of Pinellas

Theory of Fuzzy Information Granulation

Online Full Text