Chapter 1 INTRODUCTION TO VLSI SYSTEMS
Transcription
Chapter 1 INTRODUCTION TO VLSI SYSTEMS
Design of VLSI Systems - Chapter 1 1 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 1 INTRODUCTION TO VLSI SYSTEMS Historical Perspective VLSI Design Flow Design Hierarchy Concepts of Regularity, Modularity and Locality VLSI Design Styles 1.1 Historical Perspective The electronics industry has achieved a phenomenal growth over the last two decades, mainly due to the rapid advances in integration technologies, large-scale systems design - in short, due to the advent of VLSI. The number of applications of integrated circuits in high-performance computing, telecommunications, and consumer electronics has been rising steadily, and at a very fast pace. Typically, the required computational power (or, in other words, the intelligence) of these applications is the driving force for the fast development of this field. Figure 1.1 gives an overview of the prominent trends in information technologies over the next few decades. The current leading-edge technologies (such as low bit-rate video and cellular communications) already provide the end-users a certain amount of processing power and portability. This trend is expected to continue, with very important implications on VLSI and systems design. One of the most important characteristics of information services is their increasing need for very high processing power and bandwidth (in order to handle real-time video, for example). The other important characteristic is that the information services tend to become more and more personalized (as opposed to collective services such as broadcasting), which means that the devices must be more intelligent to answer individual demands, and at the same time they must be portable to allow more flexibility/mobility. [Click to enlarge image] Figure-1.1: Prominent trends in information service technologies. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 2 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... As more and more complex functions are required in various data processing and telecommunications devices, the need to integrate these functions in a small system/package is also increasing. The level of integration as measured by the number of logic gates in a monolithic chip has been steadily rising for almost three decades, mainly due to the rapid progress in processing technology and interconnect technology. Table 1.1 shows the evolution of logic complexity in integrated circuits over the last three decades, and marks the milestones of each era. Here, the numbers for circuit complexity should be interpreted only as representative examples to show the order-of-magnitude. A logic block can contain anywhere from 10 to 100 transistors, depending on the function. State-of-the-art examples of ULSI chips, such as the DEC Alpha or the INTEL Pentium contain 3 to 6 million transistors. ERA (number of logic blocks per chip) DATE COMPLEXITY Single transistor Unit logic (one gate) Multi-function Complex function Medium Scale Integration Large Scale Integration Very Large Scale Integration Ultra Large Scale Integration 1959 1960 1962 1964 1967 1972 1978 1989 less than 1 1 2 - 4 5 - 20 20 - 200 200 - 2000 2000 - 20000 20000 - ? (MSI) (LSI) (VLSI) (ULSI) Table-1.1: Evolution of logic complexity in integrated circuits. The most important message here is that the logic complexity per chip has been (and still is) increasing exponentially. The monolithic integration of a large number of functions on a single chip usually provides: Less area/volume and therefore, compactness Less power consumption Less testing requirements at system level Higher reliability, mainly due to improved on-chip interconnects Higher speed, due to significantly reduced interconnection length Significant cost savings [Click to enlarge image] Figure-1.2: Evolution of integration density and minimum feature size, as seen in the early 1980s. Therefore, the current trend of integration will also continue in the foreseeable future. Advances in device manufacturing technology, and especially the steady reduction of minimum feature size (minimum length of a transistor or an interconnect realizable on chip) support this trend. Figure 1.2 shows the history and forecast of chip complexity - and minimum feature size - over time, as seen in the early 1980s. At that time, a minimum feature size of 0.3 microns was expected around the year 2000. The actual development of the technology, however, has far exceeded these expectations. A minimum size of 0.25 microns was readily achievable by the year 1995. As a direct result of this, the integration density has also exceeded previous expectations - the first 64 Mbit DRAM, and the INTEL Pentium microprocessor chip containing more than 3 million transistors were already available by 1994, 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 3 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... pushing the envelope of integration density. When comparing the integration density of integrated circuits, a clear distinction must be made between the memory chips and logic chips. Figure 1.3 shows the level of integration over time for memory and logic chips, starting in 1970. It can be observed that in terms of transistor count, logic chips contain significantly fewer transistors in any given year mainly due to large consumption of chip area for complex interconnects. Memory circuits are highly regular and thus more cells can be integrated with much less area for interconnects. [Click to enlarge image] Figure-1.3: Level of integration over time, for memory chips and logic chips. Generally speaking, logic chips such as microprocessor chips and digital signal processing (DSP) chips contain not only large arrays of memory (SRAM) cells, but also many different functional units. As a result, their design complexity is considered much higher than that of memory chips, although advanced memory chips contain some sophisticated logic functions. The design complexity of logic chips increases almost exponentially with the number of transistors to be integrated. This is translated into the increase in the design cycle time, which is the time period from the start of the chip development until the mask-tape delivery time. However, in order to make the best use of the current technology, the chip development time has to be short enough to allow the maturing of chip manufacturing and timely delivery to customers. As a result, the level of actual logic integration tends to fall short of the integration level achievable with the current processing technology. Sophisticated computer-aided design (CAD) tools and methodologies are developed and applied in order to manage the rapidly increasing design complexity. 1.2 VLSI Design Flow The design process, at various levels, is usually evolutionary in nature. It starts with a given set of requirements. Initial design is developed and tested against the requirements. When requirements are not met, the design has to be improved. If such improvement is either not possible or too costly, then the revision of requirements and its impact analysis must be considered. The Y-chart (first introduced by D. Gajski) shown in Fig. 1.4 illustrates a design flow for most logic chips, using design activities on three different axes (domains) which resemble the letter Y. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 4 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.4: Typical VLSI design flow in three domains (Y-chart representation). The Y-chart consists of three major domains, namely: behavioral domain, structural domain, geometrical layout domain. The design flow starts from the algorithm that describes the behavior of the target chip. The corresponding architecture of the processor is first defined. It is mapped onto the chip surface by floorplanning. The next design evolution in the behavioral domain defines finite state machines (FSMs) which are structurally implemented with functional modules such as registers and arithmetic logic units (ALUs). These modules are then geometrically placed onto the chip surface using CAD tools for automatic module placement followed by routing, with a goal of minimizing the interconnects area and signal delays. The third evolution starts with a behavioral module description. Individual modules are then implemented with leaf cells. At this stage the chip is described in terms of logic gates (leaf cells), which can be placed and interconnected by using a cell placement & routing program. The last evolution involves a detailed Boolean description of leaf cells followed by a transistor level implementation of leaf cells and mask generation. In standard-cell based design, leaf cells are already pre-designed and stored in a library for logic design use. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 5 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.5: A more simplified view of VLSI design flow. Figure 1.5 provides a more simplified view of the VLSI design flow, taking into account the various representations, or abstractions of design - behavioral, logic, circuit and mask layout. Note that the verification of design plays a very important role in every step during this process. The failure to properly verify a design in its early phases typically causes significant and expensive re-design at a later stage, which ultimately increases the time-to-market. Although the design process has been described in linear fashion for simplicity, in reality there are many iterations back and forth, especially between any two neighboring steps, and occasionally even remotely separated pairs. Although top-down design flow provides an excellent design process control, in reality, there is no truly unidirectional top-down design flow. Both top-down and bottom-up approaches have to be combined. For instance, if a chip designer defined an architecture without close estimation of the corresponding chip area, then it is very likely that the resulting chip layout exceeds the area limit of the available technology. In such a case, in order to fit the architecture into the allowable chip area, some functions may have to be removed and the design process must be repeated. Such changes may require significant modification of the original requirements. Thus, it is very important to feed forward low-level information to higher levels (bottom up) as early as possible. In the following, we will examine design methodologies and structured approaches which have been developed over the years to deal with both complex hardware and software projects. Regardless of the actual size of the project, the basic principles of structured design will improve the prospects of success. Some of the classical techniques for reducing the complexity of IC design are: Hierarchy, regularity, modularity and locality. 1.3 Design Hierarchy 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 6 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The use of hierarchy, or ?divide and conquer? technique involves dividing a module into sub- modules and then repeating this operation on the sub-modules until the complexity of the smaller parts becomes manageable. This approach is very similar to the software case where large programs are split into smaller and smaller sections until simple subroutines, with well-defined functions and interfaces, can be written. In Section 1.2, we have seen that the design of a VLSI chip can be represented in three domains. Correspondingly, a hierarchy structure can be described in each domain separately. However, it is important for the simplicity of design that the hierarchies in different domains can be mapped into each other easily. As an example of structural hierarchy, Fig. 1.6 shows the structural decomposition of a CMOS four-bit adder into its components. The adder can be decomposed progressively into one- bit adders, separate carry and sum circuits, and finally, into individual logic gates. At this lower level of the hierarchy, the design of a simple circuit realizing a well-defined Boolean function is much more easier to handle than at the higher levels of the hierarchy. In the physical domain, partitioning a complex system into its various functional blocks will provide a valuable guidance for the actual realization of these blocks on chip. Obviously, the approximate shape and size (area) of each sub-module should be estimated in order to provide a useful floorplan. Figure 1.7 shows the hierarchical decomposition of a four-bit adder in physical description (geometrical layout) domain, resulting in a simple floorplan. This physical view describes the external geometry of the adder, the locations of input and output pins, and how pin locations allow some signals (in this case the carry signals) to be transferred from one sub-block to the other without external routing. At lower levels of the physical hierarchy, the internal mask [Click to enlarge image] Figure-1.6: Structural decomposition of a four-bit adder circuit, showing the hierarchy down to gate level. [Click to enlarge image] 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 7 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-1.7: Hierarchical decomposition of a four-bit adder in physical (geometrical) description domain. [Click to enlarge image] Figure-1.8: Layout of a 16-bit adder, and the components (sub-blocks) of its physical hierarchy. [Click to enlarge image] Figure-1.9: The structural hierarchy of a triangle generator chip. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 8 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.10: Physical layout of the triangle generator chip. layout of each adder cell defines the locations and the connections of each transistor and wire. Figure 1.8 shows the full-custom layout of a 16-bit dynamic CMOS adder, and the sub-modules that describe the lower levels of its physical hierarchy. Here, the 16-bit adder consists of a cascade connection of four 4-bit adders, and each 4-bit adder can again be decomposed into its functional blocks such as the Manchester chain, carry/propagate circuits and the output buffers. Finally, Fig. 1.9 and Fig. 1.10 show the structural hierarchy and the physical layout of a simple triangle generator chip, respectively. Note that there is a corresponding physical description for every module in the structural hierarchy, i.e., the components of the physical view closely match this structural view. 1.4 Concepts of Regularity, Modularity and Locality The hierarchical design approach reduces the design complexity by dividing the large system into several sub-modules. Usually, other design concepts and design approaches are also needed to simplify the process. Regularity means that the hierarchical decomposition of a large system should result in not only simple, but also similar blocks, as much as possible. A good example of regularity is the design of array structures consisting of identical cells - such as a parallel multiplication array. Regularity can exist at all levels of abstraction: At the transistor level, uniformly sized transistors simplify the design. At the logic level, identical gate structures can be used, etc. Figure 1.11 shows regular circuit-level designs of a 2-1 MUX (multiplexer), an D-type edge-triggered flip flop, and a one-bit full adder. Note that all of these circuits were designed by using inverters and tri-state buffers only. If the designer has a small library of well-defined and well-characterized basic building blocks, a number of different functions can be constructed by using this principle. Regularity usually reduces the number of different modules that need to be designed and verified, at all levels of abstraction. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 9 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.11: Regular design of a 2-1 MUX, a DFF and an adder, using inverters and tri-state buffers. Modularity in design means that the various functional blocks which make up the larger system must have well-defined functions and interfaces. Modularity allows that each block or module can be designed relatively independently from each other, since there is no ambiguity about the function and the signal interface of these blocks. All of the blocks can be combined with ease at the end of the design process, to form the large system. The concept of modularity enables the parallelisation of the design process. It also allows the use of generic modules in various designs - the well-defined functionality and signal interface allow plug-and-play design. By defining well-characterized interfaces for each module in the system, we effectively ensure that the internals of each module become unimportant to the exterior modules. Internal details remain at the local level. The concept of locality also ensures that connections are mostly between neighboring modules, avoiding long-distance connections as much as possible. This last point is extremely important for avoiding excessive interconnect delays. Time-critical operations should be performed locally, without the need to access distant modules or signals. If necessary, the replication of some logic may solve this problem in large system architectures. 1.5 VLSI Design Styles Several design styles can be considered for chip implementation of specified algorithms or logic functions. Each design style has its own merits and shortcomings, and thus a proper choice has to be made by designers in order to provide the functionality at low cost. 1.5.1 Field Programmable Gate Array (FPGA) Fully fabricated FPGA chips containing thousands of logic gates or even more, with programmable interconnects, are available to users for their custom hardware programming to realize desired functionality. This design style provides a means for fast prototyping and also for cost-effective chip design, especially for low-volume applications. A typical field programmable gate array (FPGA) chip consists of I/O buffers, an array of configurable logic blocks (CLBs), and programmable interconnect structures. The programming of the interconnects is implemented by programming of RAM cells whose output terminals are connected to the gates of MOS pass transistors. A general architecture of FPGA from XILINX is shown in Fig. 1.12. A more detailed view showing the locations of switch matrices used for interconnect routing is given in Fig. 1.13. A simple CLB (model XC2000 from XILINX) is shown in Fig. 1.14. It consists of four signal input terminals (A, B, C, D), a clock signal terminal, user-programmable multiplexers, an SR-latch, and a look-up table (LUT). The LUT is 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 10 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... a digital memory that stores the truth table of the Boolean function. Thus, it can generate any function of up to four variables or any two functions of three variables. The control terminals of multiplexers are not shown explicitly in Fig. 1.14. The CLB is configured such that many different logic functions can be realized by programming its array. More sophisticated CLBs have also been introduced to map complex functions. The typical design flow of an FPGA chip starts with the behavioral description of its functionality, using a hardware description language such as VHDL. The synthesized architecture is then technology-mapped (or partitioned) into circuits or logic cells. At this stage, the chip design is completely described in terms of available logic cells. Next, the placement and routing step assigns individual logic cells to FPGA sites (CLBs) and determines the routing patterns among the cells in accordance with the netlist. After routing is completed, the on-chip [Click to enlarge image] Figure-1.12: General architecture of Xilinx FPGAs. [Click to enlarge image] Figure-1.13: Detailed view of switch matrices and interconnection routing between CLBs. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 11 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.14: XC2000 CLB of the Xilinx FPGA. performance of the design can be simulated and verified before downloading the design for programming of the FPGA chip. The programming of the chip remains valid as long as the chip is powered-on, or until new programming is done. In most cases, full utilization of the FPGA chip area is not possible - many cell sites may remain unused. The largest advantage of FPGA-based design is the very short turn-around time, i.e., the time required from the start of the design process until a functional chip is available. Since no physical manufacturing step is necessary for customizing the FPGA chip, a functional sample can be obtained almost as soon as the design is mapped into a specific technology. The typical price of FPGA chips are usually higher than other realization alternatives (such as gate array or standard cells) of the same design, but for small-volume production of ASIC chips and for fast prototyping, FPGA offers a very valuable option. 1.5.2 Gate Array Design In view of the fast prototyping capability, the gate array (GA) comes after the FPGA. While the design implementation of the FPGA chip is done with user programming, that of the gate array is done with metal mask design and processing. Gate array implementation requires a two-step manufacturing process: The first phase, which is based on generic (standard) masks, results in an array of uncommitted transistors on each GA chip. These uncommitted chips can be stored for later customization, which is completed by defining the metal interconnects between the transistors of the array (Fig. 1.15). Since the patterning of metallic interconnects is done at the end of the chip fabrication, the turn-around time can be still short, a few days to a few weeks. Figure 1.16 shows a corner of a gate array chip which contains bonding pads on its left and bottom edges, diodes for I/O protection, nMOS transistors and pMOS transistors for chip output driver circuits in the neighboring areas of bonding pads, arrays of nMOS transistors and pMOS transistors, underpass wire segments, and power and ground buses along with contact windows. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 12 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.15: Basic processing steps required for gate array implementation. [Click to enlarge image] Figure-1.16: A corner of a typical gate array chip. Figure 1.17 shows a magnified portion of the internal array with metal mask design (metal lines highlighted in dark) to realize a complex logic function. Typical gate array platforms allow dedicated areas, called channels, for intercell routing as shown in Figs. 1.16 and 1.17 between rows or columns of MOS transistors. The availability of these routing channels simplifies the interconnections, even using one metal layer only. The interconnection patterns to realize basic logic gates can be stored in a library, which can then be used to customize rows of uncommitted transistors according to the netlist. While most gate array platforms only contain rows of uncommitted transistors separated by routing channels, some other platforms also offer dedicated memory (RAM) arrays to allow a higher density where memory functions are required. Figure 1.18 shows the layout views of a conventional gate array and a gate array platform with two dedicated memory banks. With the use of multiple interconnect layers, the routing can be achieved over the active cell areas; thus, the routing channels can be removed as in Sea-of-Gates (SOG) chips. Here, the entire chip surface is covered with uncommitted nMOS and pMOS transistors. As in the gate array case, neighboring transistors can be customized using a metal mask to form basic logic gates. For intercell routing, however, some of the uncommitted transistors must be sacrificed. This approach results in more flexibility for interconnections, and usually in a higher density. The basic platform of a SOG chip is shown in Fig. 1.19. Figure 1.20 offers a brief comparison between the channeled (GA) vs. the channelless (SOG) approaches. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 13 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.17: Metal mask design to realize a complex logic function on a channeled GA platform. [Click to enlarge image] Figure-1.18: Layout views of a conventional GA chip and a gate array with two memory banks. [Click to enlarge image] Figure-1.19: The platform of a Sea-of-Gates (SOG) chip. In general, the GA chip utilization factor, as measured by the used chip area divided by the total chip area, is higher than that of the FPGA and so is the chip speed, since more customized design can be achieved with metal mask 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 14 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... designs. The current gate array chips can implement as many as hundreds of thousands of logic gates. [Click to enlarge image] Figure-1.20: Comparison between the channeled (GA) vs. the channelless (SOG) approaches. 1.5.3 Standard-Cells Based Design The standard-cells based design is one of the most prevalent full custom design styles which require development of a full custom mask set. The standard cell is also called the polycell. In this design style, all of the commonly used logic cells are developed, characterized, and stored in a standard cell library. A typical library may contain a few hundred cells including inverters, NAND gates, NOR gates, complex AOI, OAI gates, D-latches, and flip-flops. Each gate type can have multiple implementations to provide adequate driving capability for different fanouts. For instance, the inverter gate can have standard size transistors, double size transistors, and quadruple size transistors so that the chip designer can choose the proper size to achieve high circuit speed and layout density. The characterization of each cell is done for several different categories. It consists of delay time vs. load capacitance circuit simulation model timing simulation model fault simulation model cell data for place-and-route mask data To enable automated placement of the cells and routing of inter-cell connections, each cell layout is designed with a fixed height, so that a number of cells can be abutted side-by-side to form rows. The power and ground rails typically run parallel to the upper and lower boundaries of the cell, thus, neighboring cells share a common power and ground bus. The input and output pins are located on the upper and lower boundaries of the cell. Figure 1.21 shows the layout of a typical standard cell. Notice that the nMOS transistors are located closer to the ground rail while the pMOS transistors are placed closer to the power rail. [Click to enlarge image] 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 15 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-1.21: A standard cell layout example. Figure 1.22 shows a floorplan for standard-cell based design. Inside the I/O frame which is reserved for I/O cells, the chip area contains rows or columns of standard cells. Between cell rows are channels for dedicated inter-cell routing. As in the case of Sea-of-Gates, with over-the- cell routing, the channel areas can be reduced or even removed provided that the cell rows offer sufficient routing space. The physical design and layout of logic cells ensure that when cells are placed into rows, their heights are matched and neighboring cells can be abutted side-by-side, which provides natural connections for power and ground lines in each row. The signal delay, noise margins, and power consumption of each cell should be also optimized with proper sizing of transistors using circuit simulation. [Click to enlarge image] Figure-1.22: A simplified floorplan of standard-cells-based design. If a number of cells must share the same input and/or output signals, a common signal bus structure can also be incorporated into the standard-cell-based chip layout. Figure 1.23 shows the simplified symbolic view of a case where a signal bus has been inserted between the rows of standard cells. Note that in this case the chip consists of two blocks, and power/ground routing must be provided from both sides of the layout area. Standard-cell based designs may consist of several such macro-blocks, each corresponding to a specific unit of the system architecture such as ALU, control logic, etc. [Click to enlarge image] Figure-1.23: Simplified floorplan consisting of two separate blocks and a common signal bus. After chip logic design is done using standard cells in the library, the most challenging task is to place individual cells into rows and interconnect them in a way that meets stringent design goals in circuit speed, chip area, and power consumption. Many advanced CAD tools for place-and-route have been developed and used to achieve such goals. Also from the chip layout, circuit models which include interconnect parasitics can be extracted and used for timing simulation and analysis to identify timing critical paths. For timing critical paths, proper gate sizing is often practiced to meet the timing requirements. In many VLSI chips, such as microprocessors and digital signal 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 16 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... processing chips, standard-cells based design is used to implement complex control logic modules. Some full custom chips can be also implemented exclusively with standard cells. Finally, Fig. 1.24 shows the detailed mask layout of a standard-cell-based chip with an uninterrupted single block of cell rows, and three memory banks placed on one side of the chip. Notice that within the cell block, the separations between neighboring rows depend on the number of wires in the routing channel between the cell rows. If a high interconnect density can be achieved in the routing channel, the standard cell rows can be placed closer to each other, resulting in a smaller chip area. The availability of dedicated memory blocks also reduces the area, since the realization of memory elements using standard cells would occupy a larger area. [Click to enlarge image] Figure-1.24: Mask layout of a standard-cell-based chip with a single block of cells and three memory banks. 1.5.4 Full Custom Design Although the standard-cells based design is often called full custom design, in a strict sense, it is somewhat less than fully custom since the cells are pre-designed for general use and the same cells are utilized in many different chip designs. In a fuller custom design, the entire mask design is done anew without use of any library. However, the development cost of such a design style is becoming prohibitively high. Thus, the concept of design reuse is becoming popular in order to reduce design cycle time and development cost. The most rigorous full custom design can be the design of a memory cell, be it static or dynamic. Since the same layout design is replicated, there would not be any alternative to high density memory chip design. For logic chip design, a good compromise can be achieved by using a combination of different design styles on the same chip, such as standard cells, data-path cells and PLAs. In real full-custom layout in which the geometry, orientation and placement of every transistor is done individually by the designer, design productivity is usually very low - typically 10 to 20 transistors per day, per designer. In digital CMOS VLSI, full-custom design is rarely used due to the high labor cost. Exceptions to this include the design of high-volume products such as memory chips, high- performance microprocessors and FPGA masters. Figure 1.25 shows the full layout of the Intel 486 microprocessor chip, which is a good example of a hybrid full-custom design. Here, one can identify four different design styles on one chip: Memory banks (RAM cache), data-path units consisting of bit-slice cells, control circuitry mainly consisting of standard cells and PLA blocks. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 17 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-1.25: Mask layout of the Intel 486 microprocessor chip, as an example of full-custom design. [Click to enlarge image] Figure-1.26: Overview of VLSI design styles. 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 1 18 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... This chapter edited by Y. Leblebici production of 5/1/2007 10:44 AM Design of VLSI Systems - Chapter 2 1 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 2 CMOS FABRICATION TECHNOLOGY AND DESIGN RULES Notice: This chapter is a largely based on Chapter 2 (Fabrication of MOSFETs) of the book CMOS Digital Integrated Circuit Design - Analysis and Design by S.M. Kang and Y. Leblebici. Introduction Fabrication Process Flow - Basic Steps The CMOS n-Well Process Advanced CMOS Fabrication Technologies Layout Design Rules 2.1 Introduction In this chapter, the fundamentals of MOS chip fabrication will be discussed and the major steps of the process flow will be examined. It is not the aim of this chapter to present a detailed discussion of silicon fabrication technology, which deserves separate treatment in a dedicated course. Rather, the emphasis will be on the general outline of the process flow and on the interaction of various processing steps, which ultimately determine the device and the circuit performance characteristics. The following chapters show that there are very strong links between the fabrication process, the circuit design process and the performance of the resulting chip. Hence, circuit designers must have a working knowledge of chip fabrication to create effective designs and in order to optimize the circuits with respect to various manufacturing parameters. Also, the circuit designer must have a clear understanding of the roles of various masks used in the fabrication process, and how the masks are used to define various features of the devices on-chip. The following discussion will concentrate on the well-established CMOS fabrication technology, which requires that both n-channel (nMOS) and p-channel (pMOS) transistors be built on the same chip substrate. To accommodate both nMOS and pMOS devices, special regions must be created in which the semiconductor type is opposite to the substrate type. These regions are called wells or tubs. A p-well is created in an n-type substrate or, alternatively, an n- well is created in a p-type substrate. In the simple n-well CMOS fabrication technology presented, the nMOS transistor is created in the p-type substrate, and the pMOS transistor is created in the n-well, which is built-in into the p-type substrate. In the twin-tub CMOS technology, additional tubs of the same type as the substrate can also be created for device optimization. The simplified process sequence for the fabrication of CMOS integrated circuits on a p- type silicon substrate is shown in Fig. 2.1. The process starts with the creation of the n-well regions for pMOS transistors, by impurity implantation into the substrate. Then, a thick oxide is grown in the regions surrounding the nMOS and pMOS active regions. The thin gate oxide is subsequently grown on the surface through thermal oxidation. These steps 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 2 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... are followed by the creation of n+ and p+ regions (source, drain and channel-stop implants) and by final metallization (creation of metal interconnects). [Click to enlarge image] Figure-2.1: Simplified process sequence for fabrication of the n-well CMOS integrated circuit with a single polysilicon layer, showing only major fabrication steps. The process flow sequence pictured in Fig. 2.1 may at first seem to be too abstract, since detailed fabrication steps are not shown. To obtain a better understanding of the issues involved in the semiconductor fabrication process, we first have to consider some of the basic steps in more detail. 2.2 Fabrication Process Flow - Basic Steps Note that each processing step requires that certain areas are defined on chip by appropriate masks. Consequently, the integrated circuit may be viewed as a set of patterned layers of doped silicon, polysilicon, metal and insulating silicon dioxide. In general, a layer must be patterned before the next layer of material is applied on chip. The process used to transfer a pattern to a layer on the chip is called lithography. Since each layer has its own distinct patterning requirements, the lithographic sequence must be repeated for every layer, using a different mask. To illustrate the fabrication steps involved in patterning silicon dioxide through optical lithography, let us first examine the process flow shown in Fig. 2.2. The sequence starts with the thermal oxidation of the silicon surface, by which an oxide layer of about 1 micrometer thickness, for example, is created on the substrate (Fig. 2.2(b)). The entire oxide surface is then covered with a layer of photoresist, which is essentially a light-sensitive, acid-resistant organic polymer, initially insoluble in the developing solution (Fig. 2.2(c)). If the photoresist material is exposed to ultraviolet (UV) light, the exposed areas become soluble so that the they are no longer resistant to etching solvents. To selectively expose the photoresist, we have to cover some of the areas on the surface with a mask during exposure. Thus, when the structure with the mask on top is exposed to UV light, areas which are covered by the opaque features on the mask are shielded. In the areas where the UV light can pass through, on the other hand, the photoresist is exposed a nd becomes soluble (Fig. 2.2(d)). 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 3 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] [Click to enlarge image] Figure-2.2: Process steps required for patterning of silicon dioxide. The type of photoresist which is initially insoluble and becomes soluble after exposure to UV light is called positive photoresist. The process sequence shown in Fig. 2.2 uses positive photoresist. There is another type of photoresist which is initially soluble and becomes insoluble (hardened) after exposure to UV light, called negative photoresist. If negative photoresist is used in the photolithography process, the areas which are not shielded from the UV light by the opaque mask features become insoluble, whereas the shielded areas can subsequently be etched away by a developing solution. Negative ph otoresists are more sensitive to light, but their photolithographic resolution is not as high as that of the positive photoresists. Therefore, negative photoresists are used less commonly in the manufacturing of high-density integrated circuits. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 4 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Following the UV exposure step, the unexposed portions of the photoresist can be removed by a solvent. Now, the silicon dioxide regions which are not covered by hardened pho toresist can be etched away either by using a chemical solvent (HF acid) or by using a dry etch (plasma etch) process (Fig. 2.2(e)). Note that at the end of this step, we obtain an oxide window that reaches down to the silicon surface (Fig. 2.2(f)). The remaining photoresist can now be stripped from the silicon dioxide surface by using another solvent, leaving the patterned silicon dioxide feature on the surface as shown in Fig. 2.2(g). The sequence of process steps illustrated in detail in Fig. 2.2 actually accomplishes a single pattern transfer onto the silicon dioxide surface, as shown in Fig. 2.3. The fabrication of semiconductor devices requires several such pattern transfers to be performed on silicon dioxide, polysilicon, and metal. The basic patterning process used in all fabrication steps, however, is quite similar to the one shown in Fig. 2.2. Also note that for accurate generation of high-density patterns required in sub-micron devices, electron beam (E-beam) lithography is used instead of optical lithography. In the following, the main processing steps involved in the fabrication of an n-channel MOS transistor on p-type silicon substrate will be examined. [Click to enlarge image] Figure-2.3: The result of a single lithographic patterning sequence on silicon dioxide, without showing the intermediate steps. Compare the unpatterned structure (top) and the patterned structure (bottom) with Fig. 2.2(b) and Fig. 2.2(g), respectively. The process starts with the oxidation of the silicon substrate (Fig. 2.4(a)), in which a relatively thick silicon dioxide layer, also called field oxide, is created on the surface (Fig. 2.4(b)). Then, the field oxide is selectively etched to expose the silicon surface on which the MOS transistor will be created (Fig. 2.4(c)). Following this step, the surface is covered with a thin, high-quality oxide layer, which will eventually form the gate oxide of the MOS transistor (Fig. 2.4(d)). On top of the thin oxide, a layer of polysilicon (polycrystalline silicon) is deposited (Fig. 2.4(e)). Polysilicon is used both as gate electrode material for MOS transistors and also as an interconnect medium in silicon integrated circuits. Undoped polysilicon has relatively high resistivity. The resistivity of polysilicon can be reduced, however, by doping it with impurity atoms. After deposition, the polysilicon layer is patterned and etched to form the interconnects and the MOS transistor gates (Fig. 2.4(f)). The thin gate oxide not covered by polysilicon is also etched away, which exposes the bare silicon surface on which the source and drain junctions are to be formed (Fig. 2.4(g)). The entire silicon surface is then doped with a high concentration of impurities, either through diffusion or ion implantation (in this case with donor atoms to produce n-type doping). Figure 2.4(h) shows that the doping penetrates the exposed areas on the silicon surface, ultimately creating two n-type regions (source and drain junctions) in the p-type substrate. The impurity doping also penetrates the polysilicon on the surface, reducing its resistivity. Note that the polysilicon gate, which is patterned before doping actually defines the precise location of the channel region and, hence, the location of the source and the drain regions. Since this procedure allows very precise positioning of the two regions relative to the gate, it is also called the self-aligned process. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 5 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 6 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] [Click to enlarge image] Figure-2.4: Process flow for the fabrication of an n-type MOSFET on p-type silicon. Once the source and drain regions are completed, the entire surface is again covered with an insulating layer of silicon dioxide (Fig. 2.4(i)). The insulating oxide layer is then patterned in order to provide contact windows for the drain and source junctions (Fig. 2.4(j)). The surface is covered with evaporated aluminum which will form the interconnects (Fig. 2.4(k)). Finally, the metal layer is patterned and etched, completing the interconnection of the MOS transistors on the surface (Fig. 2.4(l)). Usually, a second (and third) layer of metallic interconnect can also be added on top of this structure by creating another insulating oxide layer, cutting contact (via) holes, depositing, and patterning the metal. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 7 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 2.3 The CMOS n-Well Process Having examined the basic process steps for pattern transfer through lithography, and having gone through the fabrication procedure of a single n-type MOS transistor, we can now return to the generalized fabrication sequence of n-well CMOS integrated circuits, as shown in Fig. 2.1. In the following figures, some of the important process steps involved in the fabrication of a CMOS inverter will be shown by a top view of the lithographic masks and a cross-sectional view of the relevant areas. The n-well CMOS process starts with a moderately doped (with impurity concentration typically less than 1015 cm-3) p-type silicon substrate. Then, an initial oxide layer is grown on the entire surface. The first lithographic mask defines the n-well region. Donor atoms, usually phosphorus, are implanted through this window in the oxide. Once the n-well is created, the active areas of the nMOS and pMOS transistors can be defined. Figures 2.5 through 2.10 illustrate the significant milestones that occur during the fabrication process of a CMOS inverter. [Click to enlarge image] Figure-2.5: Following the creation of the n-well region, a thick field oxide is grown in the areas surrounding the transistor active regions, and a thin gate oxide is grown on top of the active regions. The thickness and the quality of the gate oxide are two of the most critical fabrication parameters, since they strongly affect the operational characteristics of the MOS transistor, as well as its long-term reliability. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 8 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.6: The polysilicon layer is deposited using chemical vapor deposition (CVD) and patterned by dry (plasma) etching. The created polysilicon lines will function as the gate electrodes of the nMOS and the pMOS transistors and their interconnects. Also, the polysilicon gates act as self-aligned masks for the source and drain implantations that follow this step. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 9 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.7: Using a set of two masks, the n+ and p+ regions are implanted into the substrate and into the n- well, respectively. Also, the ohmic contacts to the substrate and to the n-well are implanted in this process step. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 10 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.8: An insulating silicon dioxide layer is deposited over the entire wafer using CVD. Then, the contacts are defined and etched away to expose the silicon or polysilicon contact wind ows. These contact windows are necessary to complete the circuit interconnections using the metal layer, which is patterned in the next step. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 11 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.9: Metal (aluminum) is deposited over the entire chip surface using metal evaporation, and the metal lines are patterned through etching. Since the wafer surface is non-planar, the quality and the integrity of the metal lines created in this step are very critical and are ultimately essential for circuit reliability. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 12 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.10: The composite layout and the resulting cross-sectional view of the chip, showing one nMOS and one pMOS transistor (built-in n-well), the polysilicon and metal interconnections. The final step is to deposit the passivation layer (for protection) over the chip, except for wire-bonding pad areas. The patterning process by the use of a succession of masks and process steps is conceptually summarized in Fig. 2.11. It is seen that a series of masking steps must be sequentially performed for the desired patterns to be created on the wafer surface. An example of the end result of this sequence is shown as a cross-section on the right. [Click to enlarge image] 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 13 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-2.11: Conceptual illustration of the mask sequence applied to create desired structures. 2.4 Advanced CMOS Fabrication Technologies In this section, two examples will be given for advanced CMOS processes which offer additional benefits in terms of device performance and integration density. These processes, namely, the twin-tub CMOS process and the silicon-on-insulator (SOI) process, are becoming especially more popular for sub-micron geometries where device performance and density must be pushed beyond the limits of the conventional n-well CMOS process. Twin-Tub (Twin-Well) CMOS Process This technology provides the basis for separate optimization of the nMOS and pMOS transistors, thus making it possible for threshold voltage, body effect and the channel transconductance of both types of transistors to be tuned independently. Generally, the starting material is a n+ or p+ substrate, with a lightly doped epitaxial layer on top. This epitaxial layer provides the actual substrate on which the n-well and the p-well are formed. Since two independent doping steps are performed for the creation of the well regions, the dopant concentrations can be carefully optimized to produce the desired device characteristics. In the conventional n-well CMOS process, the doping density of the well region is typically about one order of magnitude higher than the substrate, which, among other effects, results in unbalanced drain parasitics. The twin-tub process (Fig. 2.12) also avoids this problem. [Click to enlarge image] Figure-2.12: Cross-section of nMOS and pMOS transistors in twin-tub CMOS process. Silicon-on-Insulator (SOI) CMOS Process Rather than using silicon as the substrate material, technologists have sought to use an insulating substrate to improve process characteristics such as speed and latch-up susceptibility. The SOI CMOS technology allows the creation of independent, completely isolated nMOS and pMOS transistors virtually side-by-side on an insulating substrate (for example: sapphire). The main advantages of this technology are the higher integration density (because of the absence of well regions), complete avoidance of the latch-up problem, and lower parasitic capacitances compared to the conventional n-well or twin-tub CMOS processes. A cross-section of nMOS and pMOS devices in created using SOI process is shown in Fig. 2.13. The SOI CMOS process is considerably more costly than the standard n-well CMOS process. Yet the improvements of device performance and the absence of latch-up problems can justify its use, especially for deep-sub-micron devices. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 14 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-2.13: Cross-section of nMOS and pMOS transistors in SOI CMOS process. 2.5 Layout Design Rules The physical mask layout of any circuit to be manufactured using a particular process must conform to a set of geometric constraints or rules, which are generally called layout design rules. These rules usually specify the minimum allowable line widths for physical objects on-chip such as metal and polysilicon interconnects or diffusion areas, minimum feature dimensions, and minimum allowable separations between two such features. If a metal line width is made too small, for example, it is possible for the line to break during the fabrication process or afterwards, resulting in an open circuit. If two lines are placed too close to each other in the layout, they may form an unwanted short circuit by merging during or after the fabrication process. The main objective of design rules is to achieve a high overall yield and reliability while using the smallest possible silicon area, for any circuit to be manufactured with a particular process. Note that there is usually a trade-off between higher yield which is obtained through conservative geometries, and better area efficiency, which is obtained through aggressive, high- density placement of various features on the chip. The layout design rules which are specified for a particular fabrication process normally represent a reasonable optimum point in terms of yield and density. It must be emphasized, however, that the design rules do not represent strict boundaries which separate "correct" designs from "incorrect" ones. A layout which violates some of the specified design rules may still result in an operational circuit with reasonable yield, whereas another layout observing all specified design rules may result in a circuit which is not functional and/or has very low yield. To summarize, we can say, in general, that observing the layout design rules significantly increases the probability of fabricating a successful product with high yield. The design rules are usually described in two ways : Micron rules, in which the layout constraints such as minimum feature sizes and minimum allowable feature separations, are stated in terms of absolute dimensions in micrometers, or, Lambda rules, which specify the layout constraints in terms of a single parameter (?) and, thus, allow linear, proportional scaling of all geometrical constraints. Lambda-based layout design rules were originally devised to simplify the industry- standard micron-based design rules and to allow scaling capability for various processes. It must be emphasized, however, that most of the submicron CMOS process design rules do not lend themselves to straightforward linear scaling. The use of lambda-based design rules must therefore be handled with caution in sub-micron geometries. In the following, we present a sample set of the lambda-based layout design rules devised for the MOSIS CMOS process and illustrate the implications of these rules on a section a simple layout which includes two transistors (Fig. 2.14). MOSIS Layout Design Rules (sample set) Rule number Description L-Rule 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 15 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... R1 R2 Minimum active area width Minimum active area spacing 3 L 3 L R3 R4 R5 R6 Minimum poly width Minimum poly spacing Minimum gate extension of poly over active Minimum poly-active edge spacing (poly outside active area) Minimum poly-active edge spacing (poly inside active area) 2 2 2 1 R8 R9 Minimum metal width Minimum metal spacing 3 L 3 L R10 R11 R12 R13 R14 Poly contact Minimum poly Minimum poly Minimum poly Minimum poly 2 2 1 1 3 R15 R16 Active contact size Minimum active contact spacing (on the same active region) Minimum active contact to active edge spacing Minimum active contact to metal edge spacing Minimum active contact to poly edge spacing Minimum active contact spacing (on different active regions) R7 R17 R18 R19 R20 size contact contact contact contact spacing to poly edge spacing to metal edge spacing to active edge spacing L L L L 3 L L L L L L 2 L 2 L 1 1 3 6 L L L L [Click to enlarge image] Figure-2.14: Illustration of some of the typical MOSIS layout design rules listed above. 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 2 16 of 16 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... References 1. W. Maly, Atlas of IC Technologies, Menlo Park, CA: Benjamin/Cummings, 1987. 2. A. S. Grove, Physics and Technology of Semiconductor Devices, New York, NY: John Wiley & Sons, Inc., 1967. 3. G. E. Anner, Planar Processing Primer, New York, NY: Van Nostrand Rheinhold, 1990. 4. T. E. Dillinger, VLSI Engineering, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1988. 5. S.M. Sze, VLSI Technology, New York, NY: McGraw-Hill, 1983. This chapter edited by Y. Leblebici production of 5/1/2007 10:52 AM Design of VLSI Systems - Chapter 3 1 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 3 FULL-CUSTOM MASK LAYOUT DESIGN Introduction CMOS Layout Design Rules CMOS Inverter Layout Design Layout of CMOS NAND and NOR Gates Complex CMOS Logic Gates 3.1 Introduction In this chapter, the basic mask layout design guidelines for CMOS logic gates will be presented. The design of physical layout is very tightly linked to overall circuit performance (area, speed, power dissipation) since the physical structure directly determines the transconductances of the transistors, the parasitic capacitances and resistances, and obviously, the silicon area which is used for a certain function. On the other hand, the detailed mask layout of logic gates requires a very intensive and time-consuming design effort, which is justifiable only in special circumstances where the area and/or the performance of the circuit must be optimized under very tight constraints. Therefore, automated layout generation (e.g., standard cells + computer-aided placement and routing) is typically preferred for the design of most digital VLSI circuits. In order to judge the physical constraints and limitations, however, the VLSI designer must also have a good understanding of the physical mask layout process. Mask layout drawings must strictly conform to a set of layout design rules as described in Chapter 2, therefore, we will start this chapter with the review of a complete design rule set. The design of a simple CMOS inverter will be presented step-by-step, in order to show the influence of various design rules on the mask structure and on the dimensions. Also, we will introduce the concept of stick diagrams, which can be used very effectively to simplify the overall topology of layout in the early design phases. With the help of stick diagrams, the designer can have a good understanding of the topological constraints, and quickly test several possibilities for the optimum layout without actually drawing a complete mask diagram. The physical (mask layout) design of CMOS logic gates is an iterative process which starts with the circuit topology (to realize the desired logic function) and the initial sizing of the transistors (to realize the desired performance specifications). At this point, the designer can only estimate the total parasitic load at the output node, based on the fan-out, the number of devices, and the expected length of the interconnection lines. If the logic gate contains more than 4-6 transistors, the topological graph representation and the Euler-path method allow the designer to determine the optimum ordering of the transistors. A simple stick diagram layout can now be drawn, showing the locations of the transistors, the local interconnections between the transistors and the locations of the contacts. After a topologically feasible layout is found, the mask layers are drawn (using a layout editor tool) according to the layout design rules. This procedure may require several small iterations in order to accommodate all design rules, but the basic topology should not change very significantly. Following the final DRC (Design Rule Check), a circuit 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 2 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... extraction procedure is performed on the finished layout to determine the actual transistor sizes, and more importantly, the parasitic capacitances at each node. The result of the extraction step is usually a detailed [Click to enlarge image] Figure-3.1: The typical design flow for the production of a mask layout. SPICE input file, which is automatically generated by the extraction tool. Now, the actual performance of the circuit can be determined by performing a SPICE simulation, using the extracted net-list. If the simulated circuit performance (e.g., transient response times or power dissipation) do not match the desired specifications, the layout must be modified and the whole process must be repeated. The layout modifications are usually concentrated on the (W/L) ratios of the transistors (transistor re-sizing), since the width-to-length ratios of the transistors determine the device transconductance and the parasitic source/drain capacitances. The designer may also decide to change parts or all of the circuit topology in order to reduce the parasitics. The flow diagram of this iterative process is shown in Fig. 3.1. 3.2 CMOS Layout Design Rules As already discussed in Chapter 2, each mask layout design must conform to a set of layout design rules, which dictate the geometrical constraints imposed upon the mask layers by the technology and by the fabrication process. The layout designer must follow these rules in order to guarantee a certain yield for the finished product, i.e., a certain ratio of acceptable chips out of a fabrication batch. A design which violates some of the layout design rules may still result in a functional chip, but the yield is expected to be lower because of random process variations. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 3 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The design rules below are given in terms of scaleable lambda-rules. Note that while the concept of scaleable design rules is very convenient for defining a technology-independent mask layout and for memorizing the basic constraints, most of the rules do not scale linearly, especially for sub-micron technologies. This fact is illustrated in the right column, where a representative rule set is given in real micron dimensions. A simple comparison with the lambdabased rules shows that there are significant differences. Therefore, lambda-based design rules are simply not useful for sub-micron CMOS technologies. [Click to enlarge image] [Click to enlarge image] 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 4 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-3.2: Illustration of CMOS layout design rules. 3.3 CMOS Inverter Layout Design In the following, the mask layout design of a CMOS inverter will be examined step-by-step. The circuit consists of one nMOS and one pMOS transistor, therefore, one would assume that the layout topology is relatively simple. Yet, we will see that there exist quite a number of different design possibilities even for this very simple circuit. First, we need to create the individual transistors according to the design rules. Assume that we attempt to design the inverter with minimum-size transistors. The width of the active area is then determined by the minimum diffusion contact size (which is necessary for source and drain connections) and the minimum separation from diffusion contact to both active area edges. The width of the polysilicon line over the active area (which is the gate of the transistor) is typically taken as the minimum poly width (Fig. 3.3). Then, the overall length of the active area is simply determined by the following sum: (minimum poly width) + 2 x (minimum poly-to- contact spacing) + 2 x (minimum spacing from contact to active area edge). The pMOS transistor must be placed in an n-well region, and the minimum size of the n- well is dictated by the pMOS active area and the minimum n-well overlap over n+. The distance between the nMOS and the pMOS transistor is determined by the minimum separation between the n+ active area and the n-well (Fig. 3.4). The polysilicon gates of the nMOS and the pMOS transistors are usually aligned. The final step in the mask layout is the local interconnections in metal, for the output node and for the VDD and GND contacts (Fig. 3.5). Notice that in order to be biased properly, the n-well region must also have a VDD contact. [Click to enlarge image] Figure-3.3: Design rule constraints which determine the dimensions of a minimum-size transistor. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 5 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.4: Placement of one nMOS and one pMOS transistor. [Click to enlarge image] Figure-3.5: Complete mask layout of the CMOS inverter. The inital phase of layout design can be simplified significantly by the use of stick diagrams - or so-called symbolic layouts. Here, the detailed layout design rules are simply neglected and the main features (active areas, polysilicon lines, metal lines) are represented by constant width rectangles or simple sticks. The purpose of the stick diagram is to provide the designer a good understanding of the topological constraints, and to quickly test several possibilities for the optimum layout without actually drawing a complete mask diagram. In the following, we will examine a series of stick diagrams which show different layout options for the CMOS inverter circuit. The first two stick diagram layouts shown in Fig. 3.6 are the two most basic inverter configurations, with different alignments of the transistors. In some cases, other signals must be routed over the inverter. For instance, if one or two metal lines have to be passed through the middle of the cell from left to right, horizontal metal straps can be used to access the drain terminals of the transistors, which in turn connect to a vertical Metal-2 line. Metal-1 can now be used to route the signals passing through the inverter. Alternatively, the diffusion areas of both transistors may be used for extending the power and ground connections. This makes the inverter transistors transparent to horizontal metal lines which may pass over. The addition of a second metal layer allows more interconnect freedom. The second- level metal can be used for power and ground supply lines, or alternatively, it may be used to vertically strap the input and the output signals. The final layout example in Fig. 3.6 shows one possibility of using a third metal layer, which is utilized for routing 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 6 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... three signals on top. [Click to enlarge image] Figure-3.6: Stick diagrams showing various CMOS inverter layout options. 3.4 Layout of CMOS NAND and NOR Gates The mask layout designs of CMOS NAND and NOR gates follow the general principles examined earlier for the CMOS inverter layout. Figure 3.7 shows the sample layouts of a two- input NOR gate and a two-input NAND gate, using single-layer polysilicon and single-layer metal. Here, the p-type diffusion area for the pMOS transistors and the n-type diffusion area for the nMOS transistors are aligned in parallel to allow simple routing of the gate signals with two parallel polysilicon lines running vertically. Also notice that the two mask layouts show a very strong symmetry, due to the fact that the NAND and the NOR gate are have a symmetrical circuit topology. Finally, Figs 3.8 and 3.9 show the major steps of the mask layout design for both gates, starting from the stick diagram and progressively defining the mask layers. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 7 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.7: Sample layouts of a CMOS NOR2 gate and a CMOS NAND2 gate. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 8 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.8: Major steps required for generating the mask layout of a CMOS NOR2 gate. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 9 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.9: Major steps required for generating the mask layout of a CMOS NAND2 gate. 3.5 Complex CMOS Logic Gates The realization of complex Boolean functions (which may include several input variables and several product terms) typically requires a series-parallel network of nMOS transistors which constitute the so-called pull-down net, and a corresponding dual network of pMOS transistors which constitute the pull-up net. Figure 3.10 shows the circuit diagram and the corresponding network graphs of a complex CMOS logic gate. Once the network topology of the nMOS pull- down network is known, the pull-up network of pMOS transistors can easily be constructed by using the dual-graph concept. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 10 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.10: A complex CMOS logic gate realizing a Boolean function with 5 input variables. Now, we will investigate the problem of constructing a minimum-area layout for the complex CMOS logic gate. Figure 3.11 shows the stick-diagram layout of a ?first-attempt?, using an arbitrary ordering of the polysilicon gate columns. Note that in this case, the separation between the polysilicon columns must be sufficiently wide to allow for two metal-diffusion contacts on both sides and one diffusion-diffusion separation. This certainly consumes a considerable amount of extra silicon area. If we can minimize the number of active-area breaks both for the nMOS and for the pMOS transistors, the separation between the polysilicon gate columns can be made smaller. This, in turn, will reduce the overall horizontal dimension and the overall circuit layout area. The number of active-area breaks can be minimized by changing the ordering of the polysilicon columns, i.e., by changing the ordering of the transistors. [Click to enlarge image] Figure-3.11: Stick diagram layout of the complex CMOS logic gate, with an arbitrary ordering of the polysilicon gate columns. A simple method for finding the optimum gate ordering is the Euler-path method: Simply find a Euler path in the pull-down network graph and a Euler path in the pull-up network graph with the identical ordering of input labels, i.e., find a common Euler path for both graphs. The Euler path is defined as an uninterrupted path that traverses each edge (branch) of the graph exactly once. Figure 3.12 shows the construction of a common Euler path for both graphs in our example. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 11 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.12: Finding a common Euler path in both graphs for the pull-down and pull-up net provides a gate ordering that minimizes the number of active-area breaks. In both cases, the Euler path starts at (x) and ends at (y). It is seen that there is a common sequence (E-D-A-B-C) in both graphs. The polysilicon gate columns can be arranged according to this sequence, which results in uninterrupted active areas for nMOS as well as for pMOS transistors. The stick diagram of the new layout is shown in Fig. 3.13. In this case, the separation between two neighboring poly columns must allow only for one metal-diffusion contact. The advantages of this new layout are more compact (smaller) layout area, simple routing of signals, and correspondingly, smaller parasitic capacitance. [Click to enlarge image] Figure-3.13: Optimized stick diagram layout of the complex CMOS logic gate. It may not always be possible to construct a complete Euler path both in the pull-down and in the pull-up network. In that case, the best strategy is to find sub-Euler-paths in both graphs, which should be as long as possible. This approach attempts to maximize the number of transistors which can be placed in a single, uninterrupted active area. Finally, Fig. 3.14 shows the circuit diagram of a CMOS one-bit full adder. The circuit has three inputs, and two outputs, sum and carry_out. The corresponding mask layout of this circuit is given in Fig. 3.15. All input and output signals have been arranged in vertical polysilicon columns. Notice that both the sum-circuit and the carry-circuit have been realized using one uninterrupted active area each. 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 3 12 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-3.14: Circuit diagram of the CMOS one-bit full adder. [Click to enlarge image] Figure-3.15: Mask layout of the CMOS full adder circuit.. This chapter edited by Y. Leblebici production of 5/1/2007 10:56 AM Design of VLSI Systems - Chapter 4 1 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 4 PARASITIC EXTRACTION AND PERFORMANCE ESTIMATION FROM PHYSICAL STRUCTURE Introduction The Reality with Interconnections MOSFET Capacitances Interconnect Capacitance Estimation Interconnect Resistance Estimation 4.1 Introduction In this chapter, we will investigate some of the physical factors which determine and ultimately limit the performance of digital VLSI circuits. The switching characteristics of digital integrated circuits essentially dictate the overall operating speed of digital systems. The dynamic performance requirements of a digital system are usually among the most important design specifications that must be met by the circuit designer. Therefore, the switching speed of the circuits must be estimated and optimized very early in the design phase. The classical approach for determining the switching speed of a digital block is based on the assumption that the loads are mainly capacitive and lumped. Relatively simple delay models exist for logic gates with purely capacitive load at the output node, hence, the dynamic behavior of the circuit can be estimated easily once the load is determined. The conventional delay estimation approaches seek to classify three main components of the gate load, all of which are assumed to be purely capacitive, as: (1) internal parasitic capacitances of the transistors, (2) interconnect (line) capacitances, and (3) input capacitances of the fan-out gates. Of these three components, the load conditions imposed by the interconnection lines present serious problems. [Click to enlarge image] 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 2 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-4.1: An inverter driving three other inverters over interconnection lines. Figure 4.1 shows a simple situation where an inverter is driving three other inverters, linked over interconnection lines of different length and geometry. If the total load of each interconnection line can be approximated by a lumped capacitance, then the total load seen by the primary inverter is simply the sum of all capacitive components described above. The switching characteristics of the inverter are then described by the charge/discharge time of the load capacitance, as seen in Fig. 4.2. The expected output voltage waveform of the inverter is given in Fig. 4.3, where the propagation delay time is the primary measure of switching speed. It can be shown very easily that the signal propagation delay under these conditions is linearly proportional to the load capacitance. In most cases, however, the load conditions imposed by the interconnection line are far from being simple. The line, itself a three-dimensional structure in metal and/or polysilicon, usually has a non-negligible resistance in addition to its capacitance. The length/width ratio of the wire usually dictates that the parameters are distributed, making the interconnect a true transmission line. Also, an interconnect is very rarely ?alone?, i.e., isolated from other influences. In real conditions, the interconnection line is in very close proximity to a number of other lines, either on the same level or on different levels. The capacitive/inductive coupling and the signal interference between neighboring lines should also be taken into consideration for an accurate estimation of delay. [Click to enlarge image] Figure-4.2: CMOS inverter stage with lumped capacitive load at the output node. [Click to enlarge image] Figure-4.3: Typical input and output waveforms of an inverter with purely capacitive load. 4.2 The Reality with Interconnections Consider the following situation where an inverter is driving two other inverters, over long interconnection lines. In general, if the time of flight across the interconnection line (as determined by the speed of light) is shorter than the signal rise/fall times, then the wire can be modeled as a capacitive load, or as a lumped or distributed RC network. If the interconnection lines are sufficiently long and the rise times of the signal waveforms are comparable to the time of flight across the line, then the inductance also becomes important, and the interconnection lines must be modeled as 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 3 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... transmission lines. Taking into consideration the RLCG (resistance, inductance, capacitance, and conductance) parasitics (as seen in Fig. 4.4), the signal transmission across the wire becomes a very complicated matter, compared to the relatively simplistic lumped- load case. Note that the signal integrity can be significantly degraded especially when the output impedance of the driver is significantly lower than the characteristic impedance of the transmission line. [Click to enlarge image] Figure-4.4: (a) An RLCG interconnection tree. (b) Typical signal waveforms at the nodes A and B, showing the signal delay and the various delay components. The transmission-line effects have not been a serious concern in CMOS VLSI until recently, since the gate delay originating from purely or mostly capacitive load components dominated the line delay in most cases. But as the fabrication technologies move to finer (sub- micron) design rules, the intrinsic gate delay components tend to decrease dramatically. By contrast, the overall chip size does not decrease - designers just put more functionality on the same sized chip. A 100 mm2 chip has been a standard large chip for almost a decade. The factors which determine the chip size are mainly driven by the packaging technology, manufacturing equipment, and the yield. Since the chip size and the worst-case line length on a chip remain unchanged, the importance of interconnect delay increases in sub-micron technologies. In addition, as the widths of metal lines shrink, the transmission line effects and signal coupling between neighboring lines become even more pronounced. This fact is illustrated in Fig. 4.5, where typical intrinsic gate delay and interconnect delay are plotted qualitatively, for different technologies. It can be seen that for sub-micron technologies, the interconnect delay starts to dominate the gate delay. In order to deal with the implications and to optimize a system for speed, the designers must have reliable and efficient means of (1) estimating the interconnect parasitics in a large chip, and (2) simulating the time- domain effects. Yet we will see that neither of these tasks is simple - interconnect parasitic extraction and accurate simulation of line effects are two of the most difficult problems in physical design of VLSI circuits today. 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 4 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-4.5: Interconnect delay dominates gate delay in sub-micron CMOS technologies. Once we establish the fact that the interconnection delay becomes a dominant factor in CMOS VLSI, the next question is: how many of the interconnections in a large chip may cause serious problems in terms of delay. The hierarchical structure of most VLSI designs offers some insight on this question. In a chip consisting of several functional modules, each module contains a relatively large number of local connections between its functional blocks, logic gates, and transistors. Since these intra-module connections are usually made over short distances, their influence on speed can be simulated easily with conventional models. Yet there are also a fair amount of longer connections between the modules on a chip, the so-called inter-module connections. It is usually these inter-module connections which should be scrutinized in the early design phases for possible timing problems. Figure 4.6 shows the typical statistical distribution of wire lengths on a chip, normalized for the chip diagonal length. The distribution plot clearly exhibits two distinct peaks, one for the relatively shorter intra-module connections, and the other for the longer inter-module connections. Also note that a small number of interconnections may be very long, typically longer than the chip diagonal length. These lines are usually required for global signal bus connections, and for clock distribution networks. Although their numbers are relatively small, these long interconnections are obviously the most problematic ones. [Click to enlarge image] Figure-4.6: Statistical distribution of interconnection length on a typical chip. To summarize the message of this section, we state that : (1) interconnection delay is becoming the dominating factor which determines the dynamic performance of large-scale systems, and (2) interconnect parasitics are difficult to model and to simulate. In the following sections, we will concentrate on various aspects of on-chip parasitics, and we will mainly consider capacitive and resistive components. 4.3 MOSFET Capacitances 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 5 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The first component of capacitive parasitics we will examine is the MOSFET capacitances. These parasitic components are mainly responsible for the intrinsic delay of logic gates, and they can be modeled with fairly high accuracy for gate delay estimation. The extraction of transistor parasitics from physical structure (mask layout) is also fairly straightforward. The parasitic capacitances associated with a MOSFET are shown in Fig. 4.7 as lumped elements between the device terminals. Based on their physical origins, the parasitic device capacitances can be classified into two major groups: (1) oxide-related capacitances and (2) junction capacitances. The gate-oxide-related capacitances are Cgd (gate-to-drain capacitance), Cgs (gate-to-source capacitance), and Cgb (gate-to-substrate capacitance). Notice that in reality, the gate-to-channel capacitance is distributed and voltage dependent. Consequently, all of the oxide-related capacitances described here change with the bias conditions of the transistor. Figure 4.8 shows qualitatively the oxide-related capacitances during cut-off, linear-mode operation and saturation of the MOSFET. The simplified variation of the three capacitances with gate-to-source bias voltage is shown in Fig. 4.9. [Click to enlarge image] Figure-4.7: Lumped representation of parasitic MOSFET capacitances. [Click to enlarge image] Figure-4.8: Schematic representation of MOSFET oxide capacitances during (a) cut-off, (b) linear- mode operation, and (c) saturation. Note that the total gate oxide capacitance is mainly determined by the parallel-plate capacitance between the polysilicon gate and the underlying structures. Hence, the magnitude of the oxide-related capacitances is very closely related to (1) the gate oxide thickness, and (2) the area of the MOSFET gate. Obviously, the total gate capacitance decreases with decreasing device dimensions (W and L), yet it increases with decreasing gate oxide thickness. In 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 6 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... sub-micron technologies, the horizontal dimensions (which dictate the gate area) are usually scaled down more easily than the horizontal dimensions, such as the gate oxide thickness. Consequently, MOSFET transistors fabricated using sub-micron technologies have, in general, smaller gate capacitances. [Click to enlarge image] Figure-4.9: Variation of oxide capacitances as functions of gate-to-source voltage. Now we consider the voltage-dependent source-to-substrate and drain-to-substrate capacitances, Csb and Cdb. Both of these capacitances are due to the depletion charge surrounding the respective source or drain regions of the transistor, which are embedded in the substrate. Figure 4.10 shows the simplified geometry of an n-type diffusion region within the p-type substrate. Here, the diffusion region has been approximated by a rectangular box, which consists of five planar pn-junctions. The total junction capacitance is a function of the junction area (sum of all planar junction areas), the doping densities, and the applied terminal voltages. Accurate methods for estimating the junction capacitances based on these data are readily available in the literature, therefore, a detailed discussion of capacitance calculations will not be presented here. One important aspect of parasitic device junction capacitances is that the amount of capacitance is a linear function of the junction area. Consequently, the size of the drain or the source diffusion area dictates the amount of parasitic capacitance. In sub-micron technologies, where the overall dimensions of the individual devices are scaled down, the parasitic junction capacitances also decrease significantly. It was already mentioned that the MOSFET parasitic capacitances are mainly responsible for the intrinsic delay of logic gates. We have seen that both the oxide-related parasitic capacitances and the junction capacitances tend to decrease with shrinking device dimensions, hence, the relative significance of intrinsic gate delay diminishes in sub-micron technologies. [Click to enlarge image] Figure-4.10: Three-dimensional view of the n-type diffusion region within the p-type substrate. 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 7 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 4.4 Interconnect Capacitance Estimation In a typical VLSI chip, the parasitic interconnect capacitances are among the most difficult parameters to estimate accurately. Each interconnection line (wire) is a three dimensional structure in metal and/or polysilicon, with significant variations of shape, thickness, and vertical distance from the ground plane (substrate). Also, each interconnect line is typically surrounded by a number of other lines, either on the same level or on different levels. Figure 4.11 shows a possible, realistic situation where interconnections on three different levels run in close proximity of each other. The accurate estimation of the parasitic capacitances of these wires with respect to the ground plane, as well as with respect to each other, is obviously a complicated task. [Click to enlarge image] Figure-4.11: Example of six interconnect lines running on three different levels. Unfortunately for the VLSI designers, most of the conventional computer-aided VLSI design tools have a relatively limited capability of interconnect parasitic estimation. This is true even for the design tools regularly used for sub-micron VLSI design, where interconnect parasitics were shown to be very dominant. The designer should therefore be aware of the physical problem and try to incorporate this knowledge early in the design phase, when the initial floorplanning of the chip is done. First, consider the section of a single interconnect which is shown in Fig. 4.12. It is assumed that this wire segment has a length of (l) in the current direction, a width of (w) and a thickness of (t). Moreover, we assume that the interconnect segment runs parallel to the chip surface and is separated from the ground plane by a dielectric (oxide) layer of height (h). Now, the correct estimation of the parasitic capacitance with respect to ground is an important issue. Using the basic geometry given in Fig. 4.12, one can calculate the parallel-plate capacitance Cpp of the interconnect segment. However, in interconnect lines where the wire thickness (t) is comparable in magnitude to the ground-plane distance (h), fringing electric fields significantly increase the total parasitic capacitance (Fig. 4.13). [Click to enlarge image] Figure-4.12: Interconnect segment running parallel to the surface, used for parasitic capacitance estimations. 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 8 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-4.13: Influence of fringing electric fields upon the parasitic wire capacitance. Figure 4.14 shows the variation of the fringing-field factor FF = Ctotal/Cpp, as a function of (t/h), (w/h) and (w/l). It can be seen that the influence of fringing fields increases with the decreasing (w/h) ratio, and that the fringing-field capacitance can be as much as 10-20 times larger than the parallel-plate capacitance. It was mentioned earlier that the sub-micron fabrication technologies allow the width of the metal lines to be decreased somewhat, yet the thickness of the line must be preserved in order to ensure structural integrity. This situation, which involves narrow metal lines with a considerable vertical thickness, is especially vulnerable to fringing field effects. [Click to enlarge image] Figure-4.14: Variation of the fringing-field factor with the interconnect geometry. A set of simple formulas developed by Yuan and Trick in the early 1980?s can be used to estimate the capacitance of the interconnect structures in which fringing fields complicate the parasitic capacitance calculation. The following two cases are considered for two different ranges of line width (w). (4.1) 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 9 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... (4.2) These formulas permit the accurate approximation of the parasitic capacitance values to within 10% error, even for very small values of (t/h). Figure 4.15 shows a different view of the line capacitance as a function of (w/h) and (t/h). The linear dash-dotted line in this plot represents the corresponding parallel-plate capacitance, and the other two curves represent the actual capacitance, taking into account the fringing-field effects. [Click to enlarge image] Figure-4.15: Capacitance of a single interconnect, as a function of (w/h) and (t/h). Now consider the more realistic case where the interconnection line is not ?alone? but is coupled with other lines running in parallel. In this case, the total parasitic capacitance of the line is not only increased by the fringing-field effects, but also by the capacitive coupling between the lines. Figure 4.16 shows the capacitance of a line which is coupled with two other lines on both sides, separated by the minimum design rule. Especially if both of the neighboring lines are biased at ground potential, the total parasitic capacitance of the interconnect running in the middle (with respect to the ground plane) can be more than 20 times as large as the simple parallel-plate capacitance. Note that the capacitive coupling between neighboring lines is increased when the thickness of the wire is comparable to its width. 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 10 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-4.16: Capacitance of coupled interconnects, as a function of (w/h) and (t/h). Figure 4.17 shows the cross-section view of a double-metal CMOS structure, where the individual parasitic capacitances between the layers are also indicated. The cross-section does not show a MOSFET, but just a portion of a diffusion region over which some metal lines may pass. The inter-layer capacitances between the metal-2 and metal-1, metal-1 and polysilicon, and metal-2 and polysilicon are labeled as Cm2m1, Cm1p and Cm2p, respectively. The other parasitic capacitance components are defined with respect to the substrate. If the metal line passes over an active region, the oxide thickness underneath is smaller (because of the active area window), and consequently, the capacitance is larger. These special cases are labeled as Cm1a and Cm2a. Otherwise, the thick field oxide layer results in a smaller capacitance value. [Click to enlarge image] Figure-4.17: Cross-sectional view of a double-metal CMOS structure, showing capacitances between layers. The vertical thickness values of the different layers in a typical 0.8 micron CMOS technology are given below as an example. Field oxide thickness Gate oxide thickness Poly 1 thickness Poly-metal oxide thickness Metal 1 thickness Via oxide thickness Metal 2 thickness n+ junction depth p+ junction depth n-well junction depth 0.52 16.0 0.35 0.65 0.60 1.00 1.00 0.40 0.40 3.50 um nm um um um um um um um um (minimum width 0.8 um) (minimum width 1.4 um) (minimum width 1.6 um) The list below contains the capacitance values between various layers, also for a typical 0.8 micron CMOS technology. Poly over field oxide (area) 0.066 fF/um2 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 11 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Poly over field oxide (perimeter) 0.046 fF/um Metal-1 over field oxide Metal-1 over field oxide (area) (perimeter) 0.030 0.044 fF/um2 fF/um Metal-2 over field oxide Metal-2 over field oxide (area) (perimeter) 0.016 0.042 fF/um2 fF/um Metal-1 over poly Metal-1 over poly (area) (perimeter) 0.053 0.051 fF/um2 fF/um Metal-2 over poly Metal-2 over poly (area) (perimeter) 0.021 0.045 fF/um2 fF/um Metal-2 over metal-1 Metal-2 over metal-1 (area) (perimeter) 0.035 0.051 fF/um2 fF/um For the estimation of interconnect capacitances in a complicated three-dimensional structure, the exact geometry must be taken into account for every portion of the wire. Yet this requires an unacceptable amount of computation in a large circuit, even if simple formulas are applied for the calculation of capacitances. Usually, chip manufacturers supply the area capacitance (parallel-plate cap) and the perimeter capacitance (fringing-field cap) figures for each layer, which are backed up by measurement of capacitance test structures. These figures can be used to extract the parasitic capacitances from the mask layout. It is often prudent to include test structures on chip that enable the designer to independently calibrate a process to a set of design tools. In some cases where the entire chip performance is influenced by the parasitic capacitance of a specific line, accurate 3-D simulation is the only reliable solution. 4.5 Interconnect Resistance Estimation The parasitic resistance of a metal or polysilicon line can also have a profound influence on the signal propagation delay over that line. The resistance of a line depends on the type of material used (polysilicon, aluminum, gold, ...), the dimensions of the line and finally, the number and locations of the contacts on that line. Consider again the interconnection line shown in Fig. 4.12. The total resistance in the indicated current direction can be found as (4.2) where the greek letter ro represents the characteristic resistivity of the interconnect material, and Rsheet represents the sheet resistivity of the line, in (ohm/square). For a typical polysilicon layer, the sheet resistivity is between 20-40 ohm/square, whereas the sheet resistivity of silicide is about 2- 4 ohm/square. Using the formula given above, we can estimate the total parasitic resistance of a wire segment based on its geometry. Typical metal-poly and metal-diffusion contact resistance values are between 20-30 ohms, while typical via resistance is about 0.3 ohms. In most short-distance aluminum and silicide interconnects, the amount of parasitic wire resistance is usually negligible. On the other hand, the effects of the parasitic resistance must be taken into account for longer wire segments. As a first-order approximation in simulations, the total lumped resistance may be assumed to be connected in series with the total lumped capacitance of the wire. A much better approximation of the influence of distributed parasitic resistance can be obtained by using an RC-ladder network model to represent the interconnect segment (Fig. 4.18). Here, the interconnect segment is divided into smaller, identical sectors, and each sector is represented by an RC-cell. Typically, the number of these RC-cells (i.e., the resolution of the RC model) determines the accuracy of the simulation results. On the other hand, simulation time restrictions usually limit the resolution of this distributed line model. 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 4 12 of 12 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-4.18: RC-ladder network used to model the distributed resistance and capacitance of an interconnect. This chapter edited by Y. Leblebici production of 5/1/2007 10:46 AM Design of VLSI Systems - Chapter 5 1 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 5 CLOCK SIGNALS AND SYSTEM TIMING 5.1 On-Chip Clock Generation and Distribution Clock signals are the heartbeats of digital systems. Hence, the stability of clock signals is highly important. Ideally, clock signals should have minimum rise and fall times, specified duty cycles, and zero skew. In reality, clock signals have nonzero skews and noticeable rise and fall times; duty cycles can also vary. In fact, as much as 10% of a machine cycle time is expended to allow realistic clock skews in large computer systems. The problem is no less serious in VLSI chip design. A simple technique for on-chip generation of a primary clock signal would be to use a ring oscillator as shown in Fig. 5.1. Such a clock circuit has been used in low-end microprocessor chips. [Click to enlarge image] Figure 5.1: Simple on-chip clock generation circuit using a ring oscillator. However, the generated clock signal can be quite process-dependent and unstable. As a result, separate clock chips which use crystal oscillators have been used for high- performance VLSI chip families. Figure 5.2 shows the circuit schematic of a Pierce crystal oscillator with good frequency stability. This circuit is a near series-resonant circuit in which the crystal sees a low load impedance across its terminals. Series resonance exists in the crystal but its internal series resistance largely the determines the oscillation frequency. In its equivalent circuit model, the crystal can be represented as a series RLC circuit; thus, the higher the series resistance, the lower the oscillation frequency. The external load at the terminals of the crystal also has a considerable effect on the frequency and the frequency stability. The inverter across the crystal provides the necessary v oltage differential, and the external inverter provides the amplification to drive clock loads. Note that the oscillator circuit presented here is by no means a typical example of the state-of-the-art; design of high-frequency, high-quality clock oscillators is a formidable task, which is beyond the scope of this section. 5/1/2007 10:53 AM Design of VLSI Systems - Chapter 5 2 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-5.2: Circuit diagram of a Pierce crystal oscillator circuit. Usually a VLSI chip receives one or more primary clock signals from an external clock chip and, in turn, generates necessary derivatives for its internal use. It is often necessary to use two non-overlapping clock signals. The logical product of such two clock signals should be zero at all times. Figure 5.3 shows a simple circuit that generates CK-1 and CK-2 from the original clock signal CK. Figure 5.4 shows a clock decoder circuit that takes in the primary clock signals and generates four phase signals. [Click to enlarge image] Figure-5.3: A simple circuit that generates a pair of non-overlapping clock signals from CK. [Click to enlarge image] Figure-5.4: Clock decoder circuit: (a) symbolic representation and (b) sample waveforms and gate-level implementation. Since clock signals are required almost uniformly over the chip area, it is desirable that all clock signals are distributed with a uniform delay. An ideal distribution network would be the H-tree structure shown in Fig. 5.5. In such a structure, the distances from the center to all branch points are the same and hence, the signal delays would be the same. However, this structure is difficult to implement in practice due to routing constraints and different fanout requirements. A more practical approach for clock-signal distribution is to route main clock signals to 5/1/2007 10:53 AM Design of VLSI Systems - Chapter 5 3 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... macroblocks and use local clock decoders to carefully balance the delays under different loading conditions. [Click to enlarge image] Figure-5.5: General layout of an H-tree clock distribution network. The reduction of clock skews, which are caused by the differences in clock arrival times and changes in clock waveforms due to variations in load conditions, is a major concern in high-speed VLSI design. In addition to uniform clock distribution (H-tree) networks and local skew balancing, a number of new computer-aided design techniques have been developed to automatically generate the layout of an optimum clock distribution network with zero skew. Figure 5.6 shows a zero-skew clock routing network that was constructed based on estimated routing parasitics. Regardless of the exact geometry of the clock distribution network, the clock signals must be buffered in multiple stages as shown in Fig. 5.7 to handle the high fan-out loads. It is also essential that every buffer stage drives the same number of fan-out gates so that the clock delays are always balanced. In the configuration shown in Fig. 5.8 (used in the DEC Alpha chip designs), the interconnect wires are cross- connected with vertical metal straps in a mesh pattern, in order to keep the clock signals in phase across the entire chip. So far we have seen the needs for having equal interconnect lengths and extensive buffering in order to distribute clock signals with minimal skews and healthy signal waveforms. In practice, designers must spend significant time and effort to tune the transistor sizes in buffers (inverters) and also the widths of interconnects. Widening the interconnection wires decreases the series resistance, but at the cost of increasing the parasitic resistance. [Click to enlarge image] Figure-5.6: An example of the zero-skew clock routing network, generated by a computer-aided design tool. 5/1/2007 10:53 AM Design of VLSI Systems - Chapter 5 4 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-5.7: Three-level buffered clock distribution network. [Click to enlarge image] Figure-5.8: Genaral structure of the clock distribution network used in DEC Alpha microprocessor chips. The following points should always be considered carefully in digit al system design, but especially for successful high-speed VLSI design: Ideal duty cycle of a clock signal is 50%, and the signal can travel farther in a chain of inverting buffers with ideal duty cycle. The duty cycle of a clock signal can be improved, i.e., made closer to 50%, by using feedback based on the voltage average. To prevent reflection in the interconnection network, the rise time and the fall time of the clock signal should not be reduced excessively. The load capacitance should be reduced as much as possible, by reducing the fan-out, the interconnection lengths and the gate capacitances. The characterictic impedance of the clock distribution line should be reduced by using properly increased (w/h)-ratios (the ratio of the line width to vertical separation distance of the line from the substrate). Inductive loads can be used to partially cancel the effects of parasitic capacitance of a clock receiver (matching network). Adequate separation should be maintained between high-speed clock lines in order to prevent cross-talk. Also, placing a power or ground rail between two high-speed lines can be an effective measure. 5/1/2007 10:53 AM Design of VLSI Systems - Chapter 5 5 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... This chapter edited by Y. Leblebici production of 5/1/2007 10:53 AM Design of VLSI Systems 1 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 6 ARITHMETIC FOR DIGITAL SYSTEMS Introduction Notation Systems Principle of Generation and Propagation The 1-bit Full Adder Enhancement Techniques for Adders Multioperand Adders Multiplication Addition and Multiplication in Galois Fields, GF(2n) 6.1 Introduction Computation speeds have increased dramatically during the past three decades resulting from the development of various technologies. The execution speed of an arithmetic operation is a function of two factors. One is the circuit technology and the other is the algorithm used. It can be rather confusing to discuss both factors simultaneously; for instance, a ripple-carry adder implemented in GaAs technology may be faster than a carry-look-ahead adder implemented in CMOS. Further, in any technology, logic path delay depends upon many different factors: the number of gates through which a signal has to pass before a decision is made, the logic capability of each gate, cumulative distance among all such serial gates, the electrical signal propagation time of the medium per unit distance, etc. Because the logic path delay is attributable to the delay internal and external to logic gates, a comprehensive model of performance would have to include technology, distance, placement, layout, electrical and logical capabilities of the gates. It is not feasible to make a general model of arithmetic performance and include all these variables. The purpose of this chapter is to give an overview of the different components used in the design of arithmetic operators. The following parts will not exhaustively go through all these components. However, the algorithms used, some mathematical concepts, the architectures, the implementations at the block, transistor or even mask level will be presented. This chapter will start by the presentation of various notation systems. Those are important because they influence the architectures, the size and the performance of the arithmetic components. The well known and used principle of generation and propagation will be explained and basic implementation at transistor level will be given as examples. The basic full adder cell (FA) will be shown as a brick used in the construction of various systems. After that, the problem of building large adders will lead to the presentation of enhancement techniques. Multioperand adders are of particular interest when building special CPU's and especially multipliers. That is why, certain algorithms will be introduced to give a better idea of the building of multipliers. After the show of the classical approaches, a logarithmic multiplier and the multiplication and addition in the Galois Fields will be briefly introduced. Muller [Mull92] and Cavanagh [Cava83] constitute two reference books on the matter. 6.2 Notation Systems 6.2.1 Integer Unsigned 5/1/2007 10:53 AM Design of VLSI Systems 2 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The binary number system is the most conventional and easily implemented system for internal use in digital computers. It is also a positional number system. In this mode the number is encoded as a vector of n bits (digits) in which each is weighted according to its position in the vector. Associated to each vector is a base (or radix) r. Each bit has an integer value in the range 0 to r-1. In the binary system where r=2, each bit has the value 0 or 1. Consider a n-bit vector of the form: (1) where ai=0 or 1 for i in [0, n-1]. This vector can represent positive integer values V = A in the range 0 to 2n-1, where: (2) The above representation can be extended to include fractions. An example follows. The string of binary digits 1101.11 can be interpreted to represent the quantity : 23 . 1 + 22 . 1 + 21 . 0 + 20 . 1 + 2-1 . 1 + 2-2 . 1 = 13.75 (3) The following Table 1 shows the 3-bit vector representing the decimal expression to the right. Table-6.1: Binary representation unsigned system with 3 digits 6.2.2 Integer Signed If only positive integers were to be represented in fixed-point notation, then an n-bit word would permit a range from 0 to 2n-1. However, both positive and negative integers are used in computations and an encoding scheme must be devised in which both positive and negative numbers are distributed as evenly as possible. There must be also an easy way to distinguish between positive and negative numbers. The left most digit is usually reserved for the sign. Consider the following number A with radix r, where the sign digit an-1 has the following value: for binary numbers where r=2, the previous equation becomes: The remaining digits in A indicate either the true value or the magnitude of A in a complemented form. 6.2.2.1 Absolute value 5/1/2007 10:53 AM Design of VLSI Systems 3 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Table-6.2: binary representation signed absolute value In this representation, the high-order bit indicates the sign of the integer (0 for positive, 1 for negative). A positive number has a range of 0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive number is : The negatives numbers having the following representation: One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two number with opposite signs. The magnitudes have to be compared to determine the sign of the result. 6.2.2.2 1's complement Table-6.3: binary representation signed In this representation, the high-order bit also indicates the sign of the integer (0 for positive, 1 for negative). A positive number has a range of 0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive number is : The negatives numbers having the following representation: One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two number with opposite signs. The magnitudes have to be compared to determine the sign of the result. 6.2.2.3 2's complement 5/1/2007 10:53 AM Design of VLSI Systems 4 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Table-6.4: binary representation signed in 2's complement In this notation system (radix 2), the value of A is represented such as: The test sign is also a simple comparison of two bits. There is a unique representation of 0. Addition and subtraction are easier because the result comes out always in a unique 2's complement form. 6.2.3 Carry Save In some particular operations requiring big additions such as in multiplication or in filtering operations, the carry save notation is used. This notation can be either used in 1's or 2's or whatever other definition. It only means that for the result of an addition, the result will be coded in two digits which are the carry in the sum digit. When coming to the multioperand adders and multipliers, this notion will be understood by itself. 6.2.4 Redundant Notation It has been stated that each bit in a number system has an integer value in the range 0 to r-1. This produces a digit set S: (4) in which all the digits of the set are positively weighted. It is also possible to have a digit set in which both positive- and negative-weighted digits are allowed [Aviz61] [Taka87], such as: (5) where l is a positive integer representing the upper limit of the set. This is considered as a redundant number system, because there may be more than one way to represent a given number. Each digit of a redundant number system can assume the 2(l+1) values of the set T. The range of l is: (6) Where: is called the ceiling of . For any number x , the ceiling of x is the smallest integer not less than x. The floor of x , is the largest integer not greater than x. Since the integer l bigger or equal than 1 and r bigger or equal than 2, then the maximum magnitude of l will be (7) Thus for r=2, the digit set is: (8) 5/1/2007 10:53 AM Design of VLSI Systems 5 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... For r=4, the digit set is (9) For example, for n=4 and r=2, the number A=-5 has four representation as shown below on Table 5. 23 22 21 20 A= 0 -1 0 -1 A= 0 -1 -1 1 A= -1 0 1 1 A= -1 1 0 -1 Table-6.5: Redundant representation of A=-5 when r=4 This multirepresentation makes redundant number systems difficult to use for certain arithmetic operations. Also, since each signed digit may require more than one bit to represent the digit, this may increase both the storage and the width of the storage bus. However, redundant number systems have an advantage for the addition which is that is possible to eliminate the problem of the propagation of the carry bit. This operation can be done in a constant time independent of the length of the data word. The conversion from binary to binary redundant is usually a duplication or juxtaposition of bits and it does not cost anything. On the contrary, the opposite conversion means an addition and the propagation of the carry bit cannot be removed. Let us consider the example where r=2 and l=1. In this system the three used digits are -1, 0, +1. The representation of 1 is 10, because 1-0=1. The representation of -1 is 01, because 0-1=-1. One representation of 0 is 00, because 0-0=0. One representation of 0 is 11, because 1-1=0. The addition of 7 and 5 give 12 in decimal. The same is equivalent in a binary non redundant system to 111 + 101: We note that a carry bit has to be added to the next digits when making the operation "by hand". In the redundant system the same operation absorbs the carry bit which is never propagated to the next order digits: The result 1001100 has now to be converted to the binary non redundant system. To achieve that, each couple of bits has to be added together. The eventual carry has to be propagated to the next order bits: 5/1/2007 10:53 AM Design of VLSI Systems 6 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 6.3 Principle of Generation and Propagation 6.3.1 The Concept The principle of Generation and Propagation seems to have been discussed for the first time by Burks, Goldstine and Von Neumann [BGNe46]. It is based on a simple remark: when adding two numbers A and B in 2’s complement or in the simplest binary representation (A=an-1...a1a0, B=bn-1...b1b0), when ai =bi it is not necessary to know the carry ci . So it is not necessary to wait for its calculation in order to determine ci+1 and the sum si+1. If ai=bi=0, then necessarily ci+1=0 If ai=bi=1, then necessarily ci+1=1 This means that when ai=bi, it is possible to add the bits greater than the ith, before the carry information ci+1 has arrived. The time required to perform the addition will be proportional to the length of the longest chain i,i+1, i+2, i+p so that ak not equal to bk for k in [i,i+p]. It has been shown [BGNe46] that the average value of this longest chain is proportional to the logarithm of the number of bits used to represent A and B. By using this principle of generation and propagation it is possible to design an adder with an average delay o(log n). However, this type of adder is usable only on asynchronous systems [Mull82]. Today the complexity of the systems is so high that asynchronous timing of the operations is rarely implemented. That is why the problem is rather to minimize the maximum delay rather than the average delay. Generation: This principle of generation allows the system to take advantage of the occurrences “ai=bi”. In both cases (ai=1 or ai=0) the carry bit will be known. Propagation: If we are able to localize a chain of bits ai ai+1...ai+p and bi bi+1...bi+p for which ak not equal to bk for k in [i,i+p], then the output carry bit of this chain will be equal to the input carry bit of the chain. These remarks constitute the principle of generation and propagation used to speed the addition of two numbers. All adders which use this principle calculate in a first stage. pi = ai XOR bi (10) gi = ai bi (11) The previous equations determine the ability of the ith bit to propagate carry information or to generate a carry information. 6.3.2 Transistor Formulation 5/1/2007 10:53 AM Design of VLSI Systems 7 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.1: A 1-bit adder with propagation signal controling the pass-gate This implementation can be very performant (20 transistors) depending on the way the XOR function is built. The carry propagation of the carry is controlled by the output of the XOR gate. The generation of the carry is directly made by the function at the bottom. When both input signals are 1, then the inverse output carry is 0. In the schematic of Figure 6.1, the carry passes through a complete transmission gate. If the carry path is precharged to VDD, the transmission gate is then reduced to a simple NMOS transistor. In the same way the PMOS transistors of the carry generation is removed. One gets a Manchester cell. [Click to enlarge image]Figure-6.2: The Manchester cell The Manchester cell is very fast, but a large set of such cascaded cells would be slow. This is due to the distributed RC effect and the body effect making the propagation time grow with the square of the number of cells. Practically, an inverter is added every four cells, like in Figure 6.3. [Click to enlarge image]Figure-6.3: The Manchester carry cell 6.4 The 1-bit Full Adder It is the generic cell used not only to perform addition but also arithmetic multiplication division and filtering operation. In this part we will analyse the equations and give some implementations with layout examples. The adder cell receives two operands ai and bi, and an incoming carry ci. It computes the sum and the outgoing carry ci+1. ci+1 = ai . bi + ai . ci + ci . bi = ai . bi + (ai + bi ). ci ci+1 = pi . ci + gi [Click to enlarge image]Figure-6.4: The full adder (FA) and half adder (HA) cells where pi = bi XOR ai is the PROPAGATION signal (12) gi = ai . bi is the GENERATION signal (13) 5/1/2007 10:53 AM Design of VLSI Systems 8 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... si = si = ai XOR bi XOR ci (14) .(ai + bi + ci) + ai . bi . ci (15) These equation can be directly translated into two N and P nets of transistors leading to the following schematics. The main disadvantage of this implementation is that there is no regularity in the nets. [Click to enlarge image]Figure-6.5: Direct transcription of the previous equations The dual form of each equation described previously can be written in the same manner as the normal form: (16) dual of (16) (17) (18) In the same way : (19) dual of (19) (20) (21) The schematic becomes symmetrical (Figure 6.6), and leads to a better layout : [Click to enlarge image]Figure-6.6: Symmetrical implementation due to the dual expressions of ci and si. The following Figure 6.7 shows different physical layouts in different technologies. The size, the technology and the performance of each cell is summarized in the next Table 6. Name of cell Number of Tr. Size (µm2) Technology fa_ly_mini_jk fa_ly_op1 Fulladd.L 24 24 28 2400 3150 962 1.2 µ 1.2 µ 0.5 µ Worst Case Delay (ns) (Typical Conditions) 20 5 1.5 5/1/2007 10:53 AM Design of VLSI Systems 9 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... fa_ly_itt 24 3627 1.2 µ 10 Table-6.6: Characteristics of layout cells from Figure 7. Figure-6.7: Mask layout for different Full Adder cells 6.5 Enhancement Techniques for Adders The operands of addition are the addend the augend. The addend is added to the augend to form the sum. In most computers, the augmented operand (the augend) is replaced by the sum, whereas the addend is unchanged. High speed adders are not only for addition but also for subtraction, multiplication and division. The speed of a digital processor 5/1/2007 10:53 AM Design of VLSI Systems 10 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... depends heavily on the speed of adders. The adders add vectors of bits and the principal problem is to speed- up the carry signal. A traditional and non optimized four bit adder can be made by the use of the generic one-bit adder cell connected one to the other. It is the ripple carry adder. In this case, the sum resulting at each stage need to wait for the incoming carry signal to perform the sum operation. The carry propagation can be speed-up in two ways. The first –and most obvious– way is to use a faster logic circuit technology. The second way is to generate carries by means of forecasting logic that does not rely on the carry signal being rippled from stage to stage of the adder. [Click to enlarge image]Figure-6.8: A 4-bit parallel ripple carry adder Generally, the size of an adder is determined according to the type of operations required, to the precision or to the time allowed to perform the operation. Since the operands have a fixed size, if becomes important to determine whether or not there is a detected overflow Overflow: An overflow can be detected in two ways. First an overflow has occurred when the sign of the sum does not agree with the signs of the operands and the sign s of the operands are the same. In an n-bit adder, overflow can be defined as: (22) Secondly, if the carry out of the high order numeric (magnitude) position of the sum and the carry out of the sign position of the sum agree, the sum is satisfactory; if they disagree, an overflow has occurred. Thus, (23) A parallel adder adds two operands, including the sign bits. An overflow from the magnitude part will tend to change the sign of the sum. So that an erroneous sign will be produced. The following Table 7 summarizes the overflow detection an-1 0 0 1 1 bn-1 sn-1 cn-1 cn-2 Overflow 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 Table-6.7: Overflow detection for 1's and 2's complement Coming back to the acceleration of the computation, two major techniques are used: speed-up techniques (Carry Skip and Carry Select), anticipation techniques (Carry Look Ahead, Brent and Kung and C3i). Finally, a combination of these techniques can prove to be an optimum for large adders. 6.5.1 The Carry-Skip Adder Depending on the position at which a carry signal has been generated, the propagation time can be variable. In the best case, when there is no carry generation, the addition time will only take into account the time to propagate the carry signal. Figure 9 is an example illustrating a carry signal generated twice, with the input carry being equal to 0. In this case three simultaneous carry propagations occur. The longest is the second, which takes 7 cell delays (it starts at the 4th position and ends at the 11th position). So the addition time of these two numbers with this 16-bits Ripple Carry Adder is 7.k + k’, where k is the delay cell and k’ is the time needed to compute the 11th sum bit using the 11th carry-in. With a Ripple Carry Adder, if the input bits Ai and Bi are different for all position i, then the carry signal is propagated at all positions (thus never generated), and the addition is completed when the carry signal has propagated through the whole adder. In this case, the Ripple Carry Adder is as slow as it is large. Actually, Ripple Carry Adders are fast only for some configurations of the input words, where carry signals are generated at some positions. 5/1/2007 10:53 AM Design of VLSI Systems 11 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Carry Skip Adders take advantage both of the generation or the propagation of the carry signal. They are divided into blocks, where a special circuit detects quickly if all the bits to be added are different (Pi = 1 in all the block). The signal produced by this circuit will be called block propagation signal. If the carry is propagated at all positions in the block, then the carry signal entering into the block can directly bypass it and so be transmitted through a multiplexer to the next block. As soon as the carry signal is transmitted to a block, it starts to propagate through the block, as if it had been generated at the beginning of the block. Figure 6.10 shows the structure of a 24-bits Carry Skip Adder, divided into 4 blocks. [Click to enlarge image]Figure-6.10: The "domino behaviour of the carry propagation and generation signals [Click to enlarge image]Figure-6.10a: Block diagram of a carry skip adder To summarize, if in a block all Ai's?Bi's, then the carry signal skips over the block. If they are equal, a carry signal is generated inside the block and needs to complete the computation inside before to give the carry information to the next block. OPTIMISATION TECHNIQUE WITH BLOCKS OF EQUAL SIZE It becomes now obvious that there exist a trade-off between the speed and the size of the blocks. In this part we analyse the division of the adder into blocks of equal size. Let us denote k1 the time needed by the carry signal to propagate through an adder cell, and k2 the time it needs to skip over one block. Suppose the N-bit Carry Skip Adder is divided into M blocks, and each block contains P adder cells. The actual addition time of a Ripple Carry Adder depends on the configuration of the input words. The completion time may be small but it also may reach the worst case, when all adder cells propagate the carry signal. In the same way, we must evaluate the worst carry propagation time for the Carry Skip Adder. The worst case of carry propagation is depicted in Figure 6.11. [Click to enlarge image]Figure-6.11: Worst case for the propagation signal in a Carry Skip adder with blocks of equal size The configuration of the input words is such that a carry signal is generated at the beginning of the first block. Then this carry signal is propagated by all the succeeding adder cells but the last which generates another carry signal. In the first and the last block the block propagation signal is equal to 0, so the entering carry signal is not transmitted to the next block. Consequently, in the first block, the last adder cells must wait for the carry signal, which comes from the first cell of the first block. When going out of the first block, the carry signal is distributed to the 2nd, 3rd and last block, where it propagates. In these blocks, the carry signals propagate almost simultaneously (we must account for the multiplexer delays). Any other situation leads to a better case. Suppose for instance that the 2nd block does not propagate the carry signal (its block propagation signal is equal to zero), then it means that a carry signal is generated inside. This carry signal starts to propagate as soon as the input bits are settled. In other words, at the beginning of the addition, there exists two sources for the carry signals. The paths of these carry signals are shorter than the carry path of the worst case. Let us formalize that the total adder is made of N adder cells. It contains M blocks of P adder cells. The total of adder cells is then N=M.P (24) 5/1/2007 10:53 AM Design of VLSI Systems 12 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The time T needed by the carry signal to propagate through P adder cells is T=k1.P (25) The time T' needed by the carry signal to skip through M adder blocks is T'=k2.M (26) The problem to solve is to minimize the worst case delay which is: (27) (28) So that the function to be minimized is: (29) The minimum is obtained for: (30) OPTIMISATION TECHNIQUE WITH BLOCKS OF NON-EQUAL SIZE Let us formalize the problem as a geometric problem. A square will represent the generic full adder cell. These cells will be grouped in P groups (in a column like manner). L(i) is the value of the number of bits of one column. L(1), L(2), ..., L(P) are the P adjacent columns. (see Figure 6.12) [Click to enlarge image]Figure-6.12: Geometric formalization If a carry signal is generated at the ith section, this carry skips j-i-1 sections and disappears at the end of the jth section. So the delay of propagation is: (31) By defining the constant a equal to: (32) one can position two straight lines defined by: (at the left most position) (33) 5/1/2007 10:53 AM Design of VLSI Systems 13 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... (at the right most position) (34) The constant a is equivalent to the slope dimension in the geometrical problem of the two two straight lines defined by equations (33) and (34). These straight lines are adjacent to the top of the columns and the maximum time can be expressed as a geometrical distance y equal to the y-value of the intersection of the two straight lines. (35) (36) because (37) [Click to enlarge image]Figure-6.13: Representation of the geometrical worst delay A possible implementation of a block is shown in Figure 6.14. In a precharged mode, the output of the four inverter-like structure is set to one. In the evaluation mode, the entire block is in action and the output will either receive c0 or the carry generated inside the comparator cells according to the values given to A and B. If there is no carry generation needed, c0 will be transmitted to the output. In the other case, one of the inversed pi's will switch the multiplexer to enable the other input. [Click to enlarge image]Figure-6.14: A possible implementation of the Carry Skip block 6.5.2 The Carry-Select Adder This type of adder is not as fast as the Carry Look Ahead (CLA) presented in a next section. However, despite its bigger amount of hardware needed, it has an interesting design concept. The Carry Select principle requires two identical parallel adders that are partitioned into four-bit groups. Each group consists of the same design as that shown on Figure 15. The group generates a group carry. In the carry select adder, two sums are generated simultaneously. One sum assumes that the carry in is equal to one as the other assumes that the carry in is equal to zero. So that the predicted group carry is used to select one of the two sums. It can be seen that the group carries logic increases rapidly when more high- order groups are added to the total adder length. This complexity can be decreased, with a subsequent increase in the delay, by partitioning a long adder into sections, with four groups per section, similar to the CLA adder. 5/1/2007 10:53 AM Design of VLSI Systems 14 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.15: The Carry Select adder [Click to enlarge image]Figure-6.16: The Carry Select adder . (a) the design with non optimised used of the gates, (b) Merging of the redundant gates A possible implementation is shown on Figure 6.16, where it is possible to merge some redundant logic gates to achieve a lower complexity with a higher density. 6.5.3 The Carry Look-Ahead Adder The limitation in the sequential method of forming carries, especially in the Ripple Carry adder arises from specifying ci as a specific function of ci-1. It is possible to express a carry as a function of all the preceding low order carry by using the recursivity of the carry function. With the following expression a considerable increase in speed can be realized. (38) Usually the size and complexity for a big adder using this equation is not affordable. That is why the equation is used in a modular way by making groups of carry (usually four bits). Such a unit generates then a group carry which give the right predicted information to the next block giving time to the sum units to perform their calculation. Figure-6.17: The Carry Generation unit performing the Carry group computation Such unit can be implemented in various ways, according to the allowed level of abstraction. In a CMOS process, 17 transistors are able to guarantee the static function (Figure 6.18). However this design requires a careful sizing of the transistors put in series. The same design is available with less transistors in a dynamic logic design. The sizing is still an important issue, but the number of transistors is reduced (Figure 6.19). 5/1/2007 10:53 AM Design of VLSI Systems 15 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.18: Static implementation of the 4-bit carry lookahead chain [Click to enlarge image]Figure-6.19: Dynamic implementation of the 4-bit carry lookahead chain To build large adders the preceding blocks are cascaded according to Figure 6.20. [Click to enlarge image]Figure-6.20: Implementation of a 16-bit CLA adder 6.5.4 The Brent and Kung Adder which combines couples of generation and The technique to speed up the addition is to introduce a "new" operator propagation signals. This "new" operator come from the reformulation of the carry chain. REFORMULATION OF THE CARRY CHAIN Let an an-1 ... a1 and bn bn-1 ... b1 be n-bit binary numbers with sum sn+1 sn ... s1. The usual method for addition computes the si’s by: c0 = 0 (39) ci = aibi + aici-1 + bici-1 (40) si = ai ++ bi ++ ci-1, i = 1,...,n (41) sn+1 = cn (42) Where ++ means the sum modulo-2 and ci is the carry from bit position i. From the previous paragraph we can deduce that the ci’s are given by: c0 = 0 (43) ci = gi + pi ci-1 (44) gi = ai bi (45) pi = ai ++ bi for i = 1,...., n (46) 5/1/2007 10:53 AM Design of VLSI Systems 16 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... One can explain equation (44) saying that the carry ci is either generated by ai and bi or propagated from the previous carry ci-1. The whole idea is now to generate the carry’s in parallel so that the nth stage does not have to “wait” for the n-1th carry bit to compute the global sum. To achieve this goal an operator Let is defined. be defined as follows for any g, g', p and p' : (g, p) (g', p') = (g + p . g', p . p') (47) Lemma1: Let (Gi, Pi) = (g1, p1) if i = 1 (48) (gi, pi) (Gi-1, Pi-1) if i in [2, n] (49) Then ci = Gi for i = 1, 2, ..., n. Proof: The Lemma is proved by induction on i. Since c0 = 0, (44) above gives: c1 = g1 + p1 . 0 = g1 = G1 (50) So the result holds for i=1. If i>1 and ci-1 = Gi-1 , then (Gi, Pi) = (gi, pi) (Gi-1, Pi-1) (51) (Gi, Pi) = (gi, pi) (ci-1, Pi-1) (52) (Gi, Pi) = (gi + pi . ci-1, Pi . Pi-1) (53) thus Gi = gi + pi . ci-1 (54) And from (44) we have : Gi = ci. Lemma2: The operator is associative. Proof: For any (g3, p3), (g2, p2), (g1, p1) we have: [(g3, p3) (g2, p2)] (g1, p1) = (g3+ p3 . g2, p3 . p2) (g1, p1) = (g3+ p3 . g2+ , p3 . p2 . p1) (55) and, (g3, p3) [(g2, p2) (g1, p1)] = (g3, p3) (g2 + p2 . g1, p2 . p1) = (g3 + p3 . (g2 + p2 . g1), p3 . p2 . p1) (56) One can check that the expressions (55) and (56) are equal using the distributivity of . and +. To compute the ci’s it is only necessary to compute all the (Gi, Pi)’s but by Lemmas 1 and 2, (Gi, Pi) = (gi, pi) (gi-1, pi-1) .... (g1, p1) (57) can be evaluated in any order from the given gi’s and pi’s. The motivation for introducing the operator Delta is to generate the carry’s in parallel. The carry’s will be generated in a block or carry chain block, and the sum will be obtained directly from all the carry’s and pi’s since we use the fact that: si = pi H ci-2 for i=1,...,n (58) THE ADDER Based on the previous reformulation of the carry computation Brent and Kung have proposed a scheme to add two n-bit numbers in a time proportional to log(n) and in area proportional to n.log(n), for n bigger or equal to 2. Figure 6.21 shows how the carry’s are computed in parallel for 16-bit numbers. [Click to enlarge image]Figure-6.21: The first binary tree allowing the calculation of c1, c2, c4, c8, c16. 5/1/2007 10:53 AM Design of VLSI Systems 17 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Using this binary tree approach, only the ci’s where i=2k (k=0,1,...n) are computed. The missing ci’s have to be computed using another tree structure, but this time the root of the tree is inverted (see Figure 6.22). In Figure 6.21 and Figure 6.22 the squares represent a cell which performs equation (47). Circles represent a duplication cell where the inputs are separated into two distinct wires (see Figure 6.23). When using this structure of two separate binary trees, the computation of two 16-bit numbers is performed in T=9 stages of cells. During this time, all the carries are computed in a time necessary to traverse two independent binary trees. According to Burks, Goldstine and Von Neumann, the fastest way to add two operands is proportional to the logarithm of the number of bits. Brent and Kung have achieved such a result. [Click to enlarge image]Figure-6.22: Computation of all the carry’s for n = 16 [Click to enlarge image]Figure-6.23: (a) The cell, (b) the duplication cell 6.5.5 The C3i Adder THE ALGORITHM Let ai and bi be the digits of A and B, two n-bit numbers with i = 1, 2, ...,n. The carry's will be computed according to (59). (59) with: Gi = ci (60) If we develop (59), we get: (61) and by introducing a parameter m less or equal than n so that it exists q in IN | n = q.m, it is possible to obtain the couple (Gi, Pi) by forming groups of m cells performing the intermediate operations detailed in (62) and (63). (62) (63) 5/1/2007 10:53 AM Design of VLSI Systems 18 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... This manner of computing the carries is strictly based on the fact that the operator is associative. It shows also that the calculation is performed sequentially, i.e. in a time proportional to the number of bits n. We will now illustrate this analytical approach by giving a way to build an architectural layout of this new algorithm. We will proceed to give a graphical method to place the cells defined in the previous paragraph [Kowa92]. THE GRAPHICAL CONSTRUCTIONM 1. First build a binary tree of cells. 2. Duplicate this binary tree m times to the right (m is a power of two; see Remark1 in the next pages if m is not a power of two). The cells at the right of bit 1 determines the least significant bit (LSB). 3. Eliminate the cells at the right side of the LSB. Change the cells not connected to anything into duplication cells. Eliminate all cells under the second row of cells, except the right most group of m cells. 4. Duplicate q times to the right by incrementing the row down the only group of m cells left after step 3. This gives a visual representation of the delay read in Figure 6.29. 5. Shift up the q groups of cells, to get a compact representation of a "floorplan". This complete approach is illustrated in Figure 6.24, where all the steps are carefully observed. The only cells necessary for this carry generation block to constitute a real parallel adder are the cells performing equations (45) and (46). The first row of functions is put at the top of the structure. The second one is pasted at the bottom. [Click to enlarge image]Figure-6.24: (a) Step1, (b) Step2, (c) Step3 and Step4, (d) Step5 At this point of the definition, two remarks have to be made about the definition of this algorithm. Both concern the m parameter used to defined the algorithm. Remark 1 specifies the case for m not equal to 2q (q in [0,1, ...] ) as Remark 2 deals with the case where m=n. [Click to enlarge image]Figure-6.25: Adder where m=6. The fan-out of the 11th carry bit is highlighted Remark 1: For m not a power of two, the algorithm is built the same way up to the very last step. The only reported difference will concern the delay which will be equal to the next nearest power of two. This means that there is no special 5/1/2007 10:53 AM Design of VLSI Systems 19 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... interest to build such versions of these adders. The fan-out of certain cells is even increased to three, so that the electrical behaviour will be degraded. Figure 6.25 illustrates the design of such an adder based on m=6. The fan-out of the cell of bit 11 is three. The delay of this adder is equivalent to the delay of an adder with a duplication with m=8. Remark 2: For m equal to the number of bits of the adder, the algorithm reaches the real theoretical limit demonstrated by Burks, Goldstine and Von Neumann. The logarithmic time is attained using one depth of a binary tree instead of two in the case of Brent and Kung. This particular case is illustrated in Figure 6.26. The definition of the algorithm is followed up to Step3. Once the reproduction of the binary tree is made m times to the right, the only thing to do is to remove the cells at the negative bit positions and the adder is finished. Mathematically, one can notice that this is the limit. We will discuss later whether it is the best way to build an adder using m=n. [Click to enlarge image]Figure-6.26: Adder where m=n. This constitutes the theoretical limit for the computation of the addition. COMPARISONS In this section, we develop a comparison between adders obtained using the new algorithm with different values of m. On the plots of Figure 6.27 through Figure 6.29, the suffixes JK2, JK4, and JK8 will denote different adders obtained for m equal two, four or eight. They are compared to the Brent and Kung implementation and to the theoretical limit which is obtained when m equals n, the number of bits. The comparison between these architectures is done according to the formalisation of a computational model described in [Kowa93]. We clearly see that BK’s algorithm performs the addition with a delay proportional to the logarithm of the number of bits. JK2 performs the addition in a linear time, just as JK4 or JK8. The parameter m influences the slope of the delay. So that, the higher is m, the longer the delay stays under the logarithmic delay of (BK). We see that when one wants to implement the addition faster than (BK), there is a choice to make among different values of m. The choice will depend on the size of the adder because it is evident that a 24-bit JK2 adder (delay = 11 stages of cells) performs worse than BK (delay = 7 stages of cells). On the other hand JK8 (delay = 5 stages of cells) is very attractive. The delay is better than BK up to 57 bits. At this point both delays are equal. Furthermore, even at equal delays (up to 73 bits) our implementation performs better in terms of regularity, modularity and ease to build. The strong advantage of this new algorithm compared to BK is that for a size of the input word which is not a power-of-two, the design of the cells is much easier. There is no partial binary tree to build. The addition of a bit to the adder is the addition of a bit-slice. This bit-slice is very compact and regular. Let us now consider the case where m equals n (denoted by XXX on our figures). The delay of such an adder is exactly one half of BK and it is the lowest bound we obtain. For small adders (n < 16), the delay is very close to XXX. And it can be demonstrated that the delays (always in term of stages) of JK2, JK4, JK8 are always at least equal to XXX. This discussion took into account the two following characteristics of the computational model: The gate of a stage computes a logical function in a constant time; The signal is divided into two signals in constant time (this occurs especially at the output of the first stage of cells). And the conclusion of this discussion is that m has to be chosen as high as possible to reduce the global delay. When we turn to the comparisons concerning the area, we will take into account the following characteristics of our computational model: At most two wires cross at any point; A constant but predefined area of the gates with minimum width for the wires is used; The computation is made in a convex planar region. For this discussion let us consider Figure 6.28 where we represent the area of the different adders versus the number of 5/1/2007 10:53 AM Design of VLSI Systems 20 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... bits. It is obvious that for m being the smallest, the area will be the smallest as well. For m increasing up to n, we can see that the area will still be proportional to the number of bits following a straight line. For m equal to n the area will be exactly one half of the BK area with a linear variation. The slope of this variation in both cases of BK and XXX will vary according to the intervals [2q,2q+1] where q=0. Here we could point out that the floorplan of BK could be optimised to become comparable to the one of XXX, but the cost of such an implementation would be very high because of the irregularity of the wirings and the interconnections. These considerations lead us to the following conclusion: to minimise the area of a new adder, m must be chosen low. This is contradictory with the previous conclusion. That is why a very wise choice of m will be necessary, and it will always depend on the targeted application. Finally, Figure 6.27 gives us the indications about the number of transistors used to implement our different versions of adders. These calculations are based on the dynamic logic family (TSPC: True Single Phase Clocking) described in [Kowa93]. When considering this graph, we see that BK and XXX are two limits of the family of our adders. BK uses the smallest number of transistors , whereas XXX uses up to five times more transistors. When m is highest, the number of transistors is highest. Nevertheless, we see that the area is smaller than BK. A high density is an advantage, but an overhead in transistors can lead to higher power dissipation. This evident drawback in our algorithm is counterbalanced by the progress being made in the VLSI area. With the shrinking of the design rules, the size of the transistors decreases as well as the size of the interconnections. This leads to smaller power dissipation. This fact is even more pronounced when the technologies tend to decrease the power supply from 5V to 3.3V. In other words, the increase in the number of transistors corresponds to the redundancy we introduce in the calculations to decrease the delay of our adders. Now we will discuss an important characteristics of our computational model that differs from the model of Brent and Kung: The signal travels along a wire in a time proportional to its length. This assumption is very important as we discuss it with an example. Let us consider the 16-bit BK adder (Figure 6.22) and the 16-bit JK4 adder (Figure 6.24). The longest wire in the BK implementation will be equal to at least eight widths of ? cells, whereas in the JK4 implementation, the longest wire will be equal to four widths of -cells. For BK, the output capacitive load of a -cell will be variable and a variable sizing of the cell will be necessary. In our case, the parameter m will defined a fixed library of -cells used in the adder. The capacitive load will always be limited to a fixed value allowing all cells to be sized to a fixed value. Figure-6.27: Number of transistors versus the number of bits 5/1/2007 10:53 AM Design of VLSI Systems 21 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-6.28: Area versus the number of bits Figure-6.29: Delay in number of stages versus the number of bits in the adder To partially conclude this section, we say that an optimum must be defined when choosing to implement our algorithm. This optimum will depend on the application for which the operator is to be used. 6.6 Multioperand Adders 6.6.1 General Principle The goal is to add more than 2 operand in a time. This generally occurs in multiplication operation or filtering. 6.6.2 Wallace Trees 5/1/2007 10:53 AM Design of VLSI Systems 22 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... For this purpose, Wallace trees were introduced. The addition time grows like the logarithm of the bit number. The simplest Wallace tree is the adder cell. More generally, an n-inputs Wallace tree is an n-input operator and log2(n) outputs, such that the value of the output word is equal to the number of “1” in the input word. The input bits and the least significant bit of the output have the same weight (Figure 6.30). An important property of Wallace trees is that they may be constructed using adder cells. Furthermore, the number of adder cells needed grows like the logarithm log2(n) of the number n of input bits. Consequently, Wallace trees are useful whenever a large number of operands are to add, like in multipliers. In a Braun or Baugh-Wooley multiplier with a Ripple Carry Adder, the completion time of the multiplication is proportional to twice the number n of bits. If the collection of the partial products is made through Wallace trees, the time for getting the result in a carry save notation should be proportional to log2(n). [Click to enlarge image]Figure-6.30: Wallace cells made of adders Figure 6.31 represents a 7-inputs adder: for each weight, Wallace trees are used until there remains only two bits of each weight, as to add them using a classical 2-inputs adder. When taking into account the regularity of the interconnections, Wallace trees are the most irregular. [Click to enlarge image]Figure-6.31: A 7-inputs Wallace tree 6.6.3 Overturned Stairs Trees To circumvent the irregularity Mou [Mou91] proposes an alternalive way to build multi-operand adders. The method uses basic cells called branch, connector or root. These basic elements (see Figure 6.32) are connected together to form n-input trees. One has to take care about the weight of the inputs. Because in this case the weights at the input of the 18-input OS tree are different. The regularity of this structure is better than with Wallace trees but the construction of multipliers is still complex. [Click to enlarge image]Figure-6.32: Basic cells used to build OS-trees [Click to enlarge image]Figure-6.33: A 18-input OS-tree 5/1/2007 10:53 AM Design of VLSI Systems 23 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 6.7 Multiplication 6.7.1 Inroduction Multiplication can be considered as a serie of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, the result is the product. Each step of the addition generates a partial product. In most coimputers, the operands usually contain the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of the operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional number representation. It is possible to decompose multipliers in two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. As for adders, it is possible to enhance the intrinsic performances of multipliers. Acting in the generation part, the Booth (or modified Booth) algorithm is often used because it reduces the number of partial products. The collection of the partial products can then be made using a regular array, a Wallace tree or a binary tree [Sinh89]. Figure-6.34: Partial product representation and multioperand addition 6.7.2 Booth Algorithm This algorithm is a powerful direct algorithm for signed-number multiplication. It generates a 2n-bit product and treats both positive and negative numbers uniformly. The idea is to reduce the number of additions to perform. Booth algorithm allows in the best case n/2 additions whereas modified Booth algorithm allows always n/2 additions. Let us consider a string of k consecutive 1s in a multiplier: ..., i+k, i+k-1, i+k-2 , ..., i, i-1, ... ..., 0 , 1 , 1 , ..., 1, 0, ... where there is k consecutive 1s. 5/1/2007 10:53 AM Design of VLSI Systems 24 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... By using the following property of binary strings: 2i+k-2i=2i+k-1+2i+k-2+...+2i+1+2i the k consecutive 1s can be replaced by the following string ..., i+k+1, i+k, i+k-1, i+k-2, ..., i+1, i , i-1 , ... ..., 0 , 1 , 0 , 0 , ..., 0 , -1 , 0 , ... k-1 consecutive 0s Addition Subtraction In fact, the modified Booth algorithm converts a signed number from the standard 2’s-complement radix into a number system where the digits are in the set {-1,0,1}. In this number system, any number may be written in several forms, so the system is called redundant. The coding table for the modified Booth algorithm is given in Table 8. The algorithm scans strings composed of three digits. Depending on the value of the string, a certain operation will be performed. A possible implementation of the Booth encoder is given on Figure 6.35. The layout of another possible structure is given on Figure 6.36. BIT 21 Yi+1 M is 20 Yi 2-1 Yi-1 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 OPERATION multiplied by 0 add zero (no string) add multipleic (end of 1 string) 0 add multiplic. (a string) add twice the mul. (end 1 of string) sub. twice the m. (beg. 0 of string) sub. the m. (-2X and 1 +X) sub . the m. (beg. of 0 string) sub. zero (center of 1 string) Table-6.8: Modified Booth coding table. +0 +X +X +2X -2X -X -X -0 [Click to enlarge image]Figure-6.35: Booth encoder cell 5/1/2007 10:53 AM Design of VLSI Systems 25 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-6.36: Booth encoder cell (layout size: 65.70 µm2 (0.5µCMOS)) 6.7.3 Serial-Parallel Multiplier This multiplier is the simplest one, the multiplication is considered as a succession of additions. If A = (an an-1……a0) and B = (bn bn-1……b0) The product A.B is expressed as : A.B = A.2n.bn + A.2n-1.bn-1 +…+ A.20.b0 The structure of Figure 6.37 is suited only for positive operands. If the operands are negative and coded in 2’s-complement : 1. The most significant bit of B has a negative weight, so a subtraction has to be performed at the last step. 2. Operand A.2k must be written on 2N bits, so the most significant bit of A must be duplicated. It may be easier to shift the content of the accumulator to the right instead of shifting A to the left. [Click to enlarge image]Figure-6.37: Serial-Parallel multiplier 6.7.4 Braun Parallel Multiplier The simplest parallel multiplier is the Braun array. All the partial products A.bk are computed in parallel, then collected through a cascade of Carry Save Adders. At the bottom of the array, the output of the array is noted in Carry Save, so an additional adder converts it (by the mean of a carry propagation) into the classical notation (Figure 6.38). The completion time is limited by the depth of the carry save array, and by the carry propagation in the adder. Note that this multiplier is only suited for positive operands. Negative operands may be multiplied using a Baugh-Wooley multiplier. 5/1/2007 10:53 AM Design of VLSI Systems 26 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.38: A 4-bit Braun Multiplier without the final adder Figure 6.38 and Figure 6.40 use the symbols given in Figure 6.39 where CMUL1 and CMUL2 are two generic cells consisting of an adder without the final inverter and with one input connected to an AND or NAND gate. A non optimised (in term of transistors) multiplier would consist only of adder cells connected one to another with AND gates generating the partial products. In these examples, the inverters at the output of the adders have been eliminated and the parity of the bits has been compensated by the use of CMUL1 or CMUL2. Figure-6.40: A 8-bit Braun Multiplier without the final adder 6.7.5 Baugh-Wooley Multiplier This technique has been developed in order to design regular multipliers, suited for 2’s-complement numbers. Let us consider 2 numbers A and B : 5/1/2007 10:53 AM Design of VLSI Systems 27 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... (64), (65) The product A.B is given by the following equation : (66) We see that subtractor cells must be used. In order to use only adder cells, the negative terms may be rewritten as : (67) By this way, A.B becomes : (68) The final equation is : (69) because : (70) A and B are n-bits operands, so their product is a 2n-bits number. Consequently, the most significant weight is 2n-1, and the first term -22n-1 is taken into account by adding a 1 in the most significant cell of the multiplier. 5/1/2007 10:53 AM Design of VLSI Systems 28 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.41: shows a 4-bits Baugh-Wooley multiplier [Click to enlarge image]Figure-6.41: A 4-bit Baugh-Wooley Multiplier with the final adder 6.7.6 Dadda Multiplier The advantage of this method is the higher regularity of the array. Signed integers can be processed. The cost for this regularity is the addition of an extra column of adders. [Click to enlarge image]Figure-6.42: A 4-bit Baugh-Wooley Multiplier with the final adder 6.7.7 Mou's Multiplier On Figure 6.43 the scheme using OS-trees is used in a 4-bit multiplier. The partial product generation is done according to Dadda multiplication. Figure 6.44 represents the OS-tree structure used in a 16-bit multiplier. Although the author claims a better regularity, its scheme does not allow an easy pipelining. 5/1/2007 10:53 AM Design of VLSI Systems 29 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-6.43: A 4-bit OS-tree Multiplier with a final adder [Click to enlarge image]Figure-6.44: A 16-bit OS-tree Multiplier without a final adder and without the partial product cells 6.7.8 Logarithmic Multiplier The objective of this circuit is to compute the product of two terms. The property used is the following equation : Log(A * B) = Log (A) + Log (B) (71) There are several ways to obtain the logarithm of a number : look-up tables, recursive algorithms or the segmentation of the logarithmic curve [Hoef91]. The segmentation method : The basic idea is to approximate the logarithm curve with a set of linear segments. If y = Log2 (x) (72) an approximation of this value on the segment ]2n+1 , 2n[ can be made using the following equation : y = ax + b = ( y / x ).x + b = [1 / (2n+1 - 2n)].x + n-1 = 2-n x + (n-1) (73) What is the hardware interpretation of this formula? If we take xi = (xi7, xi6, xi5, xi4, xi3, xi2, xi1, xi0), an integer coded with 8 bits, its logarithm will be obtained as follows. The decimal part of the logarithm will be obtained by shifting xi n positions to the right, and the integer part will be the value where the MSB occurs. For instance if xi is (0,0,1,0,1,1,1,0) = 46, the integer part of the logarithm is 5 because the MSB is xi5 and the decimal part is 01110. So the logarithm of xi equals 101.01110 = 5.4375 because 01110 is 14 out of a possible 32, and 14/32 = 0.4275 Table 9 illustrates this coding. Once the coding of two linear words has been performed, the addition of the two logarithms can be done. The last operation to be performed is the antilogarithm of the sum to obtain the value of the final product. Using this method, a 11.6% error on the product of two binary operands (i.e. the sum of two logarithmic numbers) occurs. We would like to reduce this error without increasing the complexity of the operation nor the complexity of the operator. Since the transfomations used in this system are logarithms and antilogarithms, it is natural to think that the complexity of the correction systems will grow exponentially if the error approaches zero. We analyze the error to derive an easy and 5/1/2007 10:53 AM Design of VLSI Systems 30 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... effective way to increase the accuracy of the result. Table-6.9: Coding of the binary logarithm according to the segmentation method Figure 6.45 describes the architecture of the logarithmic multiplier with the different variables used in the system. [Click to enlarge image]Figure-6.45: Block diagram of a logarithmic multiplier Error analysis: Let us define the different functions used in this system. 5/1/2007 10:53 AM Design of VLSI Systems 31 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The logarithm and antilogarithm curves are approximated by linear segments. They start at values which are in powers-of-two and end at the next power-of- two value. Figure 6.46 shows how a logarithm is approximated. The same is true for the antilogarithm. [Click to enlarge image]Figure-6.46: Approximated value of the logarithm compared to the exact logarithm By adding the unique value 17*2-8 to the two logarithms an improvement of 40% is achieved on the maximum error. The maximum error comes down from 11.6% to 7.0%, an improvement of 40% compared with a system without any correction. The only cost is the replacement of the internal two input adder by a three input adder. A more complex correction system which leads to better precision but at a much higher hardware cost is possible. In Table 10 we suggest a system which would choose one correction among three depending on the value of the input bits. Table 10 can be read as the values of the logarithms obtained after the coder for either a1 or a2. The penultimate column represents the ideal correction which should be added to get 100% accuracy. The last column gives the correction chosen among three possibilities: 32, 16 or 0. Three decoding functions have to be implemented for this proposal. If the exclusive -OR of a-2 and a-3 is true, then the added value is 32*2-8. If all the bits of the decimal part are zero, then the added value is zero. In all other cases the added value is 16*2-8. This decreases the average error. But the drawback is that the maximum error will be minimized only if the steps between two ideal corrections are bigger than the unity step. To minimize the maximum error the correcting functions should increase in an exponential way. Further research could be performed in this area. 5/1/2007 10:53 AM Design of VLSI Systems 32 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Table-6.10: A more complex correction scheme 6.8 Addition and Multiplication in Galois Fields, GF(2n) The group theory is used to introduce another algebraic system, called a field. A field is a set of elements in which we can do addition, subtraction, multiplication and division without leaving the set. Addition and multiplication must satisfy the commutative, associative, and distributive laws. A formal definition of a field is given below. Definition Let F be a set of elements on which two binary operations called addition "+" and multiplication".", are defined. The set F together with the two binary operations + and . is a field if the following conditions are satisfied: 1. F is a commutative group under addition +. The identity element with respect to addition is called the zero element or the additive identity of F and is denoted by 0. 2. The set of nonzero elements in F is a commutative group under multiplication . .The identity element with respect to multiplication is called the unit element or the multiplicative identity of F and is denoted 1. 3. Multiplication is distributive over addition; that is, for any three elements, a, b, c in F: a . ( b + c ) = a . b + a . c The number of elements in a field is called the order of the field. 5/1/2007 10:53 AM Design of VLSI Systems 33 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... A field with finite number of elements is called a finite field. Let us consider the set {0,1} together with modulo-2 addition and multiplication. We can easily check that the set {0,1} is a field of two elements under modulo-2 addition and modulo-2 multiplication.field is called a binary field and is denoted by GF(2). The binary field GF(2) plays an important role in coding theory [Rao74] and is widely used in digital computers and data transmission or storage systems. Another example using the residue to the base [Garn59] is given below. Table 11 represents the values of N, from 0 to 29 with their representation according to the residue of the base (5, 3, 2).The addition and multiplication of two term in this base can be performed according to the next example: Table-6.11: N varying from 0 to 29 and its representation in the residue number system The most interesting property in these systems is that there is no carry propagation inside the set. This can be attractive when implementing into VLSI these operators References [Aviz61] Avizienis, Signed-Digit Number Representations For Fast Parallel Arithmetic, IRE Trans. Electron. Compute., Vol EC-10, pp. 389-400, 1961. [Cava83] J. J. F. Cavanagh, Digital Computer Arithmetic, McGraw-Hill computer sciences series, 1983. [Garn59] H. L. Garner. The Residue Number System, IRE Transactions on Elec. Comput., p. 140- 147, September 1959 [Hoef91] B. Hoefflinger, M. Selzer, F. Warkowski, Digital Logarithmic CMOS Multiplier for Very- High Speed Signal Processing, in Proc. IEEE Custom Integrated Circuit Conference, 1991, pp.16.7.1-16.7.5. [Kowa92] J. Kowalczuk and D. Mlynek, Un Nouvel Algorithme De Generation D'Additionneurs Rapides Dédiés 5/1/2007 10:53 AM Design of VLSI Systems 34 of 34 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Au traitement d'images, Proceeding of the Industrial Automation Conference, p. 20.9-20.13, Montreal, Québec, Canada, June 1992. [Kowa93] J. Kowalczuk, On the Design and implementation of algorithms for Multimedia Systems, PhD Thesis No 1188, Swiss Federal Institute of Techn., Lausanne 1994. [Mull82] J.M. Muller, Arithmétique des ordinateurs, opérateurs et fonctions élémentaires, Masson 1989 [Rao74] T. R. N. Rao, Error Coding for Arithmetic Processors, New York, Academic Press, 1974 [Sinh89] B. P. Sinha and P. K. Srimani, Fast Parallel Algorithms for Binary Multiplication and Their Implementation on Systolic Architectures, IEEE Transactions on Computers, vol. 38, No. 3, 424-431, March 1989 [Taka87] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa and N. Takagi, A High-Speed Multiplier Using a Redundant Binary Adder Tree, IEEE Journal of Solid States Circuits, Volume SC- 22, No. 1, Pages 28-34, February 1987 This chapter edited by D. Mlynek production of 5/1/2007 10:53 AM Design of VLSI Systems 1 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 7 LOW-POWER VLSI CIRCUITS AND SYSTEMS Introduction Overview of Power Consumption Low-Power Design Through Voltage Scaling Estimation and Optimization of Switching Activity Reduction of Switched Capacitance Adiabatic Logic Circuits 7.1 Introduction The increasing prominence of portable systems and the need to limit power consumption (and hence, heat dissipation) in very-high density ULSI chips have led to rapid and innovative developments in low-power design during the recent years. The driving forces behind these developments are portable applications requiring low power dissipation and high throughput, such as notebook computers, portable communication devices and personal digital assistants (PDAs). In most of these cases, the requirements of low power consumption must be met along with equally demanding goals of high chip density and high throughput. Hence, low-power design of digital integrated circuits has emerged as a very active and rapidly developing field of CMOS design. The limited battery lifetime typically imposes very strict demands on the overall power consumption of the portable system. Although new rechargeable battery types such as Nickel-Metal Hydride (NiMH) are being developed with higher energy capacity than that of the conventional Nickel-Cadmium (NiCd) batteries, revolutionary increase of the energy capacity is not expected in the near future. The energy density (amount of energy stored per unit weight) offered by the new battery technologies (e.g., NiMH) is about 30 Watt-hour/pound, which is still low in view of the expanding applications of portable systems. Therefore, reducing the power dissipation of integrated circuits through design improvements is a major challenge in portable systems design. The need for low-power design is also becoming a major issue in high-performance digital systems, such as microprocessors, digital signal processors (DSPs) and other applications. Increasing chip density and higher operating speed lead to the design of very complex chips with high clock frequencies. Typically, the power dissipation of the chip, and thus, the temperature, increase linearly with the clock frequency. Since the dissipated heat must be removed effectively to keep the chip temperature at an acceptable level, the cost of packaging, cooling and heat removal becomes a significant factor. Several high-performance microprocessor chips designed in the early 1990s (e.g., Intel Pentium, DEC Alpha, PowerPC) operate at clock frequencies in the range of 100 to 300 MHz, and their typical power consumption is between 20 and 50 W. ULSI reliability is yet another concern which points to the need for low-power design. There is a close correlation between the peak power dissipation of digital circuits and reliability problems such as electromigration and hot-carrier induced device degradation. Also, the thermal stress caused by heat dissipation on chip is a major reliability concern. Consequently, the reduction of power consumption is also crucial for reliability enhancement. 5/1/2007 10:57 AM Design of VLSI Systems 2 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The methodologies which are used to achieve low power consumption in digital systems span a wide range, from device/process level to algorithm level. Device characteristics (e.g., threshold voltage), device geometries and interconnect properties are significant factors in lowering the power consumption. Circuit-level measures such as the proper choice of circuit design styles, reduction of the voltage swing and clocking strategies can be used to reduce power dissipation at the transistor level. Architecture-level measures include smart power management of various system blocks, utilization of pipelining and parallelism, and design of bus structures. Finally, the power consumed by the system can be reduced by a proper selection of the data processing algorithms, specifically to minimize the number of switching events for a given task. In this chapter, we will primarily concentrate on the circuit- or transistor-level design measures which can be applied to reduce the power dissipation of digital integrated circuits. Various sources of power consumption will be discussed in detail, and design strategies will be introduced to reduce the power dissipation. The concept of adiabatic logic will be given a special emphasis since it emerges as a very effective means for reducing the power consumption. [Table of Contents] [Top of Document] 7.2 Overview of Power Consumption In the following, we will examine the various sources (components) of time-averaged power consumption in CMOS circuits. The average power consumption in conventional CMOS digital circuits can be expressed as the sum of three main components, namely, (1) the dynamic (switching) power consumption, (2) the short-circuit power consumption, and (3) the leakage power consumption. If the system or chip includes circuits other than conventional CMOS gates that have continuous current paths between the power supply and the ground, a fourth (static) power component should also be considered. We will limit our discussion to the conventional static and dynamic CMOS logic circuits. Switching Power Dissipation This component represents the power dissipated during a switching event, i.e., when the output node voltage of a CMOS logic gate makes a power consuming transition. In digital CMOS circuits, dynamic power is dissipated when energy is drawn from the power supply to charge up the output node capacitance. During the charge-up phase, the output node voltage typically makes a full transition from 0 to VDD, and the energy used for the transition is relatively independent of the function performed by the circuit. To illustrate the dynamic power dissipation during switching, consider the circuit example given in Fig. 7.1. Here, a two-input NOR gate drives two NAND gates, through interconnection lines. The total capacitive load at the output of the NOR gate consists of (1) the output capacitance of the gate itself, (2) the total interconnect capacitance, and (3) the input capacitances of the driven gates. [Click to enlarge image] Figure-7.1: A NOR gate driving two NAND gates through interconnection lines. The output capacitance of the gate consists mainly of the junction parasitic capacitances, which are due to the drain diffusion regions of the MOS transistors in the circuit. The important aspect to emphasize here is that the amount of capacitance is approximately a linear function of the junction area. Consequently, the size of the total drain diffusion 5/1/2007 10:57 AM Design of VLSI Systems 3 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... area dictates the amount of parasitic capacitance. The interconnect lines between the gates contribute to the second component of the total capacitance. The estimation of parasitic interconnect capacitance was discussed thoroughly in Chapter 4. Note that especially in sub-micron technologies, the interconnect capacitance can become the dominant component, compared to the transistor-related capacitances. Finally, the input capacitances are mainly due to gate oxide capacitances of the transistors connected to the input terminal. Again, the amount of the gate oxide capacitance is determined primarily by the gate area of each transistor. [Click to enlarge image] Figure-7.2: Generic representation of a CMOS logic gate for switching power calculation Any CMOS logic gate making an output voltage transition can thus be represented by its nMOS network, pMOS network, and the total load capacitance connected to its output node, as seen in Fig. 7.2. The average power dissipation of the CMOS logic gate, driven by a periodic input voltage waveform with ideally zero rise- and fall-times, can be calculated from the energy required to charge up the output node to VDD and charge down the total output load capacitance to ground level. (7.1) Evaluating this integral yields the well-known expression for the average dynamic (switching) power consumption in CMOS logic circuits. (7.2) or (7.3) Note that the average switching power dissipation of a CMOS gate is essentially independent of all transistor characteristics and transistor sizes. Hence, given an input pattern, the switching delay times have no relevance to the amount of power consumption during the switching events as long as the output voltage swing is between 0 and VDD. Equation (7.3) shows that the average dynamic power dissipation is proportional to the square of the power supply voltage, hence, any reduction of VDD will significantly reduce the power consumption. Another way to limit the dynamic power dissipation of a CMOS logic gate is to reduce the amount of switched capacitance at the output. This 5/1/2007 10:57 AM Design of VLSI Systems 4 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... issue will be discussed in more detail later. First, let us briefly examine the effect of reducing the power supply voltage VDD upon switching power consumption and dynamic performance of the gate. Although the reduction of power supply voltage significantly reduces the dynamic power dissipation, the inevitable design trade-off is the increase of delay. This can be seen by examining the following propagation delay expressions for the CMOS inverter circuit. (7.4) Assuming that the power supply voltage is being scaled down while all other variables are kept constant, it can be seen that the propagation delay time will increase. Figure 7.3 shows the normalized variation of the delay as a function of VDD, where the threshold voltages of the nMOS and the pMOS transistor are assumed to be constant, VT,n = 0.8 V and VT,p = - 0.8 V, respectively. The normalized variation of the average switching power dissipation as a function of the supply voltage is also shown on the same plot. [Click to enlarge image] Figure-7.3: Normalized propagation delay and average switching power dissipation of a CMOS inverter, as a function of the power supply voltage VDD. Notice that the dependence of circuit speed on the power supply voltage may also influence the relationship between the dynamic power dissipation and the supply voltage. Equation (7.3) suggests a quadratic improvement (reduction) of power consumption as the power supply voltage is reduced. However, this interpretation assumes that the switching frequency (i.e., the number of switching events per unit time) remains constant. If the circuit is always operated at the maximum frequency allowed by its propagation delay, on the other hand, the number of switching events per unit time (i.e., the operating frequency) will obviously drop as the propagation delay becomes larger with the reduction of the power supply voltage. The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic relationship, shown in Fig. 7.3. The analysis of switching power dissipation presented above is based on the assumption that the output node of a CMOS gate undergoes one power-consuming transition (0-to-VDD transition) in each clock cycle. This assumption, however, is not always correct; the node transition rate can be smaller than the clock rate, depending on the circuit topology, logic style and the input signal statistics. To better represent this behavior, we will introduce aT (node transition factor), which is the effective number of power-consuming voltage transitions experienced per clock cycle. Then, the average switching power consumption becomes (7.5) 5/1/2007 10:57 AM Design of VLSI Systems 5 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... The estimation of switching activity and various measures to reduce its rate will be discussed in detail in Section 7.4. Note that in most complex CMOS logic gates, a number of internal circuit nodes also make full or partial voltage transitions during switching. Since there is a parasitic node capacitance associated with each internal node, these internal transitions contribute to the overall power dissipation of the circuit. In fact, an internal node may undergo several transitions while the output node voltage of the circuit remains unchanged, as illustrated in Fig. 7.4. [Zoom] [Click to enlarge image] Figure-7.4: Switching of the internal node in a two-input NOR gate results in dynamic power dissipation even if the output node voltage remains unchanged. In the most general case, the internal node voltage transitions can also be partial transitions, i.e., the node voltage swing may be only Vi which is smaller than the full voltage swing of VDD. Taking this possibility into account, the generalized expression for the average switching power dissipation can be written as (7.6) where Ci represents the parasitic capacitance associated with each node and aTi represents the corresponding node transition factor associated with that node. Short-Circuit Power Dissipation The switching power dissipation examined above is purely due to the energy required to charge up the parasitic capacitances in the circuit, and the switching power is independent of the rise and fall times of the input signals. Yet, if a CMOS inverter (or a logic gate) is driven with input voltage waveforms with finite rise and fall times, both the nMOS and the pMOS transistors in the circuit may conduct simultaneously for a short amount of time during switching, forming a direct current path between the power supply and the ground, as shown in Fig. 7.5. The current component which passes through both the nMOS and the pMOS devices during switching does not contribute to the charging of the capacitances in the circuit, and hence, it is called the short-circuit current component. This component is especially prevalent if the output load capacitance is small, and/or if the input signal rise and fall times are large, as seen in Fig. 7.5. Here, the input/output voltage waveforms and the components of the current drawn from the power supply are illustrated for a symmetrical CMOS inverter with small capacitive load. The nMOS transistor in the circuit starts conducting when the rising input voltage exceeds the threshold voltage VT,n. The pMOS transistor remains on until the input reaches the voltage level (VDD - |VT,p|). Thus, there is a time window during which both transistors are turned on. As the output capacitance is discharged through the nMOS transistor, the output voltage starts to fall. The drain-to-source voltage drop of the pMOS transistor becomes nonzero, which allows the pMOS transistor to conduct as well. The short circuit current is terminated when the input voltage transition is completed and the pMOS transistor is turned off. An similar event is responsible for the shortcircuit current component during the falling input transition, when the output voltage starts rising while both transistors are on. 5/1/2007 10:57 AM Design of VLSI Systems 6 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Note that the magnitude of the short-circuit current component will be approximately the same during both the rising-input transition and the falling-input transition, assuming that the inverter is symmetrical and the input rise and fall times are identical. The pMOS transistor also conducts the current which is needed to charge up the small output load capacitance, but only during the falling-input transition (the output capacitance is discharged through the nMOS device during the rising-input transition). This current component, which is responsible for the switching power dissipation of the circuit (current component to charge up the load capacitance), is also shown in Fig. 7.5. The average of both of these current components determines the total amount of power drawn from the supply. For a simple analysis consider a symmetric CMOS inverter with k = kn = kp and VT = VT,n = |VT,p|, and with a very small capacitive load. If the inverter is driven with an input voltage waveform with equal rise and fall times (t = trise = tfall), it can be derived that the time-averaged short circuit current drawn from the power supply is (7.7) [Click to enlarge image] Figure-7.5: Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit current in a CMOS inverter with small capacitive load. The total current drawn from the power supply is the sum of both current components. Hence, the short-circuit power dissipation becomes (7.8) Note that the short-circuit power dissipation is linearly proportional to the input signal rise and fall times, and also to the transconductance of the transistors. Hence, reducing the input transition times will obviously decrease the short-circuit current component. Now consider the same CMOS inverter with a larger output load capacitance and smaller input transition times. During the rising input transition, the output voltage will effectively remain at VDD until the input voltage completes 5/1/2007 10:57 AM Design of VLSI Systems 7 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... its swing and the output will start to drop only after the input has reached its [Click to enlarge image] Figure-7.6: Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit current in a CMOS inverter with larger capacitive load and smaller input transition times. The total current drawn from the power supply is approximately equal to the charge-up current. final value. Although both the nMOS and the pMOS transistors are on simultaneously during the transition, the pMOS transistor cannot conduct a significant amount of current since the voltage drop between its source and drain terminals is approximately equal to zero. Similarly, the output voltage will remain approximately equal to 0 V during a falling input transition and it will start to rise only after the input voltage completes its swing. Again, both transistors will be on simultaneously during the input voltage transition, yet the nMOS transistor will not be able to conduct a significant amount of current since its drain-to-source voltage is approximately equal to zero. This situation is illustrated in Fig. 7.6, which shows the simulated input and output voltage waveforms of the inverter as well as the short-circuit and dynamic current components drawn from the power supply. Notice that the peak value of the supply current to charge up the output load capacitance is larger in this case. The reason for this is that the pMOS transistor remains in saturation during the entire input transition, as opposed to the previous case shown in Fig. 7.5 where the transistor leaves the saturation region before the input transition is completed. The discussion concerning the magnitude of the short-circuit current may suggest that the short-circuit power dissipation can be reduced by making the output voltage transition times larger and/or by making the input voltage transition times smaller. Yet this goal should be balanced carefully against other performance goals such as propagation delay, and the reduction of the short-circuit current should be considered as one of the many design requirements that must satisfied by the designer. Leakage Power Dissipation The nMOS and pMOS transistors used in a CMOS logic gate generally have nonzero reverse leakage and subthreshold currents. In a CMOS VLSI chip containing a very large number of transistors, these currents can contribute to the overall power dissipation even when the transistors are not undergoing any switching event. The magnitude of the leakage currents is determined mainly by the processing parameters. Of the two main leakage current components found in a MOSFET, the reverse diode leakage occurs when the pn-junction between the drain and the bulk of the transistor is reversely biased. The reverse-biased drain junction 5/1/2007 10:57 AM Design of VLSI Systems 8 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... then conducts a reverse saturation current which is eventually drawn from the power supply. Consider a CMOS inverter with a high input voltage, where the nMOS transistor is turned on and the output node voltage is discharged to zero. Although the pMOS transistor is turned off, there will be a reverse potential difference of VDD between its drain and the n-well, causing a diode leakage through the drain junction. The n-well region of the pMOS transistor is also reverse-biased with VDD, with respect to the p-type substrate. Therefore, another significant leakage current component exists due to the n-well junction (Fig. 7.7). [Click to enlarge image] Figure-7.7: Reverse leakage current paths in a CMOS inverter with high input voltage. A similar situation can be observed when the input voltage is equal to zero, and the output voltage is charged up to VDD through the pMOS transistor. Then, the reverse potential difference between the nMOS drain region and the p-type substrate causes a reverse leakage current which is also drawn from the power supply (through the pMOS transistor). The magnitude of the reverse leakage current of a pn-junction is given by the following expression (7.9) where Vbias is the magnitude of the reverse bias voltage across the junction, JS is the reverse saturation current density and the A is the junction area. The typical magnitude of the reverse saturation current density is 1 - 5 pA/mm2, and it increases quite significantly with temperature. Note that the reverse leakage occurs even during the stand-by operation when no switching takes place. Hence, the power dissipation due to this mechanism can be significant in a large chip containing several million transistors. Another component of leakage currents which occur in CMOS circuits is the subthreshold current, which is due to carrier diffusion between the source and the drain region of the transistor in weak inversion. An MOS transistor in the subthreshold operating region behaves similar to a bipolar device and the subthreshold current exhibits an exponential dependence on the gate voltage. The amount of the subthreshold current may become significant when the gate-to-source voltage is smaller than, but very close to the threshold voltage of the device. In this case, the power dissipation due to subthreshold leakage can become comparable in magnitude to the switching power dissipation of the circuit. The subthreshold leakage current is illustrated in Fig. 7.8. 5/1/2007 10:57 AM Design of VLSI Systems 9 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.8: Subthreshold leakage current path in a CMOS inverter with high input voltage. Note that the subthreshold leakage current also occurs when there is no switching activity in the circuit, and this component must be carefully considered for estimating the total power dissipation in the stand-by operation mode. The subthreshold current expression is given below, in order to illustrate the exponential dependence of the current on terminal voltages. (7.10) One relatively simple measure to limit the subthreshold current component is to avoid very low threshold voltages, so that the VGS of the nMOS transistor remains safely below VT,n when the input is logic zero, and the |VGS| of the pMOS transistor remains safely below |VT,p| when the input is logic one. In addition to the three major sources of power consumption in CMOS digital integrated circuits discussed here, some chips may also contain components or circuits which actually consume static power. One example is the pseudo-nMOS logic circuits which utilize a pMOS transistor as the pull-up device. The presence of such circuit blocks should also be taken into account when estimating the overall power dissipation of a complex system. 7.3 Low-Power Design Through Voltage Scaling The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage. Therefore, reduction of VDD emerges as a very effective means of limiting the power consumption. Given a certain technology, the circuit designer may utilize on-chip DC- DC converters and/or separate power pins to achieve this goal. As we have already discussed briefly in Section 7.2, however, the savings in power dissipation comes at a significant cost in terms of increased circuit delay. When considering drastic reduction of the power supply voltage below the new standard of 3.3 V, the issue of time-domain performance should also be addressed carefully. In the following, we will examine reduction of the power supply voltage with a corresponding scaling of threshold voltages, in order to compensate for the speed degradation. At the system level, architectural measures such as the use of parallel processing blocks and/or pipelining techniques also offer very feasible alternatives for maintaining the system performance (throughput) despite aggressive reduction of the power supply voltage. The propagation delay expression (7.4) clearly shows that the negative effect of reducing the power supply voltage upon delay can be compensated for, if the threshold voltage of the transistor is scaled down accordingly. However, this approach is limited due to the fact that the threshold voltage cannot be scaled to the same extent as the supply voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce the same speed-performance at a lower VDD. Figure 7.9 shows the variation of the propagation delay of a CMOS inverter as a function of the power supply voltage, and for different threshold voltage values. 5/1/2007 10:57 AM Design of VLSI Systems 10 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.9: Variation of the normalized propagation delay of a CMOS inverter, as a function of the power supply voltage VDD and the threshold voltage VT. We can see, for example, that reducing the threshold voltage from 0.8 V to 0.2 V can improve the delay at VDD = 2 V by a factor of 2. The influence of threshold voltage reduction upon propagation delay is especially pronounced at low power supply voltages. It should be noted, however, that the threshold voltage reduction approach is restricted by the concerns on noise margins and the subthreshold conduction. Smaller threshold voltages lead to smaller noise margins for the CMOS logic gates. The subthreshold conduction current also sets a severe limitation against reducing the threshold voltage. For threshold voltages smaller than 0.2 V, leakage power dissipation due to subthreshold conduction may become a very significant component of the overall power consumption. In certain types of applications, the reduction of circuit speed which comes as a result of voltage scaling can be compensated for at the expense of more silicon area. In the following, we will examine the use of architectural measures such as pipelining and hardware replication to offset the loss of speed at lower supply voltages. Pipelining Approach First, consider the single functional block shown in Fig. 7.10 which implements a logic function F(INPUT) of the input vector, INPUT. Both the input and the output vectors are sampled through register arrays, driven by a clock signal CLK. Assume that the critical path in this logic block (at a power supply voltage of VDD) allows a maximum sampling frequency of fCLK; in other words, the maximum input-to-output propagation delay tP,max of this logic block is equal to or less than TCLK = 1/fCLK. Figure 7.10 also shows the simplified timing diagram of the circuit. A new input vector is latched into the input register array at each clock cycle, and the output data becomes valid with a latency of one cycle. [Click to enlarge image] Figure-7.10: Single-stage implementation of a logic function and its simplified timing diagram. 5/1/2007 10:57 AM Design of VLSI Systems 11 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Let Ctotal be the total capacitance switched every clock cycle. Here, Ctotal consists of (i) the capacitance switched in the input register array, (ii) the capacitance switched to implement the logic function, and (iii) the capacitance switched in the output register array. Then, the dynamic power consumption of this structure can be found as (7.11) Now consider an N-stage pipelined structure for implementing the same logic function, as shown in Fig. 7.11. The logic function F(INPUT) has been partitioned into N successive stages, and a total of (N-1) register arrays have been introduced, in addition to the original input and output registers, to create the pipeline. All registers are clocked at the original sample rate, fCLK. If all stages of the partitioned function have approximately equal delay of (7.12) Then the logic blocks between two successive registers can operate N-times slower while maintaining the same functional throughput as before. This implies that the power supply voltage can be reduced to a value of VDD,new, to effectively slow down the circuit by a factor of N. The supply voltage to achieve this reduction can be found by solving (7.4). [Click to enlarge image] Figure-7.11: N-stage pipeline structure realizing the same logic function as in Fig. 7.10. The maximum pipeline stage delay is equal to the clock period, and the latency is N clock cycles. The dynamic power consumption of the N-stage pipelined structure with a lower supply voltage and with the same functional throughput as the single-stage structure can be approximated by (7.13) where Creg represents the capacitance switched by each pipeline register. Then, the power reduction factor achieved in a N-stage pipeline structure is (7.14) As an example, consider replacing a single-stage logic block (VDD = 5 V, fCLK = 20 MHz) with a four-stage 5/1/2007 10:57 AM Design of VLSI Systems 12 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... pipeline structure, running at the same clock frequency. This means that the propagation delay of each pipeline stage can be increased by a factor of 4 without sacrificing the data throughput. Assuming that the magnitude of the threshold voltage of all transistors is 0.8 V, the desired speed reduction can be achieved by reducing the power supply voltage from 5 V to approximately 2 V (see Fig. 7.9). With a typical ratio of (Creg/Ctotal) = 0.1, the overall power reduction factor is found from (7.14) as 1/5. This means that replacing the original single-stage logic block with a four-stage pipeline running at the same clock frequency and reducing the power supply voltage from 5 V to 2 V will provide a dynamic power savings of about 80%, while maintaining the same throughput as before. The architectural modification described here has a relatively small area overhead. A total of (N-1) register arrays have to be added to convert the original single-stage structure into a pipeline. While trading off area for lower power, this approach also increases the latency from one to N clock cycles. Yet in many applications such as signal processing and data encoding, latency is not a very significant concern. Parallel Processing Approach (Hardware Replication) Another possibility of trading off area for lower power dissipation is to use parallelism, or hardware replication. This approach could be useful especially when the logic function to be implemented is not suitable for pipelining. Consider N identical processing elements, each implementing the logic function F(INPUT) in parallel, as shown in Fig. 7.12. Assume that the consecutive input vectors arrive at the same rate as in the single-stage case examined earlier. The input vectors are routed to all the registers of the N processing blocks. Gated clock signals, each with a clock period of (N TCLK), are used to load each register every N clock cycles. This means that the clock signals to each input register are skewed by TCLK, such that each one of the N consecutive input vectors is loaded into a different input register. Since each input register is clocked at a lower frequency of (fCLK / N), the time allowed to compute the function for each input vector is increased by a factor of N. This implies that the power supply voltage can be reduced until the critical path delay equals the new clock period of (N TCLK). The outputs of the N processing blocks are multiplexed and sent to an output register which operates at a clock frequency of fCLK, ensuring the same data throughput rate as before. The timing diagram of this parallel arrangement is given in Fig. 7.13. Since the time allowed to compute the function for each input vector is increased by a factor of N, the power supply voltage can be reduced to a value of VDD,new, to effectively slow down the circuit. The new supply voltage can be found, as in the pipelined case, by solving (7.4). The total dynamic power dissipation of the parallel structure (neglecting the dissipation of the multiplexor) is found as the sum of the power dissipated by the input registers and the logic blocks operating at a clock frequency of (fCLK / N), and the output register operating at a clock frequency of fCLK. (7.15) 5/1/2007 10:57 AM Design of VLSI Systems 13 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.12: N-block parallel structure realizing the same logic function as in Fig. 7.10. Notice that the input registers are clocked at a lower frequency of (fCLK / N). Note that there is also an additional overhead which consists of the input routing capacitance, the output routing capacitance and the capacitance of the output multiplexor structure, all of which are increasing functions of N. If this overhead is neglected, the amount of power reduction achievable in a N-block parallel implementation is (7.16) The lower bound of dynamic power reduction realizable with architecture-driven voltage scaling is found, assuming zero threshold voltage, as (7.17) (7.17) [Zoom] [Click to enlarge image] Figure-7.13: Simplified timing diagram of the N-block parallel structure shown in Fig. 7.12. 5/1/2007 10:57 AM Design of VLSI Systems 14 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Two obvious consequences of this approach are the increased area and the increased latency. A total of N identical processing blocks must be used to slow down the operation (clocking) speed by a factor of N. In fact, the silicon area will grow even faster than the number of processor because of signal routing and the overhead circuitry. The timing diagram in Fig. 7.13 shows that the parallel implementation has a latency of N clock cycles, as in the N-stage pipelined implementation. Considering its smaller area overhead, however, the pipelined approach offers a more efficient alternative for reducing the power dissipation while maintaining the throughput. 7.4 Estimation and Optimization of Switching Activity In the previous section, we have discussed methods for minimizing dynamic power consumption in CMOS digital integrated circuits by supply voltage scaling. Another approach to low power design is to reduce the switching activity and the amount of the switched capacitance to the minimum level required to perform a given task. The measures to accomplish this goal can range from optimization of algorithms to logic design, and finally to physical mask design. In the following, we will examine the concept of switching activity, and introduce some of the approaches used to reduce it. We will also examine the various measures used to minimize the amount of capacitance which must be switched to perform a given task in a circuit. The Concept of Switching Activity It was already discussed in Section 7.2 that the dynamic power consumption of a CMOS logic gate depends, among other parameters, also on the node transition factor aT, which is the effective number of power-consuming voltage transitions experienced by the output capacitance per clock cycle. This parameter, also called the switching activity factor, depends on the Boolean function performed by the gate, the logic family, and the input signal statistics. Assuming that all input signals have an equal probability to assume a logic "0" or a logic "1" state, we can easily investigate the output transition probabilities for different types of logic gates. First, we will introduce two signal probabilities, P0 and P1. P0 corresponds to the probability of having a logic "0" at the output, and P1 = (1 - P0) corresponds to the probability of having a logic "1" at the output. Therefore, the probability that a power-consuming (0-to-1) transition occurs at the output node is the product of these two output signal probabilities. Consider, for example, a static CMOS NOR2 gate. If the two inputs are independent and uniformly distributed, the four possible input combinations (00, 01, 10, 11) are equally likely to occur. Thus, we can find from the truth table of the NOR2 gate that P0 = 3/4, and P1 = 1/4. The probability that a power-consuming transition occurs at the output node is therefore (7.18) The transition probabilities can be shown on a state transition diagram which consists of the only two possible output states and the possible transitions among them (Fig. 7.14). In the general case of a CMOS logic gate with n input variables, the probability of a power-consuming output transition can be expressed as a function of n0, which is the number of zeros in the output column of the truth table. (7.19) 5/1/2007 10:57 AM Design of VLSI Systems 15 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.14: State transition diagram and state transition probabilities of a NOR2 gate. The output transition probability is shown as a function of the number of inputs in Fig. 7.15, for different types of logic gates and assuming equal input probabilities. For a NAND or NOR gate, the truth table contains only one "0" or "1", respectively, regardless of the number of inputs. Therefore, the output transition probability drops as the number of inputs is increased. In a XOR gate, on the other hand, the truth table always contains an equal number of logic "0" and logic "1" values. The output transition probability therefore remains constant at 0.25. [Click to enlarge image] Figure-7.15: Output transition probabilities of different logic gates, as a function of the number of inputs. Note that the transition probability of the XOR gate is independent of the number or inputs. In multi-level logic circuits, the distribution of input signal probabilities is typically not uniform, i.e., one cannot expect to have equal probabilities for the occurrence of a logic "0" and a logic "1". Then, the output transition probability becomes a function of the input probability distributions. As an example, consider the NOR2 gate examined above. Let P1,A represent the probability of having a logic "1" at the input A, and P1,B represent the probability of having a logic "1" at the input B. The probability of obtaining a logic "1" at the output node is (7.20) Using this expression, the probability of a power-consuming output transition is found as a function of P1,A and P1,B. (7.21) Figure 7.16 shows the distribution of the output transition probability in a NOR2 gate, as a function of two input probabilities. It can be seen that the evaluation of switching activity becomes a complicated problem in large circuits, especially when sequential elements, reconvergent nodes and feedback loops are involved. The designer must therefore rely on computer-aided design (CAD) tools for correct estimation of switching activity in a given network. 5/1/2007 10:57 AM Design of VLSI Systems 16 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.16: Output transition probability of NOR2 gate as a function of two input probabilities. In dynamic CMOS logic circuits, the output node is precharged during every clock cycle. If the output node was discharged (i.e., if the output value was equal to "0") in the previous cycle, the pMOS precharge transistor will draw a current from the power supply during the precharge phase. This means that the dynamic CMOS logic gate will consume power every time the output value equals "0", regardless of the preceding or following values. Therefore, the power consumption of dynamic logic gates is determined by the signal-value probability of the output node and not by the transition probability. From the discussion above, we can see that signal-value probabilities are always larger than transition probabilities, hence, the power consumption of dynamic CMOS logic gates is typically larger than static CMOS gates under the same conditions. Reduction of Switching Activit Switching activity in CMOS digital integrated circuits can be reduced by algorithmic optimization, by architecture optimization, by proper choice of logic topology and by circuit-level optimization. In the following, we will briefly some of the measures that can be applied to optimize the switching probabilities, and hence, the dynamic power consumption. Algorithmic optimization depends heavily on the application and on the characteristics of the data such as dynamic range, correlation, statistics of data transmission. Some of the techniques can be applied only for specific algorithms such as Digital Signal Processing (DSP) and cannot be used for general purpose processing. One possibility is the choosing a proper vector quantization (VQ) algorithm which results in minimum switching activity. For example, the number of memory accesses, the number of multiplications and the number of additions can be reduced by about a factor of 30 if differential tree search algorithm is used instead of the full search algorithm. The representation of data may also have a significant impact on switching activity at the system level. In applications where data bits change sequentially and are highly correlated (such as the address bits to access instructions) for example, the use of Gray coding leads to a reduced number of transitions compared to simple binary coding. Another example is using sign- magnitude representation instead of the conventional two's complement representation for signed data. A change in sign will cause transitions of the higher-order bits in the two's complement representation, whereas only the sign bit will change in sign-magnitude representation. Therefore, the switching activity can be reduced by using the sign-magnitude representation in applications where the data sign changes are frequent. An important architecture-level measure to reduce switching activity is based on delay balancing and the reduction of glitches. In multi-level logic circuits, the finite propagation delay from one logic block to the next can cause spurious signal transitions, or glitches as a result of critical races or dynamic hazards. In general, if all input signals of a gate change simultaneously, no glitching occurs. But a dynamic hazard or glitch can occur if input signals change at different times. Thus, a node can exhibit multiple transitions in a single clock cycle before settling to the correct logic level (Fig. 7.17). In some cases, the signal glitches are only partial, i.e., the node voltage does not make a full transition between the ground and VDD levels, yet even partial glitches can have a significant contribution to dynamic power dissipation. 5/1/2007 10:57 AM Design of VLSI Systems 17 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.17: Signal glitching in multi-level static CMOS circuits. Glitches occur primarily due to a mismatch or imbalance in the path lengths in the logic network. Such a mismatch in path length results in a mismatch of signal timing with respect to the primary inputs. As an example, consider the simple parity network shown in Fig. 7.18. Assuming that all XOR blocks have the same delay, it can be seen that the network in Fig. 7.18(a) will suffer from glitching due to the wide disparity between the arrival times of the input signals for the gates. In the network shown in Fig. 7.18(b), on the other hand, all arrival times are identical because the delay paths are balanced. Such redesign can significantly reduce the glitching transitions, and consequently, the dynamic power dissipation in complex multi-level networks. Also notice that the tree structure shown in Fig. 7.18(b) results in smaller overall propagation delay. Finally, it should be noted that glitching is not a significant issue in multi-level dynamic CMOS logic circuits, since each node undergoes at most one transition per clock cycle. [Click to enlarge image] Figure-7.18: (a) Implementation of a four-input parity (XOR) function using a chain structure. (b) Implementation of the same function using a tree structure which will reduce glitching transitions. 7.5 Reduction of Switched Capacitance It was already established in the previous sections that the amount of switched capacitance plays a significant role in the dynamic power dissipation of the circuit. Hence, reduction of this parasitic capacitance is a major goal for low-power design of digital integrated circuits. At the system level, one of the approaches to reduce the switched capacitance is to limit the use of shared resources. A simple example is the use of a global bus structure for data transmission between a large number of operational modules (Fig. 7.19). If a single shared bus is connected to all modules as in Fig. 7.19(a), this structure results in a large bus capacitance due to (i) the large number of drivers and receivers sharing the same transmission medium, and (ii) the parasitic capacitance of the long bus line. Obviously, driving the large bus capacitance will require a significant amount of power consumption during each bus access. Alternatively, the global bus structure can be partitioned into a number of smaller dedicated local busses to handle the data transmission between neighboring modules, as sown in Fig. 7.19(b). In this case, the switched capacitance during each bus access is significantly reduced, yet multiple busses may increase the overall routing area on chip. 5/1/2007 10:57 AM Design of VLSI Systems 18 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-7.19: (a) Using a single global bus structure for connecting a large number of modules on chip results in large bus capacitance and large dynamic power dissipation. (b) Using smaller local busses reduces the amount of switched capacitance, at the expense of additional chip area. The type of logic style used to implement a digital circuit also affects the physical capacitance of the circuit. The physical capacitance is a function of the number of transistors that are required to implement a given function. For example, one approach to reduce the physical capacitance is to use transfer gates over conventional CMOS logic gates to implement logic functions. Pass-gate logic design is attractive since fewer transistors are required certain functions such as XOR and XNOR. In many arithmetic operations where binary adders and multipliers are used, pass transistor logic offers significant advantages. Similarly, multiplexors and other key building blocks can also be simplified using this design style. The amount of parasitic capacitance that is switched (i.e. charged up or charged down) during operation can be also reduced at the physical design level, or mask level. The parasitic gate and diffusion capacitances of MOS transistors in the circuit typically constitute a significant amount of the total capacitance in a combinational logic circuit. Hence, a simple mask-level measure to reduce power dissipation is keeping the transistors (especially the drain and source regions) at minimum dimensions whenever possible and feasible, thereby minimizing the parasitic capacitances. Designing a logic gate with minimum-size transistors certainly affects the dynamic performance of the circuit, and this trade-off between dynamic performance and power dissipation should be carefully considered in critical circuits. Especially in circuits driving a large extrinsic capacitive loads, e.g., large fan-out or routing capacitances, the transistors must be designed with larger dimensions. Yet in many other cases where the load capacitance of a gate is mainly intrinsic, the transistor sizes can be kept at minimum. Note that most standard cell libraries are designed with larger transistors in order to accommodate a wide range of capacitive loads and performance requirements. Consequently, a standard-cell based design may have considerable overhead in terms of switched capacitance in each cell. 7.6 Adiabatic Logic Circuits In conventional level-restoring CMOS logic circuits with rail-to-rail output voltage swing, each switching event causes an energy transfer from the power supply to the output node, or from the output node to the ground. During a 5/1/2007 10:57 AM Design of VLSI Systems 19 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 0-to-VDD transition of the output, the total output charge Q = Cload VDD is drawn from the power supply at a constant voltage. Thus, an energy of Esupply = Cload VDD2 is drawn from the power supply during this transition. Charging the output node capacitance to the voltage level VDD means that at the end of the transition, the amount of energy Estored = Cload VDD2/2 is stored on the output node. Thus, half of the injected energy from the power supply is dissipated in the pMOS network while only one half is delivered to the output node. During a subsequent VDD-to-0 transition of the output node, no charge is drawn from the power supply and the energy stored in the load capacitance is dissipated in the nMOS network. To reduce the dissipation, the circuit designer can minimize the switching events, decrease the node capacitance, reduce the voltage swing, or apply a combination of these methods. Yet in all cases, the energy drawn from the power supply is used only once before being dissipated. To increase the energy efficiency of logic circuits, other measures must be introduced for recycling the energy drawn from the power supply. A novel class of logic circuits called adiabatic logic offers the possibility of further reducing the energy dissipated during switching events, and the possibility of recycling, or reusing, some of the energy drawn from the power supply. To accomplish this goal, the circuit topology and the operation principles have to be modified, sometimes drastically. The amount of energy recycling achievable using adiabatic techniques is also determined by the fabrication technology, switching speed and voltage swing. The term "adiabatic" is typically used to describe thermodynamic processes that have no energy exchange with the environment, and therefore, no energy loss in the form of dissipated heat. In our case, the electric charge transfer between the nodes of a circuit will be viewed as the process, and various techniques will be explored to minimize the energy loss, or heat dissipation, during charge transfer events. It should be noted that fully adiabatic operation of a circuit is an ideal condition which may only be approached asymptotically as the switching process is slowed down. In practical cases, energy dissipation associated with a charge transfer event is usually composed of an adiabatic component and a non-adiabatic component. Therefore, reducing all energy loss to zero may not be possible, regardless of the switching speed. Adiabatic Switching Consider the simple circuit shown in Fig. 7.20 where a load capacitance is charged by a constant current source. This circuit is similar to the equivalent circuit used to model the charge-up event in conventional CMOS circuits, with the exception that in conventional CMOS, the output capacitance is charged by a constant voltage source and not by a constant current source. Here, R represents the on-resistance of the pMOS network. Also note that a constant charging current corresponds to a linear voltage ramp. Assuming that the capacitance voltage VC is equal to zero initially, the variation of the voltage as a function of time can be found as (7.22) Hence, the charging current can be expressed as a simple function of VC and time t. (7.23) The amount of energy dissipated in the resistor R from t = 0 to t = T can be found as (7.24) Combining (7.23) and (7.24), the dissipated energy can also be expressed as follows. 5/1/2007 10:57 AM Design of VLSI Systems 20 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... (7.25) [Click to enlarge image] Figure-7.20: Constant-current source charging a load capacitance C, through a resistance R. Now, a number of simple observations can be made based on (7.25). First, the dissipated energy is smaller than for the conventional case if the charging time T is larger than 2 RC. In fact, the dissipated energy can be made arbitrarily small by increasing the charging time, since Ediss is inversely proportional to T. Also, we observe that the dissipated energy is proportional to the resistance R, as opposed to the conventional case where the dissipation depends on the capacitance and the voltage swing. Reducing the on-resistance of the pMOS network will reduce the energy dissipation. We have seen that the constant-current charging process efficiently transfers energy from the power supply to the load capacitance. A portion of the energy thus stored in the capacitance can also be reclaimed by reversing the current source direction, allowing the charge to be transferred from the capacitance back into the supply. This possibility is unique to adiabatic operation, since in conventional CMOS circuits the energy is dissipated after being used only once. The constant-current power supply must certainly be capable of retrieving the energy back from the circuit. Adiabatic logic circuits thus require non-standard power supplies with time-varying voltage, also called pulsed-power supplies. The additional hardware overhead associated with these specific power supply circuits is one of the design trade-off that must be considered when using adiabatic logic. Adiabatic Logic Gates In the following, we will examine simple circuit configurations which can be used for adiabatic switching. Note that most of the research on adiabatic logic circuits are relatively recent, therefore, the circuits presented here should be considered as examples only. Other circuit topologies are also possible, but the overall approach of energy recycling should still be applicable, regardless of the specific circuit configuration. First, consider the adiabatic amplifier circuit shown in Fig. 7.21, which can be used to drive capacitive loads. It consists of two CMOS transmission gates and two nMOS clamp transistors. Both the input (X) and the output (Y) are dual-rail encoded, which means that the inverses of both signals are also available, to control the CMOS T-gates. [Click to enlarge image] Figure-7.21: Adiabatic amplifier circuit which transfers the complementary input signals to its complementary outputs through CMOS transmission gates. 5/1/2007 10:57 AM Design of VLSI Systems 21 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... When the input signal X is set to a valid value, one of the two transmission gates becomes transparent. Next, the amplifier is energized by applying a slow voltage ramp VA, rising from zero to VDD. The load capacitance at one of the two complementary outputs is adiabatically charged to VDD through the transmission gate, while the other output node remains clamped to ground potential. When the charging process is completed, the output signal pair is valid and can be used as an input to other, similar circuits. Next, the circuit is de-energized by ramping the voltage VA back to zero. Thus, the energy that was stored in the output load capacitance is retrieved by the power supply. Note that the input signal pair must be valid and stable throughout this sequence. [Click to enlarge image] Figure-7.22: (a) The general circuit topology of a conventional CMOS logic gate. (b) The topology of an adiabatic logic gate implementing the same function. Note the difference in charge-up and charge-down paths for the output capacitance. The simple circuit principle of the adiabatic amplifier can be extended to allow the implementation of arbitrary logic functions. Figure 7.22 shows the general circuit topology of a conventional CMOS logic gate and an adiabatic counterpart. To convert a conventional CMOS logic gate into an adiabatic gate, the pull-up and pull-down networks must be replaced with complementary transmission-gate networks. The T-gate network implementing the pull-up function is used to drive the true output of the adiabatic gate, while the T-gate network implementing the pull-down function drives the complementary output node. Note the all inputs should also be available in complementary form. Both networks in the adiabatic logic circuit are used to charge-up as well as charge-down the output capacitances, which ensures that the energy stored at the output node can be retrieved by the power supply, at the end of each cycle. To allow adiabatic operation, the DC voltage source of the original circuit must be replaced by a pulsed-power supply with ramped voltage output. Note that the circuit modifications which are necessary to convert a conventional CMOS logic circuit into an adiabatic logic circuit increase the device count by a factor of two. Also, the reduction of energy dissipation comes at the cost of slower switching speed, which is the ultimate trade-off in all adiabatic methods. Stepwise Charging Circuits We have seen earlier that the dissipation during a charge-up event can be minimized, and in the ideal case be reduced to zero, by using a constant-current power supply. This requires that the power supply be able to generate linear voltage ramps. Practical supplies can be constructed by using resonant inductor circuits to approximate the constant output current and the linear voltage ramp with sinusoidal signals. But the use of inductors presents several difficulties at the circuit level, especially in terms of chip-level integration and overall efficiency. 5/1/2007 10:57 AM Design of VLSI Systems 22 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... An alternative to using pure voltage ramps is to use stepwise supply voltage waveforms, where the output voltage of the power supply is increased and decreased in small increments during charging and discharging. Since the energy dissipation depends on the average voltage drop traversed by the charge that flows onto the load capacitance, using smaller voltage steps, or increments, should reduce the dissipation considerably. Figure 7.23 shows a CMOS inverter driven by a stepwise supply voltage waveform. Assume that the output voltage is equal to zero initially. With the input voltage set to logic low level, the power supply voltage VA is increased from 0 to VDD, in n equal voltage steps (Fig. 7.24). Since the pMOS transistor is conducting during this transition, the output load capacitance will be charged up in a stepwise manner. The on-resistance of the pMOS transistor can be represented by the linear resistor R. Thus, the output load capacitance is being charged up through a resistor, in small voltage increments. For the ith time increment, the amount of capacitor current can be expressed as (7.26) Solving this differential equation with the initial condition Vout(ti) = VA(i) yields (7.27) [Click to enlarge image] Figure-7.23: A CMOS inverter circuit with a stepwise-increasing supply voltage. [Click to enlarge image] Figure-7.24: Equivalent circuit, and the input and output voltage waveforms of the CMOS inverter circuit in Fig. 7.23 (stepwise charge-up case). Here, n is the number of steps of the supply voltage waveform. The amount of energy dissipated during one voltage step increment can now be found as (7.28) Since n steps are used to charge up the capacitance to VDD, the total dissipation is 5/1/2007 10:57 AM Design of VLSI Systems 23 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... (7.29) According to this simplified analysis, charging the output capacitance with n voltage steps, or increments, reduces the energy dissipation per cycle by a factor of n. Therefore, the total power dissipation is also reduced by a factor of n using stepwise charging. This result implies that if the voltage steps can be made very small and the number of voltage steps n approaches infinity (i.e., if the supply voltage is a slow linear ramp), the energy dissipation will approach zero. Another example for simple stepwise charging circuits is the stepwise driver for capacitive loads, implemented with nMOS devices as shown in Fig. 7.25. Here, a bank of n constant voltage supplies with evenly distributed voltage levels is used. The load capacitance is charged up by connecting the constant voltage sources V1 through VN to the load successively, using an array of switch devices. To discharge the load capacitance, the constant voltage sources are connected to the load in the reverse sequence. The switch devices are shown as nMOS transistors in Fig. 7.25, yet some of them may be replaced by pMOS transistors to prevent the undesirable threshold-voltage drop problem and the substrate-bias effects at higher voltage levels. One of the most significant drawbacks of this circuit configuration is the need for multiple supply voltages. A power supply system capable of efficiently generating n different voltage levels would be complex and expensive. Also, the routing of n different supply voltages to each circuit in a large system would create a significant overhead. In addition, the concept is not easily extensible to general logic gates. Therefore, stepwise charging driver circuits can be best utilized for driving a few critical nodes in the circuit that are responsible for a large portion of the overall power dissipation, such as output pads and large busses. In general, we have seen that adiabatic logic circuits can offer significant reduction of energy dissipation, but usually at the expense of switching times. Therefore, adiabatic logic circuits can be best utilized in cases where delay is not critical. Moreover, the realization of unconventional power supplies needed in adiabatic circuit configurations typically results in an overhead both in terms of overall energy dissipation and in terms of silicon area. These issues should be carefully considered when adiabatic logic is used as a method for low-power design. [Click to enlarge image] Figure-7.25: Stepwise driver circuit for capacitive loads. The load capacitance is successively connected to constant voltage sources Vi through an array of switch devices. References 1. A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Norwell, MA: Kluwer Academic Publishers, 1995. 5/1/2007 10:57 AM Design of VLSI Systems 24 of 24 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 2. J.M. Rabaey and M. Pedram, ed., Low Power Design Methodologies, Norwell, MA: Kluwer Academic Publishers, 1995. 3. A. Bellaouar and M.I. Elmasry, Low-Power Digital VLSI Design, Norwell, MA: Kluwer Academic Publishers, 1995. 4. F. Najm, "A survey of power estimation techniques in VLSI circuits," IEEE Transactions on VLSI Systems, vol. 2, pp. 446-455, December 1994. 5. W.C. Athas, L. Swensson, J.G. Koller and E. Chou, "Low-power digital systems based on adiabatic- switching principles," IEEE Transactions on VLSI Systems, vol. 2, pp. 398-407, December 1994. This chapter edited by Y. Leblebici production of 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 1 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 8 TESTABILITY OF INTEGRATED SYSTEMS Design Constraints Testing The Rule of Ten Terminology Failures in CMOS Combinational Logic Testing Practical Ad-Hoc DFT Guidelines Scan Design Techniques 8.1 Design Constraints The following paragraphs reminds the designer of some basic rules to consider before starting. Each of these constraints has at least one tool helping in the development of the design in respect to a set of rules : 8.1.1 Design Rule Checking Every technology has its design rules. It consists in interpreting the possible geometrical implementation of the chips to be manufactured. These rules are given by the technology department in every foundry of IC. Rules are often described in a document with boxes representing the layers available in the technology on which are indicated the sizes, distances and geometrical constraints allowed in this technology. [Click to enlarge image] Figure-8.1: Designer needs to execute a program called DRC to check if his design don't violate the rules defined by the founder. This step of verification called DRC is as important as the simulation of the functionality of your design. A sole 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 2 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... simulation can't take in consideration if the rules are respected in which case the manufacturing of the chip could lead to shorts or cuts in the silicon physical implementation. Some other verification tools should also be used, such as ERC and LVS described above. 8.1.2 Layout Versus Schematic As a complement to the DRC, LVS is another tool to be used especially if the design started with a Schematic Entry tool. The aim of LVS is to check if the design at the layout level corresponds or is still coherent to the schematic. Usually, designers start with a schematic and then simulate it, if it is OK then they go to layout. But in some cases like full-custom or some semi-custom designs the layout implementation of the chip differs from the schematic because of some simulation results or because of a design error that simulation can't detect easily : simulation could never be exhaustive. LVS checks that the designer did the same representation at the schematic and layout levels, if not LVS tools indicate the occurrence. Of course a simulation of the layout using the same stimuli used for the schematic is more secure for the final design. 8.1.3 Latch-Up and Electro-Static Discharge Latch-up caused to CMOS the early problems that delayed its introduction in the electronic industry. It also called "Thyristor effect" and could cause the destruction of the chip or a part of it. There are no real solution to this phenomena but a set of design techniques exist to avoid instead of solving Latch-up occurence. The origin of Latch-up is the distribution of the NMOS and PMOS N and P basic structures inside the silicon. In some cases, not only PN junction are formed but also a structure like PNPN or NPNP parasitic thyristors. these parasitic elements could feature like a real thyristor and develop a high current destroying the area around it including the PMOS and NMOS transistors. The most used technique in avoiding the formation of such a structure is to add "butting contact" polarising the Nwell (or Pwell) to Vdd (or to Ground). This technique cannot eliminate the Latch-up process but reduces its effect. Another electrical constraint to CMOS is called ESD or Electro-Static Discharge. Handling CMOS chip properly could be a solution to avoid gates destruction caused by electro- static charges that people could have at the surface of their hands. This is the reason why it is important to have a conducting bracelet linked to ground when handling CMOS ICs. But even ground linked bracelet is not enough to protect CMOS chips from destruction due to ESD. Two diodes at each pad inside the chip link every I/O to Vdd and Gnd. These two big diodes protect the chip core (CMOS transistor gates) from ESD by limiting over-voltage. [Click to enlarge image] Figure-8.2: 8.1.4 Electrical Rule Checking Based on the previous paragraph, ERC is a guarantee that the designer has considered all the minimum necessary implementations for ERC free design. This tool verifies that the designer did used a sufficient number of well polarisations, applied the appropriate ESD pads or used VDD and VSS at the right places. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 3 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 8.2 Testing Design of logic integrated circuits in CMOS technology is becoming more and more complex since VLSI is the interest of many electronic IC users and manufacturers. A common problem to be solved by both users and manufacturers is the testing of these ICs. [Click to enlarge image] Figure-8.3: Testing can be expressed by checking if the outputs of a functional system (functional block, Integrated Circuit, Printed Circuit Board or a complete system) correspond to the inputs applied to it. If the test of this functional system is positive, then the system is good for use. If the outputs are different than expected, then the system has a problem: so either the system is rejected (Go/No Go test), or a diagnosis is applied to it, in order to point out and probably eliminate the problem's causes. Testing is applied to detect faults after several operations : design, manufacturing, packaging and especially during the active life of a system, and thus since failures caused by wear-out can occur at any moment of its usage. Design for Testability (DfT) is the ability of simplifying the test of any system. DfT could be synthesized by a set of techniques and design guidelines where the goals are : minimizing costs of system production minimizing system test complexity : test generation and application improving quality avoiding problems of timing discordance or block nature incompatibility. 8.3 The Rule of Ten In the production process cycle, a fault can occur at the chip level. If a test strategy is considered at the beginning of the design, then the fault could be detected rapidly, located and eliminated at a very low cost. When the faulty chip is soldered on a printed circuit board, the cost of fault remedy would be multiplied by ten. And this cost factors continues to apply until the system has been assembled and packaged and then sent to users. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 4 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-8.4: 8.4 Terminology At the system level the most used words are the following: Testability could be expressed by the ability for a Device Under Test (DUT), to be better observed and controlled easily from its external environment. [Click to enlarge image] Figure-8.5: The Design for Testability is then reduced to a set of design rules or guidelines to be respected in order to facilitate the test. The Reliability is expressed in terms of probability for a device to work without major problems for a given time. Reliability goes down when components number is increased. The Security is the probability that user's life is not in danger while a problem occurs to a device. Security is enhanced if a certain type components are added for more protection. The Quality 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 5 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... is essential in some types of applications. A "zero defect" target is often required. The Quality could be enhanced by having a proper design methodology, and a good technology, avoiding problems and simplifying testing. 8.5 Failures in CMOS When a MOS circuit has been fabricated and initially tested, some mechanisms can still cause it to fail. Failures are caused either by design bugs or by wearout (ageing or corrosion) mechanisms. The MOSFET transistor currently used has two main characteristics : threshold voltage and transconductance on which the performance of that circuit depends. [Click to enlarge image] Figure-8.6: The design bugs or defects result generally in device length and width deviating from those specified for a process (design rules). This type of fault is difficult to detect since it occurs later during the active life of the circuit, and leads mostly to opens and breaks in conductors or shorts between conductors. Failures are also caused by phenomena like "hot carrier injection", "oxide breakdown", "metallization failures" or "corrosion". The consequences of hot carrier injection, for instance, is a threshold voltage shifting and transconductance degrading because the gate oxide is charged when hot carriers are injected (usually electron in NMOS). Cross-talk is also a cause of faults (generally transient), and needs to isolate properly the different parts of the device. 8.6 Combinational Logic Testing It is more convenient to talk about "test generation for combinational logic testing" in this section, and about "test generation for sequential logic testing" in the next section. Thus the solution to the problem of testing a purely combinational logic block is a good set of patterns detecting "all" the possible faults. The first idea to test an N input circuit would be to apply an N-bit counter to the inputs (controllability), then generate all the 2N combinations, and observe the outputs for checking (observability). This is called "exhaustive testing", and it is very efficient... but only for few- input circuits. When the input number increase, this technique 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 6 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... becomes very time consuming. [Click to enlarge image] Figure-8.7: 8.6.1 Sensitized Path Testing Most of the time, in exhaustive testing, many patterns do not occur during the application of the circuit. So instead of spending a huge amount of time searching for faults everywhere, the possible faults are first enumerated and a set of appropriate vectors are then generated. This is called "single-path sensitization" and it is based on "fault oriented testing". [Click to enlarge image] Figure-8.8: The basic idea is to select a path from the site of a fault, through a sequence of gates leading to an output of the combinational logic under test. The process is composed of three steps : Manifestation : gate inputs, at the site of the fault, are specified as to generate the opposite value of the faulty value (0 for SA1, 1 for SA0). Propagation : inputs of the other gates are determined so as to propagate the fault signal along the specified path to the primary output of the circuit. This is done by setting these inputs to "1" for AND/NAND gates and "0" for OR/NOR gates. Consistency : or justification. This final step helps finding the primary input pattern that will realize all the necessary input values. This is done by tracing backward from the gate inputs to the primary inputs of the logic in order to 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 7 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... receive the test patterns. [Click to enlarge image] Figure-8.9: EXAMPLE1 - SA1 of line1 (L1) : the aim is to find the vector(s) able to detect this fault. Manifestation : L1 = 0 , then input A = 0. In a fault-free situation, the output F changes with A if B,C and D are fixed : for B,C and D fixed, L1 is SA1 gives F = 0, for instance, even if A = 0 (F = 1 for fault-free). Propagation : Through the AND-gate : L5 = L8 = 1, this condition is necessary for the propagation of the " L1 = 0 ". This leads to L10 = 0. Through the NOR-gate, and since L10 = 0, then L11 = 0, so the propagated manifestation can reach the primary output F. F is then read and compared with the fault-free value : F = 1. Consistency : From the AND-gate : L5=1, and then L2=B=1. Also L8=1, and then L7=1. Until now we found the values of A and B. When C and D are found, then the test vectors are generated, in the same manner, and ready to be applied to detect L1= SA1. From the NOT-gate, L11=0, so L9=L7=1 (coherency with L8=L7). From the OR-gate L7=1, and since L6=L2=B=1, so B+C+D=L7=1, then C and D can have either 1 or 0. These three steps have led to four possible vectors detecting L1=SA1. [Click to enlarge image] Figure-8.10: EXAMPLE 2 - SA1 of line8 (L8) : The same combinational logic having one internal line SA1. Manifestation : L8 = 0 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 8 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Propagation : Through the AND-gate : L5 = L1 = 1, then L10 = 0 Through the NOR-gate : we want to have L11 = 0, not to mask L10 = 0. Consistency : From the AND-gate L8 = 0 leads to L7 = 0. From the NOT-gate L11 = 0 means L9 = L7 = 1, L7 could not be set to 1 and 0 at the same time. This incompatibility could not be resolved in this case, and the fault "L8 SA1" remains undetectable. [Click to enlarge image] Figure-8.11: EXAMPLE 3 - SA1 of line2 (L2) : Always the same combinational logic, with the line L2 SA1. Manifestation : L2 = 0, sets L5 = L6 = 0. Propagation : Through the AND-gate : L1 = 1 and then we need L10=0. Through the OR-gate L3=L4=0, so we can have L7=L8=L9=0, but through the NOT-gate L11 = 1. The propagated error "L2 SA1" across a reconvergent path is masked since the NOR-gate does not distinguish the origin of the propagation. 8.7 Practical Ad-Hoc DFT Guidelines This section provides a set of practical Design for Testability guidelines classified into three types: those who are facilitating test generation, test application and those avoiding timing problems. 8.7.1 Improve Controllability and Observability All "design for test" methods ensure that a design has enough observability and controllability to provide for a complete and efficient testing. When a node has difficult access from primary inputs or outputs (pads of the circuit), a very efficient method is to add internal pads acceding to this kind of node in order, for instance, to control block B2 and observe block B1 with a probe. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 9 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-8.12: It is easy to observe block B1 by adding a pad just on its output, without breaking the link between the two blocks. The control of the block B2 means to set a 0 or a 1 to its input, and also to be transparent to the link B1-B2. The logic functions of this purpose are a NOR- gate, transparent to a zero, and a NAND-gate, transparent to a one. By this way the control of B2 is possible across these two gates. Another implementation of this cell is based on pass-gates multiplexers performing the same function, but with less transistors than with the NAND and NOR gates (8 instead of 12). The simple optimization of observation and control is not enough to guarantee a full testability of the blocks B1 and B2. This technique has to be completed with some other techniques of testing depending on the internal structures of blocks B1 and B2. 8.7.2 Use Multiplexers This technique is an extension of the precedent, while multiplexers are used in case of limitation of primary inputs and outputs. [Click to enlarge image] Figure-8.13: In this case the major penalties are extra devices and propagation delays due to multiplexers. Demultiplexers are also used to improve observability. Using multiplexers and demultiplexers allows internal access of blocks separately from each other, which is the basis of techniques based on partitioning or bypassing blocks to observe or control separately other blocks. 8.7.3 Partition Large Circuits 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 10 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Partitioning large circuits into smaller sub-circuits reduces the test-generation effort. The test- generation effort for a general purpose circuit of n gates is assumed to be proportional to somewhere between n2 and n3. If the circuit is partitioned into two sub-circuits, then the amount of test generation effort is reduced correspondingly. [Click to enlarge image] Figure-8.14: The example of the SN7480 full adder shows that an exhaustive testing requires 512 tests (29), while a full test after partitioning into four sub-circuits, for SA0 and SA1 faults, requires 24 tests. Logical partitioning of a circuit should be based on recognizable sub-functions and can be achieved physically by incorporating some facilities to isolate and control clock lines, reset lines and power supply lines. The multiplexers can be massively used to separate sub-circuits without changing the function of the global circuit. 8.7.4 Divide Long Counter Chains Based on the same principle of partitioning, the counters are sequential elements that need a large number of vectors to be fully tested. The partitioning of a long counter corresponds to its division into sub-counters. The full test of a 16-bit counter requires the application of 216 + 1 = 65537 clock pulses. If this counter is divided into two 8-bit counters, then each counter can be tested separately, and the total test time is reduced 128 times (27). This is also useful if there are subsequent requirements to set the counter to a particular count for tests associated with other parts of the circuit : pre-loading facilities. [Click to enlarge image] Figure-8.15: 8.7.5 Initialize Sequential Logic 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 11 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... One of the most important problems in sequential logic testing occurs at the time of power-on, where the first state is random if there were no initialization. In this case it is impossible to start a test sequence correctly, because of memory effects of the sequential elements. [Click to enlarge image] Figure-8.16: The solution is to provide flip-flops or latches with a set or reset input, and then to use them so that the test sequence would start with a known state. Ideally, all memory elements should be able to be set to a known state, but practically this could be very surface consuming, also it is not always necessary to initialize all the sequential logic. For example, a serial-in serial-out counter could have its first flip-flop provided with an initialization, then after a few clock pulses the counter is in a known state. Overriding of the tester is necessary some times, and requires the addition of gates before a Set or a Reset so the tester can override the initialization state of the logic. 8.7.6 Avoid Asynchronous Logic Asynchronous logic uses memory elements in which state-transitions are controlled by the sequence of changes on the primary inputs. There is thus no way to determine easily when the next state will be established. This is again a problem of timing and memory effects. [Click to enlarge image] Figure-8.17: Asynchronous logic is faster than synchronous logic, since the speed in asynchronous logic is only limited by gate 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 12 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... propagation delays and interconnects. The design of asynchronous logic is then more difficult than synchronous (clocked) logic and must be carried out with due regards to the possibility of critical races (circuit behavior depending on two inputs changing simultaneously) and hazards (occurrence of a momentary value opposite to the expected value). Non-deterministic behavior in asynchronous logic can cause problems during fault simulation. Time dependency of operation can make testing very difficult, since it is sensitive to tester signal skew. 8.7.7 Avoid Logical Redundancy Logical redundancy exists either to mask a static-hazard condition, or unintentionally (design bug). In both cases, with a logically redundant node it is not possible to make a primary output value dependent on the value of the redundant node. This means that certain fault conditions on the node cannot be detected, such as a node SA1 of the function F. [Click to enlarge image] Figure-8.18: Another inconvenience of logical redundancy is the possibility for a non-detectable fault on a redundant node to mask the detection of a fault normally-detectable, such a SA0 of input C in the second example, masked by a SA1 of a redundant node. 8.7.8 Avoid Delay Dependent Logic Automatic test pattern generators work in logic domains, they view delay dependent logic as redundant combinational logic. In this case the ATPG will see an AND of a signal with its complement, and will therefore always compute a 0 on the output of the AND-gate (instead of a pulse). Adding an OR-gate after the AND-gate output permits to the ATPG to substitute a clock signal directly. [Click to enlarge image] 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 13 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Figure-8.19: 8.7.9 Avoid Clock Gating When a clock signal is gated with any data signal, for example a load signal coming from a tester, a skew or any other hazard on that signal can cause an error on the output of logic. [Click to enlarge image] Figure-8.20: This is also due to asynchronous type of logic. Clock signals should be distributed in the circuit with respect to synchronous logic structure. 8.7.10 Strictly Distinguish Between Signal and Clock This is another timing situation to avoid, in which the tester could not be synchronized if one clock or more are dependent on asynchronous delays (across D-input of flip-flops, for example). [Click to enlarge image] Figure-8.21: The problem is the same when a signal fans out to a clock input and a data input. 8.7.11 Avoid Self Resetting Logic The self resetting logic is more related to asynchronous logic, since a reset input is independent of clock signal. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 14 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Before the delayed reset, the tester reads the set value and continue the normal operation. If a reset has occurred before tester observation, then the read value is erroneous. The solution to this problem is to allow the tester to override by adding an OR-gate, for example, with an inhibition input coming from the tester. By this way the right response is given to the tester at the right time. [Click to enlarge image] Figure-8.22: 8.7.12 Use Bused Structure This approach is related, by structure, to partitioning technique. It is very useful for microprocessor-like circuits. Using this structure allows the external tester the access of three buses, which go to many different modules. [Click to enlarge image] Figure-8.23: The tester can then disconnect any module from the buses by putting its output into a high- impedance state. Test patterns can then be applied to each module separately. 8.7.13 Separate Analog and Digital Circuits Testing analog circuit requires a completely different strategy than for digital circuit. Also the sharp edges of digital signals can cause cross-talk problem to the analog lines, if they are close to each other. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 15 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-8.24: If it is necessary to route digital signals near analog lines, then the digital lines should be properly balanced and shielded. Also, in the cases of circuits like Analog-Digital converters, it is better to bring out analog signals for observation before conversion. For Digital-Analog converters, digital signals are to be brought out also for observation before conversion. 8.7.14 Bypassing Techniques Bypassing a sub-circuit consists in propagating the sub-circuit inputs signals directly to the outputs. The aim of this technique is to bypass a sub-circuit (part of a global circuit) in order to access another sub-circuit to be tested. The partitioning technique is based on bypassing technique and they both use multiplexers to perform two different methods. In the bypassing technique sub-circuits can be then tested exhaustively, by controlling multiplexers in the whole circuit. To speed-up the test, some sub-circuits are tested simultaneously if the propagation paths are associated with other disjoint or separated sub- circuits. [Click to enlarge image] Figure-8.25: DfT Remarks All the techniques listed above do not represent an exhaustive list for DfT, but give a set of rules to resp ect as possible. Some of these guidelines goals are the simplification of test vectors generation, others goals are the simplification of test vectors application, and many others are to avoid timing problems in the design. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 16 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 8.8 Scan Design Techniques The set of design for testability guidelines presented above is a set of ad hoc methods to design random logic in respect with testability requirements. The scan design techniques are a set of structured approaches to design (for testability) the sequential circuits. The major difficulty in testing sequential circuits is determining the internal state of the circuit. Scan design techniques are directed at improving the controllability and observability of the internal states of a sequential circuit. By this the problem of testing a sequential circuit is reduced to that of testing a combinational circuit, since the internal states of the circuit are under control. 8.8.1 Scan Path The goal of the scan path technique is to reconfigure a sequential circuit, for the purpose of testing, into a combinational circuit. Since a sequential circuit is based on a combinational circuit and some storage elements, the technique of scan path consists in connecting together all the storage elements to form a long serial shift register. Thus the internal state of the circuit can be observed and controlled by shifting (scanning) out the contents of the storage elements. The shift register is then called a scan path. [Click to enlarge image] Figure-8.26: The storage elements can either be D, J-K, or R-S types of flip-flops, but simple latches cannot be used in scan path. However, the structure of storage elements is slightly different than classical ones. Generally the selection of the input source is achieved using a multiplexer on the data input controlled by an external mode signal. This multiplexer is integrated into the D-flip-flop, in our case; the D-flip-flop is then called MD-flip-flop (multiplexed-flip-flop). The sequential circuit containing a scan path has two modes of operation : a normal mode and a test mode which configure the storage elements in the scan path. In the normal mode, the storage elements are connected to the combinational circuit, in the loops of the global sequential circuit, which is considered then as a finite state machine. In the test mode, the loops are broken and the storage elements are connected together as a serial shift register (scan path), receiving the same clock signal. The input of the scan path is called scan-in and the output scan-out. Several scan paths can be implemented in one same complex circuit if it is necessary, though having several scan-in inputs and scan-out outputs. A large sequential circuit can be partitioned into sub-circuits, containing combinational sub-circuits, associated with one scan path each. Efficiency of the test pattern generation for a combinational sub-circuit is greatly improved by partitioning, since its depth is reduced. 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 17 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Before applying test patterns, the shift register itself has to be verified by shifting in all ones i.e. 111...11, or zeros i.e. 000...00, and comparing. The method of testing a circuit with the scan path is as follows: 1. 2. 3. 4. 5. 6. 7. 8. Set test mode signal, flip-flops accept data from input scan-in Verify the scan path by shifting in and out test data Set the shift register to an initial state Apply a test pattern to the primary inputs of the circuit Set normal mode, the circuit settles and can monitor the primary outputs of the circuit Activate the circuit clock for one cycle Return to test mode Scan out the contents of the registers, simultaneously scan in the next pattern 8.8.2 Boundary Scan Test (BST) Boundary Scan Test (BST) is a technique involving scan path and self-testing techniques to resolve the problem of testing boards carrying VLSI integrated circuits and/or surface mounted devices (SMD). Printed circuit boards (PCB) are becoming very dense and complex, especially with SMD circuits, that most test equipment cannot guarantee a good fault coverage. [Click to enlarge image] Figure-8.27: BST consists in placing a scan path (shift register) adjacent to each component pin and to interconnect the cells in order to form a chain around the border of the circuit. The BST circuits contained on one board are then connected together to form a single path through the board. The boundary scan path is provided with serial input and output pads and appropriate clock pads which make it possible to : Test the interconnections between the various chip Deliver test data to the chips on board for self-testing Test the chips themselves with internal self-test 5/1/2007 10:57 AM Design of VLSI Systems - Chapter 8 18 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image] Figure-8.28: The advantages of Boundary scan techniques are as follows : No need for complex testers in PCB testing Test engineers work is simplified and more efficient Time to spend on test pattern generation and application is reduced Fault coverage is greatly increased. BS Techniques are grouped by the IEEE Standard Organization in a "standard test access port and boundary scan architecture", namely IEEE P1149.1-1990. The Joint Test Action Group (JTAG), formed basically in 1986 at Philips, is an international committee composed of IC manufacturers who have set the technical development of the IEEE P1149 standard and promoted its use by all sectors of electronics industry. The IEEE 1149 is a family of overall testability bus standards, defined by the Joint Test Action Group (JTAG), formed basically in 1986 at Philips. JTAG is an international committee composed of European and American IC manufacturers. The "standard Test Access Port and Boundary Scan architecture", namely IEEE P1149.1 accepted by the IEEE standard committee in February1990, is the first one of this family. Several other ongoing standards are developed and suggested as drafts to the technical committee of the IEEE 1149 standard in order to promote their use by all sectors of electronics industry. This chapter edited by D. Mlynek production of 5/1/2007 10:57 AM Design of VLSI Systems - Fuzzy Logic Systems 1 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Chapter 9 FUZZY LOGIC SYSTEMS Systems Considerations Fuzzy Logic Based Control Background Integrated Implementations of Fuzzy Logic Circuits Digital Implementations of Fuzzy Logic Circuits Analog Implementations of Fuzzy Logic Circuits Mixed Digital/Analog Implementations of Fuzzy Systems CAD Automation for Fuzzy Logic Circuits Design Neural Networks Implementing Fuzzy Systems 1 Systems Considerations The use of fuzzy logic is rapidly spreading in the realm of consumer products design in order to satisfy the following requirements: (1) to develop control systems with nonlinear characteristics and decision making systems for controllers, (2) to cope with an increasing number of sensors and exploit the larger quantity of information, (3) to reduce development time, (4) to reduce costs associated with incorporating the technology into the product. Fuzzy technology can satisfy these requirements for the following reasons. Nonlinear characteristics are realized in fuzzy logic by partitioning the rule space, bu weighting the rules, and by the nonlinear membership function. Rule-based systems compute their output by combining results from different parts of the partition, each part being governed by separate rules. In fuzzy reasoning, the boundaries of these parts overlap, and the local results are combined by weighting them appropriately. That is why the output in a fuzzy system is a smooth, nonlinear function. In decision-making systems, the target of modeling is not a control surface but the person whose decision-making is to be emulated. This kind of modeling is outside the realm of conventional control theory. Fuzzy reasoning can tackle this easily since it can handle qualitative knowledge (e.g. linguistic terms like “big” and “fast”, and rules of thumb) directly. In most applications to consumer products, fuzzy systems do not directly control the actuators, but determine the parameters to be used for control. For example, they may determine washing time in washing machines, or if it is the hand or the image that is shaking in a camcorder, or they compute which object is supposed to be the focus in an auto-focus system, or they determine the contrast optimal for watching television. A fuzzy system encodes knowledge at two levels: knowledge which incorporates fuzzy heuristics, and the knowledge that defines the terms being used in the former level. Due to this separation of meaning, it is possible to directly encode linguistic rules and heuristics. This reduces the development time, since the expert’s knowledge can be directly built in. Although the developed fuzzy system may have complex input-output characteristics, as long as the mapping is static during the operation of the device, the mapping can be discretized and implemented as a memory lookup on simple hardware. This further reduces the costs involved in incorporating the knowledge into the device. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 2 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 2 Fuzzy Logic Based Control Background 2.1 Mamdani-Type Controllers E.H. Mamdani is famous in the circle of fuzzy for the works on fuzzy logic he made in the 70s and that are still topical now. He has extended the application field of fuzzy logic theory to technical systems whereas most scientists thought that these applications were restricted to non-technical fields (such as human sciences, trade, jurisprudence, etc.). At first, he suggested that a control that can be done by an operator could be done as well by fuzzy logic after having translated the operator experience into qualitative linguistic terms ([Mam73]-[Mam77]). Mamdani's method gave then rise to many engineering applications, especially for industrial fuzzy control and command systems. Mamdani introduced the fuzzification/inference/defuzzification scheme and used an inference strategy that is generally mentioned as the max-min method. This inference type is a way of linking input linguistic variables to output ones in accordance with the generalized modus ponens, using only the MIN and MAX functions (as T-norm and S-norm (or T-conorm) respectively). It allows to achieve approximate reasoning (or interpolative inference). Let consider a set of inference rules in the guise of a fuzzy associative memory (FAM) represented in an inference matrix or table, and some fuzzy sets and respective membership functions that have been attributed to each variables (fuzzification). Figure 2.1 represents the case where 2 rules are activated involving 2 input variables (x & y) and one output variable (r). Let assume x be a measure of x(t) at time t, and y a measure of y(t) at the same time. Let now consider the fuzzy sets A1, A2, B0 and B1 which respective membership functions µA1(x), µA2(x), µB0(y) and µB1(y) take positive values for x and y. Affected inference rules are for example as follow: else if if (x = A1 (x = A2 AND AND y = B0) y = B1) then then r = C0 r = C1 They also can be expressed as follow: A statement like x=A1 is true at a degree µA1(x) and a rule is activated when the combination of all membership grades (or truth degrees) µi(k) in its condition part (premise) take a strictly positive value. Several rules may be activated simultaneously. The max-min method realises the AND operators of the different rule conditions by taking the respective maximum membership functions. The premises can also include some OR operators realised by taking the minimum of the membership functions but it is rarely the case in control systems. The implications (connective then) are realised by the truncation (or clipping) of the output sets. That consists in taking for each point the minimum value between the membership grades resulting from rules conditions (fig. 2.1: µA2 (x) and µB0 (y)) and the membership functions of the respective output fuzzy sets (fig. 2.1: µC1 (r) and µC0 (r)): The rules are finally combined by using the connective else, acting then as the OR operator and interpreted as the maximum operation for each possible value of the output variable (r on fig. 2.1) according to the defined fuzzy sets (n sets): It is then possible with the above defined operations to give an algorithm for fuzzy reasoning (in order to achieve a control action for example). 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 3 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-2.1: max-min method for 2 rules involving 2 input and 1 output variables The max-min method finally requires a defuzzification stage which is generally performed using the centre of gravity method. It will give for the above example the real value r, using the membership function resulting from the max-min method: 2.2 Characteristics There are several mathematical properties that make the MIN and MAX operators well appropriated (this is simple and efficient) for fuzzy inferences as notably described in [Dub80] and [God94]. There are however several other methods based on different ways to realise the OR & AND operators, in purpose to improve some mathematical properties or numerical implementation characteristics. The max-prod method for example is similar to the max-min method, except that all implications in the rules (then operations) are realised by a product instead of a minimisation. The truth values of the rule conditions are used to multiply uniformly the corresponding output sets instead of clipping them at a certain level. This allow to keep the information that consist in the shape of these sets, and that is partly lost with the max-min method. The product is moreover simpler and faster to execute than the minimum operator in software implementations and allows some simplifications in numerical realizations of inferences (since it can deal with analytic expressions instead of comparing each couple of stored points). Another common method is the sum-prod method that uses the arithmetical mean and the product to realise respectively all the OR & AND operators. Unlike the MAX operator which select only the maximums values, the sum takes into account all involved sets and conserves part of the information that contain their shapes. These different methods are described and compared in [Büh94]. From this analysis emerges the fact that they lead to very similar input/output characteristics in the case of a single input variable, when used with triangular or trapezoid output sets. With several input variables, the max-min method produces non-linear characteristics with strong discontinuities, while the sum-prod method produces non-linear characteristics with smoother discontinuities. Nevertheless the choice of a specific method is mainly influenced by the way to implement it. This suggests the choice of the max-min method for hardware implementations, because the MAX and MIN operators are then the easiest to implement. These two operators are moreover the most robust to realize T- and S-norm according to the authors of [Ngu93]. Since they are the most reliable when membership grades have imprecise and noisy values they are well appropriate to fuzzy hardware with questionable accuracy. The max-min method is finally well suitable for fuzzy rule-based systems when there is no precise model. It generally leads in a simple way to consistent rules as it can be noticed in practical applications. The choice of an inference method has nevertheless a great importance for one-rule inferences, to select for example one among several candidates or to choose an optimal solution (this case is frequent especially in non-technical fields). The operators are then to be chosen with care because they influence directly the criterion of evaluation and consequently the final decision. Finally, one important aspect of Mamdani's method is that it is essentially heuristic and it can sometimes be very 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 4 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... difficult to express an operator's or engineer's knowledge in terms of fuzzy sets and implications. Moreover such a knowledge is often incomplete and episodic rather than systematic. There is in fact no specific methodology for clearly deriving and analysing a rule base of a fuzzy inference system (there is consequently no exhaustive choice of optimal rule and set numbers, shapes, operators,...). Since their principle seems to be rather simple, fuzzy system includes a lot of parameters and can lead then to a great deal of different and complicated characteristics. Some problems may occur when the rules have to describe a process that is too complex or to deal with a high number of variables. It can be then very difficult to define a sufficient set of coherent rules, and the danger of having not enough or conflicting rules occurs. There are several methods used to optimise inference rules and fuzzy sets, that are more or less perfected and sometimes quite complicated (typical ones are gradient method, least squares method, simulated annealing, neural-networks, etc.). They give rise to adaptive fuzzy systems which parameters are suited to the conditions of a specific application. The high parametrical level of fuzzy systems makes automatic adaptation solutions rather difficult. 2.3 Application field Mamdani's method is currently and effectively applied to process control, robotics and other expert systems. It is especially well appropriated to execute an operator's control or command action. It leads to good results that are often close to the operator's ones while dismissing the risk of human error. Thus it has been used successfully in the control of several plants, such as those in the chemical, cement or steel industries. Mamdani-type control is simpler than most of standard ones and requires much shorter development cycle when linguistic rules can be easily expressed (because there is no need to develop, analyse and implement a mathematical model). It is even in several cases as or more efficient, especially when no precise model exists, for example when a process to control is governed by non-linear laws or includes bad-known parameters or disturbances. When the mathematical model contains some non-linear terms, they are linearized and simplified under the assumption of small error signals, whereas a non-linear fuzzy method often allows to control bigger ranges of error. The non-linearity of Mamdani-type control can moreover have a favourable influence on transitory phenomenon. Consequently it can sometimes supplant classic control to provide fast responses. When increasing the response speed of conventional controllers the overshoot is also increasing (in position control for example). Generally fuzzy controllers with highly non-linear characteristics gives lower overshoot before setting time than conventional PID control, but small oscillations often remain after settlement. This oscillatory behaviour can become very difficult to restrain and fuzzy controllers are sometimes combined with conventional ones (in cascade configuration for example) to exploit the advantages of both. However fuzzy controllers can also have perfectly linear characteristics and replace standard comtrollers, which can sometimes be useful to provide some features of fuzzy controllers. Fuzzy control is attractive in some cases where parameters are varying with time and easily expressed into linguistic terms. It is indeed easier to reassign one or several fuzzy rules rather than to calculate the mathematical equations of a new model necessary to adapt classic controllers. However these last ones have a smaller number of parameters to adjust, and these parameters can for example be issued from a fuzzy system which regulates them according to the features fluctuations of a controlled process. 2.4 Defuzzification Strategy The common centre of gravity defuzzification method requires a quantity of calculation that is prohibitive for many real-time applications with software implementations. Its calculation can however be simplified when associated with the sum-prod method [Büh94]. The computation of the centre of gravity can take advantage of the high speed afforded by VLSI when integrated on an IC, which is however quite complex. Simplest defuzzification types are often used, as for example the mean of maximum or height method [Dri93],[Büh94],[God94]. 2.5 Takagi-Sugeno's method A Sugeno-type controller differs from a Mamdani-type controller in the nature of the inference rules that are generally called Takagi-Sugeno's rules in this case. Whereas Mamdani's method uses some inference rules that only deal with linguistic variables, the rules of Takagi- Sugeno directly lead to real values that are not membership functions but deterministic crisp values [Tak83],[Tak85]. This method only use fuzzy sets for the input variables, and there is no need of any defuzzification stage. Whereas the antecedents still consist in logical functions of linguistic variables, the 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 5 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... output values are resulting from standard functions of the input variables (real crisp functions). In most cases, only linear functions are used. Some variables can appear in such a function and not in the corresponding condition or vice versa. Each function belongs to the consequence part of a rule and is considered when the respective condition has a positive truth degree. This degree results from the different membership factors of the input fuzzy sets that the rule condition deal with. The final output value is calculated as the weighted average of all the linear functions and the weights are the truth degrees of their respective rule conditions. [Click to enlarge image]Figure-2.2: Takagi-Sugeno's method for 2 rules involving 2 input and 1 output variables Control rules are deriving in the following way: Let the antecedent part results from a specific fuzzification achieved according to human operator's or engineer's knowledge. That means some fuzzy sets and some logical functions making up premises are defined. The input space has been thus divided into a certain number of fuzzy subsets, i.e. a certain number of fuzzy rules. Then the setting-up of the consequence part of the rules requires a certain amount of numerical data (corresponding to input and output variables) coming from the operation which has to be modelled (fig. 2.3). The coefficients of each linear function are identified from the analysis of these data in order to minimized the difference between outputs of an original system and the ones of a model (fuzzy engine). This optimisation is achieved by the analysis of a weighted linear regression using a certain amount of input data (the weights are calculated as the truth degrees of the fuzzy conditions). Finally the system identification is function of the consequence parameters (coefficients in the linear functions) as well as the premise parameters (input sets) and the fuzzy partition of the input space. The identification of this last has no solution since it is a combinatorial problem and is generally issued from an heuristic search. An iterative algorithm for parameters and structure identification can be used efficiently (fig. 2.4) [Sug85],[Tak85]. The resulting fuzzy model gives generally much better results than a statistical model [Tak83]. [Click to enlarge image]Figure-2.3: Takagi-Sugeno's method: identification of a fuzzy logic inference engine 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 6 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-2.4: Takagi-Sugeno's method: outline of an identification algorithm [Sug85],[Tak85] 2.6 Characteristics It is not possible to derive efficient control rules by Mamdani's method when it is too difficult to translate exactly an human operator's or engineer's knowledge into linguistic terms. The adaptive Takagi-Sugeno's method is however efficient in such cases, provided that inference rules can be derived from the analysis of appropriate numerical data. One more advantage of this method occurs in some complicated cases where many different variables play a part in a process. It leads then more certainly to consistent rules than Mamdani's method. When fuzzy reasoning is used to describe some processes, methods as Mamdani's one are not so powerful partly because of their nonlinearity. The use of linear relations in the consequences enable to easily deal with Takagi-Sugeno'rules as an efficient mathematical tool for fuzzy modelling. It is however not possible to cascade inference rules without performing again a fuzzification step of crisp output values. Takagi-Sugeno'rules implementation is simplified because there is no need of a defuzzification stage. Output values are simply calculated from input ones, using a set of weighted linear functions. One needs however to be able to collect a certain amount of data (sometimes during a rather long period for time-varying processes) and then to find good coefficients with the least squares method. Measurements and data analysis often require a great deal of time, and it is furthermore not always possible to find satisfying coefficients. Takagi-Sugeno' method becomes actually difficult or even impossible to apply when a control has strongly non-linear characteristics (for example when there is a large hysteresis loop) or which is moreover changing with time. Some method as the simulated annealing for example could be used to reduce the iterative sequence of tests achieving the coefficients and structure identification. The approximation of measured characteristic or of a calculated linear regression could also be efficiently achieved by using a neural network. The structure identification would then result from a supervised learning phase instead of a long and unwieldy iterative research. 2.7 Application field The rules of Takagi-Sugeno are mainly used to realise process modelling. Fuzzy controllers are realised by regarding an operator's control action as a process to model, and thus by collecting a lot of data when the operator is proceeding. A typical application of Takagi- Sugeno'method is described in [Sug85], pp.125-138. Sugeno has experienced this method with good results in the case of an automated car parking into a garage. Fuzzy control rules are then derived from observation and modelling of a man's parking action. The method is well appropriated to this situation which is easier to observe than to express into linguistic terms. The arms and legs actions are indeed easier to measure than to verbalise, just like most gestures which have been acquired by heuristic experiences or which are instinctive, rather than issued from precise reasoning. 2.8 Sugeno-Type controllers Sugeno's method is a special case of Takagi-Sugeno'one and its particularity lies in the fact that all the output membership functions are precise values. The consequence functions are then replaced by constants which are weighted by the condition truth degrees to give the final output values. The fuzzy sets forming these precise values are generally called singletons or sets of Sugeno (fig. 2.5). It can also be regarded as a particular case of Mamdani'method where straight and without thickness non-overlapping output sets are used. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 7 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-2.5: Example of Singletons (Sets of Sugeno) 2.9 Characteristics The main advantage of using singletons is the great simplicity of implementing the consequence part. It can be used with Mamdani'method to simplify considerably the defuzzification stage, whose task is reduced to the calculation of a weighted average with a restricted set of crisp values. An other defuzzification type called maxima method is sometimes used (cf.[Mam74bis],[Dri93]). In this case each output crisp set corresponds to a specific action, and only the one with the maximum weight is selected (or an action midway between two maximum values). The use of singletons has no bad consequence on the output variable domain which can be the same as with triangular or trapezoid output sets when using the centre of gravity method. The nonlinearity of a controller characteristic can be modulated by the distribution of these sets (but not more by their shapes). The difficulty to describe some processes can however be then increased in comparison with the use of more complex sets in Mamdani'method or the use of linear functions in Takagi-Sugeno'method. For the latter, the restriction to constant functions in the consequence part doesn't allow any more to describe complicated relationships between input and output variables in a rather simple format. This method leads then sometimes to a worse minimisation of the output error between a real process and a fuzzy model, and structure identification (fuzzy partition of the input space) has to be much more improved to obtain satisfactory results in complicated cases. 3 Integrated Implementations of Fuzzy Logic Circuits 3.1 Fuzzy dedicated circuits The recent advances in fuzzy logic theory has brought about rule-based algorithms to a growing field of applications. Several implementation approaches have been proposed during the last ten years, especially in Japan where fuzzy systems have known a real proliferation. Fuzzy control has proved it could lead to good performance with short design times in a wide variety of low complexity real world applications. Many specific application programs have been developed to implement fuzzy operations on standard digital computers, and most of processor manufacturers provide software environments to develop and simulate fuzzy applications on their available microcontrollers. The design of fuzzy dedicated integrated circuits is however of great interest, because of the increasing number of fuzzy applications requiring highly parallel and high-speed fuzzy processing. They attempt to give a concrete expression to the idea of "fuzzy computers" (sometimes called computers of the sixth generation), which deal with analog values or some digital representation of them. Fuzzy processors are designed in a way to optimise fuzzy logic functions (cf.§3.2) as regards their implementation size and execution speed. Since practical systems often require a great number of rule evaluations per second, a huge amount of real- time data processing is necessary and the speed of fuzzy circuits is of prime importance. Their architectures are generally suited to the structure of approximate reasoning and decision- making algorithms and generally include three distinct parts, proceeding fuzzification, inference and defuzzification respectively. Fuzzy controllers are thus made on one hand of a knowledge base, which contains rules and membership functions outlines, as other configuration parameters of the system. On the other hand stands an inference mechanism (based on interpolative reasoning) interfacing a real process in a feedback loop configuration. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 8 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-3.1: General structure of fuzzy logic dedicated processors for approximate reasoning The structure of fuzzy dedicated circuits is characterized by the number and shape range of input and output variables, the number of rules they can evaluate simultaneously, the type(s) of inference(s) (size of premises, operators, consequences,...), the type(s) of defuzzification method(s), and so on. Their performance is evaluated according to their processing speed (that is the number of fuzzy logic inferences per second (FLIPS)), as to their precision (error and noise generation in analog circuits and number of bit representing fuzzy values in digital ones). Fast response is required for non-linear functions as MIN and MAX which output signals can be subject to sudden discontinuities. These functions are however piecewise linear and accuracy is also quite important. Fuzzy chips are mainly used in expert systems as well as in control and command field to achieve real-time performance. They are however less efficient for applications that deal with a large amount of data, because of the limited I/O. It can thus be interesting to design several compatible circuits, identical or different. They can be for example dedicated to inference rules and defuzzification respectively. In this manner it is possible to connect them together in several different ways, using some of them for example to proceed parallel fuzzifications and inferences, and connecting them to a defuzzification circuit [Hir93]. With a large number of input variables it may be useful to split up the control system into several cascaded or superposed units, which aims at simplifying the inference engine [Büh94]. Fuzzy software is useful when an application can be modelled to simulate and calculate in advance a multidimensional response characteristic. They give the different parameters relative to an optimal characteristic in order to design or program a fuzzy dedicated circuit. Some of these parameters (fuzzy sets and inference rules) have just to be adjusted then in the real implementation. In some simple cases, the numeric response characteristic can be stored as a reference rule table in a multidimensional associative memory, and provides then response values for real-time proceeding without any more inference calculations. The stored values are provided by fuzzy software (when a model exist), by the measure of an operator's action or by an adaptive system with a retroactive learning scheme (which comes close to the principle of fuzzy dedicated neural networks)[Chan93]. When no model exists, a fuzzy processor has to be programmable because its parameters are established by iterative testing on the real implementation, not by simulation. This process requires a lot of tests and is time consuming, and may furthermore not necessarily reach an optimal solution. Since no simulation of the system's dynamic behaviour are achieved, the danger of unstable working states occurs in control applications. The configuration flexibility of integrated fuzzy system requires a great storage capacity (digital circuits) or tunable components (analog circuits) to deal with variable numbers and shapes of membership functions as variable numbers and sizes of rules. Since the shape of the membership functions is generally less important than their degree of overlap the horizontal position of sets should always be tunable, but often complicated shapes are not essential. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 9 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-3.2: Fuzzy logic circuits 3.2 Basic fuzzy logic functions and their relations The fuzzy operators presented below have been introduced to deal with fuzzy values and are relative to specific membership grades µ(x). A fuzzy operator that deals with tho fuzzy sets A and B defines to a new membership function . The real variable (x) has been omitted in the expressions below. Conventional operators as algebraic sum (+), algebraic difference (-), algebraic product (.) are commonly used in fuzzy reasoning systems. The first nine fuzzy operators are the more commonly used in hardware system design to implement fuzzy information processing. They have moreover the particularity that they all can be expressed by the only bounded difference ( ) and algebraic sum (+) functions [Yam86],[Zhij90],[Sas90]. There has been several approaches using these two functions to design fuzzy units, which provides attractive perspectives for CAD automation and semicustom circuits. Other operators also allows to express any fuzzy formulae when combined together, as for example the bounded sum ( ) and the bounded product ( ) [Lem94]. These ones are associative and commutative, but are not distributive to each other. The non-distributiveness of bounded operators leads to long and complicated manipulations when used to solve fuzzy formulae, and it is quite not obvious to substite or eliminate some terms. Fuzzy formulae are unfortunately not able to be reduced as much and as simply than boolean formulae. Some other fuzzy operators as symmetric difference ( ), drastic difference ( ), drastic sum ( ) or drastic product ( ) are sometimes to resolve some fuzzy equations. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 10 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 4 Digital Implementations of Fuzzy Logic Circuits 4.1 Digital approach The VLSI digital implementation of Fuzzy Logic systems offers several advantages issued from the sound knowledge of digital circuit design and technology. Several mature CAD tools allow relatively easy design automation (synthesis & simulation) reducing consequently time and cost of development. The automatic regeneration of logic levels involves high noise immunity and low sensitivity to the variances of transistor characteristics. This provides accurate 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 11 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... and reliable data and signal processing. Binary data can be easily stored and allows to realize programmable and multistage fuzzy processing. Complex representation of fuzzy vectors and parallel structures are however required to obtain accurate and fast processing. Digital implementations of common fuzzy operations leads unfortunately rapidly to complicated, enormous VLSI circuits. The density and speed of these circuits are nevertheless continually increasing according to technological advances, so that they will become more and more efficient to implement fuzzy logic systems. 4.2 Digital fuzzy circuits features Digital fuzzy processors are generally designed for multipurpose applications in order to interest a maximum of potential customers. They should thus implement a great and various number of fuzzy operators, membership functions and inference rules. This make them rather efficient for a large range of applications, provided that appropriate programming is possible (which supposes an appropriate internal or external memory). Combined with an appropriate object-oriented programming environment, linguistic rules derived from an human expert can be directly translated into an implementation on a chip. The first hardware approaches of implementing fuzzy logic inference engine were the digital circuits designed by Togai and Watanabe [Tog86],[Tog87],[Wat90]. ASIC's implementing specific architectures and specialized instructions (MIN & MAX) exhibits much enhanced processing speed relatively to regular microprocessors. operators: Min, Max implementation architectures: Fuzzy microcontroller (embedded systems with A/D & D/A converters, Fuzzy arithmetic logic unit (FALU) and specific memories, ...) versus fuzzy coprocessor; coprocessor. Sequential processing -> enhance speed by parallel, pipeline [Nak93], systolic, RISC, SIMD architectures fuzzification strategy, defuzzification strategy For applications where a restricted number of pins of a chip can be inappropriate (such as fuzzy databases (= ambiguous data or fuzzy relations between crisp data)), a fuzzy coprocessor is used rather than a stand-alone chip. When other conventionnal digital processing are required (flexible system) 4.3 Digital Fuzzy Processing Analog fuzzy values should be converted into strictly binary signals before being processed by standard digital circuits. On one hand the analog input signals should be quantized through A/D converters, and on the other hand the membership functions should be quantized to obtain their digital representations. Fuzzy sets are then storable but in the guise of stair-functions. The combination of these two round-off effect can deteriorate fuzzy processing if the fuzzy values are not represented with a sufficient number of bit. There is however a trade-off between precision and size (or speed) since the latter is proportional to the number of bit. The input are furthermore sampled at the frequency of digital circuits. Sampling effects: non-continuous (or pseudo-continuous) control [Büh94] § 7.3 4.4 Different approaches of digital IC's Togai's FC110 fuzzy accelerator, coprocessor Watanabe Risc approach THOMSON's WARP (Weight Associative Rule Processor) Cell-to-cell mapping approach, automatic synthesis [Pag93] SIEMENS's coprocessor (SAE81C99) optimized memory organisation [Eich92] Sasaki SIMD and Logic-in-Memory structure [Sas93] 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 12 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... OMRON's FP-3000, FP-1000 & FP-5000 NeuraLogix' controllers INFORM's FUZZY-166 Processor 4.6 Application fields and future trends Control, expert systems, robots, image recognition, diagnosis, database (interface with digital circuits), information retrieval, ... High prices must be reduced to make these processors more attractive. Some less common hardware implementations such as pulse-width-modulation [Ung93] or sperconducting processor have also been implemented, and are not exclude from future developments according to technological evolution. 5 Analog Implementations of Fuzzy Logic Circuits 5.1 Analog fuzzy circuits features Analog circuits present several advantages towards digital ones, especially regarding speed of processing, power dissipation and functional density. They can moreover perform continuous-time processing and have the particularity to be well compatible with sensors, actuators and all other analog signals. Therefore they are obviously indicated to deal with fuzzy values which are analog by nature. Some continuous representations of symbolic membership functions and some non-linear fuzzy operations can be easily synthesised by dealing with transistor characteristics. There is no need of A/D or D/A converters when implemented in a real system, provided that no specific digital signal processing is required. Analog circuits can then supplant digital controllers for some applications requiring low-cost, low-consumption, compact and high-speed stand-alone chips. They suffer nevertheless of the lack of reliable memory cells. They are consequently not well appropriate to pipeline structures and have very restricted programmability possibilities. Fortunately, the nature of fuzzy variable systems requires extensive parallelism which make analog circuits well appropriate to proceed high- speed numerous inferences and also limits the problem of error accumulation. Analog controllers can achieve fuzzy real-time reasoning with a large amount of fuzzy implications, especially when no high-level accuracy is required. They are well suited to deal with vague and imprecise models, like for example some tasks that interface human senses (eye, tactile nerves, ear,...) or replace human reasoning (pattern recognition, reflex processes,...). Accuracy of fuzzy systems is actually not always so important since there is no thorough mathematical background usable to define precise and exhaustive fuzzy methods. Imprecise but adjustable analog devices are consequently suitable for a great number of cases (tunable membership functions are thus needed to optimize the performance). They are nevertheless much less flexible and adaptable than digital ones that are programmable, and they must be designed according to the structure of a specific application. A basic programmability is afforded when some analog external parameters can be adjusted or when some binary inputs allow the control of internal switches. The choice of common fuzzy operators (§3.2) is not exhaustive, and other methods could just as well be implemented since there is no mathematical proof than some of them are really optimal. Thus some other operators can be chosen instead of the common MIN and MAX ones, in order to be efficiently performed with analog circuits. Some dense, fast and accurate hardware operators can give very good practical results, even if they are theoretically not optimal. A novel way of evaluating the condition part of fuzzy rules has thus been introduced in [Land93]. The degree of membership of an input vector to some fuzzy subspace is defined using a measure of distance between this vector and some central point in the subspace. MIN and MAX operators are not implemented any more and very dense and high-speed analog hardware may be realised (in current-mode). 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 13 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Analog signals are represented either by voltages (voltage-mode, §5.2) or by currents (current-mode, §5.4). Stable and low-noise analog technologies (n-well CMOS, BiCMOS) must be used in order to design analog circuits having sufficient accuracy with wide frequency range. Reliable CAD tools for automatic design as fast verification and simulation tools are required for effective design. 5.2 Voltage-mode 5.2.1 Yamakawa's approach Voltage-mode is attractive because it makes easy to distribute a signal in various parts of a circuit. Non-linear operators such as the MIN, MAX and truncation ones are quite easy to implement in voltage-mode. Multiple-input MIN & MAX circuits constructed with bipolar transistors are represented in figure 5.1 and 5.2 and are called emitter coupled fuzzy logic gates. These basic non-linear gates present good characteristics and robustness [Yam93]. Such circuits are impractical with MOS transistors which cause anacceptable error associated with the transition regions in which multiple devices are active. CMOS multiple-input MIN & MAX circuits using gain-enhanced voltage followers based on differencial amplifiers are presented in [Fatt93] & [Tou94]. They are more complicated but have high frequency and accurate performance according to the authors. [Click to enlarge image]Figure-5.1: MIN cell, voltage-mode [Yam88] [Click to enlarge image]Figure-5.2: MAX cell,,voltage-mode [Yam88] Yamakawa has designed a BiCMOS monolithic rule chip (TG004MC[Yam88], TG005MC[Hir93]) which architecture is shown on figure 5.3. It is constructed with about 600 transistors and 800 resistors. The response time of fuzzy inference is about 1µs (1 Mega- FLIPS). This chip implement one fuzzy inference including three variables in the antecedent and one variable in the consequent. The antecedent part is made of membership function circuits (fig. 5.4) providing for each variable seven possible fuzzy linguistic values (one of which is selected according to a voltage label VLAB). These values can have four possible shapes assigned by an external binary signal, and the slopes can be changed by external resistors. A special label "not assigned" corresponding to a constant high level (+5V) can stand for constant membership functions which value is 1. A truncation circuit is not more than a certain number n of 2-inputs MIN circuits connected to a n-inputs MAX circuit. A voltage level as truncation factor is applied to one input of each MIN circuit, and membership grades of output fuzzy sets are applied to the second inputs. The output fuzzy sets are sampled in the consequent part by a membership function generator (MFG) in order to form fuzzy words of 25 elements which can be processed by the truncation circuit. These words are realised by a switch array controlled by a switch setting circuit [Gup88],[Yam93], and represent the consequent membership functions as analog voltages on 25 lines. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 14 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-5.3: Architecture of the TG005MC rule chip [Yam93] [Click to enlarge image]Figure-5.4: Principle of a basic membership function circuit [Hir85] Yamakawa also designed an analog defuzzifier chip (TB005PL[Yam88], TB010PL[Hir93]) which implements the centre of gravity calculation for 25 values (fig. 5.5). Its architecture consist of an ordinary addition circuit in parallel with a weighted addition circuit, and followed by an analog divider. It is constructed with resistor arrays, OP amplifiers, an analog divider and capacitors in hybrid structure. The response time of defuzzification is about 5µs and is almost determined by that of the divider. The sum calculation is proceeded by a simple network of 25 identical resistors connected to a same node, which produce a current proportional to the desired sum. The weighted addition is proceeded in the same way, but with 25 different resistors: 0,R, R/2, R/3,..., R/24. For an emitter junction of a bipolar transistor, the base-to- emitter voltage is proportional to the logarithm of the emitter current (over the range where the effects of series and shunt resistors and reverse saturation current are negligible). Thus the division of the two currents (proportional to the sum and the weighted sum respectively) is implemented by the subtraction of two base-to-emitter voltages. Finally, the divider is followed by a level shifter with current-voltage conversion [Yam88bis]. Several rule chips can be connected to such a defuzzifier chip which calculate a deterministic value from the maximal membership function. Several defuzzifier chips can also be used to realise systems involving several conditions and several conclusions. The main weak point of such systems is that they generally lead to non-optimal cumbersome implementations. OMRON has developed the above analog fuzzy chips, as an analog rudimentary fuzzy computer based on these two circuits: the FZ-5000. This last one consists of a multi-boards system, which each board includes four inference chips and one defuzzifier chip. Inference time is about 15µs, including defuzzification. Programming is done using switches 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 15 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... and jumper pins, or by a specific software tool (FT-6000) using a personal computer. [Click to enlarge image]Figure-5.5: Architecture of the TB005PL defuzzifier chip [Yam88bis] 5.2.2 Voltage-mode disadvantages Voltage-mode fuzzy circuit implies a large stored energy into the node parasitic capacitances (CV2/2) and speed is limited by charge delays of various capacitors. They are moreover penalized by a certain lack of precision because signals are sensitive to changes of supply voltages. This is especially significant when the voltage range is restricted in order to limit transistor functioning to a small parts of their characteristic, or when the electrical consumption should be limited. The problems mainly lie in the sizing of some components. Several functions are very difficult to build in voltage-mode, and it is also true for some basic ones as the algebraic sum. The above described approach needs resistors to achieve additions and to convert voltages into currents. Integrated resistors are unfortunately inaccurate, cumbersome and involve significant parasitic capacitances. The truncation of the consequent and the defuzzification pose an important problem as regards the parallelism of the inference engine (especially when the number and size of output sets is large). This approach implies high-power dissipation and large chip area, and leads to high-costs implementations. 5.3 Mixed-mode 5.3.1 Transconductance An hybrid mode can be realized by implementing on a same chip both current-mode and voltage-mode operations. This allows to avoid some problems inherent to voltage-mode while taking advantage of some benefits of current-mode. For example, the sum of analog voltages is rather complex to realize, while it corresponds to simple wire connections in current-mode. This way to do has been used in the defuzzifier chip in §5.2.1. The difficulty is to obtain linear and accurate conversions to swap between voltage - and current - modes without loosing too much precision. Efficient transconductance elements (fig.5.6) should exhibit at the same time good linearity and good frequency response to deal with fast and non-linearly varying membership values. They should moreover occupy few place and consume few energy. [Click to enlarge image]Figure-5.6: A linear tunable transconductance [Park86] 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 16 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... 5.3.2 OTA-based approach Operational transconductance amplifiers (OTA) can be used as basic building blocks to design analog CMOS circuits for fuzzy logic. The use of OTAs, diodes and OTA-simulated resistors is sufficient to realize every useful fuzzy circuits. Fig.5.7 represents an OTA of the same type that the ones used in [Inou91]. The fuzzy grade interval [0,1] is represented by [0V,1V] in order to assure a linear operation of the OTAs. More effective OTA based on the transconductance element of fig.5.6 are also presented in [Park86]. It is well linear and efficient in a wide range of frequencies, but at the expense of bigger consumption and chip area. [Click to enlarge image]Figure-5.7: Operational Transconductance Amplifier (OTA) A high-speed bounded difference operation can easily be implemented by use of OTAs. For the circuit represented on fig.5.8, the relationship between the input voltages (VIN1 & VIN2) and the current of the diode satisfies the definition of the bounded difference. As the first OTA supplies a current proportional to the difference of input voltages, the bounded difference is realized for µ1 greater or equal to µ2 when input voltages represent the membership functions µ1 and µ2. It is also realized for µ1 < µ2 since the diode can only pass the unidirectional output current of the OTA. The second OTA finally converts this current into a proportional output voltage (VOUT), in order to obtain a voltage-mode bounded difference characteristic. Its output current is indeed identical to the diode one, and is also proportional to the output voltage which is connected back to an input of the second OTA. [Click to enlarge image]Figure-5.8: OTA-based bounded difference operator This OTA-based bounded difference operator can be effectively used to synthesise all other fuzzy function according to the relationships described in §3.2. The algebraic sum used in these relationships is realised by simple wire connections before finally converting currents into voltages with OTAs. The fuzzy complement operation is realized when the positive input voltage (VIN1) is connected to the high-level (µ1 = 1) in the circuit of fig.5.8 (VOUT is then the complement of VIN2). Two-input MAX and MIN operators synthesis are represented on fig.9. Since µ1 = (µ1 µ2) in any case, MIN(µ1 ,µ2) = µ1 (µ1 µ2) = µ1 - (µ1 µ2). The algebraic difference leads to a faster MIN operator synthesis than two cascaded bounded differences. A current equal to the negative value -(µ1 µ2) is obtained by permuting the two input voltages (VIN1 as µ1 & VIN2 as µ2) in a common bounded difference circuit operation which diode (D1) has been connected in reverse sense. The diode D2 is just useful if the output voltage has to be limited to positive values. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 17 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-5.9: OTA-based MAX and MIN operators [Inou91] Multiple-input MAX and MIN circuits are also described and analysed in [Inou91], where a fuzzy membership function circuit is also presented. They are made of two-stages OTA structures which provide high-speed signal processing. OTA-based realisations of other fuzzy functions are however often quite complicated and requires more OTAs stages, which radically deteriorates accuracy and speed of processing. Several OTA are actually necessary to synthesise some relationships of §3.2 since many voltage-current or current-voltage conversions are needed before performing bounded differences (which deals with voltages) and algebraic sums (which deals with currents). There is moreover quite difficult to proceed simplifications when these different functions are combined together and the principal optimisation consists of increasing parallelism of OTA stages. This OTA-based approach leads unfortunately rapidly to very big and not so accurate circuits for implementing fuzzy processing of complex inference systems. 5.3.2 Current-mode multiple-input MIN and MAX circuits Multiple-input MIN and MAX circuits can be realized with current mirrors composed of a standard MOS transistor and a lateral MOS transistor in bipolar operation [Tou94]. They are suitable for current-mode CMOS circuits and are based on the same principle than the voltage- mode circuits of fig. 5.1 & 5.2. Input currents are converted into gate-sources voltages which are applied to the base terminals of bipolar transistors connected as voltage followers. The voltages are processed according to MIN or MAX operation before being converted into output current. Precision depends on the symmetry of the structures based on MOS-bipolar mirrors followed by the inverse bipolar-MOS mirror (whose bipolar transistor also compensates the DC level shift of approximatively 0.7V). [Click to enlarge image]Figure-5.10: Current-mode 3-input MIN circuit [Tou94] 5.4 Current-mode 5.4.1 Current-mode circuits Current-mode circuits do not need resistors and can achieve summation and subtraction in the most simple way, just by wire connections. This leads to simple and intuitive configurations, which exhibits high speed and great functional density. They are used more and more, especially for systems requiring a high level of interconnectivity (neural networks for example). High speed is provided when capacitive nodes are not subject to great voltage fluctuations. Current-mode circuits can also exhibit advantages as low power dissipation and low supply voltage, as good insensitivity to the fluctuation of the latter. Since they have a single fan-out, current repeatability is of prime 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 18 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... importance and the distribution of signals requires multiple-output current mirrors. 5.4.2 Current mirrors A basic realisation of multiple-output CMOS current mirror is shown on fig.5.11. This circuit is however not suitable for synthesising accurate functions since each output current is slightly modulated by the output voltage throughout the Early conductance. The output current should be independent of the output voltage, which is obtained by reducing the output conductance as for the three common mirrors shown on fig.5.12. The drain voltage of the transistor which imposes the current (drain voltage of T2 on fig.5.12) is then independent of the output voltage of the circuit. Multiple-output cascade mirrors are often used but Wilson ones are preferable for low power applications because they require a single polarization voltage (VG(T1) on fig.5.12.a) instead of the two superposed voltages VG(T1) & VG(T3) on fig.5.12.b). The Mod-Wilson mirror is obtained by adding a transistor to the Wilson mirror (T4 on fig.5.12.c) to improve its symmetry. This mirror provides good accuracy and the input current is well reproduced with perfectly matched identical transistors. The precision of all these mirrors depends on their output resistance (which must be as high as possible) and on the matching of their transistors. The quality of current reproducibility is very important in cascade configurations to limit error accumulation. Dynamic mirrors can be used to obtain a greater accuracy, but at the expense of a clocking scheme which considerably increases their size and complexity. [Click to enlarge image]Figure-5.11: Basic n-output CMOS current mirror and symbolic representation [Click to enlarge image]Figure-5.12: NMOS current mirrors All the above mentioned mirrors can be constructed in standard bipolar technology as well as in MOS one. Multiple-output current mirrors can be realized by compact bipolar circuits when using multicollector transistors, whose accuracy is however poorer towards structures with single- collector transistors. Bipolar transistors produce two type of significant errors, due to the base current and to the reverse mode current. The latter causes the saturation of one collector to affect the other collectors, and the circuits should be designed so that no collector in the multiple-output current mirrors may be saturable. These errors do not appear in multiple-output MOS current mirrors since their input-output paths are separated and their drain currents are independent of each other. The design of cascade MOS structures is then much easier than the one of bipolar structures whose stages are interdependent. It is generally preferred, all the more since it is compatible with standard CMOS fabrication processes and efficient design tools. The mismatch between two identical transistors is however smaller for bipolar than for MOS ones, since it does not depend on the collector current level and since they are affected by a much lesser influence of surface effects. All MOS transistors must work in saturation mode in order to reduce the mismatch effects. Bipolar current mirrors are then more appropriate to work with low voltages and are more precise and fast with low currents (the speed of MOS ones decreases at low currents because of their intrinsic capacitances). 5.4.3 Fuzzy operator synthesis 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 19 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Current mirrors can be used as building blocks to synthesise fuzzy logic operations and relevant processing. In this way, nine basic fuzzy functions (cf. §3.2: ) can be easily implemented on monolithic IC's with standard CMOS [Zhij90],[Yam86] or bipolar technologies [Yam87]. These current-mode basic logic cells exhibit good linearity which can not be easily achieved in voltage-mode, and lead to fuzzy integrated systems which are globally smaller than in voltage-mode. [Click to enlarge image]Figure-5.13: CMOS and bipolar implementation of the bounded difference operation [Yam86],[Yam87] A bounded difference circuit can be obtained by the combination of a current mirror and a diode (fig.5.13). The diode can easily be realized in the CMOS circuit either by a single FET which gate and drain are connected together (fig.5.13) or by a current mirror (fig.5.14). The first solution involves inevitable voltage drops due to the channel resistance and can influence the normal logic function of the circuits. Nevertheless the diode can be omitted in cascade connection of such circuits because the input current mirror of the following stage also achieves its task. [Click to enlarge image]Figure-5.14: Two different implementations of the bounded difference operation [Zhij90] Current mirrors are subdivided into two complementary types whether their transistors are n- or p-channel MOS FET (NPN or PNP in bipolar technology). The directions of input and output currents depend on the type of the respective components (input mirror and output diode in the bounded difference circuit). Thus there are four different configurations of input and output current directions (two of which are shown on fig.5.14 and are suitable for cascade connections). To each configuration corresponds a complementary one which is obtained by substituting p-channel current mirrors to n-channel ones and vice versa. This is convenient for designing circuits using such fuzzy logic units as basic bricks without worrying about specific current directions between neighbouring bricks. The circuits of fig.5.13 and 5.14 realize the bounded difference operation on two values of membership functions µx & µy represented by the two currents I1 and I2 respectively. They also realize the complement operation on µy (represented by I2) when I1 has the maximum value (that represents a membership grade equal to 1). The bounded product is realized when I1 represents 1 and I2 represents the sum µx + µy. As the bounded difference and the algebraic sum are sufficient to realize all other fuzzy functions (according to the relations described in §3.2), fuzzy circuits can be designed only by specifying connections between bounded difference subcircuits. Multiple-output current mirrors are also required to distribute current to several logic units (fig.5.15). Such basic cells are attractive prospects for developing CAD tools and semicustom IC's (that are arrays of logic cells adaptable to various specifications according to the wire configurations). This leads however to solutions which are generally not optimal as regard to the number of transistors. Multiple-input MAX and MIN circuits are proposed in [Sas90] which aim at avoiding error accumulation and increasing speed relatively to binary tree realizations based on two-input MIN and MAX subcircuits. The operation of these circuits can be formulated using simultaneous bounded difference equations. A simpler multiple-input MAX 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 20 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... circuit is described in [Batu94]. [Click to enlarge image]Figure-5.15: CMOS multiple fan-out circuit [Yam86] Three "primitive" operators have been introduced to obtain more elementary basic cells than the bounded difference [Lem93],[Lem93bis],[Lem94]. These operators are defined in the following way : All fuzzy functions can be formulated as an additive combination of these primitive operators since the bounded difference can be expressed in the following way: These operators are used to reduce the complexity of electrical realization of fuzzy functions, since they lead to simple relationship between transistor-level circuits and symbolic representation of fuzzy formulae. They are actually realisable in a most simple way by using current mirrors (fig.5.16) and exhibit special properties that help to obtain reduced forms of compositional expressions. Some of these properties are as following: Several relations are rather complicated when expressed from bounded difference equations and should be reduced with the help of the above properties. This will at the same time decrease the number of cascade current mirror stages necessary to implement them. It is also important to favour parallel operations in rule evaluations so that error accumulation will be reduced and processing speed increased. [Click to enlarge image]Figure-5.16: CMOS implementation of primitive operators [Lem94] As an example of such a process, the MIN operation can be implemented with two cascade current mirrors (fig.5.17), after having carried out a certain number of simplifications into its mathematical formulation: 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 21 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-5.17: Synthesis of the MIN function based on primitive operators [Lem94] There are various solutions to realize non-linear analog membership function generator in current mode but they are directly influenced by their physical features and often exhibit bad temperature behaviour. Consequently it is quite attractive to design circuitry achieving piecewise-linear membership functions [Kett93],[Yam88-3]. The representation of complicated membership functions can also be subdivided into an appropriate composition of elementary logic subcircuits. The fuzzification of a crisp variable x thought membership functions with triangular shapes can be expressed, for example, in terms of the primitive operators , P and N as follows [Lem94]: The input value i determines the horizontal position of the triangular functions and designates in this manner one of linguistics labels (negative high, negative low, zero,...). 5.4.4 Mixed-mode fuzzifier As current-mode circuits are restricted to single fan-out, multiple current mirrors are required to share out signals among several operational blocks. Voltage-mode inputs are thus preferable for fuzzy hardware systems since they must be distributed to the membership function circuits of many rule blocks. Current-mode signals are appreciated afterwards, because of the advantages provided by current-mode processing. Tunable voltage-input current-mode membership function circuits are consequently useful building blocks to proceed fuzzification with current-mode analog hardware [Chen92],[Sas92],[Ishi92]. They can also be used with the above described OTA-based approach. Such circuits can be realized with an OTA including variable source resistors which consist of integrated voltage-controlled resistors and which can change the OTA transconductance characteristic [Chen92]. This achieves triangular membership functions with variable heights, widths, slopes and horizontal positions. Such membership function circuits are suitable for realizing high-speed and small-size systems. Nevertheless their sizing become difficult as their complexity increase and their characteristics are affected by the variations of physical parameters (mismatches, temperature influence,...). 5.4.5 Defuzzification strategy and normalization circuit When achieved with the centre-of-gravity method, the defuzzification consists of a weighted average requiring a sum, a multiplication (weighting) and a division. The sum operation consists of wire connections while the multiplication can be realized by scaling current by means of asymmetrical current mirrors. The analog division is still the most complex operation and requires rather great chip area and processing time. The defuzzification operation is consequently often simplified in hardware implementations. However the required division can be easily and rapidly performed by current-mode normalization circuits [Ishi92]. For the circuit of figure 5.18 there are the following relations: 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 22 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-5.18: 3-input normalization circuit [Ishi92] The normalization circuit can directly be extended to more than three inputs. The sum of all output currents is normalized to I0 and so each output current (representing a membership grade) is divided by this sum (sum of all membership grades). There is no more need of dividing the weighted sum of these currents. The weight of each current can be realized in the normalization circuit by the W/L ration of the respective output transistor. Good precision is obtained when such a circuit is implemented by means of bipolar transistors or bipolar operated MOS. When using saturated MOS transistors operating in weak inversion with VS=0 the circuit is very inaccurate and variable with the temperature, due to the large variation of VT0 from device to device. 5.5 Singleton consequences and normalization loop A main disadvantage of analog circuits concerns the defuzzification stage (and especially the analog divider) in terms of size, accuracy and processing speed. The quantity and complexity of calculations are however greatly decreased with the use of singleton consequences. These ones are more appropriate for hardware implementations than complicated output sets that must consist of a limited number of discrete analog values. It can moreover be observed thae singleton consequences facilitate a linear interpolation between inference results of different control rules [Yam89],[Yam93]. Yamakawa has searched to eliminate the analog division in a voltage-mode singleton- consequent controller by using grade-controllable membership function circuits (which are obtained by modifying ordinary membership function circuits [Yam89],[Yam93]). These can be tuned in such a way that the sum of all membership grades is equal to the unity (or constant), and that the weighted sum doesn't need any more division. The regulation consists of shifting down and up each membership function characteristic according to the variations of the sum of membership grades, through a negative feedback loop. The current-mode rule block with voltage interface (and OTA-based membership function circuits) described in [Sas92] also includes a normalization locked loop which abolishes the division operation of the weighted average in a similar way (fig.5.19). The sum of membership grades ( ) is regulated by using its fluctuations as modulation factor (Vm) of the membership function circuits throughout a negative feedback loop. This regulation operation is attractive since it is faster than the analog division in fuzzy hardware implementation. Nevertheless the implementation of such circuits requires complicated units and connections, and leads to difficult sizing and testing. The normalization solution described in §5.4.4 is finally far simpler for current- and mixed-mode circuits. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 23 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Click to enlarge image]Figure-5.19: Principle of a normalization locked loop [Sas92] 5.6 Fuzzy analog memories The main weak point of analog circuits is the luck of reliable analog memory modules. Since there is no accurate and lasting way to store analog fuzzy values, analog fuzzy computers exhibit bad features concerning programmability and multistage sequential processing. Temporary memory elements have however been proposed to keep signals stable within a sampling period. This allowss to design fuzzy inference engines with pipelined structures and consequently enhanced speed [Zhij90bis]. Such basic memory cells are however not suitable for implementing complicated sequential algebra in the same way that digital circuits do thanks to the use of binary flip-flop and register circuits. An analog value can however be stored in a more lasting way when represented by a voltage, by mean of a capacitor. This last component is the core of sample-and-hold circuits which are basic cells of anolog memory elements. A voltage-mode fuzzy flip-flop has been proposed as an extension form of binary J-K flip- flop [Gup88]. It is based on fuzzy negation, T-norm and S-norm which are respectively restricted to complementation, MIN and MAX operations in order to make easier its hardware implementation. The characteristics of fuzzy flip-flop based on other operations such as algebraic product and sum, bounded product and sum, and drastic product and sum are also reported in [Hir89]. Its structure is described by the following set and reset equations which generate the same state values than a digital J-K flip-flop in the case of boolean input and state values. The fuzzy flip-flop can be implemented with a combinatorial part synthesising the above equation, and two sample-and-hold circuits driven by two control pulse circuits with opposite phases (fig.5.20). Present output Q(t) is memorized by two sample-and-hold circuits since the information is needed in the next state. [Click to enlarge image]Figure-5.20: Fuzzy J-K flip-flop: circuit block diagram [Gup88] Current-mode continuous valued memory elements can also be realized in the same way by using voltage-controlled current sources in both sample-and-hold circuits to store a control voltage representing the sampled current [Balt93]. The capacitor of a sample-and-hold circuit is thus charged according to the control voltage of a first current source. In the "hold mode" (after half a clock cycle), this stored voltage sets the output current through a second current source in the image of the input one. 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 24 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... Storage capacitors are designed according to a trade-off between high speed and small silicon area on one hand, and sufficient accuracy on the other hand (their capacitance should be high relatively to parasitic ones). High accuracy is however difficult to achieve since it is affected by the imprecision of integrated capacitors, the mismatch in couples of current sources, and overall the charge injection in parasitic capacitances. This last source of sampling error is however reduced with the master-slave configuration of the sample-and-hold circuits, since the injected charges are opposite in sign (the first order sampling errors are thus cancelled). 6 Mixed Digital/Analog Implementations of Fuzzy Systems Fuzzy logic systems lend themselves well to analog integration, except for some control and reconfiguration structures. Several switches are thus often integrated on analog circuits and commanded by digital inputs. It can actually be attractive to increase the complementarity between digital and analog features and to merge them into a single mixed chip, in order to improve the weak points of both [Fatt93],[Zhij90bis]. A fuzzy knowledge base can be programmed in a digital memory which consists of dedicated locations and stores a variable number of parameters characterizing membership function shapes and inference rules notably. Highly-parallel and non-sequential analog processing is then afforded provided that D/A converters are used. A/D converters can also be used when a digital computation of the centre-of-gravity is desired. The VLSI Design group of SGS-THOMSON has also undertaken the design of an hybrid controller implemented by means of a mixed analog/digital technology [Pag93bis],[Pag93-3]. It consists of a digital storage and distribution unit followed by an analog inference core. The membership grades are converted into analog values by an internal D/A. This system does not need expensive A/D and D/A converters in comparison with digital controllers. Its high speed should be suitable for very demanding real-time requirements with a limited number of rules. 7 CAD Automation for Fuzzy Logic Circuits Design 7.1 CAD strategy for digital circuits define specific cells to implement fuzzy operations = standard cells approach with usual CAD environments THOMSON's approach [Pag93] 7.2 CAD strategy for analog circuits Design automation of analog fuzzy blocks provides a standard-cell approach allowing to build a fast and reliable design strategy (very similar to a digital one). The use of p- and n- channel static current mirrors as building blocks is well appropriate to create a design automation framework for generating the layout of fuzzy units. Such a development environment for current-mode CMOS fuzzy logic circuits has been created from a standard graphical tool and a specific silicon compiler [Lem93],[Lem94]. The graphical tool provides a logical simulation of fuzzy algorithms and helps to the design of fuzzy system architectures. The silicon compiler generates from the mathematical expression of a given fuzzy algorithm its corresponding layout. The system is based on a three-level hierarchy, consisting of current mirror circuits as generic cells for building elementary logical blocks (MIN, MAX, ,...), which ones can finally be assembled into sophisticated fuzzy units. The use of the three primitive operators , P and N (cf.§5.4.3) leads to analytical expressions which are suitable to describe a fuzzy units at all levels, from the current mirrors to high level fuzzy algorithms. The aim of this methodology is to cancel redundant fuzzy function elements which exist in the trivial implementation of the fuzzy functions. The silicon compiler should also replace each terms of the mathematical equations by its electrical counterpart. It tries then to reduce and place each electrical block in such an optimized way that wire lengths and IC area are minimised. The interconnections of these wires determine the vertical current flow 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 25 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... between p- and n- mirrors from one stage to another. Certain configurations which cause functional problems should be forbidden and trivial connections of p- and n- mirrors involving static consumption should be avoided (case where serial p- and n- mirrors of a branch are driving a static current between the high and low levels). The automatic cell generator produces then the final layout as physical representation of the fuzzy algorithms. As an application of this environment, the prototype fuzzy controller FLC#001 has been designed in standard 1.2 µm CMOS technology [Lem94]. This low power and small size circuit achieves 10 MegaFLIPS and is quite efficient for real- time control applications. As the design issued of such a strategy is close to fuzzy algorithms, it makes easy the test of resulting circuits. Internal currents can however not be measured without adding supplementary output current mirrors which increase the circuit size. Gate voltages can be measured but they just indicate an imprecise estimation of the transistor currents. [Click to enlarge image]Figure-7.1: Structure of a silicon compiler [Lem94] 8 Neural Networks Implementing Fuzzy Systems Artificial neural networks and fuzzy logic system exhibits some similar or complementary features that suggest to merge these two techniques. Both have a certain ability to work in natural analog environment and to deal with uncertainty of real-world problems. Neural networks seem appropriate for implementing fuzzy logic since in principle they should provide massively parallel and highly parametric computation, achieve highly non-linear operations, do not require any model and are concerned with approximating input-output relationships. They have moreover the ability of achieving interpolate reasoning similar to fuzzy inference. This means that once a neural network has been trained for a representative set of data, it can interpolate and produce answers for the cases not present in the training data set (the training is a supervised learning which consists of supplying inputs and demanding the specific output in order to adjust the weights in the connections between neurons in different layers). The neural output characteristics (threshold functions with sigmoid characteristics) are able to synthesise complicated membership functions. Finally the sum of products operation that neural networks realize is close to the MIN-MAX and weighting operations that occur in approximate reasoning. One should however bear it in mind that the implementation of neural networks is still complicated and actually not really easier than the direct implementation of fuzzy logic engines. Thus hardware realizations of neural networks are not so performant as regard parralelism, computation speed, functional density, and so on. The fusion of neural networks and fuzzy logic do not only concerns their similarities, but also mutual compensations between their different features. Thus neural networks can offer a solution in fuzzy systems to the problems of structure 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 26 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... identification (number and size of rules making up a suitable fuzzy partition of the feature space) and parameter identification (number and characteristics of membership functions). They are actually able to replace in some cases the tedious and sometimes hazardous identification scheme by a supervised learning. The knowledge acquisition by self-organizing networks can be used to realize adaptive fuzzy systems, which are quite attractive when linguistic rules cannot be easily derived or when a great deal of rules are required. However there is no guaranty of convergence in the learning scheme. The necessary number of neurons is moreover unpredictable and the acquisition of new knowledge is fairly difficult without beginning a new learning scheme. There is several manner to combine neural networks and fuzzy logic, which differ according to the authors. The first idea consists of using the high flexibility of neural networks produced by learning, in order to provide automatic design and fine tuning of the membership functions used in fuzzy control. Such an adaptive neuro-fuzzy controller adjusts its performances automatically by accumulating experience in a learning phase. National Semiconductor Corp. has developed the Neufuz embedded system that provides a front-end neural networks suitable for fuzzy logic technology. A first layer performs fuzzification, a second creates the rule base and a third, a single neuron, does rule evaluation and defuzzification. Neural networks involve a great deal of computations and hardware investments which is prohibitive for many real-time applications. So they emulate and optimise in this case a fuzzy logic system rather than implement directly the final application. The National Semiconductor solution includes then the capability to generate assembly code for a strictly fuzzy logic solution. Recently Neuralogix has put on the market the NLX230 fuzzy microcontroller which consists of a VLSI digital fuzzy logic engine based on the min-max method. It includes a neural network implementing a high-speed minimum comparator block connected with 16 parallel fuzzifier on one side and a maximum comparator on the other side. An other approach aims at solving the problems of consequence identification and of defuzzification by using the Takagi-Sugeno's method which lends itself good to implementation framework based on adaptive neural networks [Jan92]. Instead of using neurons to implement some part of fuzzy systems, the structure of approximate reasoning can be applied to neural networks. The aim of this approach is to improve neural network frameworks by bringing some advantages of fuzzy logic. The latter allows an explicit representation of the knowledge and has a logic structure that allows to handle high-order processing easier. In pattern recognition for example, the learning scheme is efficient to acquire a knowledge about reference objects. As for the structure of approximate reasoning rules, it can give information on the knowledge distribution inside the whole network. It is then easier to find out the internal part that cause poor performance by analysis of error according to the rule structure [Tak90bis]. References Fuzzy Sets and Systems, Theory and Applications [Dub80] D. Dubois & H. Prade, Fuzzy Sets and Systems, Theory and Applications, Academic Press, New York, 1980 [God94] J. Godjevac, Introduction à la logique floue, Cours de perfectionnement, Institut suisse de pédagogie pour la formation professionelle, EPFL, Lausanne, 1994 [Ngu93] H.T. Nguyen, V. Kreinovich & D. Tolbert, On Robustness of Fuzzy Logics, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.543-547, San Francisco, Ca, USA, 1993 [Ter92] T. Terano, K. Asai & M. Sugeno, Fuzzy Systems Theory and its Applications, 1th. Ed., Academic Press, San Diego, 1992 Fuzzy Control [Büh94] H. Bühler, Réglage par logique floue, Presses Polytechniques Romandes, Lausanne, 1994 [Dri93] D. Driankov, H. Hellendoorn & M. Reinfrank, An Introduction to Fuzzy Control, Springer-Verlag, Berlin Heidelberg, 1993 [Gee92] H.P. Geering, Introduction to Fuzzy Control, Institut für Mess- und Regeltechnik, ETH, IMRT-Bericht Nr.24, Zürich, 1992 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 27 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Mam74] E.H. Mamdani & S.Assilian, A Case Study on the Application of Fuzzy Set Theory to Automatic Control, pp.643-649, Budapest, Sept. 1974 [Mam74bis] E.H. Mamdani, Application of Fuzzy Algorithms for Control of Simple Dynamic Plant, Proc. IEE, Vol.121, No.12, pp.1585-1588, Dec. 1974 [Mam75] E.H. Mamdani & S.Assilian, An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller, Int. J. Man-Machine Studies, Vol.7, pp.1-13, 1975 [Mam76] E.H. Mamdani, Advances in the Linguistic synthesis of Fuzzy Controllers, Int. J. Man-Machine Studies, Vol.11, pp.669-678, 1976 [Mam77] E.H. Mamdani, Application of Fuzzy Logic to Approximate Reasoning Using Linguistic Synthesis, IEEE Trans. on Computers, Vol.26, pp.1182-1191, Dec. 1977 [Mam81] E.H. Mamdani & B.R. Gaines, Fuzzy Reasoning and Its Applications, Academic Press, Cambridge, Mass., 1981 [Mam84] E.H. Mamdani, Development in Fuzzy Logic Control, Proc. of 23rd Conference on Decision and Control, pp.888-893, 1984 [Sug85] M. Sugeno, Industrial Applications of Fuzzy Control, North-Holland Publ., 1985 [Tak83] T. Takagi & M. Sugeno, Derivation of Fuzzy Control Rules from Human Operator's Control Actions, Proc. of the IFAC Symp. on Fuzzy Information, Knowledge Representation and Decision analysis, pp.55-60, July 1983 [Tak85] T. Takagi & M. Sugeno, Fuzzy Identification of Systems and Its Application to Modeling and Control, IEEE Trans. Syst. Man & Cybern., Vol.20, No.2, pp.116-132, 1985 Hardware Implementation of Fuzzy Systems [Chan93] C.-H. Chang & J.-Y. Cheung, The Dimensions Effect of FAM Rule Table in Fuzzy PID Logic Control Systems, Second IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.441-446, San Francisco, CA, USA, 1993 [Gup88] Fuzzy Computing: Theory, Hardware, and Applications, M.M. Gupta & T. Yamakawa Eds., Elsevier Science Publishers B.V. (North-Holland), 1988 [Hir93] Industrial Applications of Fuzzy Technology, K. Hirota Ed., Tokyo, 1993 Digital Implementations [Eich92] H. Eichfeld, M. Löhner & M. Müller, Architecture of a CMOS Fuzzy Logic Controller with Optimized Memory Organisation and Operator Design, Second IEEE Int. Conf. on Fuzzy Systems, pp.1317-1323, San Diego, CA, USA, March 1992 [Eich93] H. Eichfeld, T. Künemund & M. Klimke, An 8b Fuzzy Coprocessor for Fuzzy Control, Proc. of the Int. Solid State Circuits Conf.'93 Conference, San Francisco, CA, USA, Feb. 1993 [Nak93] K. Nakamura, N. Sakeshita, Y. Nitta, K. Shimanura, T. Ohono, K. Egushi & T. Tokuda, Twelve Bit Resolution 200 kFLIPS Fuzzy Inference Processor, Proc. of the Int. Solid State Circuits Conf'93 Conference, San Francisco, CA, USA, Feb. 1993 [Pag93] A. Pagni, R. Poluzzi & G. Rizzotto, Automatic Synthesis, Analysis and Implementation of a Fuzzy Controller, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.105-110, San Francisco, CA, USA, 1993 [Pag93bis] A. Pagni, R. Poluzzi & G. Rizzotto, Integrated Development Environment for Fuzzy Logic Applications, Applications of Fuzzy Logic Technology, B. Bosacchi & J.C. Bezdek Ed., pp.66-77, Boston, Massachusetts, USA, 1993 [Pag93-3] A. Pagni, R. Poluzzi & G. Rizzotto, Fuzzy Logic Program at SGS-THOMSON, Applications of Fuzzy Logic Technology, B. Bosacchi & J.C. Bezdek Ed., pp.80-90, Boston, Massachusetts, USA, 1993 [Sas93] M. Sasaki, F. Ueno & T. Inoue, 7.5MFLIPS Fuzzy Microprocessor Using SIMD and Logic-in-Memory Structure, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.527-534, San Francisco, CA, USA, 8-10 Sept. 1993 [Tog86] M. Togai & H. Watanabe, A VLSI Implementation of Fuzzy Inference Engine: toward an Expert System on a Chip, Information Sciences, Vol.38, pp.147-163, 1986 [Tog87] M. Togai & S. Chiu, A Fuzzy Chip and a Fuzzy Inference Accelerator for Real- Time Approximate Reasoning, Proc. 17th IEEE Int. Symp. Multiple-Valued Logic, pp.25-29, May 1987 [Wat90] H. Watanabe, W.D. Dettloff & K.E. Yount, A VLSI Fuzzy Logic Controller with Reconfigurable Cascadable Architecture, IEEE JSSC, Vol.25, No.2, pp.376-382, April 1990 [Wat92] H. Watanabe, RISC Approach to Design of Fuzzy Processor Architecture, Second IEEE Int. Conf. on Fuzzy Systems, pp.431-440, San Diego, CA, USA, March 1992 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 28 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Wat93] H. Watanabe & D. Chen, Evaluation of Fuzzy Instructions in a RISC Processor, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.521-526, San Francisco, CA, USA, 1993 Analog Implementation: Voltage-Mode [Hir89] K. Hirota & K. Ozawa, The Concept of Fuzzy Flip-Flop, IEEE Trans., SMC- 19, (5), pp.980-997, 1989 [Tou94] C. Toumazou & T. Lande, Building Blocks for Fuzzy Processors, Voltage- and Current-Mode Min-Max circuits in CMOS can operate from 3.3V Supply, IEEE Circuits & Devices, Vol.10, No.4, pp.48-50, July 1994 [Yam88] T. Yamakawa, Fuzzy Microprocessors - Rule Chip and Defuzzifier Chip -&- How It Works -, Proc. Int. Workshop on Fuzzy Syst. Appl., pp.51-52 & 79-87, Iizuka, Japan, Aug. 1988 [Yam88bis] T. Yamakawa, High-Speed Fuzzy Controller Hardware System: The Mega Fips Machine, Information Sciences, Vol.45, pp.113-128, 1988 [Yam89] T. Yamakawa, An Application of a Grade-Controllable Membership Function Circuit to a Singleton-Consequent Fuzzy Logic Controller, Proc. of the Third IFSA Congress, pp.296-302, Seattle, WA, USA, Aug. 6-11, 1989 [Yam93] T. Yamakawa, A Fuzzy Inference Engine in Nonlinear Analog Mode and Its Application to a Fuzzy Logic Control, IEEE Trans. on Neural Networks, Vol.4, No.3, May 1993 Analog Implementation: Mixed-Mode [Chen92] J.-J. Chen, C.-C. Chen & H.-W. Tsao, Tunable Membership Function Circuit for Fuzzy Control Systems using CMOS Technology, Electronics Letters, Vol.28, No.22, pp.2101-2103, Oct. 1992 [Inou91] T. Inoue, F. Ueno, T. Motomura, R. Matsuo & O. Setoguchi, Analysis and Design of Analog CMOS Building Blocks For Integrated Fuzzy Inference Circuits, Proc. IEEE International Symp. on Circuits and Systems, Vol.4, pp.2024-2027, 1991 [Park86] C.-S. Park & R. Schaumann, A High-Frequency CMOS Linear Transconductance Element, IEEE Trans. on Circuits and Systems, Vol.CAS- 33, No.11, Nov. 1986 Analog Implementation: Current-Mode [Batu94] I. Baturone, J.L. Huertas, A. Barriga & S. Sanchez-Solano, Current-Mode Multiple-Input MAX Circuit, Electronics Letters, Vol.30, No.9, pp.678-680, April 1994 [Balt93] F. Balteanu, I. Opris & G. Kovacs, Current-Mode Fuzzy Memory Element, Electronics Letters, Vol.29, No.2, pp.236-237, Jan. 1993 [Ishi92] O. Ishizuka, K. Tanno, Z. Tang & H. Matsumoto, Design of a Fuzzy Controller with Normalization Circuits, Second IEEE Int. Conf. on Fuzzy Systems, pp.1303-1308, San Diego, CA, USA, March 1992 [Kett93] T. Kettner, C. Heite & K. Schumacher, Analog CMOS Realisation of Fuzzy Logic Membership Functions, IEEE Journal of Solid-State Circuits, Vol.29, No.7, pp.857-861, July 1993 [Land93] O. Landolt, Efficient Analog CMOS Implementation of Fuzzy Rules by Direct Synthesis of Multidimensional Fuzzy Subspaces, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.453-458, San Francisco, CA, USA, 1993 [Lem93] L. Lemaître, M.J. Patyra & D. Mlynek, Synthesis and Design Automation of Analog Fuzzy Logic VLSI Circuits, IEEE Symp. on Multiple-Valued Logic, pp.74-79, Sacramento, CA, USA, May 1993 [Lem93bis] L. Lemaître, M.J. Patyra & D. Mlynek, Fuzzy Logic Functions Synthesis: A CMOS Current Mirror Based Solution, IEEE ISCAS93, Part 3 (of 4), pp.2015- 2018, Chicago, IL, USA, 1993 [Lem94] L. Lemaître, Theoretical Aspects of the VLSI Implementation of Fuzzy Algorithms, rapport de thèse nº1226, Département d'Electricité, EPFL, Lausanne, 1994 [Sas90] M.Sasaki, T. Inoue, Y. Shirai & F. Ueno, Fuzzy Multiple-Input Maximum and Minimum Circuits in Current Mode and Their Analyses Using Bounded- Difference Equations, IEEE Trans. on Computers, Vol.39, No.6, pp.768-774, June 1990 [Sas92] M. Sasaki, N. Ishikawa, F. Ueno & T. Inoue, Current-Mode Analog Fuzzy Hardware with Voltage Input Interface and Normalization Locked Loop, Second IEEE Int. Conf. on Fuzzy Systems, pp.451-457, San Diego, CA, USA, March 1992 [Yam86] T. Yamakawa & T. Miki, The Current Mode Fuzzy Logic Integrated Circuits Fabricated by the Standard CMOS Process, Reprint of IEEE Trans. on Computers, Vol. C-35, No.2, pp.161-167, Feb. 1986 [Yam87] T. Yamakawa, Fuzzy Logic Circuits in Current Mode, Analysis of Fuzzy Information, James C. Bezdek Ed., Vol.1, pp.241-262, CRC Press, Boca Raton, Florida, 1987 5/1/2007 10:46 AM Design of VLSI Systems - Fuzzy Logic Systems 29 of 29 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/Des... [Yam88-3] T. Yamakawa, H. Kabuo, A Programmable Fuzzifier Integrated Circuit - Synthesis, Design, and Fabrication, Information Sciences, Vol.45, pp.75-112, 1988 [Zhij90] L. Zhijian & J. Hong, CMOS Fuzzy Logic Circuits in Current-Mode toward Large Scale Integration, Proc. Int. Conf. on Fuzzy Logic & Neural Networks, pp.155-158, Iizuka, Japan, July 20-24, 1990 [Zhij90bis] L. Zhijian & J. Hong, A CMOS Current-Mode, High Speed Fuzzy Logic Microprocessor for a Real-Time Expert System, IEEE Proc. of the 20th Int. Symp. on MVL, Charlotte, NC, USA, May 23-25, 1990 Mixed Analog-Digital Implementation [Fatt93] J. Fattarasu, S.S. Mahant-Shetti & J.B. Barton, A Fuzzy Inference Processor, 1993 Symposium on VLSI Circuits, Digest of Technical Papers, pp.33-34, Kyoto, Japan, May 19-21, 1993 Implementation on FPGA [Manz92] M.A. Manzoul & D. Jayabharathi, Fuzzy Controller on FPGA Chip, IEEE Int. Conf. on Fuzzy Systems, pp.1309-1316, San Diego, Ca, USA, Mar. 8-12, 1992 [Ung93] A.P. Ungering, K. Thuener & K. Goser, Architecture of a PDM VLSI Fuzzy Logic Controller with Pipelining and Optimized Chip Area, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.447-452, San Francisco, Ca, USA, 1993 [Yam92] T. Yamakawa, A Fuzzy Programmable Logic Array (Fuzzy PLA), IEEE Int. Conf. on Fuzzy Systems, pp.459-465, San Diego, Ca, USA, Mar. 8-12, 1992 Technological Aspects [Pat90] M.J. Patyra, Fuzzy Properties of IC Subcircuits, Int. Conf. on Fuzzy Logic and Neural Networks, Iizuka, Japan, 1990 [Oeh92] J. Oehm & K. Schumacher, Accuracy Optimization of Analog Fuzzy Circuitry in Network Analysis Environment, ESSCIRC, 1992 Neural Networks and Adaptive Fuzzy Systems [Ber90] H. Berenji, Neural Networks and Fuzzy Logic in Intelligent Control, IEEE, 1990 [God93] J. Godjevac, State of the Art in the Neuro Fuzzy Field, Technical Report No.93.25, Laboratoire de Microinformatique, DI, EPFL, Lausanne, 1993 [Jan92] J.-S. Roger Jang, ANFIS, Adaptive-Network-Based Fuzzy Inference Systems, IEEE Trans. on Systems, Man and Cybernetics, 1992 [Kos92] B. Kosko, Neural Networks and Fuzzy Systems, Prentice-Hall International, Inc., Englewood Cliffs N.J., 1992 [Tak90] H. Takagi, Fusion Technology of Fuzzy Theory and Neural Networks, Survey and Future Directions, Proc. of the Int. Conf. on Fuzzy Logic & Neural Networks, Vol.1, pp.13-26, Iizuka, Japan, July 1990 [Tak90bis] H. Takagi, T. Kouda & Y. Kojima, Neural Network Designed on Approximate Reasoning Architecture and Its Application to the Pettern Recognition, Proc. of the Int. Conf. on Fuzzy Logic & Neural Networks, Vol.2, pp.671-674, Iizuka, Japan, July 1990 This chapter edited by D. Mlynek production of 5/1/2007 10:46 AM 1. Introduction 1 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Chapter 10 VLSI FOR MULTIMEDIA APPLICATIONS Case Study: Digital TV I. Introduction II. Digitization of TV functions III. Points of concern for the Design Methodology IV. Conclusion Today there is a race to design interoperable video systems for basic digital computer functions, involv multimedia applications in areas such as media information, education, medecine and entertainment, to name but a few. This chapter provides an overview of the current status in industry of digitized television includin techniques used and their limitations, technological concerns and design methodologies needed to achieve th goals for highly integrated systems. Digital TV functions can be optimized for encoding and decoding implemented in silicon in a more dedicated way using a kind of automated custom design approach allowin enough flexibility. Some practical examples are shown in the chapter: "Multimedia Architecture’s" I. Introduction Significance of VLSI for Digital TV Systems When, at the 1981 Berlin Radio and TV Exhibition, the ITT Intermetall company exhibited to the public for th first time a digital television VLSI concept [1], [2], opinions among experts were by no means unanimo favourable. Some were enthusiastic, while others doubted the technological and economic feasibility. Today, after 13 years, more than 30 million TV sets worldwide have already been equipped with this system. Today, the intensive use of VLSI chips does not need a particular justification, the main reasons being increased reliab mainly because of the long-term stability of the color reproduction brought about by digital systems, and 5/1/2007 10:54 AM 1. Introduction 2 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... medium and long-term cost advantages in manufacturing which are essential for ensuring internat competitiveness. Digital signal processing permits solutions that guarantee a high degree of compatibility with future developments, whether in terms of quality improvements or new features like intermediate picture storage adaptive comb filtering for example. In addition to these benefits, a digital system offers a number of advantag with regard to the production of TV sets: - Digital circuits are tolerance-free and are not subject to drift or aging phenomena. These well-known properties of digital technology considerably simplify factory tuning of the sets and even permit fully au computer-controlled tuning. - Digital components can be programmable. This means that the level of user convenience and the features offered by the set can be tailored to the manufacturer's individual requirements via the software. - A digital system is inherently modular with a standard circuit architecture. All the chips in a given sys compatible with each other so that TV models of various specifications, from the low-cost basic mode multi-standard satellite receiver, can be built with a host of additional quality and performance features. - Modular construction means that set assembly can be fully automated as well. Together with automatic tuning the production process can be greatly simplified and accelerated. Macro-function Processing The modular design of digital TV systems is reflected in its subdivision into largely independent functional blocks, with the possibility having special data-bus structures. It is useful to divide the structure into data-oriented flow and control-oriented flow, so that we have four main groups of components: 1.- The control unit and peripherals, based on well-known microprocessor structures, with a centra communication bus for flexibility and ease to use. An arrangement around a central bus makes it possible to easily expand the system constantly and thereby add on further quality-enhancing and special functions for picture, text and/or sound processing at no great expense. A non-volatile storage element, in which the fact settings are stored, is associated to this control processor. 2.- The video functions are mainly the video signal processing and some additional features like for exa deflection, a detailed description follows in the paper. However, the key point for VLSI implementati well-organized definition of the macro-blocks. This serves to facilitate interconnection of circuit components, and minimizes power consumption, which can be considerable at the processor speeds needed. 3.- The digital concept facilitates the decoding of today’s new digital sound broadcasting standards as well as the input of external signal sources, such as Digital Audio Tape (DAT) and Compact Disk (CD). Programmabi permits mono, stereo, and multilingual broadcasts; the compatibility with other functions in the TV system resolved with the common communication bus. This leads us to part two which is dedicated to the description this problem. 4.- With a digital system, it is possible to add some special or quality-enhancing functions simply by incorporating a single additional macro-function or chip. Therefore, standards are no longer so important due to the high level of adaptability of digital solutions. For example adaptation to a 16:9 picture tube is easy. Figure 1 shows the computation power needed for typical consumer goods applications. Notice from the figure that when the data changes at a frequency x, a digital treatment of that data must be an order of magnitude faste [3]. 5/1/2007 10:54 AM 1. Introduction 3 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Fig.1: Computation Power for Consumer Goods In this chapter we first discuss the digitization of TV functions by analyzing general concepts based on ex systems. The second section deals with silicon technologies and, in particular design methodologies concerns. The intensive use of submicron technologies associated with fast on chip clock frequencies and huge num transistors on the same substrate affects traditional methods of designing chips. As this chapter only outline general approach of the VLSI integration techniques for Digital TV, interested readers will find more descriptions of VLSI design methodologies and realizations in [9], [13], [15], [24], [26], [27], [28]. II. Digitization of "TV functions" The idea of digitization of TV functions is not new. The time some companies have started to work on i technology was not really adequate for the needed computing power so that the most effective solutions were full custom designs. This forced the block-oriented architecture where the digital functions introduced were the one to one replacement of an existing analog function. In Figure 2 there is a simplified representation of the ge concept. 5/1/2007 10:54 AM 1. Introduction 4 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Fig.2: Block Diagram of first generation digital TV set The natural separation of video and audio resulted in some incompatibilities and duplication of primary functions. The emitting principle is not changed, redundancy is a big handicap, for example the time a SEC channel is running, the PAL functions are not in operation. New generations of digital TV systems should re-think the whole concept top down before VLSI system partitioning. In today’s state-of-the-art solution one can recognize all the basic functions of the analog TV set with, however, a modularity in the concept, permitting additional features becomes possible, some special digital possibilities are exploited, e.g. storage and filtering techniques to improve signal reproduction (adaptive filtering, 1 technology), to integrate special functions (picture-in-picture, zoom, still picture) or to receive digital broadcasting standards (MAC, NICAM). The Figure 3 shows the ITT Semiconductors solution which was the first on the market in 1983 [4]. 5/1/2007 10:54 AM 1. Introduction 5 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Fig.3: The DIGIT2000 TV receiver block diagram By its very nature, computer technology is digital, while consumer electronics are geared to the analog world Starts have been made only recently to digitize TV and radio broadcasts at the transmitter end (in form of DA DSR, D2-MAC, NICAM etc). The most difficult technical tasks involved in the integration of different media are interface matching and data compression [5]. After this second step in the integration of multimedia signals, an attempt was made towards standardizatio namely, the integration of 16 identical high speed processors with communication and programmability concepts comprised in the architecture (see Figure 4, Photograph of the chip of ITT Semiconductor courtesy). 5/1/2007 10:54 AM 1. Introduction 6 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Fig.4: Chip Photograph – (ITT Semiconductors Courtesy) Many solutions proposed today (for MPEG 1 mainly) are derived from microprocessor architectures or DSPs, but there is a gap between today’s circuits and the functions needed for a real fully HDTV system. The AT& hybrid codec [29], for instance, introduces a new way to design multimedia chips by optimizing the cost equipment considering both processing and memory requirements. Pirsch [6] gives a detailed description o today’s digital principles and circuit integration. Other component manufacturers also provide different solutions for VLSI system integration [35][36][37][38][39][40]. In part IV of this paper a full HDTV system, based wavelet transforms is described. The concept is to provide generic architectures that can be applied to a wid variety of systems taking into account that certain functions have to be optimized and that some other c algorithms have to be ported to generic processors. Basics of current video coding standards Compression methods take advantage of both data redundancy and the non-linearity of human vision. They exploit correlation in space for still images and in both space and time for video signals. Compression in space is known as intra-frame compression, while compression in time is called inter-frame compression. Generall methods that achieve high compression ratios (10:1 to 50:1 for still images and 50:1 to 200:1 for vid approximations which lead to a reconstructed image not identical to the original. Methods that cause no loss of data do exist, but their compression ratios are lower (no better than 3:1). S techniques are used only in sensitive applications such as medical imaging. For example, artifacts introduced by a lossy algorithm into a X-ray radiograph may cause an incorrect interpretation and alter the diagnosis of a medical condition. Conversely, for commercial, industrial and consumer applications, lossy algorithms ar preferred because they save storage and communication bandwidth. Lossy algorithms also generally exploit aspects of the human visual system. For instance, the eye is much receptive to fine detail in the luminance (or brightness) signal than in the chrominance (or color) sig Consequently, the luminance signal is usually sampled at a higher spatial resolution. Second, the enc representation of the luminance signal is assigned more bits (a higher dynamic) than are the chrominance signals. The eye is less sensitive to energy with high spatial frequency than with low spatial frequency [7]. Indeed, i images on a personal computer monitor were formed by an alternating spatial signal of black and white, the human viewer would see a uniform gray instead of the alternating checkerboard pattern. This deficiency exploited by coding the high frequency coefficients with fewer bits and the low frequency coefficients with more bits. 5/1/2007 10:54 AM 1. Introduction 7 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... All these techniques add up to powerful compression algorithms. In many subjective tests, reconstructed image that were encoded with a 20:1 compression ratio are hard to distinguish from the original. Video data, compression at ratios of 100:1, can be decompressed with close to analog videotape quality. Lack of open standards could slow the growth of this technology and its applications. That is why several dig video standards have been proposed: JPEG (Joint Photographic Expert Group) for still pictures coding H.261 at p times 64 kbit/s was proposed by the CCITT (Consultative Committee on International Telephony and Telegraphy) for teleconferencing MPEG-1 (Motion Picture Expert Group) up to 1,5 Mbit/s was proposed for full motion compression o digital storage media MPEG-2 was proposed for digital TV compression, the bandwith depends on the chosen level and prof [33]. Another standard, the MPEG-4 for very low bit rate coding (4 kbit/s up to 64 kbit/s) is currently being debated. For more detail concerning different standards and their definition, please see the paper included in th Proceedings: "Digital Video Coding Standards and their Role in Video Communication", by R. Schäfer, and Siroka. III. Points of Concern for the Design Methodology Like aforsaid, the main idea is to think system-wise through the whole process of development; doing that, we had to select a suitable architecture as a demonstrator for this coherent design methodology. It makes no sense to reproduce existing concepts or VLSI chips, therefore we focused our demonstrator on the subband codin principle, where the DCT is only a particular case of. Following this line, there is no interest to focus on block only considering the motion problem to solve, but rather to consider the entire screen in a first global approach This gives us the ability to define macro-functions which are not restricted in their design limits, the onl restriction will come from practical parameters like block area or processing speed for example, which depen from the technology selected for developing the chips but does not depend from the architecture or th functionality. Before going into the detail of the system architecture, we like to discuss in this session the main design related and technology depending factors which will influence the hardware design process and the use of some CAD tools. We are proposing a list of major concerns one should consider when going to integrate digital TV functions. The purpose is to give a feeling for the decision process of the management of such a project. In a first step, we discuss RLC effects, down scaling, power management, requirements for the process technology, design effects like parallelism and logic styles, and we conclude the session with some criterias for the prop methodology. R,L,C effects 5/1/2007 10:54 AM 1. Introduction 8 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... In computer systems today, the clocks that drive ICs are increasingly fast; 100MHz is a " standard " clo frequency, and several chip and system manufacturers are already working on microprocessors with Ghz clocks By the end of this decade, maybe earlier, Digital Equipment will provide a 1-2 GHz version of the Alpha AX chip; Intel promises faster Pentiums; and Fujitsu will have a 1-GHz Sparc processor. When working frequencies are so high, the wires that connect the circuits boards and modules, and even the wires inside integrated circuits start behaving like transmission lines. New analysis tools become necess circumvent and to master the high-speed effects. As long as the electrical connections are short, with low clock rates, the wires can be modeled as RC circuits. such cases, the designer will have to care that rise and fall times are sufficiently short with respect to the internal clock frequency. This method is still used in fast clock rate designs by deriving clock trees to manage a good signal synchronization on big chips or boards. However, when the wire lengths increase, their inductance starts to play a major role. This is what transmission lines deal with. Transmission line effects include reflecti overshoot, undershoot, and crosstalk. RLC effects have to be analyzed in a first step but it might be necessary t use another circuit analysis to gain better insight into circuit behaviour. The interconnect will behave lik distributed transmission line and coupled lines (electromagnetic characteristics also have to be taken into account). But with low clock-rate systems, transmission line effects can also appear. Let's take for example 1MHz system with a rise time of 1ns. Capacitor loading will be dictated according to the timings, but dur transition time, reflections and ringing will occur causing some false triggering of other circuits. As a rule o thumb, high speed design techniques should be applied when the propagation delay in the interconnect is 20-25% of the signal rise and fall time [30], [34]. Some possible problems are listed below: 1. short transition time compared to the total clock period. The effect was described above. 2. inaccurate semiconductor models. It is important to take into account physical, electrical a electromagnetic characteristics of the connections and the transistors. Important electrical paramete metal-line resistivity and skin depth, dielectric constant and dielectric loss factor. 3. inappropriate geometry of the interconnect. Width, spacing, and thickness of the lines and thickness of the dielectric are of real importance. 4. lack of a good ground is often a key problem. Inductance often exist between virtual and real ground due for instance to interconnect and lead inductance. A solution to these problems could be a higher degree of integration by eliminating the number of long wires. The MCM (Multi Chip Module technology) is an example of this alternative. MCM simplifies the compo improves the yield and shrinks the footprint. Nevertheless, the word alternative is not entirely correct since MCM eliminates a certain type of problems by replacing them with another type. The narrowness of the introduced with MCMs tends to engender significant losses due to the resistance and the skin effect. Furthermore, due to the VIA structure, the effect of crossover and via parasitics is stronger than in traditiona board design. Finally, ground plane effects need a special study since severe ground bounce and an effective shift in the threshold voltage can result from the combination of high current densities with the thinn metallizations for power distribution. How then does one get a reliable high speed design? The best way is to analyze the circuit as deeply as possibl The problem here is that circuit analysis is usually very costly in CPU time. Circuit analysis can be carried out in three steps, first EM and EMI analysis, then according to the component models available in the database electrical analysis can be performed using two general approaches: one that relies on versions of Spice, the othe are the direct-solution methods using fixed-time increments. EM (Electromagnetic field solver, or Maxwell's equations solver) and EMI (Electromagnetic interference 5/1/2007 10:54 AM 1. Introduction 9 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... analyzers perform a scanning of the layout database for unique interconnect and coupling structures discontinuities. Then EM field solvers use the layout information to solve Maxwell's equations by numer methods. Inputs to the solver includes physical data about the printed circuit or multichip module such as the layer stack-up dielectrics and their thickness, placement of power and ground layers, and interconnect metal width and spacing. The output is the mathematical representation of these electrical properties. In this way, solvers analyze the structure in two, two and a half or three dimensions. In choosing among the variety of solvers, the complexity of the structure, and the accuracy of the computation must be weighted against performance and computational cost. Electrical models are important for analysis tools and they can be automatically generated from measurements in the time domain or from mathematical equations. Finally, the time complexity of solving the system matrices that represent a large electrical network is an order of magnitude larger for an electrical simulator like Spice, than for a digital simulator. Down Scaling As CMOS devices scale down into the submicron region, the intrinsic speed of the transistors increase (frequencies between 60 and 100 MHz are common). The transients are also reduced, so that the increase in output switching speed increases the rate of change of switching current (di/dt). Due to parallelization simultaneous switching of the I/O's creates a so called simultaneous switching noise (SSN), also known as Delta-I noise or ground bounce [8]. It is important that SSN be limited within a maximum allowable level to avoid spurious errors like false triggering, double clocking or missing clock pulses. The output driver design i now no longer trivial, and techniques like current controlled circuits or controlled slew rate driver designs are used to minimize the effect of the switching noise [8]. An in-depth analysis of the chip-package interface i required to ensure the functionality of high-speed chips (Figure 5). 5/1/2007 10:54 AM 1. Introduction 10 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Fig.5: Chip-Package Interface The lesson is that some important parts of each digital submicron chip have to be considered to be working analog mode rather than digital. This applies not only for the I/O's but also for the timing and clocking blocks in a system. The entire floor plan has to be analyzed in the context of noise immunity, parasitics and also propagation and reflection in the buses and main communication lines. Our idea was to reduce the numbe primitive cells and to structure the layout in such a way to be able to use the common software tools for electrical optimizations of the interconnections (abutments optimization). Down Scaling of the silicon technology is a common way today to obtain in a short time a new technology to be able to compete in the digital market, but this shrinking is only linear in x and y (with some differences and restrictions like VIA example). The third dimension is not shrinked linearly for technical and physical reasons. The designer has to make sure that the models describing the devices and the parasitics are valid for the considered frequency particular application. Power Management As chips grow in size and speed, power consumption is drastically amplified. The actual demand for por consumer products implies that power consumption must be controlled at the same time that complex us interfaces and multimedia applications are driving up computational complexity. But there are limits to how much power can be slashed for analog and digital circuits. In analog circuits, a desired signal-to-noise ratio mus 5/1/2007 10:54 AM 1. Introduction 11 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... be maintained, while for digital IC power, the lower limit is set by cycle time, operating voltage, and c capacitance [9]. A smaller supply voltage is not the only way to reduce power for digital circuits. Minimizing the number of device transitions needed to perform a given function, local suppression of the clock, reduction of clock frequency, elimination of system clock in favor of self-timed modules are other means to reduce the power. means that for the cell-based design technology there is a crucial need to design the cell library to minimize energy consumption. There are various ways to reduce the energy in each transition which is proportion capacitance and the supply voltage to the power of two (E=CV2). Capacitance's are being reduced along with the feature size in scale down processes, but this reduction is not linear. With the appropriate use of design techniques like minimization of the interconnections or use of abutment or optimum placement it is possible reduce the capacitance's in a more effective way. So what are the main techniques used to decrease th consumption? Decrease the frequency, the size and the power supply. Technology has evolved to 3.3V processes in production and most current processors take advantage of this progress. Nevertheless, reducing the nu transistors and the operating frequency cannot be performed in so a brutal manner and so trade-off have to b found. Let us bring some insight into power management by looking at different approaches found in act products. A wise combination of those approaches will eventually lead even to new methods. The MicroSparcII uses a 3.3V power supply and a fully static logic. It can cut power to the caches by 75% w they're not being accessed, and in standby mode, it can stop the clock to all logic blocks. At 85MHz it is expected to consume about 5W. Motorola and IBM had the goal of providing high performance while consuming little power when they produced the PowerPC603. Using a voltage of 3.3V, and 0.5 micron features CMOS process, four level metal and static design technology, the chip is smaller as its predecessor (85.5mm2 in 0.5 micron for 1.6 Milli transistors instead of 132mm2 in 0.6 micron for 2.8 Million transistors). The new load/store unit an (System-Register Unit) is used to implement dynamic power management with a maximum of 3W po consumption at 80MHz. But a lot more can be expected from reduction of power consumption associated reduction of voltage swing on buses for example or on memory bit lines either. To achieve a reasonable operating power for the VLSI it is necessary to decrease drastically the power consumption of internal bus drivers, a circuit technology with a reduced voltage swing for the internal buses is a good solution. Nakagome [10] proposed a new internal bus architecture that reduce operating power by suppressing the bus signal swing to less than 1 V, this architecture can achieve a low power dissipation while maintaining high speed and low standby current. This principle is shown in Figure 6. Fig.6: Architecture of internal bus An electrothermal analysis of the IC will show the non-homogeneous local power dissipations. This leads to 5/1/2007 10:54 AM 1. Introduction 12 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... avoid hot-spots in the chip itself (or in a multichip solution) to secure good yield since the failure ra microelectronic devices doubles for every 10ºC increase in temperature. To optimize both long-term reliability and performance, it has become essential to perform both thermal and electrothermal simulations during the chip design. For example, undesirable thermal feedback due to temperature gradients across a chip degrade performance of electrical circuit such as reduced-swing bus driver or mixed analog-digital component, w matching and symmetry are important parameters [11]. Reduce chip power consumption is not the only issue. When targeting low system cost and power consumption, it becomes interesting to include a PLL (Phase Locked Loop) allowing the processor to run at higher frequencies than the main system clock. By multiplying the system clock by 1, 2, 3 and 4, the PowerPC603 operates properly when slower system clock is used, three software controllable modes can be selected to redu consumption. Most of the processor can be switched off, only the bus snooping is disabled or the time-base register is switched-off. It takes, naturally some clock cycles to bring the processor into a full power mod Dynamic power management is also used to switch-off only certain processor subsystems, and even the ca protocol has been reduced from four to three states, being still compatible with the previous one. Silicon Graphics goal has been to maintain RISC performances at a reduced price. Being nor super super-pipelined (only 5 stages of pipeline) it combines integer and floating-point unit into a single unit. The result is a degraded performance but with a big saving on the number of transistors. It can also power down unused execution units, this is maybe even more necessary since dynamic logic is used. The chip should d typically about 1.5W. Table I lists four RISC chips competing with the top-end of the 80x86 microprocessor line. What is drawn from CPUs considerations is also applicable for television systems. The requireme compression algorithms and their pre- and post-processing leads to very similar system sizes to comp workstations. Our methodology was to reduce the size of the primitive cells of the library by using an optimizing software developed in-house. Table I Four RISC chips competing with the top-end of the 80x86 line. Number Power Dissip. of Maxi Trans Price for 1000 Size specint92 specfp92 Operating in voltage mm2 (Mio) DECchip21066 1.75 20W@ US$424 209 166MHz PowerPC 603 1.6 3W@ US$? 85 80MHz MicroSparc II 2.3 5W@ 85MHz US$500 233 70@ 105@ 166MHz 3.3V & 5V 166MHz peripherals 75@ 85@ 80MHz 80MHz 57.2@ 49.5@ 85MHz 85MHz 3.3V & 5V peripherals 3.3V & 5V peripherals 5/1/2007 10:54 AM 1. Introduction 13 of 18 Mips/NecRs4200 1.3 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... 2W@40/80MHz US$100 81 55@40/ 30@40/ 80MHz 80MHz 3.3V & 5V peripherals Silicon Technology Requirements It is important to notice that the today process technologies are not adapted to the new task in the consumer field: low power, high speed, huge amount of data's. Historically the most progress was done for the memory pro because of the potential business and the real need of storage since the microprocessor architecture exists. More or less all the so called ASIC Process Technologies have been extrapolated from the original DRAM technologies with some drastic simplifications because of yield sensitivity. Now the new algorithms fo multimedia applications are requiring parallel architectures and because of the computation needs lo memorization which means a drastic increase of interconnections. New ways to improve the efficiency of t designs are in the improvement of the technologies but not only by shrinking linearly the features or decreasing the supply voltage, but also by giving the possibility to store at the needed place the information, av interconnection. This could be done by using floating gate or better ferroelectric memories. This material allows a memorization on a small capacitance placed on the top of the flip-flop which generate the data to be memorized; in addition, the information will not be destroyed and the material is not sensitive to radiation Another way is the use of SOI (Silicon On Insulator). In this case the parasitic capacitances of the active device are reduced near to zero so that it is possible to work with very minimal feature size (0.1µm to 0.3µm) and to achieve very high speed at very low power consumption [12]. Another concern are the multilayer interconnects. Due to the ASIC oriented methodologie it was useful to ha more than one metal interconnection-layer, this permits the so called prediffused wafer technique (like Gate-Arrays or Sea-of-Gates). Software tools developed for this methodology enabled users to use an autom router. Bad news for high speed circuits is that wires are done in a non-predictive way so that their length are often not compatible with the speed requirements. It has been demonstrated a long time ago th interconnection-layers are sufficient to solve any problem for digital circuits, one of this could be also i polysilicon or better salicide material, so that only one metalization is really needed for high speed digital circuits, and maybe another for power supply, and the main clock system. If the designer is using pow minimization techniques for basic layout cells and if he takes into account the poly layer for cell to interconnections, the reduction of the power consumption will be significant due to mainly the reduction of the size of the cell. Effects of parallelism From the power management point of view, it is interesting to notice that for a CMOS gate the delay approximately inversely proportional to voltage. Therefore, to maintain the same operational frequency, reduction of supply voltage (for power saving) must be compensated for by computing in n parallel functions, each of them operating n times slower. This parallelism is inherent in some tasks like image proces Bit-parallelism is the most immediate, pipelining, systolic arrays are other approaches. The good news is that they don't need much overhead for communication and controlling. The bad news is that they are not applica when increasing the latency is not acceptable, for example if loops are required in some algorithms. multiprocessors are used for more general computation, the circuit overhead for controlling and communicatio task is growing more than linearly and the overall speed in the chip slows down by several order of magnitude this is, by the way, the handicap of the standard DSP's applied to the multimedia tasks. The heavy use o parallelism means also a need for memorization on chip or, if memory blocks are outside, an increase of wir which means more power and less speed. 5/1/2007 10:54 AM 1. Introduction 14 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Logic styles Architecture's of the developed systems are usually described at a high level to ensure a correct functionality These descriptions cannot generally take into account low level considerations such as logic design. Usually t tools used are C descriptions converted into VHDL codes. Such codes are then remodeled into more structu VHDL descriptions, known as RTL descriptions (Register Transfer Level). These new model of the given circuit or system is then coupled with standard cell CMOS libraries. Finally the layout generation is produced. In such a process, the designer is faced with successive operations where the only decisions are made at the highest level of abstraction. Even with the freedom of changing or influencing the RTL model, the lowest level of abstraction i.e. the logic style, will not be influenced. It is contained in the use of standard library, and is rarely different from pure CMOS style of design. Nevertheless, the fact that the logic style is frozen can lead to some aberrations or at least to certain difficulties when designing a VLSI system. Clock loads may be too high, pipeline may not be easy to manage, special tri have to be used to satisfy the gate delays. Particular circuitry has to be developed for system clocking management. One way to circumvent clock generation units it to use only one clock. The TSPC technique ( Single Phase Clock) [13] performs with only one clock. It is particularly suited to fast pipelined circuits w correctly sized with a non prohibitive cell area. Other enhancements In the whole plethora of logic families (pure CMOS, Pseudo-NMOS, Dynamic CMOS, Cascade Voltage Sw logic, Pass Transistor, etc...) it is not possible to obtain excellent speed performances with minimal gate size There is always a critical path to optimize. Piguet [14], introduces a technique to minimize the critical path of basic cells. All cells are exclusively made of branches that contain one or several transistors in series conn between a power line and a logical node. Piguet demonstrates that any logical function can be implemented w only branches. The result is that generally, the number of transistors is greater than for conventional sche However, it shows that by decomposing complex cells into several simple cells, the best schematics can be found in terms of speed, power consumption and chip area. This concept of minimization of the number of trans between a power node and a logical node is used in our concept. Asynchronous designs tend also to speed up systems. The clocking strategy is forgotten at the expense of switches which enable a wave of signals propagating as soon as they are available. The communication equivalent to "handshaking". The general drawback of this technique is the overhead in the area, and consideration is required to avoid races and hazards [15], [16]. It is also necessary to carry the clocking information with data, which increases the number of communication lines. Finally, detecting state transis requires an additional delay even if this can be kept to a minimum. Redundancy of information enables interesting realizations [16] in asynchronous and synchronous design technique consists in creating additional information to permit choosing the number representation in th calculating process which will fit the minimum delay. Avizienis [17] introduced this field and research h continued in this subject [18], [19], since it is not difficult to convert the binary representation into the redundant one, though it is more complex to do the reverse. While there is no carry propagation in such a representation, the conversion from redundant binary into a binary number is there to "absorb" the carry propagation Enhancement can also be obtained by changing the technology. Implementation in BiCMOS or GaAs [21] wil also allow better performance than pure CMOS. But the trade-off of price versus performance has to be carefully studied before making any such decision. Tri-dimensional designing of physical transistors could also be possibility to enhance a system. The capacitive load could be decreased and the density increased, but suc methods are not yet reliable [22]. 5/1/2007 10:54 AM 1. Introduction 15 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Design Methodology The aim of the proposed methodology is to show a simple and powerful way for very high speed implementations in the framework of algorithms for image compression as well as pre- and post-processing. These are the main goals and criteria: • Generality: The approach used should be as general as possible without making the implementation methodology too complex. The range of applications covered by the strategy should be as extensive as concerning both speed and types of algorithms implemented. • Flexibility: Here we consider the flexibility for a given task. For instance, if a processing element is designed, it is essential that different solutions can be generated all with slightly different parameters. This includes speed, word length, accuracy, etc. • Efficiency: This is an indispensable goal and implies not only efficiency of the used algorithm, but als efficiency of performing the design task itself. The efficiency is commonly measured as the performance of th chip compared to its relative cost. • Simplicity: This point is a milestone to the previous one. By using simple procedures, or simple macro blocks the design task will be simplified as well. Restrictions will occur, but if the strategy itself is well structured, it will also be simple to use. • CAD portability: It is a must for the methodology to be fully supported by CAD tools. A design implementation approach that is not supported by a CAD environment, cannot claim to conform to the points given above. The methodology must be defined such that it is feasible and simple to introduce the elements in these tools. So it is important that the existing CAD tools and systems can adopt and incorporate the con developed earlier. ASIC's are desirable for their high potential performance, their reliability and their suitability for high v production. On the other hand, considering the complexity of development and design, Micro- or DSP-proce based implementations usually represent cheaper solutions. However, the performance of the system is the decisive factor here. For consumer applications generally cost is defined as a measurement of the required chi area. This is the most common and important factor. Other cost measures take into account the design time, testing and verification of the chip: complex chips cannot be redesigned several times. This reduces time-to-market and gives opportunity to adapt to the evolving tools to follow the technology. Some phy constraints can also be imposed on the architecture such as power dissipation, the reliability under radiation's, etc. Modularity and regularity are two additional factors that improve the cost and flexibility of a design (it is also much simpler to link these architecture's with CAD tools). The different points developped above where intended to show in a general way the complexity of the designs of modern systems. For this reason we focused on different sensitive problems of VLSI development. Today th design methodology is too much oriented by the final product. It is usually justified by historical reasons an natural parallel development of CAD tools and technology processes. The complexity of the tools inhibits t needed methodology for modern system requirements. To prove the feasibility of concurrent engineering with the present CAD tools, the natural approach is the reuse policy. It means that to reduce the development time, one reuses the already existing designs and architectures not necessary adapted to the needs of future systems This behaviour is lead only by the commercial constraint to sell the already possessed product slightly modified. On the contrary EPFL solution presents a global approach with a complex system (from low bit-rate to HDT using a design methodology which takes into account the requirements mentioned above. It shows tha architectures bottlenecks are removed if powerful macrocells and macrofunctions are developped. Severa 5/1/2007 10:54 AM 1. Introduction 16 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... functions have been hardwired, but libraries of powerful macrocells are not enough. The arising problem being the complex control of these functions and the data bandwidth. That is why a certain balance between hard and soft functions is to be found. System analysis and optimization tools are needed to achieve this goal. W developped software tools enabling fast and easy system analysis by giving the optimal configuration o architecture of a given function. This tool takes into account functionality, power consumption and area. Th access to the hardwired functions needs to be controlled by dedicated but embedded microcontroller cores. Th way of designing these cores has to be generic since each microcontroller will be dedicated to certain subtasks of the algorithm. On the other hand the same core will be used to achieve tasks at higher levels. Because it is very expensive to build dedicated processors for each application and dedicated macrofunctions, it is necessary to provide these functions with the optimal genericity to allow their use in a large spectrum applications and in the same time with an amount of customization to allow optimal system performance. This was achieved by using in-house hierarchical analysis tools adapted to the sub-system giving a figure o " flexibility " of the considered structure in the global system context. IV. Conclusion Digitalization of the fundamental TV functions is of great interest since more than 10 years. Several million of TV sets have been produced containing digital systems. However, the real and full digital system is for the future. A lot of work is done in this field today, the considerations are more technical than economical which is a normal situation for an emerging technology. The success of this new multimedia technology will be given by the applications running with this techniques. The needed technologies and methodologies were discussed to emphasize the main parameters influencing t design of VLSI chips for Digital TV Applications like parallelization, electrical constraints, power mana scalability and so on. REFERENCES [1] Fischer T., "Digital VLSI breeds next generation TV-Receiver", Electronics, H.16 (11.8.1981) [2] Fischer T., "What is the impact of digital TV?", Transaction of 1982 IEEE ICCE [3] "Ecomomy and Intelligence of Microchips", Extract of "Funkchau" 12/31. May 1991. [4] "DIGIT2000 Digital TV System", ITT Semiconductor Group, July 1991. [5] Heberle K., "Multimedia and digital signal processing", Elektronik Industrie, No. 11, 1993. [6] Pirsch P., Demassieux N., Gehrke W.,"VLSI Architectures for Video Compression - A Survey", Special Issue Proccedings of the IEEE, Advances in Image and Video Compression, Early '95. [7] Kunt M., Ikonomopoulos A., Kocher M., "Second-Generation Image-Coding Techniques", Proceedings of the IEEE, Vol. 73, No 4, pp. 549-574, April 1985. [8] Senthinathan R. and Prince J.L., "Application Specific CMOS Output Driver Circuit Design Techniques to Reduce Simultaneous Switching Noise", Journal of Solid-States Circuits, Vol. 28, No. 12, pp 1383-1388, December 1993. [9] Vittoz E., "Low Power Design: ways to approach the limits", Digest of Technical Papers, ISSCC'94, pp 14-18, 1994. [10] Nakagome Y. and al., "Sub 1 V Swing Internal Bus Architecture for Future Low-Power ULSI's", Journal of Solid-States Circuits, Vol. 28, No. 4, pp 414-419, April 1993. [11] Sang Soo Lee, Allstot D., "Electrothermal Simulation of Integrated Circuits", Journal of Solid-States Circuits, Vol. 28, No. 12, pp 1283-1293, December 1993. [12] Fujishima M. and al., "Low-Power 1/2 Frequency Dividers using 0.1-mm CMOS Circuits built with ultrathin SIMOX Substrates", 5/1/2007 10:54 AM 1. Introduction 17 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... Journal of Solid-States Circuits, Vol. 28, No. 4, pp 510-512, April 1993. [13] Kowalczuk J., "On the Design and Implementation of Algorithms for Multi-Media Systems", PhD Thesis, EPFL, December 1993. [14] Masgonty J.-M., Mosch P., Piguet C., "Branch-Based Digital Cell Libraries", Internal Report, CSEM, 1990. [15] Yuan J., Svensson C., "High-Speed CMOS Circuit Technique", IEEE Journal of Solid-State Circuits, Vol. sc-24, No +, pp 62-70, February 1989. [16] McAuley A. J., "Dynamic Asynchronous Logic for High-Speed CMOS Systems", IEEE Journal of Solid-States Circuits, Vol. sc-27 No3, pp 382-388, March 1992. [17] Avizienis "Signed-Digit Number Representations for Fast Parallel Arithmetic", IRE Trans. Electron. Compute., Vol EC-10, pp 389-400, 1961. [18] Takagi N., & al., "A High-Speed Multiplier using a Redundant Binary Adder Tree", IEEE Journal of Solid-State Circuits, Vol. sc-22, No. 1, pp 28-34, February 1987. [19] McAuley A. J., "Four State Asynchrinous Architectures", IEEE Journal of Transactions on Computers Vol. sc-41 No. 2 , pp 129-142, February 1992. [20] Ercegovac M. D., Lang T., "On line Arithmetic: A Design Methodology and Applications in Digital Signal Processing", Journal of VLSI Signal Processing, Vol. III, pp 252-163, 1988. [21] Hoe D. H. K., Salama C. A. T., "Dynamic GaAs Capacitively Coupled Domino Logic (CCDL)", IEEE Journal of Solid-State Circuits, Vol. sc-26, No. 1, pp 844-849, June 1991. [22] Roos G., Hoefflinger B., and Zingg R., "Complex 3D-CMOS Circuits based on a Triple-Decker Cell", CICC 1991, pp 125-128, 1991. [23] Ebrahimi T., & al. "EPFL Proposal for MPEG-2", Kurihama, Japan, November 1991. [24] Hervigo R., Kowalczuk J., Mlynek D., "A Multiprocessor Architecture for a HDTV Motion Estimation System", Transaction on Consumer Electronics, Vol. 38, No. 3, pp 690-697, August 1992. [25] Langdon G. G., "An introduction to Arithmetic coding", IBM Journal of Research and Development, Vol. 28, Nr2, pp 135-149, March 1984. [26] Duc P., Nicoulaz D., Mlynek D., "A RISC Controller with customization facility for Flexible System Integration" ISCAS '94, Edinburgh, June 1994. [27] Hervigo R., Kowalczuk J., Ebrahimi T., Mlynek D., Kunt M., "A VLSI Architecture for Digital HDTV Codecs", ISCAS '92, Vol. 3, pp 1077-1080, 1992. [28] Kowalczuk J., Mlynek D.,"Implementation of multipurpose VLSI Filters for HDTV codecs", IEEE Transactions on Consumer Electronics, Vol. 38, No. 3, , pp 546-551, August 1992. [29] Duardo O. and al., "Architecture and Implementation of Ics for a DSC-HDTV Video Decoder System", IEEE Micro,pp 22-27, October 1992 [30] Goyal R., "Managing Signal Integrity", IEEE Spectrum,pp 54-58, March 1994 [31] Rissanen J. J. and Langdon G. G., "Arithmetic Coding", IBM Journal of Research and Development, Vol. 23, pp 149-162, 1979. [32] Daugman J. G., "Complete Discrete {2-D} {Gabor} Transforms by Neural Networks for Image Analysis and Compression", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 36,Nr 7, pp 1169-1179, July 1988. [33] Draft International Standard ISO/IEC DIS 13818-2, pp 188-190, 1993. [34] Forstner P., "Timing Measurement with Fast Logic Circuit", TI Technical Journal - Engineering Technology, pp 29-39, May-June 1993 [35] Rijns H., "Analog CMOS Teletext Data Slicer", Digest of Technical Papers - ISSCC94, pp 70-71, February 1994 [36] Demura T., "A Single Chip MPEG2 Video Decoder LSI", Digest of Technical Papers - ISSCC94, pp 72-73, February 1994 [37] Toyokura M., "A Video DSP with Macroblock-Level-Pipeline and a SIMD Type Vector Pipeline Architecture for MPEG2 CODEC", Digest of Technical Papers - ISSCC94, pp 74-75, February 1994 [38] SGS-Thomson Microelectronics, "MPEG2/CCIR 601 Video Decoder - STi3500", Preliminary Data Sheet, January 1994 [39] Array Microsystems, "Image Compression Coprocessor (ICC) - a77100", Advanced information Data Sheet, Rev. 1.1, July 1993 [40] Array Microsystems, "Motion Estimation Coprocessor (MEC) - a77300", Product Preview Data Sheet, Rev. 0.1, April 1993 5/1/2007 10:54 AM 1. Introduction 18 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/1_... This chapter edited by D. Mlynek , a production of 5/1/2007 10:54 AM ch11.1 1 of 1 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.1. Introduction This document is organised as follows: a review of telecommunication fundamentals and a network taxonomy is done in sections 11.2 and 11.3. In section 11.4 switching networks are explained as introduction to section 11.5, in which ATM network concepts are visited. Sections 11.6 and 11.7 explain two case studies to show the main elements that will be found in a telecommunication system-on-a-chip. The former is an ATM switch, the latter a system to transmit ov ATM networks MPEG streams. This chapter edited by E. Juarez, L. Cominelli and D. Mlynek 5/1/2007 10:51 AM ch11.2 1 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.2. Telecommunication fundamentals Figure 11.1 shows a switching network. Lines are the media links. Ovals are called network nodes. Media links simply carry data from one point to other. Nodes take the incoming data and route them to an output port. [Click to enlarge image] 5/1/2007 10:55 AM ch11.2 2 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Figure-11.1: Switching network If two different communication paths intersect through this network they have to share some resources. Tw paths can share a media link or a network node. Next sections describe these sharing techniques. 11.2.1. Media sharing techniques Media sharing occurs when two communication channels use the same media. [Click to enlarge image] Figure-11.2: Media sharing This section presents how some communication channels can use the same media link without arc considerations. There are three main techniques. 11.2.1.1. Time Division Multiple Access (TDMA) This simple method consists on multiplexing data in time. Each user transmits a period of time equal t 1/(number of possible channels) in full bandwidth W. This sharing mode can be synchronous or asynchronous. Figure 11.3 shows a synchronous TDMA system. Each channel uses a time slot each T periods. Selecting a time slot identifies one channel. Classical wired phone uses this technique. [Click to enlarge image] Figure-11.3: Synchronous TDMA diagram In synchronous TDMA, if an established channel stops transferring data without freeing the assigned time slo the unused bandwidth is lost and hence, other channels can not take advantage of this. This technique ha evolved to asynchronous TDMA to avoid this problem. 5/1/2007 10:55 AM ch11.2 3 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Figure 11.4 shows an asynchronous TDMA system. Each channel uses a time slot when the user needs to transfer data and when a time slot is unused. The header of each time slot data stream identifies the identification. ATM networks use this technique. [Click to enlarge image] Figure-11.4: Asynchronous TDMA diagram These two techniques are used to connect users. Providing broadcast channels in TDMA can not be done easily. Frequency Division Multiple Access technique avoids this problem. Next section presents this shar mode. 11.2.1.2. Frequency Division Multiple Access (FDMA) This sharing method consists on giving to each channel a piece of available bandwidth. Each user transmits over a constant bandwidth equal to W/(number of possible channels). Filtering wit bandwidth equal to W’ = W/(number of possible channels) the whole W bandwidth spectrum selects one channel. TV and radio broadcasters use this media sharing technique. Figure 11.5 shows an FDMA spec diagram. [Click to enlarge image] Figure-11.5: FDMA diagram Another method has been developed based on the frequency dimension. This method called Code Divis Multiple Access uses an encoding-decoding system used, initially, for military communications. Today consumer market applications also use this technique. Next section presents this method. 11.2.1.3. Code Division Multiple Access (CDMA) 5/1/2007 10:55 AM ch11.2 4 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Each user transmits using the full bandwidth. Demodulating the whole W band using a given identification code selects one channel out of the others. Next mobile phones standard (IS-95 or W-CDMA) uses this m sharing technique. Figure 11.6 shows a CDMA spectrum diagram. [Click to enlarge image] Figure-11.6: CDMA diagram These techniques can be merged together. For example, the Global System Mobile (GSM = Natel D) ph standard uses an FDMA-TDMA technique. After this description, we will present in next section how a network node routes data from an input port to given output port. 11.2.2. Node sharing technique Node sharing occurs when two communication channels use the same network node. The question is how some communication channels can use the same node in a cell switching network, i.e. an ATM network. [Click to enlarge image] Figure-11.7: Shared node Before answering this question, we have to define the specification of the switching function. Next secti presents this concept. 11.2.2.1. Switching function As shown in figure 11.8, a switch has N input ports and N output ports. Data come in the lines attached to t input ports. After identifying their destination, data are routed through the switch to the appropriate output port. After this stage, data can be sent to the communication line attach to the output port. 5/1/2007 10:55 AM ch11.2 5 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.8: Canonical switch We can directly implement on hardware this canonical switch. However, this technological solution poses some throughput problems. In section 11.6.1.2.1 (the one describing the crossbar switch architectures) we wil see why. In section 11.6.1.2.2 (the one describing the Batcher-Banyan network) we will see how the throughput problems can be solved. Furthermore, the incoming data sequence can pose some routing problems. Next part of this section shows these critical scenarios. 11.2.2.2. Switching scenario Figure 11.9 shows some switching scenarios. Scenario 1 shows two cells from two different input ports g through the switch to two different output ports. These two cells can be simultaneously routed. Scenario 2 shows two cells from the same input port going through the switch to two different output ports. Both cells routed to their output destinations. [Click to enlarge image] Figure-11.9: Three switch scenarios Scenario 3 shows two cells from two different input ports going trough the switch to the same output port. There are five possible strategies to solve this problem: To drop one cell and route the other. This solution involves a data lost, hence, it is not a good approach. 5/1/2007 10:55 AM ch11.2 6 of 6 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... To route simultaneously both cells and memorize in the output port the cell that has not been sent on the attached line. This technique is called output buffering. To memorize the incoming cells in the input ports and route them. This technique is called input buffering. The two other solutions consist on memorizing the extra cells during the routing task. These techniques are derived from input buffering. Section 11.6.1 considers why output buffering is better than input buffering. This chapter edited by E. Juarez, L. Cominelli and D. Mlynek , a production of 5/1/2007 10:55 AM ch11.3 1 of 1 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.3. General telecommunication network taxonomy Telecommunication Networks can be mainly classified into two groups based on the criteria of who has made the decision of which nodes are not going to receive the transmitted information. When the network takes the responsibility of t decision, we have a switching network. When this decision is left to the end-nodes, we have a broadcast network that can be divided in packet radio networks, satellite networks and local area networks. Switching networks use any of the following switching techniques: circuit, message or packet switching, this la implemented as either virtual circuit or datagram. Let us compare these techniques. This chapter edited by E. Juarez, L. Cominelli and D. Mlynek , a production of 5/1/2007 10:48 AM ch11.4 1 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.4. Comparison between different switching techniques We can begin with two rough classifications. If a connection (path) between the origin and the end node established at the beginning of a session we are talking about circuit or packet (virtual circuit) switching. In case it does not, we refer to message and packet (datagram) switching. On the other hand, when considering ho message is transmitted, if the whole message is divided into pieces we have packet switching (based either o virtual circuit or datagram) but if it does not, we have circuit and message switching. In the following paragraphs we get into the details of different switching techniques 11.4.1. Circuit switching In figure 11.11, the most import events in the life of a connection in a four-node circuit switching network (s figure 11.10) are shown. When a connection is established, the origin-node identifies the first intermediate n (node A) in the path to the end-node and sends it a communication request signal. After the first intermediate node receives this signal the process is repeated as many times as needed to reach the end-node. Afterwards, the end-node sends a communication acknowledge signal to the origin-node through all the intermediate nodes th have been used in the communication request. Then, a full duplex transmission line, that it is going to be kept for the whole communication, is set-up between the origin-node and the end-node. To release the communication the origin-node sends a communication end signal to the end-node. 5/1/2007 10:51 AM ch11.4 2 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.10: [Click to enlarge image] Figure-11.11: 11.4.2. Message switching Figure 11.12 shows life connection events for a message switching network. When a connection is established, the origin-node identifies the first intermediate node in the path to the end-node and sends it the whole messa After receiving and storing this message, the first intermediate node (node A) identifies the second one (node 5/1/2007 10:51 AM ch11.4 3 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... and, when the transmission line is not busy, the former sends the whole message (store-and-forward philosophy). This process is repeated up to the end-node. As can be seen in figure 11.12 no communication release establishment is needed. [Click to enlarge image] Figure-11.12: 11.4.3. Packet switching based on virtual circuit Figure 11.13 shows the same events for a virtual circuit (packet) switching network. When a connection established, the origin-node identifies the first intermediate node (node A) in the path to the end-node and sends it a communication request packet. This process is repeated as many times as needed to reach. Then, the end-node sends a communicatio acknowledge packet to the origin-node through the intermediate nodes (A, B, C and D) that have been traversed in the communication request. The virtual circuit established on this way will be kept for the whol communication. Once a virtual circuit has been established, the origin-node begins to send packets (each of them has a virtual circuit identifier) to the first intermediate node. Then, the first intermediate node (node A) begins to send packets to the following node in the virtual circuit without waiting to store all message packets received from the origin-node. This process is repeated until all message packets arrive to the end-node. In t communication release, when the origin-node sends to the end-node a communication end packet, the latte answers with an acknowledge packet. There are two possibilities to release a connection: No trace of the virtual circuit information is left, so every communication is set-up as if it were the first one. The virtual circuit information is kept for future connections. 5/1/2007 10:51 AM ch11.4 4 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.13: 11.4.4. Packet switching based on datagram The most important events in the life of a communication in a datagram switching network are shown in figu 11.14. The origin-node identifies the first intermediate node in the path and begins to send packets. Each pa carries an origin-node and end-node identifier. The first intermediate node (node A) begins to send packets without storing the whole message, to the following intermediate node. This process is repeated up to the end-node. As there are neither connection establishment nor connection release, the path follow for each pack from the origin-node to the end-node can be different and therefore, as a consequence of different propagation delays, they can arrive disordered. 5/1/2007 10:51 AM ch11.4 5 of 5 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.14: This chapter edited by E. Juarez, L. Cominelli and D. Mlynek a joint production of EJM 17/2/1999 5/1/2007 10:51 AM ch11.5 1 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.5. ATM Networks 11.5.1. Asynchronous Transfer Mode Before describing the fundamentals of ATM networks, we will define a few concepts such as transfer m multiplexing needed to understand the main ATM points. The concept of transfer mode summarizes two ideas related to information transmission in telecommunica networks: how information is multiplexed, i.e. how different messages share the same communication circuit, a how information is switched, i.e. how the messages are routed to the destination-node. 11.5.1.1. Multiplexing fundamentals The concept of multiplexing is related to the way in which several communications can share the same transmission medium. As seen in 2.1, different techniques used are time-division multiplexing (TD frequency-division multiplexing (FDM). The former can be synchronous or asynchronous. In STD (synchronous time-division) multiplexing, a periodic structure divided in time intervals, called frame defined and each time interval is assigned to a communication channel. As the number of time intervals in eac frame is fixed, each channel has a fixed capacity. The information delay is just function of the distance and th access time because there is no conflict to access the resources (time intervals). In ATD (asynchronous time-division) multiplexing, the time intervals used in a communication channel are neither inside a frame nor previously assigned. Every time interval can be assigned to every channel. The c assigned to each information unit has an appropriate label as identifier. With this scheme, every source mig transmit information at every time given that there are enough free resources in the network. 11.5.1.2. Switching fundamentals 5/1/2007 10:55 AM ch11.5 2 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... The switching concept is assigned to the idea of information routing from an origin-node to an end-node. We have already talked about the different switching techniques in 11.4.1-11.4.4. 11.5.1.3. Multiplexing and switching techniques used in ATM networks ATM networks use ATD (asynchronous time-division) as multiplexing technique and cell switching as sw technique. With ATD multiplexing, variable binary rate sources can be connected to the network because of the dyn assignment of time intervals to channels. Circuit switching is not a suitable technique if variable binary rate sources want to be used because after connection establishment the binary rate with this switching technique must be constant. This fixed assignment i not just an inefficient usage of available resources but a contradiction to the main goal of B-ISDN (broadb integrated services digital network) where each service has different requirements. ATM networks will be a element in the development of B-ISDN as stated in the ITU (International Telecommunication Uni recommendation I.121. Neither general packet switching is a suitable solution in ATM networks because of the difficulty to integr real-time services. However, as it has the advantage of an efficient resource usage for bursty sources, the switching technique adopted in ATM networks is a variant of this one: cell switching. Cell Switching works similar than packet switching. The differences between both are the following: All information -data, voice, video- is transported from the origin-node to the end-node in small and constant-size packets (in traditional packet switching the packet size is variable) - 53 octets - called cells. Just lightened protocols are used in order to allow nodes fast switching. As a drawback protocols will be less efficient. Signaling is completely separated from information flow in contrast to packet switching in which both, information and signaling are mixed. Arbitrary binary rate traffic flows can be integrated in the same network. The size of the ATM cell header is 5 octets (approx. 10 % of the total size of the cell). With this small hea processing is allowed in the network. The size of the cell payload is 48 octets. This small payload a store-and-forward delays in network switching nodes (see figure 11.15). The decision about the payload size was a trade-off between different proposals. While in convention communication it is preferred longer payloads to reduce information overhead, in video communication, m sensitive to delays, smaller ones are desired. The election of the current payload size was a salomonic decisi Europe, the preferred payload size was 32 octets but in USA and Japan, the preferred load size was 64 octet Finally, in a meeting hold in Geneva in June 1989, people agreed to have as payload size the average of th proposals: 48 octets. 11.5.2. ATM network interfaces In ATM networks, the interface between the network user (either an end-node or a gateway to another network) and the network it is called UNI (User-Network Interface). UNI specifies the possible physical media, cell format, mechanisms to identify different connections established through the same interface, total access rate an mechanisms to define the parameters that determine the quality of service. The interface between a pair of network nodes is called NNI (Network-Node Interface). This interface is m dedicated to routing and switching between nodes. Besides, it is designed to allow interoperability betwee switching fabrics of different companies. 11.5.3. ATM Cell format 5/1/2007 10:55 AM ch11.5 3 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Header format depends on whether or not a cell is at the UNI or the NNI. The functions of each cell header field are the following (Fig 11.15): GFC Generic Flow Control. This field appears just at the UNI and it is responsible of medium access control, as there is the possibility that more than one end-user might be connected to the same UNI. VPI, VCI Virtual Path Identifier, Virtual Channel Identifier. A connection in an ATM network is defined uniquely thanks to the combination of these two fields. It allows routing and addressing at two levels. The network routing function considers them as labels that can be changed in each node. PT Payload Type. The main objective of this field is to distinguish between user information and OAM (Operation & Maintenance) information. CLP Cell Loss Priority. This field allows assigning two priorities (high or low) to cells. For example, if a user does not meet the set-up requirements, the network can mark a cell as a low priority one. Low priority cells will be the first to be dropped when a congestion state is detected in any of the ATM network node queues. HEC Header Error Control. This field allows error checking and detection of header information. Cells can be classified in one of the following types: [Click to enlarge image] Figure-11.15: Non-assigned cells: cells with no useful information. They pass transparently the physical layer and arrive to the remote ATM layer without modification. Empty cells: they are also cells with no useful information. When information sources have no cell to be sent, the physical layer introduces these cells to match the cell flow to the maximum transmission capacity. They will never arrive to the remote ATM layer because the physical layer will filter them. Metasignaling cells: 5/1/2007 10:55 AM ch11.5 4 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... cells to negotiate the establishment of a virtual circuit between the network and the end-user. Once the virtual circuit has been established, all set-up and release operations will use this circuit. Broadcasting cells: cells whose end-node is every node connected to the same interface. Physical layer cells: cells with OAM (Operations & Maintenance) information for the network physical layer. 11.5.4. Protocol Architecture The protocol stack architecture used in ATM Networks considers three different planes: User plane, whose main function is the transmission of user information. Control plane, it is responsible of the signaling information transfer, connection and admission control. Management plane, for all OAM operations. We will describe now the functions of different layers in the user plane of the protocol stack. 11.5.4.1 Physical layer This is the layer responsible for information transport. It is divided into two sublayers. PM (Physical Medium) sublayer. It provides bit synchronization, line coding, electro-optical conversion and the transmission media (currently, monomode optical fibers). TC (Transmission Convergence) sublayer. The functions associated to this sublayer are the following: HEC (Header Error Control) field generation and checking. Matching of cell flow to the maximum transmission capacity. Cell delimitation. When an end-node receives a bit-flow it needs to discern where each cell begins and ends, i.e. where the header of each cell is located within the flow. The method to do so consists on searching in the flow one octet that is the HEC of the four previously received octets. The TC sublayer adapts the cells received from the ATM layer to the specific format used in the transmission. 11.5.4.2. ATM layer This layer provides a connection-oriented service, independently of the transmission media used. Its main functions are the following: Cell multiplexing and demultiplexing from several connections into a unique cell flow. A pair of identifiers, VCI/VPI, characterizes each connection. Cell switching. This function consists on changing the input VCI/VPI pair for a different output pair. Cell header generation/extraction (except the HEC field whose generation/checking is competence of the physical layer). Flow control and medium access control for those UNIs shared by more than one terminal. 11.5.4.3. AAL (ATM Adaptation Layer) This layer adapts either, in the transmitter side, the information coming from higher layers to the ATM layer or, in the receiver side, the ATM services to higher level requirements. It is divided into three sublayers: CS (Convergence Sublayer). It is divided into two parts: SSCS (Specific Service CS) CPCS (Common Part CS) SAR (Segmentation And Reensemble). In the transmitter side, its function consists on segmenting the CS data units into data units of length equal to the cell payload: 48 octets. In the receiver side, payloads are ensembled to reconstruct the initial data units that were given to the network to be sent. This reensemble function is assigned to the end-nodes and not to the intermediate nodes. 5/1/2007 10:55 AM ch11.5 5 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... 11.5.5. ATM switching As cell switching networks, ATM networks require a connection establishment. It is here, at this moment, where the entire communication requirements are specified: bandwidth, delay, information priority and so on. T parameters are defined for each connection and, independently of what is happening in other network poin determine the connection quality of service (QoS). A connection is established if and only if the network c guarantee the quality demanded by the user without disturbing the quality of already existing connections. In ATM networks it is possible to distinguish two levels in each virtual connection. Each of them defined w identifier: VPI , Virtual Path Identifier VCI, Virtual Channel Identifier Virtual paths are associated to the highest level of the virtual connection hierarchy. A virtual path is a set of virtual channels connecting ATM switches to ATM switches or ATM switches to end-nodes. Virtual channels are associated to the lowest level of the virtual connection hierarchy. A virtual channel unidirectional communication between end-nodes, gateways and end-nodes and between LANs (Local A Networks) and ATM networks. As the provided communication is unidirectional, each full-duplex communication will consist of two virtual channels (each of them with the same path through the network). Virtual channels and paths can be established dynamically, by signaling protocols, or permanently. Usually, path are permanent connections while channels are dynamic ones. In an ATM virtual connection, the input cell sequence is always guaranteed at the output. In ATM Networks, cell routing is achieved thanks to the information pair VPI/VCI. This information is not explicit address but a label, i.e. Cells do not have in their headers the end-node address but identifiers that chan from switch to switch before arriving to the end-node. Switching in a node begins reading the VPI/VCI fields of the input cell header (Empty cells are managed in a special way. After they are identified, they are just dropped a the switch input). This pair of identifiers is used to access the routing table in the switch to obtain, as a result, the output port and a new assigned pair VPI/VCI. Next switch in the path will use this new pair of identifiers in th same way and the procedure will be repeated. Switches can be of two types: VPs switches: they analyze just the VPI to route cells. As a virtual path (VP) groups several virtual channels (VC), if the VCIs are not considered all VCs associated to a VP are switched together. VCs switches: both identifiers are analyzed, VPI and VCI, to route cells. 11.5.6. ATM services In an ATM network it is possible to negotiate different levels or qualities of service to adapt the network applications and to offer to the users a flexible way to access the resources. If we study the main service characteristics, we can establish a service classification and define different adaptation levels for each service. Four different service class are defined for ATM networks (Table 1) BINARY RATE DELAY CONNECTION-ORIENTED APPLICATIONS Constant Constant Yes Telephony, voice A 5/1/2007 10:55 AM ch11.5 6 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... B Variable Constant Yes Compressed and voice video Variable Not constant Yes Data applications Variable Not constant No LAN interconnections C D Table-11.1: Once the different services have been characterized it is possible to define the different adaptation layers. There are four adaptation layers in ATM networks. ALL 1, for class A services. AAL2, for class B services. AAL3/4, for class C or D services. AAL 5, also for class C or D services. 11.5.7. Traffic control in ATM networks The main objective of traffic control function in ATM networks is to guarantee an optimal network performance in the following aspects: Number of cells dropped in the network. Cell transfer delay. Cell transfer delay variance or delay jitter Basically, network traffic control in ATM networks is a preventive approach: it avoids congestion states immediate effects are excessive cell dropping and unacceptable end-to-end delays. Traffic control can be applied from two different sides: on the network side, it incorporates two main functions: Call Acceptance Control (CAC) and Usage Parameter Control (UPC). On the user side, it mainly takes the form of either source rate control or layer source coding (prioritization) to conform to the service contract specification. 11.5.7.1. Call acceptance control CAC (call acceptance control) is implemented during the call setup to ensure that the admission of a call will disturb the existing connections and also that enough network resources are available for this call. It is also referred to as call admission control. The CAC results in a service contract. 11.5.7.2. Usage parameter control UPC (usage parameter control) is performed during a connection life. It is performed to check if the so characteristics respect the service contract specification. If excessive traffic is detected, it can be either immediately discarded or tagged for selective discarding if congestion is encountered in the network. It is al referred to as traffic monitoring, traffic shaping, bandwidth enforcement or cell admission control. The Leak Bucket (LB) scheme is a widely accepted implementation of an UPC function. 5/1/2007 10:55 AM ch11.5 7 of 7 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... This chapter edited by E. Juarez, L. Cominelli and D. Mlynek a joint production of EJM 17/2/1999 5/1/2007 10:55 AM ch11.6 1 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.6. Case study: ATM switch This section shows the architecture of the critical routing part in an ATM switch. Before talking about an exist ATM chip, we will present the technological constrains that drive the design. The switch functionality can be split into two main parts: A routing function to carry data from one input port to an output port. A queuing function to temporally memorise incoming data causing the blocking problem. 11.6.1. Main switching considerations 11.6.1.1. Solving the blocking problem (Head of line This section show why output buffering is a better solution to solve blocking problems (section 11.2.2.1 shows the blocking scenario) Consider a simple 2X2 (2 input ports and 2 output ports) switch (see figure 11.16). Each number represen destination port address. Queued cells are in yellow and routed cells are in blue. 5/1/2007 10:49 AM ch11.6 2 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.16: Input and Output buffering sequence With an input buffering technique we need four cycles to route all cells. First cycle shows the queuing of one cell and the routing of the other. Second cycle shows the routing of the previously queued cell and the queuing of two incoming cells. Third cycle shows the routing of the two previously queued cells and the queuing of the incoming cell. The last cycle shows the routing of last cell. With an output buffering technique we need three cycles to route all cells. First cycle shows the routing of all incoming cells. One is queued, the other is sent through the connected output line #2. Second cycle shows the routing of the second couple of incoming cells. One is queued in queue #2, the previously queued is sent through the connected output line #2 and the last one is directly sent through the output line #1. Last cycle shows the sending of the queued cells through the line #2 and the routing and sending of the cell to the line #1. In certain cases, output buffering allows smaller cell latency. Therefore, a lower memory capacity in the swi needed. To solve the blocking problem the use of the output buffering technique has been chosen. After this choice, we need to know how the routing function can be implemented. Next section presents th currently used techniques. 11.6.1.2. Routing function implementation The simplest technique to implement the routing function is to link all the inputs to all the outputs. By programming this array of connection the data can be routed from any of the input ports to any of the output ports. We can implement this function using crossbar architecture. 5/1/2007 10:49 AM ch11.6 3 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... 11.6.1.2.1. Crossbar switch A crossbar is an array of buses and transmission gates implementing paths from any input port to any output po This section describes this technique. To understand the limitations of such technique we first describe transmission gate. [Click to enlarge image] Figure-11.17: Electric view of a transmission gate. 11.6.1.2.1.1 Transmission gate Figure 11.17 shows an electric view of a transmission gate. Figure 11.18 shows a schematic view of the transmission gate. Two complementary transistors transmit the input signal without degradation (the NMOS transmit the VSS and the PMOS transmit the VDD). Command input enables or disables the transmission function. For instance: If Command = VSS then both transistors are locked. If Command = VDD then, both transistors are saturated. Cin represents the parasitic load on the input line and Cout represents the parasitic load on the output line. [Click to enlarge image] Figure-11.18: Schematic view of a transmission gate. 11.6.1.2.1.2 The crossbar switch If we wire an array of transmission gates as shown in figure 11.19, we obtain a programmable system capabl routing any incoming data to any output port. 5/1/2007 10:49 AM ch11.6 4 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.19: 2X2-crossbar switch. We can implement a 4X4 switch repeating this 2X2 structure (see figure 11.20). [Click to enlarge image] Figure-11.20: 4X4-crossbar switch. We can repeat this structure N times to obtain the required number of input and output ports. This approach causes a bus load problem. The more the number of input and output ports is, the more the load and length of each bus is. For example, in figure 11.20 load on the input bus #1 is four times the input load of one transmission gate plus the parasitic capacitance of the wire. Therefore, the routing delay from an input to an output is long. We can not use this technique implement high throughput switches with a large number of ports. 5/1/2007 10:49 AM ch11.6 5 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... To solve this problem a switch based on a 2X2 switches network has been developed. Next section shows how these switches are implemented. 11.6.1.2.2. The Batcher-Banyan switch Figure 11.21 shows the 2X2-switch module. This switch is composed of one 2X2 crossbar implementing the routin function and four FIFO memories implementing the output buffer function. The delay to carry data from an input to an output is lower than that of the crossbar switch because buses are short and are loaded by only tw transmission gates. Figure 11.22 shows an 8X8 Banyan switch. Input ports are connected to output ports by a three stage routin network. There is exactly one path from any input to any output port. Each 2X2-switch module simply routes o input to one of their two outputs. [Click to enlarge image] Figure-11.21: 2X2 switch. 5/1/2007 10:49 AM ch11.6 6 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.22: Banyan network switch. A blocking scenario in a Banyan switch is shown in figure 11.23. In this figure red paths show successful routi cells and blue ones show blocking cells. The numbers at the inputs represent cell destination output port number. All the incoming cells have a different output destination, but only two cells are routed. Some internal collisio causes this problem. A solution to this problem is to make sure that this internal collision scenario never appears. This can be achieved if incoming cells are sorted before the Banyan routing network. The sorter should sort the incoming cells according to bitonic sequence rules. A Batcher sorter using a 2X2 comparators network implements this function. [Click to enlarge image] Figure-11.23: Blocking in a Banyan network Figure 11.24 shows some routing scenario without internal collisions. 5/1/2007 10:49 AM ch11.6 7 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.24: Routing scenario without collision For instance, the following sequence is a bitonic sequence: {7, 5, 2, 1, 0, 3, 4, 6}. Rules to identify bitonic sequences are as follows: An ascending order sequence, {0, 1, 2, 3, 4, 5, 6, 7}, like in the first scenario of figure 11.24. A descending order sequence. An ascending order sequence followed by a descending order sequence. A descending order sequence followed by an ascending order sequence {7, 5, 2, 1, 0, 3, 4, 6}, like in the second scenario of figure 11.24. This well-known architecture is currently used to implement the switching function. Next section comments existent switching chip using this technique. 11.6.2. ATM Cell Switching 11.6.2.1. ATM high-level Switch Architecture Table 2 shows the main function of each ATM layer. Function Layer name Convergence Layer CS Segmentation and Reassemble SAR AAL GFC field management Header generation and extraction ATM VCI and VPI processing Multiplexing and demultiplexing of the cells Flow rate adaptation HEC generation and check TC Cell synchronization PL Transmission adaptation Synchronization PM Data emission and detection Table-11.2: ATM layer structure 5/1/2007 10:49 AM ch11.6 8 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... AAL: ATM Adaptation Layer CS: Convergence Sublayer SAR: Segmentation and Reassemble layer ATM: ATM Layer PL: Physical Layer TC: Transmission Convergence PM: Physical Medium Figure 11.25 shows a switch high-level architecture. Each block implements some of the functions describe in Table 1. [Click to enlarge image] Figure-11.25: Switch architecture An explanation of the general functionality of each layer can be found in section 11.5.4. The management block drives and synchronizes other layers, for instance, it drives the control check a administrative functions. High data transfer rates can be reached (up to some gigabits per second). One of the critical blocks of this architecture is the switching module (surround in bold in figure 11.25). Previous section discusses one of the most currently used techniques to implement this function. In next section we will comment an existent chip designed with the previously described techniques. 11.6.2.2. Existent Switch Architecture Figure 11.26, Yam[97], shows the mapping between the chip architecture and the functional architecture. 5/1/2007 10:49 AM ch11.6 9 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.26: Comparison Functional to Real architecture There are three main blocks in this chip: The first block implements the heading processing The second one implements the commutation table The third one implements the switch function Figure 11.27 shows the details of the entire switching system. [Click to enlarge image] Figure-11.27: switching system The switching network module is mainly composed of the following blocks: a Batcher-Banyan network, one input multiplexer bank and one output demultiplexer bank. The Batcher-Banyan network implements the switchi function. The Multiplexer-Demultiplexer banks are used to reduce the internal Batcher-Banyan network bus 5/1/2007 10:49 AM ch11.6 10 of 10 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... width. (From 8 bits to 2 bits and vice versa). This means that to switch one incoming 8-bit-word in one cycle, four internal Batcher-Banyan network cycle needed. A drawback for the bus width reduction is a four times increase in the internal switch frequency. Therefore, the chip designers had to choose a faster technology to keep a high throughput switching function. In this case they choose the Ga-As technology, usually used for high-frequency systems. This chapter edited by E. Juarez, L. Cominelli and D. Mlynek a joint production of EJM 17/2/1999 5/1/2007 10:49 AM ch11.7 1 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.7. Case study: ATM transmission of multiplexed-MPEG streams. Introduction Available ATM network throughputs, in the order of Gb/s, allow broadband applications to interconnect usi ATM infrastructures. We will consider, as a case study to give some intuition about the main elements that will be found in a telecommunication system-on-a-chip, the architectural design of an ATM ASIC. The architecture is conceived to give service to applications in which we will need to multiplex and transport multimed information to an end-node through an ATM network. Interactive multimedia and mobile multimedia ar examples of applications that will use such a system. Interactive multimedia (INM) relates to the network delivery of rich digital content, including audio and video, to client devices (e.g. desktop computer, TV and set-top box), typically as part of an application ha user-controlled interactions. It includes interactive movies, where viewers can explore different subplo interactive games, where players take different paths based on previous event outcomes, training-on-demand which training content tunes to each student existing knowledge, experience, and rate of information absorption, interactive marketing and shopping, digital libraries, video-on-demand and so on. Mobile multimedia applies in general to every scenario in which a remote delivery of expertise to mobile agents will be needed. It includes applications in computer supported cooperative work (CSCW) where mobile workers with difficult problems receive advice to enhance the efficiency and quality of their tasks or emergency-response applications (ambulance services, police, fire brigades). A system offering this service of multiplexing and transport through ATM networks should meet the requirements if it wants to cover applications as explained above: The system should easily scale the number of streams and the bandwidth associated to each of them to accommodate future service demand increases. The system should fairly share the available multiplex bandwidth between all different sources. This 5/1/2007 10:52 AM ch11.7 2 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... feature will enable either to increase the number of streams to be multiplexed when the available bandwidth is fixed or to reduce the necessary bandwidth to multiplex a fixed number of them. The system should guarantee a bandwidth reservation if sources with heterogeneous traffic patterns want to be simultaneously served. The system should be able to give service to mobile/portable sources connected by either wireless or infrared links. The system should be able to control the quality of the service (QoS) offered because if no control is applied in order to keep it constant, image quality degradation will depend sharply on transient congestion conditions in the network when information is dropped randomly. At last, but not least, the system should be integrated on a single chip. 11.7.1. A system view Distributing the multiplexing function between the different sources allows meeting efficiently the requirements of mobility/portability and streaming scalability. [Click to enlarge image] Figure-11.28: This distribution can be achieved with a basic unit that applies locally the multiplexing function to each source, as can be seen in figure 11.28. This basic unit is repeated for each stream that we want to multiplex. Figure 11.29 shows how the basic unit works: there is a queue, where cells carrying information from the source wa until the MAC (Medium Access Control) unit gives permission to the cells to be inserted. When an empty cell is found and the MAC unit allows insertion, this empty cell disappears from the flow and a new cell is inserted. Figure 11.30 shows the details of this basic unit. There are four main blocks: Cell multiplexing unit: where empty cells are substituted by source cells when MAC makes the decision. 5/1/2007 10:52 AM ch11.7 3 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... MAC: decides when the information coming from the video source is introduced into the high-speed flow. QoS control: manages video information in order to produce a smooth quality of service degradation when network suffers from congestion. Protocol processing & DMA blocks: they, respectively, adapt information coming from the source for ATM transmission and communicate with the software running in the host processor. [Click to enlarge image] Figure-11.29: The path followed by a cell from the source to the output module when is multiplexed is also shown in figu 11.30. 5/1/2007 10:52 AM ch11.7 4 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.30: In what follows, we will get into the details of the QoS block, MAC block and protocol processing and DM block, leaving up to the end the cell multiplexing unit block to explain the main design features telecommunication ASICs. 11.7.2. Quality of Service (QoS) control (Prioritization) One potential problem in ATM networks, caused by the bursty nature of traffic is cell loss. When several sources transmit at their peak rates simultaneously, the buffers available at some switches may cause overflow The subsequent drops of cells lead to severe degradation in service quality (multiplicative effect) due to the loss of synchronization at the decoder. In figure 11.31, The effect in the quality of the image received due to cell drops is shown. The decoded picture has been transmitted through an ATM network with congestion problems. Rather than randomly dropping cells during network congestion, we might specify to the ATM network th relative importance of different cells (prioritization) so that only less important ones are dropped. This is possible in ATM networks thanks to the CLP (cell loss priority) cell header bit. Thus, if we do so, when t network enters a period of congestion, cells are dropped in an intelligent fashion (non-priority cells first) so that the end-user only perceives a small degradation in the service's QoS. 5/1/2007 10:52 AM ch11.7 5 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.31: However, when the network is operating under normal conditions, both high priority and low priority successfully transmitted and a high quality service is available to the end user. In the worst-case scenario, the end user is guaranteed a predetermined minimum QoS dictated by the high priority packets. 5/1/2007 10:52 AM ch11.7 6 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.32: [Click to enlarge image] Figure-11.33: In figures 11.32, 11.33, the effect in the quality of the image received due to cell drops is shown. However, as the priority mechanism is applied (low frequency image information as high priority data and high frequenc image information as low priority data) an improvement in the quality of the decoded image is observed. Figure 11.34 shows the effect of non-priority cell drops in the high frequency portion of the decoded information. 5/1/2007 10:52 AM ch11.7 7 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.34: 11.7.3. Medium access control (MAC) The basic functionality of the distributed multiplexing algorithm is to incorporate low speed ATM sources into a single ATM flow. When two or more sources try to access to the common resource a conflict can occur. The medium access control (MAC) algorithm should solve the conflicts between two or more sour simultaneously accessing to the high-speed bus. Each MAC block controls the behavior of a basic unit. It can be considered as an state machine which acts depending on the basic unit inputs: empty cell from the high-speed bus, cell from the MPEG source connected to it and access request from another basic units. The MAC algorithm can adopt the DQDB (Distributed Queue Dual Bus) philosophy, taking into account that there is just one information flow (downstream). A dedicated channel is responsible for sending request upstream. The main objective of the DQDB protocol is to create and maintain a global queue of access requests to the shared bus. That queue is distributed among all connected basic units. If a basic unit wants to send an ATM cell, a request to all its predecessors is sent. Therefore, each basic unit receives, from the neighbor on the right, access requests coming from every basic unit on the right. These requests and the requests of the current basic unit are sent to the neighbor on the left. For each request, an empty cell passes through a basic unit without being assigned. When QoS control is applied, these algorithms should be modified to allow all HP cells to be sent before any LP cell queued at any basic unit. This mechanism achieves critical information to be sent first when congesti appears. 5/1/2007 10:52 AM ch11.7 8 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... 11.7.4. Communication with the host processor: protocol processing & DMA. Another important point to face is the information exchange between the software running on the host processor and the basic unit. The main mechanism used for these transactions is DMA (Direct Memory Access). In technique all communications passes through special shared data structures - they can be read from or written to by both the processor and the basic unit - that are allocated in the system's main memory. Any time any data is read from or written to main memory is consider to be "touched". A design should minimize data touches because of the large negative impact they can have on performance. Let us imagine we are running, on a typical monolithic Unix Kernel machine, an INM application implementation of the AAL/ATM protocol. Figure 11.35 shows all data touch operations involved in transmitting a cell from host main memory to the basic unit. The sequence of events is as follows: 1. The application generates data to be sent and writes it to its user-space buffer. Afterwards, It produces a system call to the socket layer to transmit data. To copy data from the user buffer into a set of kernel buffers, both of them located in main memory, steps 2 and 3 are needed: 2. The socket layer reads data from main memory. 3. The socket layer writes data to the main memory. [Click to enlarge image] Figure-11.35: To adapt this data to ATM transmission step 4 is needed. 5/1/2007 10:52 AM ch11.7 9 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... 4. AAL layer implementation reads data so that it can segment it and compute the checksum that has to be inserted in the AAL_PDU trailer. 5. The basic unit reads data from kernel buffers, adds the ATM cell header and transmits it. Figure 11.36 shows what happens in hardware for the events explained above. Some of [Click to enlarge image] Figure-11.36: the lines are dashed to indicate that the corresponding read operation might be satisfied from the cache mem rather than from the main memory. In the best case, there are three data touches for any given piece of data and in the worst case, there are five data touches. 11.7.4.1. A quantitative approach to data touches Why is so important the number of data touches? Let us consider a main memory bandwidth of about 1.5 GB/s for sequential writes and 0.7 GB/s for sequential reads. If we assume that on the average there are three reads for every two writes (see figure 11.36), the resulting average memory bandwidth is ~ 1.0 GB/s . If our basic unit requires five data touch operations for every word in every cell, then the average throughput we can expect wil be only a fifth of the average memory bandwidth, e.g. 0.2 GB/s . Clearly, every data touch that we can save will provide for significant improvements in throughput. 11.7.4.2. Reducing the number of data touches The number of data touches can be reduced if either kernel buffers or user and kernel buffers are allocated fr extra on-chip memory added to the basic unit. In figure 11.37, kernel buffers are allocated from memory on the basic unit to reduce data touches form Programmed I/O is the technique used to move data from the user buffer to these on-chip kernel buffers (d touched by the processor before is transfer to the basic unit). 5/1/2007 10:52 AM ch11.7 10 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.37: Figure 11.38 shows the same data touch reduction but with DMA being used instead of programmed I/O . In this case, as data arriving from main memory to the basic unit is not touched by the processor, it cannot compute the checksum needed in the AAL layer; therefore, this computation will have to be implemented hardware in the basic unit. 5/1/2007 10:52 AM ch11.7 11 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.38: Figure 11.39 shows an alternative that involves no main memory accesses at all (zero data touches). Both, use and kernel buffers are allocated from on-chip memory. Although this approach reduces drastically the number of data touches, it has two disadvantages: Very large amount of memory will be needed to allocate user and kernel buffers. The API (Application Programming Interface) that will be presented to programmers in this kind of framework will be incompatible with existing socket-based API. 5/1/2007 10:52 AM ch11.7 12 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.39: 11.7.5. Cell multiplexing unit: explanation of main design features of Tcomm. ASICs. There are four modules in the Cell Multiplexing Unit (figure 11.40): Input module Input FIFO module Multiplexing module Output module 5/1/2007 10:52 AM ch11.7 13 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.40: Their functionalities and main design features are as follows: Input and Output modules implement UTOPIA protocol (level one and two), the ATM-Forum stan communication protocol between an ATM layer and a Physical layer entity. Common design elements used i both modules are registers, finite-state machines, counters, and logic to compare register values, as shown in the following figures (figure 11.41 and figure 11.42). 5/1/2007 10:52 AM ch11.7 14 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.41: 5/1/2007 10:52 AM ch11.7 15 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.42: FIFO module isolates two different clock domains: input cell clock domain from output cell clock domain Besides, it allows cell storing (First Input, First Output) when UTOPIA protocol stops cell flow. Having different clock domains is a characteristic feature of telecommunication systems-on-a-chip that adds a new dimension to the design complexity: unsynchronized clock domains generate in the flip-flops that interfaces both domains metastable behavior. If realible system function is desired, techniques to reduce the probability of having a metastable behavior in a flip-flop have to be implemented. The FIFO queue is implemented with a dual-port RAM memory and two registers to store addresses: the write and read pointer. Part of this queue is shown in figure 11.43. 5/1/2007 10:52 AM ch11.7 16 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.43: Multiplexing module changes empty cells by assigned ones. The insertion module has two registers to avoid the lost of parts of a cell when the UTOPIA protocol stops, another two registers to delay the information comin from the network and one register for pipelining the module (figure 11.44) 5/1/2007 10:52 AM ch11.7 17 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... [Click to enlarge image] Figure-11.44: This chapter edited by E. Juarez, L. Cominelli and D. Mlynek 5/1/2007 10:52 AM ch11.7 18 of 18 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... a joint production of EJM 17/2/1999 5/1/2007 10:52 AM ch11.8 1 of 1 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/ch1... Chapter 11 VLSI FOR TELECOMMUNICATION SYSTEMS Introduction Telecommunication Fundamentals General Telecommunication Network Taxonomy Comparison Between Different Switching Techniques ATM Networks Case Study: ATM Switch Case study: ATM Transmission of Multiplexed-MPEG Streams Conclusions Bibliography 11.8. Conclusiones Through these two case studies within the ATM domain, we have shown the main common characteristics to telecommunication ASIC design. Briefly speaking, these features are the following: Different clock domains can coexist and, therefore, techniques to reduce the probability of having a metaestable behavior have to be applied in the design. High throughput networks imply dealing with high frequency clock designs (hundreds of megahertzs). FIFO Memories are usually needed to either separate different clock domains or store information before accessing a common resource. Designs are mainly dominated by the presence of registers. This chapter edited by E. Juarez, L. Cominelli and D. Mlynek a joint production of EJM 17/2/1999 5/1/2007 10:56 AM DSP course 1998-99 1 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... Chapter 12 Digital Signal Processing Architectures Introduction History Typical DSP applications The FIR Example General Architectures Data Path Addressing Peripherals How is a DSP different from a general-purpose processor Superscalar Architectures 12.1 Introduction Digital signal processing is concerned with the representation of signals in digital form and the transformation or processing of such signal representation using numerical computation. Sophisticated signal processing functions can be realized using digital techniques – numerous important signal processing techniques are difficult or impossible to implement using analog (continuous time) m reprogrammability is a strong advantage over conventional analog systems. Furthermore digital systems inherently more reliable, more compact, and less sensitive to environmental conditions and component aging than analog systems. The digital approach allows the possibility of time-sharing (or multiplexing) a microprocessor among a number of different signal processing functions. 12.2 History Since the invention of the transistor and integrated circuit, digital signal processing functions have b implemented on many hardware platforms ranging from special-purpose architectures to general-purpo computers. One of the earliest descriptions of a special-purpose hardware architecture for digital filtering described by Bell Labs in 1968.[1] The problem with such architectures, however, is their lack of flexibility. order to realize a complete application, one needs to be able to perform functions that go beyond simple filtering such as control, adaptive coefficient generation, and non-linear functions such as detection. 5/1/2007 10:58 AM DSP course 1998-99 2 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... The solution is to use an architecture that is more like a general-purpose computer, but which can perform b signal processing operations very efficiently. This means satisfying the following criteria: The ability to perform a multiply and add operation in parallel in the time of one instruction.[2] The ability to perform data moves to and from the arithmetic unit in parallel with arithmetic operations and modification of address pointers. The ability to perform logical operations on data and alter control flow based on the results of these operations. The ability to control operations by sequencing through a stored program. In the 1960s and 1970s, multiple chips or special-purpose computers were designed for computing DS algorithms efficiently. These systems were too costly to be used for anything but research or military applications. It was not until all of this functionality (arithmetic, addressing, control, I/O, data storage, control storage) could be realized on a single chip that DSP could become an alternative to analog signal processing fo the wide span of applications that we see today. In the late 1970s large-scale integration technology reached the point of maturity that it became practical consider realizing a single chip DSP. Several companies developed products along these lines including AM Intel, NEC, and Bell Labs. The first DSP generation AMI S2811 AMI announced a "Signal Processing Peripheral" in 1978.[1] The S2811 was designed to operate in conjunction with a microprocessor such as the 6800 and depended upon it for initialization and configuration.[2] With a small, nonexpandable program memory of only 256 words, the S2811 was intended to be used to offload som math intensive subroutines from the microprocessor. Therefore, as a peripheral, it could not "stand alone" as could DSPs from Bell Labs, NEC, and other companies. The part was to be implemented in an exotic technology called "V-groove." Availability of first silicon was after 1979 and was never used in any v product.[3] Intel 2920 Intel announced an "Analog Signal Processor," the 2920, at the 1979 Institute of Electrical and Electron Engineers (IEEE) Solid State Circuits Conference.[4] A unique feature of this device was the on-ch analog/digital and digital/analog converter capability. The drawback was the lack of a multiplier. Multiplication was performed by a series of instructions involving shifting (scaling) and adding partial products to accumulator. Multiplication of two variables was even more involved—requiring conditional instructio execution. In addition, the mechanism for addressing memory was limited to direct addressing. The program could n perform branching.[5] As such, while it could perform some signal processing calculations a little more efficiently than a general-purpose microprocessor, it greatly sacrificed flexibility and has little resemblance today's single-chip DSP. Too slow for any complete application, it was used as a component for part of modem.[6] NECµPD7720 NEC announced a digital signal processor, the 7720, at the IEEE Solid State Circuits Conference in February 1980 (the same conference that Bell Labs disclosed its first single-chip DSP). The 7720 does have all of 5/1/2007 10:58 AM DSP course 1998-99 3 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... attributes of a modern single chip DSP as described above. However, devices and tools were not available in the U.S. until as late as April 1981.[7] The Bell Labs DSP1 The genesis of Bell Labs' first single-chip DSP was the recommendation of a study group that began to consider the possibility of developing a multipurpose, large-scale integration circuit for digital signal processing in January 1977.[8] Their report, issued in October 1977, outlined the basic elements of a minimal DSP architecture which consisted of multiplier/accumulator, addressing unit, and control. The plan was for the I/O data, and control memories to be external to the 40-pin DIP until large-scale integration technology could support their integration. The spec was completed in April 1978 and the design a year later. First samples w tested in May 1979. By October, devices and tools were distributed to other Bell Labs development group became a key component in AT&T's first digital switch, 5ESS, and many other telecommunications prod Devices with this architecture are still in manufacture today. The first Bell Labs DSP was different from what was in the report. The DSP1 contained all of the functio elements found in today's DSPs including a multiplier-accumulator (MAC), parallel addressing unit, contr control memory, data memory, and I/O. It fully meets the above criteria for a single-chip DSPs. The DSP1 was first disclosed outside AT&T at the IEEE Solid State Circuits Conference in February 1980.[9] A special issue of the BellSystem Technical Journal was published in 1981 which described the architecture tools, and nine fully developed telecommunications applications for the device.[10] The following table summarizes the evolution of DSPs: Date Features Example processors First generation : 1979 1985 Harvard architecture, hardwired multiplier NECµPD7720, Intel 2920, Bell Labs DSP1, Texas Instruments TMS320C10 Second generation: 1985 1988 Concurrency, multiple busses, on-chip memory TMS320C25, MC56001, DSP16 (AT&T) Third generation: 1988 1992 On-chip floating point operations TMS320C30, MC96002, DSP32C (AT&T), Fourth generation: 1992 1997 Multi-processing features TMS320C40&50 Image and video processors TMS320C80 Low-power DSPs (AT&T) Fifth generation: 1997 – VLIW TMS320C6x, Philips TriMedia, Motorola Starcore 12.3 Typical DSP applications 5/1/2007 10:58 AM DSP course 1998-99 4 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... Digital signal processing in general, and DSP processors in particular, are used in a wide variety of applicati from military radar systems to consumer electronics. Naturally, no one processor can meet the needs o applications. Criteria such as performance, cost, integration, ease of development, power consumptions are points to examine when designing or selecting a particular DSP for a class of applications. The table b summarizes different processor applications. Table 1. Common DSP algorithms and typical applications ([11]) DSP Algorithm System Application Speech coding and decoding Digital cellular phones, personal communications systems, multimedia computers, secure communication Speech encryption and decryption Digital cellular phones, personal communications systems, secure communication Speech recognition Advanced user interfaces, multimedia workstation, robotics, automotive applications, digital cellular phones,…. Speech synthesis Multimedia PCs, advanced user interface, robotics Speaker identification Security, multimedia workstations, advanced user interfaces Hi-fi audio encoding and decoding Consumer audio & video, digital audio broadcast, professional audio, multimedia computers Modem algorithms Digital cellular phones, personal communication systems, digital audio broadcast, digital signalling on cable TV, multimedia computers, wireless computing, navigation, data/fax modems, secure communications Noise cancellation Professional audio, advanced vehicular audio, industrial applications Audio equalization Consumer audio, professional audio, advanced vehicular audio, music Ambient acoustics emulation Consumer audio, professional audio, advanced vehicular audio, music Audio mixing and editing Professional audio, music, multimedia computers Sound synthesis Professional audio, music, multimedia computers, advanced user interfaces Vision Security, multimedia computers, advanced user interfaces, instrumentation, robotics, navigation 5/1/2007 10:58 AM DSP course 1998-99 5 of 22 Image compression decompression file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... and Digital photography, digital video, multimedia computers, video-over-voice, consumer video Image composition Multimedia computers, consumer video, advanced user interfaces, navigation Beamforming Navigation, medical imaging, radar/sonar, signals intelligence Echo cancellation Speakerphones, modems, telephone switches Spectral Estimation Signals intelligence, radar/sonar, professional audio, music 12.4. The FIR Example The Finite Impulse Response filter (FIR) is a convenient way to introduce features needed in typical DSP systems. The FIR filter is described by the following equation: The following diagram shows an FIR filter. This illustrates the basic DSP operations: additions and multiplications delays array handling to access coefficients Each of these operations has its own special set of requirements: additions and multiplications require to: fetch two operands perform the addition or multiplication (usually both) store the result or hold it for a repetition delays require to: 5/1/2007 10:58 AM DSP course 1998-99 6 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... hold a value for later use array handling requires to: fetch values from consecutive memory locations copy data from memory to memory To suit these fundamental operations DSP processors often have: parallel multiply and add (MAC operation) multiple memory accesses (to fetch two operands and store the result) lots of registers to hold data temporarily efficient address generation for array handling special features such as delays or circular addressing 12.5. General Architectures The simplest processor memory structure is a single bank of memory, which the processor accesses through single set of address and data lines. This structure which is common among non-DSP processors is referred as the Von Neuman architecture. In this implementation, data and instruction are stored in the same single bank and one acces memory is performed during each instruction cycle. As seen previously to perform a typical operation for a DSP is to have a MAC operation executed in one cycle. This operation requires to fetch two data from memory, multiply them together and add the result to the previous result. With such a Von Neuman model it is not possible to fetch the instruction and the data in the same cycle. This is one reason why conventional processors do not perform well on DSP applications in general. The solution to solve memory accesses is known as the Harvard architecture and the modified Harva architecture. The following diagram shows the Harvard architecture. The program counter fetch an instructio from the program memory using the program counter and stores it in the instruction register. In parallel, t Address Calculation Unit fetch one operand from the memory and feed the execution unit with it. This architecture allows one instruction word and one data word to be fetched in a single cycle. This system requires 4 buses: 2 address bus and 2 data bus. 5/1/2007 10:58 AM DSP course 1998-99 7 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... The next picture represents the modified Harvard architecture. Two data are now fetched in the memory in single cycle. Since it is not possible to access the same memory in the same cycle, this implementation requi three memory banks: a program memory bank and two data memory bank commonly designed X and Y, eac with its own set of buses: address and data. 5/1/2007 10:58 AM DSP course 1998-99 8 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... 12.6. Data Path The data path of a DSP processor is where the vital arithmetic manipulations of signals take place. DSP processor data paths are highly specialized to achieve high performance on the types of computation mos common in DSP applications, such as multiply-accumulate operations. Registers, Adders, Multiplie Comparators, Logic operators, Multiplexers, Buffers represent 95% of a typical DSP data path. Multiplier A single-cycle multiplier is the essence of a DSP since multiplication is an essential operation in all D applications. An important distinction between multipliers in DSPs is the size of the product according to the size of the operands. In general, multiplying two n-bit fixed-point numbers requires a 2xn bits to represent correct result. For this reason DSPs have in general a multiplier, which is twice the word length of the n operands. Accumulators Registers Accumulators registers hold intermediate and final results of multiply-accumulate and other arithmetic operations. Most DSP processors have two or more accumulators. In general, the size of the accumulator is larger than the size of the result of a product. These additional bits are called guard bits. These bits a accumulating values without the risk of overflow and without rescaling. N additional bits allow up to 2n accumulations to be performed without overflow. Guards bits method is more advantageous than scaling multiplier product since it allows the maximum precision to be retained in intermediate steps of computations. ALU 5/1/2007 10:58 AM DSP course 1998-99 9 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... Arithmetic logic units implement basic arithmetic and logical operations. Operations such as addition subtraction, and, or are performed in the ALU. Shifter In fixed-point arithmetic, multiplications and accumulations often induce a growth in the bit width of resu Scaling is then necessary to pass results from stage to stage and is performed through the use of shifters. The following diagram shows the Motorola 56002 Data Path Data ALU input Registers X1, X0, Y1, and Y0 are four 24-bit, general-purpose data registers. They can be treated as four independen 24-bit registers or as two 48-bit registers called X and Y, developed by concatenating X1:X0 and Y1 respectively. X1 is the most significant word in X and Y1 is the most significant word in Y. The registers serve as input buffer registers between the X Data Bus or Y Data Bus and the MAC unit. They act as Data ALU source operands and allow new operands to be loaded for the next instruction while the current instruction use the register contents. The registers may also be read back out to the appropriate data bus to implem memory-delay operations and save/restore operations for interrupt service routines. 5/1/2007 10:58 AM DSP course 1998-99 10 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... MAC and Logic Unit The MAC and logic unit shown in the figure below conduct the main arithmetic processing and perfor calculations on data operands in the DSP. For arithmetic instructions, the unit accepts up to three input operands and outputs one 56-bit result in th following form: extension:most significant product:least significant product (EXT:MSP:LSP). The operation o the MAC unit occurs independently and in parallel with XDB and YDB activity, and its registers facilita buffering for Data ALU inputs and outputs. Latches on the MAC unit input permit writing an input register which is the source for a Data ALU operation in the same instruction. The arithmetic unit contains a multiplie and two accumulators. The input to the multiplier can only come from the X or Y registers (X1, X0, Y1, Y0). The multiplier executes 24-bit x 24-bit, parallel, twos-complement fractional multiplies. The 48-bit product i right justified and added to the 56-bit contents of either the A or B accumulator. The 56-bit sum is stored back in the same accumulator. An 8-bit adder, which acts as an extension accumulator for the MAC arra accommodates overflow of up to 256 and allows the two 56-bit accumulators to be added to and subtracted from each other. The extension adder output is the EXT portion of the MAC unit output. Thi multiply/accumulate operation is not pipelined, but is a single-cycle operation. If the instruction specifies multiply without accumulation (MPY), the MAC clears the accumulator and then adds the contents to the product. In summary, the results of all arithmetic instructions are valid (sign-extended and zero-filled) 56-bit operands in the form of EXT:MSP:LSP (A2:A1:A0 or B2:B1:B0). When a 56-bit result is to be stored as a 24-bit operand, the LSP can be simply truncated, or it can be rounded (using convergent rounding) into the MSP. Conve 5/1/2007 10:58 AM DSP course 1998-99 11 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... rounding (round-to-nearest) is performed when the instruction (for example, the signed multiply-accumulate and round (MACR) instruction) specifies adding the multiplier’s product to the contents of the accumulator. The scaling mode bits in the status register specify which bit in the accumulator shall be rounded. The logic unit performs the logical operations AND, OR, EOR, and NOT on Data ALU registers. It is 24 bits wide and operates on data in the MSP portion of the accumulator. The LSP and EXT portions of the accumulator a affected. The Data ALU features two general-purpose, 56-bit accumulators, A and B. Each consists o concatenated registers (A2:A1:A0 and B2:B1:B0, respectively). The 8-bit sign extension (EXT) is stored in A2 or B2 and is used when more than 48-bit accuracy is needed; the 24-bit most significant product (MSP) is stored in A1 or B1; the 24-bit least significant product (LSP) is stored in A0 or B0. Overflow occurs when a sou operand requires more bits for accurate representation than are available in the destination. The 8-bit exte registers offer protection against overflow. In the DSP56K chip family, the extreme values that a word operan can assume are - 1 and + 0.9999998. If the sum of two numbers is less than - 1 or greater than + 0.9999998, result (which cannot be represented in a 24 bit word operand) has underflowed or overflowed. The 8-bit extension registers can accurately represent the result of 255 overflows or 255 underflows. Whenever accumulator extension registers are in use, the V bit in the status register is set. Automatic sign extension occurs when the 56-bit accumulator is written with a smaller operand of 48 or 24 bits. A 24-bit operand is written to the MSP (A1 or B1) portion of the accumulator, the LSP (A0 or B0) portion is zero filled, and the EXT (A2 or B2) portion is sign extended from MSP. A 48-bit operand is written into MSP:LSP portion (A1:A0 or B1:B0) of the accumulator, and the EXT portion is sign extended from MSP. N sign extension occurs if an individual 24-bit register is written (A1, A0, B1, or B0).When either A or B is read, it may be optionally scaled one bit left or one bit right for block floating-point arithmetic. Sign extension can also occur when writing A or B from the XDB and/or YDB or with the results of certain Data ALU operatio (such as the transfer conditionally (Tcc) or transfer Data ALU register (TFR) instructions). Overflow protection occurs when the contents of A or B are transferred over the XDB and YDB by substituting a limiting constant for the data. Limiting does not affect the content of A or B – only the value transferred ove the XDB or YDB is limited. This overflow protection occurs after the content of the accumulator has been shifted according to the scaling mode. Shifting and limiting occur only when the entire 56-bit A or B accumulator is specified as the source for a parallel data move over the XDB or YDB. When individual registers A0, A1, A2, B0, B1, or B2 are specified as the source for a parallel data move, shifting and limiting are not performed. The accumulator shifter is an asynchronous parallel shifter with a 56-bit input and a 56-bit output tha implemented immediately before the MAC accumulator input. The source accumulator shifting operations are as follows: No Shift (Unmodified) 1-Bit Left Shift (Arithmetic or Logical) ASL, LSL, ROL 1-Bit Right Shift (Arithmetic or Logical) ASR, LSR, ROR Force to zero 12.7. Addressing The ability to generate new addresses efficiently is a characteristic feature of DSP processors. Most DS processors include one or more special address generation units (AGUs) that are dedicated to calculatin addresses. An AGU can perform one or more special address generation per instruction cycle without us processor main data path. The calculation of addresses takes place in parallel with arithmetic operations on data, 5/1/2007 10:58 AM DSP course 1998-99 12 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... improving processor performance. On of the main addressing mode is the register-indirect addressing. The data addressed is in memory and t address of the memory location containing the data is held in a register. This gives a natural way to work w arrays of data. Another advantage is the efficiency from an instruction-set point of view since it allows powerful and flexible addressing with relatively few bits in the instruction word. Whenever an operand is fetched from memory using register indirect addressing, the address register c incremented to point to the next needed value in the array. The following table summarizes most comm increment method in DSPs: *rP register indirect read the data pointed to by the address in register rP *rP++ postincrement Having read the data, postincrement the address pointer to point to the next value in the array *rP-- postdecrement Having read the data, postdecrement the address pointer to point to the previous value in the array *rP++rI register postincrement Having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the array *rP++rIr bit reversed (FFT) having read the data, postincrement the address pointer to point to the next value in the array, as if the address bits were in bit reversed order An additional convenient feature in AGU is the presence of modulo addressing modes. It is extensively used for circular addressing. Instead of comparing the address to a calculated value to see whether or not the end of buffer has been reached, dedicated registers are used to automatically perform this check and take necessa actions (i.e. reset the register to the start address of the buffer). The following picture represents the address generation unit of the Motorola 56002 5/1/2007 10:58 AM DSP course 1998-99 13 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... This AGU uses integer arithmetic to perform the effective address calculations necessary to address data operands in memory, and contains the registers used to generate the addresses. It implements linear, modulo, and reverse-carry arithmetic, and operates in parallel with other chip resources to minimize address-gen overhead. The AGU is divided into two identical halves, each of which has an address arithmetic logic uni (ALU) and four sets of three registers. They are the address registers (R0 - R3 and R4 - R7), offset registers (N0 - N3 and N4 - N7), and the modifier registers (M0 - M3 and M4 - M7). The eight Rn, Nn, and Mn registers treated as register triplets — e.g., only N2 and M2 can be used to update R2. The eight triplets are R0 R1:N1:M1, R2:N2:M2, R3:N3:M3, R4:N4:M4, R5:N5:M5, R6:N6:M6, and R7:N7:M7. The two arithmetic units can generate two 16-bit addresses every instruction cycle — one for any two of th XAB, YAB, or PAB. The AGU can directly address 65,536 locations on the XAB, 65,536 locations on the YAB, and 65,536 locations on the PAB. The two independent addresses ALUs work with the two data memories to feed the data ALU two operands in a single cycle. Each operand may be addressed by a Rn, Nn, and Mn triplet. 12.8. Peripherals Most DSP processors provides on-chip peripherals and interfaces to allow the DSP to be used in an embed system with a minimum amount of external hardware to support its operation and interfacing. Serial port A serial interface transmits and receives data one bit at a time. These ports have a variety of applications l sending and receiving data samples to and from A/D and D/A converters and codecs, sending and receiving data to and from other microprocessors or DSPs, communicating with other hardware. The two main categorie are synchronous and asynchronous interface. The synchronous serial ports transmit a bit clock signal in addition to the serial bits. The receiver uses this clock to decide when to sample received data. On the oppos asynchronous serial interfaces do not transmit a separate clock signal; they rely on the receiver deducing a clock signal from the data itself. Direct extension of serial interfaces leads to parallel ports where data are transmitted in parallel instea sequentially. Faster communication is obtained through costly additional pins. Host Port Some DSPs provide a host port for connection to a general-purpose processor or another DSP. Host ports usually specialized 8 or 16 bit bi-directional parallel ports that can be used to transfer data between the DSP and the host processor. Link ports or communications ports This kind of port is dedicated to multiprocessor operations. It is in general a parallel port intended communication between the same types of DSPs. Interrupt controller An interrupt is an external event that causes the processor to stop executing its current program and branch special block of code called an interrupt service routine. Typically this code deals with the origin of the interrupt and then returns from the interrupt. There are different interrupt sources: On-chip peripherals: serial ports, timers, DMA,… 5/1/2007 10:58 AM DSP course 1998-99 14 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... External interrupt lines: dedicated pins on the chip to be asserted by external circuitry Software interrupts: also called exceptions or traps, these interrupts are generated under software control or occurs for example for floating-point exceptions (division-by-zero, overflow and so on). DSPs associates interrupts with different memory locations. These locations are called interrupt vectors. T vectors contain the address of the interrupt routines. When an interrupt occurs, the following scenario encountered: Save program counter in a stack Branch to the relevant address given by the interrupt vector table Save all registers used in the interrupt routine Perform dedicated operations Restore all registers Restore program counter Priority levels can be assigned to the different interrupt through the use of dedicated registers. An in acknowledged when its priority level is strictly higher that current priority level. Timers Programmable timers are often used as a source of periodic interrupts. Completely software-controlled to activate specific tasks at chosen times. It is generally a counter that is preloaded with a desired value decremented on clock cycles. When zero is reached, an interrupt is issued. DMA Direct Memory Access is a technique whereby data can be transferred to or from the processor’s memory without the involvement of the processor itself. DMA is commonly used to provide improved performance with input/output devices. Rather than have the processor read data from an I/O device and copy the data into memory or vice versa, a separate DMA controller can handle such transfers in parallel. Typically, the proces loads the DMA controller with control information including the starting address for the transfer, the numbe words to be transferred, the source and the destination. The DMA controller uses the bus request pin to notify the DSP core that it is ready to make a transfer to or from external memory. The DSP core completes i instruction, releases control of external memory and signals the DMA controller via the bus grant pin that th DMA transfer can proceed. The DMA controller then transfers the specified number of data words and optionally signals completion through an interrupt. Some processor can also have multiple channels DM managing DMA transfers in parallel. 12.9. How is a DSP different from a general-purpose processor DSPs intended for real-time embedded control/signal processing applications, not general-purpose computing DSPs strictly non-user-programmable (typically no memory management, no operating system, no cache, no shared variables, single-process oriented) 5/1/2007 10:58 AM DSP course 1998-99 15 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... DSPs usually employ some form of "Harvard Architecture" to allow simultaneous code and data fetches Salient characteristic of all DSPs is devoting significant chip real estate to the "multiply-accumulate"(MAC) operation – most DSPs perform a MAC operation in a single clock DSP programs often resident in fast on-chip ROM and/or RAM (although off-chip bus expansion is usually possible) most DSPs have at least two multi-ported on-chip RAMs for storing operand data DSP interrupt handling is simple, fast, and efficient (minimum context switch overhead) many DSP applications assembly-coded, due to real-time processing constraints(although C compilers exist for most DSPs) DSP I/O provisions usually fairly simple DSP address bus widths typically smaller than those of general-purpose processors (code size tends to be small, "tight-loop" oriented) fixed-point DSPs utilize saturation arithmetic (rather than allowing 2’s complement overflow to occur) DSP addressing modes geared toward signal processing applications (direct support for circular buffers, "butterfly" access patterns) DSPs often provide direct hardware support for implementation of "do" loops many DSPs employ an on-chip hardware stack to facilitate subroutine linkage most "lower end" DSPs have integrated program ROM and scratchpad RAM to facilitate single-chip solutions most DSPs do not have integrated ADCs and DACs – these interfaces (if desired) are usually implemented externally benchmark suites used to compare DSPs totally different than those used to compare general-purpose (RISC/CISC) processors : • FIR/IIR filters • FFTs • convolution • dot product 12.10 Superscalar Architectures The term "superscalar" is commonly used to designate architectures that enable more than one instruction executed per clock cycle Nowadays multimedia architectures, supported by the continuous improvement in technologies, are rapidl moving towards highly parallel solutions like SIMD and VLIW machines. What do these acronyms mean ? SIMD 5/1/2007 10:58 AM DSP course 1998-99 16 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... stands for Single Instruction on Multiple Data. In simple words it is possible to say that the architecture has single program control unit that fetches and decodes the program instructions to multiple execution units, multiple sets of datapaths, registers and data memories. Of course a SIMD architecture can be realize multiprocessor configuration, but the exploitation of deep submicron technologies has made possible to integrate such architectures in a single chip. It is easy at this point to imagine each execution unit to be driven by different program control units, permitting the possibility to execute in parallel different instructions of th same program or different programs at the same time; in this case the resulting architecture is called M Instructions on Multiple Data (MIMD). Again a MIMD machine can be implemented by a multiprocesso structure or integrated in a single chip. Historically the first examples of the so-called multiple-issue machines were typified in the early '80s, and th were called VLIW machines (for Very Long Instruction Word). These machines exploit an instruction word consisting of several (up to 8) instruction fragments. Each fragment controls a precise execution unit; in this way the register set must be multiported to support simultaneous access, because the multiple instructions could need to share the variables. In order to accommodate the multiple instruction fragments, the instruction word is often over 100 bits long. [12] The reasons that push towards these parallel approaches are essentially two; first of all many scientific processing algorithms either for calculus or, more recently, for communication and multimedia application contain a high degree of parallelism. Secondly a parallel architecture is a cost-effective way to compute (when the program is parallelizable), since internal and on-chip communications are much faster and much more efficient than external communication channels. On the other hand parallel architectures bring with them a number of problems and new challenges that are present in simple processors. First of all, if it is true that many programs are parallelizable, extensive researches have shown that often the level of parallelism that can be achieved is theoretically not greater than 3; this means that on actual architectures the speedup factor is not greater than 2. Based upon this, it would seem that in absence of significant compiler breakthroughs available speedup is limited. A second problem concerns memories and registers; highly parallel routines require a high memory access rate, and then a very d optimization for register set, cache memory and data buses in order to feed the necessary amount of data into the execution units. Finally, such complex architectures with hardly optimized datapaths and data transfers ar very difficult to program. Normally DSP programmers used to develop applications directly in assembly language, very similar for some aspects to the natural sequential way of thinking of the human beings specifically conceived for smart optimizations. Machines like the MIMD and VLIW ones are not programmable in assembly anymore, and then processor designers have to spend a great amount of resources (often more than the time to develop the chip itself) in order to provide the Software Development Kits able to exploit the potential of the processor, taking care of every aspect from the powerful optimization techniques u understandable user interfaces. More recent attempts at multiple-issue processors have been directed at rather lower amounts of concurrency than in the first VLIW architectures (4-5 parallel execution blocks). Three examples of this new genera superscalar machines will be briefly discussed in the next subsections, underlying architectural aspects an specific solutions to deal with the problems of parallelization. The Pentium processor with multimedia extensions The Pentium processor explicitly supports multimedia, since the introduction of the so-called MMX (MultiMedia eXtension) family. The well-known key enhancement of this new technology consists of exploiting the 64-bit floating point registers of the processor to "pack" 8-, 16-, or 32-bit data, that can b processed in parallel by a SIMD operating unit. Fifty-seven new instructions are implemented in the processor in order to exploit these new functionalities, and among them "multiply and add", the basic operation in the case of digital convolutions (FIR filters) or FFT algorithms. [13] Two considerations can be made about this 5/1/2007 10:58 AM DSP course 1998-99 17 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... processor. First of all the packed data are fixed point, and then the use of these extensions for a DSP oriented task limits the use of the floating point arithmetic; conversely a full use of floating-point operations does no allow any boost in performance in comparison with the common Pentium family. Moreover MMX technology has been conceived to specifically support multimedia algorithms, but at the sam time to completely preserve code compatibility with previous processors; in this way an increased pote fixed-point processing power is not supported by the necessary memory and bus re-design, and then it is often not possible to "feed" the registers with the correct data. Extensive tests conducted after the disclosing of th MMX technology have shown that for typical video application it is often a hard matter to achieve a speedu factor of the 50%. Figure 1. How the Pentium MMX exploits the 64-bit floating-point registers to "pack" data in parallel and send them to a SIMD execution unit The TriMedia processor Another multimedia processor other than Intel MMX that is growing in interest is the TriMedia by Ph Electronics. This chip is not designed as a completely general purpose CPU, but with the double functionality of CPU and DSP in the same chip, and its core processing master unit presents a VLIW architecture. The key features of TriMedia are: A very powerful, general-purpose VLIW processor core (the DSPCPU) that coordinates all on-chip activities. In addition to implementing the non-trivial parts of multimedia algorithms, this processor runs a small real-time operating system that is driven by interrupts from the other units. DMA-driven multimedia input/output units that operate independently and that properly format data to make software media processing efficient. DMA-driven multimedia coprocessors that operate independently and in parallel with the DSPCPU to perform operations specific to important multimedia algorithms. A high-performance bus and memory system that provides communication between TM1000’s processing units. The architecture of the TriMedia is shown in figure 2. The real DSP processing must be implemented in the master CPU/DSP, which is also responsible for the algorithm direction. This unit is a 32-bit floating-point, 133 MHz general-purpose unit whose VLIW instructions can address up to five instructions out of the 27 functional operations (integer and floating 5/1/2007 10:58 AM DSP course 1998-99 18 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... multipliers and 5 ALUs). The DSPCPU is provided with a 32 Kbytes Instruction cache memory and a dual port 16 Kbytes Data ca memory. [14] TriMedia also provides a set of multimedia instructions, mostly targeted at MPEG-2 video decoding; Figure 2 The TriMedia processor architecture Some of the programming challenges for parallel architectures are solved in the DSPCPU through the concept of guarded conditional operations. An instruction takes the following form R g : R dest = imul R src1 , R src2 In this instruction the integer multiplication of the two registers is put into the destination register under condition contained in the "guard" register Rg. This allow to better control the optimization strategies of th parallel compiler, since for instance the problem of branches is relaxed and the result is accepted or not only at the last execution stage of the pipeline. As mentioned above, complex processors/DSPs like TriMedia need a big amount of development tools a software support. For this reasons the TriMedia comes with a huge amount of tools to deal with the real-tim kernel, the DSPCPU programming and the complete system exploitation. The TriMedia Software Development Environment provides a comprehensive suite of system software too compile and debug multimedia applications, analyse and optimize performance, and simulate execution o TriMedia processor. The main features are: VLIW ANSI C/C++ compilation system Source- and machine-level debugging Performance analysis and enhancement 5/1/2007 10:58 AM DSP course 1998-99 19 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... Cycle-accurate machine-level simulator TriMedia applications library, including: MPEG- 1 decoder MPEG- 2 program stream decoder 3-D graphics pipeline PC audio synthesis (FM, wavetable) V. 34 modem H. 324 (PSTN) / H. 320 (ISDN) PC video conferencing PUMA - Parallel Universal Music Architecture A very interesting solution recently developed for advanced audio applications is the PUMA (Parallel Universal Music Architecture) DSP by Studer Professional Audio. This chip was conceived and realised in collaboratio with the Integrated Systems Center (C3I) of the EPFL. This integrated circuit is designed and optimized for digital mixing consoles; it is provided with 4 indepen channel processors, and then with four 33-MHz, 24-bit fixed point multipliers and adders fully dedicated to data processing (another multiplier is provided in the master DSP, which is charged of the final data processing directs the whole chip functionalities and I/O unities); the important feature of this chip relies in the mu processing units that can work in parallel on similar blocks of data; each channel processor has its own intern data memory (256 24-bit words for each processor), the Master DSP and the Array DSP has independent program memories and program control units. The design of the I/O units deserved a great care: digital audio input and output are supported by 20 serial lines each; interprocessor communication is supported thro independent units (the Cascade Data Input and Cascade Data Output) providing 64 channels on 8 lines a processor speed. A general purpose DRAM/SRAM External Memory Interface and the External Host Inter permit memory extension and flexible programmability via an external host processor. The following figur shows the top-level architecture of the PUMA DSP. 5/1/2007 10:58 AM DSP course 1998-99 20 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... The following figure shows the internal datapath of each channel processor; three units can work in parallel in a single clock cycle: a 24x24-bit multiplier, a comparator and the general purpose ALU (adder, shifter, logical operations). Puma design flow To conclude it is interesting to spend a few words about the PUMA design flow, to understand how a modern and complex architecture, characterised by several million of transistors, can be practically realised. First of all the functional specification of the processor is developed, defining functionalities, basic bloc instruction set; at the same time the C-model of the architecture is implemented, in order to test with methodology the algorithms and the architecture. The second step is the VHDL description and simulation of the C-model at the RTL level, followed by t 5/1/2007 10:58 AM DSP course 1998-99 21 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... synthesis to the gate level. All of this was accomplished exploiting the Synopsys Design Compiler and Analyzer. After that an optimization technique called hierarchical compiling is used: after setting the boundary constraints for the main blocks, the constraints for the inner blocks are derived hierarchically by the compiler, and thi permits to relax the time paths everywhere it is not strictly necessary. The preliminary place & route follows; then the parasitic parameters (R and C) for each wire are extracted, and the so called back-annotation, or in-place compilation is performed, in order to better adapt each load to the real netlist placement. The place & route was made by the Compass tool, the back-annotation again in Synopsy Design Compiler. Finally the last place&route is made, and extensive simulations are performed for every part of the chip, in order to verify the timing of every specific operation. Now the design is ready for the foundry. References 1. Nicholson, Blasco and Reddy, "The S2811 Signal Processing Peripheral," WESCOM Tech Papers, Vol. 2 1978, pp. 1-12. 2. S2811 Signal Processing Peripheral, Advanced Product Description, AMI, May 1979. 3. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p. 24. 4. Hoff and Townsend, "An analog input/output microprocessor for signal processing," ISSCC Digest of T Papers, February 1979, p. 220. 5. 2920 Analog Signal Processor Design Handbook, Intel, Santa Clara, CA, August 1980. 6. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p.24. 7. Brodersen, "VLSI for Signal Processing," Trends and Perspectives in Signal Processing, Vol. 1, No. 1, January 1981, p. 7. 8. Stanzione et al, "Final Report Study Group on Digital Integrated Signal Processors," Bell Labs I Memorandum, October 1977. 9. Boddie, Daryanani, Eldumtan, Gadenz, Thompson, Walters, Pedersen, "A Digital Signal Pr Telecommunications Applications," ISSCC Digest of Technical Papers, February 1980, p.44. 10. Bell System Technical Journal, Vol. 60, No. 7, September 1981. 11. Phil Lapsley, Jeff Bier, Amit Shoham "DSP Processor Fundamentals, Arhitectures and Features", IEEE Press. 12. Michael J. Flynn "Computer Architecture. Pipelined and parallel processor design", Jones and Bartlett, 1995. 13. Peleg A., Wilkie S., Weiser U. Intel MMX for Multimedia PCs. Communications of the ACM, Vol. 40, No 1, Jan 1997. 14. TM1000 Preliminary Data Book, Philips Electronics NAC, 1997. 5/1/2007 10:58 AM DSP course 1998-99 22 of 22 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/D... 5/1/2007 10:58 AM ARCHITECTURES FOR VIDEO PROCESSING 1 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... ARCHITECTURES FOR VIDEO PROCESSING Integrated System Laboratory C3I Swiss Federal Institute of Technology, EPFL The first question we would like to answer is: what do we mean nowadays for video processing? In the past, more or less till the end of the 80’s there where two distinct worlds: an analog TV world and a digital computer world. All TV processing from the camera to the receiver was based on analog processing, analog modulation and analog recording. With the progress of digital technology, a part of the analog processing could be implemented by digital circuits with consistent advantages in terms of reproducibility of the circuits leading to cost and stability advantages, and noise sensitivity leading to quality advantages. At the end of the 80’s completely new video processing possibilities became feasible by digital circuits. Today, image compression and decompression is the dominant digital video processing in term of importance and complexity of the all TV chain. Figure 1. Schematic representation of a TV chain. In the near future digital processing will be used to pass from standard resolution TV to HDTV for which compression and decompression is a must, considering the bandwidth that it would require for transmission. Other applications will be found at the level of the camera to increase the image quality by increasing the number of bit from 8 to 10 or 12 for each pixel, or by using appropriate processing aiming at compensating the sensors limitations (image enhancement by non-linear filtering and processing) . Digital processing will also enter into the studio for digital recording, editing and 50/60 Hz standard conversions. Today the high communications bandwidth required by uncompressed digital video necessary for editing and recording operations, between the studio devices limits the use of full digital video and digital video processing at studio level. Video Compression 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 2 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Why video compression has become the dominant video processing application of TV? An analog TV channel only needs a 5 MHz analog channel for the transmission, conversely in case of digital video with: 8 bit A/D, 720 pixels for 576 lines (54 MHz sampling rate) we need a transmission channel with a capacity of 168.8 Mbit/s!!! In case of digital HDTV the capacity for: 10 bit A/D, 1920 pixels 1152 lines raise up to1.1 Gbit/s!!! No affordable applications, in terms of cost, are thus possible without video compression. These reasons have raised also the need of worldwide standards for video compression so as to achieve interoperability and compatibility among devices and operators. H.261 is the names given to the first digital video compression standard specifically designed for videoconference applications, MPEG-1 is the name for the one designed for CD storage (up to 1.5 Mbit/s) applications, MPEG-2 for digital TV and HDTV respectively from 4 up to 9 Mb/s for TV, or up to 20 Mb/s for HDTV; H.263 for videoconferencing at very low bit rates (16 - 128 kb/s). All these standards can be better considered as a family of standards sharing quite similar processing algorithms and features. All of them are based on the same basic philosophy: Decoder must be simple. For TV and HDTV while we have very few encoders used by broadcaster companies (at limit just one for each channel), we must have a decoder on each TV set. Decoding syntax is completely specified. This means that any compressed video bit-stream can be decoded without any ambiguity yielding the same video result. A decoder must be conformant. This means that a decoder must be able to decode any video bit-stream that respects the decoding syntax. Encoding syntax is specified. This means that an encoder must encode a video content in a conformant syntax. The encoder, (i.e. the encoding algorithm) is not specified. This means that encoding algorithms are a competitive issues, encoders can be optimized aiming at achieving higher quality of compressed video or aiming at simplifying the encoding algorithm so as to have simple encoder. It also mean that in future disposing of more processing power we can use more and more sophisticated and processing demanding encoding algorithms to find the best choices of the available encoding syntax. These basic principles of the video compression standards have clearly strong consequences on the architectures implementing video compression. So as to understand what are the main processing and architectural issues in video compression we briefly analyze in more details the basic processing of MPEG-2 standard. MPEG-2 Video Compression MPEG-2 is a complete standard that specifies all stages from video acquisition up to the interface with the communication protocols. Figure 2 reports a schematic diagram of how MPEG-2 provides after compression a transport layer. Audio and video compressed bit-streams are multiplexed and put in packets in a suitable transport format. This part of the processing cannot be classified as video processing, and is not considered here in details. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 3 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 2. MPEG-2 transport stream diagram. Figure 3. Basic processing for MPEG-2 compression. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 4 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 4. MPEG-2 pre-filtering and spatial redundancy reduction by DCT. Figure 5. MPEG-2 spatial redundancy reduction by quantization and entropy coding. The basic video processing algorithms of MPEG-2 are reported in Figure 3. These algorithms are also found with some variants in all the other compression standards mentioned before. First stage is the conversion of the image from RGB format to the YUV format and subsequent filtering and sub-sampling of the chrominance components to yield smaller color images. Then images are partitioned into block of pixels of size 8x8 and block are grouped in macro-blocks of size 16x16 pixels. Two main processes are then applied. One is the reduction of spatial redundancy, the other is the reduction of temporal redundancy. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 5 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 6. MPEG-2 temporal redundancy reduction by motion compensated prediction. Spatial redundancy is reduced applying the DCT transform to blocks and then entropy coding by Huffman tables the quantized transform coefficients. Temporal redundancy is reduced by motion compensation applied to macro-blocks according to the IBBP group of picture structure. In more details (see Figures 4 and 5) spatial redundancy is reduced applying 8 times horizontally and 8 times vertically a 8x1 DCT transform. Then transform coefficients are quantized, thus reducing to zero small high frequency coefficients, scanned in zig-zag order starting from the DC coefficient at the upper left corner of the block and coded using Huffman tables referred also as Variable Length Coding (VLC). The reduction of temporal redundancy is the process that drastically reduces the bit rate and enables to achieve high compression rates. It is based on the principle of finding the current macro-block in already transmitted pictures at the same position in the image or displaced by a so-called "motion vector" (see figure 6). Since an exact copy of the macro-block is not guaranteed to be found, the macro-block that has the lowest average error is chosen as reference macro-block. The "error macro-block" is then processed so as to reduce the spatial redundancy, if any, by means of the above mentioned procedure and transmitted so as to be able to reconstruct the desired macro-block disposing of the "motion vector" indicating the reference and the relative error. Figure 7 reports the so-called MPEG-2 Group of Picture Structure that shows that images are classified as I (Intra), P (Predicted) and B (Bi-directionally interpolated). The standard specifies that Intra image macro-block can only be processed to reduce spatial redundancy, P image macro-block can also be processed to reduce the temporal redundancy referring only to past I or P frames, B image macro-block can also be processed using an interpolation of past and future reference macro-block. Obviously B macro-block can also be coded as Intra or Predicted if it is found to convenient for the compression. Note that since B picture can use as reference both past and future I or P frames, the MPEG-2 image transmission order is different from the display order, B picture are transmitted in the compressed bit-stream after the relative I and P pictures. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 6 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 7. Structure of an MPEG-2 GOP, showing the reference pictured for motion compensated prediction of P and B pictures. Complexity of MPEG Video Processing At the end of the 80’s there have been a lot of discussions about the complexity of implementing DCT transforms in real-time at video rate. Blocks of 8x8 have been chosen instead of 16x16 in order to reduce the complexity of the transform. The main objective was to avoid complex processing at the decoder side. With this goal many DCT optimized implementations have appeared in both form of dedicated chips and software using reduced number of multiplication and additions. Nowadays, digital technology has made many progresses in terms of speed increase and processing performance for which the DCT coding or decoding is not anymore a critical issue. If we look to Figure 8 we can find a schematic block diagram of an MPEG-2 decoder that is very similar to the ones of the other compression standards. A buffer is needed to receive at a constant bit-rate the compressed bits that during decoding are not "consumed" at a constant rate. VLD is a relatively simple processing that can be implemented by means of look-up tables or memories. Being a bit-wise processing, it cannot be parallelized and results quite inefficient to be implemented in general purpose processors. This is the reason for which new multimedia processors such as Philips "Trimedia" use specific VLC/VLD units for entropy coding. The more costly elements of the MPEG-2 decoder are the memories for the storage of past and future reference frames and the handling of the data flow between the Motion Compensated Interpolator unit and the Reference video memories. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 7 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 8. Block diagram of an MPEG-2 decoder. For an MPEG-2 encoder, see Figure 9, the situation is very different. First of all we can recognize a path that implements a complete MPEG-2 decoder, necessary to reconstruct reference images as they are found at the decoder size. Then we have a motion estimation block (Bi-directional motion estimator) that has the goal of finding the motion vector, and a block that selects and controls the macro-block encoding modes. As discussed in the previous paragraphs, the way to find the best motion vectors as well as the way to chose the right coding for each macro-block is not specified by the standard. Therefore, very simple (with limited quality performance), or extremely complex algorithms (with high quality performance) can be implemented for these functions. Moreover, MPEG-2 allows the dynamic definition of the GOP structure making possible many possibilities of coding modes. In general two are the critical issues of an MPEG-2 encoder: the motion estimation processor and the handling of the complex data flow with relative bandwidth problems between original and coded frame memories, motion estimation processor and the coding control unit. We have also to mention that the coding modes of MPEG-2 are much more complex of what could seem from this brief description. In fact, existing TV is based on interlaced images and the processing all coding modes can be applied in distinct ways to "frame" blocks and macro-blocks or on "field" blocks and macro-blocks. The same applies for motion estimation for which we can use both field-based or frame-based vectors. Moreover all references for predictions can be made on true image pixels or on "virtual" image pixels obtained by bi-linear interpolations as shown in Figure 10. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 8 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 9. Block diagram of an MPEG-2 encoder. Figure 10. MPEG-2 macro-block references can be made also on "virtual" pixels (in red) obtained by bilinear interpolations, instead of image pixels from the original raster (gray). In this case also, motion vectors with half pixel precision need to be estimated. The possibility of using all these possible encoding modes largely increases the quality of the compressed video, but it might become extremely demanding in terms of processing complexity. The challenge of MPEG-2 encoder designer is to find the best trade-off between the complexity of the implemented algorithms and the quality of the compressed video. Architectural and algorithmic issues are very strictly related in MPEG-2 encoder architectures. Digital Video and Computer Graphics In the past digital video on computers was equivalent to computer graphics. Differently from the TV world all 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 9 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... processing was obviously digital mainly treating synthetic images from 2-D or 3-D models. The concept of real-time computer graphic application was very approximate since usually the application was intended to run as fast as possible on the available processors using in parallel graphic accelerators for the arithmetic operations on pixels. Figure 11. Sequence of typical computer graphic processing steps. Figure 11 shows a schematic diagram of the basic computer graphic operations. For each image, 2-D or 3-D models composed by triangles or polygons are placed in the virtual space by t he considered application that can be interactive. The position of each vertex is calculated according to the geometric transformation of the object and projected on the screen. The texture, mapped on each polygon, is transformed according to the light model corresponding to the position of the polygon in the space. The pixel on the screen corresponding to the screen raster are obtained from the "original" texture pixel on the polygon by appropriate filtering operations. Fina lly, the polygon is displayed on the screen. Figure 12. processing requirements of 3-D graphic content in terms of pixel and polygon per second. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 10 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Computer graphic applications strongly rely on the performance of acceleration cards that are specialized to treat in parallel with a high level of pipelines all these numerous but simple pixel operations. Figure 12 reports a diagram of the processing requirements in terms of polygons/s and pixel/s of various graphic contents. TV, Computer Graphics and Multimedia: MPEG-4? The new MPEG-4 multimedia standard, which was defined as draft ISO international standard in October 98, is trying the ambitious challenge of putting together the world of natural video and TV with the world of computer and computer graphics. In MPEG-4 we can find in fact both natural compressed video and 2-D and 3-D models. The standard is based on the concept of elementary streams that represents and carry the information of a single "object" that can be of any type "natural" or "synthetic", audio or video. Figure 13, reports an example of what can be the content of a MPEG-4 scene. Natural and 2-D and 3-D synthetic audio-visual objects are received and composed in a scene as seen by an hypothetical viewer. Figure 13. Example of the content and construction of an MPEG-4 scene. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 11 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 14. Diagram of MPEG-4 System layer and interface with the network layer. Two virtual levels are necessary to interface the "elementary stream" level with the network level. The first is necessary to multiplex/demultiplex each communication stream into packets and the second to synchronize each packet and build the "elementary streams" carrying the "object" information as shown in Figure 14. The processing related to MPEG-4 Systems layer cannot considered as video processing and is very similar to the packet processing typical to network communication. An MPEG-4 terminal can be schematized as shown in Figure 15. The communication network provides the stream that is demultiplexed into a set of "elementary streams". Each "elementary stream" is decoded into audio/video objects. Using the scene description transmitted with the elementary streams all object are "composed" in the video memory all together according to the size, view angle and position in the space and then "rendered" on the display, which can be interactive and originating a upstream data due to the user interaction and sent back to the MPEG-4 encoder. MPEG-4 systems, therefore implement not only the classical MPEG-2-like compression/decompression processing and functionality but also computer graphics processing such as "composition" and "rendering". The main difference comparing to natural video of MPEG-1, MPEG-2, H.263, is the introduction of "shape coding" enabling the use of arbitrarily shaped video objects as illustrated in Figure 16. Shape coding information is based on macro-block data structures and arithmetic coding for the contour information associated at each boundary block. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 12 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 15. Illustration of the processing and functionality implemented in an MPEG-4 terminal. Figure 16. Compressed shape information is necessary for arbitrarily shaped objects. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 13 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 17. MPEG-4 decoder block diagram, shape coding is coded in parallel to the DCT based texture coding. Shape coding can be of "Intra" type, or with motion compensation and prediction error like texture coding. The block diagram of an MPEG-4 encoder is depicted in Figure 17. In general it is very similar as architecture to an MPEG-2 encoder block diagram. We can notice a new "shape coding" block in the motion estimation loop that produce shape coding information transmitted in parallel to the classical texture coding information. Video Processing Architectures: Generalities In general, we can classify the circuits implementing video processing in four families: Application Specific Integrated Circuits (ASICs). To this group belong all hardwired circuits specifically designed for a single processing task. The level of programmability is very low and the circuits are usually clocked at the frequency or multiples of the input/output data sampling rates. Application Specific Digital Signal Processors (AS-DSPs). These architectures are based on a DSP cores plus special functions (such as 1-D, 2-D filters, FFT, graphics accelerators, block matching engines) that are specific to a selected application. Digital Signal Processors (DSPs). These are the classical processors architectures specialized and efficient for multiply-accumulate operations on 16-24-32 bit data. The classical well-known families are the ones of Motorola and Texas Instruments. The level of programmability of these processors is very high. They are also employed for real-time applications with constant input/output rates. General Purpose Processors (GPPs). These are the classical PC processors (Intel, IBM PowerPC) and workstation processors (Alpha Digital, Sun UltraSparc,). Originally they were designed for general purpose software applications and in general, although 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 14 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... very powerful, are not really adapted for video processing. Moreover t he operating systems employed are not real-time OS. The design of real-time video application on these architectures is not a simple task as it could appear. Considering the video processing implementations of the last years, in general, we can observe the trend versus the time illustrated in Figure 18. If we consider different video processing algorithms (indicated as Proc.1, Proc.2 etc… in order of increasing complexity.) such as DCT on a 8x8 block for instance, we find first in time to appear implementations based on ASIC architectures. After some years with the evolution of IC technology these functions can then be implemented in real-time by AS DSPs, then by standard DSPs, and then by GPPs. This trend corresponds to the desire of transferring the complexity of the processing from the hardware architecture to the software implementation. However, this trend does not present only advantages and does not apply to all the implementation cases. Figures 19, 22 and 23 reports an illustration of advantages and disadvantages for each class of architectures that should be considered case by case. Let us analyze in detail and discuss each feature. Figure 18. Trend of algorithm implementations versus the time on different architectures. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 15 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 19. Conflicting trade-offs for architecture families. Figure 19 shows how the various families of architectures behave for the two conflicting requirements of real-time performance and flexibility/programmability. For high resource demanding processing no doubt that dedicated circuits can be order of magnitude faster than GPPs, but the advantages of programmability and possibility of changing the software to implement new processing capabilities becomes attractive for some applications. For instance a GPP can decode any video standard H.261, H.263, MPEG-1 and MPEG-2 just changing the software depending on the application. On the other hand real-time performance are not so easy to be guaranteed on most of GPP platform and the difficulty of handling at the same time real-time processing and other processes have to be carefully evaluated and verified. Figure 20 shows with a simple FIR filter example the concept. For a dedicated implementation (ASIC) a filter can be implemented with simple and extremely fast circuitry. Simple architecture based on registers and multipliers just of the size and speed necessary for the processing at hand are employed. The guarantee of real-time processing is easy to be achieved by appropriately clocking the system to the input data. Conversely, a programmable solution results much more complex. Figure 21 reports the various processing elements that are usually found: ALUs, memories for the data and algorithm program instructions, communication buses etc … Moreover, even simple processing algorithms such as a FIR filtering need to access several time the data and program memories, as reported in the instruction example. These considerations lead to clear advantages in terms of cost for ASICs when high volumes are required (see Figure 23). Simpler circuits that require smaller silicon surface areas are the right solution for set-top boxes and application for high volumes (MPEG-2 decoders for digital TV broadcasting for instance). In these cases the high development costs and the lack of debugging and software tools for the simulation and design do not constitute a serious drawback. Modifications of the algorithms and the introduction of new versions are not possible, but are not required by this kind of applications. Conversely, for low volume applications, the use of programmable solutions immediately available on the market, well supported by compilers, debuggers and simulation tools that can effectively speed up the development time and cost, might be the right solution. The much higher cost of the programmable processor, in some cases become acceptable for relatively low volume of devices. Another conflicting trend between hardwired and programmable solutions can be found by the need of designing low-power solutions required by the increasing importance of portable device applications and necessary to reduce the increasing power dissipated by high performance processors (see Figure 24). This trend conflicts with the need of transferring the increasing complexity of processing algorithms from the architecture to the software which is much easier and faster to be modified corrected and debugged. The optimization of memory size and accesses, clock frequency, and other architectural features that yield low-power consumption are only possible on ASICs architectures. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 16 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... What is the range of power consumption reduction that can be reached passing from a GPP to an ASIC? It is difficult to answer to this question with a single figure, it depend by architecture to architecture, processing by processing. For instance Figure 24 reports the power dissipation of a 2-D convolution with 3x3 filter kernel on a 256x256 image on three different architectures. The result is that a ARM RISC implementation, beside being slower that the other alternatives and so providing a under-estimated result, is about 3 times more demanding than a FPGA implementation and 18 times more than an ASIC based one. The example of the IMAGE motion estimation chip that is reported at the end of this document shows that much higher reduction factors (even more than two orders of magnitude) can be reached by low-power optimized ASIC architectures for specific processing tasks when compared to GPPs providing the same performance. Figure 20. Example of FIR filtering implementation on a dedicated architecture. Figure 21. Example of a FIR filtering implementation on a DSP architecture. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 17 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 22. Conflicting trade-off for architecture families. Figure 23. Conflicting trade-off for architecture families. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 18 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 24. Power dissipation reduction for the same processing (2-D convolution 3x3) on three different architectures. A last general consideration about the efficiency of the various architectures for video processing regards the memory usage. Video processing applications, as we have seen in more detail for MPEG-2, require the handling of very large amount of data (pixels) that need to be processed and accesses several time in a video encoder or decoder. Images are filtered, coded, decoded, used as reference for motion compensation and motion estimation for different frames, in other words accessed in order or "randomly" several times in a compression/decompression stage. If we observe the speed of processors, and the speed to access cache SRAM and Synch. DRAM data in the last year we observe two distinct trends (see Figure 25). Speed of processors was similar to memory access speed in 1990, but now it is more than the double and the trend is towards a even higher speed ratios. It means that the performance bottleneck of nowadays video processing architectures is given by the efficiency of the data flow. A correct design of the software for GPPs and a careful evaluation of the achievable memory bandwidth of the various data exchanges is necessary to avoid the risk that the largest fraction of time is used by the processing unit just to wait for the correct data to be processed. For graphic accelerators performance the data flow handling is the basic objective of the processing. Figure 26 reports the performance of some state of the art devices versus the graphic content. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 19 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 25. Evolution of the processing speed of processors, SRAM and Synch. DRAM in the last years. Memory access speed has become the performance bottleneck of data-intensive processing systems. Figure 26. Performance and power dissipation of state of the art graphic accelerators (AS-DSPs) versus polygons and pixel/s. Motion Estimation Case Study Block motion estimation for high quality video compression applications (i.e. digital TV broadcasting, multimedia content production…..) is a typical example for which GPP architectures are not a good choice for the implementation. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 20 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Motion estimation is indeed the most computational demanding stage of video compression at the encoder side. For normal resolution TV we have to encode 1620 macro-block per frame, with 25 frame per second. Roughly, to search a motion vector error we need to perform about 510 arithmetic operations on data from 8 to 16 bits. The number of vector displacements depends on the search window size that should be large for guaranteeing high quality coding. For instance for sport sequences a size of about 100x100 is required. This leads to about 206 x 109 arithmetic operations per second on 8 to 16 data. Even if we are able to select an "intelligent" search algorithm that reduces from one up to two orders of magnitude the number of search points the number of operations remain extremely high and not feasible by for state of the art GPPs. Moreover, 32 or 64 bit processing arithmetic cores are wasted when operations only on 8 to 16 bits are necessary. Completely different architectures that implement a high level of parallelism at bit level are necessary. If we want to be more accurate, we can notice that B pictures require both forward and backward motion estimation, and for instance for TV applications each macro-block can use the best between frame-based or field-based motion vectors at full or half pixel resolution level. Therefore, we realize that the real processing needs can increase of more than a factor 10, if all possible motion vectors are estimated. Another reason for which ASICS or AP-DSPs are an interesting and actual choice for motion estimation is the still unsolved need of motion estimation for TV displays. Large TV displays require to double the refresh rate to avoid the annoying flickering phenomenon appearing on the side portions of large screens. A conversion of interlaced content from 50 to 100 Hz by the simple doubling of each field provides satisfactory results in there is no motion. In case of moving objects the image quality provided by field doubling is low and motion compensated interpolation is necessary to reconstruct the movement phase of the interpolated images. An efficient and low-cost motion estimation stage is necessary for high quality up-conversion on TV displays. IMAGE a Motion Estimation Chip for MPEG-2 applications. We briefly describe the characteristics of a motion estimation chip designed in the C3I laboratory of the EPFL in the framework of the ATLANTIC european project in collaboration with the BBC, CSELT, Snell&Wilcox and Fraunhofer Institute. IMAGE is an acronim for Integrated MIMD Architecture for Genetic motion Estimation. The requirements for the chip was to provide estimations for MPEG-2 encoders in very large search windows for forward, backward, field-based, frame-based, full-pel, half-pel precision motion vectors. Figure 27 and 28 report the MPEG-2 broadcasting chain and the main input-output specification of the chip. The same chip is also required to provide the result of the candidate motion compensation mode (forward, backward, filed, frame, intra), and the selection of the corresponding best coding decision. Since all these operation are macro-block based, they share the same level of parallelism of motion estimation algorithms. The basic architectural idea has been to design a processing engine extremely efficient in getting the mean absolute difference between macro-blocks (matching error) with fast access to a large image section (search window size). By extremely efficient it is meant exploiting as much as possible the parallelism intrinsic to pixel operations on 16x16 block of pixels and able to access randomly any position in the search window without useless waiting times (i.e. providing the engine with the sufficient memory bandwidth to fully exploit its processing power). Figure 29 reports the block diagram of the "block-matching" engine. We can notice in the center the "pixel processor" for the parallel execution of the macro-block difference, two cache memory banks for the storage of the current macro-block and for the search window reference, a RISC processor for the handling of the genetic motion estimation algorithm and for the communications between processing units. The basic processing unit of Figure 29 is then reported in the general architecture of the chip reported in Figure 30. We can notice two macro-block processing units in parallel, the various I/O modules for the communication with the external frame memory and the communication interfaces for cascading the chip for forward and backward motion estimation and for larger search window sizes. As mentioned discussion data intensive applications one of the main difficulty of the chip design is the correct balancing of the processing time of the various units and the optimization of the various communications between modules. It is fundamental that all module processing are scheduled so as to avoid wait times and the communication busses have the necessary bandwidth. Low power optimizations are summarized in figure 31. Deactivation of processing units, local gated clocks and implementation of a low-power internal SRAM as cache memory enabled to keep power dissipation below 1W. Figure 32 reports the final layout of the chip with the main design parameters. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 21 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... In conclusion, the chip IMAGE can be classified as an AS-DSP for its high programmability where the application specific for which a special hardware is used is the calculation of macro-block differences. Its performance for motion estimation are much higher than any state of the art GPPs and obtained with a relatively small chip dissipating less than 1W when providing real-time motion estimation for MPEG-2 video compression. More details about the IMAGE chip can be found in: F. Mombers, M. Gumm and Al. "IMAGE: a low-cost low-power video processor for high quality motion estimation in MPEG-2 encoding", IEEE Trans. on Consumer Electronics, Vol 44, No. 3 August 1998, pp. 774-783. Figure 27. Block diagram of a TV broadcasting chain based on MPEG-2 compression. Figure 28. Requirements of a motion estimation/prediction selection chip for MPEG-2 encoding. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 22 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 29. Block diagram of the "block matching" processor. Figure 30. High level architecture of the IMAGE chip with the indication of the critical communication paths. 5/1/2007 11:01 AM ARCHITECTURES FOR VIDEO PROCESSING 23 of 23 file:///D:/ACADEMIC%20DOCUMENTS/eBOOKs/vlsi%20design/A... Figure 31. Low power optimizations achieved on the IMAGE chip. Figure 32. Main design data of the IMAGE chip. 5/1/2007 11:01 AM
Similar documents
the PDF of this issue
(collecting a bomb, jumping, throwing a spear, bullet shot, explosions, etc.) There is no background music, but when you complete each level, you get a nice little victory theme that changes each l...
More information