Register Clustering Methodology for Low Power Clock Tree
Transcription
Register Clustering Methodology for Low Power Clock Tree
Deng C, Cai YC, Zhou Q. Register clustering methodology for low power clock tree synthesis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 30(2): 391–403 Mar. 2015. DOI 10.1007/s11390-015-1531-4 Register Clustering Methodology for Low Power Clock Tree Synthesis Chao Deng (邓 超), Student Member, IEEE, Yi-Ci Cai (蔡懿慈), Senior Member, CCF, ACM, IEEE, and Qiang Zhou (周 强), Senior Member, CCF, ACM, IEEE Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China E-mail: [email protected]; {caiyc, zhouqiang}@mail.tsinghua.edu.cn Received March 26, 2014; revised November 24, 2014. Abstract Clock networks dissipate a significant fraction of the entire chip power budget. Therefore, the optimization for power consumption of clock networks has become one of the most important objectives in high performance IC designs. In contrast to most of the traditional studies that handle this problem with clock routing or buffer insertion strategy, this paper proposes a novel register clustering methodology in generating the leaf level topology of the clock tree to reduce the power consumption. Three register clustering algorithms called KMR, KSR and GSR are developed and a comprehensive study of them is discussed in this paper. Meanwhile, a buffer allocation algorithm is proposed to satisfy the slew constraint within the clusters at a minimum cost of power consumption. We integrate our algorithms into a classical clock tree synthesis (CTS) flow to test the register clustering methodology on ISPD 2010 benchmark circuits. Experimental results show that all the three register clustering algorithms achieve more than 20% reduction in power consumption without affecting the skew and the maximum latency of the clock tree. As the most effective method among the three algorithms, GSR algorithm achieves a 31% reduction in power consumption as well as a 4% reduction in skew and a 5% reduction in maximum latency. Moreover, the total runtime of the CTS flow with our register clustering algorithms is significantly reduced by almost an order of magnitude. Keywords 1 1.1 low power, register clustering, clock tree synthesis Introduction Motivation With the increasingly high degree of integration and fast clock frequency in VLSI technology, power dissipation has become a crucial concern in modern IC design, especially for the portable devices which have strict requirements on power density and battery life. Dynamic power is the power dissipation during the process of charging and discharging the load capacitance and occupies the largest fraction of the total power consumption in most cases. Dynamic power is calculated by (1), where α is a constant, V is the supply voltage, Cload is the total load capacitance including both the wire capacitance and the gate input pin capacitance, and f is the switching frequency[1] . Among all circuit elements, the clock network is a major contributor to dynamic power dissipation because of its huge fanout size and frequent switching activities. A clock network typically consumes as much as 40% of the entire chip power budget[2-3] . Therefore, many researchers have paid attention to the power optimization of the clock networks according to the parameters formulated in (1). Pdynamic = αV 2 Cload f. (1) Clock gating[3-5] and multiple-supply voltage[6-8] are two commonly-used technologies to reduce the power consumption of a clock network. Clock signal is continuously switching while the data need not be loaded into registers in every clock period. Clock gating methods disable the clock signals from the inactive registers in idle clock periods by inserting some control gates and control signals into the clock network. Thus the power dissipation of the gated subtrees can be saved to some degree. However, the control gates themselves will consume a certain amount of power during the switching of the control signals, which could possibly Regular Paper This work was supported by the National Natural Science Foundation of China under Grant No. 61274031. ©2015 Springer Science + Business Media, LLC & Science Press, China 392 cancel out the power reduction brought by clock gating. Multiple-supply voltage methods reduce the power consumption by allocating different supply voltages to different areas or at different periods depending on various performance requirements. In this paper, we focus on cutting down the power dissipation by reducing the load capacitance of the clock tree without modifying the netlist or supply voltages. The load capacitance of a clock tree consists of the interconnect capacitance, the register (clock sink) capacitance, and the buffer capacitance. Traditional clock tree construction methods handle the power optimization problem by clock routing or buffer sizing. Chao et al.[9] proposed the Deferred Merge Embedding (DME) algorithm which is widely used in zero-skew clock routing with minimum wirelength[10-12] . As a non-ignorable part of the load capacitance, power reduction on buffers is also attached great importance by recent researches[13-16] . Buffers are inserted along the paths in the clock tree at a minimum cost of power dissipation, while satisfying the constraints of skew and slew at the same time. Despite the fact that clock routing based on DME algorithm can achieve the minimum wirelength of a binary clock tree, and buffer sizing technology can cut down the buffer capacitance to some extent, the effect of reducing the total power dissipation is limited by the binary tree structure. Cheon et al. indicated that most of the clock tree capacitance (about 80%) is at the leaf level, which includes all the registers and the wires connecting them and the driving buffers[2] . We make a simple statistic of the clock tree capacitance on the eight ISPD 2010 benchmark circuits. The statistical result in Fig.1 shows that buffer capacitance is the major contributor towards the total capacitance of the clock tree. To our knowledge, buffers are inserted into the clock tree not only for the avoidance of slew constraint violation but also for the optimization of skew and delay. Quite a few of the buffers are inserted at the leaf level to tune the delay of the registers and thus to reduce the skew. Therefore, an effective way of reducing clock tree capacitance is to reduce the capacitance at the leaf level. The register clustering algorithm clumps the registers into several leaf clusters and allocates a local clock buffer for each cluster. Thus the capacitance of buffers at the leaf level is significantly reduced and the potential skew of the clock tree is limited to a small range when an efficient register clustering algorithm is adopted. J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 14% 28% 58% Buffer Capacitance Wire Capacitance Register Capacitance Fig.1. Average distribution of clock tree capacitance on ISPD 2010 benchmarks. 1.2 Previous Work As we know, clumping the registers into clusters is not difficult. The real difficulty lies in reducing the negative effects on the skew/delay and the signal nets caused by the clustering procedure. Lu et al.[17] proposed a register placement algorithm to enable further less clock routing wirelength and power. Hou and Liu[18] presented an automatic register placement technique that enables the synthesis of low-power clock trees for low-power ICs. The register clustering algorithms proposed in [19] and [2] clump the registers into several clusters after the original placement and shrink the clusters proportionally in order to reduce the wirelength and potential skew. However, there exist not only clock nets but also signal nets on the chip. The movement of registers often brings negative influence on the signal nets such as timing[19] and power[2] problems. Then much additional effort must be made to eliminate the negative effects brought by the register movement. Moreover, the shrink of the clock network will increase the local congestion and thus possibly lead to a routing failure in subsequent steps of the physical design flow. A slicing-based clustering algorithm was proposed in [20], which obtains clusters of approximately equal capacitance loading. Identical buffers are allocated to the clusters for process variation and skew optimization. However, the power dissipation of the clock tree is not taken into consideration in the algorithm. Shelar[21] proposed a polynomial time greedy clustering algorithm for low power clock tree synthesis. However, the influence on skew results brought by the clustering is not discussed in the paper. 1.3 Our Contributions In this paper, we focus on cutting down the total power dissipation of the clock tree without affecting the skew/delay and the signal nets. In order to verify the validity of our register clustering methodology, we develop three independent register clustering algorithms called KMR, KSR and GSR to generate the leaf level topology of the clock tree. The three algorithms are Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis named by the key techniques adopted in them. KMR is short for K-Means Based Register Clustering algorithm, KSR is short for K-Splitting Based Register Clustering algorithm and GSR is short for Greedy Search Based Register Clustering algorithm. A local clock buffer is inserted at the center of each cluster and the registers within the cluster are connected directly with the buffer. The register clustering problem for the clock tree is similar to the unsupervised learning problem in the field of machine learning[22] . The KMR algorithm borrows some ideas from a classical unsupervised learning algorithm which is called K-means algorithm[23] . By the iterations of updating the cluster centers, the KMR algorithm can clump the registers in an optimal mode, thus limiting the local skew to a small range and minimizing the leaf level wirelength at the same time. A “pseudo center” technology is introduced into KMR to prevent the algorithm from failing when the registers to be clustered are unevenly distributed. To our knowledge, the K-means algorithm has several disadvantages including that the number of clusters to be generated is difficult to determine and the result is sensitive to the initialization of the cluster centers. In order to overcome these shortcomings, we improve the KMR algorithm to the KSR algorithm by introducing a “KSplitting” algorithm as a pre-process. GSR algorithm is a greedy algorithm which starts with one register and generates the clusters by absorbing the adjacent registers until the maximum loading or maximum fanout limitation is reached. It is worth mentioning that no negative effect is brought to the signal nets because there is no movement of registers in our algorithms. At last, a buffer allocation algorithm is proposed to satisfy the slew constraint within the clusters at a minimum cost of power consumption. KMR, KSR and GSR are three independent register clustering algorithms, and thus we integrate them into a classical CTS (clock tree synthesis) flow[24] respectively to test the effectiveness of our register clustering methodology. The contributions of this paper are summarized as follows. • We propose a novel register clustering methodology for power reduction of the clock tree, which can be easily integrated into any particular CTS approach. • We develop three independent register clustering algorithms to verify the validity of our register clustering methodology, which includes: 1) we propose KMR, which reduces the power consumption significantly without affecting the skew and the maximum latency of the clock tree; 2) we develop KSR, which 393 improves the result of KMR algorithm by introducing a “K-Splitting” algorithm to determine the number of clusters and generate a good initialization for the clustering procedure; 3) we introduce GSR algorithm for register clustering, which not only cuts down the power dissipation but also optimizes the skew and the maximum latency of the clock tree at the same time. • A buffer allocation algorithm is proposed to insert local clock buffers into the clusters, which satisfies the slew constraint at a minimum cost of power consumption. The rest of this paper is organized as follows. Section 2 defines the problem to be solved in this paper and gives a problem formulation. Section 3 presents the details of our register clustering algorithms. Section 4 presents our buffer allocation algorithm. Section 5 claims the suitability of our register clustering methodology. Section 6 reports the experimental setup and the experimental results. Finally Section 7 concludes our work. 2 Problem Formulation As we mentioned before, the total capacitance of a clock tree consists of the interconnect capacitance, the register capacitance, and the buffer capacitance. For a given layout, the register capacitance is certain. The buffer capacitance of the whole clock tree is determined by several elements including the load capacitance driven by the buffers, the skew and slew requirements during the optimization of the CTS flow, etc. By the analysis in Section 1, plenty of buffers are inserted at the leaf level of the clock tree to tune the delay of the registers and thus to reduce the skew. Therefore, an efficient register clustering algorithm is necessary to cut down the number of buffers at the leaf level. Meanwhile, a register clustering algorithm of high quality can clump the registers in an optimal mode, thus limiting the local skew to a small range and reducing the interconnect capacitance at the same time. The wirelength at the leaf level of the clock tree is a good evaluation standard for the quality of the register clustering algorithm. Therefore, we define the power reduction problem as a register clustering algorithm which aims at the minimization of the wirelength at the leaf level of the clock tree. All lengths and distances are Manhattan distance in this paper. • Register Clustering Problem. Input: a set of registers Sregister = {r1 , · · · , rn }, with the location (xri , yri ) for register ri . 394 J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 Output: a set of clusters Scluster = {c1 , · · · , cm }. The center location of cluster ci is calculated by the following two equations. p ∑ xrj /p, rj ∈ ci , j = 1, · · · , p, (2) xci = j=1 yci = p ∑ some “empty” clusters because no register chooses to join them. This will lead to the failure of the register clustering algorithm because (2) and (3) are invalid for those “empty” clusters. The “pseudo center” technology is introduced in our algorithm to solve this problem. yrj /p, rj ∈ ci , j = 1, · · · , p. (3) j=1 Objective: to minimize the total wirelength at the leaf level of the clock tree, which is described by the following equation. WL = m ∑ ∑ (xrj − xci + yrj − yci ). (4) i=1 rj ∈ci Subject to: Fig.2. Benchmark 10f07 from ISPD 2010 CNS contest. ci = ̸ ∅, i = 1, · · · , m, ∩ ci cj = ∅, i, j ∈ {i = 1, · · · , m} and i ̸= j, m ∪ ci = Sregister . i=1 3 3.1 Register Clustering Algorithms for Power Reduction Algorithm KMR Before we introduce KMR, the “pseudo center” technology will be presented at first as a preparation for KMR. 3.1.1 “Pseudo Center” Technology In our KMR algorithm, the cluster set Scluster is refined by an iterative method. During every iteration, each register will find a cluster center which is closest to it and join in the corresponding cluster. Then the cluster centers will be updated after all the registers are allocated into clusters in one iteration. The center of cluster ci is calculated by (2) and (3). However, the distribution of registers is severely uneven in some circuits where many registers locate together as an IP core. Fig.2 shows an example benchmark from ISPD 2010 CNS contest, which has several IP cores in it. In this situation, the registers in an IP core will easily form a large cluster during the iterations of the register clustering algorithm. As a result, there will appear Algorithm 1 is developed to calculate the center location for cluster C. If C is empty, the algorithm will find the largest cluster C ′ in the cluster set Scluster . Then the geometric center of half of the registers in C ′ will be used as the “pseudo center” for the empty cluster C (see details in steps 1∼9). Then a certain amount of registers in the largest cluster will be absorbed into the empty cluster C in the next iteration. Thus the “pseudo center” technology not only avoids the failure of the register clustering algorithm but also breaks up the large clusters which would possibly violate the constraints (including slew, maximum fanout, maximum load capacitance, etc.) in the following CTS flow. Algorithm 1. Get Center Input: a cluster C and the set of clusters Scluster Output: the center of cluster C: (xC , yC ) 1: if C is emptey then 2: Find the largest cluster C ′ in Scluster 3: x sum = y sum = 0 4: for i = 0 to [C ′ .size()/2] − 1 do 5: x sum = x sum + xri , ri ∈ C ′ 6: y sum = y sum + yri , ri ∈ C ′ 7: end for 8: xC = x sum/[C ′ .size()/2] 9: yC = y sum/[C ′ .size()/2] 10: else 11: (xC , yC ) is calculated by (2) and (3) 12: end if It is worth mentioning that if we generate a “pseudo center” when the total capacitance or the number of registers in a cluster is reaching the limitation, the Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis “pseudo center” technology is easily extended to take more constraints into consideration such as the maximum fanout or the maximum load capacitance of the local clock buffers. 3.1.2 Details of KMR Algorithm In the classical K-means algorithm, K is an input parameter which indicates the number of clusters to be generated. However, given a layout with a set of registers, K is unknown until the clustering result is achieved, which becomes a chicken-and-egg problem. Therefore, in our KMR algorithm, we compute K with the total number of registers T otalr and the maximum fanout of the buffers M axf by (5). In this equation, α is a user-defined parameter to avoid the clusters violating the maximum fanout constraint in the iterative process. Thus α is usually larger than 1. K = α × ⌈T otalr /M axf ⌉. (5) The detailed procedure of our KMR algorithm is shown in Algorithm 2. The final result of cluster set Scluster is refined by an iterative method, which decreases the total wirelength defined in (4) during every iteration until the total wirelength no longer reduces. At first, the input parameter K is computed by (5). Then, a center list CE is created and initialized (in steps 2∼5). After that, an initial wirelength newW L is calculated in step 7. Steps 8∼21 explain the iterative method to reduce the total wirelength. At the beginning of every iteration, all the registers are cleared out of the clusters. Then every register will find the closest cluster center in CE and join in the corresponding cluster. Note that the function Closest Center() will find the closest center for a register on the condition that the load capacitance of the cluster satisfies the maximum load constraint after adding the corresponding register. After that, CE and newW L are updated according to the new result of cluster set Scluster . The iteration will stop when the total wirelength is no longer reduced. 3.2 Algorithm KSR In this subsection, we focus on explaining the details of KSR. The “K-Splitting” algorithm will be presented at first as a preparation for KSR algorithm. 395 Algorithm 2. KMR Input: the set of registers Sregister = {r1 , · · · , rn }, with the location (xri , yri ) for register ri Output: final result of cluster set Scluster = {c1 , · · · , cK } 1: Compute K by (5) 2: Create a cluster center list CE = {ce1 , · · · , ceK }, cei is the center of cluster ci 3: for i = 0 to K − 1 do 4: CE[i] = location(ri ) 5: end for 6: oldW L = 0 7: Calculate newW L with (4) 8: while |newW L − oldW L| > 1 do 9: for i = 0 to K − 1 do 10: ci .clear() 11: end for 12: for i = 0 to Sregister .size() − 1 do 13: ceflag = Closest Center(ri , CE) 14: cflag .add(ri ) 15: end for 16: for i = 0 to K − 1 do 17: CE[i] = Get Center(ci , Scluster ) 18: end for 19: oldW L = newW L 20: Calculate newW L with (4) 21: end while 3.2.1 “K-Splitting” Algorithm The KMR algorithm adopts a K-means based process in generating K clusters. However, the distribution of registers is not taken into consideration in (5) when K is calculated. To our knowledge, if an improper K is selected, the result will be far from satisfactory. Fig.3(a) shows a small benchmark circuit which contains only 16 registers in it. We can easily figure out that K = 4 is the best choice for this benchmark. However, if K is set to 3, the result will become much worse, which is shown in Fig.3(b). Both the interconnect capacitance and the local skew/delay within the two bigger clusters will increase. The clustering result for K = 5 is shown in Fig.3(c). An unnecessary local clock buffer will be inserted at the center of the blue cluster so that the total capacitance of the clock tree will increase. Thus a proper K is very important to the final clustering result. Therefore, we develop a “KSplitting” algorithm to generate a proper K as well as a reasonable initialization of the K clusters for KSR algorithm. The “K-Splitting” algorithm is shown in Algorithm 3. First, a complete weighted graph G is constructed, with each node representing a register. The weight of the edge between two nodes is calculated by the Manhattan distance of the two corresponding 396 J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 (a) (b) (c) Splitting Edges (d) (e) (f) Fig.3. (a) A small benchmark circuit. (b) Clustering result when K = 3. (c) Clustering result when K = 5. (d) MST of the circuit. (e) Three edges to be deleted for splitting the MST. (f) Splitting result of the “K-Splitting” algorithm. Algorithm 3. “K-Splitting” Input: a set of registers Sregister = {r1 , · · · , rn }, with the location (xri , yri ) for register ri Output: K and a set of K clusters Scluster = {c1 , · · · , cK } 1: Create a complete weighted graph G(Sregister , E), where E = {e(ri , rj ), i, j ∈ {i = 1, · · · , n} and i ̸= j} 2: Calculate the weight for edges in E, w(e(ri , rj )) = |xri − xrj | + |yri − yrj | 3: Build the minimum spanning tree (MST) of G using the Kruskal algorithm: T (Sregister , ET ) = Kruskal(G), the edges in ET are sorted in ascending order by the Kruskal algorithm 4: Calculate EL with (6) 5: K = 1 6: while w(ET .back()) > EL do 7: ET .pop back() 8: K = K + 1 9: end while 10: Adopt Depth-First-Search (DFS) algorithm on T to generate K connected components, which constitute the cluster set Scluster = {c1 , · · · , cK } registers. Then the minimum spanning tree (MST) T of G is generated by the Kruskal algorithm[25] in line 3 (shown in Fig.3(d)). The edges in T are sorted in ascending order in one step of the Kruskal algorithm. A specified limit EL is computed by the parameters of the layout and the number of registers using (6), where W idth and Length are the width and the length of the layout respectively, Sizeob is the total size of the obstacles on the layout, N umregisters is the total number of the registers on the layout, and α is a userdefined parameter to control the amount of clusters to be generated. After that, the edges whose weight is larger than EL will be deleted from T (shown in Fig.3(e)) and parameter K is recorded during the iterations (in steps 6∼9). Finally, we adopt a depth-first- search (DFS) algorithm[25] to generate K connected components which constitute the original cluster set Scluster , which is shown in Fig.3(f). From the “KSplitting” algorithm, we can figure out that a larger α indicates a smaller amount of clusters, which is used for the trade-off between different requirements of power dissipation and skew/delay. In this paper, we set α to 1. √ W idth × Length − Sizeob EL = α × . (6) N umregisters 3.2.2 Details of KSR Algorithm The detailed procedure of KSR is shown in Algorithm 4. The input parameter K and the original Scluster are generated by our “K-Splitting” algorithm which is described in Algorithm 3. The main difference between KMR and KSR is the generation of K and the initialization of the center list CE. As the “KSplitting” algorithm generates an original Scluster , the center list CE is initialized by adopting “Get Center” algorithm on Scluster (in steps 2∼4). The remaining process of KSR algorithm is similar to KMR algorithm. Algorithm 4. KSR Input: the set of registers Sregister = {r1 , · · · , rn }, with the location (xri , yri ) for register ri , K, and the initialization of cluster set Scluster = {c1 , · · · , cK } ′ Output: final result of cluster set Scluster = {c1 , · · · , cK } 1: Create a cluster center list CE = {ce1 , · · · , ceK }, cei is the center of cluster ci 2: for i = 0 to K − 1 do 3: CE[i] = Get Center(ci , Scluster ) 4: end for 5: oldW L = 0 6: Calculate newW L with (4) 7: while |newW L − oldW L| > 1 do 8: for i = 0 to K − 1 do 9: ci .clear() 10: end for 11: for i = 0 to Sregister. size() − 1 do 12: ceflag = Closest Center(ri , CE) 13: cflag .add(ri ) 14: end for 15: for i = 0 to K − 1 do 16: CE[i] = Get Center(ci , Scluster ) 17: end for 18: oldW L = newW L 19: Calculate newW L with (4) 20: end while 3.3 Algorithm GSR GSR is a simple yet effective method in clumping registers into clusters. The key idea of GSR is that the radius of each cluster cannot exceed a certain value. By Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis limiting the radius of clusters, GSR clumps registers in a small bounding box for every cluster. Therefore, during the clustering process, the registers only join the clusters within a certain range. The maximum distance between the registers and the center of the corresponding cluster is calculated by (7). In this equation, W idth and Length are the width and the length of the layout respectively, Sizeob is the total size of the obstacles on the layout, T otalr is the total number of the registers on the layout, M axf is the maximum fanout limitation of buffers and α is a user-defined parameter. √ W idth × Length − Sizeob M axdis = α × . (7) T otalr /M axf Algorithm 5 shows the detailed procedure of GSR. At first, the algorithm generates the first cluster by adding r1 into it. The loading and fanout of cluster c1 is updated in step 2 while the center of c1 is updated in step 3. Then, the maximum distance between the registers and the center of the corresponding cluster M axdis is calculated by (7). After that, a traversal is performed on the remaining registers from step 5 to step 16. For every register ri , it will find the nearest cluster cf which must satisfy the maximum distance, the maximum loading, and the maximum fanout constraints. If cf exits, ri will join into cf and the loading, the fanout and the center of cf are updated from step 7 to step 10. If no cluster satisfies the constraints, a new cluster will be created and added into the cluster set Scluster . Algorithm 5. GSR Input: a set of registers Sregister = {r1 , · · · , rn }, with the location (xri , yri ) for register ri Output: a set of clusters Scluster = {c1 , · · · , cm } 1: Create cluster c1 and add r1 into c1 2: Lc1 = Lr1 , Fc1 = 1 3: Calculate the center of c1 by (2) and (3) 4: Compute M axdis by (7) 5: for i from 2 to n do 6: if ri finds the nearest cluster cf in Scluster , condition: Dis(cf , ri ) < M axdis , Lcf + Lri < M axLoad and Fcf + 1 < M axF anout then 7: Add ri into cf 8: Lcf = Lcf + Lri 9: Fcf = Fcf + 1 10: Calculate the center of ci by (2) and (3) 11: else 12: Create a new cluster cj and add ri into cj 13: Lcj = Lri , Fcj = 1 14: Calculate the center of cj by (2) and (3) 15: end if 16: end for 3.4 397 Blockage Avoidance After the register clustering process, a local clock buffer will be inserted at the center of each cluster. The center location of a cluster c is calculated by (2) and (3). However, the original center location may overlap with the existing blockages in some cases. In order to solve this problem, we move the center location towards four directions (left, right, up, down) and calculate the distance between the original center location and the four borders of blockages. The new center location is set by moving the original center location to the nearest border. The process is shown in Fig.4. Based on the above refinement of center locations, we can guarantee that all the local clock buffers are inserted off the blockages. Moreover, the top-level CTS we adopt in this paper is also a blockage-avoiding flow. Therefore, no buffer overlaps with the blockages in the final clock trees. Fig.4. Refinement of center locations. The black spot is the original center location while the gray spot is the new center location. 4 Buffer Allocation of Clusters After the register clustering algorithm clumps registers into several clusters, a local clock buffer will be allocated to each cluster to satisfy the slew constraint. Then all the registers within a cluster are connected with the local clock buffer directly. In order to reduce the power consumption as much as possible, a buffer allocation algorithm is adopted. 4.1 Slew Models for Wire and Register Slew Model for Wire. The wire slew model adopted in this paper is shown in (8), which is referred to [26]. It is applied for the situation in Fig.5. In (8), Sle is the slew degradation on wire, and de is the wire delay, which is computed with the Elmore delay model. Slew Model for Register. The input slew of register Sl(r) is modeled as (9) in [26]. It is decided by both the upstream buffer’s output slew Slbu,out (b) and the slew 398 J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 degradation Sle on the wire. In this paper, a look-up table is built through NGSPICE simulation to achieve relatively accurate output slew of the upstream buffer. As shown in Fig.5, the buffer’s output slew which is affected by the input slew and load capacitance can be looked up from the look-up table. Input Slew Slew Degradation on Wire b Output Slew Input Slew r Fig.5. Slew models for wire and register. Sle = ln 9 × de , √ Sl(r) = Slbu,out (b)2 + Sle2 . (8) (9) From (8) and (9), we can see that the worst register’s input slew within a cluster appears on the register which has the longest path connected with the local clock buffer. Therefore, we need to insert a local clock buffer in the cluster at a minimum power cost to satisfy the worst register’s input slew. 4.2 Buffer Allocation Algorithm In this paper, the two buffers provided in ISPD contests 2009 and 2010 are adopted as the basic buffers. And to extend the drive strength and increase the diversity of buffers, a buffer library B is generated by parallelling the basic buffers, which is shown in Table 1. The buffer with bigger output capacitance drives longer wire for the same slew constraint based on the same input slew and load capacitance. The detailed procedure of our buffer allocation algorithm is shown in Algorithm 6. The purpose of the algorithm is to allocate a local clock buffer to each cluster at a minimum power cost while satisfying the slew constraint at the same time. At first, we sort the buffers in the buffer library B in ascending order of their driving strength (in step 1). If there is only one register in the cluster, no buffer will be inserted at the cluster’s center because the input slew of the register will be satisfied by the top-level CTS (in steps 3∼6). Steps 7∼8 find the register which has the longest path connected with the buffer and compute the wire slew with (8). Steps 9∼18 will test the buffers in B in ascending order of their driving strength. In this way, no slew constraint will occur within the clusters and the power consumption cost is reduced to a minimum. Algorithm 6. Buffer Allocation Input: a set of registers Sregister = {c1 , · · · , cK } and a buffer library B Output: specific buffer type for each cluster in Scluster 1: Sort the buffers in B in ascending order of their driving strength 2: for i = 0 to K − 1 do 3: if ci .size() = 1 then 4: ci has no buffer 5: Continue 6: else 7: Find the farthest register rf from the buffer 8: Compute the wire slew with (8) 9: for each buffer b in B do 10: Look up the output slew of b in the look-up table 11: Compute input slew Slf of rf with (9) 12: if Slf > SlewLimit then 13: Continue 14: else 15: Set b as the buffer type of ci 16: Break 17: end if 18: end for 19: end if 20: end for 5 Suitability of Methodology Our Register Clustering The register clustering process in our approach is not only an algorithm, but also a methodology. As a preprocess of the registers, our register clustering methodology can be easily integrated into any particular CTS method for power reduction. Given a CTS method, the input of the flow is a set of registers. After register clustering, the set of local buffers will become the input registers for the given CTS method. Any other optimization algorithm in the CTS method does not need to be modified. 6 6.1 Experiments Experimental Setup We implemented our algorithms in C++ on a 2.33 GHz Intelr Xeonr Linux workstation with 8 GB memory. Meanwhile, in order to compare our algorithms with Shelar’s algorithm[21] , which is an efficient register clustering algorithm, we also implemented the algorithm according to the description in [21]. KMR, KSR, Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis 399 Table 1. Buffer Library B ID Basic Buffer Type Parallel Number Input Cap (fF) Output Cap (fF) Output Res (Ω) 0 0 1 035.0 080.0 061.200 1 1 1 004.2 006.1 440.000 2 0 2 070.0 160.0 030.600 3 0 3 105.0 240.0 020.400 4 0 4 140.0 320.0 015.300 5 1 2 008.4 012.2 220.000 6 1 3 012.6 018.3 146.667 7 1 4 016.8 024.4 110.000 Note: input capacitance (input cap) and output capacitance (output cap) are measured in femtoFarads (fF). Output resistance (output res) is measured in Ω. GSR and Shelar’s algorithm are four independent algorithms as a register clustering process, and thus we integrate them into a classical CTS flow[24] respectively to test the effectiveness of our register clustering methodology. Some improvements have been made to optimize the CTS approach in [24] including the signal polarity correction and better performance of the sample technology for buffer insertion. The experimental flow is shown in Fig.6. All the conditions and configurations on the ISPD benchmarks are set up exactly the same with [24]. Wire sizing and process variation are not taken into consideration in the algorithms. All configurations of ISPD benchmarks remain unchanged except that the wire library only consists of wire 0 and the parameters about process variation are disabled. The final results are simulated by NGSPICE simulation. Input Register Clustering Algorithm KMR KSR GSR [21] Buffer Allocation Algorithm CTS Approach in [24] Result Fig.6. Experimental flow for evaluating our register clustering algorithms. 6.2 Experimental Results A comparison is made between the different experimental results of the CTS flows with and without Shelar’s algorithm and our register clustering algorithms. We perform our experiments on two ISPD 2009 benchmarks and eight ISPD 2010 benchmarks. Table 2 demonstrates the effectiveness of our register clustering methodology clearly. The third column shows four different flows: [24] represents the CTS flow without register clustering process; [21], KMR, KSR and GSR represent the CTS flows with Shelar’s algorithm[21] , KMR, KSR and GSR respectively. From the table, we can see that KMR, KSR and GSR all achieve a significant power reduction on every benchmark. By comparison with [24], KMR, KSR and GSR achieve 23%, 29% and 31% reduction in total power dissipation on average respectively. Meanwhile, the average skew is maintained or slightly reduced in the three algorithms, which indicates that the local skew within the clusters is limited to a small range by our highquality register clustering algorithms. And the maximum latency is also improved by 8%, 6% and 5% in KMR, KSR and GSR. Moreover, the total runtime is reduced by almost an order of magnitude in the three algorithms because the leaf nodes for the top level CTS which are generated by the clustering procedure are much less than before. The last column of the table is the runtime of our register clustering algorithms. It should be noted that the clustering procedure is finished in seconds on all the benchmark circuits. By comparison with Shelar’s algorithm[21] , which is also an excellent register clustering algorithm, we can see that our algorithms are more efficient in cutting down the power dissipation of clock trees. The benchmarks 09f32 and 09fnb1 contain more blockages on the layout. Thus the experiment on these two benchmarks shows the effec- 400 J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 Table 2. Comparison of Results on CTS Flows with and Without Our Register Clustering Algorithms on ISPD 2010 Benchmarks Benchmark 10f01 10f02 10f03 10f04 10f05 10f06 10f07 10f08 09f32 09fnb1 Comparison # of Register Algorithm 1 107 [24] [21] KMR KSR GSR 2 249 [24] [21] KMR KSR GSR 1 200 [24] [21] KMR KSR GSR 1 845 [24] [21] KMR KSR GSR 1 016 [24] [21] KMR KSR GSR 1 981 [24] [21] KMR KSR GSR 1 915 [24] [21] KMR KSR GSR 1 134 [24] [21] KMR KSR GSR 1 190 [24] [21] KMR KSR GSR 1 330 [24] [21] KMR KSR GSR [24] [21] KMR KSR GSR Skew (ps) 45.22 79.19 30.67 46.81 34.67 36.62 195.11 37.77 47.19 50.15 30.92 32.57 31.36 39.68 16.89 36.03 106.52 49.88 37.50 44.58 37.20 64.40 26.26 46.76 31.74 40.75 65.27 47.04 15.63 37.22 30.41 40.92 49.74 26.73 43.82 42.90 31.63 26.58 35.50 28.55 30.55 140.88 29.33 40.63 45.06 40.41 31.63 11.97 27.85 24.72 1.00 2.05 1.00 0.99 0.96 Power (fF) 194 024.00 151 773.00 153 885.00 146 083.00 142 847.00 399 411.00 295 862.00 287 330.00 272 822.00 267 875.00 60 972.10 60 895.80 46 214.70 39 257.50 40 542.00 91 432.40 104 270.00 83 567.70 60 849.00 59 142.30 41 840.10 48 114.30 35 890.40 31 938.80 32 918.10 51 217.90 47 920.00 36 390.30 36 125.10 34 526.80 82 566.70 82 807.00 71 328.00 65 253.00 58 168.50 57 927.40 55 611.70 43 879.40 42 767.90 42 468.80 138 456.00 103 463.00 118 460.00 120 417.00 111 615.00 30 501.50 28 387.60 25 985.10 25 583.70 24 230.20 1.00 0.87 0.77 0.71 0.69 MaxLatency (ps) 798.97 791.25 756.28 779.67 763.59 941.17 1 088.21 901.55 924.48 963.18 456.98 430.23 426.90 434.75 426.94 493.18 532.45 441.18 481.49 460.74 509.30 496.35 438.91 472.96 459.57 457.53 446.85 413.82 389.61 423.38 468.25 444.73 448.62 439.09 448.13 458.17 412.75 403.34 399.12 399.72 1 295.63 1 353.67 1 289.51 1 287.43 1 260.34 465.65 417.75 408.69 403.45 405.34 1.00 1.01 0.92 0.94 0.95 Total CPU (s) 106.00 15.65 4.14 24.12 18.97 183.00 30.73 11.24 51.82 27.39 127.00 11.20 3.24 10.13 10.86 208.00 22.43 10.29 37.24 20.07 107.00 8.75 4.41 12.32 10.56 94.00 5.67 3.30 3.42 10.92 217.00 6.78 10.95 5.78 18.25 122.00 5.21 3.29 3.99 10.42 65.00 3.12 3.45 3.79 4.42 87.00 3.02 4.12 3.89 4.22 1.00 0.09 0.04 0.13 0.11 Clustering CPU (s) 0.60.65 0.35 0.98 0.31 0.61.23 1.92 6.25 0.96 0.60.78 0.26 1.15 0.22 0.61.20 1.68 3.47 0.35 0.60.25 0.27 0.87 0.20 0.60.56 0.30 1.34 0.14 0.60.55 0.57 2.35 0.43 0.60.42 0.37 1.45 0.19 0.60.22 0.17 0.89 0.18 0.60.23 0.27 0.87 0.23 0.61.00 1.01 3.17 0.50 Note: # of register means the total number of registers in the corresponding benchmark. MaxLatency means the maximum latency from the source to registers in the corresponding result. Skew and maximum latency are measured in picoseconds (ps), power is measured in femtoFarads (fF), and runtime is measured in seconds. Total CPU is the total runtime of the CTS and clustering CPU is the runtime of clustering process, which is included in total CPU. Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis tiveness of our algorithms on benchmarks with many blockages. In this paper, we focus on the power reduction of clock trees brought by our register clustering methodology. Therefore, we do not introduce many techniques to reduce the skew of the clock tree. Within the clusters, the registers are directly connected with the local clock buffers and no skew tuning techniques such as wire snaking are performed in the top-level CTS. On the other hand, the high quality of our register clustering algorithms guarantees that the global skew is limited in a reasonable range. For instance, in our KSR algorithm, the clock skew on four benchmarks decreases while the clock skew on the other six benchmarks increases. However, the clock skew is still reduced by 1.2% on average. In future, we will develop an integrated CTS system based on our register clustering methodology, which includes the local CTS within the clusters and the skew tuning techniques such as wire snaking. Figs.7∼9 shows the clustering results of benchmark 10f01 generated by KMR, KSR and GSR respectively. In the three figures, gray rectangles represent the obstacles in the layout, and black rectangles represent the bounding boxes of clusters. It is also worth mentioning that all the experimental results satisfy the constraints of the ISPD contest: 1) no slew-rate violation occurs at the input of the registers and the buffers; 2) no buffer overlaps with the existing blockages; 3) the signal polarities of all the registers are exactly the same with those of the clock source. All the above experimental results demonstrate that our register clustering methodology is effective in reducing the power consumption of the clock tree without affecting the clock skew while optimizing maximum latency at the same time. 401 Τ106 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 Τ106 Fig.8. Clustering results of KSR algorithm on 10f01. Τ106 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 Τ106 Fig.9. Clustering results of GSR algorithm on 10f01. 7 Conclusions Τ106 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 Τ106 Fig.7. Clustering results of KMR algorithm on 10f01. We presented an efficient register clustering methodology to reduce the power dissipation of the clock tree without affecting the clock skew while optimizing the maximum latency at the same time. Three different clustering algorithms called KMR, KSR and GSR were developed to verify the validity of the proposed register clustering methodology. All three algorithms achieve significant power reduction on every benchmark of ISPD 2010 CNS contest. As the most effective method among the three algorithms, GSR algorithm achieves a 31% reduction in power consumption as well as a 4% reduction in skew and a 5% reduction in maximum latency. Moreover, the total runtime of the CTS flow with our register clustering algorithms 402 is significantly reduced by almost an order of magnitude. Moreover, no negative influence has brought to the signal nets because there is no register movement in our algorithms. In future, we will try to develop an integrated low power CTS flow based on our register clustering algorithms. More algorithms such as the routing algorithm inside the clusters will be involved in the CTS system. Meanwhile, we will try to involve more considerations, like wire sizing, wire snaking, and process variation. References [1] Pedram M, Rabaey J M. Power Aware Design Methodologies. Kluwer Academic Publisher, 2002. [2] Cheon Y, Ho P H, Kahng A B, Reda S, Wang Q. Poweraware placement. In Proc. the 42nd Annual Design Automation Conference, Jun. 2005, pp.795–800. [3] Donno M, Macii E, Mazzoni L. Poweraware clock tree planning. In Proc. the 2004 International Symposium on Physical Design, April 2004, pp.138–147. [4] Lam T K, Yang X, Tang W C, Wu Y L. On applying erroneous clock gating conditions to further cut down power. In Proc. the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.509–514. [5] Lu J, Mao X, Taskin B. Clock mesh synthesis with gated local trees and activity driven register clustering. In Proc. IEEE/ACM International Conference on Computer-Aided Design, Nov. 2012, pp.691–697. [6] Igarashi M, Usami K, Nogami K et al. A low-power design method using multiple supply voltages. In Proc. the 1997 International Symposium on Low Power Electronics and Design, Aug. 1997, pp.36–41. [7] Lin K Y, Lin H T, Ho T Y. An efficient algorithm of adjustable delay buffer insertion for clock skew minimization in multiple dynamic supply voltage designs. In Proc. the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.825–830. [8] Li L, Sun J, Lu Y, Zhou H, Zeng X. Low power discrete voltage assignment under clock skew scheduling. In Proc. the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.515–520. [9] Chao T H, Hsu Y C, Ho J M, Kahng A. Zero skew clock routing with minimum wirelength. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1992, 39(11): 799–814. [10] Liu W H, Li Y L, Chen H C. Minimizing clock latency range in robust clock tree synthesis. In Proc. the 15th Asia and South Pacific Design Automation Conference, Jan. 2010, pp.389–394. [11] Shih X W, Cheng C C, Ho Y K, Chang Y W. Blockageavoiding buffered clocktree synthesis for clock latency-range and skew minimization. In Proc. the 15th Asia and South Pacific Design Automation Conference, Jan. 2010, pp.395– 400. [12] Lee D J, Markov I L. Contango: Integrated optimization of soc clock network. In Proc. the 2010 Conference on Design, Automation and Test in Europe, Mar. 2010, pp.1468–1473. J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2 [13] Rakai L, Farshidi A, Behjat L, Westwick D. Buffer sizing for clock networks using robust geometric programming considering variations in buffer sizes. In Proc. the 2013 ACM International Symposium on Physical Design, Mar. 2013, pp.154–161. [14] Singh J, Nookala V, Luo Z Q, Sapatnekar S. Robust gate sizing by geometric programming. In Proc. the 42nd Annual Design Automation Conference, Jun. 2005, pp.315–320. [15] Vittal A, Marek-Sadowska M. Lowpower buffered clock tree design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1997, 16(9): 965–975. [16] Lillis J, Cheng C K, Lin T T Y. Optimal wire sizing and buffer insertion for low power and a generalized delay model. IEEE Journal of Solid-State Circuits, 1996, 31(3): 437–447. [17] Lu Y, Sze C, Hong X, Zhou Q, Cai Y, Huang L, Hu J. Register placement for low power clock network. In Proc. the 2005 Asia and South Pacific Design Automation Conference, Jan. 2005, pp.588–593. [18] Hou W, Liu D, Ho P H. Automatic register banking for lowpower clock trees. In Proc. the 10th International Symposium on Quality Electronic Design, Mar. 2009, pp.647–652. [19] Papa D, Alpert C, Sze C, Li Z, Viswanathan N, Nam G J, Markov I L. Physical synthesis with clock-network optimization for large systems on chips. IEEE Micro, 2011, 31(4): 51–62. [20] Mehta A D, Chen Y P, Menezes N, Wong D, Pilegg L. Clustering and load balancing for buffered clock tree synthesis. In Proc. the 1997 IEEE International Conference on Computer Design: VLSI in Computers and Processors, Oct. 1997, pp.217–223. [21] Shelar R S. An efficient clustering algorithm for low power clock tree synthesis. In Proc. the 2007 International Symposium on Physical Design, Mar. 2007, pp.181–188. [22] Mitchell T. Machine Learning. McGraw Hill, 1997. [23] Selim S Z, Ismail M A. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984, PAMI6(1): 81–87. [24] Niu F, Zhou Q, Yao H, Cai Y, Yang J, Sze C N. Obstacleavoiding and slew-constrained buffered clock tree synthesis for skew optimization. In Proc. the 21st Edition of the Great Lakes Symposium on VLSI, May 2011, pp.199–204. [25] Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. Prentice-Hall India, 1998. [26] Hu S, Alpert C J, Hu J, Karandikar S K, Li Z, Shi W, Sze C N. Fast algorithms for slew-constrained minimum cost buffering. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 2007, 26(11): 2009– 2022. Chao Deng received his B.S. degree in computer science and technology from Harbin Institute of Technology (HIT), Harbin, in 2010. He is currently pursuing his Ph.D. degree from the EDA Lab, Tsinghua University, Beijing. His research interests include clock network synthesis and optimization. Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis Yi-Ci Cai is a professor in the Department of Computer Science and Technology, Tsinghua University, Beijing. She received her B.S. degree in electronic engineering from Tsinghua University in 1983, M.S. degree in computer science and technology from Tsinghua University in 1986, and Ph.D. degree in computer science from the University of Science and Technology of China, Hefei, in 2007. Her research interests include design automation for VLSI integrated circuits algorithms and theory, power/ground distribution network analysis and optimization, high performance clock synthesis, and low power physical design. 403 Qiang Zhou received his B.S. degree in computer science and technology from the University of Science and Technology of China, Hefei, in 1983, M.S. degree in computer science and technology from Tsinghua University, Beijing, in 1986, and Ph.D. degree in control theory and control engineering from Chinese University of Mining and Technology, Beijing, in 2002. His research interests include VLSI layout theory and algorithms.