Register Clustering Methodology for Low Power Clock Tree

Transcription

Register Clustering Methodology for Low Power Clock Tree
Deng C, Cai YC, Zhou Q. Register clustering methodology for low power clock tree synthesis. JOURNAL OF COMPUTER
SCIENCE AND TECHNOLOGY 30(2): 391–403 Mar. 2015. DOI 10.1007/s11390-015-1531-4
Register Clustering Methodology for Low Power Clock Tree Synthesis
Chao Deng (邓 超), Student Member, IEEE, Yi-Ci Cai (蔡懿慈), Senior Member, CCF, ACM, IEEE, and
Qiang Zhou (周 强), Senior Member, CCF, ACM, IEEE
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
E-mail: [email protected]; {caiyc, zhouqiang}@mail.tsinghua.edu.cn
Received March 26, 2014; revised November 24, 2014.
Abstract Clock networks dissipate a significant fraction of the entire chip power budget. Therefore, the optimization for
power consumption of clock networks has become one of the most important objectives in high performance IC designs. In
contrast to most of the traditional studies that handle this problem with clock routing or buffer insertion strategy, this paper
proposes a novel register clustering methodology in generating the leaf level topology of the clock tree to reduce the power
consumption. Three register clustering algorithms called KMR, KSR and GSR are developed and a comprehensive study
of them is discussed in this paper. Meanwhile, a buffer allocation algorithm is proposed to satisfy the slew constraint within
the clusters at a minimum cost of power consumption. We integrate our algorithms into a classical clock tree synthesis
(CTS) flow to test the register clustering methodology on ISPD 2010 benchmark circuits. Experimental results show that
all the three register clustering algorithms achieve more than 20% reduction in power consumption without affecting the
skew and the maximum latency of the clock tree. As the most effective method among the three algorithms, GSR algorithm
achieves a 31% reduction in power consumption as well as a 4% reduction in skew and a 5% reduction in maximum latency.
Moreover, the total runtime of the CTS flow with our register clustering algorithms is significantly reduced by almost an
order of magnitude.
Keywords
1
1.1
low power, register clustering, clock tree synthesis
Introduction
Motivation
With the increasingly high degree of integration and
fast clock frequency in VLSI technology, power dissipation has become a crucial concern in modern IC design,
especially for the portable devices which have strict requirements on power density and battery life. Dynamic
power is the power dissipation during the process of
charging and discharging the load capacitance and occupies the largest fraction of the total power consumption in most cases. Dynamic power is calculated by
(1), where α is a constant, V is the supply voltage,
Cload is the total load capacitance including both the
wire capacitance and the gate input pin capacitance,
and f is the switching frequency[1] . Among all circuit
elements, the clock network is a major contributor to
dynamic power dissipation because of its huge fanout
size and frequent switching activities. A clock network
typically consumes as much as 40% of the entire chip
power budget[2-3] . Therefore, many researchers have
paid attention to the power optimization of the clock
networks according to the parameters formulated in (1).
Pdynamic = αV 2 Cload f.
(1)
Clock gating[3-5] and multiple-supply voltage[6-8]
are two commonly-used technologies to reduce the
power consumption of a clock network. Clock signal
is continuously switching while the data need not be
loaded into registers in every clock period. Clock gating methods disable the clock signals from the inactive
registers in idle clock periods by inserting some control
gates and control signals into the clock network. Thus
the power dissipation of the gated subtrees can be saved
to some degree. However, the control gates themselves
will consume a certain amount of power during the
switching of the control signals, which could possibly
Regular Paper
This work was supported by the National Natural Science Foundation of China under Grant No. 61274031.
©2015 Springer Science + Business Media, LLC & Science Press, China
392
cancel out the power reduction brought by clock gating. Multiple-supply voltage methods reduce the power
consumption by allocating different supply voltages to
different areas or at different periods depending on various performance requirements. In this paper, we focus
on cutting down the power dissipation by reducing the
load capacitance of the clock tree without modifying
the netlist or supply voltages.
The load capacitance of a clock tree consists of
the interconnect capacitance, the register (clock sink)
capacitance, and the buffer capacitance. Traditional
clock tree construction methods handle the power optimization problem by clock routing or buffer sizing.
Chao et al.[9] proposed the Deferred Merge Embedding
(DME) algorithm which is widely used in zero-skew
clock routing with minimum wirelength[10-12] . As a
non-ignorable part of the load capacitance, power reduction on buffers is also attached great importance by
recent researches[13-16] . Buffers are inserted along the
paths in the clock tree at a minimum cost of power dissipation, while satisfying the constraints of skew and
slew at the same time.
Despite the fact that clock routing based on DME
algorithm can achieve the minimum wirelength of a binary clock tree, and buffer sizing technology can cut
down the buffer capacitance to some extent, the effect
of reducing the total power dissipation is limited by
the binary tree structure. Cheon et al. indicated that
most of the clock tree capacitance (about 80%) is at the
leaf level, which includes all the registers and the wires
connecting them and the driving buffers[2] . We make
a simple statistic of the clock tree capacitance on the
eight ISPD 2010 benchmark circuits. The statistical result in Fig.1 shows that buffer capacitance is the major
contributor towards the total capacitance of the clock
tree. To our knowledge, buffers are inserted into the
clock tree not only for the avoidance of slew constraint
violation but also for the optimization of skew and delay. Quite a few of the buffers are inserted at the leaf
level to tune the delay of the registers and thus to reduce the skew. Therefore, an effective way of reducing
clock tree capacitance is to reduce the capacitance at
the leaf level. The register clustering algorithm clumps
the registers into several leaf clusters and allocates a local clock buffer for each cluster. Thus the capacitance
of buffers at the leaf level is significantly reduced and
the potential skew of the clock tree is limited to a small
range when an efficient register clustering algorithm is
adopted.
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
14%
28%
58%
Buffer Capacitance
Wire Capacitance
Register Capacitance
Fig.1. Average distribution of clock tree capacitance on ISPD
2010 benchmarks.
1.2
Previous Work
As we know, clumping the registers into clusters is
not difficult. The real difficulty lies in reducing the
negative effects on the skew/delay and the signal nets
caused by the clustering procedure. Lu et al.[17] proposed a register placement algorithm to enable further less clock routing wirelength and power. Hou
and Liu[18] presented an automatic register placement
technique that enables the synthesis of low-power clock
trees for low-power ICs. The register clustering algorithms proposed in [19] and [2] clump the registers into
several clusters after the original placement and shrink
the clusters proportionally in order to reduce the wirelength and potential skew. However, there exist not
only clock nets but also signal nets on the chip. The
movement of registers often brings negative influence
on the signal nets such as timing[19] and power[2] problems. Then much additional effort must be made to
eliminate the negative effects brought by the register
movement. Moreover, the shrink of the clock network
will increase the local congestion and thus possibly lead
to a routing failure in subsequent steps of the physical design flow. A slicing-based clustering algorithm
was proposed in [20], which obtains clusters of approximately equal capacitance loading. Identical buffers are
allocated to the clusters for process variation and skew
optimization. However, the power dissipation of the
clock tree is not taken into consideration in the algorithm. Shelar[21] proposed a polynomial time greedy
clustering algorithm for low power clock tree synthesis.
However, the influence on skew results brought by the
clustering is not discussed in the paper.
1.3
Our Contributions
In this paper, we focus on cutting down the total
power dissipation of the clock tree without affecting the
skew/delay and the signal nets. In order to verify the
validity of our register clustering methodology, we develop three independent register clustering algorithms
called KMR, KSR and GSR to generate the leaf level
topology of the clock tree. The three algorithms are
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
named by the key techniques adopted in them. KMR is
short for K-Means Based Register Clustering algorithm,
KSR is short for K-Splitting Based Register Clustering
algorithm and GSR is short for Greedy Search Based
Register Clustering algorithm. A local clock buffer is
inserted at the center of each cluster and the registers within the cluster are connected directly with the
buffer. The register clustering problem for the clock
tree is similar to the unsupervised learning problem in
the field of machine learning[22] . The KMR algorithm
borrows some ideas from a classical unsupervised learning algorithm which is called K-means algorithm[23] .
By the iterations of updating the cluster centers, the
KMR algorithm can clump the registers in an optimal
mode, thus limiting the local skew to a small range and
minimizing the leaf level wirelength at the same time.
A “pseudo center” technology is introduced into KMR
to prevent the algorithm from failing when the registers to be clustered are unevenly distributed. To our
knowledge, the K-means algorithm has several disadvantages including that the number of clusters to be
generated is difficult to determine and the result is sensitive to the initialization of the cluster centers. In order
to overcome these shortcomings, we improve the KMR
algorithm to the KSR algorithm by introducing a “KSplitting” algorithm as a pre-process. GSR algorithm
is a greedy algorithm which starts with one register and
generates the clusters by absorbing the adjacent registers until the maximum loading or maximum fanout
limitation is reached. It is worth mentioning that no
negative effect is brought to the signal nets because
there is no movement of registers in our algorithms. At
last, a buffer allocation algorithm is proposed to satisfy
the slew constraint within the clusters at a minimum
cost of power consumption.
KMR, KSR and GSR are three independent register clustering algorithms, and thus we integrate them
into a classical CTS (clock tree synthesis) flow[24] respectively to test the effectiveness of our register clustering methodology. The contributions of this paper
are summarized as follows.
• We propose a novel register clustering methodology for power reduction of the clock tree, which can be
easily integrated into any particular CTS approach.
• We develop three independent register clustering
algorithms to verify the validity of our register clustering methodology, which includes: 1) we propose
KMR, which reduces the power consumption significantly without affecting the skew and the maximum
latency of the clock tree; 2) we develop KSR, which
393
improves the result of KMR algorithm by introducing
a “K-Splitting” algorithm to determine the number of
clusters and generate a good initialization for the clustering procedure; 3) we introduce GSR algorithm for
register clustering, which not only cuts down the power
dissipation but also optimizes the skew and the maximum latency of the clock tree at the same time.
• A buffer allocation algorithm is proposed to insert
local clock buffers into the clusters, which satisfies the
slew constraint at a minimum cost of power consumption.
The rest of this paper is organized as follows. Section 2 defines the problem to be solved in this paper
and gives a problem formulation. Section 3 presents
the details of our register clustering algorithms. Section 4 presents our buffer allocation algorithm. Section 5 claims the suitability of our register clustering
methodology. Section 6 reports the experimental setup
and the experimental results. Finally Section 7 concludes our work.
2
Problem Formulation
As we mentioned before, the total capacitance of
a clock tree consists of the interconnect capacitance,
the register capacitance, and the buffer capacitance.
For a given layout, the register capacitance is certain.
The buffer capacitance of the whole clock tree is determined by several elements including the load capacitance driven by the buffers, the skew and slew requirements during the optimization of the CTS flow, etc. By
the analysis in Section 1, plenty of buffers are inserted
at the leaf level of the clock tree to tune the delay of
the registers and thus to reduce the skew. Therefore,
an efficient register clustering algorithm is necessary to
cut down the number of buffers at the leaf level. Meanwhile, a register clustering algorithm of high quality can
clump the registers in an optimal mode, thus limiting
the local skew to a small range and reducing the interconnect capacitance at the same time. The wirelength
at the leaf level of the clock tree is a good evaluation
standard for the quality of the register clustering algorithm. Therefore, we define the power reduction problem as a register clustering algorithm which aims at the
minimization of the wirelength at the leaf level of the
clock tree. All lengths and distances are Manhattan
distance in this paper.
• Register Clustering Problem.
Input: a set of registers Sregister = {r1 , · · · , rn },
with the location (xri , yri ) for register ri .
394
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
Output: a set of clusters Scluster = {c1 , · · · , cm }.
The center location of cluster ci is calculated by the
following two equations.


p
∑
xrj  /p, rj ∈ ci , j = 1, · · · , p, (2)
xci = 
j=1

yci = 
p
∑
some “empty” clusters because no register chooses to
join them. This will lead to the failure of the register
clustering algorithm because (2) and (3) are invalid for
those “empty” clusters. The “pseudo center” technology is introduced in our algorithm to solve this problem.

yrj  /p, rj ∈ ci , j = 1, · · · , p. (3)
j=1
Objective: to minimize the total wirelength at the
leaf level of the clock tree, which is described by the
following equation.
WL =
m ∑
∑
(xrj − xci + yrj − yci ).
(4)
i=1 rj ∈ci
Subject to:
Fig.2. Benchmark 10f07 from ISPD 2010 CNS contest.
ci =
̸ ∅, i = 1, · · · , m,
∩
ci cj = ∅, i, j ∈ {i = 1, · · · , m} and i ̸= j,
m
∪
ci = Sregister .
i=1
3
3.1
Register Clustering Algorithms for Power
Reduction
Algorithm KMR
Before we introduce KMR, the “pseudo center”
technology will be presented at first as a preparation
for KMR.
3.1.1 “Pseudo Center” Technology
In our KMR algorithm, the cluster set Scluster is refined by an iterative method. During every iteration,
each register will find a cluster center which is closest
to it and join in the corresponding cluster. Then the
cluster centers will be updated after all the registers
are allocated into clusters in one iteration. The center of cluster ci is calculated by (2) and (3). However,
the distribution of registers is severely uneven in some
circuits where many registers locate together as an IP
core. Fig.2 shows an example benchmark from ISPD
2010 CNS contest, which has several IP cores in it. In
this situation, the registers in an IP core will easily
form a large cluster during the iterations of the register clustering algorithm. As a result, there will appear
Algorithm 1 is developed to calculate the center location for cluster C. If C is empty, the algorithm will
find the largest cluster C ′ in the cluster set Scluster .
Then the geometric center of half of the registers in C ′
will be used as the “pseudo center” for the empty cluster C (see details in steps 1∼9). Then a certain amount
of registers in the largest cluster will be absorbed into
the empty cluster C in the next iteration. Thus the
“pseudo center” technology not only avoids the failure
of the register clustering algorithm but also breaks up
the large clusters which would possibly violate the constraints (including slew, maximum fanout, maximum
load capacitance, etc.) in the following CTS flow.
Algorithm 1. Get Center
Input: a cluster C and the set of clusters Scluster
Output: the center of cluster C: (xC , yC )
1: if C is emptey then
2: Find the largest cluster C ′ in Scluster
3: x sum = y sum = 0
4: for i = 0 to [C ′ .size()/2] − 1 do
5:
x sum = x sum + xri , ri ∈ C ′
6:
y sum = y sum + yri , ri ∈ C ′
7: end for
8: xC = x sum/[C ′ .size()/2]
9: yC = y sum/[C ′ .size()/2]
10: else
11: (xC , yC ) is calculated by (2) and (3)
12: end if
It is worth mentioning that if we generate a “pseudo
center” when the total capacitance or the number of
registers in a cluster is reaching the limitation, the
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
“pseudo center” technology is easily extended to take
more constraints into consideration such as the maximum fanout or the maximum load capacitance of the
local clock buffers.
3.1.2 Details of KMR Algorithm
In the classical K-means algorithm, K is an input
parameter which indicates the number of clusters to
be generated. However, given a layout with a set of
registers, K is unknown until the clustering result is
achieved, which becomes a chicken-and-egg problem.
Therefore, in our KMR algorithm, we compute K with
the total number of registers T otalr and the maximum
fanout of the buffers M axf by (5). In this equation,
α is a user-defined parameter to avoid the clusters violating the maximum fanout constraint in the iterative
process. Thus α is usually larger than 1.
K = α × ⌈T otalr /M axf ⌉.
(5)
The detailed procedure of our KMR algorithm is
shown in Algorithm 2. The final result of cluster set
Scluster is refined by an iterative method, which decreases the total wirelength defined in (4) during every
iteration until the total wirelength no longer reduces.
At first, the input parameter K is computed by (5).
Then, a center list CE is created and initialized (in
steps 2∼5). After that, an initial wirelength newW L is
calculated in step 7. Steps 8∼21 explain the iterative
method to reduce the total wirelength. At the beginning of every iteration, all the registers are cleared out
of the clusters. Then every register will find the closest
cluster center in CE and join in the corresponding cluster. Note that the function Closest Center() will find
the closest center for a register on the condition that the
load capacitance of the cluster satisfies the maximum
load constraint after adding the corresponding register.
After that, CE and newW L are updated according to
the new result of cluster set Scluster . The iteration will
stop when the total wirelength is no longer reduced.
3.2
Algorithm KSR
In this subsection, we focus on explaining the details
of KSR. The “K-Splitting” algorithm will be presented
at first as a preparation for KSR algorithm.
395
Algorithm 2. KMR
Input: the set of registers Sregister = {r1 , · · · , rn }, with
the location (xri , yri ) for register ri
Output: final result of cluster set Scluster = {c1 , · · · , cK }
1: Compute K by (5)
2: Create a cluster center list CE = {ce1 , · · · , ceK },
cei is the center of cluster ci
3: for i = 0 to K − 1 do
4: CE[i] = location(ri )
5: end for
6: oldW L = 0
7: Calculate newW L with (4)
8: while |newW L − oldW L| > 1 do
9: for i = 0 to K − 1 do
10:
ci .clear()
11: end for
12: for i = 0 to Sregister .size() − 1 do
13:
ceflag = Closest Center(ri , CE)
14:
cflag .add(ri )
15: end for
16: for i = 0 to K − 1 do
17:
CE[i] = Get Center(ci , Scluster )
18: end for
19: oldW L = newW L
20: Calculate newW L with (4)
21: end while
3.2.1 “K-Splitting” Algorithm
The KMR algorithm adopts a K-means based process in generating K clusters. However, the distribution of registers is not taken into consideration in (5)
when K is calculated. To our knowledge, if an improper K is selected, the result will be far from satisfactory. Fig.3(a) shows a small benchmark circuit which
contains only 16 registers in it. We can easily figure
out that K = 4 is the best choice for this benchmark.
However, if K is set to 3, the result will become much
worse, which is shown in Fig.3(b). Both the interconnect capacitance and the local skew/delay within the
two bigger clusters will increase. The clustering result
for K = 5 is shown in Fig.3(c). An unnecessary local
clock buffer will be inserted at the center of the blue
cluster so that the total capacitance of the clock tree
will increase. Thus a proper K is very important to
the final clustering result. Therefore, we develop a “KSplitting” algorithm to generate a proper K as well as
a reasonable initialization of the K clusters for KSR
algorithm.
The “K-Splitting” algorithm is shown in Algorithm 3. First, a complete weighted graph G is constructed, with each node representing a register. The
weight of the edge between two nodes is calculated
by the Manhattan distance of the two corresponding
396
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
(a)
(b)
(c)
Splitting
Edges
(d)
(e)
(f)
Fig.3. (a) A small benchmark circuit. (b) Clustering result
when K = 3. (c) Clustering result when K = 5. (d) MST of the
circuit. (e) Three edges to be deleted for splitting the MST. (f)
Splitting result of the “K-Splitting” algorithm.
Algorithm 3. “K-Splitting”
Input: a set of registers Sregister = {r1 , · · · , rn }, with the
location (xri , yri ) for register ri
Output: K and a set of K clusters Scluster = {c1 , · · · , cK }
1: Create a complete weighted graph G(Sregister , E),
where
E = {e(ri , rj ), i, j ∈ {i = 1, · · · , n} and i ̸= j}
2: Calculate the weight for edges in E, w(e(ri , rj )) =
|xri − xrj | + |yri − yrj |
3: Build the minimum spanning tree (MST) of G using
the Kruskal algorithm: T (Sregister , ET ) = Kruskal(G),
the edges in ET are sorted in ascending order by the
Kruskal algorithm
4: Calculate EL with (6)
5: K = 1
6: while w(ET .back()) > EL do
7: ET .pop back()
8: K = K + 1
9: end while
10: Adopt Depth-First-Search (DFS) algorithm on T to
generate K connected components, which constitute
the cluster set Scluster = {c1 , · · · , cK }
registers. Then the minimum spanning tree (MST) T
of G is generated by the Kruskal algorithm[25] in line
3 (shown in Fig.3(d)). The edges in T are sorted in
ascending order in one step of the Kruskal algorithm.
A specified limit EL is computed by the parameters
of the layout and the number of registers using (6),
where W idth and Length are the width and the length
of the layout respectively, Sizeob is the total size of
the obstacles on the layout, N umregisters is the total
number of the registers on the layout, and α is a userdefined parameter to control the amount of clusters
to be generated. After that, the edges whose weight
is larger than EL will be deleted from T (shown in
Fig.3(e)) and parameter K is recorded during the iterations (in steps 6∼9). Finally, we adopt a depth-first-
search (DFS) algorithm[25] to generate K connected
components which constitute the original cluster set
Scluster , which is shown in Fig.3(f). From the “KSplitting” algorithm, we can figure out that a larger α
indicates a smaller amount of clusters, which is used for
the trade-off between different requirements of power
dissipation and skew/delay. In this paper, we set α to
1.
√
W idth × Length − Sizeob
EL = α ×
.
(6)
N umregisters
3.2.2 Details of KSR Algorithm
The detailed procedure of KSR is shown in Algorithm 4. The input parameter K and the original
Scluster are generated by our “K-Splitting” algorithm
which is described in Algorithm 3. The main difference between KMR and KSR is the generation of K
and the initialization of the center list CE. As the “KSplitting” algorithm generates an original Scluster , the
center list CE is initialized by adopting “Get Center”
algorithm on Scluster (in steps 2∼4). The remaining
process of KSR algorithm is similar to KMR algorithm.
Algorithm 4. KSR
Input: the set of registers Sregister = {r1 , · · · , rn }, with
the location (xri , yri ) for register ri , K, and the initialization of cluster set Scluster = {c1 , · · · , cK }
′
Output: final result of cluster set Scluster
= {c1 , · · · , cK }
1: Create a cluster center list CE = {ce1 , · · · , ceK },
cei is the center of cluster ci
2: for i = 0 to K − 1 do
3: CE[i] = Get Center(ci , Scluster )
4: end for
5: oldW L = 0
6: Calculate newW L with (4)
7: while |newW L − oldW L| > 1 do
8: for i = 0 to K − 1 do
9: ci .clear()
10: end for
11: for i = 0 to Sregister. size() − 1 do
12:
ceflag = Closest Center(ri , CE)
13:
cflag .add(ri )
14: end for
15: for i = 0 to K − 1 do
16:
CE[i] = Get Center(ci , Scluster )
17: end for
18: oldW L = newW L
19: Calculate newW L with (4)
20: end while
3.3
Algorithm GSR
GSR is a simple yet effective method in clumping
registers into clusters. The key idea of GSR is that the
radius of each cluster cannot exceed a certain value. By
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
limiting the radius of clusters, GSR clumps registers in
a small bounding box for every cluster. Therefore, during the clustering process, the registers only join the
clusters within a certain range. The maximum distance
between the registers and the center of the corresponding cluster is calculated by (7). In this equation, W idth
and Length are the width and the length of the layout
respectively, Sizeob is the total size of the obstacles on
the layout, T otalr is the total number of the registers
on the layout, M axf is the maximum fanout limitation
of buffers and α is a user-defined parameter.
√
W idth × Length − Sizeob
M axdis = α ×
.
(7)
T otalr /M axf
Algorithm 5 shows the detailed procedure of GSR.
At first, the algorithm generates the first cluster by
adding r1 into it. The loading and fanout of cluster
c1 is updated in step 2 while the center of c1 is updated in step 3. Then, the maximum distance between
the registers and the center of the corresponding cluster M axdis is calculated by (7). After that, a traversal
is performed on the remaining registers from step 5 to
step 16. For every register ri , it will find the nearest
cluster cf which must satisfy the maximum distance,
the maximum loading, and the maximum fanout constraints. If cf exits, ri will join into cf and the loading,
the fanout and the center of cf are updated from step
7 to step 10. If no cluster satisfies the constraints, a
new cluster will be created and added into the cluster
set Scluster .
Algorithm 5. GSR
Input: a set of registers Sregister = {r1 , · · · , rn }, with the
location (xri , yri ) for register ri
Output: a set of clusters Scluster = {c1 , · · · , cm }
1: Create cluster c1 and add r1 into c1
2: Lc1 = Lr1 , Fc1 = 1
3: Calculate the center of c1 by (2) and (3)
4: Compute M axdis by (7)
5: for i from 2 to n do
6: if ri finds the nearest cluster cf in Scluster , condition: Dis(cf , ri ) < M axdis , Lcf + Lri < M axLoad
and Fcf + 1 < M axF anout then
7:
Add ri into cf
8:
Lcf = Lcf + Lri
9:
Fcf = Fcf + 1
10:
Calculate the center of ci by (2) and (3)
11: else
12:
Create a new cluster cj and add ri into cj
13:
Lcj = Lri , Fcj = 1
14:
Calculate the center of cj by (2) and (3)
15: end if
16: end for
3.4
397
Blockage Avoidance
After the register clustering process, a local clock
buffer will be inserted at the center of each cluster. The
center location of a cluster c is calculated by (2) and
(3). However, the original center location may overlap
with the existing blockages in some cases. In order to
solve this problem, we move the center location towards
four directions (left, right, up, down) and calculate the
distance between the original center location and the
four borders of blockages. The new center location is
set by moving the original center location to the nearest border. The process is shown in Fig.4. Based on
the above refinement of center locations, we can guarantee that all the local clock buffers are inserted off
the blockages. Moreover, the top-level CTS we adopt
in this paper is also a blockage-avoiding flow. Therefore, no buffer overlaps with the blockages in the final
clock trees.
Fig.4. Refinement of center locations. The black spot is the
original center location while the gray spot is the new center
location.
4
Buffer Allocation of Clusters
After the register clustering algorithm clumps registers into several clusters, a local clock buffer will be
allocated to each cluster to satisfy the slew constraint.
Then all the registers within a cluster are connected
with the local clock buffer directly. In order to reduce
the power consumption as much as possible, a buffer
allocation algorithm is adopted.
4.1
Slew Models for Wire and Register
Slew Model for Wire. The wire slew model adopted
in this paper is shown in (8), which is referred to [26].
It is applied for the situation in Fig.5. In (8), Sle is
the slew degradation on wire, and de is the wire delay,
which is computed with the Elmore delay model.
Slew Model for Register. The input slew of register
Sl(r) is modeled as (9) in [26]. It is decided by both the
upstream buffer’s output slew Slbu,out (b) and the slew
398
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
degradation Sle on the wire. In this paper, a look-up
table is built through NGSPICE simulation to achieve
relatively accurate output slew of the upstream buffer.
As shown in Fig.5, the buffer’s output slew which is
affected by the input slew and load capacitance can be
looked up from the look-up table.
Input Slew
Slew Degradation
on Wire
b
Output Slew
Input Slew
r
Fig.5. Slew models for wire and register.
Sle = ln 9 × de ,
√
Sl(r) = Slbu,out (b)2 + Sle2 .
(8)
(9)
From (8) and (9), we can see that the worst register’s input slew within a cluster appears on the register
which has the longest path connected with the local
clock buffer. Therefore, we need to insert a local clock
buffer in the cluster at a minimum power cost to satisfy
the worst register’s input slew.
4.2
Buffer Allocation Algorithm
In this paper, the two buffers provided in ISPD contests 2009 and 2010 are adopted as the basic buffers.
And to extend the drive strength and increase the diversity of buffers, a buffer library B is generated by
parallelling the basic buffers, which is shown in Table 1.
The buffer with bigger output capacitance drives longer
wire for the same slew constraint based on the same input slew and load capacitance.
The detailed procedure of our buffer allocation algorithm is shown in Algorithm 6. The purpose of the
algorithm is to allocate a local clock buffer to each cluster at a minimum power cost while satisfying the slew
constraint at the same time. At first, we sort the buffers
in the buffer library B in ascending order of their driving strength (in step 1). If there is only one register
in the cluster, no buffer will be inserted at the cluster’s center because the input slew of the register will
be satisfied by the top-level CTS (in steps 3∼6). Steps
7∼8 find the register which has the longest path connected with the buffer and compute the wire slew with
(8). Steps 9∼18 will test the buffers in B in ascending
order of their driving strength. In this way, no slew
constraint will occur within the clusters and the power
consumption cost is reduced to a minimum.
Algorithm 6. Buffer Allocation
Input: a set of registers Sregister = {c1 , · · · , cK } and a
buffer library B
Output: specific buffer type for each cluster in Scluster
1: Sort the buffers in B in ascending order of their
driving strength
2: for i = 0 to K − 1 do
3: if ci .size() = 1 then
4:
ci has no buffer
5:
Continue
6: else
7:
Find the farthest register rf from the buffer
8:
Compute the wire slew with (8)
9:
for each buffer b in B do
10:
Look up the output slew of b in the look-up
table
11:
Compute input slew Slf of rf with (9)
12:
if Slf > SlewLimit then
13:
Continue
14:
else
15:
Set b as the buffer type of ci
16:
Break
17:
end if
18:
end for
19: end if
20: end for
5
Suitability of
Methodology
Our
Register
Clustering
The register clustering process in our approach is
not only an algorithm, but also a methodology. As
a preprocess of the registers, our register clustering
methodology can be easily integrated into any particular CTS method for power reduction. Given a CTS
method, the input of the flow is a set of registers. After
register clustering, the set of local buffers will become
the input registers for the given CTS method. Any
other optimization algorithm in the CTS method does
not need to be modified.
6
6.1
Experiments
Experimental Setup
We implemented our algorithms in C++ on a 2.33
GHz Intelr Xeonr Linux workstation with 8 GB memory. Meanwhile, in order to compare our algorithms
with Shelar’s algorithm[21] , which is an efficient register clustering algorithm, we also implemented the algorithm according to the description in [21]. KMR, KSR,
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
399
Table 1. Buffer Library B
ID
Basic Buffer Type
Parallel Number
Input Cap (fF)
Output Cap (fF)
Output Res (Ω)
0
0
1
035.0
080.0
061.200
1
1
1
004.2
006.1
440.000
2
0
2
070.0
160.0
030.600
3
0
3
105.0
240.0
020.400
4
0
4
140.0
320.0
015.300
5
1
2
008.4
012.2
220.000
6
1
3
012.6
018.3
146.667
7
1
4
016.8
024.4
110.000
Note: input capacitance (input cap) and output capacitance (output cap) are measured in femtoFarads (fF). Output resistance
(output res) is measured in Ω.
GSR and Shelar’s algorithm are four independent algorithms as a register clustering process, and thus we integrate them into a classical CTS flow[24] respectively to
test the effectiveness of our register clustering methodology. Some improvements have been made to optimize
the CTS approach in [24] including the signal polarity
correction and better performance of the sample technology for buffer insertion. The experimental flow is
shown in Fig.6. All the conditions and configurations
on the ISPD benchmarks are set up exactly the same
with [24]. Wire sizing and process variation are not
taken into consideration in the algorithms. All configurations of ISPD benchmarks remain unchanged except
that the wire library only consists of wire 0 and the
parameters about process variation are disabled. The
final results are simulated by NGSPICE simulation.
Input
Register Clustering Algorithm
KMR
KSR
GSR
[21]
Buffer Allocation Algorithm
CTS Approach in [24]
Result
Fig.6. Experimental flow for evaluating our register clustering
algorithms.
6.2
Experimental Results
A comparison is made between the different experimental results of the CTS flows with and without Shelar’s algorithm and our register clustering algorithms.
We perform our experiments on two ISPD 2009 benchmarks and eight ISPD 2010 benchmarks.
Table 2 demonstrates the effectiveness of our register clustering methodology clearly. The third column
shows four different flows: [24] represents the CTS
flow without register clustering process; [21], KMR,
KSR and GSR represent the CTS flows with Shelar’s
algorithm[21] , KMR, KSR and GSR respectively. From
the table, we can see that KMR, KSR and GSR all
achieve a significant power reduction on every benchmark. By comparison with [24], KMR, KSR and GSR
achieve 23%, 29% and 31% reduction in total power dissipation on average respectively. Meanwhile, the average skew is maintained or slightly reduced in the three
algorithms, which indicates that the local skew within
the clusters is limited to a small range by our highquality register clustering algorithms. And the maximum latency is also improved by 8%, 6% and 5% in
KMR, KSR and GSR. Moreover, the total runtime is
reduced by almost an order of magnitude in the three
algorithms because the leaf nodes for the top level CTS
which are generated by the clustering procedure are
much less than before. The last column of the table
is the runtime of our register clustering algorithms. It
should be noted that the clustering procedure is finished
in seconds on all the benchmark circuits. By comparison with Shelar’s algorithm[21] , which is also an excellent register clustering algorithm, we can see that our
algorithms are more efficient in cutting down the power
dissipation of clock trees. The benchmarks 09f32 and
09fnb1 contain more blockages on the layout. Thus the
experiment on these two benchmarks shows the effec-
400
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
Table 2. Comparison of Results on CTS Flows with and Without Our Register Clustering Algorithms on ISPD 2010 Benchmarks
Benchmark
10f01
10f02
10f03
10f04
10f05
10f06
10f07
10f08
09f32
09fnb1
Comparison
# of Register Algorithm
1 107
[24]
[21]
KMR
KSR
GSR
2 249
[24]
[21]
KMR
KSR
GSR
1 200
[24]
[21]
KMR
KSR
GSR
1 845
[24]
[21]
KMR
KSR
GSR
1 016
[24]
[21]
KMR
KSR
GSR
1 981
[24]
[21]
KMR
KSR
GSR
1 915
[24]
[21]
KMR
KSR
GSR
1 134
[24]
[21]
KMR
KSR
GSR
1 190
[24]
[21]
KMR
KSR
GSR
1 330
[24]
[21]
KMR
KSR
GSR
[24]
[21]
KMR
KSR
GSR
Skew (ps)
45.22
79.19
30.67
46.81
34.67
36.62
195.11
37.77
47.19
50.15
30.92
32.57
31.36
39.68
16.89
36.03
106.52
49.88
37.50
44.58
37.20
64.40
26.26
46.76
31.74
40.75
65.27
47.04
15.63
37.22
30.41
40.92
49.74
26.73
43.82
42.90
31.63
26.58
35.50
28.55
30.55
140.88
29.33
40.63
45.06
40.41
31.63
11.97
27.85
24.72
1.00
2.05
1.00
0.99
0.96
Power (fF)
194 024.00
151 773.00
153 885.00
146 083.00
142 847.00
399 411.00
295 862.00
287 330.00
272 822.00
267 875.00
60 972.10
60 895.80
46 214.70
39 257.50
40 542.00
91 432.40
104 270.00
83 567.70
60 849.00
59 142.30
41 840.10
48 114.30
35 890.40
31 938.80
32 918.10
51 217.90
47 920.00
36 390.30
36 125.10
34 526.80
82 566.70
82 807.00
71 328.00
65 253.00
58 168.50
57 927.40
55 611.70
43 879.40
42 767.90
42 468.80
138 456.00
103 463.00
118 460.00
120 417.00
111 615.00
30 501.50
28 387.60
25 985.10
25 583.70
24 230.20
1.00
0.87
0.77
0.71
0.69
MaxLatency (ps)
798.97
791.25
756.28
779.67
763.59
941.17
1 088.21
901.55
924.48
963.18
456.98
430.23
426.90
434.75
426.94
493.18
532.45
441.18
481.49
460.74
509.30
496.35
438.91
472.96
459.57
457.53
446.85
413.82
389.61
423.38
468.25
444.73
448.62
439.09
448.13
458.17
412.75
403.34
399.12
399.72
1 295.63
1 353.67
1 289.51
1 287.43
1 260.34
465.65
417.75
408.69
403.45
405.34
1.00
1.01
0.92
0.94
0.95
Total CPU (s)
106.00
15.65
4.14
24.12
18.97
183.00
30.73
11.24
51.82
27.39
127.00
11.20
3.24
10.13
10.86
208.00
22.43
10.29
37.24
20.07
107.00
8.75
4.41
12.32
10.56
94.00
5.67
3.30
3.42
10.92
217.00
6.78
10.95
5.78
18.25
122.00
5.21
3.29
3.99
10.42
65.00
3.12
3.45
3.79
4.42
87.00
3.02
4.12
3.89
4.22
1.00
0.09
0.04
0.13
0.11
Clustering CPU (s)
0.60.65
0.35
0.98
0.31
0.61.23
1.92
6.25
0.96
0.60.78
0.26
1.15
0.22
0.61.20
1.68
3.47
0.35
0.60.25
0.27
0.87
0.20
0.60.56
0.30
1.34
0.14
0.60.55
0.57
2.35
0.43
0.60.42
0.37
1.45
0.19
0.60.22
0.17
0.89
0.18
0.60.23
0.27
0.87
0.23
0.61.00
1.01
3.17
0.50
Note: # of register means the total number of registers in the corresponding benchmark. MaxLatency means the maximum latency
from the source to registers in the corresponding result. Skew and maximum latency are measured in picoseconds (ps), power is
measured in femtoFarads (fF), and runtime is measured in seconds. Total CPU is the total runtime of the CTS and clustering CPU
is the runtime of clustering process, which is included in total CPU.
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
tiveness of our algorithms on benchmarks with many
blockages.
In this paper, we focus on the power reduction of
clock trees brought by our register clustering methodology. Therefore, we do not introduce many techniques
to reduce the skew of the clock tree. Within the clusters, the registers are directly connected with the local clock buffers and no skew tuning techniques such as
wire snaking are performed in the top-level CTS. On the
other hand, the high quality of our register clustering algorithms guarantees that the global skew is limited in a
reasonable range. For instance, in our KSR algorithm,
the clock skew on four benchmarks decreases while the
clock skew on the other six benchmarks increases. However, the clock skew is still reduced by 1.2% on average.
In future, we will develop an integrated CTS system
based on our register clustering methodology, which includes the local CTS within the clusters and the skew
tuning techniques such as wire snaking.
Figs.7∼9 shows the clustering results of benchmark
10f01 generated by KMR, KSR and GSR respectively.
In the three figures, gray rectangles represent the obstacles in the layout, and black rectangles represent the
bounding boxes of clusters.
It is also worth mentioning that all the experimental
results satisfy the constraints of the ISPD contest: 1) no
slew-rate violation occurs at the input of the registers
and the buffers; 2) no buffer overlaps with the existing
blockages; 3) the signal polarities of all the registers are
exactly the same with those of the clock source.
All the above experimental results demonstrate that
our register clustering methodology is effective in reducing the power consumption of the clock tree without affecting the clock skew while optimizing maximum
latency at the same time.
401
Τ106
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8 Τ106
Fig.8. Clustering results of KSR algorithm on 10f01.
Τ106
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8 Τ106
Fig.9. Clustering results of GSR algorithm on 10f01.
7
Conclusions
Τ106
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8 Τ106
Fig.7. Clustering results of KMR algorithm on 10f01.
We presented an efficient register clustering
methodology to reduce the power dissipation of the
clock tree without affecting the clock skew while optimizing the maximum latency at the same time. Three
different clustering algorithms called KMR, KSR and
GSR were developed to verify the validity of the proposed register clustering methodology. All three algorithms achieve significant power reduction on every
benchmark of ISPD 2010 CNS contest. As the most effective method among the three algorithms, GSR algorithm achieves a 31% reduction in power consumption
as well as a 4% reduction in skew and a 5% reduction
in maximum latency. Moreover, the total runtime of
the CTS flow with our register clustering algorithms
402
is significantly reduced by almost an order of magnitude. Moreover, no negative influence has brought to
the signal nets because there is no register movement
in our algorithms. In future, we will try to develop
an integrated low power CTS flow based on our register clustering algorithms. More algorithms such as the
routing algorithm inside the clusters will be involved
in the CTS system. Meanwhile, we will try to involve
more considerations, like wire sizing, wire snaking, and
process variation.
References
[1] Pedram M, Rabaey J M. Power Aware Design Methodologies. Kluwer Academic Publisher, 2002.
[2] Cheon Y, Ho P H, Kahng A B, Reda S, Wang Q. Poweraware placement. In Proc. the 42nd Annual Design Automation Conference, Jun. 2005, pp.795–800.
[3] Donno M, Macii E, Mazzoni L. Poweraware clock tree planning. In Proc. the 2004 International Symposium on Physical Design, April 2004, pp.138–147.
[4] Lam T K, Yang X, Tang W C, Wu Y L. On applying erroneous clock gating conditions to further cut down power. In
Proc. the 16th Asia and South Pacific Design Automation
Conference, Jan. 2011, pp.509–514.
[5] Lu J, Mao X, Taskin B. Clock mesh synthesis with gated
local trees and activity driven register clustering. In Proc.
IEEE/ACM International Conference on Computer-Aided
Design, Nov. 2012, pp.691–697.
[6] Igarashi M, Usami K, Nogami K et al. A low-power design
method using multiple supply voltages. In Proc. the 1997
International Symposium on Low Power Electronics and
Design, Aug. 1997, pp.36–41.
[7] Lin K Y, Lin H T, Ho T Y. An efficient algorithm of adjustable delay buffer insertion for clock skew minimization
in multiple dynamic supply voltage designs. In Proc. the
16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.825–830.
[8] Li L, Sun J, Lu Y, Zhou H, Zeng X. Low power discrete
voltage assignment under clock skew scheduling. In Proc.
the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.515–520.
[9] Chao T H, Hsu Y C, Ho J M, Kahng A. Zero skew clock
routing with minimum wirelength. IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, 1992, 39(11): 799–814.
[10] Liu W H, Li Y L, Chen H C. Minimizing clock latency range
in robust clock tree synthesis. In Proc. the 15th Asia and
South Pacific Design Automation Conference, Jan. 2010,
pp.389–394.
[11] Shih X W, Cheng C C, Ho Y K, Chang Y W. Blockageavoiding buffered clocktree synthesis for clock latency-range
and skew minimization. In Proc. the 15th Asia and South
Pacific Design Automation Conference, Jan. 2010, pp.395–
400.
[12] Lee D J, Markov I L. Contango: Integrated optimization of
soc clock network. In Proc. the 2010 Conference on Design,
Automation and Test in Europe, Mar. 2010, pp.1468–1473.
J. Comput. Sci. & Technol., Mar. 2015, Vol.30, No.2
[13] Rakai L, Farshidi A, Behjat L, Westwick D. Buffer sizing for
clock networks using robust geometric programming considering variations in buffer sizes. In Proc. the 2013 ACM
International Symposium on Physical Design, Mar. 2013,
pp.154–161.
[14] Singh J, Nookala V, Luo Z Q, Sapatnekar S. Robust gate
sizing by geometric programming. In Proc. the 42nd Annual
Design Automation Conference, Jun. 2005, pp.315–320.
[15] Vittal A, Marek-Sadowska M. Lowpower buffered clock tree
design. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 1997, 16(9): 965–975.
[16] Lillis J, Cheng C K, Lin T T Y. Optimal wire sizing
and buffer insertion for low power and a generalized delay
model. IEEE Journal of Solid-State Circuits, 1996, 31(3):
437–447.
[17] Lu Y, Sze C, Hong X, Zhou Q, Cai Y, Huang L, Hu J. Register placement for low power clock network. In Proc. the
2005 Asia and South Pacific Design Automation Conference, Jan. 2005, pp.588–593.
[18] Hou W, Liu D, Ho P H. Automatic register banking for lowpower clock trees. In Proc. the 10th International Symposium on Quality Electronic Design, Mar. 2009, pp.647–652.
[19] Papa D, Alpert C, Sze C, Li Z, Viswanathan N, Nam G
J, Markov I L. Physical synthesis with clock-network optimization for large systems on chips. IEEE Micro, 2011,
31(4): 51–62.
[20] Mehta A D, Chen Y P, Menezes N, Wong D, Pilegg L. Clustering and load balancing for buffered clock tree synthesis.
In Proc. the 1997 IEEE International Conference on Computer Design: VLSI in Computers and Processors, Oct.
1997, pp.217–223.
[21] Shelar R S. An efficient clustering algorithm for low power
clock tree synthesis. In Proc. the 2007 International Symposium on Physical Design, Mar. 2007, pp.181–188.
[22] Mitchell T. Machine Learning. McGraw Hill, 1997.
[23] Selim S Z, Ismail M A. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984, PAMI6(1): 81–87.
[24] Niu F, Zhou Q, Yao H, Cai Y, Yang J, Sze C N. Obstacleavoiding and slew-constrained buffered clock tree synthesis
for skew optimization. In Proc. the 21st Edition of the Great
Lakes Symposium on VLSI, May 2011, pp.199–204.
[25] Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. Prentice-Hall India, 1998.
[26] Hu S, Alpert C J, Hu J, Karandikar S K, Li Z, Shi W, Sze
C N. Fast algorithms for slew-constrained minimum cost
buffering. IEEE Transactions on Computer Aided Design
of Integrated Circuits and Systems, 2007, 26(11): 2009–
2022.
Chao Deng received his B.S. degree
in computer science and technology
from Harbin Institute of Technology
(HIT), Harbin, in 2010. He is currently
pursuing his Ph.D. degree from the EDA
Lab, Tsinghua University, Beijing. His
research interests include clock network
synthesis and optimization.
Chao Deng et al.: Register Clustering Methodology for Low Power Clock Tree Synthesis
Yi-Ci Cai is a professor in the
Department of Computer Science
and Technology, Tsinghua University,
Beijing. She received her B.S. degree
in electronic engineering from Tsinghua
University in 1983, M.S. degree in
computer science and technology from
Tsinghua University in 1986, and Ph.D.
degree in computer science from the University of Science
and Technology of China, Hefei, in 2007. Her research
interests include design automation for VLSI integrated
circuits algorithms and theory, power/ground distribution
network analysis and optimization, high performance clock
synthesis, and low power physical design.
403
Qiang Zhou received his B.S. degree
in computer science and technology
from the University of Science and
Technology of China, Hefei, in 1983,
M.S. degree in computer science and
technology from Tsinghua University,
Beijing, in 1986, and Ph.D. degree in
control theory and control engineering
from Chinese University of Mining and Technology,
Beijing, in 2002. His research interests include VLSI
layout theory and algorithms.