A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE

Transcription

A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE
A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE CLASSIFIERS BASED
ON A CUSTOM CHIP AND A CPLD FOR GAS SENSORS APPLICATIONS
Amine Bermak
Dominique Martinez
Hong Kong University of Science and Technology
EEE Department, Hong Kong, SAR.
LORIA-CNRS, BP239 Vandoeuvre,
Nancy, 54506, France.
ABSTRACT
This paper describes a hardware implementation of tree
classifiers based on a custom VLSI chip and a CPLD chip.
The tree classifier comprises a first layer of threshold logic
unit, implemented on a reconfigurable custom chip, followed by a logical function implemented using a CPLD
chip. We first describe the architecture of tree classifiers
and compare its performance with support vector machine
(SVM) for different data sets. The reconfigurability of the
hardware (number of classifiers, topology and precision)
is discussed. Experimental results show that the hardware
presents a number of interesting feature such as reconfigurability, as well as improved fault tolerance.
1. INTRODUCTION
Classification is one of the most important tasks in pattern
recognition systems. Neural network and decision trees
have been widely used as classifiers and different classification schemes have been proposed in the literature [1]. A
number of interesting applications have also emerged in the
last decade particularly in smart gas sensors applications in
which the data from the front end sensors are processed using advanced pattern recognition classifier used in order to
improve the selectivity and compensate for a number of inherent problems encountered in gas sensors. To date two
general approaches have been used to implement the NN
or decision tree classifier for real-time odor sensing applications. The first approach relies on the use of software
implementation running on conventional computers. This
kind of implementation allows considerable flexibility in the
topology and operation of the NN or decision tree classifier. The second approach relies on the use of applicationspecific and dedicated NN hardware implementation. This
solution ultimately associates the front-end sensor with the
on-chip pattern recognition processing which if realized using advanced microelectronic packaging technology (such
as MCM, Flip-Chip or surface-mounted) would result in
miniaturized odor sensing system. This rapprochement bec
0-7803-8560-8/04/$20.00°2004IEEE
tween the front-end sensing devices and the processing is
particularly interesting because it results in improved performance in terms of real-time detection, low power consumption and reduced noise with lower overall system cost.
Unfortunately, the neural network or decision tree classifier implemented in such specialized hardware is not flexible which makes the processing chip only suitable for a very
specialized detection problem. It is therefore impossible to
set the detection and classification parameters (topology, accuracy, etc...) of the final system after production.
In this paper we propose to address this issue by proposing a reconfigurable decision tree hardware architecture that
can be programmed in terms of topology and processing
precision allowing the user to externally set the detection
parameters. The resulting hardware would offer both the
flexibility of a software implementation and the high performance and compactness of a dedicated hardware implementation. Section 2 describes the decision tree classifier
architecture and compares its performance with other classifiers. Section 3 describes the reconfigurable hardware architecture. A conclusion is discussed in section 4.
2. DECISION TREES CLASSIFIERS
Decision trees classifiers can be considered as threshold networks as evidenced through the example shown in Figure 1.
The decision tree shown in Figure 1.A implements a classifier that discriminates between three classes (denoted 1,
2 and 3) shown in Figure 1.B. Each node in the tree is a
Threshold Logic Unit (TLU) implementing a linear discriminant and each leaf is associated to a given class. Classifying
an input pattern then reduces to a sequence of binary decisions, starting from the root node and ending when a leaf
is reached. Each class can then be represented by a logical
function that combines the binary decisions encountered at
the nodes. Therefore, a decision tree can thus be considered as a threshold network having a hidden layer of TLUs
followed by one logical function per class.
Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent threshold network
by extracting one logical function per class from the tree
structure. A logical function for a given class has a number
j
b
-©aH +
e 3
©
Hj
j
c
2e ³
3 a
b
e ³
- + - +
2 ³2³
2 3 1 - dj+ ³³· (2(((2( d
1(
1
1 2 1c·· 1
B.
A.
b c
aj
¢ b
LJ b j
class1
F
\ 3·
¢ bjLJ­¯ 1
2\· ³
3 ³a
b
´
J
class2
¢
J­
b
L¯ F2j
·2
X´
2
\
³
"
Q
³
Q
­
"
A cj J­
LL class ³³·
¯­J
2 \\2(( d
- 3 ³ · (
A
F3j
( ((
1
(
¯
(
1
A dj
­
1·· 1 \
.
.
C.
D.
Fig. 1. Equivalence between decision trees and threshold networks. (A.)
and (C.) are two examples of a tree and a threshold network respectively
and (B.) and (D.) are their respective partition of the input space. Each node
of the tree corresponds to a separator hyperplane and the leaves correspond
to a given class. Each class can be represented by a logical function that
combines a set of nodes. In our example, it can be seen from the tree
¯ class2 = a
structure that class1 = a¯
c + acd,
¯¯b + acd¯ and class3 = a
¯b.
Note however that the logical function for class 1 extracted by our program
is class1 = a¯
c + ad¯ which is better optimized.
of conjunctions equal to the number of leaves associated to
this class (see Figure 1.A and logical expressions reported
in figure caption). While it is not possible to reduce this
number without loosing the equivalence between the decision tree and the threshold network, it can be possible to
simplify the conjunctions themselves. Any node that has a
leaf of a given class as children can be removed from the
other conjunctions of the same class. For example, node c
in figure 1.A has a leaf of class 1. The conjunction associated to this leaf is a¯
c. It is easy to check that node c can be
¯
removed from the other conjunction acd.
Bagging [2] is a popular and effective technique that can
be used in order to improve classification performance by
creating ensembles. Bagging uses random sampling with
replacement from the original data set in order to obtain different training sets. Because the size of the sampled data
set has the same size as the original one, many of the original examples may be repeated while others may be left out.
On average, 63% of the original data appears in the sampled training set [2]. Each individual classifier is built on
each training set by applying the same learning algorithm.
The resulting classifiers are then combined by a simple majority vote. Research works suggested that ensembles with
ten members are adequate to improve the classification performance of decision trees [3]. Thus, ensembles of ten decision trees were created by using bagging and transformed to
ensembles of ten threshold networks. The tree building procedure OC1 [4] was used because it performs better compared to other procedures (smaller and more accurate decision trees) [4]. OC1 is a randomized algorithm that builds
oblique decision trees by simulated annealing. The OC1
program was modified to incorporate bagging and a C program was written in order to transform each decision tree
into an equivalent threshold network by extracting automatically one optimized logical function per class from the tree
structure. Because the datasets we used were small, generalization performance were estimated by a leave-one-out
procedure. Table I reports the leave-one-out performance
of bagging decision trees implemented as threshold network ensembles in comparison to the one of single threshold networks and support vector machines (SVM). For these
datasets, threshold network ensembles were always more
accurate than single threshold networks. Moreover, they
outperformed SVMs in two datasets over four.
Dataset
hepatitis
ionosphere
iris
odor
Single
TN
78.75 (78.4)
89.46 (91.9)
93.33 (94.0)
90.32
Ensemble of
10 TNs
88.75 (83.5)
93.16 (94.0)
96.67 (95.4)
96.77
SVM
83.75
89.74
96.67
100
Table 1. Leave-one-out accuracy (in %) for the different datasets.
3. VLSI IMPLEMENTATION
Decision trees require only TLUs and combinatorial logic
and are very suitable for a VLSI implementation. The
threshold function is indeed easy to implement in digital and
this results in significant silicon area saving as compared to
sigmoidal or radial basis functions used in multilayer perceptrons or RBF networks and implemented through area
consuming look-up tables. This simplification results in
very compact arithmetic units, and makes the prospect of
building up VLSI chips implementing bagging threshold
networks particularly promising for real time decisions.
In our VLSI implementation the TLUs were implemented
using a reconfigurable custom chip while the logical function (which varies from one application to another) is synthesized on a CPLD device. This makes the relatively expensive hardware implementation a general problem solving system. The custom VLSI chip is based on a 2D systolic
array architecture. This array consists of a scalable 4 × 4
Processing Elements (PE). The array could be configured,
using a control signal (cne), to perform either a weighted
sum or a TLU unit (Fig. 2.A). Each processor PE includes a
local configurable memory to store either one 16-bit weight,
two 8-bit weights or four 4-bit weights. Wider systolic array
(more inputs) can be realized by bypassing the activation
function and connecting the partial weighted sum across
different chips. A 10-bit internal control register is used
to configure the 4 × 4 array of PEs in terms of decision tree
topology, and weights precision. The activation function is
simply realized by detecting the sign bit of the weighted
sum which corresponds to the last generated bit from the
serial parallel processor. A sampling circuit is also used in
order to detect the sign bit of each TLU and to multiplex
the different TLU outputs in time so that only one physical pin is used for the out signal of the entire array. This
time multiplexing scheme does not affect the overall speed
performance of the system since a systolic architecture is
used and data are processed in a pipelined way. For example the sign bit for row1 is obtained one clock cycle earlier
than the one of row2 and therefore only one physical pin
is required to sample the two rows data. This has allowed
us to reduce the physical number of Pins and facilitate the
interfacing of several chips within the multi-chip hardware.
Other methods, such as bus sharing technique, were also
used in order to reduce further the number of physical PINs
and buses within the architecture. Since the system presents
different modes of operation, a single bus was assigned different tasks over the different modes of operation. Figure
2.B shows the internal block diagram of each PE within the
systolic array. It can be noticed from this figure that the
4-bit Sin bus is used to load the 16-bit internal weight register and also to provide the partial sum inputs to either the
configurable arithmetic unit or to the 2-inputs/1-output multiplexer, depending on the value of the control bit (NO/OP)
stored in the internal flip-flop. When (NO/OP = 0), the bus
Sin is directly provided to the neighboring PE located on
the right (Sin=Sout) and hence no processing is performed
within the PE; the processor is in a Non-Operational (NO)
mode. This mode is used in order to obtain two interesting features for the systolic array namely: (i) Possibility of
exhaustive test of individual PE within the array and (ii) improved fault tolerance of the system. With the NO mode, it
is possible to isolate a defective PE within the systolic array
and/or a defective chip within a multi-chip hardware.
Xi
16
Input Unit
4
4
4
Sout
4
4
PE21
PE31
PE41
PE12
PE22
PE32
PE42
PE13
PE23
PE33
PE43
PE14
PE24
PE34
PE44
4
4
So
4
16
Unit
PE11
16
Multiplexer
Input
Si
in Figure 3. The input precision is arbitrary selected by the
user and hence Xi can take any word-length. As it can be
seen from Figure 3, depending on the selected precision different topologies of threshold networks can be configured.
Pb p × q stands for a network topology with b bits of precision, p inputs and q TLUs. For a 4-bit precision three
configurations are possible namely: P4 16 × 4, P4 8 × 8 and
P4 4 × 16. For an 8-bit precision two configurations are possible: P8 4 × 8, P8 8 × 4 and only one configuration is possible for a 16-bit precision: P16 4 × 4. The available resources
of the circuit are a trade-off between the three parameters
(b, p and q). The configuration with the lowest precision
allows to increase the number of inputs or TLUs according
to: b × p × q = c, where c is a constant term which depends
on the number of chip interfaced. For example c = 256 for
a single chip while c = 1024 for four cascaded chips.
Internal weight register
Out
16
4
cne
4
Control
Bus
Control Unit
Internal 10-bit Control Register
Input Unit
Configurable
Arithmetic
Unit
Control
Sou
Multiplexer
Systolic Architecture
4
Output Unit
Sin
Input Unit
4
4
PE: Processing Element
FF
: Binary activation
Internal 16-bit Nop/op Register
8
control
signals
1
NO/Op
control
(A.)
(B.)
Fig. 2. (A.) Internal architecture of a basic VLSI chip. (B.) Processing
Element (PE) building block diagram.
3.1. Reconfigurability
The 4 × 4 systolic array can be configured to operate at
three different configuration of weight precision as shown
Sout
3.2. Experimental results and fault tolerance
A total of 50 basic chips were manufactured (44 dies and 6
packaged chips). After verifying the full functionality of the
packaged chips, the remaining 44 dies were mounted on 11
MCMs each containing 4 chips. Special care was paid to the
design of the MCM in order to make the test and debugging
of each chip within the MCM possible. This was achieved
by designing special back-side contact on the MCM such
that they would fit on a test socket. Access to inter-chip
connection was also obtained using back-side MCM contacts designed to report the test contacts of the MCM into
a standard DIL package. For each MCM each single chip
was separately addressed and tested. Several tests were conducted namely: (i) Test 1: Wire bonding and MCM level
connectivity test; (ii) Test 2: Propagation of the control signals within the MCM; (iii) Test 3: Synaptic weight loading
test; (iv) Test 4: NO/OP programming test and (v) Test 5:
Exhaustive functional test. Figure 4 summarizes the test results for the 11 MCMs. From the 11 MCMs, 6 were found
to be fully operational (54%). A faulty wire bonding was
detected for MCM 1 and 2. MCM 3 passed the connectivity test but failed all the remaining tests. MCM4 and 5
successfully passed all test except the last functional test.
Even though MCM 4 and 5 do not operate properly their
malfunction does not affect the correct operation of neighboring MCMs if they were mounted within a larger system.
This is made possible using the NO mode, hence improving considerably the fault tolerance of the system. Indeed, 2
MCMs out of the 5 faulty ones (40%) do not present catastrophic faults, thanks to the NO feature. In order to test
the reconfigurable hardware for gas sensors applications, an
experimental setup was developed in order to extract the
response of a gas sensor array composed of 5 commercial
TGS figaro gas sensors. Ethanol or butanol vapors were injected into the gas chamber at a flow rate determined by the
mass flow controllers. We used 16 different concentrations
for ethanol ranging from 1360 to 5165 ppm and 15 different
N3
N2
N1
N4
cne
cne
N1
N2
N4
N3
N7
N6
N5
N8
cne
N2
N1
N4
N3
16-bit precision
8-bit precision
W2
W1
x1
W3
x2
W4
x3
W6
W5
x4
W7
x6
x5
W2
W1
W8
x7
x8
x1
N2
N1
x3
N4
N2
N1
cne
x2
x1
W4
x3
x4
P164x4
N3
N2
N1
N8
W3
W2
W1
x4
P8 4x8
P8 8x4
cne
W4
W3
x2
N15
N16
Reconfiguration
switches
cne
4-bit precision
Neurons
W1
W15
W3
W2
x1
x2
x3
W2
W1
W16
x15
x1
x16
P4 8x8
P4 16x4
W1
W8
x2
x8
W2
x1
Synapses
W4
x2
P4 4x16
x4
Inputs
Fig. 3. The different topologies of threshold networks implemented by the circuit.
MCM # Test 1
MCM 1
F
MCM 2
F
MCM 3
P
MCM 4
P
MCM 5
P
MCM 6
P
MCM 7
P
MCM 8
P
MCM 9
P
MCM 10 P
MCM 11 P
Test 2
F
F
F
P
P
P
P
P
P
P
P
Test 3
F
F
F
P
P
P
P
P
P
P
P
Test 4
F
F
F
P
P
P
P
P
P
P
P
Test 5
F
F
F
F
F
P
P
P
P
P
P
Fig. 4. Fault Characterization of the 11 MCMs with respect to the 5 tests
described. P and F stands for pass and fail results of the test, respectively.
Dark shades represent samples with catastrophic fault. Light shades represent non operational samples however defective MCMs can be isolated
using the NO mode. No shades are samples with no fault.
concentrations for butanol ranging from 870 to 3050 ppm.
We then recorded the steady state outputs of 4 sensor arrays,
each one composed of the 5 TGS sensors for the 31 different
concentrations of ethanol and butanol. This yielded a total
of 124 patterns. This odor dataset was used for estimating the performance of decision trees trained with bagging.
The test and the training performance were experimentally
measured on the chip and compared with the performance
obtained by simulation for 4-bit, 8-bit of precisions and for
ensembles of 10 and 20 threshold networks. A test performance of 96% was obtained in the case of an ensemble of
10 threshold networks with 8-bit weight precision while the
performance dropped to 84% for a 4-bit weight precision.
Using an ensemble of 20 threshold networks for the 4-bit
precision improves the performance by 3% resulting in an
87% accuracy. The performance measurements obtained
for both the training and the test match these obtained by
simulation. A peak classification per second performance
of 28M samples/s, 11M samples/s and 4M samples/second
were obtained for 4-bit, 8-bit and 16-bit weights precision
respectively with an input precision of 11-bits.
4. CONCLUSION
In this paper we presented a reconfigurable VLSI implementation of decision trees. The proposed architecture supports a configurable decision tree structure as well as variable precision computation in which it is possible for low
precision applications to either reduce the power consumption or increase the ensemble size to improve the classification performance. Inefficient usage of the hardware resources is avoided since both the weights and input precisions are user-defined for the application at hand. Statistical
tests performed on the manufactured MCMs showed that
the proposed NO/OP feature implemented in our circuit improves the fault tolerance of the system by as much as 40%.
Operation of the hardware as a classifier was also demonstrated through gas detection application. Experimental results suggests a peak classification performance of 28M,
11M and 4M samples/s for 4-bit, 8-bit and 16-bit resp.
5. ACKNOWLEDGMENTS
This work was supported in part by a Procore Research
Grant ref: F-HK15/03T.
6. REFERENCES
[1] R. O. Duda; P.E Hart and D. G. Stork, “Pattern Classification,” AWiley-interscience, 2001.
[2] L. Breiman, “Bagging predictors,” Machine Learning,
24(2), pp.123-140, 1996.
[3] R. Maclin and D. Opitz, “An Empirical evaluation of
bagging and boosting,” Proc. of AAAI, 1997.
[4] S.K. Murthy; et Al, “A System for Induction of Oblique
Dec. Trees,” J. of AI 2, pp. 1-33, 1994.