A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE
Transcription
A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE
A RECONFIGURABLE HARDWARE IMPLEMENTATION OF TREE CLASSIFIERS BASED ON A CUSTOM CHIP AND A CPLD FOR GAS SENSORS APPLICATIONS Amine Bermak Dominique Martinez Hong Kong University of Science and Technology EEE Department, Hong Kong, SAR. LORIA-CNRS, BP239 Vandoeuvre, Nancy, 54506, France. ABSTRACT This paper describes a hardware implementation of tree classifiers based on a custom VLSI chip and a CPLD chip. The tree classifier comprises a first layer of threshold logic unit, implemented on a reconfigurable custom chip, followed by a logical function implemented using a CPLD chip. We first describe the architecture of tree classifiers and compare its performance with support vector machine (SVM) for different data sets. The reconfigurability of the hardware (number of classifiers, topology and precision) is discussed. Experimental results show that the hardware presents a number of interesting feature such as reconfigurability, as well as improved fault tolerance. 1. INTRODUCTION Classification is one of the most important tasks in pattern recognition systems. Neural network and decision trees have been widely used as classifiers and different classification schemes have been proposed in the literature [1]. A number of interesting applications have also emerged in the last decade particularly in smart gas sensors applications in which the data from the front end sensors are processed using advanced pattern recognition classifier used in order to improve the selectivity and compensate for a number of inherent problems encountered in gas sensors. To date two general approaches have been used to implement the NN or decision tree classifier for real-time odor sensing applications. The first approach relies on the use of software implementation running on conventional computers. This kind of implementation allows considerable flexibility in the topology and operation of the NN or decision tree classifier. The second approach relies on the use of applicationspecific and dedicated NN hardware implementation. This solution ultimately associates the front-end sensor with the on-chip pattern recognition processing which if realized using advanced microelectronic packaging technology (such as MCM, Flip-Chip or surface-mounted) would result in miniaturized odor sensing system. This rapprochement bec 0-7803-8560-8/04/$20.00°2004IEEE tween the front-end sensing devices and the processing is particularly interesting because it results in improved performance in terms of real-time detection, low power consumption and reduced noise with lower overall system cost. Unfortunately, the neural network or decision tree classifier implemented in such specialized hardware is not flexible which makes the processing chip only suitable for a very specialized detection problem. It is therefore impossible to set the detection and classification parameters (topology, accuracy, etc...) of the final system after production. In this paper we propose to address this issue by proposing a reconfigurable decision tree hardware architecture that can be programmed in terms of topology and processing precision allowing the user to externally set the detection parameters. The resulting hardware would offer both the flexibility of a software implementation and the high performance and compactness of a dedicated hardware implementation. Section 2 describes the decision tree classifier architecture and compares its performance with other classifiers. Section 3 describes the reconfigurable hardware architecture. A conclusion is discussed in section 4. 2. DECISION TREES CLASSIFIERS Decision trees classifiers can be considered as threshold networks as evidenced through the example shown in Figure 1. The decision tree shown in Figure 1.A implements a classifier that discriminates between three classes (denoted 1, 2 and 3) shown in Figure 1.B. Each node in the tree is a Threshold Logic Unit (TLU) implementing a linear discriminant and each leaf is associated to a given class. Classifying an input pattern then reduces to a sequence of binary decisions, starting from the root node and ending when a leaf is reached. Each class can then be represented by a logical function that combines the binary decisions encountered at the nodes. Therefore, a decision tree can thus be considered as a threshold network having a hidden layer of TLUs followed by one logical function per class. Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent threshold network by extracting one logical function per class from the tree structure. A logical function for a given class has a number j b -©aH + e 3 © Hj j c 2e ³ 3 a b e ³ - + - + 2 ³2³ 2 3 1 - dj+ ³³· (2(((2( d 1( 1 1 2 1c·· 1 B. A. b c aj ¢ b LJ b j class1 F \ 3· ¢ bjLJ¯ 1 2\· ³ 3 ³a b ´ J class2 ¢ J b L¯ F2j ·2 X´ 2 \ ³ " Q ³ Q " A cj J LL class ³³· ¯J 2 \\2(( d - 3 ³ · ( A F3j ( (( 1 ( ¯ ( 1 A dj 1·· 1 \ . . C. D. Fig. 1. Equivalence between decision trees and threshold networks. (A.) and (C.) are two examples of a tree and a threshold network respectively and (B.) and (D.) are their respective partition of the input space. Each node of the tree corresponds to a separator hyperplane and the leaves correspond to a given class. Each class can be represented by a logical function that combines a set of nodes. In our example, it can be seen from the tree ¯ class2 = a structure that class1 = a¯ c + acd, ¯¯b + acd¯ and class3 = a ¯b. Note however that the logical function for class 1 extracted by our program is class1 = a¯ c + ad¯ which is better optimized. of conjunctions equal to the number of leaves associated to this class (see Figure 1.A and logical expressions reported in figure caption). While it is not possible to reduce this number without loosing the equivalence between the decision tree and the threshold network, it can be possible to simplify the conjunctions themselves. Any node that has a leaf of a given class as children can be removed from the other conjunctions of the same class. For example, node c in figure 1.A has a leaf of class 1. The conjunction associated to this leaf is a¯ c. It is easy to check that node c can be ¯ removed from the other conjunction acd. Bagging [2] is a popular and effective technique that can be used in order to improve classification performance by creating ensembles. Bagging uses random sampling with replacement from the original data set in order to obtain different training sets. Because the size of the sampled data set has the same size as the original one, many of the original examples may be repeated while others may be left out. On average, 63% of the original data appears in the sampled training set [2]. Each individual classifier is built on each training set by applying the same learning algorithm. The resulting classifiers are then combined by a simple majority vote. Research works suggested that ensembles with ten members are adequate to improve the classification performance of decision trees [3]. Thus, ensembles of ten decision trees were created by using bagging and transformed to ensembles of ten threshold networks. The tree building procedure OC1 [4] was used because it performs better compared to other procedures (smaller and more accurate decision trees) [4]. OC1 is a randomized algorithm that builds oblique decision trees by simulated annealing. The OC1 program was modified to incorporate bagging and a C program was written in order to transform each decision tree into an equivalent threshold network by extracting automatically one optimized logical function per class from the tree structure. Because the datasets we used were small, generalization performance were estimated by a leave-one-out procedure. Table I reports the leave-one-out performance of bagging decision trees implemented as threshold network ensembles in comparison to the one of single threshold networks and support vector machines (SVM). For these datasets, threshold network ensembles were always more accurate than single threshold networks. Moreover, they outperformed SVMs in two datasets over four. Dataset hepatitis ionosphere iris odor Single TN 78.75 (78.4) 89.46 (91.9) 93.33 (94.0) 90.32 Ensemble of 10 TNs 88.75 (83.5) 93.16 (94.0) 96.67 (95.4) 96.77 SVM 83.75 89.74 96.67 100 Table 1. Leave-one-out accuracy (in %) for the different datasets. 3. VLSI IMPLEMENTATION Decision trees require only TLUs and combinatorial logic and are very suitable for a VLSI implementation. The threshold function is indeed easy to implement in digital and this results in significant silicon area saving as compared to sigmoidal or radial basis functions used in multilayer perceptrons or RBF networks and implemented through area consuming look-up tables. This simplification results in very compact arithmetic units, and makes the prospect of building up VLSI chips implementing bagging threshold networks particularly promising for real time decisions. In our VLSI implementation the TLUs were implemented using a reconfigurable custom chip while the logical function (which varies from one application to another) is synthesized on a CPLD device. This makes the relatively expensive hardware implementation a general problem solving system. The custom VLSI chip is based on a 2D systolic array architecture. This array consists of a scalable 4 × 4 Processing Elements (PE). The array could be configured, using a control signal (cne), to perform either a weighted sum or a TLU unit (Fig. 2.A). Each processor PE includes a local configurable memory to store either one 16-bit weight, two 8-bit weights or four 4-bit weights. Wider systolic array (more inputs) can be realized by bypassing the activation function and connecting the partial weighted sum across different chips. A 10-bit internal control register is used to configure the 4 × 4 array of PEs in terms of decision tree topology, and weights precision. The activation function is simply realized by detecting the sign bit of the weighted sum which corresponds to the last generated bit from the serial parallel processor. A sampling circuit is also used in order to detect the sign bit of each TLU and to multiplex the different TLU outputs in time so that only one physical pin is used for the out signal of the entire array. This time multiplexing scheme does not affect the overall speed performance of the system since a systolic architecture is used and data are processed in a pipelined way. For example the sign bit for row1 is obtained one clock cycle earlier than the one of row2 and therefore only one physical pin is required to sample the two rows data. This has allowed us to reduce the physical number of Pins and facilitate the interfacing of several chips within the multi-chip hardware. Other methods, such as bus sharing technique, were also used in order to reduce further the number of physical PINs and buses within the architecture. Since the system presents different modes of operation, a single bus was assigned different tasks over the different modes of operation. Figure 2.B shows the internal block diagram of each PE within the systolic array. It can be noticed from this figure that the 4-bit Sin bus is used to load the 16-bit internal weight register and also to provide the partial sum inputs to either the configurable arithmetic unit or to the 2-inputs/1-output multiplexer, depending on the value of the control bit (NO/OP) stored in the internal flip-flop. When (NO/OP = 0), the bus Sin is directly provided to the neighboring PE located on the right (Sin=Sout) and hence no processing is performed within the PE; the processor is in a Non-Operational (NO) mode. This mode is used in order to obtain two interesting features for the systolic array namely: (i) Possibility of exhaustive test of individual PE within the array and (ii) improved fault tolerance of the system. With the NO mode, it is possible to isolate a defective PE within the systolic array and/or a defective chip within a multi-chip hardware. Xi 16 Input Unit 4 4 4 Sout 4 4 PE21 PE31 PE41 PE12 PE22 PE32 PE42 PE13 PE23 PE33 PE43 PE14 PE24 PE34 PE44 4 4 So 4 16 Unit PE11 16 Multiplexer Input Si in Figure 3. The input precision is arbitrary selected by the user and hence Xi can take any word-length. As it can be seen from Figure 3, depending on the selected precision different topologies of threshold networks can be configured. Pb p × q stands for a network topology with b bits of precision, p inputs and q TLUs. For a 4-bit precision three configurations are possible namely: P4 16 × 4, P4 8 × 8 and P4 4 × 16. For an 8-bit precision two configurations are possible: P8 4 × 8, P8 8 × 4 and only one configuration is possible for a 16-bit precision: P16 4 × 4. The available resources of the circuit are a trade-off between the three parameters (b, p and q). The configuration with the lowest precision allows to increase the number of inputs or TLUs according to: b × p × q = c, where c is a constant term which depends on the number of chip interfaced. For example c = 256 for a single chip while c = 1024 for four cascaded chips. Internal weight register Out 16 4 cne 4 Control Bus Control Unit Internal 10-bit Control Register Input Unit Configurable Arithmetic Unit Control Sou Multiplexer Systolic Architecture 4 Output Unit Sin Input Unit 4 4 PE: Processing Element FF : Binary activation Internal 16-bit Nop/op Register 8 control signals 1 NO/Op control (A.) (B.) Fig. 2. (A.) Internal architecture of a basic VLSI chip. (B.) Processing Element (PE) building block diagram. 3.1. Reconfigurability The 4 × 4 systolic array can be configured to operate at three different configuration of weight precision as shown Sout 3.2. Experimental results and fault tolerance A total of 50 basic chips were manufactured (44 dies and 6 packaged chips). After verifying the full functionality of the packaged chips, the remaining 44 dies were mounted on 11 MCMs each containing 4 chips. Special care was paid to the design of the MCM in order to make the test and debugging of each chip within the MCM possible. This was achieved by designing special back-side contact on the MCM such that they would fit on a test socket. Access to inter-chip connection was also obtained using back-side MCM contacts designed to report the test contacts of the MCM into a standard DIL package. For each MCM each single chip was separately addressed and tested. Several tests were conducted namely: (i) Test 1: Wire bonding and MCM level connectivity test; (ii) Test 2: Propagation of the control signals within the MCM; (iii) Test 3: Synaptic weight loading test; (iv) Test 4: NO/OP programming test and (v) Test 5: Exhaustive functional test. Figure 4 summarizes the test results for the 11 MCMs. From the 11 MCMs, 6 were found to be fully operational (54%). A faulty wire bonding was detected for MCM 1 and 2. MCM 3 passed the connectivity test but failed all the remaining tests. MCM4 and 5 successfully passed all test except the last functional test. Even though MCM 4 and 5 do not operate properly their malfunction does not affect the correct operation of neighboring MCMs if they were mounted within a larger system. This is made possible using the NO mode, hence improving considerably the fault tolerance of the system. Indeed, 2 MCMs out of the 5 faulty ones (40%) do not present catastrophic faults, thanks to the NO feature. In order to test the reconfigurable hardware for gas sensors applications, an experimental setup was developed in order to extract the response of a gas sensor array composed of 5 commercial TGS figaro gas sensors. Ethanol or butanol vapors were injected into the gas chamber at a flow rate determined by the mass flow controllers. We used 16 different concentrations for ethanol ranging from 1360 to 5165 ppm and 15 different N3 N2 N1 N4 cne cne N1 N2 N4 N3 N7 N6 N5 N8 cne N2 N1 N4 N3 16-bit precision 8-bit precision W2 W1 x1 W3 x2 W4 x3 W6 W5 x4 W7 x6 x5 W2 W1 W8 x7 x8 x1 N2 N1 x3 N4 N2 N1 cne x2 x1 W4 x3 x4 P164x4 N3 N2 N1 N8 W3 W2 W1 x4 P8 4x8 P8 8x4 cne W4 W3 x2 N15 N16 Reconfiguration switches cne 4-bit precision Neurons W1 W15 W3 W2 x1 x2 x3 W2 W1 W16 x15 x1 x16 P4 8x8 P4 16x4 W1 W8 x2 x8 W2 x1 Synapses W4 x2 P4 4x16 x4 Inputs Fig. 3. The different topologies of threshold networks implemented by the circuit. MCM # Test 1 MCM 1 F MCM 2 F MCM 3 P MCM 4 P MCM 5 P MCM 6 P MCM 7 P MCM 8 P MCM 9 P MCM 10 P MCM 11 P Test 2 F F F P P P P P P P P Test 3 F F F P P P P P P P P Test 4 F F F P P P P P P P P Test 5 F F F F F P P P P P P Fig. 4. Fault Characterization of the 11 MCMs with respect to the 5 tests described. P and F stands for pass and fail results of the test, respectively. Dark shades represent samples with catastrophic fault. Light shades represent non operational samples however defective MCMs can be isolated using the NO mode. No shades are samples with no fault. concentrations for butanol ranging from 870 to 3050 ppm. We then recorded the steady state outputs of 4 sensor arrays, each one composed of the 5 TGS sensors for the 31 different concentrations of ethanol and butanol. This yielded a total of 124 patterns. This odor dataset was used for estimating the performance of decision trees trained with bagging. The test and the training performance were experimentally measured on the chip and compared with the performance obtained by simulation for 4-bit, 8-bit of precisions and for ensembles of 10 and 20 threshold networks. A test performance of 96% was obtained in the case of an ensemble of 10 threshold networks with 8-bit weight precision while the performance dropped to 84% for a 4-bit weight precision. Using an ensemble of 20 threshold networks for the 4-bit precision improves the performance by 3% resulting in an 87% accuracy. The performance measurements obtained for both the training and the test match these obtained by simulation. A peak classification per second performance of 28M samples/s, 11M samples/s and 4M samples/second were obtained for 4-bit, 8-bit and 16-bit weights precision respectively with an input precision of 11-bits. 4. CONCLUSION In this paper we presented a reconfigurable VLSI implementation of decision trees. The proposed architecture supports a configurable decision tree structure as well as variable precision computation in which it is possible for low precision applications to either reduce the power consumption or increase the ensemble size to improve the classification performance. Inefficient usage of the hardware resources is avoided since both the weights and input precisions are user-defined for the application at hand. Statistical tests performed on the manufactured MCMs showed that the proposed NO/OP feature implemented in our circuit improves the fault tolerance of the system by as much as 40%. Operation of the hardware as a classifier was also demonstrated through gas detection application. Experimental results suggests a peak classification performance of 28M, 11M and 4M samples/s for 4-bit, 8-bit and 16-bit resp. 5. ACKNOWLEDGMENTS This work was supported in part by a Procore Research Grant ref: F-HK15/03T. 6. REFERENCES [1] R. O. Duda; P.E Hart and D. G. Stork, “Pattern Classification,” AWiley-interscience, 2001. [2] L. Breiman, “Bagging predictors,” Machine Learning, 24(2), pp.123-140, 1996. [3] R. Maclin and D. Opitz, “An Empirical evaluation of bagging and boosting,” Proc. of AAAI, 1997. [4] S.K. Murthy; et Al, “A System for Induction of Oblique Dec. Trees,” J. of AI 2, pp. 1-33, 1994.