An assessment of support vector machines for land cover classification
Transcription
An assessment of support vector machines for land cover classification
int. j. remote sensing, 2002, vol. 23, no. 4, 725–749 An assessment of support vector machines for land cover classi cation C. HUANG† Department of Geography, University of Maryland, College Park, MD 20742, USA L. S. DAVIS Institute for Advanced Computing Studies, University of Maryland, College Park, MD 20742, USA and J. R. G. TOWNSHEND Department of Geography and Institute for Advanced Computing Studies, University of Maryland, College Park, MD 20742, USA (Received 27 October 1999; in nal form 27 November 2000) Abstract. The support vector machine (SVM) is a group of theoretically superior machine learning algorithms. It was found competitive with the best available machine learning algorithms in classifying high-dimensional data sets. This paper gives an introduction to the theoretical development of the SVM and an experimental evaluation of its accuracy, stability and training speed in deriving land cover classi cations from satellite images. The SVM was compared to three other popular classi ers, including the maximum likelihood classi er (MLC), neural network classi ers (NNC) and decision tree classi ers (DTC). The impacts of kernel con guration on the performance of the SVM and of the selection of training data and input variables on the four classi ers were also evaluated in this experiment. 1. Introduction Land cover information has been identi ed as one of the crucial data components for many aspects of global change studies and environmental applications (Sellers et al. 1995). The derivation of such information increasingly relies on remote sensing technology due to its ability to acquire measurements of land surfaces at various spatial and temporal scales. One of the major approaches to deriving land cover information from remotely sensed images is classi cation. Numerous classi cation algorithms have been developed since the rst Landsat image was acquired in early 1970s (Townshend 1992, Hall et al. 1995). Among the most popular are the maximum likelihood classi er (MLC), neural network classi ers and decision tree classi ers. The MLC is a parametric classi er based on statistical theory. Despite limitations due to its assumption of normal distribution of class signature (e.g. Swain and Davis †Current address: Raytheon ITSS, USGS/EROS Data Center, Sioux Falls, SD 57108, USA; e-mail address: [email protected] International Journal of Remote Sensing ISSN 0143-1161 print/ISSN 1366-5901 online © 2002 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/01431160110040323 726 Chengquan Huang et al. 1978), it is perhaps one of the most widely used classi ers (e.g. Wang 1990, Hansen et al. 1996). Neural networks avoid some of the problems of the MLC by adopting a non-parametric approach. Their potential discriminating power has attracted a great deal of research eVort. As a result, many types of neural networks have been developed (Lippman 1987); the most widely used in the classi cation of remotely sensed images is a group of networks called a multi-layer perceptron (MLP) (e.g. Paola and Schowengerdt 1995, Atkinson and Tatnall 1997). A decision tree classi er takes a diVerent approach to land cover classi cation. It breaks an often very complex classi cation problem into multiple stages of simpler decision-making processes (Safavian and Landgrebe 1991). Depending on the number of variables used at each stage, there are univariate and multivariate decision trees (Friedl and Brodley 1997). Univariate decision trees have been used to develop land cover classi cations at a global scale (DeFries et al. 1998, Hansen et al. 2000). Though multivariate decision trees are often more compact and can be more accurate than univariate decision trees (Brodley and UtgoV 1995), they involve more complex algorithms and, as a result, are aVected by a suite of algorithm-related factors (Friedl and Brodley 1997). The univariate decision tree developed by Quinlan (1993) is evaluated in this study. The support vector machine (SVM) represents a group of theoretically superior machine learning algorithms. As shall be described in the following section, the SVM employs optimization algorithms to locate the optimal boundaries between classes. Statistically, the optimal boundaries should be generalized to unseen samples with least errors among all possible boundaries separating the classes, therefore minimizing the confusion between classes. In practice, the SVM has been applied to optical character recognition, handwritten digit recognition and text categorization (Vapnik 1995, Joachims 1998b). These experiments found the SVM to be competitive with the best available classi cation methods, including neural networks and decision tree classi ers. The superior performance of the SVM was also demonstrated in classifying hyperspectral images acquired from the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) (Gualtieri and Cromp 1998). While hundreds of variables were used as the input in the experiments mentioned above, there are far fewer variables in data acquired from operational sensor systems such as Landsat, the Advanced Very High Resolution Radiometer (AVHRR) and the Moderate Resolution Imaging Spectroradiometer (MODIS). Because these are among the major sensor systems from which land cover information is derived, an evaluation of the performance of the SVM using images from such sensor systems should have practical implications for land cover classi cation. The purpose of this paper is to demonstrate the applicability of this algorithm to deriving land cover from such operational sensor systems and systematically to evaluate its performances in comparison to other popular classi ers, including the statistical maximum likelihood classi er (MLC), a back propagation neural network classi er (NNC) (Pao 1989) and a decision tree classi er (DTC) (Quinlan 1993). The SVM was implemented by Joachims (1998a) as SV Mlight . A brief introduction to the theoretical development of the SVM is given in the following section. This is deemed necessary because the SVM is relatively new to the remote sensing community as compared to the other three methods. The data set and experimental design are presented in §3. Experimental results are discussed in the following three sections, including impacts of kernel con guration on the performance of the SVM, comparative performances of the four classi ers and Support vector machines for land cover classi cation 727 impacts of non-algorithm factors. The results of this study are summarized in the last section. 2. Theoretical development of SVM There are a number of publications detailing the mathematical formulation of the SVM (see e.g. Vapnik 1995, 1998, Burges 1998). The algorithm development of this section follows Vapnik (1995) and Burges (1998). The inductive principle behind SVM is structural risk minimization (SRM). According to Vapnik (1995), the risk of a learning machine (R) is bounded by the sum of the empirical risk estimated from training samples (R ) and a con dence emp interval (Y ): R R +Y. The strategy of SRM is to keep the empirical risk (R ) emp emp xed and to minimize the con dence interval (Y ), or to maximize the margin between a separating hyperplane and closest data points ( gure 1). A separating hyperplane refers to a plane in a multi-dimensional space that separates the data samples of two classes. The optimal separating hyperplane is the separating hyperplane that maximizes the margin from closest data points to the plane. Currently, one SVM classi er can only separate two classes. Integration strategies are needed to extend this method to classifying multiple classes. 2.1. T he optimal separating hyperplane Let the training data of two separable classes with k samples be represented by (x , y ), ..., (x , y ) where xµRn is an n-dimensional space, and yµ{+1, 1} is class 1 1 k k label. Suppose the two classes can be separated by two hyperplanes parallel to the optimal hyperplane ( gure 1(a)): w·x +b 1 i w·x +b i for 1 y =1, i=1, 2, ..., k i for y = 1 i (1) (2) Figure 1. The optimal separating hyperplane between (a) separable samples and (b) non-separable data samples. 728 Chengquan Huang et al. where w=(w , ..., w ) is a vector of n elements. Inequalities (1) and (2) can be 1 n combined into a single inequality: y [ w ´x +b] 1 i=1, ..., k (3) i i As shown in gure 1, the optimal separating hyperplane is the one that separates the data with maximum margin. This hyperplane can be found by minimizing the norm of w, or the following function: F (w)= Ø(w´w) (4) under inequality constraint (3). The saddle point of the following Lagrangean gives solutions to the above optimization problem: L(w, b, a)= Ø(w´w) k a { y [ w´x +b] 1} (5) i i i i= 1 where a 0 are Lagrange multipliers (Sundaram 1996). The solution to this optimi ization problem requires that the gradient of L (w, b, a) with respect to w and b vanishes, giving the following conditions: æ k w= æ y a x (6) i i i i= 1 k æ a y =0 (7) i i i= 1 By substituting (6 ) and (7) into (5), the optimization problem becomes: maximize 1 k k k L(a)= æ a æ æ a a y y (x ´x ) (8) i 2 i j i j i j i= 1 i= 1 j= 1 under constraints a 0, i=1, ..., k. i Given an optimal solution a0=(a0 , ..., a0 ) to (8), the solution w0 to (5) is a linear 1 k combination of training samples: k w0= æ y a0 x (9) i i i i= 1 According to the Kuhn–Tucker theory (Sundaram 1996), only points that satisfy the equalities in (1) and (2) can have non-zero coeYcients a0 . These points lie on i the two parallel hyperplanes and are called support vectors ( gure 1). Let x0(1) be a support vector of one class and x0( 1) of the other, then the constant b0 can be calculated as follows: b0= Ø[ w0´x0(1)+w0´x0( 1)] (10) The decision rule that separates the two classes can be written as: f (x)=sign A æ y a0 (x ´x) i i i support vector B b0 (11) 2.2. Dealing with non-separable cases An important assumption to the above solution is that the data are separable in the feature space. It is easy to check that there is no optimal solution if the data cannot be separated without error. To resolve this problem, a penalty value C for Support vector machines for land cover classi cation 729 misclassi cation errors and positive slack variables j are introduced ( gure 1(b)). i These variables are incorporated into constraint (1) and (2) as follows: w´x +b 1 j for y =1 i i i w ´x +b 1+j for y =1 i i i i j 0, i=1, ..., k i The objective function (4) then becomes F (w, j)= Ø(w ´w)+C (12) (13) (14) A B l k æ j i i= 1 (15) C is a preset penalty value for misclassi cation errors. If l=1, the solution to this optimization problem is similar to that of the separable case. 2.3. Support vector machines To generalize the above method to non-linear decision functions, the support vector machine implements the following idea: it maps the input vector x into a high-dimensional feature space H and constructs the optimal separating hyperplane in that space. Suppose the data are mapped into a high-dimensional space H through mapping function W: W: Rn H (16) A vector x in the feature space can be represented as W(x) in the high-dimensional space H. Since the only way in which the data appear in the training problem (8) are in the form of dot products of two vectors, the training algorithm in the highdimensional space H would only depend on data in this space through a dot product, i.e. on functions of the form W(x )´W(x ). Now, if there is a kernel function K such i j that K (x , x )=W(x )´W(x ) (17) i j i j we would only need to use K in the training program without knowing the explicit form of W. The same trick can be applied to the decision function (11) because the only form in which the data appear are in the form of dot products. Thus if a kernel function K can be found, we can train and use a classi er in the high-dimensional space without knowing the explicit form of the mapping function. The optimization problem (8) can be rewritten as: 1 k k k L (a)= æ a æ æ a a y y K (x , x ) i 2 i j i j i j i= 1 i= 1 j= 1 and the decision rule expressed in equation (11) becomes: f (x)=sign A æ y a0 K (x , x) i i i support vector B b0 (18) (19) A kernel that can be used to construct a SVM must meet Mercer’s condition (Courant and Hilbert 1953). The following two types of kernels meet this condition 730 Chengquan Huang et al. and will be considered in this study (Vapnik 1995). The polynomial kernels, K (x , x )=(x ´x +1)p 1 2 1 2 and the radial basis functions (RBF), (20) K (x , x )=eÕ c(x1Õ x2)2 1 2 (21) 2.4. From binary classi er to multi-class classi er In the above theoretical development, the SVM was developed as a binary classi er, i.e. one SVM can only separate two classes. Strategies are needed to adapt this method to multi-class cases. Two simple strategies have been proposed to adapt the SVM to N-class problems (Gualtieri and Cromp 1998). One is to construct a machine for each pair of classes, resulting in N (N 1)/2 machines. When applied to a test pixel each machine gives one vote to the winning class, and the pixel is labelled with the class having most votes. The other strategy is to break the N-class case into N two-class cases, in each of which a machine is trained to classify one class against all others. When applied to a test pixel, a value measuring the con dence that the pixel belongs to a class can be calculated from equation (19), and the pixel is labelled with the class with which the pixel has the highest con dence value ( Vapnik 1995). Without an evaluation of the two strategies, the second one is used in this study because it only requires to train N SVM machines for an N-class case, while for the same classi cation the rst strategy requires to train N (N 1)/2 SVM machines. With the second strategy each SVM machine is constructed to separate one class from all other classes. An obvious problem with this strategy is that in constructing each SVM machine, the sizes of the two concerned classes can be highly unbalanced, because one of them is the aggregation of N 1 classes. For data samples that cannot be separated without error, a classi er may not be able to nd a boundary between two highly unbalanced classes. For example, a classi er may not be able to nd a boundary between the two classes shown in gure 2 because the classi er probably makes least errors by labelling all pixels belonging to the smaller class with the Figure 2. An example of highly unbalanced training samples in a two-dimensional space de ned by two arbitrary variables, features 1 and 2. A classi er might incur more errors by drawing boundaries between the two classes than labelling pixels of the smaller class with the larger one. Support vector machines for land cover classi cation 731 larger one. To avoid this problem the samples of the smaller class are replicated such that the two classes have approximately the same sizes. Similar tricks were employed in constructing decision tree classi ers for highly unbalanced classes (DeFries et al. 1998). 3. Data and experimental design 3.1. Data and preprocessing A spatially degraded Thematic Mapper (TM) image and a corresponding reference map were used in this evaluation study. The TM image, acquired in eastern Maryland on 14 August, 1985, has a spatial resolution of 28.5 m. The six spectral bands (bands 1–5, and 7) of the TM image were converted to top-of-atmosphere (TOC) re ectance according to Markham and Barker (1986). Atmospheric correction was not necessary because the image was quite clear within the study area. Three broad cover types—forest, non-forest land and water—were delimited from this image, giving a land cover map with the same spatial resolution as the TM image. This three-class scheme was selected to ensure the achievement of high accuracy of the collected land cover map at this resolution. Confused pixels were labelled according to aerial photographs and eld visits covering the study area. Both the TM image and derived land cover map were degraded to a spatial resolution of 256.5 m with a degrading ratio of 9:1, i.e. each degraded pixel corresponds to 9 by 9 TM pixels. The main reason for evaluating the classi ers using degraded data is that a highly reliable reference land cover map with a reasonable number of classes can be generated at the degraded resolution. The image was degraded using a simulation programme embedded with models of the point spread functions (PSF) of TM and MODIS sensors (Barker and Burelhach 1992). By considering the PSF of both sensor systems, the simulation programme gives more realistic images than spatial averaging (Justice et al. 1989). Overlaying the 256.5m grids on the 28.5 m land cover map and calculating the proportions of forest, nonforest land and water within each 256.5m grid gave proportion images of forest, non-forest land and water at the 256.5 m resolution. A land cover map at the 256.5 m resolution was developed by reclassifying the proportion images according to class de nitions given in table 1. These de nitions were based on the IGBP classi cation scheme (Belward and Loveland 1996, DeFries et al. 1998). Class names were so chosen to match the de nitions used in this study. 3.2. Experimental design Many factors aVect the performance of a classi er, including the selection of training and testing data samples as well as input variables (Gong and Howarth 1990, Foody et al. 1995). Because the impact of testing data selection on accuracy Table 1. De nition of land cover classes for the Maryland data set. Code 1 2 3 4 5 6 Cover type Closed forest Open forest Woodland Non-forest land Land-water mix Water De nition tree cover>60%, water 20% 30%<tree cover 60%, water 10%<tree cover 30%, water tree cover 10%, water 20% 20%<water 70% water>70% 20% 20% 732 Chengquan Huang et al. assessment has been investigated in many works (e.g. Genderen and Lock 1978, Stehman 1992), only the selection of training sample and the selection of input variable were considered in this study. In order to avoid biases in the con dence level of accuracy estimates due to inappropriately sampled testing data (FitzpatrickLins 1981, Dicks and Lo 1990), the accuracy measure of each test was estimated from all pixels not used as training data. 3.2.1. T raining data selection Training data selection is one of the major factors determining to what degree the classi cation rules can be generalized to unseen samples (Paola and Schowengerdt 1995). A previous study showed that this factor could be more important for obtaining accurate classi cations than the selection of classi cation algorithms (Hixson et al. 1980). To assess the impact of training data size on diVerent classi cation algorithms, the selected algorithms were tested using training data of varying sizes. Speci cally, the four algorithms were trained using approximately 2, 4, 6, 8, 10 and 20% pixels of the entire image. With data sizes xed, training pixels can be selected in many ways. A commonly used sampling method is to identify and label small patches of homogeneous pixels in an image (Campbell 1996). However, adjacent pixels tend to be spatially correlated or have similar values (Campbell 1981). Training samples collected this way underestimate the spectral variability of each class and are likely to give degraded classi cations (Gong and Howarth 1990). A simple method to minimize the eVect of spatial correlation is random sampling (Campbell 1996). Two random sampling strategies were investigated in this experiment. One is called equal sample rate (ESR) in which a xed percentage of pixels are randomly sampled from each class as training data. The other is called equal sample size (ESS) in which a xed number of pixels are randomly sampled from each class as training data. In both strategies the total number of training samples is approximately the same as those calculated according to the prede ned 2, 4, 6, 8, 10 and 20% sampling rates for the whole data set. 3.2.2. Selection of input variables The six TM spectral bands roughly correspond to six MODIS bands at 250 m and 500 m resolutions (Barnes et al. 1998). Only the red (TM band 3) and near infrared (NIR, TM band 4) bands are available at 250 m resolution. The other four TM bands are available at 500 m resolution. Because these four bands contain information that is complementary to the red and NIR bands (Townshend 1984, Toll 1985), not having them at 250 m resolution may limit the ability to derive land cover information at this resolution. Two sets of tests were performed to evaluate the impact of not having the four TM bands on land cover characterization at the 250 m resolution. In the rst set only the red, NIR band and the normalized diVerence vegetation index (NDVI) were used as input to the classi ers, while in the second set the other four bands were also included. NDVI is calculated from the red and NIR bands as follows: NDVI= NIR red NIR+red (22) Table 2 summarizes the training conditions under which the four classi cation algorithms were evaluated. Support vector machines for land cover classi cation 733 Table 2. Training data conditions under which the classi cation algorithms were tested. Sampling method Sample size (% of entire image) Number of input variables Training case no. Equal sample size 2 3 7 1 2 4 3 7 3 4 6 3 7 5 6 8 3 7 7 8 10 3 7 9 10 20 3 7 11 12 2 3 7 13 14 4 3 7 15 16 6 3 7 17 18 8 3 7 19 20 10 3 7 21 22 20 3 7 23 24 Equal sample rate 3.2.3. Cross validation In the above experiment only one training data set was sampled from the image at each training size level. In order to evaluate the stability of the selected classi ers and for the results to be statistically valid, cross validations were performed at two training data size levels: 6% pixels representing a relatively small training size and 20% pixels representing a relatively large training size. At each size level ten sets of training samples were randomly selected from the image using the equal sample rate (ESR) method. As will be discussed in §6.1, this method gave slightly higher accuracies than the ESS. On each training data set the four classi cation algorithms were trained using three and seven variables. 3.3. Methods for performance assessment The criteria for evaluating the performance of classi cation algorithms include accuracy, speed, stability, and comprehensibility, among others. Which criterion or 734 Chengquan Huang et al. which group of criteria to use depends on the purpose of the evaluation. As a criterion most relevant to all parties and all purposes, accuracy was selected as the primary criterion in this assessment. Speed and stability are also important factors in algorithm selection and these were considered as well. Two widely used accuracy measures —overall accuracy and the kappa coeYcient—were used in this study (Rosen eld and Fitzpatrick-Lins 1986, Congalton 1991, Janssen and Wel 1994). The overall accuracy has the advantage of being directly interpretable as the proportion of pixels being classi ed correctly (Janssen and Wel 1994, Stehman 1997), while the kappa coeYcient allows for a statistical test of the signi cance of the diVerence between two algorithms (Congalton 1991). 4. Impact of kernel con guration on the performances of the SVM According to the theoretical development of SVM presented in §2, the kernel function plays a major role in locating complex decision boundaries between classes. By mapping the input data into a high-dimensional space, the kernel function converts non-linear boundaries in the original data space into linear ones in the high-dimensional space, which can then be located using an optimization algorithm. Therefore the selection of kernel function and appropriate values for corresponding kernel parameters, referred to as kernel con guration, may aVect the performance of the SVM. 4.1. Polynomial kernels The parameter to be prede ned for using the polynomial kernels is the polynomial order p. According to previous studies (Cortes and Vapnik 1995), p values of 1 to 8 were tested for each of the 24 training cases. Rapid increases in computing time as p increases limited experiments with higher p values. Kernel performance is measured using the overall agreement between a classi cation and a reference map—the overall accuracy (Stehman 1997). Figure 3 shows the impact of p on kernel performance. In general, the linear kernel ( p=1) performed worse than nonlinear kernels, which is expected because boundaries between many classes are more likely to be non-linear. With three variables as the input, there are obvious trends of improved accuracy as p increases (Figure 3(c) and (d )). Such trends are also observed in training cases with seven input variables when p increases from 1 to 4 ( gure 3(a) and (b)). This observation is in contrast to the studies of Cortes and Vapnik (1995), in which no obvious trend was observed when the polynomial order p increased from 2 to higher values. This is probably because the number of input variables used in this study is quite diVerent from those used in previous studies. The data set used in this experiment has only several variables while those used in previous studies had hundreds of variables. DiVerences between observations of this experiment and those of previous studies suggest that polynomial order p has diVerent impacts on kernel performance when diVerent number of input variables is used. With large numbers of input variables, complex nonlinear decision boundaries can still be mapped into linear ones using relatively low-order polynomial kernels. However, if a data set has only several variables, it is necessary to try high-order polynomial kernels in order to achieve optimal performances using polynomial SVM. 4.2. RBF kernels The parameter to be preset for using the RBF kernel de ned in equation (21) is c. In previous studies c values of around 1 were used ( Vapnik 1995, Joachims 1998b). Support vector machines for land cover classi cation (a) (b) (c) (d) 735 Figure 3. Performance of polynomial kernels as a function of polynomial order p (training data size is % pixel of the image). (a) Equal sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3 variables, (d) equal sample rate, 3 variables. For this speci c data set, c values between 1 and 20 gave reasonable results ( gure 4). A comparison between gure 3 and gure 4 reveals that the performance of the RBF kernel is less aVected by c than that of the polynomial kernel by p. With seven input variables ( gure 4(a) and (b)), the overall accuracy only changed slightly when c varied between 1 and 20. With three input variables, however, the impact is more signi cant. Figure 4(c) and (d ) show obvious trends of increased performance when c increased from 1 to 7.5. For most training cases the overall accuracy only changed slightly when c increased beyond 7.5. The impact of kernel parameter on kernel performance can be illustrated using an experiment performed on arbitrary data samples collected in a two-dimensional space. Figure 5 shows the data samples of two classes and the decision boundaries between the two classes as located by polynomial and RBF kernels. Notice that although the decision boundaries located by all non-linear kernels (all polynomial kernels with p>1 and all RBF kernels) are similar, for this speci c set of samples the shape of the decision boundary is adjusted slightly and misclassi cation errors are reduced gradually, as p increases from 3 to 12 for the polynomial kernel ( gure 5(a)), or as c decreases from 1 to 0.1 for the RBF kernel ( gure 5(b)). With appropriate kernel parameter values both polynomial ( p=12) and RBF (c=0.1) kernels classi ed this arbitrary data set without error, though the decision boundaries de ned by the two types of kernels are not exactly the same. How well these decision Chengquan Huang et al. 736 (a) (b) (c) (d) Figure 4. Performance of RBF kernels as a function of c (training data size is % pixel of the image). (a) Equal sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3 variables, (d ) equal sample rate, 3 variables. boundaries can be generalized to unseen samples depends on the distribution of unseen data samples. As will be discussed in §6, classi cation accuracy is aVected by training sample size and number of input variables. Figures 3 and 4 show that most SVM kernels gave higher accuracies with a larger training size and more input variables. With three input variables, however, most SVM kernels gave unexpectedly higher accuracies on the training case with 2% pixels sampled using the equal sample size (ESS) method than on several larger training data sets selected using the same sampling method ( gures 3(c) and 4(c)). This is probably because SVM de nes decision boundaries between classes using support vectors rather than statistical attributes which are sample size dependent ( gure 5). Although a larger training data set has a better chance of including support vectors that de ne the actual decision boundaries and hence should give higher accuracies, there are occasions when a smaller training data set includes such support vectors while larger ones do not. In §6.1 we will show that the other three classi ers did not have such abnormal high accuracies on this training case (see gure 8(c) later). 5. Comparative performances of the four classi ers The previous section has already illustrated the impact of kernel parameter setting on the accuracy of the SVM. Similarly, the performance of the other Support vector machines for land cover classi cation 737 Figure 5. Impact of kernel con guration on the decision boundaries and misclassi cation errors of the SVM. Empty and solid circles represent two arbitrary classes. Circled points are support vectors. Checked points represent misclassi cation errors. Red and blue represent high con dence areas for class one (empty circle) and two (solid circle) respectively. Optimal separating hyperplanes are highlighted in white. classi cation algorithms may also be aVected by the parameter settings of those algorithms. For example, the performance of the NNC is in uenced by the network structure (e.g. Sui 1994, Paola and Schowengerdt 1997), while that of the DTC is aVected by the degree of pruning (Breiman et al. 1984, Quinlan 1993). In this experiment the NNC took a three-layer (input, hidden and output) network structure, which is considered suYcient for classifying multispectral imageries (Paola and Schowengerdt 1995). The numbers of units of the rst and last layers were set to 738 Chengquan Huang et al. the numbers of input variables and output classes respectively. There is no guideline for determining the number of hidden units. In this experiment it was determined according to the number of input variables. Three hidden layer con gurations were tested on each training case: the number of hidden units equals one, two and three times of the number of input variables. A major issue in pruning a classi cation tree is when to stop to produce a tree that generalizes well to unseen data samples. Too simple a tree may not be able to exploit fully the explanatory power of the data while too complex a tree may generalize poorly. Yet there is no practical guideline that guarantees a ‘perfect’ tree that is not too simple and not too complex. In this experiment a wide range of pruning degrees were tested. Because of the diVerent nature of the impacts of algorithm parameters on diVerent algorithms, it is impossible to account for such diVerences in evaluating the comparative performances of the algorithms. To avoid this problem, the best performance of each algorithm on each training case was reported in the following comparison. The performances were evaluated in terms of algorithm accuracy, stability and speed. 5.1. Classi cation accuracy The accuracy of classi cations was measured using the overall accuracy. The signi cance of accuracy diVerences was tested using the kappa statistics according to Congalton et al. (1983) and Hudson and Ramm (1987). Figure 6 shows the overall accuracies of the four algorithms on the 24 training cases. Table 3 gives the signi cance values of accuracy diVerences between the four algorithms. Table 4 gives the mean and standard deviation of the overall accuracies of classi cations developed through cross validation at two training size levels: 6% and 20% pixels of the image. Several patterns can be observed from gure 6 and tables 3 and 4, as follows. (1) Generally the SVM was more accurate than DTC or the MLC. It gave signi cantly higher accuracies than the MLC in 18 out of 20 training cases (MLC could not run on four training cases due to insuYcient training samples) and than DTC in 14 of 24 training cases. In all remaining training cases the MLC and DTC did not generate signi cantly better results than the SVM. The SVM also gave signi cantly better results than NNC in six of the 12 training cases with seven input variables, and though insigni cantly, gave higher accuracies than NNC in ve of the remaining six training cases. On average when seven variables were used, the overall accuracy of the SVM was 1–2% higher than that of NNC, and 2–4% higher than those of DTC and the MLC (table 4). When only three variables were used, the average overall accuracies of the SVM were about 1–2% higher than those of DTC and the MLC. These observations are in general agreement with previous works in which the SVM was found to be more accurate than either NNC or DTC ( Vapnik 1995, Joachims 1998b). This is expected because, as discussed in §2, the SVM is designed to locate an optimal separating hyperplane, while the other three algorithms may not be able to locate this separating hyperplane. Statistically the optimal separating hyperplane located by the SVM should be generalized to unseen samples with least errors among all separating hyperplanes. (2) Unexpectedly, however, SVM did not give signi cantly higher accuracies than NNC in any of the 12 training cases with three input variables. On the contrary, it was signi cantly less accurate than NNC in three of the 12 Support vector machines for land cover classi cation 739 Figure 6. Overall accuracies of classi cations developed using the four classi ers. Y -axis is overall accuracy (%). X-axis is training data size (% pixel of the image). (a) Equal sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3 variables, (d ) equal sample rate, 3 variables. training cases. The average overall accuracies of the SVM were slightly lower than those of NNC (table 4). The lower accuracies of SVM than those of NNC on data with three variables are probably due to the inability of the SVM to transform non-linear class boundaries in the original space into linear ones in a high-dimensional space. According to the algorithm development detailed in §2, the applicability of the SVM to non-linear decision boundaries depends on whether the decision boundaries can be transformed into linear ones by mapping the input data into a high-dimensional space. With only three input variables, the SVM might have less success in transforming complex decision boundaries in the original input space into linear ones in a high-dimensional space. The complex network structure of NNC, however, might be able to approximate complex decision boundaries even when the data contain very few variables, and therefore have better comparative performances over the SVM. The comparative performances of the SVM on data sets with very few variables should be further investigated because data sets with such few variables were not considered in previous studies (Cortes and Vapnik 1995, Joachims 1998b). Chengquan Huang et al. 740 Table 3. Signi cance value (Z) of diVerences between the accuracies of the four classi ers. Sample size (%) Equal sample size, 7 variables Equal sample rate, 7 variables Equal sample size, 3 variables 1.77 1.96 1.92 2.28 1.94 2.55 SVM vs. NNC 3.65 1.50 1.00 1.19 3.96 2.26 2 4 6 8 10 20 0.61 2.33 4.43 4.58 2.70 4.68 SVM vs. DTC 2.48 0.81 1.89 2.25 4.58 3.10 3.46 0.61 0.46 4.51 2.46 1.19 2 4 6 8 10 20 8.03 7.27 6.34 3.30 4.73 6.32 SVM vs. MLC NA NA 3.38 4.24 7.54 5.03 5.04 0.33 2.35 4.80 1.51 3.39 2 4 6 8 10 20 1.17 0.37 2.52 2.30 0.76 2.13 DTC vs. NNC 1.17 0.69 0.89 1.06 0.61 0.83 7.44 4.94 1.90 1.29 2.02 1.63 DTC vs. MLC NA NA 1.49 1.99 2.97 1.94 6.25 5.33 4.42 1.01 2.78 3.76 NNC vs. MLC NA NA 2.38 3.05 3.58 2.77 2 4 6 8 10 20 2 4 6 8 10 20 2 4 6 8 10 20 Equal sample rate, 3 variables 1.20 2.29 4.60 1.06 0.02 1.50 2.31 2.91 5.07 5.60 2.48 2.71 1.02 2.38 0.22 0.88 0.02 0.02 1.65 1.37 3.01 1.52 5.23 1.43 NA NA 3.03 6.48 4.51 3.86 2.70 1.01 2.79 2.40 5.22 1.42 1.60 0.28 1.88 0.28 0.96 2.19 NA NA 0.02 4.98 0.7 2.46 3.91 2.64 6.99 5.88 1.54 4.93 NA NA 2.80 7.39 4.50 2.53 Notes. 1. DiVerences signi cant at 95% con dence level (Z 1.96) are highlighted in bold face. A positive value indicates better performance of the rst classi er, while a negative one indicates better performance of the second classi er. 2. NA indicates that the MLC did not work due to insuYcient training samples for certain classes and no comparison was made. Support vector machines for land cover classi cation 741 Table 4. Mean and standard deviation (s) of the overall accuracies (%) of classi cations developed using ten sets of training samples randomly selected from the Maryland data set. SVM NNC DTC MLC Training condition Mean s Mean s Mean s Mean s Training size=20%, Input variables=7 Training size=6%, Input variables=7 Training size=20%, Input variables=3 Training size=6%, Input variables=3 75.62 0.19 74.02 0.81 73.31 0.65 71.76 0.79 74.20 0.60 72.10 1.31 71.82 0.94 70.92 1.04 66.41 0.39 66.82 0.91 65.92 0.52 64.59 0.62 65.49 1.20 65.97 0.79 64.45 0.58 63.95 0.97 (3) Of the other three algorithms, NNC gave signi cantly higher results than DTC in ten of the 12 training cases with three input variables and in three of the 12 training cases with seven input variables. Again NNC showed better comparative performances on training cases with three variables than on training cases with seven variables. DTC did not give signi cantly better results than NNC on any of the remaining training cases. Both NNC and DTC were more accurate than the MLC. NNC had signi cantly higher accuracies than the MLC in 18 of 20 training cases while DTC did so in eight of 20 training cases. The MLC did not have signi cantly higher accuracies than NNC and DTC on any of the remaining training cases. (4) The accuracy diVerences of the four algorithms on the data set used in this study were generally small. However, many of them were statistically signi cant. 5.2. Algorithm stability and speed The standard deviation of the overall accuracy of an algorithm estimated in cross validation is a quantitative measure of its relative stability (table 4). Figure 7 shows the variations of the accuracies of the four classi ers. Both table 4 and gure 7 reveal that the stabilities of the algorithms diVered greatly and were aVected by training data size and number of input variables. In general, the overall accuracies of the algorithms were more stable when trained using 20% pixels than using 6% pixels, especially when seven variables were used ( gures 7(a) and (b)). The SVM gave far more stable overall accuracies than the other three algorithms when trained using 20% pixels with seven variables. It also gave more stable overall accuracies than the other three algorithms when trained using 6% pixels with seven variables ( gure 7(b)) and using 20% pixels with three variables ( gure 7(c)). But when trained using 6% pixels with three variables, it gave overall accuracies in a wider range than the other three algorithms ( gure 7(d )). Of the other three algorithms, DTC gave slightly more stable overall accuracies than NNC or the MLC, both of which gave overall accuracies in wider ranges in all cases. The training speeds of the four classi ers were substantially diVerent. In all training cases training the MLC and DTC did not take more than a few minutes on a SUN Ultra 2 workstation, while training NNC and the SVM took hours and days, respectively. Furthermore, the training speeds of the above algorithms were 742 Chengquan Huang et al. (a) (b) (c) (d) Figure 7. Boxplots of the overall accuracies of classi cations developed using ten sets of training samples randomly selected from the Maryland data set. (a). Training size= 20% pixels of the image, number of input variables=7. (b) Training size=6% pixels of the image, number of input variables=7. (c) Training size=20% pixels of the image, number of input variables=3. (d ) Training size=6% pixels of the image, number of input variables=3. aVected by many factors, including numbers of training samples and input variables, noise level in the training data set, as well as algorithm parameter setting. This is especially the case for the SVM and NNC. Many studies have demonstrated that the training speed of NNC depends on network structure, momentum rate, learning rate and converging criteria (Paola and Schowengerdt 1995). The training of the SVM was aVected by training data size, kernel parameter setting and class separability. Generally, when the training data size was doubled, the training time would be more than doubled. Training the SVM to classify two highly mixed classes could take several times longer than training it to classify two separable classes. For the SVM programme used in this study, polynomial kernels, especially high-order kernels, took far more time to train than RBF kernels. 6. Impacts of non-algorithm factors 6.1. Impact of training sample selection Training sample selection includes two parts: training data size and selection method. Reorganizing the numbers in gure 6 shows the impact of training data size on algorithm performance ( gure 8). As expected, increases in training data size generally led to improved performances. While the increases in overall accuracy were Support vector machines for land cover classi cation 743 Figure 8. Impact of training data size on the performances of the classi ers. Y -axis is overall accuracy (%). Training data size is % pixel of the image. (a) Equal sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3 variables, (d ) equal sample rate, 3 variables. not monotonic as training data size increased, larger training data sets (>6% of the image) generally gave results better than smaller ones (<6%). One of the goals of this experiment was to determine the minimum training data size for suYcient training of an algorithm. The obvious increases in overall accuracy as training data size increased from 2% to 6% indicate that for this test data set training pixels less than 6% of the entire image are insuYcient for training the four algorithms. Beyond 6%, however, it is hard to tell when an algorithm is trained adequately. When seven variables were used and the training samples were selected using the equal sample rate (ESR) method ( gure 8(b)), the largest training data set (20% pixels) gave the best results. For other training cases, however, the best performance of an algorithm was often achieved with training pixels less than 20% of the image ( gure 8(a), (c) (d )). Hepner et al. (1990) considered a training data size of a 10 by 10 block for each class as the minimum data size for training NNC. Zhuang et al. (1994) suggested that training data sets of approximately 5–10% of an image were needed to train a neural network classi er adequately. The results of this experiment suggest that the minimum number of samples for adequately Chengquan Huang et al. 744 training an algorithm may depend on the algorithm concerned, the number of input variables, the method used to select the training samples, and the size and spatial variability of the study area. The impact of the two sampling methods for selecting training data—equal sample size (ESS) and equal sample rate (ESR)—on classi cation accuracy was assessed using kappa statistics. Table 5 shows that the two sampling methods did give signi cantly diVerent accuracies for some training cases. For most training cases slightly higher accuracies were achieved when the training samples were selected using the ESR method. Considering the disadvantage of undersampling or even totally missing rare classes of the ESR method, the sampling rate of very rare classes should be increased when this method is employed. 6.2. Impact of input variables It is evident from gures 6 and 8 that substantial improvements were achieved when the classi cations were developed using seven variables instead of using three. The respective average improvements in overall accuracy for the SVM, NNC, DTC and the MLC were 8.8%, 5.8%, 8.0% and 5.9% when training samples were selected using the ESS method, and 8.1%, 6.1%, 7.6% and 7.3% when training samples were selected using the ESR method, respectively. Figure 9 shows two SVM classi cations developed using three and seven variables. They were developed from the training data set consisting of 20% pixels of the image selected using the ESR method. A visual inspection of the two classi cations reveals that using the four additional TM bands led to substantial improvements in discriminating between the four land classes (closed forest, open forest, woodland and non-forest land). Table 6 gives the number of pixels classi ed correctly in the two classi cations. The last row shows that the relative increases in number of pixels classi ed correctly for the four land classes are much higher than those for the classes of water and land–water mix. It should be noted that improvements in classi cation accuracy achieved by using more variables were substantially higher than those achieved by choosing better classi cation algorithms or by increasing training data size, underlining the importance of using as much information as possible in land cover classi cation. Table 5. Signi cance value (Z) of diVerences between classi cations developed from training samples selected using the equal sample size (ESS) and equal sample rate (ESR) methods. Algorithm Sample rate (%) 2 4 6 8 10 20 SVM DTC NNC MLC 3-band 7-band 3-band 7-band 3-band 7-band 3-band 7-band — — 2.40 0.85 0.30 2.67 2.72 1.04 3.07 0.81 2.70 3.13 3.16 1.92 1.12 0.85 2.07 1.74 0.94 3.01 0.53 3.83 0.01 2.93 1.28 1.21 1.42 1.47 0.20 3.35 0.54 1.19 1.74 0.63 2.67 1.64 5.83 1.53 0.21 0.24 0.06 1.24 — — 1.83 1.80 0.75 3.06 Note. DiVerences signi cant at the 95% con dence level (Z 1.96) are highlighted in bold face. Positive Z values indicate higher accuracies for the ESS method while negative ones indicate higher accuracies for the ESR method. Support vector machines for land cover classi cation 745 Figure 9. SVM classi cations developed for the study area in eastern Maryland, USA, using three and seven variables from the training data set consisting of 20% training pixels selected using the equal sample rate (ESR) method. The classi cations cover an area of 22.5 km by 22.5 km. (a) Classi cation developed using three variables. (b) Classi cation developed using seven variables. Table 6. Number of pixels classi ed correctly in the two classi cations shown in gure 8 and per-class improvement due to using seven instead of three variables in the classi cation. Classi cation developed using Closed forest Open forest Woodland Non-forest land Land-water mix Water Per-class agreement (number of pixel) between a classi cation and the reference map Three variables 1317 587 376 612 276 974 Seven variables 1533 695 447 752 291 982 Relative increases (%) in per-class agreement when the number of input variables increased from 3 to 7 16.4 18.4 18.9 22.9 5.4 0.8 Many studies have demonstrated the usefulness of the two mid-infrared bands of the TM sensor in discriminating between vegetation types (e.g. DeGloria 1984, Townshend 1984), yet the two bands will not be available at 250 m resolution on the MODIS instrument (Running et al. 1994). Results from this experiment show that the loss of discriminatory power due to not having the two mid-infrared bands at 250 m resolution could not be fully compensated for by using better classi cation algorithms or by increasing training data size. Whether the lost information can be fully compensated for by incorporating spatial and temporal information needs to be further investigated. 746 7. Chengquan Huang et al. Summary and conclusions The support vector machine (SVM) is a machine learning algorithm based on statistical learning theory. In previous studies it had been found competitive with the best available machine learning algorithms for handwritten digit recognition and text categorization. In this study, an experiment was performed to evaluate the comparative performances of this algorithm and three other popular classi ers (the maximum likelihood classi er (MLC), neural network classi ers (NNC) and decision tree classi ers (DTC)) in land cover classi cation. In addition to the comparative performances of the four classi ers, impacts of the con gurations of SVM kernels on its performance and of the selection of training data and input variables on all four classi ers were also evaluated. SVM uses kernel functions to map non-linear decision boundaries in the original data space into linear ones in a high-dimensional space. Results from this experiment revealed that kernel type and kernel parameter aVect the shape of the decision boundaries as located by the SVM and thus in uence the performance of the SVM. For polynomial kernels better accuracies were achieved on data with three input variables as the polynomial order p increased from 1 to 8, suggesting the need for using high-order polynomial kernels when the input data have very few variables. When seven variables were used in the classi cation, improved accuracies were achieved when p increased from 1 to 4. Further increases in p had little impact on classi cation accuracy. For RBF kernels the accuracy increased slightly when c increased from 1 to 7.5. No obvious trend of improvement was observed when c increased from 5 to 20. However, an experiment using arbitrary data points revealed that misclassi cation error is a function of c. Of the four algorithms evaluated, the MLC had lower accuracies than the three non-parametric algorithms. The SVM was more accurate than DTC in 22 out of 24 training cases. It also gave higher accuracies than NNC when seven variables were used in the classi cation. This observation is in agreement with several previous studies. The higher accuracies of the SVM should be attributed to its ability to locate an optimal separating hyperplane. As shown in gure 1, statistically, the optimal separating hyperplane found by the SVM algorithm should be generalized to unseen samples with fewer errors than any other separating hyperplane that might be found by other classi ers. Unexpectedly, however, NNC were more accurate than the SVM when only three variables were used in the classi cation. This is probably because the SVM had less success in transforming non-linear class boundaries in a very lowdimensional space into linear ones in a high-dimensional space. On the other hand, the complex network structure of NNC might be more eYcient in approximating non-linear decision boundaries even when the data have only very few variables. Generally the absolute diVerences of classi cation accuracy were small among the four classi ers. However, many of the diVerences were statistically signi cant. In terms of algorithm stability, the SVM gave more stable overall accuracies than the other three algorithms except when trained using 6% pixels with three variables. Of the other three algorithms, DTC gave slightly more stable overall accuracies than NNC or the MLC, both of which gave overall accuracies in wide ranges. In terms of training speed, the MLC and DTC were much faster than the SVM and NNC. While the training speed of NNC depended on network structure, momentum rate, learning rate and converging criteria, that of the SVM was aVected by training data size, kernel parameter setting and class separability. All four classi ers were aVected by the selection of training samples. It was not Support vector machines for land cover classi cation 747 possible to determine the minimum number of samples for suYciently training an algorithm according to results from this experiment. However, the initial trends of improved classi cation accuracies for all four classi ers as training data size increased emphasize the necessity of having adequate training samples in land cover classi cation. Feature selection is another factor aVecting classi cation accuracy. Substantial increases in accuracy were achieved when all six TM spectral bands and the NDVI were used instead of only the red, NIR and the NDVI. The additional four TM bands improved the discrimination between land classes. Improvements due to the inclusion of the four TM bands exceeded those due to the use of better classi cation algorithms or increased training data size, underlining the need to use as much information as possible in deriving land cover classi cation from satellite images. Acknowledgments This study was made possible through a NSF grant (BIR9318183) and a contract from the National Aeronautics and Space Administration (NAS596060). The SVM programme used in this study was made available by Mr Thorstan Joachain. References Atkinson, P. M., and Tatnall, A. R. L., 1997, Neural networks in remote sensing. International Journal of Remote Sensing, 18, 699–709. Barker, J. L., and Burelhach, J. W., 1992, MODIS image simulation from Landsat TM imagery. In Proceedings ASPRS/ACSM/RT , Washington, DC, April 22–25, 1992 (Washington, DC: ASPRS), pp. 156–165. Barnes, W. L., Pagano, T. S., and Salomonson, V. V., 1998, Prelaunch characteristics of the Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1. IEEE T ransactions on Geoscience and Remote Sensing, 36, 1088–1100. Belward, A., and Loveland, T., 1996, The DIS 1 km land cover data set. Global Change News L etter, 27, 7–9. Breiman, L., Friedman, J. H., Olshend, R. A., and Stone, C. J., 1984, Classi cation and Regression T rees (Belmont, CA: Wadsworth International Group). Brodley, C. E., and Utgoff, P. E., 1995, Multivariate decision trees. Machine L earning, 19, 45–77. Burges, C. J. C., 1998, A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Campbell, J. B., 1981, Spatial correlation eVects upon accuracy of supervised classi cation of land cover. Photogrammetric Engineering and Remote Sensing, 47, 355–363. Campbell, J. B., 1996, Introduction to Remote Sensing (New York: The Guilford Press). Congalton, R., 1991, A review of assessing the accuracy of classi cations of remotely sensed data. Remote Sensing of Environment, 37, 35–46. Congalton, R. G., Oderwald, R. G., and Mead, R. A., 1983, Assessing Landsat classi cation accuracy using discrete multivariate analysis statistical techniques. Photogrammetric Engineering and Remote Sensing, 49, 1671–1678. Cortes, C., and Vapnik, V., 1995, Support vector networks. Machine L earning, 20, 273–297. Courant, R., and Hilbert, D., 1953, Methods of Mathematical Physics (New York: John Wiley). DeFries, R. S., Hansen, M., Townshend, J. R. G., and Sohlberg, R., 1998, Global land cover classi cations at 8km spatial resolution: the use of training data derived from Landsat imagery in decision tree classi ers. International Journal of Remote Sensing, 19, 3141–3168. DeGloria, S., 1984, Spectral variability of Landsat-4 Thematic Mapper and Multispectral Scanner data for selected crop and forest cover types. IEEE T ransactions on Geoscience and Remote Sensing, GE-22, 303–311. Dicks, S. E., and Lo, T. H. C., 1990, Evaluation of thematic map accuracy in a land-use and land-cover mapping program. Photogrammetric Engineering and Remote Sensing, 56, 1247–1252. 748 Chengquan Huang et al. Fitzpatrick-Lins, K., 1981, Comparison of sampling procedures and data analysis for a landuse and land-cover map. Photogrammetric Engineering and Remote Sensing, 47, 343–351. Foody, G. M., McCulloch, M. B., and Yates, W. B., 1995, The eVect of training set size and composition on arti cial neural network classi cation. International Journal of Remote Sensing, 16, 1707–1723. Friedl, M. A., and Brodley, C. E., 1997, Decision tree classi cation of land cover from remotely sensed data. Remote Sensing of Environment, 61, 399–409. Genderen, V. J. L., and Lock, B. F., 1978, Remote sensing: statistical testing of thematic map accuracy. Remote Sensing of Environment, 7, 3–14. Gong, P., and Howarth, P. J., 1990, An assessment of some factors in uencing multispectral land-cover classi cation. Photogrammetric Engineering and Remote Sensing, 56, 597–603. Gualtieri, J. A., and Cromp, R. F., 1998, Support vector machines for hyperspectral remote sensing classi cation. In Proceedings of the 27th AIPR Workshop: Advances in Computer Assisted Recognition, Washington, DC, Oct. 27, 1998 (Washington, DC: SPIE), pp. 221–232. Hall, F. G., Townshend, J. R., and Engman, E. T., 1995, Status of remote sensing algorithms for estimation of land surface state parameters. Remote Sensing of Environment, 51, 138–156. Hansen, M., DeFries, R. S., Townshend, J. R. G., and Sohlberg, R., 2000, Global land cover classi cation at 1 km spatial resulution using a classi cation tree approach. International Journal of Remote Sensing, 21, 1331–1364. Hansen, M., Dubayah, R., and DeFries, R., 1996, Classi cation trees: an alternative to traditional land cover classi ers. International Journal of Remote Sensing, 17, 1075–1081. Hepner, G. F., Logan, T., Ritter, N., and Bryant, N., 1990, Arti cial neural network classi cation using a minimal training set: comparison to conventional supervised classi cation. Photogrammetric Engineering and Remote Sensing, 56, 496–473. Hixson, M., Scholz, D., Fuhs, N., and Akiyama, T., 1980, Evaluation of several schemes for classi cation of remotely sensed data. Photogrammetric Engineering and Remote Sensing, 46, 1547–1553. Hudson, W. D., and Ramm, C. W., 1987, Correct formulation of the Kappa coeYcient of agreement. Photogrammetric Engineering and Remote Sensing, 53, 421–422. Janssen, L. L. F., and Wel, F., 1994, Accuracy assessment of satellite derived land cover data: a review. IEEE Photogrammetric Engineering and Remote Sensing, 60, 419–426. Joachims, T., 1998a, Making large scale SVM learning practical. In Advances in Kernel Methods—Support Vector L earning, edited by B. Scholkopf, C. Burges and A. Smola (New York: MIT Press). Joachims, T., 1998b, Text categorization with support vector machines—learning with many relevant features. In Proceedings of European Conference on Machine L earning, Chemnitz, Germany, April 10, 1998 (Berlin: Springer), pp. 137–142. Justice, C. O., Markham, B. L., Townshend, J. R. G., and Kennard, R. L., 1989, Spatial degradation of satellite data. International Journal of Remote Sensing, 10, 1539–1561. Lippman, R. P., 1987, An introduction to computing with neural nets. IEEE ASSP Magazine, 4, 2–22. Markham, B. L., and Barker, J. L., 1986, Landsat MSS and TM post-calibration dynamic ranges, exoatmospheric re ectances and at-satellite temperatures. EOSAT L andsat T echnical Notes, 1, 3–8. Pao, Y.-H., 1989, Adaptive Pattern Recognition and Neural Networks (New York: AddisonWesley). Paola, J. D., and Schowengerdt, R. A., 1995, A review and analysis of backpropagation neural networks for classi cation of remotely sensed multi-spectral imagery. International Journal of Remote Sensing, 16, 3033–3058. Paola, J. D., and Schowengerdt, R. A., 1997, The eVect of neural network structure on a multispectral land-use/land cover classi cation. Photogrammetric Engineering and Remote Sensing, 63, 535–544. Quinlan, J. R., 1993, C4.5 Programs for Machine L earning (San Mateo, CA: Morgan Kaufmann Publishers). Support vector machines for land cover classi cation 749 Rosenfield, G. H., and Fitzpatrick-Lins, K., 1986, A coeYcient of agreement as a measure of thematic classi cation accuracy. Photogrammetric Engineering & Remote Sensing, 52, 223–227. Running, S. W., Justice, C. O., Salomonson, V., Hall, D., Barker, J., Kaufmann, Y. J., Strahler, A. H., Huete, A. R., Muller, J. P., Vanderbilt, V., Wan, Z. M., Teillet, P., and Carneggie, D., 1994, Terrestrial remote sensing science and algorithms planned for EOS/MODIS. International Journal of Remote Sensing, 15, 3587–3620. Safavian, S. R., and Landgrebe, D., 1991, A survey of decision tree classi er methodology. IEEE T ransactions on Systems, Man, and Cybernetics, 21, 660–674. Sellers, P. J., Meeson, B. W., Hall, F. G., Asrar, G., Murphy, R. E., Schiffer, R. A., Bretherton, F. P., et al., 1995, Remote sensing of the land surface for studies of global change: models—algorithms—experiments. Remote Sensing of Environment, 51, 3–26. Stehman, S. V., 1992, Comparison of systematic and random sampling for estimating the accuracy of maps generated from remotely sensed data. Photogrammetric Engineering and Remote Sensing, 58, 1343–1350. Stehman, S. V., 1997, Selecting and interpreting measures of thematic classi cation accuracy. Remote Sensing of Environment, 62, 77–89. Sui, D. Z., 1994, Recent applications of neural networks for spatial data handling. Canadian Journal of Remote Sensing, 20, 368–380. Sundaram, R. K., 1996, A First Course in Optimization T heory (New York: Cambridge University Press). Swain, P. H., and Davis, S. M. (editors), 1978, Remote Sensing: the Quantitative Approach (New York: McGraw-Hill). Toll, D. L., 1985, EVect of Landsat Thematic Mapper sensor parameters on land cover classi cation. Remote Sensing of Environment, 17, 129–140. Townshend, J. R. G., 1984, Agricultural land-cover discrimination using Thematic Mapper spectral bands. International Journal of Remote Sensing, 5, 681–698. Townshend, J. R. G., 1992, Land cover. International Journal of Remote Sensing, 13, 1319–1328. Vapnik, V. N., 1995, T he Nature of Statistical L earning T heory (New York: Springer-Verlag). Vapnik, V. N., 1998, Statistical L earning T heory (New York: Wiley). Wang, F., 1990, Fuzzy supervised classi cation of remote sensing images. IEEE T ransactions on Geoscience and Remote Sensing, 28, 194–201. Zhuang, X., Engel, B. A., Lozano-Garcia, D. F., Fernandez, R. N., and Johannsen, C. J., 1994, Optimization of training data required for neuro-classi cation. International Journal of Remote Sensing, 15, 3271–3277.