An assessment of support vector machines for land cover classification

Transcription

An assessment of support vector machines for land cover classification
int. j. remote sensing, 2002, vol. 23, no. 4, 725–749
An assessment of support vector machines for land cover classiŽ cation
C. HUANG†
Department of Geography, University of Maryland, College Park, MD 20742,
USA
L. S. DAVIS
Institute for Advanced Computing Studies, University of Maryland, College
Park, MD 20742, USA
and J. R. G. TOWNSHEND
Department of Geography and Institute for Advanced Computing Studies,
University of Maryland, College Park, MD 20742, USA
(Received 27 October 1999; in Ž nal form 27 November 2000)
Abstract. The support vector machine (SVM) is a group of theoretically superior
machine learning algorithms. It was found competitive with the best available
machine learning algorithms in classifying high-dimensional data sets. This paper
gives an introduction to the theoretical development of the SVM and an experimental evaluation of its accuracy, stability and training speed in deriving land
cover classiŽ cations from satellite images. The SVM was compared to three other
popular classiŽ ers, including the maximum likelihood classiŽ er (MLC), neural
network classiŽ ers (NNC) and decision tree classiŽ ers (DTC). The impacts of
kernel conŽ guration on the performance of the SVM and of the selection of
training data and input variables on the four classiŽ ers were also evaluated in
this experiment.
1.
Introduction
Land cover information has been identiŽ ed as one of the crucial data components
for many aspects of global change studies and environmental applications (Sellers
et al. 1995). The derivation of such information increasingly relies on remote sensing
technology due to its ability to acquire measurements of land surfaces at various
spatial and temporal scales. One of the major approaches to deriving land cover
information from remotely sensed images is classiŽ cation. Numerous classiŽ cation
algorithms have been developed since the Ž rst Landsat image was acquired in early
1970s (Townshend 1992, Hall et al. 1995). Among the most popular are the maximum
likelihood classiŽ er (MLC), neural network classiŽ ers and decision tree classiŽ ers.
The MLC is a parametric classiŽ er based on statistical theory. Despite limitations
due to its assumption of normal distribution of class signature (e.g. Swain and Davis
†Current address: Raytheon ITSS, USGS/EROS Data Center, Sioux Falls, SD 57108,
USA; e-mail address: [email protected]
International Journal of Remote Sensing
ISSN 0143-1161 print/ISSN 1366-5901 online © 2002 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI: 10.1080/01431160110040323
726
Chengquan Huang et al.
1978), it is perhaps one of the most widely used classiŽ ers (e.g. Wang 1990, Hansen
et al. 1996). Neural networks avoid some of the problems of the MLC by adopting
a non-parametric approach. Their potential discriminating power has attracted a
great deal of research eVort. As a result, many types of neural networks have been
developed (Lippman 1987); the most widely used in the classiŽ cation of remotely
sensed images is a group of networks called a multi-layer perceptron (MLP) (e.g.
Paola and Schowengerdt 1995, Atkinson and Tatnall 1997).
A decision tree classiŽ er takes a diVerent approach to land cover classiŽ cation.
It breaks an often very complex classiŽ cation problem into multiple stages of simpler
decision-making processes (Safavian and Landgrebe 1991). Depending on the
number of variables used at each stage, there are univariate and multivariate decision
trees (Friedl and Brodley 1997). Univariate decision trees have been used to develop
land cover classiŽ cations at a global scale (DeFries et al. 1998, Hansen et al. 2000).
Though multivariate decision trees are often more compact and can be more accurate
than univariate decision trees (Brodley and UtgoV 1995), they involve more complex
algorithms and, as a result, are aVected by a suite of algorithm-related factors (Friedl
and Brodley 1997). The univariate decision tree developed by Quinlan (1993) is
evaluated in this study.
The support vector machine (SVM) represents a group of theoretically superior
machine learning algorithms. As shall be described in the following section, the SVM
employs optimization algorithms to locate the optimal boundaries between classes.
Statistically, the optimal boundaries should be generalized to unseen samples with
least errors among all possible boundaries separating the classes, therefore minimizing
the confusion between classes. In practice, the SVM has been applied to optical
character recognition, handwritten digit recognition and text categorization (Vapnik
1995, Joachims 1998b). These experiments found the SVM to be competitive with
the best available classiŽ cation methods, including neural networks and decision tree
classiŽ ers. The superior performance of the SVM was also demonstrated in classifying
hyperspectral images acquired from the Airborne Visible/Infrared Imaging
Spectrometer (AVIRIS) (Gualtieri and Cromp 1998). While hundreds of variables
were used as the input in the experiments mentioned above, there are far fewer
variables in data acquired from operational sensor systems such as Landsat, the
Advanced Very High Resolution Radiometer (AVHRR) and the Moderate Resolution
Imaging Spectroradiometer (MODIS). Because these are among the major sensor
systems from which land cover information is derived, an evaluation of the performance of the SVM using images from such sensor systems should have practical
implications for land cover classiŽ cation. The purpose of this paper is to demonstrate
the applicability of this algorithm to deriving land cover from such operational
sensor systems and systematically to evaluate its performances in comparison to
other popular classiŽ ers, including the statistical maximum likelihood classiŽ er
(MLC), a back propagation neural network classiŽ er (NNC) (Pao 1989) and a
decision tree classiŽ er (DTC) (Quinlan 1993). The SVM was implemented by
Joachims (1998a) as SV Mlight .
A brief introduction to the theoretical development of the SVM is given in the
following section. This is deemed necessary because the SVM is relatively new to
the remote sensing community as compared to the other three methods. The data
set and experimental design are presented in §3. Experimental results are discussed
in the following three sections, including impacts of kernel conŽ guration on the
performance of the SVM, comparative performances of the four classiŽ ers and
Support vector machines for land cover classiŽ cation
727
impacts of non-algorithm factors. The results of this study are summarized in the
last section.
2.
Theoretical development of SVM
There are a number of publications detailing the mathematical formulation of
the SVM (see e.g. Vapnik 1995, 1998, Burges 1998). The algorithm development of
this section follows Vapnik (1995) and Burges (1998).
The inductive principle behind SVM is structural risk minimization (SRM).
According to Vapnik (1995), the risk of a learning machine (R) is bounded by the
sum of the empirical risk estimated from training samples (R ) and a conŽ dence
emp
interval (Y ): R R +Y. The strategy of SRM is to keep the empirical risk (R )
emp
emp
Ž xed and to minimize the conŽ dence interval (Y ), or to maximize the margin between
a separating hyperplane and closest data points (Ž gure 1). A separating hyperplane
refers to a plane in a multi-dimensional space that separates the data samples of two
classes. The optimal separating hyperplane is the separating hyperplane that maximizes the margin from closest data points to the plane. Currently, one SVM classiŽ er
can only separate two classes. Integration strategies are needed to extend this method
to classifying multiple classes.
2.1. T he optimal separating hyperplane
Let the training data of two separable classes with k samples be represented by
(x , y ), ..., (x , y ) where xµRn is an n-dimensional space, and yµ{+1, ­ 1} is class
1 1
k k
label. Suppose the two classes can be separated by two hyperplanes parallel to the
optimal hyperplane (Ž gure 1(a)):
w·x +b 1
i
w·x +b
i
­
for
1
y =1, i=1, 2, ..., k
i
for y =­ 1
i
(1)
(2)
Figure 1. The optimal separating hyperplane between (a) separable samples and (b)
non-separable data samples.
728
Chengquan Huang et al.
where w=(w , ..., w ) is a vector of n elements. Inequalities (1) and (2) can be
1
n
combined into a single inequality:
y [ w ´x +b] 1
i=1, ..., k
(3)
i
i
As shown in Ž gure 1, the optimal separating hyperplane is the one that separates
the data with maximum margin. This hyperplane can be found by minimizing the
norm of w, or the following function:
F (w)= Ø(w´w)
(4)
under inequality constraint (3).
The saddle point of the following Lagrangean gives solutions to the above
optimization problem:
L(w, b, a)= Ø(w´w)­
k
a { y [ w´x +b] ­ 1}
(5)
i i
i
i= 1
where a 0 are Lagrange multipliers (Sundaram 1996). The solution to this optimi
ization problem requires that the gradient of L (w, b, a) with respect to w and b
vanishes, giving the following conditions:
æ
k
w= æ y a x
(6)
i i i
i= 1
k
æ a y =0
(7)
i i
i= 1
By substituting (6 ) and (7) into (5), the optimization problem becomes: maximize
1 k k
k
L(a)= æ a ­
æ
æ a a y y (x ´x )
(8)
i 2
i j i j i j
i= 1
i= 1 j= 1
under constraints a 0, i=1, ..., k.
i
Given an optimal solution a0=(a0 , ..., a0 ) to (8), the solution w0 to (5) is a linear
1
k
combination of training samples:
k
w0= æ y a0 x
(9)
i i i
i= 1
According to the Kuhn–Tucker theory (Sundaram 1996), only points that satisfy
the equalities in (1) and (2) can have non-zero coeYcients a0 . These points lie on
i
the two parallel hyperplanes and are called support vectors (Ž gure 1). Let x0(1) be
a support vector of one class and x0(­ 1) of the other, then the constant b0 can be
calculated as follows:
b0= Ø[ w0´x0(1)+w0´x0(­ 1)]
(10)
The decision rule that separates the two classes can be written as:
f (x)=sign
A
æ
y a0 (x ´x)­
i i i
support vector
B
b0
(11)
2.2. Dealing with non-separable cases
An important assumption to the above solution is that the data are separable in
the feature space. It is easy to check that there is no optimal solution if the data
cannot be separated without error. To resolve this problem, a penalty value C for
Support vector machines for land cover classiŽ cation
729
misclassiŽ cation errors and positive slack variables j are introduced (Ž gure 1(b)).
i
These variables are incorporated into constraint (1) and (2) as follows:
w´x +b 1­ j
for y =1
i
i
i
w ´x +b ­ 1+j
for y =1
i i
i
i
j 0, i=1, ..., k
i
The objective function (4) then becomes
F (w, j)= Ø(w ´w)+C
(12)
(13)
(14)
A B
l
k
æ j
i
i= 1
(15)
C is a preset penalty value for misclassiŽ cation errors. If l=1, the solution to this
optimization problem is similar to that of the separable case.
2.3. Support vector machines
To generalize the above method to non-linear decision functions, the support
vector machine implements the following idea: it maps the input vector x into a
high-dimensional feature space H and constructs the optimal separating hyperplane
in that space. Suppose the data are mapped into a high-dimensional space H through
mapping function W:
W: Rn
H
(16)
A vector x in the feature space can be represented as W(x) in the high-dimensional
space H. Since the only way in which the data appear in the training problem (8)
are in the form of dot products of two vectors, the training algorithm in the highdimensional space H would only depend on data in this space through a dot product,
i.e. on functions of the form W(x )´W(x ). Now, if there is a kernel function K such
i
j
that
K (x , x )=W(x )´W(x )
(17)
i j
i
j
we would only need to use K in the training program without knowing the explicit
form of W. The same trick can be applied to the decision function (11) because the
only form in which the data appear are in the form of dot products. Thus if a kernel
function K can be found, we can train and use a classiŽ er in the high-dimensional
space without knowing the explicit form of the mapping function. The optimization
problem (8) can be rewritten as:
1 k k
k
L (a)= æ a ­
æ
æ a a y y K (x , x )
i 2
i j i j
i j
i= 1
i= 1 j= 1
and the decision rule expressed in equation (11) becomes:
f (x)=sign
A
æ
y a0 K (x , x)­
i i
i
support vector
B
b0
(18)
(19)
A kernel that can be used to construct a SVM must meet Mercer’s condition
(Courant and Hilbert 1953). The following two types of kernels meet this condition
730
Chengquan Huang et al.
and will be considered in this study (Vapnik 1995). The polynomial kernels,
K (x , x )=(x ´x +1)p
1 2
1 2
and the radial basis functions (RBF),
(20)
K (x , x )=eÕ c(x1Õ x2)2
1 2
(21)
2.4. From binary classiŽ er to multi-class classiŽ er
In the above theoretical development, the SVM was developed as a binary
classiŽ er, i.e. one SVM can only separate two classes. Strategies are needed to adapt
this method to multi-class cases. Two simple strategies have been proposed to adapt
the SVM to N-class problems (Gualtieri and Cromp 1998). One is to construct a
machine for each pair of classes, resulting in N (N­ 1)/2 machines. When applied to
a test pixel each machine gives one vote to the winning class, and the pixel is labelled
with the class having most votes. The other strategy is to break the N-class case
into N two-class cases, in each of which a machine is trained to classify one class
against all others. When applied to a test pixel, a value measuring the conŽ dence
that the pixel belongs to a class can be calculated from equation (19), and the pixel
is labelled with the class with which the pixel has the highest conŽ dence value
( Vapnik 1995). Without an evaluation of the two strategies, the second one is used
in this study because it only requires to train N SVM machines for an N-class case,
while for the same classiŽ cation the Ž rst strategy requires to train N (N­ 1)/2 SVM
machines.
With the second strategy each SVM machine is constructed to separate one class
from all other classes. An obvious problem with this strategy is that in constructing
each SVM machine, the sizes of the two concerned classes can be highly unbalanced,
because one of them is the aggregation of N­ 1 classes. For data samples that cannot
be separated without error, a classiŽ er may not be able to Ž nd a boundary between
two highly unbalanced classes. For example, a classiŽ er may not be able to Ž nd a
boundary between the two classes shown in Ž gure 2 because the classiŽ er probably
makes least errors by labelling all pixels belonging to the smaller class with the
Figure 2. An example of highly unbalanced training samples in a two-dimensional space
deŽ ned by two arbitrary variables, features 1 and 2. A classiŽ er might incur more
errors by drawing boundaries between the two classes than labelling pixels of the
smaller class with the larger one.
Support vector machines for land cover classiŽ cation
731
larger one. To avoid this problem the samples of the smaller class are replicated
such that the two classes have approximately the same sizes. Similar tricks were
employed in constructing decision tree classiŽ ers for highly unbalanced classes
(DeFries et al. 1998).
3. Data and experimental design
3.1. Data and preprocessing
A spatially degraded Thematic Mapper (TM) image and a corresponding reference map were used in this evaluation study. The TM image, acquired in eastern
Maryland on 14 August, 1985, has a spatial resolution of 28.5 m. The six spectral
bands (bands 1–5, and 7) of the TM image were converted to top-of-atmosphere
(TOC) re ectance according to Markham and Barker (1986). Atmospheric correction
was not necessary because the image was quite clear within the study area. Three
broad cover types—forest, non-forest land and water—were delimited from this
image, giving a land cover map with the same spatial resolution as the TM image.
This three-class scheme was selected to ensure the achievement of high accuracy
of the collected land cover map at this resolution. Confused pixels were labelled
according to aerial photographs and Ž eld visits covering the study area.
Both the TM image and derived land cover map were degraded to a spatial
resolution of 256.5 m with a degrading ratio of 9:1, i.e. each degraded pixel corresponds to 9 by 9 TM pixels. The main reason for evaluating the classiŽ ers using
degraded data is that a highly reliable reference land cover map with a reasonable
number of classes can be generated at the degraded resolution. The image was
degraded using a simulation programme embedded with models of the point spread
functions (PSF) of TM and MODIS sensors (Barker and Burelhach 1992). By
considering the PSF of both sensor systems, the simulation programme gives more
realistic images than spatial averaging (Justice et al. 1989). Overlaying the 256.5m
grids on the 28.5 m land cover map and calculating the proportions of forest, nonforest land and water within each 256.5m grid gave proportion images of forest,
non-forest land and water at the 256.5 m resolution. A land cover map at the 256.5 m
resolution was developed by reclassifying the proportion images according to class
deŽ nitions given in table 1. These deŽ nitions were based on the IGBP classiŽ cation
scheme (Belward and Loveland 1996, DeFries et al. 1998). Class names were so
chosen to match the deŽ nitions used in this study.
3.2. Experimental design
Many factors aVect the performance of a classiŽ er, including the selection of
training and testing data samples as well as input variables (Gong and Howarth
1990, Foody et al. 1995). Because the impact of testing data selection on accuracy
Table 1. DeŽ nition of land cover classes for the Maryland data set.
Code
1
2
3
4
5
6
Cover type
Closed forest
Open forest
Woodland
Non-forest land
Land-water mix
Water
DeŽ nition
tree cover>60%, water 20%
30%<tree cover 60%, water
10%<tree cover 30%, water
tree cover 10%, water 20%
20%<water 70%
water>70%
20%
20%
732
Chengquan Huang et al.
assessment has been investigated in many works (e.g. Genderen and Lock 1978,
Stehman 1992), only the selection of training sample and the selection of input
variable were considered in this study. In order to avoid biases in the conŽ dence
level of accuracy estimates due to inappropriately sampled testing data (FitzpatrickLins 1981, Dicks and Lo 1990), the accuracy measure of each test was estimated
from all pixels not used as training data.
3.2.1. T raining data selection
Training data selection is one of the major factors determining to what degree
the classiŽ cation rules can be generalized to unseen samples (Paola and Schowengerdt
1995). A previous study showed that this factor could be more important for
obtaining accurate classiŽ cations than the selection of classiŽ cation algorithms
(Hixson et al. 1980). To assess the impact of training data size on diVerent classiŽ cation algorithms, the selected algorithms were tested using training data of varying
sizes. SpeciŽ cally, the four algorithms were trained using approximately 2, 4, 6, 8, 10
and 20% pixels of the entire image.
With data sizes Ž xed, training pixels can be selected in many ways. A commonly
used sampling method is to identify and label small patches of homogeneous pixels
in an image (Campbell 1996). However, adjacent pixels tend to be spatially correlated
or have similar values (Campbell 1981). Training samples collected this way underestimate the spectral variability of each class and are likely to give degraded classiŽ cations (Gong and Howarth 1990). A simple method to minimize the eVect of
spatial correlation is random sampling (Campbell 1996). Two random sampling
strategies were investigated in this experiment. One is called equal sample rate (ESR)
in which a Ž xed percentage of pixels are randomly sampled from each class as
training data. The other is called equal sample size (ESS) in which a Ž xed number
of pixels are randomly sampled from each class as training data. In both strategies
the total number of training samples is approximately the same as those calculated
according to the predeŽ ned 2, 4, 6, 8, 10 and 20% sampling rates for the whole
data set.
3.2.2. Selection of input variables
The six TM spectral bands roughly correspond to six MODIS bands at 250 m
and 500 m resolutions (Barnes et al. 1998). Only the red (TM band 3) and near
infrared (NIR, TM band 4) bands are available at 250 m resolution. The other four
TM bands are available at 500 m resolution. Because these four bands contain
information that is complementary to the red and NIR bands (Townshend 1984,
Toll 1985), not having them at 250 m resolution may limit the ability to derive land
cover information at this resolution. Two sets of tests were performed to evaluate
the impact of not having the four TM bands on land cover characterization at the
250 m resolution. In the Ž rst set only the red, NIR band and the normalized diVerence
vegetation index (NDVI) were used as input to the classiŽ ers, while in the second
set the other four bands were also included. NDVI is calculated from the red and
NIR bands as follows:
NDVI=
NIR­ red
NIR+red
(22)
Table 2 summarizes the training conditions under which the four classiŽ cation
algorithms were evaluated.
Support vector machines for land cover classiŽ cation
733
Table 2. Training data conditions under which the classiŽ cation algorithms were tested.
Sampling method
Sample size (%
of entire image)
Number of input
variables
Training case no.
Equal sample size
2
3
7
1
2
4
3
7
3
4
6
3
7
5
6
8
3
7
7
8
10
3
7
9
10
20
3
7
11
12
2
3
7
13
14
4
3
7
15
16
6
3
7
17
18
8
3
7
19
20
10
3
7
21
22
20
3
7
23
24
Equal sample rate
3.2.3. Cross validation
In the above experiment only one training data set was sampled from the image
at each training size level. In order to evaluate the stability of the selected classiŽ ers
and for the results to be statistically valid, cross validations were performed at two
training data size levels: 6% pixels representing a relatively small training size and
20% pixels representing a relatively large training size. At each size level ten sets of
training samples were randomly selected from the image using the equal sample rate
(ESR) method. As will be discussed in §6.1, this method gave slightly higher accuracies
than the ESS. On each training data set the four classiŽ cation algorithms were
trained using three and seven variables.
3.3. Methods for performance assessment
The criteria for evaluating the performance of classiŽ cation algorithms include
accuracy, speed, stability, and comprehensibility, among others. Which criterion or
734
Chengquan Huang et al.
which group of criteria to use depends on the purpose of the evaluation. As a
criterion most relevant to all parties and all purposes, accuracy was selected as the
primary criterion in this assessment. Speed and stability are also important factors
in algorithm selection and these were considered as well. Two widely used accuracy
measures —overall accuracy and the kappa coeYcient—were used in this study
(RosenŽ eld and Fitzpatrick-Lins 1986, Congalton 1991, Janssen and Wel 1994). The
overall accuracy has the advantage of being directly interpretable as the proportion
of pixels being classiŽ ed correctly (Janssen and Wel 1994, Stehman 1997), while the
kappa coeYcient allows for a statistical test of the signiŽ cance of the diVerence
between two algorithms (Congalton 1991).
4.
Impact of kernel conŽ guration on the performances of the SVM
According to the theoretical development of SVM presented in §2, the kernel
function plays a major role in locating complex decision boundaries between classes.
By mapping the input data into a high-dimensional space, the kernel function
converts non-linear boundaries in the original data space into linear ones in the
high-dimensional space, which can then be located using an optimization algorithm.
Therefore the selection of kernel function and appropriate values for corresponding
kernel parameters, referred to as kernel conŽ guration, may aVect the performance
of the SVM.
4.1. Polynomial kernels
The parameter to be predeŽ ned for using the polynomial kernels is the polynomial
order p. According to previous studies (Cortes and Vapnik 1995), p values of 1 to 8
were tested for each of the 24 training cases. Rapid increases in computing time as
p increases limited experiments with higher p values. Kernel performance is measured
using the overall agreement between a classiŽ cation and a reference map—the overall
accuracy (Stehman 1997). Figure 3 shows the impact of p on kernel performance. In
general, the linear kernel ( p=1) performed worse than nonlinear kernels, which is
expected because boundaries between many classes are more likely to be non-linear.
With three variables as the input, there are obvious trends of improved accuracy as
p increases (Figure 3(c) and (d )). Such trends are also observed in training cases with
seven input variables when p increases from 1 to 4 (Ž gure 3(a) and (b)). This observation is in contrast to the studies of Cortes and Vapnik (1995), in which no obvious
trend was observed when the polynomial order p increased from 2 to higher values.
This is probably because the number of input variables used in this study is quite
diVerent from those used in previous studies. The data set used in this experiment
has only several variables while those used in previous studies had hundreds of
variables. DiVerences between observations of this experiment and those of previous
studies suggest that polynomial order p has diVerent impacts on kernel performance
when diVerent number of input variables is used. With large numbers of input
variables, complex nonlinear decision boundaries can still be mapped into linear
ones using relatively low-order polynomial kernels. However, if a data set has only
several variables, it is necessary to try high-order polynomial kernels in order to
achieve optimal performances using polynomial SVM.
4.2. RBF kernels
The parameter to be preset for using the RBF kernel deŽ ned in equation (21) is
c. In previous studies c values of around 1 were used ( Vapnik 1995, Joachims 1998b).
Support vector machines for land cover classiŽ cation
(a)
(b)
(c)
(d)
735
Figure 3. Performance of polynomial kernels as a function of polynomial order p (training
data size is % pixel of the image). (a) Equal sample size, 7 variables, (b) equal sample
rate, 7 variables, (c) equal sample size, 3 variables, (d) equal sample rate, 3 variables.
For this speciŽ c data set, c values between 1 and 20 gave reasonable results (Ž gure 4).
A comparison between Ž gure 3 and Ž gure 4 reveals that the performance of the RBF
kernel is less aVected by c than that of the polynomial kernel by p. With seven input
variables (Ž gure 4(a) and (b)), the overall accuracy only changed slightly when c
varied between 1 and 20. With three input variables, however, the impact is more
signiŽ cant. Figure 4(c) and (d ) show obvious trends of increased performance when
c increased from 1 to 7.5. For most training cases the overall accuracy only changed
slightly when c increased beyond 7.5.
The impact of kernel parameter on kernel performance can be illustrated using
an experiment performed on arbitrary data samples collected in a two-dimensional
space. Figure 5 shows the data samples of two classes and the decision boundaries
between the two classes as located by polynomial and RBF kernels. Notice that
although the decision boundaries located by all non-linear kernels (all polynomial
kernels with p>1 and all RBF kernels) are similar, for this speciŽ c set of samples
the shape of the decision boundary is adjusted slightly and misclassiŽ cation errors
are reduced gradually, as p increases from 3 to 12 for the polynomial kernel
(Ž gure 5(a)), or as c decreases from 1 to 0.1 for the RBF kernel (Ž gure 5(b)). With
appropriate kernel parameter values both polynomial ( p=12) and RBF (c=0.1)
kernels classiŽ ed this arbitrary data set without error, though the decision boundaries
deŽ ned by the two types of kernels are not exactly the same. How well these decision
Chengquan Huang et al.
736
(a)
(b)
(c)
(d)
Figure 4. Performance of RBF kernels as a function of c (training data size is % pixel of the
image). (a) Equal sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal
sample size, 3 variables, (d ) equal sample rate, 3 variables.
boundaries can be generalized to unseen samples depends on the distribution of
unseen data samples.
As will be discussed in §6, classiŽ cation accuracy is aVected by training sample
size and number of input variables. Figures 3 and 4 show that most SVM kernels
gave higher accuracies with a larger training size and more input variables. With
three input variables, however, most SVM kernels gave unexpectedly higher
accuracies on the training case with 2% pixels sampled using the equal sample size
(ESS) method than on several larger training data sets selected using the same
sampling method (Ž gures 3(c) and 4(c)). This is probably because SVM deŽ nes
decision boundaries between classes using support vectors rather than statistical
attributes which are sample size dependent (Ž gure 5). Although a larger training
data set has a better chance of including support vectors that deŽ ne the actual
decision boundaries and hence should give higher accuracies, there are occasions
when a smaller training data set includes such support vectors while larger ones do
not. In §6.1 we will show that the other three classiŽ ers did not have such abnormal
high accuracies on this training case (see Ž gure 8(c) later).
5.
Comparative performances of the four classiŽ ers
The previous section has already illustrated the impact of kernel parameter setting on the accuracy of the SVM. Similarly, the performance of the other
Support vector machines for land cover classiŽ cation
737
Figure 5. Impact of kernel conŽ guration on the decision boundaries and misclassiŽ cation
errors of the SVM. Empty and solid circles represent two arbitrary classes. Circled
points are support vectors. Checked points represent misclassiŽ cation errors. Red and
blue represent high conŽ dence areas for class one (empty circle) and two (solid circle)
respectively. Optimal separating hyperplanes are highlighted in white.
classiŽ cation algorithms may also be aVected by the parameter settings of those
algorithms. For example, the performance of the NNC is in uenced by the network
structure (e.g. Sui 1994, Paola and Schowengerdt 1997), while that of the DTC is
aVected by the degree of pruning (Breiman et al. 1984, Quinlan 1993). In this
experiment the NNC took a three-layer (input, hidden and output) network structure,
which is considered suYcient for classifying multispectral imageries (Paola and
Schowengerdt 1995). The numbers of units of the Ž rst and last layers were set to
738
Chengquan Huang et al.
the numbers of input variables and output classes respectively. There is no guideline
for determining the number of hidden units. In this experiment it was determined
according to the number of input variables. Three hidden layer conŽ gurations were
tested on each training case: the number of hidden units equals one, two and three
times of the number of input variables. A major issue in pruning a classiŽ cation tree
is when to stop to produce a tree that generalizes well to unseen data samples. Too
simple a tree may not be able to exploit fully the explanatory power of the data
while too complex a tree may generalize poorly. Yet there is no practical guideline
that guarantees a ‘perfect’ tree that is not too simple and not too complex. In this
experiment a wide range of pruning degrees were tested.
Because of the diVerent nature of the impacts of algorithm parameters on diVerent
algorithms, it is impossible to account for such diVerences in evaluating the comparative performances of the algorithms. To avoid this problem, the best performance of
each algorithm on each training case was reported in the following comparison. The
performances were evaluated in terms of algorithm accuracy, stability and speed.
5.1. ClassiŽ cation accuracy
The accuracy of classiŽ cations was measured using the overall accuracy. The
signiŽ cance of accuracy diVerences was tested using the kappa statistics according
to Congalton et al. (1983) and Hudson and Ramm (1987). Figure 6 shows the overall
accuracies of the four algorithms on the 24 training cases. Table 3 gives the signiŽ cance values of accuracy diVerences between the four algorithms. Table 4 gives the
mean and standard deviation of the overall accuracies of classiŽ cations developed
through cross validation at two training size levels: 6% and 20% pixels of the image.
Several patterns can be observed from Ž gure 6 and tables 3 and 4, as follows.
(1) Generally the SVM was more accurate than DTC or the MLC. It gave
signiŽ cantly higher accuracies than the MLC in 18 out of 20 training cases
(MLC could not run on four training cases due to insuYcient training
samples) and than DTC in 14 of 24 training cases. In all remaining training
cases the MLC and DTC did not generate signiŽ cantly better results than
the SVM. The SVM also gave signiŽ cantly better results than NNC in six of
the 12 training cases with seven input variables, and though insigniŽ cantly,
gave higher accuracies than NNC in Ž ve of the remaining six training cases.
On average when seven variables were used, the overall accuracy of the SVM
was 1–2% higher than that of NNC, and 2–4% higher than those of DTC
and the MLC (table 4). When only three variables were used, the average
overall accuracies of the SVM were about 1–2% higher than those of DTC
and the MLC. These observations are in general agreement with previous
works in which the SVM was found to be more accurate than either NNC
or DTC ( Vapnik 1995, Joachims 1998b). This is expected because, as discussed
in §2, the SVM is designed to locate an optimal separating hyperplane, while
the other three algorithms may not be able to locate this separating hyperplane. Statistically the optimal separating hyperplane located by the
SVM should be generalized to unseen samples with least errors among all
separating hyperplanes.
(2) Unexpectedly, however, SVM did not give signiŽ cantly higher accuracies
than NNC in any of the 12 training cases with three input variables. On the
contrary, it was signiŽ cantly less accurate than NNC in three of the 12
Support vector machines for land cover classiŽ cation
739
Figure 6. Overall accuracies of classiŽ cations developed using the four classiŽ ers. Y -axis is
overall accuracy (%). X-axis is training data size (% pixel of the image). (a) Equal
sample size, 7 variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3
variables, (d ) equal sample rate, 3 variables.
training cases. The average overall accuracies of the SVM were slightly lower
than those of NNC (table 4). The lower accuracies of SVM than those of
NNC on data with three variables are probably due to the inability of the
SVM to transform non-linear class boundaries in the original space into
linear ones in a high-dimensional space. According to the algorithm development detailed in §2, the applicability of the SVM to non-linear decision
boundaries depends on whether the decision boundaries can be transformed
into linear ones by mapping the input data into a high-dimensional space.
With only three input variables, the SVM might have less success in transforming complex decision boundaries in the original input space into linear
ones in a high-dimensional space. The complex network structure of NNC,
however, might be able to approximate complex decision boundaries even
when the data contain very few variables, and therefore have better comparative performances over the SVM. The comparative performances of the
SVM on data sets with very few variables should be further investigated
because data sets with such few variables were not considered in previous
studies (Cortes and Vapnik 1995, Joachims 1998b).
Chengquan Huang et al.
740
Table 3. SigniŽ cance value (Z) of diVerences between the accuracies of the four classiŽ ers.
Sample
size (%)
Equal sample
size,
7 variables
Equal sample
rate,
7 variables
Equal sample
size,
3 variables
1.77
1.96
1.92
2.28
1.94
2.55
SVM vs. NNC
3.65
­ 1.50
1.00
1.19
3.96
2.26
2
4
6
8
10
20
0.61
2.33
4.43
4.58
2.70
4.68
SVM vs. DTC
2.48
­ 0.81
1.89
2.25
4.58
3.10
3.46
0.61
0.46
4.51
2.46
1.19
2
4
6
8
10
20
8.03
7.27
6.34
3.30
4.73
6.32
SVM vs. MLC
NA
NA
3.38
4.24
7.54
5.03
5.04
0.33
2.35
4.80
1.51
3.39
2
4
6
8
10
20
1.17
0.37
2.52
2.30
0.76
2.13
DTC vs. NNC
1.17
­ 0.69
­ 0.89
­ 1.06
­ 0.61
­ 0.83
7.44
4.94
1.90
1.29
2.02
1.63
DTC vs. MLC
NA
NA
1.49
1.99
2.97
1.94
6.25
5.33
4.42
1.01
2.78
3.76
NNC vs. MLC
NA
NA
2.38
3.05
3.58
2.77
2
4
6
8
10
20
2
4
6
8
10
20
2
4
6
8
10
20
­
­
­
­
­
­
Equal sample
rate,
3 variables
­
1.20
2.29
4.60
1.06
0.02
1.50
­
­
­
­
­
­
2.31
2.91
5.07
5.60
2.48
2.71
­
­
­
­
­
­
­
1.02
2.38
0.22
0.88
0.02
0.02
­
­
1.65
1.37
3.01
1.52
5.23
1.43
­
­
­
­
­
­
­
NA
NA
3.03
6.48
4.51
3.86
2.70
1.01
2.79
2.40
5.22
1.42
1.60
0.28
1.88
0.28
0.96
2.19
NA
NA
0.02
4.98
­ 0.7
2.46
3.91
2.64
6.99
5.88
1.54
4.93
NA
NA
2.80
7.39
4.50
2.53
Notes.
1. DiVerences signiŽ cant at 95% conŽ dence level (Z 1.96) are highlighted in bold face.
A positive value indicates better performance of the Ž rst classiŽ er, while a negative
one indicates better performance of the second classiŽ er.
2. NA indicates that the MLC did not work due to insuYcient training samples for
certain classes and no comparison was made.
Support vector machines for land cover classiŽ cation
741
Table 4. Mean and standard deviation (s) of the overall accuracies (%) of classiŽ cations
developed using ten sets of training samples randomly selected from the Maryland
data set.
SVM
NNC
DTC
MLC
Training condition
Mean
s
Mean
s
Mean
s
Mean
s
Training size=20%,
Input variables=7
Training size=6%,
Input variables=7
Training size=20%,
Input variables=3
Training size=6%,
Input variables=3
75.62
0.19
74.02
0.81
73.31
0.65
71.76
0.79
74.20
0.60
72.10
1.31
71.82
0.94
70.92
1.04
66.41
0.39
66.82
0.91
65.92
0.52
64.59
0.62
65.49
1.20
65.97
0.79
64.45
0.58
63.95
0.97
(3) Of the other three algorithms, NNC gave signiŽ cantly higher results than
DTC in ten of the 12 training cases with three input variables and in three
of the 12 training cases with seven input variables. Again NNC showed better
comparative performances on training cases with three variables than on
training cases with seven variables. DTC did not give signiŽ cantly better
results than NNC on any of the remaining training cases. Both NNC and
DTC were more accurate than the MLC. NNC had signiŽ cantly higher
accuracies than the MLC in 18 of 20 training cases while DTC did so
in eight of 20 training cases. The MLC did not have signiŽ cantly higher
accuracies than NNC and DTC on any of the remaining training cases.
(4) The accuracy diVerences of the four algorithms on the data set used in this
study were generally small. However, many of them were statistically
signiŽ cant.
5.2. Algorithm stability and speed
The standard deviation of the overall accuracy of an algorithm estimated in cross
validation is a quantitative measure of its relative stability (table 4). Figure 7 shows
the variations of the accuracies of the four classiŽ ers. Both table 4 and Ž gure 7 reveal
that the stabilities of the algorithms diVered greatly and were aVected by training
data size and number of input variables. In general, the overall accuracies of the
algorithms were more stable when trained using 20% pixels than using 6% pixels,
especially when seven variables were used (Ž gures 7(a) and (b)). The SVM gave far
more stable overall accuracies than the other three algorithms when trained using
20% pixels with seven variables. It also gave more stable overall accuracies than the
other three algorithms when trained using 6% pixels with seven variables (Ž gure 7(b))
and using 20% pixels with three variables (Ž gure 7(c)). But when trained using 6%
pixels with three variables, it gave overall accuracies in a wider range than the other
three algorithms (Ž gure 7(d )). Of the other three algorithms, DTC gave slightly
more stable overall accuracies than NNC or the MLC, both of which gave overall
accuracies in wider ranges in all cases.
The training speeds of the four classiŽ ers were substantially diVerent. In all
training cases training the MLC and DTC did not take more than a few minutes
on a SUN Ultra 2 workstation, while training NNC and the SVM took hours and
days, respectively. Furthermore, the training speeds of the above algorithms were
742
Chengquan Huang et al.
(a)
(b)
(c)
(d)
Figure 7. Boxplots of the overall accuracies of classiŽ cations developed using ten sets of
training samples randomly selected from the Maryland data set. (a). Training size=
20% pixels of the image, number of input variables=7. (b) Training size=6% pixels
of the image, number of input variables=7. (c) Training size=20% pixels of the image,
number of input variables=3. (d ) Training size=6% pixels of the image, number of
input variables=3.
aVected by many factors, including numbers of training samples and input variables,
noise level in the training data set, as well as algorithm parameter setting. This is
especially the case for the SVM and NNC. Many studies have demonstrated that
the training speed of NNC depends on network structure, momentum rate, learning
rate and converging criteria (Paola and Schowengerdt 1995). The training of
the SVM was aVected by training data size, kernel parameter setting and class
separability. Generally, when the training data size was doubled, the training time
would be more than doubled. Training the SVM to classify two highly mixed classes
could take several times longer than training it to classify two separable classes. For
the SVM programme used in this study, polynomial kernels, especially high-order
kernels, took far more time to train than RBF kernels.
6. Impacts of non-algorithm factors
6.1. Impact of training sample selection
Training sample selection includes two parts: training data size and selection
method. Reorganizing the numbers in Ž gure 6 shows the impact of training data size
on algorithm performance (Ž gure 8). As expected, increases in training data size
generally led to improved performances. While the increases in overall accuracy were
Support vector machines for land cover classiŽ cation
743
Figure 8. Impact of training data size on the performances of the classiŽ ers. Y -axis is overall
accuracy (%). Training data size is % pixel of the image. (a) Equal sample size, 7
variables, (b) equal sample rate, 7 variables, (c) equal sample size, 3 variables, (d ) equal
sample rate, 3 variables.
not monotonic as training data size increased, larger training data sets (>6% of the
image) generally gave results better than smaller ones (<6%).
One of the goals of this experiment was to determine the minimum training data
size for suYcient training of an algorithm. The obvious increases in overall accuracy
as training data size increased from 2% to 6% indicate that for this test data set
training pixels less than 6% of the entire image are insuYcient for training the four
algorithms. Beyond 6%, however, it is hard to tell when an algorithm is trained
adequately. When seven variables were used and the training samples were selected
using the equal sample rate (ESR) method (Ž gure 8(b)), the largest training data set
(20% pixels) gave the best results. For other training cases, however, the best
performance of an algorithm was often achieved with training pixels less than 20%
of the image (Ž gure 8(a), (c) (d )). Hepner et al. (1990) considered a training data size
of a 10 by 10 block for each class as the minimum data size for training NNC.
Zhuang et al. (1994) suggested that training data sets of approximately 5–10% of
an image were needed to train a neural network classiŽ er adequately. The results
of this experiment suggest that the minimum number of samples for adequately
Chengquan Huang et al.
744
training an algorithm may depend on the algorithm concerned, the number of input
variables, the method used to select the training samples, and the size and spatial
variability of the study area.
The impact of the two sampling methods for selecting training data—equal
sample size (ESS) and equal sample rate (ESR)—on classiŽ cation accuracy was
assessed using kappa statistics. Table 5 shows that the two sampling methods did
give signiŽ cantly diVerent accuracies for some training cases. For most training cases
slightly higher accuracies were achieved when the training samples were selected
using the ESR method. Considering the disadvantage of undersampling or even
totally missing rare classes of the ESR method, the sampling rate of very rare classes
should be increased when this method is employed.
6.2. Impact of input variables
It is evident from Ž gures 6 and 8 that substantial improvements were achieved
when the classiŽ cations were developed using seven variables instead of using three.
The respective average improvements in overall accuracy for the SVM, NNC, DTC
and the MLC were 8.8%, 5.8%, 8.0% and 5.9% when training samples were selected
using the ESS method, and 8.1%, 6.1%, 7.6% and 7.3% when training samples were
selected using the ESR method, respectively. Figure 9 shows two SVM classiŽ cations
developed using three and seven variables. They were developed from the training
data set consisting of 20% pixels of the image selected using the ESR method. A
visual inspection of the two classiŽ cations reveals that using the four additional TM
bands led to substantial improvements in discriminating between the four land
classes (closed forest, open forest, woodland and non-forest land). Table 6 gives the
number of pixels classiŽ ed correctly in the two classiŽ cations. The last row shows
that the relative increases in number of pixels classiŽ ed correctly for the four land
classes are much higher than those for the classes of water and land–water mix.
It should be noted that improvements in classiŽ cation accuracy achieved by
using more variables were substantially higher than those achieved by choosing
better classiŽ cation algorithms or by increasing training data size, underlining the
importance of using as much information as possible in land cover classiŽ cation.
Table 5. SigniŽ cance value (Z) of diVerences between classiŽ cations developed from training
samples selected using the equal sample size (ESS) and equal sample rate (ESR)
methods.
Algorithm
Sample
rate (%)
2
4
6
8
10
20
SVM
DTC
NNC
MLC
3-band 7-band
3-band 7-band
3-band 7-band
3-band 7-band
­
­
—
—
­ 2.40
0.85
0.30
­ 2.67
­
­
­
­
­
2.72
1.04
3.07
0.81
2.70
3.13
­
­
­
3.16
1.92
1.12
0.85
2.07
1.74
­
­
­
­
­
0.94
3.01
0.53
3.83
0.01
2.93
­
­
­
­
­
­
1.28
1.21
1.42
1.47
0.20
3.35
­
­
­
­
0.54
1.19
1.74
0.63
2.67
1.64
­
­
­
5.83
1.53
0.21
0.24
0.06
1.24
—
—
­ 1.83
1.80
0.75
­ 3.06
Note. DiVerences signiŽ cant at the 95% conŽ dence level (Z 1.96) are highlighted in bold
face. Positive Z values indicate higher accuracies for the ESS method while negative ones
indicate higher accuracies for the ESR method.
Support vector machines for land cover classiŽ cation
745
Figure 9. SVM classiŽ cations developed for the study area in eastern Maryland, USA,
using three and seven variables from the training data set consisting of 20% training
pixels selected using the equal sample rate (ESR) method. The classiŽ cations cover
an area of 22.5 km by 22.5 km. (a) ClassiŽ cation developed using three variables.
(b) ClassiŽ cation developed using seven variables.
Table 6. Number of pixels classiŽ ed correctly in the two classiŽ cations shown in Ž gure 8 and
per-class improvement due to using seven instead of three variables in the classiŽ cation.
ClassiŽ cation
developed using
Closed
forest
Open
forest
Woodland
Non-forest
land
Land-water
mix
Water
Per-class agreement (number of pixel) between a classiŽ cation and the reference map
Three variables
1317
587
376
612
276
974
Seven variables
1533
695
447
752
291
982
Relative increases (%) in per-class agreement when the number of input variables increased
from 3 to 7
16.4
18.4
18.9
22.9
5.4
0.8
Many studies have demonstrated the usefulness of the two mid-infrared bands of
the TM sensor in discriminating between vegetation types (e.g. DeGloria 1984,
Townshend 1984), yet the two bands will not be available at 250 m resolution on
the MODIS instrument (Running et al. 1994). Results from this experiment show
that the loss of discriminatory power due to not having the two mid-infrared bands
at 250 m resolution could not be fully compensated for by using better classiŽ cation
algorithms or by increasing training data size. Whether the lost information can be
fully compensated for by incorporating spatial and temporal information needs to
be further investigated.
746
7.
Chengquan Huang et al.
Summary and conclusions
The support vector machine (SVM) is a machine learning algorithm based on
statistical learning theory. In previous studies it had been found competitive with
the best available machine learning algorithms for handwritten digit recognition and
text categorization. In this study, an experiment was performed to evaluate the
comparative performances of this algorithm and three other popular classiŽ ers (the
maximum likelihood classiŽ er (MLC), neural network classiŽ ers (NNC) and decision
tree classiŽ ers (DTC)) in land cover classiŽ cation. In addition to the comparative
performances of the four classiŽ ers, impacts of the conŽ gurations of SVM kernels
on its performance and of the selection of training data and input variables on all
four classiŽ ers were also evaluated.
SVM uses kernel functions to map non-linear decision boundaries in the original
data space into linear ones in a high-dimensional space. Results from this experiment
revealed that kernel type and kernel parameter aVect the shape of the decision
boundaries as located by the SVM and thus in uence the performance of the SVM.
For polynomial kernels better accuracies were achieved on data with three input
variables as the polynomial order p increased from 1 to 8, suggesting the need for
using high-order polynomial kernels when the input data have very few variables.
When seven variables were used in the classiŽ cation, improved accuracies were
achieved when p increased from 1 to 4. Further increases in p had little impact on
classiŽ cation accuracy. For RBF kernels the accuracy increased slightly when c
increased from 1 to 7.5. No obvious trend of improvement was observed when c
increased from 5 to 20. However, an experiment using arbitrary data points revealed
that misclassiŽ cation error is a function of c.
Of the four algorithms evaluated, the MLC had lower accuracies than the three
non-parametric algorithms. The SVM was more accurate than DTC in 22 out of 24
training cases. It also gave higher accuracies than NNC when seven variables were
used in the classiŽ cation. This observation is in agreement with several previous
studies. The higher accuracies of the SVM should be attributed to its ability to locate
an optimal separating hyperplane. As shown in Ž gure 1, statistically, the optimal
separating hyperplane found by the SVM algorithm should be generalized to unseen
samples with fewer errors than any other separating hyperplane that might be found
by other classiŽ ers. Unexpectedly, however, NNC were more accurate than the SVM
when only three variables were used in the classiŽ cation. This is probably because
the SVM had less success in transforming non-linear class boundaries in a very lowdimensional space into linear ones in a high-dimensional space. On the other hand,
the complex network structure of NNC might be more eYcient in approximating
non-linear decision boundaries even when the data have only very few variables.
Generally the absolute diVerences of classiŽ cation accuracy were small among the
four classiŽ ers. However, many of the diVerences were statistically signiŽ cant.
In terms of algorithm stability, the SVM gave more stable overall accuracies
than the other three algorithms except when trained using 6% pixels with three
variables. Of the other three algorithms, DTC gave slightly more stable overall
accuracies than NNC or the MLC, both of which gave overall accuracies in wide
ranges. In terms of training speed, the MLC and DTC were much faster than the
SVM and NNC. While the training speed of NNC depended on network structure,
momentum rate, learning rate and converging criteria, that of the SVM was aVected
by training data size, kernel parameter setting and class separability.
All four classiŽ ers were aVected by the selection of training samples. It was not
Support vector machines for land cover classiŽ cation
747
possible to determine the minimum number of samples for suYciently training an
algorithm according to results from this experiment. However, the initial trends of
improved classiŽ cation accuracies for all four classiŽ ers as training data size increased
emphasize the necessity of having adequate training samples in land cover classiŽ cation. Feature selection is another factor aVecting classiŽ cation accuracy. Substantial
increases in accuracy were achieved when all six TM spectral bands and the NDVI
were used instead of only the red, NIR and the NDVI. The additional four TM
bands improved the discrimination between land classes. Improvements due to the
inclusion of the four TM bands exceeded those due to the use of better classiŽ cation
algorithms or increased training data size, underlining the need to use as much
information as possible in deriving land cover classiŽ cation from satellite images.
Acknowledgments
This study was made possible through a NSF grant (BIR9318183) and a contract
from the National Aeronautics and Space Administration (NAS596060). The SVM
programme used in this study was made available by Mr Thorstan Joachain.
References
Atkinson, P. M., and Tatnall, A. R. L., 1997, Neural networks in remote sensing.
International Journal of Remote Sensing, 18, 699–709.
Barker, J. L., and Burelhach, J. W., 1992, MODIS image simulation from Landsat
TM imagery. In Proceedings ASPRS/ACSM/RT , Washington, DC, April 22–25, 1992
(Washington, DC: ASPRS), pp. 156–165.
Barnes, W. L., Pagano, T. S., and Salomonson, V. V., 1998, Prelaunch characteristics of
the Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1. IEEE
T ransactions on Geoscience and Remote Sensing, 36, 1088–1100.
Belward, A., and Loveland, T., 1996, The DIS 1 km land cover data set. Global Change
News L etter, 27, 7–9.
Breiman, L., Friedman, J. H., Olshend, R. A., and Stone, C. J., 1984, ClassiŽ cation and
Regression T rees (Belmont, CA: Wadsworth International Group).
Brodley, C. E., and Utgoff, P. E., 1995, Multivariate decision trees. Machine L earning,
19, 45–77.
Burges, C. J. C., 1998, A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2, 121–167.
Campbell, J. B., 1981, Spatial correlation eVects upon accuracy of supervised classiŽ cation
of land cover. Photogrammetric Engineering and Remote Sensing, 47, 355–363.
Campbell, J. B., 1996, Introduction to Remote Sensing (New York: The Guilford Press).
Congalton, R., 1991, A review of assessing the accuracy of classiŽ cations of remotely sensed
data. Remote Sensing of Environment, 37, 35–46.
Congalton, R. G., Oderwald, R. G., and Mead, R. A., 1983, Assessing Landsat classiŽ cation
accuracy using discrete multivariate analysis statistical techniques. Photogrammetric
Engineering and Remote Sensing, 49, 1671–1678.
Cortes, C., and Vapnik, V., 1995, Support vector networks. Machine L earning, 20, 273–297.
Courant, R., and Hilbert, D., 1953, Methods of Mathematical Physics (New York: John
Wiley).
DeFries, R. S., Hansen, M., Townshend, J. R. G., and Sohlberg, R., 1998, Global land
cover classiŽ cations at 8km spatial resolution: the use of training data derived from
Landsat imagery in decision tree classiŽ ers. International Journal of Remote Sensing,
19, 3141–3168.
DeGloria, S., 1984, Spectral variability of Landsat-4 Thematic Mapper and Multispectral
Scanner data for selected crop and forest cover types. IEEE T ransactions on Geoscience
and Remote Sensing, GE-22, 303–311.
Dicks, S. E., and Lo, T. H. C., 1990, Evaluation of thematic map accuracy in a land-use and
land-cover mapping program. Photogrammetric Engineering and Remote Sensing, 56,
1247–1252.
748
Chengquan Huang et al.
Fitzpatrick-Lins, K., 1981, Comparison of sampling procedures and data analysis for a landuse and land-cover map. Photogrammetric Engineering and Remote Sensing, 47,
343–351.
Foody, G. M., McCulloch, M. B., and Yates, W. B., 1995, The eVect of training set size
and composition on artiŽ cial neural network classiŽ cation. International Journal of
Remote Sensing, 16, 1707–1723.
Friedl, M. A., and Brodley, C. E., 1997, Decision tree classiŽ cation of land cover from
remotely sensed data. Remote Sensing of Environment, 61, 399–409.
Genderen, V. J. L., and Lock, B. F., 1978, Remote sensing: statistical testing of thematic
map accuracy. Remote Sensing of Environment, 7, 3–14.
Gong, P., and Howarth, P. J., 1990, An assessment of some factors in uencing multispectral
land-cover classiŽ cation. Photogrammetric Engineering and Remote Sensing, 56,
597–603.
Gualtieri, J. A., and Cromp, R. F., 1998, Support vector machines for hyperspectral remote
sensing classiŽ cation. In Proceedings of the 27th AIPR Workshop: Advances in Computer
Assisted Recognition, Washington, DC, Oct. 27, 1998 (Washington, DC: SPIE),
pp. 221–232.
Hall, F. G., Townshend, J. R., and Engman, E. T., 1995, Status of remote sensing algorithms
for estimation of land surface state parameters. Remote Sensing of Environment, 51,
138–156.
Hansen, M., DeFries, R. S., Townshend, J. R. G., and Sohlberg, R., 2000, Global land
cover classiŽ cation at 1 km spatial resulution using a classiŽ cation tree approach.
International Journal of Remote Sensing, 21, 1331–1364.
Hansen, M., Dubayah, R., and DeFries, R., 1996, ClassiŽ cation trees: an alternative to
traditional land cover classiŽ ers. International Journal of Remote Sensing, 17,
1075–1081.
Hepner, G. F., Logan, T., Ritter, N., and Bryant, N., 1990, ArtiŽ cial neural network
classiŽ cation using a minimal training set: comparison to conventional supervised
classiŽ cation. Photogrammetric Engineering and Remote Sensing, 56, 496–473.
Hixson, M., Scholz, D., Fuhs, N., and Akiyama, T., 1980, Evaluation of several schemes
for classiŽ cation of remotely sensed data. Photogrammetric Engineering and Remote
Sensing, 46, 1547–1553.
Hudson, W. D., and Ramm, C. W., 1987, Correct formulation of the Kappa coeYcient of
agreement. Photogrammetric Engineering and Remote Sensing, 53, 421–422.
Janssen, L. L. F., and Wel, F., 1994, Accuracy assessment of satellite derived land cover
data: a review. IEEE Photogrammetric Engineering and Remote Sensing, 60, 419–426.
Joachims, T., 1998a, Making large scale SVM learning practical. In Advances in Kernel
Methods—Support Vector L earning, edited by B. Scholkopf, C. Burges and A. Smola
(New York: MIT Press).
Joachims, T., 1998b, Text categorization with support vector machines—learning with
many relevant features. In Proceedings of European Conference on Machine L earning,
Chemnitz, Germany, April 10, 1998 (Berlin: Springer), pp. 137–142.
Justice, C. O., Markham, B. L., Townshend, J. R. G., and Kennard, R. L., 1989, Spatial
degradation of satellite data. International Journal of Remote Sensing, 10, 1539–1561.
Lippman, R. P., 1987, An introduction to computing with neural nets. IEEE ASSP Magazine,
4, 2–22.
Markham, B. L., and Barker, J. L., 1986, Landsat MSS and TM post-calibration dynamic
ranges, exoatmospheric re ectances and at-satellite temperatures. EOSAT L andsat
T echnical Notes, 1, 3–8.
Pao, Y.-H., 1989, Adaptive Pattern Recognition and Neural Networks (New York: AddisonWesley).
Paola, J. D., and Schowengerdt, R. A., 1995, A review and analysis of backpropagation
neural networks for classiŽ cation of remotely sensed multi-spectral imagery.
International Journal of Remote Sensing, 16, 3033–3058.
Paola, J. D., and Schowengerdt, R. A., 1997, The eVect of neural network structure on
a multispectral land-use/land cover classiŽ cation. Photogrammetric Engineering and
Remote Sensing, 63, 535–544.
Quinlan, J. R., 1993, C4.5 Programs for Machine L earning (San Mateo, CA: Morgan
Kaufmann Publishers).
Support vector machines for land cover classiŽ cation
749
Rosenfield, G. H., and Fitzpatrick-Lins, K., 1986, A coeYcient of agreement as a measure
of thematic classiŽ cation accuracy. Photogrammetric Engineering & Remote Sensing,
52, 223–227.
Running, S. W., Justice, C. O., Salomonson, V., Hall, D., Barker, J., Kaufmann, Y. J.,
Strahler, A. H., Huete, A. R., Muller, J. P., Vanderbilt, V., Wan, Z. M.,
Teillet, P., and Carneggie, D., 1994, Terrestrial remote sensing science and algorithms
planned for EOS/MODIS. International Journal of Remote Sensing, 15, 3587–3620.
Safavian, S. R., and Landgrebe, D., 1991, A survey of decision tree classiŽ er methodology.
IEEE T ransactions on Systems, Man, and Cybernetics, 21, 660–674.
Sellers, P. J., Meeson, B. W., Hall, F. G., Asrar, G., Murphy, R. E., Schiffer, R. A.,
Bretherton, F. P., et al., 1995, Remote sensing of the land surface for studies of
global change: models—algorithms—experiments. Remote Sensing of Environment,
51, 3–26.
Stehman, S. V., 1992, Comparison of systematic and random sampling for estimating the
accuracy of maps generated from remotely sensed data. Photogrammetric Engineering
and Remote Sensing, 58, 1343–1350.
Stehman, S. V., 1997, Selecting and interpreting measures of thematic classiŽ cation accuracy.
Remote Sensing of Environment, 62, 77–89.
Sui, D. Z., 1994, Recent applications of neural networks for spatial data handling. Canadian
Journal of Remote Sensing, 20, 368–380.
Sundaram, R. K., 1996, A First Course in Optimization T heory (New York: Cambridge
University Press).
Swain, P. H., and Davis, S. M. (editors), 1978, Remote Sensing: the Quantitative Approach
(New York: McGraw-Hill).
Toll, D. L., 1985, EVect of Landsat Thematic Mapper sensor parameters on land cover
classiŽ cation. Remote Sensing of Environment, 17, 129–140.
Townshend, J. R. G., 1984, Agricultural land-cover discrimination using Thematic Mapper
spectral bands. International Journal of Remote Sensing, 5, 681–698.
Townshend, J. R. G., 1992, Land cover. International Journal of Remote Sensing, 13,
1319–1328.
Vapnik, V. N., 1995, T he Nature of Statistical L earning T heory (New York: Springer-Verlag).
Vapnik, V. N., 1998, Statistical L earning T heory (New York: Wiley).
Wang, F., 1990, Fuzzy supervised classiŽ cation of remote sensing images. IEEE T ransactions
on Geoscience and Remote Sensing, 28, 194–201.
Zhuang, X., Engel, B. A., Lozano-Garcia, D. F., Fernandez, R. N., and Johannsen, C. J.,
1994, Optimization of training data required for neuro-classiŽ cation. International
Journal of Remote Sensing, 15, 3271–3277.