In-sample and Out-of-sample Model Selection ... Estimation for Support Vector Machines
Transcription
In-sample and Out-of-sample Model Selection ... Estimation for Support Vector Machines
In-sample and Out-of-sample Model Selection and Error Estimation for Support Vector Machines Authors: Davide Anguita, Alessandro Ghio, Luca Oneto, Sandro Ridella Submitted to: IEEE Transactions on neural networks and learning systems Disclaimer The following article has been accepted for publication by the IEEE Transactions on neural networks and learning systems, to which the copyright is transferred. The authors distributed this copy for personal use to potentially interested parties. No commercial use may be made of the article or the work described in it, nor re-distribution of such work is allowed. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 1 In–sample and Out–of–sample Model Selection and Error Estimation for Support Vector Machines Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE, Luca Oneto Member, IEEE, and Sandro Ridella, Member, IEEE Abstract—In–sample approaches to model selection and error estimation of Support Vector Machines (SVMs) are not as widespread as out–of–sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data–dependent Structural Risk Minimization (SRM) framework and propose a proper re-formulation of the SVM learning algorithm, so that the in–sample approach can be effectively applied. The experiments, performed both on simulated and real–world datasets, show that our in–sample approach can be favorably compared to out–of–sample methods, especially in cases where the last ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out–of–sample approaches like the Cross Validation, the Leave–One–Out or the Bootstrap. Index Terms—Structural Risk Minimization, Model Selection, Error Estimation, Statistical Learning Theory, Support Vector Machine, Cross Validation, Leave One Out, Bootstrap I. I NTRODUCTION ODEL selection addresses the problem of tuning the complexity of a classifier to the available training data, so to avoid either under– or overfitting [1]. These problems affect most classifiers because, in general, their complexity is controlled by one or more hyperparameters, which must be tuned separately from the training process in order to achieve optimal performances. Some examples of tunable hyperparameters are the number of hidden neurons or the amount of regularization in Multi Layer Perceptrons (MLPs) [2], [3] and the margin/error trade–off or the value of the kernel parameters in Support Vector Machines (SVMs) [4], [5], [6]. Strictly related to this problem is the estimation of the generalization error of a classifier: in fact, the main objective of building an optimal classifier is to choose both its parameters and hyperparameters, so to minimize its generalization error and compute an estimate of this value for predicting the classification performance on future data. Unfortunately, despite the large amount of work on this important topic, the problem of model selection and error estimation of SVMs is still open and the objective of extensive research [7], [8], [9], [10], [11], [12]. Among the several methods proposed for this purpose, it is possible to identify two main approaches: out–of–sample M Davide Anguita, Alessandro Ghio, Luca Oneto and Sandro Ridella are with the DITEN Department, University of Genova, Via Opera Pia 11A, I16145 Genova, Italy (email: {Davide.Anguita, Alessandro.Ghio, Luca.Oneto, Sandro.Ridella}@unige.it). and in–sample methods. The first ones are favored by practitioners because they work well in many situations and allow the application of simple statistical techniques for estimating the quantities of interest. Some examples of out–of–sample methods are the well–known k–Fold Cross Validation (KCV), the Leave–One–Out (LOO), and the Bootstrap (BTS) [13], [14], [15]. All these techniques rely on a similar idea: the original dataset is resampled, with or without replacement, to build two independent datasets called, respectively, the training and validation (or estimation) sets. The first one is used for training a classifier, while the second one is exploited for estimating its generalization error, so that the hyperparameters can be tuned to achieve its minimum value. Note that both error estimates computed through the training and validation sets are, obviously, optimistically biased; therefore, if a generalization estimate of the final classifier is desired, it is necessary to build a third independent set, called the test set, by nesting two of the resampling procedures mentioned above. Unfortunately, this additional splitting of the original dataset results in a further shrinking of the available learning data and contributes to a further increase of the computational burden. Furthermore, after the learning and model selection phases, the user is left with several classifiers (e.g. k classifiers in the case of KCV), each one with possibly different values of the hyperparameters and combining them, or retraining a final classifier on the entire dataset for obtaining a final classifier, can lead to unexpected results [16]. Despite some drawbacks, when reasonably large amount of data are available, out–of–sample techniques work reasonably well. However, there are several settings where their use has been questioned by many researchers [17], [18], [19]. In particular, the main difficulties arise in the small–sample regime or, in other words, when the size of the training set is small compared to the dimensionality of the patterns. A typical example is the case of microarray data, where less than a hundred samples, composed of thousands of genes, are often available [20]. In these cases, in–sample methods would represent the obvious choice for performing the model selection phase: in fact, they allow to exploit the whole set of available data for both training the model and estimating its generalization error, thank to the application of rigorous statistical procedures. Despite their unquestionable advantages with respect to outof-sample methods, their use is not widespread: one of the reasons is the common belief that in–sample methods are very useful for gaining deep theoretical insights on the learning process or for developing new learning algorithms, but they are not suitable for practical purposes. The SVM itself, that is IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 one of the most successful classification algorithms of the last decade, stems out from the well–known Vapnik’s Structural Risk Minimization (SRM) principle [5], [21], which represents the seminal approach to in–sample methods. However, SRM is not able, in practice, to estimate the generalization error of the trained classifier or select its optimal hyperparameters [22], [23]. Similar principles are equally interesting from a theoretical point of view, but seldom useful in practice [21], [24], [25] as they are overly pessimistic. In the past years, some proposals have heuristically adapted the SRM principle to in–sample model selection purposes, with some success, but they had to give up its theoretical rigour, so compromising its applicability [26]. We present in this work a new method for applying a data– dependent SRM approach [27] to model selection and error estimation, by exploiting new results in the field of Statistical Learning Theory (SLT) [28]. In particular, we describe an in– sample method for applying the data–dependent SRM principle to a slightly modified version of the SVM. Our approach is general, but is particularly effective in performing model selection and error estimation in the small–sample setting: in these cases, it is able to outperform out–of–sample techniques. The novelty of our approach is the exploitation of new results on the Maximal Discrepancy and Rademacher Complexity theory [28], trying not to give up any theoretical rigor, despite achieving good performance in practice. Our purpose is not to claim the general superiority of in–sample methods above out– of–sample ones, but to explore advantages and disadvantages of both approaches, in order to understand why and when they can be successfully applied. For this reason, a theoretically rigorous analysis of out–of–sample methods is also presented. Finally, we show that the proposed in–sample method allows using a conventional quadratic programming solver for SVMs to control the complexity of the classifier. In other words, even if we make use of a modified SVM, to allow for the application of the in–sample approach, any well–known optimization algorithm like, for example, the Sequential Minimal Optimization (SMO) method [29], [30] can be used for performing the training, the model selection and the error estimation phases, as in the out-of-sample cases. The paper is organized as follows: Section II details the classification problem framework and describe the in–sample and out–of–sample general approaches. Section III and Section IV survey old and new statistical tools, which are the basis for the subsequent analysis of out–of–sample and in–sample methods. Section V propose a new method for applying the data–dependent SRM approach to the model selection and error estimation of a modified Support Vector Machine and details also an algorithm for exploiting conventional SVM– specific Quadratic Programming solvers. Finally, Section VI shows the application of our proposal to real–world small– sample problems, along with a comparison to out–of–sample methods. II. T HE CLASSIFICATION PROBLEM FRAMEWORK We consider a binary classification problem, with an input space X ∈ ℜd and an output space Y ∈ {−1, +1}. We 2 assume that the data (x, y), with x ∈ X and y ∈ Y, is composed of random variables distributed according to an unknown distribution P and we observe a sequence of n independent and identically distributed (i.i.d.) pairs Dn = {(x1 , y1 ), . . . , . . . , (xn , yn )}, sampled according to P . Our goal is to build a classifier or, in other words, to construct a function f : X → Y, which predicts Y from X . Obviously, we need a criterion to choose f , therefore we measure the expected error performed by the selected function on the entire data population, i.e. the risk: L(f ) = E(X ,Y) ℓ(f (x), y), (1) where ℓ(f (x), y) is a suitable loss function, which measures the discrepancy between the prediction of f and the true Y, according to some user–defined criteria. Some examples of loss functions are the hard loss 0 yf (x) > 0 (2) ℓI (f (x), y) = 1 otherwise, which is an indicator function that simply counts the number of misclassified samples; the hinge loss, which is used by the SVM algorithm and is a convex upper bound of the previous one [4]: ℓH (f (x), y) = max (0, 1 − yf (x)) ; (3) the logistic loss, which is used for obtaining probabilistic outputs from a classifier [31]: 1 , (4) ℓL (f (x), y) = −yf (x) 1+e and, finally, the soft loss [32], [33]: 1 − yf (x) ℓS (f (x), y) = min 1, max 0, , (5) 2 which is a piecewise linear approximation of the former and a clipped version of the hinge loss. Let us consider a class of functions F. Thus, the optimal classifier f ∗ ∈ F is: f ∗ = arg min L(f ). f ∈F (6) Since P is unknown, we cannot directly evaluate the risk, nor find f ∗ . The only available option is to depict a method for selecting F, and consequently f ∗ , based on the available data and, eventually, some a–priori knowledge. Note that the model selection problem consists, generally speaking, in the identification of a suitable F: in fact, the hyperparameters of the classifier affect, directly or indirectly, the function class where the learning algorithm searches for the, possibly optimal, function [21], [34]. The Empirical Risk Minimization (ERM) approach suggests to estimate the true risk L(f ) by its empirical version: n so that X ˆ n (f ) = 1 ℓ(f (xi ), yi ), L n i=1 ˆ n (f ). fn∗ = arg min L f ∈F (7) (8) IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 ˆ n (f ) typically underestimates L(f ) and can Unfortunately, L bring to severe overfitting because, if the class of functions is sufficiently large, it is always possible to find a function that perfectly fits the data but shows poor generalization capability. For this reason, it is necessary to perform a model selection step, by selecting an appropriate F, so to avoid classifiers that are prone to overfit the data. A typical approach is to study the random variable L(f ) − ˆ n (f ), which represents the generalization bias of the clasL sifier f . In particular, given a user–defined confidence value δ, the objective is to bound the probability that the true risk exceeds the empirical one: h i ˆ n (f ) + ǫ ≤ δ, P L(f ) ≥ L (9) leading to bounds of the following form, which hold with probability (1 − δ): ˆ n (f ) + ǫ. L(f ) ≤ L (10) Eq. (10) can be used to select an optimal class function and, consequently, an optimal classifier, by minimizing the term on right side of the inequality. The out–of–sample approach suggests to use an independent dataset, sampled from the same data distribution that generated the training set, so that the bound is valid, for any classifier, even after it has learned the Dn set. Given additional m samples Dm = {(x1 , y1 ), . . . , (xm , ym )} and given a classifier fn , which has been trained on Dn , its generalization error can be upper bounded, in probability, according to: ˆ m (fn ) + ǫ, L(fn ) ≤ L (11) where the bound holds with probability (1−δ). Then the model selection phase can be performed by varying the hyperparameters of the classifier until the right side of Eq. (11) reaches its minimum. In particular, let us suppose to consider several function classes F1 , F2 , . . ., indexed by different values of the ∗ hyperparameters, then the optimal classifier fn∗ = fn,i ∗ is the result of the following minimization process: h i ∗ ˆ m (fn,i )+ǫ , (12) i∗ = arg min L i where ∗ ˆ n (f ). fn,i = arg min L f ∈Fi (13) Note that, if we are interested in estimating the generalization error of fn∗ , we need to apply again the bound of Eq. (11), but using some data (i.e. the test set) that has not been involved in this procedure. It is also worth mentioning that the partition of the original dataset into training and validation (and eventually test) sets can affect the tightness of the bound, due to lucky or unlucky splittings, and, therefore, its effectiveness. This is a major issue for out–of–sample methods and several heuristics have been proposed in the literature for dealing with this problem (e.g. stratified sampling or topology–based splitting [35]) but they will be not analyzed here, as they are outside of the scope of this paper. Eq. (11) can be rewritten as ˆ n (fn ) + (L ˆ m (fn ) − L ˆ n (fn )) + ǫ, L(fn ) ≤ L (14) 3 Fig. 1: The Structural Risk Minimization principle. clearly showing that the out–of–sample approach can be considered as a penalized ERM, where the penalty term takes in account the discrepancy between the classifier performance on the training and the validation set. This formulation explains also other approaches to model selection like, for example, the early stopping procedure, which is widely used in neural networks learning [36]. In fact, Eq. (14) suggests to stop the learning phase when the performance of the classifier on the training and validation set begin to diverge. The in–sample approach, instead, targets the use of the same dataset for learning, model selection and error estimation, without resorting to additional samples. In particular, this approach can be summarized as follows: a learning algorithm takes as input the data Dn and produces a function fn and ˆ n (fn ), which is a random variable an estimate of the error L depending on the data themselves. As we cannot aprioristically know which function will be chosen by the algorithm, we consider uniform deviations of the error estimate: ˆ n (fn ) ≤ sup (L(f ) − L ˆ n (f )). L(fn ) − L (15) f ∈F Then, the model selection phase can be performed according to a data–dependent version of the Structural Risk Minimization (SRM) framework [27], which suggests to choose a possibly infinite sequence {Fi , i = 1, 2, . . .} of model classes of increasing complexity, F1 ⊆ F2 ⊆ . . ., (Figure 1) and minimize the empirical risk in each class with an added penalty term, which, in our case, give rise to bounds of the following form: ˆ n (fn ) + sup (L(f ) − L ˆ n (f )). L(fn ) ≤ L (16) f ∈F From Eq. (16) it is clear that the price to pay for avoiding the use of additional validation/test sets is the need to take in account the behavior of the worst possible classifier in the class, while the out–of–sample approach focuses on the actual learned classifier. Applying Eq. (16) for model selection and error estimation purposes is a straightforward operation, at least in theory. A very small function class is selected and its size (i.e. its complexity) is increased, by varying the hyperparameters, until the bound reaches its minimum, which represents the optimal trade–off between under– and overfitting, and therefore, iden- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 4 [27], but only few authors have proposed some methods for dealing with this problem [37]. One example is the use of Localized Rademacher Complexities [38] or, in other words, the study of penalty terms that takes into account only the classifier with low empirical error: although this approach is very interesting from a theoretical point of view, its application in practice is not evident. A more practical approach has been proposed by Vapnik and other authors [21], [39], introducing the concept of Universum, i.e. a dataset composed of samples that do not belong to any class represented in the training set. However, no generalization bounds, like the ones that will be presented here, have been proposed for this approach. III. A THEORETICAL ANALYSIS OF OUT– OF – SAMPLE Fig. 2: Hypothesis spaces with different centroids. tifies the optimal classifier: " # ∗ ˆ ˆ fn = arg min Ln (f ) + sup (L(f ) − Ln (f )) . f ∈Fi Fi ∈{F1 ,F2 ,...} f ∈Fi (17) Furthermore, after the procedure has been completed, the value of Eq. (16) provides, by construction, a probabilistic upper bound of the error rate of the selected classifier. A. The need for a strategy for selecting the class of functions From the previous analysis, it is obvious that the class of functions F, where the classifier fn∗ is searched, plays a central role in the successful application of the in–sample approach. However, in the conventional data–dependent SRM formulation, the function space is arbitrarily centered and this choice severely influences the sequence Fi . Its detrimental effect on Eq. (16) can be clearly understood through the example of Fig. 2, where we suppose to know the optimal classifier f ∗ . In this case, in fact, the hypothesis space Fi∗ , which includes fn∗ , is characterized by a large penalty term and, thus, fn∗ is greatly penalized with respect to other models. Furthermore, the bound of Eq. (16) becomes very loose, as the penalty term takes in account the entire Fi∗ class, so the generalization error of the chosen classifier is greatly overestimated. The main problem lies in the fact that fn∗ is ‘far’ from the aprioristically chosen centroid f0 of the sequence Fi . On the contrary, if we were able to define a sequence of function classes Fi′ , i = 1, 2, ..., centered on a function f0′ sufficiently close to f ∗ , the penalty term would be noticeably reduced and we would be able to improve both the SRM model selection and error estimation phases, by choosing a model close to the optimal one. We argue that this is one of the main reasons for which the in–sample approach has not been considered effective so far and that this line of research should be better explored, if we are interested in building better classification algorithms and, at the same time, more reliable performance estimates. In the recent literature, the data–dependent selection of the centroid has been theoretically approached, for example, in TECHNIQUES Out–of–sample methods are favored by practitioners because they work well in many real cases and are simple to implement. Here we present a rigorous statistical analysis of two well known out–of–sample methods for model selection and error estimation: the k–fold Cross Validation (KCV) and the Bootstrap (BTS). The philosophy of the two methods is similar: part of the data is left out from the training set and is used for estimating the error of the classifier that has been found during the learning phase. The splitting of the original training data is repeated several time, in order to average out unlucky cases, therefore the entire procedure produces several classifier (one for each data splitting). Note that, from the point of view of our analysis, it is not statistically correct to select one of the trained classifiers to perform the classification of new samples because, in this case, the samples of the validation (or test) set would not be i.i.d. anymore. For this reason, every time a new sample is received, the user should randomly select one of the classifiers, so that the error estimation bounds, one for each trained classifier, can be safely averaged. A. Bounding the true risk with out–of–sample data Depending on the chosen loss function, we can apply different statistical tools to estimate the classifier error. When dealing with the hard loss, we are considering sums of Bernoulli random variables, so we can use the well–known one–sided ˆ m (fn ) Clopper–Pearson bound [40], [41]. Given t = m L ˆ m (fn ) + ǫ, then the misclassifications, and defining p = L errors follow a Binomial distribution: t X m j B(t; m, p) = p (1 − p)m−j (18) j j=0 so we can bound the generalization error by computing the inverse of the Binomial tail: n o ˆ m + ǫ) ≥ δ ˆ m , m, δ) = max ǫ : B(t; m, L (19) ǫ ∗ (L ǫ and, therefore, with probability (1 − δ): ˆ m (fn ) + ǫ∗ (L ˆ m , m, δ). L(fn ) ≤ L (20) More explicit bounds, albeit less sharp, are available in the literature and allow to gain a better insight on the behavior of IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 the error estimate. Among them are the so–called Empirical Chernoff Bound [40] and the more recent Empirical Bernstein Bound [42], [43], which is valid for bounded random variables, including Bernoulli ones: s 2 2 ˆ m (fn ) + s 2 ln( δ ) + 7 ln( δ ) , L(fn ) ≤ L (21) m 3(m − 1) q ˆ m (fn )(1 − L ˆ m (fn )). This bound can be easily where s = L related to Eq. (11) and shows clearly that the classifier error decays at a pace between O(n−1 ) and O(n−1/2 ), depending on the performance of the trained classifier on the validation dataset. The hinge loss, unfortunately, gives rise to bounds that decay at a much slower pace respect to the previous one. In fact, as noted in [44] almost fifty years ago, when dealing with a positive unbounded random variable, Markov inequality cannot be improved, as there are some distributions for which the equality is attained. Therefore, the assumption that the loss function is bounded becomes crucial to obtain any improvement over Markov bound. On the contrary, the soft and the logistic losses show a behavior similar to the hard loss: in fact, Eq. (21) can be used in these cases as well. In this paper, however, we propose to use a tighter bound, which was conceived by Hoeffding in [44] and has been neglected in the literature, mainly because it cannot be put in closed form. With our notation, the bound is: h i ˆ m (f ) > ǫ P L(f ) − L !Lˆ (f ) m !1−Lˆ m (f ) ˆ m (f ) − ǫ m ˆ m (f ) − ǫ L 1 − L . ≤ ˆ m (f ) ˆ m (f ) L 1−L (22) By equating the right part of Eq. (22) to δ and solving it numerically, we can find the value ǫ∗ , that can be inserted in ˆ m (f ). Eq. (20), as a function of δ, m and L Note that the above inequality is, in practice, as tight as the one derived by the application of the Clopper–Pearson bound [45], [46], is numerically more tractable and is valid for both hard and soft losses, so it will be used in the rest of the paper, when dealing with the out–of–sample approach. B. k–fold Cross Validation The KCV technique consists in splitting a dataset in k independent subsets and using, in turn, all but one set to train the classifier, while the remaining set is used to estimate the generalization error. When k = n this becomes the well– known Leave–One–Out (LOO) technique, which is often used in the small–sample setting, because all the samples, except one, are used for training the model [14]. If our target is to perform both the model selection and the error estimation of the final classifier, a nested KCV is required, where (k − 2) subsets are used, in turn, for the training phase, one is used as a validation set to optimize the hyperparameters and the last one as a test set to estimate 5 the generalization error. Note that O(k 2 ) training steps are necessary in this case. To guarantee the statistical soundness of the KCV approach, one of the k trained classifiers must be randomly chosen before classifying a new sample. This procedure is seldomly used in practice because, usually, one retrains a final classifier on the entire training data: however, as pointed out by many authors, we believe that this heuristic procedure is the one to blame for unexpected and inconsistent results of the KCV technique in the small–sample setting. If the nested KCV procedure is applied, so to guarantee the independence of the training, validation and test sets, the generalization error can be bounded by [40], [44], [46]: L(fn∗ ) ≤ k i 1 X h ˆj ∗ ˆ jn , n , δ L n f nk ,j + ǫ∗ L k k j=1 k k (23) ˆ jn (f ∗ ) is the error performed by the j-th optimal where L n,j k classifier on the corresponding test set, composed of nk samples, and fn∗ is the randomly selected classifier. It is interesting to note that, for the LOO procedure, nk = 1 so the bound becomes useless, in practice, for any reasonable value of the confidence δ. This is another hint that the LOO procedure should be used with care, as this result raises a strong concern on its reliability, especially in the small–sample setting, which is the elective setting for LOO. C. Bootstrap The BTS method is a pure resampling technique: at each jth step, a training set, with the same cardinality of the original one, is built by sampling the patterns with replacement. The remaining data, which consists, on average, of approximately 36.8% of the original dataset, are used to compose the validation set. The procedure is then repeated several times (NB ∈ [1, 2n−1 ]) in order to obtain statistically sound results n [13]. As for the KCV, if the user is interested in performing the error estimation of the trained classifiers, a nested Bootstrap is needed, where the sampling procedure is repeated twice in order to create both a validation and a test set. If we suppose that the test set consists of mj patterns then, after the model selection phase, we will be left with NB different models, for which the average generalization error can be expressed as: L(fn∗ ) NB h i 1 X ∗ ˆ jm , mj , δ) . ˆ jm (fm ) + ǫ ∗ (L L ≤ j ,j j j NB j=1 (24) As can be seen by comparing (23) and (24), the KCV and the BTS are equivalent, except for the different sampling approach of the original dataset. IV. A THEORETICAL ANALYSIS OF IN – SAMPLE TECHNIQUES As detailed in previous sections, the main objective of in– sample techniques is to upper bound the supremum on the right of Eq. (16), so we need a bound which holds simultaneously for all functions in a class. The Maximal Discrepancy and Rademacher Complexity are two different statistical tools, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 which can be exploited for such purposes. The Rademacher Complexity of a class of functions F is defined as: n (25) where σ1 , . . . , σn are n independent random variables for which P(σi = +1) = P(σi = −1) = 1/2. An upper bound of ˆ L(f ) in terms of R(F) was proposed in [47] and the proof is mainly an application of the following result, known as McDiarmid’s inequality: Theorem 1: [48] Let Z1 , . . . , Zn be independent random variables taking values in a set Z, and assume that g : Z n → ℜ is a function satisfying sup |g(z1 , . . . , zn ) − g(z1 , . . . , zˆi , . . . , zn )| < ci Then, for any ǫ > 0, P {g(z1 , . . . , zn ) − E {g(z1 , . . . , zn )} ≥ ǫ} < e P {E {g(z1 , . . . , zn )} − g(z1 , . . . , zn ) ≥ ǫ} < e from which we obtain: h i ˆ n (f ) ≤ E(X ,Y) R(F). ˆ E(X ,Y) sup L(f ) − L (34) f ∈F 2X ˆ σi ℓ(f (xi ), yi ) R(F) = Eσ sup f ∈F n i=1 z1 ,...,zn ,ˆ zi 6 2 c2 i=1 i − Pn2ǫ 2 c2 i=1 i − Pn2ǫ In other words, the theorem states that, if replacing the i-th coordinate zi by any other value, g changes by at most ci , then the function is sharply concentrated around its mean. Using the McDiarmid’s inequality, it is possible to bound the supremum of Eq. (16), thank to the following theorem [47]. We detail here a simplified proof, which also corrects some of the errors that appear in [47]. Theorem 2: Given a dataset Dn , consisting in n patterns xi ∈ X d , given a class of functions F and a loss function ℓ(·, ·) ∈ [0, 1], then " # h i −2nε2 ˆ ˆ P sup L (f ) − Ln (f ) ≥ R(F) + ε ≤ 2 exp 9 f ∈F (26) Proof: Let us consider a ghost sample Dn′ = {x′i , yi′ }, composed of n patterns generated from the same probability distribution of Dn , the following upper bound holds1 : h i ˆ n (f ) E(X ,Y) sup L (f ) − L (27) f ∈F h i ˆ ′n (f )] − L ˆ n (f ) = E(X ,Y) sup E(X ′ ,Y ′ ) [L (28) f ∈F h i ˆ ′ (f ) − L ˆ n (f ) ≤ E(X ,Y) E(X ′ ,Y ′ ) sup L (29) n f ∈F " n # 1X ′ = E(X ,Y) E(X ′ ,Y ′ ) sup (ℓi − ℓi ) (30) f ∈F n i=1 # " n 1X ′ σi (ℓi − ℓi ) (31) = E(X ,Y) E(X ′ ,Y ′ ) Eσ sup f ∈F n i=1 " n # 1X ≤ 2E(X ,Y) Eσ sup (32) σ i ℓi f ∈F n i=1 ˆ = E(X ,Y) R(F) (33) 1 In order to simplify the notation we define ℓ′ = ℓ(f (x′ , y ′ )) and ℓ = i i i i ℓ(f (xi , yi )). ˆ For theh sake of simplicity, let us define S(F) = i ˆ n (f ) . Then, by using McDiarmid’s insupf ∈F L(f ) − L ˆ equality, we know that S(F) is sharply concentrated around its mean: h i 2 ˆ ˆ P S(F) ≥ E(X ,Y) S(F) + ǫ ≤ e−2nǫ , (35) because the loss function is bounded. Therefore, combining these two results, we obtain: i h ˆ ˆ ≥ E(X ,Y) R(F) +ǫ (36) P S(F) h i 2 ˆ ˆ ≤ P S(F) ≥ E(X ,Y) S(F) + ǫ ≤ e−2nǫ . (37) ˆ We are interested in bounding L(f ) with R(F), so we can write: h i ˆ ˆ P S(F) ≥ R(F) +ǫ (38) i h ˆ ˆ ≥ E(X ,Y) S(F) + aǫ ≤ P S(F) h i ˆ ˆ +P ER(F) ≥ R(F) + (1 − a)ǫ (39) ≤ e−2na 2 2 ǫ n + e− 2 (1−a) 2 2 ǫ (40) where, in the last step, we applied again the McDiarmid’s inequality. By setting a = 31 we have: h i −2nǫ2 ˆ ˆ (41) P S(F) ≥ R(F) + ǫ ≤ 2e 9 . The previous theorem allows us to obtain the main result by fixing a confidence δ and solving Eq. (26) respect to ǫ, so to obtain the following explicit bound, which holds with probability (1 − δ): s log 2δ ˆ n (fn ) + R(F) ˆ . (42) L(fn ) ≤ L +3 2n The approach based on the Maximal Discrepancy is similar to the previous one and provides similar results. For the sake of brevity we refer the reader to [32], [49] for the complete proofs and to [32] for a comparison of the two approaches: here we give only the final results. (1) (2) Let us split Dn in two halves, Dn/2 and Dn/2 , and compute the corresponding empirical errors: n ˆ (1) (f ) L n/2 2 2X = ℓ (f (xi ), yi ) n i=1 n X ˆ (2) (f ) = 2 ℓ (f (xi ), yi ) . L n/2 n n (43) (44) i= 2 +1 ˆ of F is defined as Then, the Maximal Discrepancy M ˆ (1) (f ) − L ˆ (2) (f ) ˆ (F) = max L M n/2 n/2 f ∈F (45) IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 and, under the same hypothesis of Theorem 2, the following bound holds, with probability (1 − δ): s ˆ n (f ) + M ˆ (F) + 3 L(f ) ≤ L 2 log δ . 2n on a subset G ⊆ Z with probability 1 − δn , while ¯ ∃z ′ ∈ Z such that ∀ {z1 , . . . , zn } ∈ G, i cn < |g(z1 , . . . , zn ) − g(z1 , . . . , zi′ , . . . , zn )| ≤ 2A 2Anδn . cn In other words, the theorem states that if g satisfies the same conditions of Theorem 1, with high probability (1 − δn ), then the function is (almost) concentrated around its mean. Unfortunately, the bound is exponential only if it is possible to show that δn decays exponentially, which requires to introduce some constraints on the probability distribution generating the data. As we are working in the agnostic case, where no hypothesis on the data is assumed, this approach is outside of the scope of this paper, albeit it opens some interesting research cases like, for example, when some additional information on the data is available. The use of the soft loss function, instead, which is bounded in the interval [0, 1] and will be adapted to the SVM in the following sections, allows us to apply the bound of Eq. (42). By noting that the soft loss satisfies the following symmetry property: ℓ(f (x), y) = 1 − ℓ(f (x), −y) 2 n " " X X (ℓ(fi , yi ) − 1) − i∈I + ℓ(fi , yi ) i∈I − 2 X 2 X = 1 + Eσ sup − ℓ(fi , −yi ) − ℓ(fi , yi ) n n f ∈F + − i∈I i∈I " = 1 + Eσ sup − f ∈F = 1 − Eσ inf 2 n f ∈F n X ℓ(fi , −σi yi ) i=1 " # (48) # (49) # (50) # (51) # n X 2 ℓ(fi , σi ) . n i=1 (52) In other words, the Rademacher Complexity of the class F can be computed by learning the original dataset, but where the labels have been randomly flipped. Analogously, it can be proved that the Maximal Discrepancy of the class F can be computed by learning the original dataset, where the labels (2) of the samples in Dn/2 have been flipped [28]. The second problem that must be addressed is finding an ˆ and M ˆ , avoiding efficient way to compute the quantities R the computation of Eσ [·], which would require N = 2n training phases. A simple approach is to adopt a Monte–Carlo estimation of this quantity, by computing: where G¯ ∪ G = Z, then for any ǫ > 0, −ǫ2 n 2X ˆ σ i ℓi R(F) = Eσ sup f ∈F n i=1 f ∈F |g(z1 , . . . , zn ) − g(z1 , . . . , zˆi , . . . , zn )| < cn ∀i P {|g − E[g]| ≥ ǫ} ≤ 2e 8nc2n + " = 1 + Eσ sup The theoretical analysis of the previous section does not clarify how the in–sample techniques can be applied in practice and, in particular, can be used to develop effective model selection and error estimation phases for SVMs. The first problem, analogously to the case of the out–of–sample approach, is related to the boundedness requirement of the loss function, which is not satisfied by the SVM hinge loss. A recent result appears to be promising in generalizing the McDiarmid’s inequality to the case of (almost) unbounded functions [50], [51]: Theorem 3: [50] Let Z1 , . . . , Zn be independent random variables taking values in a set Z and assume that a function g : Z n → [−A, A] ⊆ ℜ satisfies z1 ,...,zn ,ˆ zi I + = {i : σi = +1} and I − = {i : σi = −1}, then: (46) A. The in–sample approach in practice sup 7 (47) it can be shown that the Rademacher Complexity can be easily computed by learning a modified dataset. In fact, let us define k n X 2X j ˆ k (F) = 1 sup σ ℓ(f (xi ), yi ) R k j=1 f ∈F n i=1 i (53) where 1 ≤ k ≤ N is the number of Monte–Carlo trials. ˆ k , instead of R, ˆ can be explicited The effect of computing R by noting that the Monte–Carlo trials can be modeled as a sampling without replacement from the N possible label configurations. Then, we can apply any bound for the tail of the hypergeometric distribution like, for example, the Serfling’s bound [52], to write: 2 h i − 2kǫ ˆ ˆ k (F) + ǫ ≤ e 1− k−1 N . P R(F) ≥R (54) We know that ˆ ˆ E(X ,Y) S(F) ≤ E(X ,Y) R(F) (55) IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 and, moreover, h ˆ ˆ k (F) + aǫ P S(F) ≥R h i (56) i ˆ ˆ ≤ P S(F) ≥ E(X ,Y) S(F) + a1 ǫ h i ˆ ˆ +P E(X ,Y) R(F) ≥ R(F) + a2 ǫ h i ˆ ˆ k (F) + a3 ǫ +P R(F) ≥R ≤e −2na21 ǫ2 +e 2 2 −n 2 a2 ǫ +e Then, with probability (1 − δ): ˆ n (f ) + R ˆ k (F) + 3 + L(f ) ≤ L s α −2ka23 ǫ2 where aq= a1 + a2 + a3 . So, by setting a1 = 14 , a2 = n(1− k−1 N ) , we have that: a3 = 14 k q n(1− k−1 N ) 3+ nǫ2 k ˆ ˆ ǫ ≤ 3e− 8 . ≥ R(F) + P S(F) 4 (58) 1 2 and (59) (60) which recovers the bound of Eq. (42), for k → N , up to some constants. The Maximal Discepancy approach results in a very similar bound (the proofs can be found in [32]): s k X − log 2δ 1 (j) ˆ ˆ (61) M (F) + 3 L(f ) ≤ Ln (f ) + k j=1 2n which holds with probability (1−δ) and where k is the number of random shuffles of Dn before splitting it in two halves. Note that, in this case, the confidence term does not depend on k: this is a consequence of retaining the information provided by the labels yi , which is lost in the Rademacher Complexity approach. V. T HE APPLICATION OF THE IN – SAMPLE APPROACH TO THE S UPPORT V ECTOR M ACHINE Let us consider the training set Dn , and the input space X ∈ ℜd . We map our input space in a feature space X ′ ∈ ℜD with the function φ : ℜd → ℜD . Then, the SVM classifier is defined as: (62) where the weights w ∈ ℜD and the bias b ∈ ℜ are found by solving the following primal convex constrained quadratic programming (CCQP) problem: min w,b,ξ 1 kwk2 + CeT ξ 2 yi (w · φ(xi ) + b) ≥ 1 − ξi n n n X 1 XX αi αi αj yi yj K(xi , xj ) − 2 i=1 j=1 i=1 (64) 0 ≤ αi ≤ C n X yi αi = 0, i=1 where K(xi , xj ) = φ(xi )·φ(xj ) is a suitable kernel function. After solving the problem (64), the Lagrange multipliers can be used to define the SVM classifier in its dual form: n X yi αi K(xi , x) + b, (65) f (x) = i=1 s ln 3δ ) n(1 − k−1 N k 2n f (x) = w · φ(x) + b By introducing n Lagrange multipliers α1 , . . . , αn , it is possible to write the problem of Eq. (63) in its dual form, for which efficient solvers have been developed throughout the years: min (57) 8 (63) ξi ≥ 0 where ei = 1 ∀i ∈ {1, . . . , n} [4]. The above problem is also known as the Tikhonov formulation of the SVM, because it can be seen as a regularized ill–posed problem. The hyperparameter C in problems (63) and (64) is tuned during the model selection phase, and indirectly defines the set of functions F. Then, any out–of–sample technique can be applied to estimate the generalization error of the classifier and the optimal value of C can be chosen accordingly. Unfortunately, this formulation suffers from several drawbacks: • the hypothesis space F is not directly controlled by the hyperparameter C, but only indirectly through the minimization process; • the loss function of SVM is not bounded, which represents a problem for out–of–sample techniques as well, because the optimization is performed using the hinge loss, while the error estimation is usually computed with the hard loss; • the function space is centered in an arbitrary way respect to the optimal (unknown) classifier. It is worthwhile to write the SVM optimization problem as [21]: min w,b,ξ n X ξi (66) kwk2 ≤ ρ (67) yi (w · φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0, (68) (69) i=1 which is the equivalent Ivanov formulation of problem (63) for some value of the hyperparameter ρ. From Eq. (67) it is clear that ρ explicitly controls the size of the function space F, which is centered in the origin and consists of the set of linear classifiers with margin greater or equal to 2/ρ. In fact, as ρ is increased, the set of functions is enriched by classifiers with smaller margin and, therefore, of greater classification (and overfitting) capability. A possibility to center the space F in a different point is to translate the weights of the classifiers by some constant value, so that Eq. (67) becomes kw − w0 k2 ≤ ρ. By applying this idea to the Ivanov formulation of the SVM and substituting IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 the hinge loss with the soft loss, we obtain the following new optimization problem: min w,b,ξ,η n X ηi (70) Algorithm 1 ConCave–Convex Procedure Initialize θ (0) repeat (θ) θ (t+1) = arg minθ Jconvex (θ) + dJconcave dθ until θ i=1 2 kwk ≤ ρ yi (w · φ(xi ) + b) + yi λf0 (xi ) ≥ 1 − ξi ξi ≥ 0 (t+1) z }| { z }| { n n X X 1 2 ξi −C ςi min kwk + C w,b,ξ,ς 2 i=1 i=1 ·θ (71) ςi = max(0, ξi − 2) where θ = [w|b] is introduced to simplify the notation. Obviously, the algorithm does not guarantee to find the optimal solution, but it converges to a usually good solution in a finite number of steps [33]. To apply the CCCP we must compute the derivative of the concave part of the objective function at the t-th step: dJconcave (θ) ·θ (73) dθ (t) θ=θ ! n X d (−Cςi ) ·θ (74) = dθ (t) i=1 = n X θ=θ (t) ∆i yi (w · φ(xi ) + b) (75) i=1 where if yi w(t) · φ(xi ) + b(t) < −1 (76) = otherwise. Then, the (t + 1)-th solution w(t+1) , b(t+1) can be found by solving the following learning problem: (t) ∆i min w,b,ξ yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi ξi ≥ 0 ηi = min(2, ξi ). It can be shown that the two formulations are equivalent, in the sense that, for any ρ, there is at least one C for which the Ivanov and Tikhonov solutions coincide [21]. In particular, the value of C is a non–decreasing function of the value of ρ, so that, given a particular C, the corresponding ρ can be found by a simple bisection algorithm [55], [56], [57]. Regardless of the formulation, the optimization problem is non–convex, so we must resort to methods that are able to find an approximate suboptimal solution, like the Peeling technique [32], [58] or the ConCave–Convex Procedure (CCCP) [33]. In particular, the CCCP, which is synthesized in Algorithm 1, suggests breaking the objective function of Eq. (71) in its (72) yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi ξi ≥ 0 n min θ=θ (t) Jconcave (θ) Jconvex (θ) where f0 is the classifier, which has been selected as the center of the function space F, and λ is a normalization constant. Note that f0 can be either a linear classifier f0 (x) = w0 · φ(x) + b0 or a non-linear one (e.g. analogously to what shown in [53]) but, in general, can be any a–priori and auxiliary information, which helps in relocating the function space closer to the optimal classifier. In this respect, f0 can be considered as a hint, a concept introduced in [54] in the context of neural networks, which must be defined independently from the training data. The normalization constant λ weights the amount of hints that we are keen to accept in searching for our optimal classifier: if we set λ = 0, we obtain the conventional Ivanov formulation for SVM, while for larger values of λ the hint is weighted even more than the regularization process itself. The sensitivity analysis of the SVM solution with respect to the variations of λ is an interesting issue that would require a thorough study, so we are not address it here. In any case, as we are working in the agnostic case, we equally weight the hint and the regularized learning process, thus we choose λ = 1 in this work. The previous optimization problem can be re-formulated in its dual form and solved by general–purpose convex programming algorithms [21]. However, we show here that it can also be solved by conventional SVM learning algorithms, if we rewrite it in the usual Tikhonov formulation: w,b,ξ,η =θ (t) convex and concave parts: ηi = min(2, ξi ) X 1 ηi kwk2 + C 2 i=1 9 C 0 n n X X 1 (t) ∆i yi (w · φ(xi ) + b) ξi + kwk2 + C 2 i=1 i=1 (77) yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi ξi ≥ 0. As a last issue, it is worth noting that the dual formulation of the previous problem can be obtained, by introducing n lagrange multipliers βi : min β n n n X 1 XX (yi f0 (xi ) − 1) βi βi βj yi yj K(xi , xj ) + 2 i=1 j=1 i=1 (t) (t) − ∆i ≤ βi ≤ C − ∆i n X yi βi = 0, (78) i=1 which can be solved by any SVM–specific algorithm like, for example, the well–known Sequential Minimal Optimization algorithm [29], [30]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 A. The method in a nutshell In this section, we briefly summarize the method, which allows to apply the in-sample approach to the SVM model selection and error estimation problems. As a first step, we have to identify a centroid f0 : for this purpose, possible a-priori information can be exploited, else in [53] a method to identify a hint in a data-dependent way is suggested. Note that f0 can be either a linear or a non-linear SVM classifier and, in principle, can be even computed by exploiting a kernel that differs from the one used during the learning phase. Once the sequence of classes of functions is centered, we explore its hierarchy according to the Structural Risk Minimization principle: ideally by looking for the optimal hyperparameter ρ ∈ (0, +∞), similarly to the search for the optimal C in conventional SVMs. For every value of ρ, i.e. for every class of functions, Problem (70) is solved by exploiting the procedure previously presented in Section V and either the Rademacher Complexity (Eq. (60)) or the Maximal Discrepancy (Eq. (61)) bounds are computed. Finally, F and, therefore, the corresponding classifier are chosen, for which the value of the estimated generalization error is minimized. Note that this value, by construction, is a statistically valid estimate. VI. E XPERIMENTAL R ESULTS We describe in this section two sets of experiments. The first one is built by using relatively large datasets that allow us to simulate the small–sample setting. Each dataset is sampled by extracting a small amount of data to build the training sets and exploiting the remaining data as a good representative of the entire sample population. The rationale behind this choice is to build some toy problems, but based on real–world data, so to better explore the performance of our proposal in a controlled setting. Thank to this approach, the experimental results can be easily interpreted and the two approaches, in–sample vs. out– of–sample, easily compared. The second set, instead, targets the classification of microarray data, which consists of true small–sample datasets. A. The simulated small–sample setting We consider the well–known MNIST [59] dataset consisting of 62000 images, representing the numbers from 0 to 9: in particular, we consider the 13074 patterns containing 0’s and 1’s, allowing us to deal with a binary classification problem. We build a small–sample dataset by randomly sampling a small number of patterns, varying from n = 10 to n = 400, which is a value much smaller then the dimensionality of the data d = 28 × 28 = 784, while the remaining 13074 − n images are used as a reference set. In order to build statistically relevant results, the entire procedure is repeated 30 times during the experiments. We also consider a balanced version of the DaimlerChrysler dataset [60], where half of the 9800 images, of d = 36 × 18 = 648 pixels, contains the picture of a pedestrian, while the other half contains only some general background or other objects. 10 These two datasets target different objectives: the MNIST dataset represents an easy classification problem, in the sense that a low classification error, well below 1%, can be easily achieved; on the contrary, the DaimlerChrysler dataset is a much more difficult problem, because the samples from each class are quite overlapped, so the small–sample setting makes this problem even more difficult to solve. By analyzing these two opposite cases, it is possible to gain a better insight on the performance of the various methods. In all cases, we use a linear kernel φ(x) = x, as the training data are obviously linearly separable (d > n) and the use of a nonlinear transformation would further complicate the interpretation of the results. In Tables I–VIII, the results obtained with the different methods are reported. Each column refers to a different approach: • • • • • RC and MD are the in–sample procedures using, respectively, the Rademacher Complexity and Maximal Discrepancy approaches, with f0 = 0; RCf and MDf are similar to the previous cases, but 30% of the samples of the training set are used for finding a hint f0 (x) = w0 · x + b0 by learning a linear classifier on them (refer to [53] for further details); KCV is the k-fold Cross Validation procedure, with k = 10; LOO is the Leave–One–Out procedure; BTS is the Bootstrap technique with NB = 100. For in–sample methods, the model selection is performed by searching the optimal hyperparameter ρ ∈ [10−6 , 103 ] among 30 values, equally spaced in a logarithmic scale, while for the out–of–sample approaches the search is performed by varying C in the same range. Tables I and II show the error rate achieved by each selected classifier on the reference sets, using the soft loss for computing the error. In particular, the in–sample methods exploit the soft loss for the learning phase, which, by construction, includes also the model selection and error estimation phases. The out–of–sample approaches, instead, use the conventional hinge loss for finding the optimal classifier and the soft loss for model selection. When the classifiers have been found, according to the respective approaches, their performance is verified on the reference dataset, so to check if a good model has been selected, and the achieved misclassification rate is reported in the tables. All the figures are in percentage and the best values are highlighted. As can be easily seen, the best approaches are RCf and MDf , which consistently outperform the out–of–sample ones. It is also clear that centering the function space in a more appropriate point, thank to the hint f0 , improves the ability of the procedure to select a better classifier, respect to in–sample approaches without hints. This is a result of the shrinking of the function space, which directly affects the tightness of the generalization error bounds. As a last remark, it is possible to note that RC and MD often select the same classifier, with a slight superiority of RC, when dealing with difficult classification problems. This is also an expected result, because MD makes use of the label information, which is misleading if the samples are not well IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 separable [49]. The use of the soft loss is not so common in the SVM literature, so we repeated the experiments by applying the hard loss for computing the misclassification error on the reference dataset. Tables III and IV report the results and confirm the superiority of the in–sample approach, when dealing with the MNIST problem, while the results are comparable with the out–of–sample methods in the case of the DaimlerChrysler dataset. In particular, the in–sample methods with hints appear to perform slightly better than the BTS and slightly worse than KCV and LOO, even though the difference is almost negligible. This is not surprising, because the in–sample methods adopt a soft–loss for the training phase, which is not the same loss used for evaluating them. Tables V and VI show the error estimation computed with the various approaches by exploiting the error bounds, presented in Sections III and IV: in particular, the in–sample methods provide these values directly, as a byproduct of the training phase, while the figures for the out–of–sample methods are obtained by applying the Hoeffding bound of Eq. (22) on the samples of the test set. The missing values indicate that the estimation is not consistent, because it exceeds 100%. In this case, BTS (but not always KCV nor LOO) outperforms the in–sample ones, which are more pessimistic. However, by taking a closer view to the results, several interesting facts can be inferred. The out–of–sample methods are very sensitive to the number of test samples: in fact, the LOO method, which uses only one sample at the time for the error estimation, is not able to provide consistent results. The quality of the error estimation improves for KCV, which uses one tenth of the data, and even more for BTS, which selects, on average, a third of the data for performing the estimation. In any case, by comparing the results in Table V with the ones in Table I, it is clear that even out–of–sample methods are overly pessimistic: in the best case (BTS, with n = 400), the generalization error is overestimated by a factor greater than 4. This result seems to be in contrast with the common belief that out–of–sample methods provide a good estimation of the generalization error, but they are not surprising because, most of the times, when the generalization error of a classifier is reported in the literature, the confidence term (i.e. the second term on the right side of Eq. (23)) is usually neglected and only its average performance is disclosed (i.e. the first term on the right side of Eq. (23)). The results on the two datasets provide another interesting insight on the behavior of the two approaches: it is clear that out–of–sample methods exploit the distribution of the samples of the test set, because they are able to identify the intrinsic difficulty of the classification problem; in–sample methods, instead, do not possess this kind of information and, therefore, maintain a pessimistic approach in all cases, which is not useful for easy classification problems, like MNIST. This is confirmed also by the small difference in performance of the two approaches on the difficult DaimlerChrysler problem. On the other hand, the advantage of having a test set, for out–of– sample methods, is overcome by the need of reducing the size of the training and validation sets, which causes the methods to choose a worse performing classifier. This is related to the well–known issue of the optimal splitting of the data between 11 TABLE IX: Human gene expressions datasets. Dataset Brain Tumor 1 [62] Brain Tumor 2 [62] Colon Cancer 1 [63] Colon Cancer 2 [64] DLBCL [62] Duke Breast Cancer [65] Leukemia [66] Leukemia 1 [62] Leukemia 2 [62] Lung Cancer [62] Myeloma [67] Prostate Tumor [62] SRBCT [62] d 5920 10367 22283 2000 5469 7129 7129 5327 11225 12600 28032 10509 2308 n 90 50 47 62 77 44 72 72 72 203 105 102 83 training and test sets, which is still an open problem. Finally, Tables VII and VIII show the error estimation of the out–of–sample methods using the hard loss. In this case, the in–sample methods cannot be applied, because it is not possible to perform the learning with this loss. As expected, the error estimation improves, respect to the previous case, except for the LOO method, which is not able to provide consistent results. The improvement respect to the case of the soft loss is due to the fact that we are now working in a parametric setting (e.g. the errors are distributed according to a Binomial distribution), while the soft loss gives rise to a non–parametric estimation, which is a more difficult problem. In synthesis, the experiments clearly show that in–sample methods with hints are more reliable for model selection than out–of–sample ones and that the Boostrap appears to be the best approach to perform the generalization error estimation of the trained classifier. B. Microarray small–sample datasets The last set of experiments deals with several Human Gene Expression datasets (Table IX), where all the problems are converted, where needed, to two classes by simply grouping some data. In this kind of setting, a reference set of reasonable size is not available, then, we reproduce the methodology used by [61], which consists in generating five different training/test pairs, using a cross validation approach. The same procedures of Section VI-A are used in order to compare the different approaches to model selection and error estimation and the results are reported in Tables X–XIII. Table X shows the error rate obtained on the reference sets using the soft loss, where the in–sample methods outperform out–of–sample ones most of the time (8 vs. 5). The interesting fact is the large improvement of the in–sample methods with hints, respect to the analogous versions without hints. Providing some a–priori knowledge for selecting the classifier space appears to be very useful and, in some cases (e.g. the Brain Tumor 2, Colon Cancer 2 and DLBCL datasets), allows to solve problems that no other method, in–sample without hints or out–of–sample, can deal with. In Table XI, analogously, the misclassification rates on the reference sets using the hard loss, which favors out–of–sample methods, are reported. In this case, the three out–of–sample methods globally outperform in–sample ones, but none of IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 them, considered singularly, is consistently better than in– sample ones. Finally, Tables XII and XIII show the error estimation using the soft and hard loss, respectively. The Bootstrap provides better estimates respect to all other methods but, unfortunately, it suffers from two serious drawbacks: the estimates are very loose and, in some cases (e.g. the Brain Tumor 2, Colon Cancer 2 and DLBCL datasets), the estimation is not consistent as it underestimates the actual classifier error rate. This is an indication that, in the small sample setting, where the test data is very scarce, both in–sample and out–of–sample methods are of little use for estimating the generalization error of a classifier; however, while out–of–sample methods cannot be improved because they work in a parametric setting, where the Clopper–Pearson bound is the tightest possible, in–sample methods could lead to better estimation as they allow for further improvements, both in the theoretical framework and in their practical application. C. Some notes on the computational effort The proposed approach addresses the problem of model selection and error estimation of SVMs. Though general, this approach gives benefits when we deal with small sample problems (d >> n) like the Gene Expression datasets, where only few samples are available (n ≈ 100). In this framework, the computational complexity and the computational cost of the proposed method is not a critical issue because the time needed to perform the procedure is small. In fact, as an example, we report in Table XIV the computational time (in seconds)2 needed to perform the different in–sample and out–of–sample procedures on the MNIST dataset, where it is worthwhile noting that the learning procedures always require less than one minute to conclude. Similar results are obtained using the other datasets of this paper. VII. C ONCLUSION We have detailed a complete methodology for applying two in–sample approaches, based on the data–dependent Structural Risk Minimization principle, to the model selection and error estimation of Support Vector Classifiers. The methodology is theoretically justified and obtains good results in practice. At the same time, we have shown that in–sample methods can be comparable to, or even better than, more widely–used out–of– sample methods, at least in the small-sample setting. A step for improving their adoption is our proposal for transforming the in–sample learning problem from the Ivanov formulation to the Tikhonov one, so that it can be easily approached by conventional SVM solvers. We believe that our analysis opens new perspectives on the application of the data–dependent SRM theory to practical problems, by showing that the common misconception about its poor practical effectiveness is greatly exagerated. The SRM theory is just a different and sophisticated statistical tool that needs to be used with some care and that, we hope, will 2 The values are referred to an Intel Core I5 2.3 GHz architecture. The source code is written in Fortran90. 12 be further improved in the future, both by building sharper theoretical bounds and by finding more clever ways to exploit hints for centering the classifier space. R EFERENCES [1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Model selection: beyond the bayesian/frequentist divide,” The Journal of Machine Learning Research, vol. 11, pp. 61–87, 2010. [2] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural Computation, vol. 4, no. 1, pp. 1–58, 1992. [3] P. Bartlett, “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 525–536, 1998. [4] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [5] V. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999. [6] Y. Shao, C. Zhang, X. Wang, and N. Deng, “Improvements on twin support vector machines,” IEEE Transactions on Neural Networks, no. 99, pp. 962–968, 2011. [7] D. Anguita, S. Ridella, and F. Rivieccio, “K-fold generalization capability assessment for support vector classifiers,” in Proceedings of the IEEE International Joint Conference on Neural Networks, 2005. [8] B. Milenova, J. Yarmus, and M. Campos, “Svm in oracle database 10g: removing the barriers to widespread adoption of support vector machines,” in Proceedings of the 31st International Conference on Very Large Data Bases, 2005. [9] Z. Xu, M. Dai, and D. Meng, “Fast and efficient strategies for model selection of gaussian support vector machine,” IEEE Transactions on Cybernetics, vol. 39, no. 5, pp. 1292–1307, 2009. [10] T. Glasmachers and C. Igel, “Maximum likelihood model selection for 1-norm soft margin svms with multiple parameters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1522–1528, 2010. [11] K. De Brabanter, J. De Brabanter, J. Suykens, and B. De Moor, “Approximate confidence and prediction intervals for least squares support vector regression,” IEEE Transactions on Neural Networks, no. 99, pp. 110–120, 2011. [12] M. Karasuyama and I. Takeuchi, “Nonlinear regularization path for quadratic loss support vector machines,” IEEE Transactions on Neural Networks, no. 99, pp. 1613–1625, 2011. [13] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993. [14] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proceedings of the International Joint Conference on Artificial Intelligence, 1995. [15] F. Cheng, J. Yu, and H. Xiong, “Facial expression recognition in jaffe dataset based on gaussian process classification,” IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1685–1690, 2010. [16] D. Anguita, A. Ghio, S. Ridella, and D. Sterpi, “K–fold cross validation for error rate estimate in support vector machines,” in Proceedings of the International Conference on Data Mining, 2009. [17] T. Clark, “Can out-of-sample forecast comparisons help prevent overfitting?” Journal of forecasting, vol. 23, no. 2, pp. 115–139, 2004. [18] D. Rapach and M. Wohar, “In-sample vs. out-of-sample tests of stock return predictability in the context of data mining,” Journal of Empirical Finance, vol. 13, no. 2, pp. 231–247, 2006. [19] A. Isaksson, M. Wallman, H. Goransson, and M. Gustafsson, “Crossvalidation and bootstrapping are unreliable in small sample classification,” Pattern Recognition Letters, vol. 29, no. 14, pp. 1960–1965, 2008. [20] U. M. Braga-Neto and E. R. Dougherty, “Is cross-validation valid for small-sample microarray classification?” Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004. [21] V. N. Vapnik, Statistical Learning Theory. Wiley-Interscience, 1998. [22] K. Duan, S. Keerthi, and A. Poo, “Evaluation of simple performance measures for tuning svm hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, 2003. [23] D. Anguita, A. Boni, R. Ridella, F. Rivieccio, and D. Sterpi, “Theoretical and practical model selection methods for support vector classifiers,” in Support Vector Machines: Theory and Applications, L. Wang, Ed., 2005, pp. 159–180. [24] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419– 422, 2004. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 [25] B. Scholkopf and A. J. Smola, Learning with Kernels. The MIT Press, 2001. [26] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model complexity control for regression using vc generalization bounds,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1075–1089, 1999. [27] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony, “Structural risk minimization over data-dependent hierarchies,” IEEE Transactions on Information Theory, vol. 44, no. 5, pp. 1926–1940, 1998. [28] P. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and error estimation,” Machine Learning, vol. 48, no. 1, pp. 85–113, 2002. [29] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in kernel methods: support vector learning, 1999. [30] C. Lin, “Asymptotic convergence of an smo algorithm without any assumptions,” IEEE Transactions on Neural Networks, vol. 13, no. 1, pp. 248–250, 2002. [31] J. C. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” in Advances in Large Margin Classifier, 1999. [32] D. Anguita, A. Ghio, N. Greco, L. Oneto, and S. Ridella, “Model selection for support vector machines: Advantages and disadvantages of the machine learning theory,” in Proceedings of the International Joint Conference on Neural Networks, 2010. [33] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading convexity for scalability,” in Proceedings of the 23rd International Conference on Machine learning, 2006. [34] M. Anthony, Discrete mathematics of neural networks: selected topics. Society for Industrial Mathematics, 2001. [35] M. Aupetit, “Nearly homogeneous multi-partitioning with a deterministic generator,” Neurocomputing, vol. 72, no. 7-9, pp. 1379–1389, 2009. [36] C. Bishop, Pattern recognition and machine learning. springer New York, 2006. [37] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers,” in Proceedings of the Neural Information Processing System (NIPS), 2011, pp. 1009–1016. [38] P. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005. [39] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, “Inference with the universum,” in Proceedings of the 23rd International Conference on Machine learning, 2006, pp. 1009–1016. [40] J. Langford, “Tutorial on practical prediction theory for classification,” Journal of Machine Learning Research, vol. 6, no. 1, p. 273, 2006. [41] C. Clopper and E. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, p. 404, 1934. [42] J. Audibert, R. Munos, and C. Szepesv´ari, “Exploration-exploitation tradeoff using variance estimates in multi-armed bandits,” Theoretical Computer Science, vol. 410, no. 19, pp. 1876–1902, 2009. [43] A. Maurer and M. Pontil, “Empirical bernstein bounds and sample variance penalization,” Proceedings of the Int. Conference on Learning Theory, 2009. [44] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963. [45] V. Bentkus, “On hoeffdings inequalities,” The Annals of Probability, vol. 32, no. 2, pp. 1650–1673, 2004. [46] D. Anguita, A. Ghio, L. Ghelardoni, and S. Ridella, “Test error bounds for classifiers: A survey of old and new results,” in Proceedings of The IEEE Symposium on Foundations of Computational Intelligence, 2011, pp. 80–87. [47] P. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” The Journal of Machine Learning Research, vol. 3, pp. 463–482, 2003. [48] C. McDiarmid, “On the method of bounded differences,” Surveys in combinatorics, vol. 141, no. 1, pp. 148–188, 1989. [49] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Maximal discrepancy vs. rademacher complexity for error estimation,” in Proceeding of The European Symposium on Artificial Neural Networks, 2011, pp. 257–262. [50] S. Kutin, “Extensions to mcdiarmid’s inequality when differences are bounded with high probability,” TR-2002-04, University of Chicago, Tech. Rep., 2002. [51] E. Ordentlich, K. Viswanathan, and M. Weinberger, “Denoiser-loss estimators and twice-universal denoising,” in Proceedings of the IEEE International Symposium on Information Theory, 2009. [52] R. Serfling, “Probability inequalities for the sum in sampling without replacement,” The Annals of Statistics, vol. 2, no. 1, pp. 39–48, 1974. 13 [53] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Selecting the hypothesis space for improving the generalization ability of support vector machines,” in Proceedings of the International Joint Conference on Neural Networks, 2011, pp. 1169–1176. [54] Y. Abu-Mostafa, “Hints,” Neural Computation, vol. 7, no. 4, pp. 639– 671, 1995. [55] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization path for the support vector machine,” The Journal of Machine Learning Research, vol. 5, pp. 1391–1415, 2004. [56] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K. Muller, and A. Zien, “Efficient and accurate lp-norm multiple kernel learning,” Advances in Neural Information Processing Systems (NIPS), vol. 22, no. 22, pp. 997– 1005, 2009. [57] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-sample model selection for support vector machines,” in Proceedings of the International Joint Conference on Neural Networks, 2011, pp. 1154–1161. [58] D. Anguita, A. Ghio, and S. Ridella, “Maximal discrepancy for support vector machines,” Neurocomputing, vol. 74, pp. 1436–1443, 2011. [59] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y. LeCun, U. Muller, E. Sackinger, P. Simard et al., “Comparison of classifier methods: a case study in handwritten digit recognition,” in Proceedings of the 12th IAPR International Conference on Pattern Recognition Computer Vision and Image Processing, 1994. [60] S. Munder and D. Gavrila, “An experimental study on pedestrian classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1863–1868, 2006. [61] A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis,” Bioinformatics, vol. 21, no. 5, pp. 631–643, 2005. [62] A. Statnikov, I. Tsamardinos, and Y. Dosbayev, “GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data,” International Journal of Medical Informatics, vol. 74, no. 7-8, pp. 491–503, 2005. [63] N. Ancona, R. Maglietta, A. Piepoli, A. D’Addabbo, R. Cotugno, M. Savino, S. Liuni, M. Carella, G. Pesole, and F. Perri, “On the statistical assessment of classifiers using DNA microarray data,” BMC bioinformatics, vol. 7, no. 1, pp. 387–399, 2006. [64] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the national academy of sciences of the United States of America, vol. 96, no. 12, pp. 6745–6767, 1999. [65] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. Marks, and J. Nevins, “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 20, pp. 11 462–11 490, 2001. [66] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. [67] D. Page, F. Zhan, J. Cussens, M. Waddell, J. Hardin, B. Barlogie, and J. Shaughnessy Jr, “Comparative data mining for microarrays: A case study based on multiple myeloma,” in Poster presentation at International Conference on Intelligent Systems for Molecular Biology August, 2002. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf 8.46 ± 0.97 5.10 ± 0.67 3.05 ± 0.23 2.36 ± 0.23 1.96 ± 0.14 1.63 ± 0.11 1.44 ± 0.11 1.27 ± 0.09 1.20 ± 0.08 1.08 ± 0.09 0.92 ± 0.05 0.81 ± 0.07 0.70 ± 0.06 RCf 8.98 ± 1.12 5.10 ± 0.67 3.05 ± 0.23 2.36 ± 0.23 1.96 ± 0.14 1.63 ± 0.11 1.44 ± 0.11 1.27 ± 0.09 1.20 ± 0.08 1.08 ± 0.09 0.92 ± 0.05 0.81 ± 0.07 0.70 ± 0.06 MD 12.90 ± 0.83 8.39 ± 1.11 6.26 ± 0.16 5.95 ± 0.12 5.61 ± 0.07 5.26 ± 0.29 4.98 ± 0.40 3.71 ± 0.58 2.71 ± 0.42 2.25 ± 0.21 2.07 ± 0.03 2.02 ± 0.04 1.93 ± 0.02 RC 13.20 ± 0.86 8.93 ± 1.20 6.26 ± 0.16 5.95 ± 0.12 5.61 ± 0.07 5.36 ± 0.21 4.98 ± 0.40 4.41 ± 0.53 3.59 ± 0.57 2.75 ± 0.47 2.07 ± 0.03 2.02 ± 0.04 1.93 ± 0.02 KCV 10.70 ± 0.88 6.96 ± 0.70 4.56 ± 0.27 3.42 ± 0.27 2.94 ± 0.18 2.42 ± 0.14 2.17 ± 0.14 1.89 ± 0.12 1.74 ± 0.11 1.53 ± 0.09 1.34 ± 0.06 1.18 ± 0.08 0.98 ± 0.06 14 LOO 10.70 ± 0.88 6.69 ± 0.71 4.31 ± 0.26 3.25 ± 0.29 2.79 ± 0.17 2.35 ± 0.17 2.09 ± 0.17 1.85 ± 0.15 1.65 ± 0.11 1.44 ± 0.09 1.27 ± 0.06 1.11 ± 0.09 0.92 ± 0.07 BTS 13.40 ± 0.76 9.37 ± 0.62 5.93 ± 0.26 4.40 ± 0.25 3.61 ± 0.17 3.15 ± 0.14 2.86 ± 0.15 2.43 ± 0.14 2.18 ± 0.12 1.98 ± 0.09 1.67 ± 0.08 1.48 ± 0.09 1.24 ± 0.07 TABLE I: MNIST dataset: error on the reference set, computed using the soft loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf 37.40 ± 3.38 31.50 ± 2.02 28.00 ± 0.76 26.60 ± 0.51 25.70 ± 0.50 25.20 ± 0.71 23.80 ± 0.43 22.90 ± 0.38 22.40 ± 0.35 21.80 ± 0.39 21.30 ± 0.39 20.50 ± 0.40 19.60 ± 0.29 RCf 37.90 ± 3.52 31.70 ± 2.00 28.00 ± 0.75 26.60 ± 0.50 25.70 ± 0.50 25.20 ± 0.71 23.80 ± 0.43 22.90 ± 0.37 22.40 ± 0.35 21.90 ± 0.38 21.30 ± 0.39 20.50 ± 0.41 19.60 ± 0.29 MD 42.80 ± 2.91 37.70 ± 2.43 33.10 ± 0.64 31.60 ± 0.49 30.60 ± 0.48 30.20 ± 0.46 29.70 ± 0.40 28.90 ± 0.37 27.90 ± 0.33 27.90 ± 0.33 27.10 ± 0.23 27.00 ± 0.30 26.10 ± 0.35 RC 44.80 ± 2.54 37.90 ± 2.38 33.10 ± 0.64 31.70 ± 0.46 30.90 ± 0.47 30.40 ± 0.49 29.80 ± 0.39 29.40 ± 0.34 28.70 ± 0.41 28.20 ± 0.37 27.30 ± 0.21 27.10 ± 0.24 26.30 ± 0.32 KCV 37.10 ± 2.58 32.00 ± 1.16 29.10 ± 0.81 27.60 ± 0.53 26.80 ± 0.59 25.70 ± 0.60 24.60 ± 0.42 23.80 ± 0.45 23.30 ± 0.38 22.70 ± 0.40 21.80 ± 0.34 21.00 ± 0.33 20.00 ± 0.27 LOO 37.10 ± 2.58 31.40 ± 1.11 29.50 ± 0.80 27.60 ± 0.75 26.90 ± 0.78 26.00 ± 0.78 24.60 ± 0.53 23.70 ± 0.43 23.20 ± 0.52 22.60 ± 0.44 21.60 ± 0.39 20.80 ± 0.35 20.00 ± 0.30 BTS 39.00 ± 2.63 34.60 ± 1.34 30.70 ± 0.68 28.90 ± 0.55 28.20 ± 0.48 27.90 ± 0.69 26.60 ± 0.49 25.40 ± 0.34 25.00 ± 0.34 24.70 ± 0.38 23.40 ± 0.30 22.50 ± 0.32 21.50 ± 0.25 TABLE II: DaimlerChrysler dataset: error on the reference set, computed using the soft loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf 2.33 ± 0.94 1.16 ± 0.31 0.47 ± 0.05 0.47 ± 0.10 0.39 ± 0.05 0.30 ± 0.04 0.28 ± 0.03 0.29 ± 0.04 0.30 ± 0.05 0.28 ± 0.04 0.25 ± 0.02 0.25 ± 0.04 0.20 ± 0.02 RCf 2.55 ± 1.04 1.16 ± 0.31 0.47 ± 0.05 0.47 ± 0.10 0.39 ± 0.05 0.30 ± 0.04 0.28 ± 0.03 0.29 ± 0.04 0.30 ± 0.05 0.28 ± 0.04 0.25 ± 0.02 0.25 ± 0.04 0.20 ± 0.02 MD 2.69 ± 0.67 1.41 ± 0.43 0.72 ± 0.09 0.73 ± 0.10 0.65 ± 0.07 0.58 ± 0.06 0.56 ± 0.05 0.45 ± 0.07 0.35 ± 0.05 0.31 ± 0.04 0.28 ± 0.02 0.27 ± 0.02 0.26 ± 0.01 RC 2.78 ± 0.66 1.49 ± 0.43 0.72 ± 0.09 0.73 ± 0.10 0.65 ± 0.07 0.59 ± 0.06 0.56 ± 0.05 0.51 ± 0.06 0.43 ± 0.07 0.37 ± 0.06 0.28 ± 0.02 0.27 ± 0.02 0.26 ± 0.01 KCV 2.77 ± 0.83 1.58 ± 0.42 0.90 ± 0.27 0.85 ± 0.26 0.73 ± 0.17 0.53 ± 0.14 0.57 ± 0.12 0.61 ± 0.19 0.49 ± 0.09 0.50 ± 0.11 0.40 ± 0.07 0.39 ± 0.08 0.29 ± 0.06 LOO 2.77 ± 0.83 1.91 ± 0.63 1.72 ± 0.56 1.15 ± 0.36 1.26 ± 0.30 0.76 ± 0.22 0.79 ± 0.18 0.77 ± 0.22 0.58 ± 0.11 0.61 ± 0.13 0.50 ± 0.10 0.48 ± 0.11 0.36 ± 0.07 BTS 3.21 ± 0.67 1.79 ± 0.34 0.78 ± 0.12 0.66 ± 0.12 0.56 ± 0.09 0.40 ± 0.05 0.43 ± 0.06 0.37 ± 0.05 0.37 ± 0.06 0.34 ± 0.04 0.32 ± 0.03 0.28 ± 0.04 0.24 ± 0.02 TABLE III: MNIST dataset: error on the reference set, computed using hard loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf 33.60 ± 4.53 27.10 ± 2.53 23.80 ± 0.75 23.10 ± 0.54 22.20 ± 0.54 22.00 ± 0.75 20.90 ± 0.50 20.10 ± 0.42 20.00 ± 0.43 19.40 ± 0.41 19.10 ± 0.39 18.50 ± 0.40 17.90 ± 0.32 RCf 34.60 ± 4.91 27.10 ± 2.52 23.80 ± 0.75 23.10 ± 0.54 22.30 ± 0.54 22.00 ± 0.75 20.90 ± 0.52 20.10 ± 0.40 20.00 ± 0.43 19.40 ± 0.41 19.10 ± 0.39 18.50 ± 0.41 17.90 ± 0.32 MD 31.40 ± 4.01 27.20 ± 1.05 26.00 ± 0.63 25.90 ± 0.79 25.00 ± 0.51 24.20 ± 0.49 24.10 ± 0.55 23.70 ± 0.48 23.10 ± 0.36 22.70 ± 0.49 22.60 ± 0.43 22.40 ± 0.36 21.40 ± 0.55 RC 31.70 ± 4.00 27.30 ± 1.09 26.00 ± 0.63 26.00 ± 0.80 25.20 ± 0.55 24.20 ± 0.48 24.30 ± 0.50 24.00 ± 0.49 23.60 ± 0.41 23.00 ± 0.51 22.70 ± 0.43 22.50 ± 0.31 21.60 ± 0.57 KCV 32.30 ± 3.54 26.70 ± 1.06 24.70 ± 0.88 23.50 ± 0.85 22.70 ± 0.73 21.80 ± 0.77 20.80 ± 0.51 19.80 ± 0.46 19.80 ± 0.51 19.20 ± 0.44 18.60 ± 0.44 17.80 ± 0.31 17.00 ± 0.25 LOO 32.30 ± 3.54 25.70 ± 0.61 24.40 ± 0.80 23.20 ± 0.78 22.80 ± 0.84 22.00 ± 0.74 21.00 ± 0.71 20.30 ± 0.78 19.90 ± 0.63 19.10 ± 0.48 18.50 ± 0.42 17.80 ± 0.37 16.90 ± 0.34 BTS 35.30 ± 3.56 29.40 ± 1.41 26.00 ± 0.78 24.40 ± 0.53 23.50 ± 0.48 23.40 ± 0.72 22.30 ± 0.51 21.50 ± 0.40 21.10 ± 0.40 20.80 ± 0.40 19.70 ± 0.32 19.00 ± 0.29 18.20 ± 0.27 TABLE IV: DaimlerChrysler dataset: error on the reference set, computed using hard loss. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf – – 77.00 ± 0.00 62.90 ± 0.00 54.50 ± 0.00 48.70 ± 0.00 44.50 ± 0.00 39.80 ± 0.00 37.40 ± 0.00 34.40 ± 0.00 30.80 ± 0.00 28.10 ± 0.00 24.40 ± 0.00 RCf – – 77.00 ± 0.00 62.90 ± 0.00 54.50 ± 0.00 48.70 ± 0.00 44.50 ± 0.00 39.80 ± 0.00 37.40 ± 0.00 34.40 ± 0.00 30.80 ± 0.00 28.10 ± 0.00 24.40 ± 0.00 MD – – – 83.60 ± 0.52 72.70 ± 0.45 65.30 ± 0.30 59.50 ± 0.33 54.90 ± 0.26 51.80 ± 0.22 47.80 ± 0.19 43.30 ± 0.18 39.70 ± 0.17 34.80 ± 0.15 RC – – – 85.00 ± 0.46 73.90 ± 0.41 66.50 ± 0.32 60.90 ± 0.34 55.10 ± 0.27 52.20 ± 0.23 48.40 ± 0.19 43.20 ± 0.17 39.60 ± 0.16 34.90 ± 0.16 KCV – 85.60 ± 0.76 62.19 ± 0.43 48.28 ± 0.31 39.16 ± 0.25 32.66 ± 0.19 28.41 ± 0.18 24.02 ± 0.17 21.39 ± 0.11 18.66 ± 0.12 15.44 ± 0.11 13.40 ± 0.10 10.23 ± 0.08 15 LOO – – – – – – – – – – – – – BTS 78.09 ± 1.56 54.56 ± 0.98 32.34 ± 0.58 23.87 ± 0.41 18.21 ± 0.31 15.71 ± 0.24 13.81 ± 0.22 11.29 ± 0.21 10.01 ± 0.17 9.00 ± 0.14 7.13 ± 0.14 6.31 ± 0.16 5.03 ± 0.08 TABLE V: MNIST dataset: error estimation using the soft loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf – – 88.30 ± 1.92 71.80 ± 1.48 64.40 ± 1.07 58.90 ± 1.06 52.20 ± 0.95 48.80 ± 0.92 44.50 ± 0.88 42.30 ± 0.76 36.70 ± 0.66 35.40 ± 0.58 30.80 ± 0.75 RCf – – 88.40 ± 1.84 72.10 ± 1.50 64.80 ± 1.11 58.60 ± 1.08 52.20 ± 0.88 48.70 ± 0.89 44.60 ± 0.93 42.30 ± 0.78 36.70 ± 0.65 35.30 ± 0.60 30.80 ± 0.75 MD – – – – 94.50 ± 0.89 87.90 ± 1.08 82.00 ± 0.81 77.60 ± 0.85 73.90 ± 0.74 71.10 ± 0.81 66.20 ± 0.65 62.80 ± 0.68 58.30 ± 0.74 RC – – – – 94.30 ± 0.87 87.80 ± 1.01 82.30 ± 0.78 76.90 ± 0.83 73.50 ± 0.74 70.90 ± 0.82 65.60 ± 0.64 62.40 ± 0.64 57.80 ± 0.71 KCV – – 84.05 ± 2.03 73.58 ± 2.02 68.70 ± 1.13 63.54 ± 1.31 58.93 ± 1.19 55.36 ± 1.36 51.12 ± 0.74 48.64 ± 0.90 44.61 ± 0.82 42.22 ± 0.74 37.97 ± 0.84 LOO – – – – – – – – – – – – – BTS 98.11 ± 4.96 77.76 ± 3.80 63.48 ± 2.18 54.50 ± 1.87 51.50 ± 1.45 48.10 ± 1.37 44.52 ± 1.02 41.83 ± 1.17 39.04 ± 0.94 37.84 ± 0.99 34.79 ± 0.82 33.05 ± 0.77 30.83 ± 0.84 TABLE VI: DaimlerChrysler dataset: error estimation using the soft loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 KCV 95.00 ± 1.99 77.64 ± 0.76 52.71 ± 0.32 39.30 ± 0.25 31.23 ± 0.24 25.89 ± 0.17 22.09 ± 0.13 18.10 ± 0.15 16.16 ± 0.09 13.91 ± 0.10 11.29 ± 0.09 9.50 ± 0.10 7.22 ± 0.07 LOO – – – – – – – – – – – – – BTS 57.65 ± 2.48 34.15 ± 0.90 18.63 ± 0.45 12.79 ± 0.33 9.74 ± 0.23 7.86 ± 0.19 6.59 ± 0.17 5.30 ± 0.15 4.69 ± 0.11 4.00 ± 0.11 3.21 ± 0.09 2.68 ± 0.09 2.02 ± 0.08 TABLE VII: MNIST dataset: error estimation using the hard loss. n 10 20 40 60 80 100 120 150 170 200 250 300 400 KCV 95.00 ± 6.88 77.64 ± 3.83 75.14 ± 2.18 58.18 ± 2.27 59.97 ± 1.44 50.69 ± 1.67 43.81 ± 1.45 43.98 ± 1.52 39.56 ± 1.02 34.37 ± 0.91 32.96 ± 0.97 31.90 ± 0.89 27.47 ± 0.86 LOO – – – – – – – – – – – – – BTS 80.75 ± 7.42 64.80 ± 4.73 52.41 ± 2.78 42.17 ± 2.20 40.29 ± 1.53 38.98 ± 1.62 35.52 ± 1.28 32.94 ± 1.50 31.09 ± 1.09 29.71 ± 1.10 27.68 ± 0.99 26.27 ± 0.82 24.43 ± 0.83 TABLE VIII: DaimlerChrysler dataset: error estimation using the hard loss. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 Dataset Brain Tumor1 Brain Tumor2 Colon Cancer1 Colon Cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung Cancer Myeloma Prostate Tumor SRBCT MDf 14.00 ± 6.37 5.78 ± 2.35 27.40 ± 14.78 25.50 ± 8.59 19.10 ± 2.03 33.40 ± 5.07 14.20 ± 5.05 17.40 ± 4.66 11.30 ± 4.34 11.60 ± 2.93 8.60 ± 2.00 17.80 ± 4.65 10.20 ± 4.47 RCf 14.00 ± 7.66 5.78 ± 3.34 27.40 ± 13.97 25.50 ± 7.92 19.10 ± 3.40 33.40 ± 5.61 14.20 ± 5.59 17.40 ± 5.16 11.30 ± 4.08 11.60 ± 3.35 8.60 ± 2.14 17.80 ± 3.82 10.20 ± 4.49 MD 33.30 ± 0.02 76.20 ± 0.12 45.30 ± 0.05 67.70 ± 0.04 58.00 ± 0.04 50.00 ± 0.06 31.20 ± 0.04 38.80 ± 5.07 31.10 ± 0.03 30.20 ± 0.01 28.40 ± 0.00 25.00 ± 4.25 31.50 ± 0.01 RC 33.30 ± 0.02 76.20 ± 0.12 45.30 ± 0.05 67.70 ± 0.04 58.00 ± 0.04 50.00 ± 0.06 31.20 ± 0.04 42.20 ± 3.33 31.10 ± 0.03 30.20 ± 0.00 28.40 ± 0.00 25.60 ± 3.96 31.50 ± 0.01 16 KCV 17.10 ± 1.94 73.90 ± 0.30 30.10 ± 3.01 67.70 ± 0.07 57.90 ± 0.06 27.90 ± 1.61 20.90 ± 3.61 18.20 ± 3.46 9.56 ± 3.08 11.20 ± 1.53 9.06 ± 0.65 12.40 ± 3.71 10.90 ± 1.33 LOO 15.70 ± 1.80 73.90 ± 0.27 29.00 ± 1.60 67.70 ± 0.07 57.90 ± 0.05 26.40 ± 3.00 20.60 ± 3.20 17.40 ± 4.15 9.59 ± 2.78 10.60 ± 1.63 8.63 ± 0.66 11.80 ± 3.40 10.50 ± 1.55 BTS 18.90 ± 3.29 75.60 ± 0.14 29.80 ± 4.14 67.70 ± 0.05 58.00 ± 0.03 31.60 ± 3.08 23.30 ± 3.29 21.20 ± 4.24 10.80 ± 3.71 13.10 ± 1.38 11.50 ± 0.60 15.30 ± 3.93 13.60 ± 1.51 TABLE X: Human gene expression dataset: error on the reference set, computed using the soft loss. Dataset Brain Tumor1 Brain Tumor2 Colon Cancer1 Colon Cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung Cancer Myeloma Prostate Tumor SRBCT MDf 5.56 ± 4.52 2.86 ± 4.50 18.20 ± 16.50 18.60 ± 11.00 8.57 ± 4.58 21.70 ± 10.90 7.50 ± 6.01 1.00 ± 2.57 3.00 ± 3.15 5.96 ± 3.19 0.00 ± 0.00 10.90 ± 4.67 1.74 ± 2.74 RCf 5.56 ± 4.52 2.86 ± 4.50 18.20 ± 16.50 18.60 ± 11.00 8.57 ± 4.58 21.70 ± 10.90 5.00 ± 6.01 1.00 ± 2.57 4.00 ± 4.81 5.96 ± 3.19 0.00 ± 0.00 10.90 ± 4.67 1.74 ± 2.74 MD 33.30 ± 0.00 2.86 ± 4.50 54.50 ± 0.00 42.90 ± 0.00 33.30 ± 0.00 48.30 ± 12.50 31.20 ± 0.00 49.00 ± 2.57 40.00 ± 0.00 34.00 ± 0.00 28.00 ± 0.00 34.50 ± 12.00 39.10 ± 0.00 RC 33.30 ± 0.00 2.86 ± 4.50 54.50 ± 0.00 42.90 ± 0.00 33.30 ± 0.00 48.30 ± 12.50 31.20 ± 0.00 50.00 ± 0.00 40.00 ± 0.00 34.00 ± 0.00 28.00 ± 0.00 35.50 ± 10.10 39.10 ± 0.00 KCV 5.56 ± 4.52 2.86 ± 4.50 12.70 ± 17.50 15.70 ± 6.87 7.62 ± 6.24 6.67 ± 8.02 5.00 ± 3.21 2.00 ± 3.15 5.00 ± 4.07 7.23 ± 2.19 0.00 ± 0.00 10.90 ± 4.67 2.61 ± 4.47 LOO 8.89 ± 7.28 2.86 ± 4.50 16.40 ± 15.50 17.10 ± 9.36 8.57 ± 7.14 10.00 ± 10.50 2.50 ± 3.94 1.00 ± 2.57 5.00 ± 4.07 5.96 ± 3.19 0.00 ± 0.00 10.90 ± 4.67 3.48 ± 6.52 BTS 5.56 ± 7.82 2.86 ± 4.50 20.00 ± 13.60 17.10 ± 4.50 8.57 ± 4.58 8.33 ± 9.58 6.25 ± 5.08 1.00 ± 2.57 5.00 ± 4.07 7.23 ± 2.19 0.00 ± 0.00 10.90 ± 4.67 5.22 ± 8.21 TABLE XI: Human gene expression dataset: error on the reference set, computed using the hard loss. Dataset Brain Tumor1 Brain Tumor2 Colon Cancer1 Colon Cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung Cancer Myeloma Prostate Tumor SRBCT MDf 61.10 ± 1.25 81.70 ± 0.49 92.60 ± 2.84 74.90 ± 2.36 66.30 ± 1.02 – 68.20 ± 0.96 70.70 ± 1.43 68.30 ± 0.43 39.10 ± 0.01 54.50 ± 0.00 56.90 ± 1.39 63.40 ± 0.63 RCf 61.10 ± 1.25 81.50 ± 0.02 92.50 ± 2.91 75.10 ± 2.55 66.30 ± 1.02 – 67.60 ± 1.34 70.90 ± 1.44 69.00 ± 1.23 39.10 ± 0.01 54.50 ± 0.00 56.90 ± 1.39 63.50 ± 0.75 MD 89.50 ± 0.00 96.10 ± 0.06 – 82.70 ± 0.06 66.90 ± 0.08 – – – 99.90 ± 0.01 70.60 ± 0.00 82.70 ± 0.01 – 97.70 ± 0.00 RC 90.20 ± 0.00 – – 91.10 ± 0.06 74.10 ± 0.07 – – – 99.20 ± 0.01 70.20 ± 0.00 82.60 ± 0.01 – 97.60 ± 0.00 KCV 58.30 ± 0.78 82.40 ± 0.04 85.40 ± 0.76 72.10 ± 0.01 59.10 ± 0.02 85.30 ± 2.57 66.10 ± 1.22 70.00 ± 1.45 63.40 ± 0.86 42.00 ± 0.66 51.10 ± 0.36 63.50 ± 0.66 58.20 ± 0.64 LOO – – – – – – – – – – – – – BTS 31.03 ± 0.54 39.50 ± 0.07 55.18 ± 0.35 32.44 ± 0.02 23.70 ± 0.02 60.27 ± 4.02 36.21 ± 1.61 41.80 ± 1.52 33.41 ± 1.55 15.68 ± 0.81 22.77 ± 0.77 36.83 ± 1.08 31.10 ± 1.83 TABLE XII: Human gene expression dataset: error estimation using the soft loss. Dataset Brain Tumor1 Brain Tumor2 Colon Cancer1 Colon Cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung Cancer Myeloma Prostate Tumor SRBCT KCV 34.04 ± 2.03 56.49 ± 1.29 56.49 ± 5.14 46.43 ± 4.11 41.43 ± 1.71 60.79 ± 2.40 41.43 ± 1.60 43.79 ± 1.60 43.79 ± 2.57 17.47 ± 0.88 31.23 ± 0.00 47.07 ± 1.64 39.30 ± 2.57 LOO – – – – – – – – – – – – – BTS 22.05 ± 1.46 20.50 ± 1.68 49.29 ± 9.54 38.65 ± 3.85 21.21 ± 3.54 45.08 ± 3.90 21.21 ± 1.92 22.70 ± 3.63 22.70 ± 0.99 13.00 ± 1.16 9.74 ± 0.18 20.00 ± 1.14 19.91 ± 1.68 TABLE XIII: Human gene expression dataset: error estimation using the hard loss. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012 n 10 20 40 60 80 100 120 150 170 200 250 300 400 MDf 0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.2 1.1 ± 0.3 2.3 ± 0.2 2.4 ± 0.2 4.5 ± 0.4 11.4 ± 0.4 17.8 ± 0.4 25.1 ± 0.3 27.4 ± 0.4 48.1 ± 0.4 58.9 ± 0.6 RCf 0.1 ± 0.1 0.4 ± 0.1 0.6 ± 0.2 1.1 ± 0.2 2.2 ± 0.4 2.6 ± 0.3 3.9 ± 0.3 10.9 ± 0.4 17.1 ± 0.3 24.8 ± 0.4 28.9 ± 0.4 47.2 ± 0.4 59.4 ± 0.4 MD 0.1 ± 0.1 0.3 ± 0.1 0.6 ± 0.1 1.1 ± 0.1 2.0 ± 0.1 2.7 ± 0.2 4.2 ± 0.3 10.1 ± 0.3 12.8 ± 0.3 21.3 ± 0.3 25.2 ± 0.2 36.1 ± 0.2 44.2 ± 0.2 RC 0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.1 1.2 ± 0.1 1.9 ± 0.2 2.7 ± 0.2 4.0 ± 0.2 9.7 ± 0.2 11.9 ± 0.2 20.5 ± 0.2 25.9 ± 0.3 36.8 ± 0.2 44.1 ± 0.2 KCV 0.0 ± 0.1 0.0 ± 0.1 0.1 ± 0.1 0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.1 1.1 ± 0.3 1.7 ± 0.3 2.4 ± 0.3 2.9 ± 0.2 4.1 ± 0.4 4.9 ± 0.3 6.3 ± 0.3 17 LOO 0.0 ± 0.1 0.1 ± 0.1 0.3 ± 0.1 0.8 ± 0.1 2.0 ± 0.3 2.9 ± 0.7 5.1 ± 0.4 9.3 ± 0.4 13.4 ± 0.4 18.1 ± 1.1 27.6 ± 1.2 39.3 ± 1.1 59.7 ± 1.3 BTS 0.0 ± 0.1 0.0 ± 0.1 0.0 ± 0.1 0.0 ± 0.1 0.1 ± 0.1 0.2 ± 0.2 0.5 ± 0.2 0.7 ± 0.1 0.9 ± 0.2 1.4 ± 0.3 1.9 ± 0.3 2.3 ± 0.3 4.1 ± 0.4 TABLE XIV: MNIST dataset: computational time required by the different in–sample and out–of–sample procedures.