For Review Only - Universidad de Granada
Transcription
For Review Only - Universidad de Granada
Universidad de Granada Departamento de Ciencias de la Computación e Inteligencia Artificial Programa de Doctorado en Ciencias de la Computación y Tecnologı́a Informática Algoritmos evolutivos de codificación real para el problema de generación de prototipos en aprendizaje supervisado y semi-supervisado basado en instancias Tesis Doctoral Isaac Triguero Velázquez Granada, Marzo de 2014 Editor: Editorial de la Universidad de Granada Autor: Isaac Triguero Velázquez D.L.: GR 374-2015 ISBN: 978-84-9083-269-1 Universidad de Granada Algoritmos evolutivos de codificación real para el problema de generación de prototipos en aprendizaje supervisado y semi-supervisado basado en instancias MEMORIA PRESENTADA POR Isaac Triguero Velázquez PARA OPTAR AL GRADO DE DOCTOR EN INFORMÁTICA Marzo de 2014 DIRECTORES Francisco Herrera Triguero y Salvador Garcı́a López Departamento de Ciencias de la Computación e Inteligencia Artificial La memoria titulada “Algoritmos evolutivos de codificación real para el problema de generación de prototipos en aprendizaje supervisado y semi-supervisado basado en instancias ”, que presenta D. Isaac Triguero Velázquez para optar al grado de doctor, ha sido realizada dentro del Programa Oficial de Doctorado en “Ciencias de la Computación y Tecnologı́a Informática”, en el Departamento de Ciencias de la Computación e Inteligencia Artificial de la Universidad de Granada bajo la dirección de los doctores D. Francisco Herrera Triguero y D. Salvador Garcı́a López. El doctorando y los directores de la tesis garantizamos, al firmar esta tesis doctoral, que el trabajo ha sido realizado por el doctorando bajo la dirección de los directores de la tesis, y hasta donde nuestro conocimiento alcanza, en la realización del trabajo se han respetado los derechos de otros autores a ser citados cuando se han utilizado sus resultados o publicaciones. Granada, Marzo de 2014 El Doctorando Fdo: Isaac Triguero Velázquez Los directores Fdo: Francisco Herrera Triguero Fdo: Salvador Garcı́a López Esta tesis doctoral ha sido desarrollada con la financiación de la beca predoctoral adscrita al proyecto de investigación de excelencia P10-TIC-6858 de la Junta de Andalucı́a. También ha sido subvencionada por los proyectos TIN2008-06681-C06-01 y TIN2011-28488 del Ministerio de Ciencia e Innovación. Dedicada a la memoria de mi padre: D. Gabriel Triguero Agradecimientos Esta tesis está especialmente dedicada a la memoria de mi padre, D. Gabriel Triguero, porque sin haber podido estar presente en el transcurso de la misma, sı́ que fue su impulsor, y para mı́, la motivación necesaria para realizarla. Además, quisiera agradecer y dedicar este trabajo a mi madre, mi hermano, mis sobrinos, mis tı́os, mis primos y finalmente, a mis abuelos, que en paz descansen. Todos ellos siempre me han apoyado durante mi formación. Esta memoria es por y para vosotros. Desde el punto de vista académico quisiera en primer lugar agradecer a mis directores de tesis, Francisco Herrera y Salvador Garcı́a, todo el esfuerzo, dedicación e interés que me han prestado. Estoy convencido de que sin ellos, sin sus muchos consejos y sin los conocimientos que me han transmitido, hubiese sido imposible finalizar esta tesis doctoral. Para mı́, Paco y Salva, son sinónimo de investigación de calidad, y espero poder seguir realizando esta actividad junto a ellos. Quisiera agradecer también el apoyo de José Manuel Benı́tez, quien sin estar directamente relacionado con mi tesis doctoral, si que ha influido en mi formación durante este periodo. Gracias a él, mi pelı́cula favorita siempre será “Hércules”, y aún estoy esperando la nueva de “Ulises”. También he de agradecer mucho a mis compañeros de fatigas. En primer lugar, a Joaquı́n y su “pedantic english”, por todos los consejos e ideas que sin duda han sido muy importantes en esta tesis. A mis compañeros de promoción Vicky, Álvaro y José Antonio, a los seniors del grupo, hermanos Alcalá, Alberto, Julián, etc, y a los no tan seniors, Nacho, M. Cobo, Cristobal (Jaén), Christoph, Fran, Michela, etc. Desde aquı́ también agradecer y a la vez dar ánimo a los niños del futuro: Dani, Pablo, Sergio, Sara, Juanan, Rosa, Alber, M.Parra, Lala, etc. I would also like to thank Jaume for all his support in my research stay at the Nottingham University. I also thank to German and Nicola and the good moments we spent running through the “hilly” Notts. No puedo olvidar en este mensaje de agradecimiento a mis amigos: A Álvaro (otra vez, sı́), sus bromas, y nuestros muchos redbulls y domingos compartidos en el edificio orquı́deas. A mis amigos de siempre, que considero parte de mi familia, Trujillo y Emilio. A mi buen amigo Manolo, compañero de entrenos y carreras, y al cual le deseo que en unos años escriba una memoria como esta. Finalmente, a todos mis amigos de Atarfe y a mis compañeros de hobby y tarimas. Querı́a dejar para el final, un agradecimiento muy especial, a la persona que conoce de primera mano el trabajo que ha supuesto esta tesis doctoral como si la hubiese hecho ella misma. Sin su cariño, su compresión y su motivación durante todos estos años nunca hubiese sido capaz de hacerla. Gracias Marta! Mi agradecimiento también a todas aquellas personas que no por no citarlas han sido menos importantes en el término de esta memoria. GRACIAS A TODOS Table of Contents Page I PhD dissertation 1 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introducción . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Nearest neighbor classification: Data reduction approaches . . . . . . . . . . 15 2.2 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Semi-supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3. Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 5.2 6. Prototype Generation for Supervised Learning . . . . . . . . . . . . . . . . . 26 5.1.1 A review on Prototype Generation . . . . . . . . . . . . . . . . . . . 26 5.1.2 New Prototype Generation Methods based on Differential Evolution 27 5.1.3 Integrating Prototype Selection and Feature Weighting within Prototype Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1.4 Enabling Prototype Reduction Models to deal with Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Self-labeling with prototype generation/selection for semi-supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1 A Survey on Self-labeling Semi-Supervised Classification . . . . . . 31 5.2.2 New Self-labeling Approaches Aided by Prototype Generation/Selection Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1 Prototype Generation for supervised learning . . . . . . . . . . . . . . . . . . 33 6.1.1 A review on Prototype Generation . . . . . . . . . . . . . . . . . . . 33 6.1.2 New Prototype Generation Methods based on Differential Evolution 34 ix x TABLE OF CONTENTS 6.2 7. 6.1.3 Integrating Prototype Selection and Feature Weighting within Prototype Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1.4 Enabling Prototype Reduction Models to deal with Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Self-labeling with Prototype Generation/Selection for Semi-Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.2.1 A Survey on Self-labeling Semi-Supervised Classification . . . . . . 36 6.2.2 New Self-labeling Approaches Aided by Prototype Generation/Selection Models . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 II Publications: Published and Submitted Papers 1. 2. 43 Prototype generation for supervised classification . . . . . . . . . . . . . . . . . . . . 43 1.1 A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1.2 IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification . 59 1.3 Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 1.4 Integrating a Differential Evolution Feature Weighting scheme into Prototype Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 1.5 MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Self-labeling with prototype generation/selection for semi-supervised classification . 137 2.1 Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 2.2 On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 179 2.3 SEG-SSC: A Framework based on Synthetic Examples Generation for SelfLabeled Semi-Supervised Classification . . . . . . . . . . . . . . . . . . . . . . 193 Bibliografı́a 209 Chapter I PhD dissertation 1. Introduction In recent years, there has been a rapid advance of technology and communications, leading by the great expansion of the Internet. As consequence, there is an increasing necessity of processing and classifying large quantity of data. Specially, this is very important in a wide variety of fields such as astronomy, geology, medicine, or the interpretation of the human genome, because of the available information has been considerably increased. However, the real value of data lies on the possibility of extracting valuable knowledge for making decisions or the exploration and comprehension of the phenomenon that produced the data. Otherwise, it would result in the expression “the world is becoming data rich but knowledge poor” [Bra07]. The processing and collection of these data, manually or in a semiautomatic way, becomes impossible when the size of the data and the number of dimensions or parameters is excessively increased. Nowadays, it is common to find databases with millions of records and thousands of dimensions, so that, only computers could automate this process. Therefore, the use of automatic procedure to acquire valuable knowledge is required to acquire valuable knowledge from data. Data Mining consists of solving real problems by analyzing the data presented in them. In the literature, it is qualified as science and technology to explore data, aiming to discover already present unknown patterns. Many people distinguish Data Mining as a synonym of the Knowledge Discovery in Databases (KDD) process, while others view Data Mining as the main step of KDD [TSK05, WFH11, HKP11]. There are several definitions of the KDD process. For example, in [FPSS96] the authors define it as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. In [Fri97], it considers the KDD process as an automatic exploratory data analysis of large databases. A key aspect that characterizes the KDD process is the way in which it is divided into stages according the agreement of several important researchers in the topic. There are several methods to do this division, with advantages and weaknesses [Han12]. In this thesis, we adopt a hybridization widely used in recent years that categorizes these stages in the following four steps: • Problem Definition: Selection of relevant data that form the problem according to the relevant prior knowledge obtained by the experts. Definition of the goals pursued by the 1 2 Chapter I. PhD dissertation Figure 1: The KDD process end-user. It also includes the comprehension of both the selected data to approach and the expert knowledge associated to achieve a high degree of reliability. • Data Gathering and Preparation: This stage includes operations for data cleaning, data integration, data transformation and data reduction. The first one consists of the removal of noisy and inconsistent data. The second tries to combine multiple data sources into a single one. The third transforms and consolidates the data into forms that are appropriates to perform data mining tasks. Finally, the data reduction process includes the selection and extraction of both features and examples in a database. This phase aims to ease the development of the following stages. • Model Building and Evaluation: This is the process in which the methods are used to extract valid data patterns. Firstly, this step includes the choice of the most suitable data mining task, such as classification regression, clustering or association, and the choice of the data mining algorithm itself, belonging to one of the previous families. Secondly, the adaptation of the selected algorithm to the addressed problem, tuning essential parameters and applying validation procedures. Finally, estimating and interpreting the mined patterns based on different measures of interest. • Knowledge Deployment: This last stage involves the description of the discovered patterns in order to be useful for the end-user. Figure 1 shows the KDD process and reveals the main four stages mentioned previously. It is worth mentioning that all the stages are interconnected among them, showing that the KDD process is actually a self-organized scheme in which each stage has repercussions in the remaining stages. As commented above, the main phase of the KDD process is also known as Data Mining. This discipline is focused in a more reduced objective (regarding the KDD process) that consists of the identification of patterns and the prediction of relationships from data. It is noteworthy that the success of a data mining technique does not only rely on its performance. These techniques are sensitive to the quality of the information provided. Thus, the higher the quality is, the greater 1. Introduction 3 the generated models will be able to make decisions. In this sense, preprocessing techniques are necessary in the KDD process before applying data mining techniques. Data mining techniques are commonly categorized as descriptive or predictive methods. The former type is devoted to discover interesting patterns among data. The latter aims to predict the behavior of a model through the analysis of the data. Both descriptive and predictive process of data mining are conducted by machine learning algorithms [MD01, Alp10, WFH11]. The main objective of machine learning tools is to induce knowledge from problems that do not provide a straightforward and efficient algorithmic solution or they are informally or vaguely defined. Depending on the available information to perform a machine learning task, three different problems can be considered: • Supervised learning: In this problem, the values of the target variable/s and a set of input variable are known. The aim is to establish a relation between the input variables and the output variable/s in order to predict output variable/s of new examples (whose target value is unknown). – In classification [DHS00] the values of the target variable/s are discrete and a finite number of values (known as labels or classes) are available. For instance, the different types of a disease such as Hepatitis A, B, C. – In regression [CM98] the values of the target variable/s are continuous (real-coded). For example the temperature, electric consumption, weight, etc. • Unsupervised learning: The values of the target variable/s are unknown. The aim lies on the description of relations and patterns of the data. The most common categories of unsupervised machine learning algorithms are clustering and association. – In clustering [Har75], the process consists of splitting the data into several groups, with the examples belonging to each group being as similar as possible among them. – Association [AIS93] is devoted to identify relation between transactional data. • Semi-supervised learning: This type of problem is between the previous learning paradigms, so that, machine learning algorithms are provided by some examples in which the values of the target variable/s are known (generally a very reduced number) and many others in which the values of the target variable/s are unknown. In this paradigm, we can consider the prediction of target variables as a classification or regression problem, as well as the description of the data as a clustering or an association process. Thus, it is considered an extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm [ZG09]. In this thesis, we will focus on both supervised and semi-supervised classification. A classification method is defined as a technique that learns how to categorize examples or elements from several predefined classes. Broadly speaking, a classifier learns a model from an input data set (denoted as training set), and then, the model will be applied to predict the class value of a given set of examples (the test set) that they have not been used in the learning process. To measure the performance of a classifier, several strategies can be used: • Accuracy: It measures the confidence of the learned classification model. It is usually estimated as the percentage of test examples correctly classified over the total. 4 Chapter I. PhD dissertation • Efficiency: Time spent by the model when classifying a test example. • Interpretability: Clarity and credibility, from the human point of view, of the classification model. • Learning time: Time required by the machine learning algorithm to build the classification model. • Robustness: Minimum number of examples needed to obtain a precise and reliable classification model. In the specialized literature, there are different approaches to perform classification tasks in both supervised and semi-supervised context with successful results. For example, statistical techniques, discriminant functions, neural networks, decision trees, support vector machines and so on. However, as we stated before, these techniques may be useless when the input data are impure, leading to the extraction of wrong models. Therefore, the preprocessing of the data becomes one of the most relevant stages to enable data mining method to obtain better and more accurate information [ZZY03]. The data preparation can generate a lesser size data set in comparison to the original set, improving the efficiency of the data mining tool and, moreover, the preparation originates high quality data, which may result in high quality models. Among the data prepreparation strategies [Pyl99], data reduction methods aim to simplify data in order to enable data mining algorithms to be applied not only in a faster way, but also in a more accurate way by removing noisy and redundant data. From the perspective of attributes or variables, the most well-known data reduction processes are feature selection, feature weighting and feature generation [LM07]. Taking into consideration the instance space, we can highlight instance reduction methods [GDCH12, TDGH12]. Feature selection consists of choosing a representative subset of features from the original feature space, while feature generation creates new features to describe the data. With a similar point of view, feature weighting schemes assign a weight to each feature of the domain of the problem to modify the way in which distances between examples are computed [PV06]. This technique can be viewed as a generalization of feature selection algorithms, allowing us to obtain a soft approximation of the feature relevance degree assigning a real value as a weight, so different features can receive different treatments. An instance reduction technique is devoted to find the best reduced set that represents the original training data with a lesser number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise examples. This methodology can be divided into Instance Selection (IS) [MFV02, GCH08] and Instance Generation or abstraction (IG) depending on how it creates the reduced set [Koh90, LKL02]. The former attempts to choose an appropriate subset of the original training data, while the latter can also build new artificial instances to better adjust the decision boundaries of the classes. In this manner, the IG process fills some regions in the domain of the problem, which have no representative examples in the original dataset. Most of the instance reduction techniques have been focused on enhancing the Nearest Neighbor (NN) classifier [CH67]. In the specialized literature, when IS or IG are applied to instance-based learning algorithms, they are commonly referred as Prototype Selection (PS) and Prototype Generation (PG), respectively [DGH10a]. The NN rule is a nonparametric instance-based algorithm [AKA91] for pattern classification [DHS00, HTF09]. It belongs to the lazy learning family of methods, which refers to those methods that predicts the class label from raw training data and does 1. Introduction 5 not provide a learning model. The NN algorithm predicts the class of a test sample according to a concept of similarity [CGG+ 09] between examples (commonly Euclidean distance). Thus, the predicted class a given test sample is set equal to the most frequent class among its nearest training samples. Despite its simplicity, the NN rule has demonstrated itself to be one of the most important and effective technique in data mining and pattern recognition, being considered one of the top ten methods in data mining in [WK09]. However, the NN classifier also suffers from several problems that PS and PG techniques have been trying to alleviate. Four main weaknesses could be mentioned: • High storage requirements: It needs all the examples of the training set to classify a test example. • High computational cost: Each classification implies the computation of similarities between the test sample and all the examples of the training set. • Low tolerance to noisy instances: All the training data are supposed to be relevant, so that, noisy data may induce to incorrect classifications. • Wrong suppositions about the input training examples: The NN rule makes predictions over existing data, assuming that input data perfectly delimits the decision boundaries between classes. PS technique are limited to address the first three weaknesses, but it also assumes that the best representative examples can be obtained from a subset of the original data, whereas PG methods generate new representative examples if needed, thus tackling also the fourth weakness mentioned above. In the literature, there was no a complete categorization for PG methods and they were frequently confused with PS. A considerable number of PG methods have been proposed and some of them are rather unknown. The absence of a focused taxonomy on PG produces that new algorithms are usually compared with only a subset of the complete family of PG methods and, in most of the studies, no rigorous analysis has been carried out. Among the great number of existing PS and PG techniques we can highlight as the most promising approaches those that are based on Evolutionary Algorithms (EAs) [ES08]. EAs have been widely used in many different data mining problems [Fre02, PF09] acting as optimization strategies. EAs are a set of modern meta heuristics used successfully in many applications with great complexity. Their success on solving difficult problems has been the engine of a field known as Evolutionary Computation [GJ05]. These techniques are domain independent, which makes them ideal for applications where the domain knowledge is difficult to provide. Moreover, they have the ability to explore large search spaces finding consistently good solutions. Given that PS and PG problems could be seen as combinatorial and optimization problems, EAs have been used to solve them with successful results [CHL03, NL09]. Concretely, PS can be expressed as a binary space search problem, whereas PG is expressed as a continuous space search problem. Until now, existing evolutionary PG techniques did not take into consideration the selection of the most appropriate number of prototypes per class when applying the optimization process, which became their main drawback. Very recently, the term of big data has been coined to refer to the challenges and advantages derived from collecting and processing vast amounts of data [Mar13]. Formally, it is defined as the quantity of data that exceeds the processing capabilities of a given system. It is attracting much 6 Chapter I. PhD dissertation attention in data mining because the knowledge extraction process from big data has become a very difficult task for most of the classical and advanced techniques. The main challenges are to deal with the increasing scale of data at the level of number of instances, at the level of features or characteristics and the complexity of the problem. Nowadays, with the availability of cloud platforms [PBA+ 08] we dispose of sufficient processing units to extract valuable knowledge from massive data. Therefore, the adaptation of data mining techniques to emerging technologies, such as distributed computation, will be a mandatory task to overcome their limitations. As such, data reduction techniques should enable data mining algorithms to address big data problems with greater ease, but on the other hand these methods are also affected by the increase in size and complexity of data sets and they may be unable to provide a preprocessed data set in an acceptable time. Enhancing the scalability of data reduction techniques is becoming a challenging topic. In semi-supervised classification the main problem is related to the lack of labeled examples. This problem has been addressed by several approaches with different assumptions about the characteristics of the input data [BC01, FUS08, Joa99]. Among them, self-labeled techniques do not make any specific assumptions about the input data. These techniques follow an iterative procedure, aiming to obtain an enlarged labeled data set by labeling most confident unlabeled data within a supervised framework. They accept that their own predictions tend to be correct [Yar95, BM98]. A wide variety of self-labeling methods have been presented with successful applications [LZ07, JP08]. However, in the literature, there was no a taxonomy of methods that states their main benefits and drawbacks. These methods present two main weaknesses: • The addition of noisy examples to the enlarged labeled set, especially in early stages of the self-labeling process, may lead to build wrong models. Hence, reducing the size unlabeled set by detecting noisy examples become an important task. • They are limited by the number of labeled points and their distribution to identifying reliable unlabeled examples. This problem is much more pronounced when the labeled ratio is greatly reduced and labeled examples do not minimally represent the domain. As far of our knowledge, until the writing of this thesis, the application of PS and PG techniques was limited to supervised classification tasks, and they have not been applied to semi-supervised approaches. However, their utilization can benefit to alleviate the previous problems by reducing the number of noisy examples in the unlabeled set and introducing new generated labeled examples. The present thesis is developed following two main parts: prototype generation for (1) supervised and (2) semi-supervised learning. It is very important to note that in the first part of this thesis PG acts as a pure data reduction technique, while in the second PG methods will work as an unlabeled data reduction algorithm as well as a generator of new labeled examples. • In the former part, a deep study of the PG field will be performed determining which families of methods are more promising and which are their main advantages and drawbacks. Then, we will design new PG techniques based on evolutionary algorithms, determining the best proportion of examples per class, to improve current approaches. After that, we will combine PG with other data reduction approaches to increase the accuracy classification. Finally, we will develop a cloud-based framework that enables prototype reduction techniques to be applied on big problems. 1. Introduction 7 • The latter part of this thesis is devoted to semi-supervised classification. In particular, we will perform a survey of those semi-supervised learning methods that are based on selflabeling. Then, we will develop new algorithms to overcome their main drawbacks following two perspectives: (a) reducing the number of noisy examples in the unlabeled set and (b) generating synthetic data with PG techniques in order to diminish the influence of the lack of labeled examples. After this introductory section, the next section (Section 2.) is devoted to describe in detail the four main areas related: Data reduction for NN classification (Section 2.1), evolutionary algorithms (Section 2.2), semi-supervised classification (Section 2.3) and big data (Section 2.4). All of them are fundamental areas for defining and describing the proposals presented as a results of this thesis. Next, the justification of this memory will be given in Section 3., describing the open problems addressed. The objectives pursued when tackling them are described in Section 4.. Section 5. presents a summary on the works that compose this memory. A joint discussion of results is provided in Section 6., showing the connection between each of the objectives and how have been reached each of them. A summary of the conclusions drawn is provided in Section 7.. Finally, in Section 8. we point out several open future lines of work derived from the results achieved. The second part of the memory is constituted by eight journal publications, organized into two main sections: supervised and semi-supervised learning. These publications are the following: • Prototype generation for supervised classification: – A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. – IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. – Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification. – Integrating a Differential Evolution Feature Weighting Scheme into Prototype Generation. – MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. • Self-labeling with prototype generation/selection for semi-supervised classification. – Self-labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study. – On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. – SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled SemiSupervised Classification. 8 Chapter I. PhD dissertation Introducción Los avances en la tecnologı́a y las comunicaciones de los últimos años, liderados por la gran expansión de Internet, han traı́do consigo como una de sus consecuencias la necesidad de procesar y clasificar las grandes cantidades de información que se han puesto a nuestra disposición. Especialmente en el ámbito de la investigación cientı́fica, a lo largo de campos tan variados como la astronomı́a, la geologı́a, la medicina, o la interpretación del genoma humano, la cantidad de información disponible se ha visto incrementada considerablemente. Sin embargo, el verdadero valor de los datos está en la posibilidad de extraer conocimiento útil para la toma de decisiones o comprensión de los mismos. De otra forma, esto resultarı́a en la expresión “El mundo es rico en datos pero pobre en conocimiento” [Bra07]. El análisis y gestión manual o semi-automática de estos datos puede llegar a ser imposible cuando el tamaño de los datos y el número de dimensiones o parámetros crece en exceso. En la actualidad es común encontrar bases de dato con millones de registros y miles de variables, de forma que solo un ordenador podrı́a automatizar el proceso. Por tanto, se considera imprescindible el uso de procedimientos automáticos que nos permitan descubrir información importante de dichos datos. La minerı́a de datos consiste, de forma general, en resolver problemas reales mediante el análisis de los datos que los conforman. En la literatura, se califica de ciencia y tecnologı́a para explorar datos con el fin de descubrir patrones desconocidos ya presentes en los datos. Muchos autores distinguen a la minerı́a de datos como un sinónimo del proceso de Descubrimiento de Conocimiento en Bases de Datos (Knowledge Discovery in Databases, KDD), mientras que otros ven a la minerı́a de datos como el proceso principal del KDD [TSK05, WFH11, HKP11]. Existen diferentes definiciones del proceso del KDD. Por ejemplo, en [FPSS96] se define como “el proceso no trivial de identificar patrones en los datos que son válidos, novedosos y potencialmente útiles”. En [Fri97] se considera un proceso automático de exploración y análisis de grandes bases de datos. Un aspecto clave que caracteriza al proceso del KDD es la forma en que se subdivide en distintas etapas de acuerdo con la opinión de los investigadores de este ámbito. Existen diversos métodos para realizar esta división con sus ventajas e inconvenientes [HKP11]. En esta tesis se adopta una hibridación ampliamente utilizada en los últimos años que categoriza al KDD en las siguientes cuatro grandes etapas: • Definición del problema: Selección de los datos que conforman el problema obtenidos a partir del conocimiento previo de los expertos. Definición de los objetivos perseguidos por el usuario final. Esto implica también la compresión de los datos seleccionados y del conocimiento de los expertos para alcanzar una definición rigurosa. • Preparación de datos: Esta etapa incluye operaciones de limpieza de los datos, integración de datos, transformación de datos y reducción de datos. La primera consiste en la eliminación de ruido y datos inconsistentes. La segunda trata de combinar múltiples fuentes de datos en una sola. La tercera transforma los datos para poder realizar operaciones de minerı́a de datos. Finalmente, la reducción de datos incluye la selección y extracción tanto de caracterı́sticas como de ejemplos de una base de datos. Esta fase tiene por objetivo facilitar el desarrollo de las siguientes etapas. • Construcción de un modelo y evaluación: Este es el proceso en que los métodos son utilizados para extraer patrones válidos de los datos. En primer lugar, se debe elegir la tarea de minerı́a de datos más adecuada a nuestro problema, como clasificación, regresión, clustering 9 1. Introduction Figure 2: El proceso del KDD o asociación, y seleccionar un algoritmo de minerı́a de datos perteneciente a alguna de estas familias. A continuación, se debe adaptar y emplear este algoritmo realizando un ajuste de los parámetros y un procedimiento de validación. Finalmente, se evaluarán e interpretarán los patrones obtenidos mediante el uso de diferentes medidas de interés. • Visualización del conocimiento obtenido: Esta última etapa trata de describir los patrones obtenidos, de forma que puedan ser útiles para los usuarios. La Figura 2 muestra el proceso del KDD y las cuatro etapas mencionadas anteriormente. Es importante destacar que todas las etapas están interconectadas entre sı́, indicando que el proceso del KDD es un sistema auto-organizativo en el cual cada etapa tiene repercusión en las demás. Como se comentó anteriormente, la fase principal del proceso del KDD es también conocido como minerı́a de datos. Esta disciplina se centra en un objetivo más reducido (respecto del proceso del KDD) que consiste en la identificación de patrones y la predicción de relaciones entre los datos. Es importante destacar que el éxito de estas técnicas no solo se basa en su calidad. Éstas son sensibles a la calidad de la información provista. Ası́, a mayor calidad, mejores serán los modelos generados para tomar decisiones. En este sentido las técnicas de preprocesamiento son necesarias en el proceso del KDD antes de aplicar las técnicas de minerı́a de datos. Las técnicas de minerı́a de datos se suelen clasificar en: descriptivas y predictivas. Las primeras son aplicadas para descubrir patrones interesantes entre los datos. Las segundas se utilizan para predecir el comportamiento de un modelo a través del análisis de los datos disponibles. Ambos procesos son abordados mediante algoritmos de aprendizaje automático [Alp10]. El objetivo principal de estos es la inducción de conocimiento en problemas que no tienen una solución algorı́tmica directa y eficiente, o que su especificación sea vaga o informal. El uso de algoritmos de aprendizaje automático presenta dos vertientes: Pueden ser empleados simplemente como cajas negras, obteniéndose como resultado tan solo las salidas de los modelos. Sin embargo, algunos algoritmos pueden ser empleados como herramientas de representación de conocimiento, construyendo una estructura simbólica de conocimiento dispuesta a ser útil desde al punto de vista de la funcionalidad, pero también desde la perspectiva de la interpretabilidad. En función de la información de la que disponga el algoritmo de aprendizaje automático para 10 Chapter I. PhD dissertation llevar a cabo la extracción de conocimiento, se pueden considerar tres problemas distintos: • Aprendizaje supervisado: En este problema los valores de la variable o variables objetivo y de un conjunto de variables de entrada son conocidos. Se pretende establecer una relación entre las variables de entrada y salida con el fin de predecir el valor de salida de nuevo ejemplos para los que no se conoce la/s variable/s objetivo. En función del tipo de variable objetivo, las tareas de los algoritmos de aprendizaje supervisado se conocen como clasificación o regresión. – En clasificación [DHS00] los valores de la/s variable/s objetivo son de carácter discreto y pueden tomar un número finito de valores (llamadas clases o etiquetas). Por ejemplo tipos de una enfermedad como la Hepatitis A, B, C. – En regresión [CM98] los valores de la/s variable/s objetivo son continuos. Por ejemplo temperatura, consumo eléctrico, peso, etc. • Aprendizaje no supervisado: Los valores de la/s variable/s objetivo no son conocidos. Se trata de describir relaciones y patrones entre los datos. Las categorı́as más comunes en algoritmos de aprendizaje no supervisado son el agrupamiento (clustering) y asociación. – En clustering [Har75] el proceso consiste en separar datos en distintos grupos manteniendo en ellos los objetos que son similares entre sı́. – En asociacion [AIS93] se trata de identificar las relaciones entre datos transaccionales. • Aprendizaje semi-supervisado: Este tipo de problema se encuentra entre los dos anteriores, de forma que los algoritmos de aprendizaje automático disponen de datos para los que se conoce la/s variable/s objetivo (generalmente un número muy reducido) y datos para los que no. En este paradigma se puede considerar la predicción de variables objetivo en forma de clasificación o regresión, ası́ como la descripción de los datos como un agrupamiento o una asociación. La clave radica en extender el aprendizaje supervisado o el no supervisado mediante el uso de información adicional tı́pica del otro paradigma de aprendizaje [ZG09]. En esta tesis nos centraremos tanto en clasificación supervisada como en clasificación semisupervisada. Un método de clasificación se define como una técnica que aprende como categorizar ejemplos o elementos que en una serie de clases predefinidas. De forma general, un clasificador aprende un modelo de un conjunto de entrada (denominado conjunto de entrenamiento), y después, el modelo será aplicado para predecir el valor de la clase de un conjunto dado de ejemplos (el conjunto de prueba o test) que no han sido previamente utilizados en la fase de aprendizaje. Para medir el rendimiento de un clasificador varias medidas pueden ser utilizadas: • Precisión: Mide la confianza del modelo de clasificación aprendido. Esta es normalmente estimada como el porcentaje de ejemplos de test correctamente clasificados sobre el total. • Eficiencia: Tiempo consumido por el clasificador a la hora de clasificar un ejemplo de test. • Interpretabilidad: Claridad y credibilidad, desde el punto de vista humano, del modelo de clasificación. • Tiempo de aprendizaje: Tiempo requerido por el algoritmo de aprendizaje automático para construir el modelo de clasificación. 1. Introduction 11 • Robustez: Número mı́nimo de ejemplos necesarios para obtener un modelo de clasificación preciso y fiable. En la literatura especializada existen diferentes propuestas para la realización de tareas de clasificación tanto en un contexto supervisado como semi-supervisado con resultados exitosos. Por ejemplo, técnicas estadı́sticas, funciones discriminantes, redes neuronales, árboles de decisión, máquinas de vectores soporte y muchas otras. Sin embargo, como se comentó anteriormente, estas técnicas pueden no ser útiles cuando los datos de entrada son impuros, lo cual puede estar ligado a la extracción de modelos erróneos. Por tanto, el preprocesamiento de los datos se convierte en una de las etapas más relevantes para habilitar a las técnicas de minerı́a de dato a obtener conocimiento mejor y más preciso [ZZY03]. La preparación de datos puede generar un menor conjunto de datos en comparación con el original, de modo que se mejore la eficiencia de las técnicas de minerı́a de datos y, además, ésta originará datos de alta calidad que puede resultar en modelos de alta calidad. Entre las técnicas de preparación de datos, los métodos de reducción de datos tratan de simplificar los datos con el fin de habilitar a los algoritmos de minerı́a de datos a ser aplicados no solo más rápidamente, sino que incluso de forma precisión mediante la eliminación de ruido y datos redundantes. Desde la perspectiva de los atributos o variables, los procesos de reducción de datos más conocidos son la selección, la ponderación y la generación de caracterı́sticas [LM07]. Tomando en consideración el espacio de las instancias (ejemplos), podemos destacar los métodos de reducción de instancias [GDCH12, TDGH12]. La selección de caracterı́sticas consiste en elegir un subconjunto representativo de caracterı́sticas del espacio original, mientras que la generación de caracterı́sticas crea nuevas variables para describir los datos. Con un punto de vista muy similar, los esquemas de ponderación de variables asignan un peso a cada caracterı́stica del domino del problema para modificar la forma en que las distancias entre los ejemplos son calculadas [PV06]. Estas técnicas pueden ser vistas como una generalización de la selección de caracterı́sticas, permitiendo obtener una aproximación más fina del grado de relevancia de cada caracterı́stica, asignando un valor real como peso. Una técnica de reducción de instancias tiene por objeto encontrar el mejor conjunto reducido que representa al conjunto original de datos con el menor número de instancias. Sus principales objetivos son acelerar el proceso de clasificación, reducir el espacio de almacenamiento utilizado y la sensibilidad a ejemplos ruidosos. Esta metodologı́a puede ser dividad entre Selección de Instancias (SI) [MFV02, GCH08] y Generación o abstracción de instancias (GI) dependiendo de cómo se crea el conjunto reducido [Koh90, LKL02]. El primero trata de elegir un subconjunto apropiado del conjunto original de entrenamiento, mientras que el segundo puede también crear nuevos ejemplos artificiales que mejor se ajustan a los lı́mites de decisión entre clases. De esta manera, la GI rellena algunas regiones del espacio del problema que no tiene ejemplos representativos en el conjunto original. La mayor parte de las técnicas de reducción de instancias se han centrado en mejorar al clasificador del vecino más cercano (Nearest Neighbor, NN) [CH67]. En la literatura especializada, cuando las técnicas de SI y GI son aplicadas a algoritmos basados en instancias se suelen llamar Selección de Prototipos (SP) y Generación de Prototipos (GP), respectivamente [DGH10a]. La regla NN es un algoritmo no paramétrico basado en instancias [AKA91] para clasificación de patrones [DHS00, HTF09]. Pertenece a la familia de métodos de aprendizaje “perezoso”, que está conformado por aquellos métodos que predicen la clase de un ejemplo a partir de los datos de entrenamiento sin proveer un modelo de aprendizaje. El algoritmo NN predice la clase de un ejemplo de test dado de acuerdo a un concepto de similaridad [CGG+ 09] entre ejemplos (normalmente la distancia Euclı́dea). Ası́, la clase predicha de un ejemplo de test es igual a la clase más frecuente 12 Chapter I. PhD dissertation entre los vecinos de entrenamiento más cercanos. A pesar de su simplicidad, la técnica del NN ha demostrado ser una de las técnicas más importantes y eficaces en minerı́a de datos y reconocimiento de patrones, siendo considerada una de las diez mejores técnicas en minerı́a de datos en [WK09]. Sin embargo, el clasificador NN también padece algunos problemas que los métodos de SP y GP han estado tratando de aliviar. Principalmente, cuatro debilidades pueden ser mencionadas: • Alto coste de almacenamiento: Necesita almacenados todos los ejemplos de entrenamiento para clasificar un ejemplo de test. • Alto coste computacional: Cada clasificación implica el cálculo de similaridades entre el ejemplo de test y todos los ejemplos del conjunto de entrenamiento. • Baja tolerancia a ejemplos ruidosos: Todos los ejemplos de entrenamiento son considerados como relevantes, de modo que los datos ruidosos pueden inducir a clasificaciones incorrectas. • Suposiciones erróneas sobre los datos de entrada: La regla del NN hace predicciones sobre los datos existentes, asumiendo que éstos definen perfectamente las fronteras entre clases. Las técnicas de SP están limitadas a abordar las tres primeras debilidades, pero también asume que los mejores representantes pueden ser obtenidos de un subconjunto del conjunto original, mientras que la GP genera nuevos ejemplos si lo considera necesario, abordando ası́ la cuarta debilidad mencionada anteriormente. Antes de esta tesis, no existı́a una categorización completa de los métodos de GP y eran confundidos habitualmente con la SP. Un considerable número de métodos de GP han sido propuestos y algunos de ellos son desconocidos. La ausencia de una taxonomı́a centrada en métodos de GP provoca que los nuevos algoritmos se comparen normalmente con un subconjunto de la familia de métodos de GP y, en muchos estudios, no se realiza un análisis riguroso. Entre el amplio número de técnicas existentes de PS y PG podemos destacar como los enfoques más prometedores a aquellos que se basan en Algoritmos Evolutivos (AEs) [ES08]. Los AEs han sido muy usados en muchos problemas de minerı́a de datos [Fre02, PF09] actuando como estrategias de optimización. Los AEs son un conjunto de metaheurı́sticas usadas con éxito en muchas aplicaciones de gran complejidad. Su éxito en resolver problemas muy complejos ha sido el motor del conocido campo de la Computación Evolutiva [GJ05]. Estas técnicas son independientes del dominio, lo que les hace ideales para aplicaciones donde el dominio de conocimiento es difı́cil de proveer. Además, se caracterizan por tener la habilidad de explorar grandes espacios de búsqueda encontrando buenas soluciones. Dado que la SP y la GP pueden ser expresados como problemas combinatorios y de optimización, los AEs ha sido usado para resolverlos con buenos resultados [CHL03, NL09]. Concretamente, la SP se puede expresar como un problema de búsqueda binario, mientras que la GP se expresa como un problema continuo. Antes de esta tesis, las propuesta evolutivas para GP no tenı́an en consideración una selección apropiada de ejemplos por clase cuando se aplicaba la optimización, lo que era su mayor debilidad. Muy recientemente, el término “big data” ha sido acuñado para referirse a los retos y ventajas derivadas de recoger y procesar grandes volúmenes de datos [Mar13]. Formalmente, este se define como la cantidad de datos que excede las capacidades de procesamiento de un sistema dado. Está atrayendo mucha atención en minerı́a de datos porque el proceso de extracción de conocimiento de grandes bases de datos ha llegado a ser una tarea muy complicada para la mayorı́a de las 1. Introduction 13 técnicas clásicas y actuales. Los principales retos son abordar el incremento del tamaño de los datos a nivel de instancia, a nivel de caracterı́sticas y la complejidad del problema. Actualmente, con disponibilidad de las plataformas “cloud” [PBA+ 08] se dispone de suficientes unidades de procesamiento para extraer conocimiento valioso de datos masivos. Por lo tanto, la adaptación de las técnicas de minerı́a de datos a las nuevas tecnologı́as, tales como la computación distribuida, será una tarea obligada para superar sus limitaciones. Como tales, las técnicas de reducción de datos deberı́a habilitar a los algoritmos de minerı́a de datos a abordar problemas de gran tamaño con facilidad, sin embargo estos métodos también están afectados por el incremento del tamaño y complejidad de los conjuntos de datos siendo incapaces de proporcionar un conjunto preprocesado de datos en un tiempo aceptable. Mejorar la escalabilidad de las técnicas de reducción de datos se está convirtiendo un campo de investigación muy demandado. En clasificación semi-supervisada el principal problema está relacionado con la escasez de datos etiquetados. Este problema se ha abordado desde distintos enfoques con diferentes hipótesis sobre las caracterı́sticas de los datos de entrada [BC01, FUS08, Joa99]. Entre las distintas posibilidades, las técnicas de auto-etiquetado no hacen ninguna hipótesis especı́fica sobre los datos. Siguen un proceso iterativo con el fin de obtener un conjunto etiquetado mayor mediante el etiquetado de los ejemplos no etiquetados más confiables, dentro de un esquema supervisado. Aceptan que sus predicciones tienden a ser correctas [Yar95, BM98]. Un gran número de métodos de auto-etiquetado han sido presentados con gran aplicación [LZ07, JP08]. Sin embargo, antes de esta tesis, no existı́a una taxonomı́a de métodos que estableciese las ventajas e inconvenientes de estos métodos. Éstos presentan dos debilidades principales: • Añadir ejemplos ruidosos al conjunto etiquetado, especialmente en las primeras etapas del proceso, puede llevar a construir modelos erróneos. Por ello, detectar ejemplos verdaderamente ruidosos es una tarea importante. • Están limitados por el número de ejemplos etiquetados y su distribución en el espacio para identificar ejemplos no etiquetados confiables. Este problema es mucho más pronunciado cuando el ratio de ejemplos etiquetados es muy reducido y no representan mı́nimamente el dominio del problema. Hasta la escritura de esta tesis, la aplicación de las técnicas de SP y GP estaba limitada al ámbito de la clasificación supervisada y no se aplicaban a la semi-supervisada. Sin embargo, su utilización puede ser beneficiosa para aliviar los problemas comentados anteriormente mediante la reducción de ejemplos ruidosos en el conjunto no-etiquetado y la introducción de nuevos ejemplos etiquetados (generados con métodos de GP). La presente tesis de desarrolla siguiendo dos partes principales: generación de prototipos para aprendizaje (1) supervisado y (2) semi-supervisado. Es muy importante aclarar que la GP es utilizada en la primera parte de la tesis como técnica pura de reducción de datos, mientras que en la segunda parte los métodos de GP funcionan como reductores de instancias no etiquetadas y como generador de nuevos ejemplos etiquetados. • En la primera parte se realiza un estudio profundo del campo de la GP, determinando qué familias de métodos son más prometedoras y cuáles son sus principales ventajas e inconvenientes. Después, se diseñarán nuevas técnicas de GP basadas en algoritmos evolutivos, determinando la mejor proporción de ejemplos por clase, con el fin de mejorar los métodos actuales. A continuación, se combinará la GP con otros mecanismos de reducción de datos con 14 Chapter I. PhD dissertation el fin de mejorar la precisión. Finalmente, se desarrollará un esquema basado en tecnologı́as cloud para técnicas de reducción de prototipos que les permita abordar grandes bases de datos (big data problems). • La última parte de la tesis está dedicada a la clasificación semi-supervisada. Concretamente, se realizará una revisión de la literatura de todos aquellos métodos de aprendizaje semisupervisado ue están basados en auto-etiquetado. A continuación, se desarrollarán nuevos algoritmos para superar sus principales debilidades siguiendo dos perspectivas: (a) reducir el número de instancias ruidosas en el conjunto no etiquetado y (b) generando datos sintéticos con técnicas de GP con el fin de disminuir la influencia de la falta de datos etiquetados. Tras esta sección introductoria, la siguiente sección (Sección 2.) está dedicada a describir en detalle cuatro principales áreas relacionadas: Reducción de datos para clasificación del vecino más cercano (Sección 2.1), algoritmos evolutivos (Sección 2.2), clasificación semi-supervisada (Sección 2.3) y big data (Sección 2.4). Todas ellas son áreas fundamentales para definir y describir las propuestas presentadas como resultado de esta tesis. Después, la justificación de esta memoria se presenta en la Sección 3., describiendo los problemas abiertos abordados. Los objetivos perseguidos al abordar dichos problemas son descritos en la Sección 4.. La Sección 5. presenta un resumen de los trabajos que componen esta memoria. Se aporta una discusión conjunta de resultados en la Sección 6., mostrando la conexión entre cada uno de los objetivos y como ha sido alcanzado cada uno de ellos. En la Sección 7. se incluye un resumen de las conclusiones alcanzadas. Finalmente, en la Sección 8. se destacan varias lı́neas de trabajo futuro abiertas, derivadas de los resultados alcanzados. La segunda parte de la memoria se constituye de ocho publicaciones, organizadas en dos secciones principales: aprendizaje supervisado y semi-supervisado. Estas publicaciones son las siguientes: • Generación de prototipos para clasificación supervisada: – A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. – IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. – Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification. – Integrating a Differential Evolution Feature Weighting Scheme into Prototype Generation. – MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. • Técnicas de auto-etiquetado con generación/selección de prototipos para clasificación semisupervisada: – Self-labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study. – On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. – SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled SemiSupervised Classification. 2. Preliminaries 2. 15 Preliminaries In this section we describe all the background information involved in this thesis. Firstly, Section 2.1 presents the NN classifier and data reduction techniques to improve its performance. Secondly, Section 2.2 describes the use of EAs in data mining, detailing those EAs used in this thesis. Thirdly, Section 2.3 shows a snapshot on semi-supervised classification. Finally, Section 2.4 provides information about the big data problem, its main characteristics and some solutions that have been proposed. 2.1 Nearest neighbor classification: Data reduction approaches The NN algorithm was proposed in [FH51] by Fix and Hodges in 1951 as a nonparametric classifier. However, its popularity was increased after most of its main properties were later described by Cover and Hart [CH67] in 1967. Its extension to k nearest neighbor (k-NN) is nowadays considered one of the most important data mining techniques [WK09]. As a nonparametric classifier (regarding parametric ones) it does not assume any specific distribution or structure in the used data. It is based on a very intuitive approach to classify new examples: similarity between examples. Patterns that are similar, in some sense, should be assigned to the same class. The naive implementation of this rule has no learning phase in that it uses all the training set objects in order to classify new incoming data. Thus, it belongs to the lazy learning family of methods [AKA91], in contradistinction to eager learning models [Mit97] that build a model during the learning (training) phase. Its theoretical properties guarantee that its probability of error is bounded above by twice the Bayes probability error for all distributions [GDCH12]. In supervised classification, a given data set is typically divided into training and test partitions, obtaining a training set composed by N samples and a test set with M samples. Each sample is formed by an attribute vector that contains information that describe the example. The standard k-NN algorithm considers the use of the nearest examples (similar) of the training set to predict the class of a test pattern. Figure 3 presents a simple example about how the decision is taken depending on the number of neighbors used. Despite its well-known performance, it suffers several shortcomings such as the necessity of storing the full training set when performing a classification, the high computational cost, the low tolerance to noise (especially when k is established to 1) and the fact that the NN classifier focuses exclusively on existing data. These weaknesses have been widely studied in the literature, providing multiple solutions such as different similarity measures [PV06], optimization of the choice of the k parameter [Gho06] (Figure 3 showed an example of the variability of the decision according to this value), designing of fast and approximate approaches [GCB97] and reduction of the training set [GDCH12]. Among these solutions, this thesis is focused on the data preparation [Pyl99] for NN classifiers via data reduction. The aim of these techniques is to reduce the size of the training set in order to improve the performance of the classifier with respect to its efficiency and the storage requirements. This field is composed by different techniques, we highlight the following alternatives: • Feature selection (FS): It performs the reduction of the data by removing irrelevant or redundant features [LM07]. In particular, the goal of feature selection is to find a minimum set of attributes such as the resulting probability distribution of the data output attributes 16 Chapter I. PhD dissertation Figure 3: An illustrative example for the NN algorithm. A problem with two classes: Red circles and green triangles. The blue square represents a test example that should be classified according to its nearest neighbors. Taken into consideration the 3 nearest neighbors it is classified as a red circle, however, using the 5 nearest neighbors it would be marked as a green triangle. (or classes) is as close as possible to the original distribution obtained using all attributes. It increases the generality and efficiency due to the reduction of the number of attributes per instance to process. Moreover, it enables to the NN classifier to deal with high dimensional problems. • Feature weighting (FW): This does not select a subset of features, but it modifies the way in which similarity between instances is computed according to feature weights. Thus, the aim of these techniques is to determine the importance degree of each feature. A good review on this topic can be found in [WAM97]. • Feature generation (FG): These techniques are also called feature extraction [GGNZ06]. Their objective is to find new features that better describe the training data as function of the original ones. Therefore, in feature extraction, apart from the removal operation of attributes, subsets of attributes can be merged or can contribute to the creation of artificial substitute attributes. Linear and non-linear transformations or statistical techniques such as principal component analysis [Jol02] are classical examples of these techniques. • Instance selection (IS): It consists of choosing the most representative instances in training data [GDCH12]. Its focus is to find the best reduced set of instances from the original training set that does not contain noisy and irrelevant examples. These techniques perform somehow the selection of the best subset by using rules and/or heuristics (even metaheuristics). When they are applied to improve the NN classifier they are usually denoted as prototype selection (PS). • Instance generation (IG): IG is considered an extension of IS in which in addition to selecting data, these techniques are able to generate new artificial examples [LKL02]. The main objective is to represent the original training data with the lowest number of instances. As before, instance generation techniques applied to NN classification are named prototype generation (PG). Figure 4 illustrates an example of PG. Among these techniques, we will focus on PG and FW techniques as the natural extension of PS and FS, respectively. The main difference between these kinds of techniques is the way in which 2. Preliminaries 17 Figure 4: Illustrative PG example. Extracted from [Koh90]. Subfigure (A) represents the original training data. The centroid of each class (red and blue) and the decision boundary are depicted. Subfigure (B) shows a very reduced selection/generation of points with a smooth decision boundary, similar to the original. the problem is defined. On the one hand, PS and FS refer to binary search problems, so that, the selection problem could be represented as a binary array, corresponding each element with the value 1 if the feature is currently selected by the algorithm, and 0 if it is not chosen. Therefore, there are a total of 2N possible subsets, where N is the number of instances/features of the data set. On the other hand, PG and FW are real-valued search problems. Thus, these problems may be represented with real-valued arrays or matrices. Hence, they provide more general frameworks, allowing modifications of the internal values that represent each example or attribute. However, the search space becomes much more complex than in the binary case. Given the complexity of PG models, some works have been using a previous PS selection step [KO03a]. We will refer with the term Prototype Reduction (PR) to the problems of instance selection or generation for the NN rule. Both PS and PG methodologies have been widely studied. More than 50 PS methods have been proposed. Generally, they can be categorized into three kinds of methods: condensation [Har68], edition [Wil72] or hybrid models [GCH08]. A complete review of this topic is [GDCH12]. Regarding PG techniques, they can be divided into several families depending on the main heuristic operation followed: positioning adjustment [NL09], class re-labeling [SBM+ 03], centroid-based [FHA07] and space-splitting [Sán04]. More information about PS and PG approaches can be found at the SCI2S thematic public website on Prototype Reduction in Nearest Neighbor Classification: Prototype Selection and Prototype Generation http://sci2s. ugr.es/pr/. FS methods have been commonly categorized by the mechanism followed to assess the quality of a given subset of features. There are three main categories: filter [GE03], wrapper [KJ97] and embedded methods [SIL07]. In [WAM97], FW techniques were categorized by several dimensions, according to its weight learning bias, the weight space (binary or continuous), the representation of the features, their generality and the degree of employment of domain specific knowledge. A wide number of FW techniques are available in the literature, both classical and recent (for example, [PV06, FI08]). The most well known group of them is the family of Relief-based algorithms [KR92], with ReliefF [Kon94] as its forerunner algorithm to tackle the FW problem. 18 2.2 Chapter I. PhD dissertation Evolutionary algorithms Evolutionary algorithms (EAs) are techniques inspired by natural computation [GJ05, PF09] that have arisen as very competitive methods in the last decade [ES08]. They are a set of metaheuristics designed to tackle problems of search and optimization. Many different evolutionary approaches have been published with a common way of working: evolving a set of solutions by applying different operators that modify the solutions until the process reaches an stopping criteria. Benefits of using EAs come from the flexibility provided and their fitness to the objective target in combination with a robust behavior. Nowadays, EAs are considered as very adaptable tools to solve complex optimization and search problems. EAs have several features that make them very attractive for the data mining process [Fre02]. For example, they are able to deal with both binary and real valued problems as long as they could be formulated as a search procedure. These are the main reasons why they are used on upgrading and adjusting many different data mining algorithms. Recently, a great number of works have developed new techniques for data mining using EAs. These attempts used EAs for different tasks of data mining such as feature extraction, feature selection, classification, and clustering [EVH10, KM99]. The main role of EAs in most of these approaches is optimization. They are used to improve the robustness and accuracy of some of the traditional data mining techniques. EAs have been also applied to improve the NN classifier acting as PS, PG, FS or FW algorithms. First attempts in PS correspond to the papers [Kun95, KB98], aiming to optimized the accuracy obtained as well as the reduction achieved [CHL03]. Advanced works in evolutionary PS are [GCH08, DGH10b]. In terms of FS, a wide variety of evolutionary methods have been also proposed acting as wrapper methods [KJ97, OLM04]. Both PS and FS algorithms encode the solutions as binary arrays to represent the selection performed. EAs for PG are based on the positioning adjustment of prototypes, which is a suitable methodology to optimize the position of a set of prototypes. Several proposals are presented on this topic, such as artificial immune model [Gar08] or particle swarm optimization [CGI09]. Many works in FW are related with clustering purposes [GB08], however, FW for NN classification can be also modeled with evolutionary approaches [DTGH12]. Evolutionary PG and FW approaches follow a real-coded representation. Different kinds of EAs have been developed over the years such as genetic algorithms, genetic programming, evolution strategies, evolutionary programming, evolution strategies, differential evolution, cultural evolution algorithms and co-evolutionary algorithms. Among them, in this thesis we will focus on differential evolution as a modern real-coded algorithm [SP97, PSL05]. Differential evolution (DE) follows the general procedure of an EA, searching for a global optimum point in a D-dimensional real parameter space. It works through a cycle of stages, performing a given number of generations. Figure 5 presents the general structure of the DE algorithm. DE starts with a population of N P candidate solutions that are commonly called individuals and represented as vectors. Initially, the population should cover the entire search space as much as possible. In most of the problems, this is achieved by uniformly randomizing individuals within the search space constrained by the prescribed minimum and maximum bounds of each variable. After initialization, DE applies the mutation operator to generate a mutant vector with respect to each individual in the current population. For each individual (denoted as “target vector”), its associated mutant vector is created through a differential mutation operation that is scaled by a parameter F . The method of creating this mutant vector is that which differentiates one DE 19 2. Preliminaries Figure 5: Main stages of differential evolution scheme from another. One of the simplest forms of DE mutation works as follow: three other distinct individuals (regarding the target vector) are randomly sampled from the current population. Now the difference of any two of these three vectors is scaled with the parameter F , and the scaled difference is added to the third one whence we obtain the mutant vector. To enhance the diversity of the population, the crossover operator comes into play after generating the mutant vector. The DE algorithm can use three kinds of crossover schemes, known as Binomial, Exponential and Arithmetic crossovers. This operator is applied to each pair of the target vector and its corresponding mutant vector to generate a new vector (marked as “trial vector”). This operation is controlled by a crossover rate parameter CR. To keep the population size constant over the subsequent generations, the DE algorithm performs a selection of the most promising vectors. In particular, it determines whether the target or the trial vector survive to the next generation. If the new trial vector yields a solution equal to or better than the target vector, it replaces the corresponding target vector in the next generation; otherwise the target is retained in the population. Therefore, the population always gets better or retains the same fitness values, but never deteriorates. This one-to-one selection procedure is generally kept fixed in most of the DE algorithms. The success of DE in solving a specific problem crucially depends on choosing the appropriate mutation strategy and its associated control parameter values (F and CR) that determine the convergence speed. Hence, a fixed selection of these parameters can produce slow and/or premature convergence depending on the problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE algorithm [QHS09, DACK09, ZS09]. A good review on DE can be found in [DS11]. 20 Chapter I. PhD dissertation 2.3 Semi-supervised classification Nowadays, the use of unlabeled data in conjunction with labeled data is a growing field in different research lines, ranging from bioinformatics to Web mining. In these topics it is easier to obtain unlabeled than labeled data because it requires less effort, expertise and time consumption. Under this context, traditional supervised approaches are limited to using labeled data to build a model. Semi-Supervised Learning (SSL) is the learning paradigm concerned with the design of models in the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples to either modify or reprioritize the hypothesis obtained from labeled samples alone. SSL is an extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm. Depending on the main objective of the methods, we can divide SSL into Semi-Supervised Classification (SSC) [CSZ06] and semi-supervised clustering [Ped85]. The former focuses on enhancing supervised classification by minimizing errors in the labeled examples but it must also be compatible with the input distribution of unlabeled instances. The latter, also known as constrained clustering, aims to obtain better defined clusters than those obtained from unlabeled data. In this thesis, we focus on SSC. SSC can be categorized into two slightly different settings [CW11], denoted transductive and inductive learning. On the one hand, transductive learning concerns the problem of predicting the labels of the unlabeled examples, given in advance, by taking both labeled and unlabeled data together into account to train a classifier. On the other hand, inductive learning considers the given labeled and unlabeled data as the training examples, and its objective is to predict unseen data. Many different approaches have been suggested and studied in order to classify using unlabeled data in SSC. Existing SSC algorithms are usually classified depending on the conjectures they make about the relation of labeled and unlabeled data distributions. Broadly speaking, they are based on the manifold and/or cluster assumption. The manifold assumption is satisfied if data lie approximately on a manifold of lower dimensionality than the input space. The cluster assumption states that similar examples should have the same label, so classes are well-separated and do not cut through dense unlabeled data. We group the four following methodologies according to Zhu and Goldberg’s book [ZG09]: • Graph-based : This represents the SSC problem as a graph min-cut problem [BC01], following the manifold assumption. Labeled and unlabeled examples constitute the graph nodes and the similarity measurements between nodes correspond to the graph edges. The graph construction determines the behavior of this kind of algorithm [Joa03, BNS06]. These methods usually assume label smoothness over the graph. Its main characteristics are: nonparametric, discriminative and transductive in nature. Advanced proposals can be found in [XWT11, WJC13]. • Generative models and Cluster-then-Label methods: The first attempts to deal with unlabeled data correspond to this area (based on the cluster assumption). It includes those methods that assume a joint probability model p(x, y)=p(y)p(x|y), where p(x|y) is an identifiable mixture distribution, for example a Gaussian mixture model. Hence it follows a determined parametric model using both unlabeled and labeled data. Cluster-then-label methods are closely related to generative models. Instead of using a probabilistic model, they apply a previous clustering step to the whole data set, and then they label each cluster with the help of labeled data. Recent advances in these topics are [FUS08, TH10]. • Semi-Supervised Support Vector Machines (S 3 V M ): S 3 V M is an extension of standard Support Vector Machines (SVM) with unlabeled data. This approach also implements the cluster 2. Preliminaries 21 assumption. This methodology is also known as transductive SVM, although it learns an inductive rule defined over the search space. Advanced works in S 3 V M are [CSK08, AC10]. • Self-labeled methods: They form an important family of methods in SSC. They are not intrinsically geared to learning in the presence of both labeled and unlabeled data, but they use unlabeled points within a supervised learning paradigm. These techniques aim to obtain one (or several) enlarged labeled set/s, based on the most reliable predictions. Thus, these models do not make any specific assumptions about the input data, but the models accept that their own predictions tend to be correct. Some authors state that self-labeling is likely to be the case when the classes form well-separated clusters [ZG09] (cluster assumption). In this thesis, we will focus on self-labeled methods. The major benefits of this family of methods are: simplicity and being a wrapper methodology. The former is related to the facility of implementation and applicability. The latter means that any kind of classifier can be used regardless of its complexity, which is very important depending on the problem tackled. As caveats, the addition of wrongly labeled examples during the self-labeling process can lead to an even worse performance. Several mechanisms have been proposed to reduce this problem [LZ05]. A preeminent work with this philosophy is the self-training paradigm designed by Yarowsky [Yar95]. In self-training, a supervised classifier is initially trained with the L set. Then it is retrained with its own most confident predictions, enlarging its labeled training set. Thus, it is defined as a wrapper method for SSC. This idea was later extended by Blum and Mitchell [BM98] with the method known as co-training. This consists of two classifiers that are trained on two sufficient and redundant sets of attributes. This requirement implies that each subset of features should be able to perfectly define the frontiers between classes. Then, the method follows a mutual teaching procedure that works as follows: each classifier labels the most confidently predicted examples from its point of view and they are added to the L set of the other classifier. It is also known that usefulness is constrained by the imposed requirement [DLM01], which is not satisfied in many real applications. Nevertheless, this method has become an example for recent models thanks to the idea of using the agreement (or disagreement) of multiple classifiers and the mutual teaching approach. A good study of when co-training works can be found in [DLZ10]. Due to the success of co-training and its relatively limited application, many works have proposed the improvement of standard co-training by eliminating the established conditions. In [GZ00], the authors proposed a multi-learning approach, so that two different supervised learning algorithms were used without splitting the feature space. They showed that this mechanism divides the instance space into a set of equivalence classes. Later, the same authors proposed a faster and more precise alternative, named Democractic co-learning (Democratic-Co) [ZG04], which is also based on multi-learning. As an alternative, which requires neither sufficient and redundant views nor several supervised learning algorithms, Zhou and Li [ZL05] presented the Tri-Training algorithm, which attempts to determine the most reliable unlabeled data as the agreement of three classifiers (same learning algorithm). Then, they proposed the Co-Forest algorithm [LZ07] as a similar approach that uses Random Forest. A further similar approach is Co-Bagging [HS10, HSP10] where confidence is estimated from the local accuracy of committee members. Other recent self-labeled approaches are [YC10, HYGL10, SZ11, HGG13]. In summary, all of these recent schemes work on the hypothesis that several weak classifiers, learned with a small number of instances, can produce better generalizations than only one weak classifier. These methods are also known as disagreement-based models that are motivated, in part, by the empirical success of ensemble learning. The term disagreement-based was recently coined by Zhou and Li [ZL10]. 22 2.4 Chapter I. PhD dissertation Big data The rapid development of information technologies is involving new challenges to collect and analyze vast amounts of data. The size of the data sets is exponentially growing because, nowadays, we dispose of numerous sources (such as mobiles, software logs, cameras, and so on) gathering new information and we have also better techniques to collect information. As a numerical example, in 2010, Facebook had 21 PetaBytes of internal warehouse data with 12 TB new data added every day and 800 TB compressed data scanned daily [TSA+ 10]. The term of big data is being used to refer to those data sets that are so large that its collection and processing becomes very difficult to most of the data processing techniques. Formally, the big data problem could be defined as the quantity of data that exceeds the processing capabilities of a given system [MCD13] in terms of time and/or memory consumption. The big data problem has been also described with four terms: Volume, Velocity, Variety and Veracity (the 4Vs model [Lan01, LdRBH14]). With the terms of volume and velocity, they refer to the amount of data that should be processed or stored and how quickly will be analyzed. Variety is related with the different types of data and their structure. Finally, the veracity is associated to the data integrity and the trust on the information used to make decisions. Big data applications are attracting much attention in a wide variety of areas such as industry, medicine or financial businesses because they have progressively acquired a lot of raw data. With the availability of cloud platforms [PBA+ 08] they could take some advantages from these massive data sets by extracting valuable information. However, the analysis and knowledge extraction process from big data become very difficult tasks for most of the classical and advanced data mining and machine learning tools. Therefore, data mining techniques should be adapted to the emerging technologies to overcome their limitations. Among other solutions, the MapReduce framework [DG08, DG10] in conjunction with its distributed file system [GGL03], originally introduced by Google, offers a simple but robust environment to tackling the processing of large data sets over a cluster of machines. This scheme is currently taken into consideration in data mining, rather than other parallelization schemes such as MPI (Message Passing Interface) [SO98], because of its fault-tolerant mechanism, which is crucial for time-consuming jobs, and because of its simplicity. MapReduce is a paradigm of parallel programming designed to process or generate large data sets regardless the underlying hardware or software. Based on functional programming, this model works in two different steps: the map phase and the reduce phase. Each one has key-value (< k, v >) pairs as input and output. The map phase takes each < k, v > pair and generates a set of intermediate < k, v > pairs. Then, MapReduce merges all the values associated with the same intermediate key as a list (known as shuffle phase). The reduce phase takes that list as input for producing the final values. Figure 6 depicts a flowchart of the MapReduce framework. In a MapReduce program, all map and reduce operations run in parallel. First of all, all map functions are independently run. Meanwhile, reduce operations wait until their respective maps are finished. Then, they process different keys concurrently and independently. Note that inputs and outputs of a MapReduce job are stored in an associated distributed file system that is accessible from any computer of the used cluster. An illustrative example about the way of working of MapReduce could be finding the average costs per year from a big list of cost records. Each record may be composed by a variety of values, but it at least includes the year and the cost. The map function extracts from each record the pairs < year, cost > and transmits them as its output. The shuffle stage groups the < year, cost > pairs by its corresponding year, creating a list of costs per year < year, list(cost) >. Finally, the reduce 23 2. Preliminaries Figure 6: MapReduce flowchart phase performs the average of all the costs contained in the list of each year. Different implementations of the MapReduce framework are possible [DG08], depending on the available cluster architecture. Some implementations of MapReduce are: Mars [HFL+ 08], Phoenix [TYK11] and Apache Hadoop [Whi12, Pro13a]. We will focus on the Hadoop implementation because of its performance, open source nature, installation facilities and its distributed file system (Hadoop Distributed File System, HDFS). A Hadoop cluster is formed by a master-slave architecture, where one master node manages an arbitrary number of slave nodes. The HDFS replicates file data in multiple storage nodes that can concurrently access to the data. As such cluster, a certain percentage of these slave nodes may be out of order temporarily. For this reason, Hadoop provides a fault-tolerant mechanism, so that, when one node fails, Hadoop restarts automatically the task on another node. In the specialized literature, several recent proposals have focused on the parallelization of machine learning tools based on the MapReduce approach [ZMH09, SFJ12]. For example, some classification techniques such as [HDW+ 11, PR12, CLL13] have been implemented within the MapReduce paradigm. They have shown that the distribution of the data and the processing under a cloud computing infrastructure is very useful for speeding up the knowledge extraction process. In fact, there is a growing open source project, called Apache Mahout [Pro13b], that collects distributed and scalable machine learning algorithms implemented on top of Hadoop. Nowadays, it supplies an implementation of several specific techniques, such as, k-means for clustering, a naive bayes classifier, a collaborative filtering, etc. Data reduction techniques, such as PG, PS, FS or FW, should ease data mining algorithms to address with big data problems, however, these methods are also affected by the increase of the size and complexity of data sets and they are unable to provide a preprocessed data set in a reasonable time. Therefore, they should be also adapted to new technologies. 24 3. Chapter I. PhD dissertation Justification As we explained in the previous sections, instance generation methods are very useful tools to reduce the size of training data sets in order to improve the performance of data mining techniques (e.g. prototype generation for the nearest neighbor rule). Advanced evolutionary techniques are optimization models that may provide a potential enhancement to generate appropriate sets of representative examples. To adopt instance/prototype generation models as outstanding data reduction techniques, the following issues should be taken into consideration: • In supervised classification, a great number of prototype generation techniques have been proposed in the literature. Some of them are based on evolutionary techniques with promising results. However, this field is quite unknown, it is frequently confused with prototype selection models and its main drawbacks are not known. To tackle the design of new prototype reduction models we consider that: – First of all, it is necessary to have a deep knowledge of the current state-of-the-art, analyzing in detail which are the main strengths and weaknesses of these models, comparing theoretically and empirically their performance. After that, the main drawbacks of existing prototype generation techniques should be addressed. These issues could be solved through several strategies based on evolutionary algorithms. – Another interesting trend would be the hybridization of different data reduction algorithms into a single algorithm. In this sense, the performance of prototype generation techniques could be improved even more taking into consideration the feature space or relying on the simplicity of prototype selection methods to accelerate the convergence process of these techniques. – Finally, it is known that data mining and data reduction techniques lack of scaling up capabilities to tackle big data problems. Therefore, the study and design of scalable mining algorithm will be needed. • Until now, prototype generation techniques have been focused on supervised contexts, in which all the available training data are labeled, aiming to find the smallest reduced set that represents the original labeled set. However, their application can be useful in other fields, such as self-labeling semi-supervised classification where labeled data are sparse and scattered, to provide new synthetic data or identifying reliable examples. Thus, the following issues should be addressed: – Self-labeling semi-supervised learning is a growing field with permits us to tackle the shortage of labeled examples based on supervised models. A throughout study of this family of algorithms is essential in order to discern their advantages and disadvantages. – The previous study will allow us to understand how self-labeling methods can be improved by the aid of prototype generation algorithms. The generation of synthetic data and the detection of noisy examples for the field of semi-supervised learning may be a good trend to fulfil labeled data regions and avoid the introduction of noisy data. All this issues refers to the main topic of this thesis: The development of new prototype generation models for supervised and semi-supervised classification through evolutionary approaches. 4. Objectives 4. 25 Objectives After the study of the current state of all the areas described in the previous sections, it is possible to focus on the actual objectives of this thesis. They will include the research and analysis of the background fields described before, and the development of advanced models for prototype generation in supervised and semi-supervised contexts based on their most promising properties of each field. More specifically, two main objectives motivate the present thesis: analysis, design and implementation of evolutionary prototype generation techniques for (1) supervised and (2) semisupervised learning. In what follows, we will elaborate the sub-objectives that form each one. • Prototype generation for supervised classification. – To study the current state of the art in prototype generation. A theoretical and empirical study of the state-of-the-art on the field of prototype generation in order to categorize existing trends and discover strengths and weaknesses of each family of methods. To the best of our knowledge, there is currently no general categorization of this kind of techniques that establishes an overview of the proposed methods in the literature. The goal of performing this study is to provide guidelines about the application of these techniques, allowing a broad readership to differentiate between techniques and make appropriate decisions about the most suitable method for a given type of problem. Moreover, this step will be our starting point for the next objectives. – To provide new evolutionary prototype generation models. After analyzing the last works published in prototype generation, our objective is to develop new evolutionary prototype generation models that overcome the known issues and limitations of the current state-of-the-art. Therefore, the aim is to design more accurate models with a higher reduction rate by using better evolutionary approaches. To do so, we will rely on the success of differential evolution algorithm in real-coded problems. – To combine the previous prototype generation models with other data reduction approaches. The models previously designed can be improved even more if other data reduction techniques, such as prototype selection and feature weighting, are considered, establishing a cooperation between different data preprocessing tasks into a single algorithm. To address this objective, we will study two possibilities. The first one is to design hybrid prototype selection and generation algorithms, whereas the second one will be focused on the combination with an evolutionary feature weighting approach. – To enable prototype reduction models to be applied on big data sets. Given that the application of prototype reduction techniques is not feasible in big data sets in terms of runtime and memory consumption, we aim to develop new algorithmic strategies, based on the emerging cloud-technologies that allow them to do so without major algorithmic modifications of the original prototype reduction proposals. • Self-labeling with prototype generation and selection for semi-supervised classification. – To review the state of the art in self-labeling semi-supervised classification techniques. A survey of the stat-of-the-art of self-labeling semi-supervised algorithm to have a full understanding on their capabilities. At the time of writing of this thesis, there is no a taxonomy of this kind of methods. Our goal is to analyze their main 26 Chapter I. PhD dissertation strengths and drawbacks in order to discover how prototype generation algorithms can be useful in this field. – To develop new self-labeling approaches with the aid of prototype generation or selection models. Given the complex scenario of semi-supervised learning, in which the number of labeled examples is very reduced, we will focus on the application of prototype generation and selection algorithms to alleviate their main drawbacks. Two research lines will be established: the first one will be related with the removal of noisy labeled and unlabeled examples that can be wrongly added during the self-labeled process, while the second line will be based on the generation of new synthetic labeled data. 5. Summary This thesis is composed by eight works, organized into two main different parts. Each part is devoted to pursue one of the objectives, and their respective sub-objectives, described above. • Prototype Generation for Supervised Learning: – A Review on Prototype Generation. – New Prototype Generation Methods based on Differential Evolution. – Integrating Prototype Selection and Feature Weighting within Prototype Generation. – Enabling Prototype Reduction Models to deal with Big Data Classification. • Self-labeling with Prototype Generation/Selection for Semi-Supervised Classification: – A Survey on Self-labeling Semi-Supervised Classification. – New Self-labeling Approaches Aided by Prototype Generation/Selection Models. This section presents a summary of the different proposals presented in this dissertation according to the two pursued objectives (Section 5.1 and Section 5.2, respectively). In each section, we will describe the associated publications and their main contents. 5.1 Prototype Generation for Supervised Learning This subsection encloses all the works related to the first part of this thesis, devoted to the study and development of PG algorithms for supervised learning. Section 5.1.1 summarizes the review performed on PG. Section 5.1.2 shows the proposed evolutionary model for PG. Then, Section 5.1.3 briefly explains the proposed schemes to integrate PG with other data reduction approaches. Finally, in Section 5.1.4 a big data solution for prototype reduction approaches will be presented. 5.1.1 A review on Prototype Generation The NN rule has shown to perform well in many different classification and pattern recognition tasks. However, this rule suffers from several shortcomings in time response, noise sensitivity, high storage requirements and dependence of the existing data to make predictions. Several approaches 5. Summary 27 have been suggested and studied in order to tackle the drawbacks mentioned above. Among them, prototype reduction models consist of reducing the training data used for classification. The PS process consists of choosing a subset of the original training data. Whereas PG builds new artificial prototypes to increase the accuracy of the NN classification. PS has been widely studied in the literature [GDCH12], however, although they each relate to different problems, PG algorithms are commonly confused with PS. Moreover, at the time of writing of this thesis, there is no a general categorization or taxonomy for these kinds of methods, and more than 24 techniques had been proposed. For these reasons, we have performed a exhaustive survey on this topic. From a theoretical point of view, we have proposed a taxonomy based on the main characteristics presented in these methods. From an empirical point of view, we have conducted a wide experimental study for measuring their performance in terms of accuracy and reduction capabilities. We have identified the main characteristics in PG. They include the type of reduction performed (incremental, decremental, fixed or mixed), the kind of resulting generated set (condensed, edited or hybrid), the generation mechanism (positioning adjustment [NL09], class re-labeling [SBM+ 03], centroid-based [FHA07] and space-splitting [Sán04]) and the way in which they evaluate the search (filter, semi-wrapper and wrapper). According to these characteristics we have classified them into several families starting from the generation heuristic followed to the reduction type. Moreover, some criteria to compare these kinds of methods are explained as well as related and advance work. In the experimental study, we involved a great number of problems (59), differentiating between small/large data sets, numerical/nominal/mixed data sets and binary of multi-class problems. Finally, we included a visualization section to illustrate the way of working of PG methods with a 2-dimensional data set. The journal article associated to this part is: • I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100, doi: 10.1109/TSMCC.2010.2103939. 5.1.2 New Prototype Generation Methods based on Differential Evolution The family of positioning adjustment of prototypes highlight as a successful trend within the PG methodology. The aim of these techniques is to correct the position of a subset of prototypes from the initial set by using an optimization procedure (real-coded). Many proposals belong to this family, such as learning vector quantization [Koh90] and its successive improvements [LMYW05, KO03b], genetic algorithms [FI04] and particle swarm optimization [NL09, CGI09]. Most of the existing positioning adjustment of prototypes techniques start with an initial set of prototypes and try to improve the classification accuracy by adjusting it. Two schemes of initialization are commonly used: • The number of representative instances for each class is proportional to the number of them in the input data. • All the classes are represented by the same number of prototypes. 28 Chapter I. PhD dissertation This initialization process becomes their main drawback due to the fact that this parameter can be very dependent on the problem tackled. Some PG approaches [FI04, LMYW05] compute the number of needed prototypes to be retained automatically, but in complex domains, they require to retain many prototypes. To address these limitations, we propose a novel evolutionary procedure to automatically find the smallest reduced set, which is able to achieve suitable classification accuracy over different types of problems. This method follows an iterative prototype adjustment scheme with an incremental approach. At each step, an optimization procedure is used to adjust the position of the prototypes, and the method adds new prototypes if needed. As a second contribution of this work, we adopted the Differential Evolution [SP97, PSL05] technique as optimizer. Specifically, we used a self-adaptive differential algorithm named SFLSDE [NT09] to avoid the convergence problems related to fixed parameters. Our proposal is denoted by Iterative Prototype Adjustment based on Differential Evolution (IPADE). Among other characteristics of our evolutionary proposal, we would like to mention the way in which it codifies the individuals. In this algorithm, each individual in the population encodes a single prototype, so that, the whole population form the resulting reduced set. To contrast the behavior of our proposal, we conducted experiments on a great number of realworld data sets, the classification accuracy and reduction rate of our approach are investigated and its performance will be compared with classical and recent PG models. The journal article associated to this part is: • I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. IEEE Transactions on Neural Networks 21 (12) (2010) 1984-1990, doi: 10.1109/TNN.2010.2087415. An extension of this work was presented in the following conference paper: • I. Triguero, S. Garcı́a, F. Herrera, Enhancing IPADE Algorithm with a Different Individual Codification. 6th International Conference on Hybrid Artificial Intelligence Systems (HAIS2011). Wroclaw, Poland, 23-25 May 2011, LNAI 6679, pp. 262–270 This extension consisted of a new individual codification that allowed to the IPADE approach to improve even more its accuracy capabilities. Concretely, it codified a completed reduced set in each individual. We denoted this algorithm IPADECS. 5.1.3 Integrating Prototype Selection and Feature Weighting within Prototype Generation The hybridization of techniques has become a very useful tool in the development of new advanced data reduction models. It is common to use classical PS methods in pre or late stages of a PG algorithm as mechanisms for removing noisy or redundant prototypes. For example, some PG methods implement ENN or DROP algorithms as early filtering processes [LKL02, Sán04] and, in [KO03b], a hybridization method based on LVQ3 post-processing of conventional prototype reduction approaches is proposed. The two approaches that will be reviewed here are managed by hybrid algorithms, helping to combine the efforts of several data pre-processing approaches at once. 5. Summary 29 In the first proposal, we combine PG with a previous PS step aiming to improve the performance of positioning adjustment algorithms. Several issues motivate the combination of PS and PG: • PS algorithms assume that the best representative examples can be obtained from a subset of the original data, whereas PG methods generate new representative examples if needed. • PG methods relate to a more complex problem than PS, so that, finding a promising solution requires a higher cost for positioning adjustment methods. • Determining the number of instances per class is not straightforward for PG approaches. To hybridize both methodologies, we perform a preliminary PS stage to the adjustment process to initialize a subset of prototypes. Making use of this idea, we mitigate the complexity of positioning adjustment methods because we provide a promising initial solution to the PG technique. Note also that PS methods are not forced to select a determinate number of prototypes of each class; they select the most suitable number of prototypes per class. In addition to this, if the prototypes selected can be tuned in the search space, the main drawback associated with PS is also overcome. To understand how the proposed hybrid model can improve the classification accuracy of isolated PS and PG methods, we analyze the combination of several PS (DROP3 [WM00], ICF [BM02] and SSMA [GCH08]) and generation algorithms (LVQ3 [Koh90], PSO [NL09] and a proposed differential evolution algorithm). Moreover, we also analyze several adaptive differential evolution schemes such as SADE [QHS09], DEGL [DACK09], JADE [ZS09] and SFLSDE [NT09] and a wide variety of mutation/crossover operator for PG. As result, we obtained that the model composed by SSMA and SFLSDE was the best performing approach (noted as SSMA-SFLSDE). In the second proposal, we develop an hybrid FW approach with PG and PS. To do so, we firstly design a differential evolution FW scheme that is also based on the self-adaptive SFLSDE [NT09] algorithm. The aim of this FW algorithm is provide a set of optimal feature weights for a given set of prototypes, maximizing the accuracy obtained with the NN classifier. Then, it is hybridized with two different prototype reduction approaches: IPADECS, and the hybrid SSMA-SFLSDE, denoting the resulting hybrid models as IPADECS-DEFW and SSMA-DEPGFW, respectively. Note that the hybridization process differs between both models given that IPADECS is a pure PG algorithm, and SSMA-SFLSDE combines PS and PG. As an additional study we also analyze the scaling up problem of prototype reduction when dealing with large data set (up to 300 000 instances). To tackle this problem we propose several strategies, based on stratification [CHL05], to apply the proposed hybrid approach in large problems. To test the proposed hybrid scheme in comparison with other FW approaches, and isolated prototype reduction methods, we performed a wide experimental study with many different data sets. The journal articles associated to this part are: • I. Triguero, S. Garcı́a, F. Herrera, Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification. Pattern Recognition 44 (4) (2011) 901-916, doi: 10.1016/j.patcog.2010.10.020. • I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, Integrating a Differential Evolution Feature Weighting scheme into Prototype Generation. Neurocomputing 97 (2012) 332-343, doi: 10.1016/j.neucom.2012.06.009. 30 Chapter I. PhD dissertation 5.1.4 Enabling Prototype Reduction Models to deal with Big Data Classification Nowadays, analyzing and extracting knowledge from large-scale data sets is a very challenging task. Although data reduction techniques should ease data mining algorithms to tackle big data problems, these methods are also affected by the increase of the size and complexity of data sets. They are unable to provide a preprocessed data set in a reasonable time. Hence, a new class of scalable data reduction method that embraces the huge storage and processing capacity of cloud platforms is required. With these issues in mind, in this part of the thesis, we focused on the development of several strategies to provide the capacity of dealing with big data problems to prototype reduction methods. Several solutions had been developed to enable data reduction techniques to deal with this problem. For prototype reduction, we can find a data-level approach that is based on a distributed partitioning model that maintains the class distribution (also called stratification). This splits the original training data into several subsets that are individually addressed. Then, it joins each partial reduced set into a global solution. This approach had been used for PS in [CHL05, DGH10b]. We extended it to PG in the following conference paper: • I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A Study of the Scaling up Capabilities of Stratified Prototype Generation. Third World Congress on Nature and Biologically Inspired Computing (NABIC’11), Salamanca (Spain), pp. 304-309, October 19-21, 2011 This scheme provided promising results to enable PG and PS methods to be applied in large data sets. However, two main problems when the size of the data is highly increased: • A stratified partitioning process could not be carried out when the size of the data set is so big that it occupies all the available RAM memory. • This scheme does not consider that joining each partial solution into a global one could generate a reduced set with redundant or noisy instances that may damage the classification performance. Aiming to handle both drawbacks we proposed a novel distributed partitioning framework relying on the success of the MapReduce approach [DG08]. We denoted this framework “MapReduce for Prototype Reduction” (MRPR). The map and reduce phases were carefully designed to perform a proper data reduction process. Specifically, the map phase is devoted to split the original training set into several subsets that are individually addressed by applying the prototype reduction technique. In the reduce stage, we integrate multiple partial solutions (reduced sets of prototypes) into a single one. To do this, we propose an iterative filtering and fusion of prototype as part of the reduce phase. We analyzed the different strategies, of varying computational effort, for the integration of the partial solutions generated by the mappers. We analyzed the training and test accuracy, runtime and reduction capabilities of prototype reductions techniques under the proposed framework. Several variations of the proposed model will be investigated with different number of mappers and four big data sets of up to 5.7 million instances. As prototype reduction techniques, we focused on the hybrid SSMA-SFLSDE algorithm previously proposed as well as two PS methods (FCNN [Ang07] and DROP3 [WM00]) and two PG (LVQ3 [Koh90] and RSP3 [Sán04]). The journal article associated to this part is: 5. Summary 31 • I. Triguero, D. Peralta, J. Bacardit, S. Garcı́a, F. Herrera, MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. Submitted to Neurocomputing. 5.2 Self-labeling with prototype generation/selection for semi-supervised classification This subsection presents a summary of the works related to the second part of this thesis, analyzing and designed new self-labeling approaches with PG algorithms. 5.2.1 A Survey on Self-labeling Semi-Supervised Classification Self-labeling methods are appropriate tools to tackle problems with large amounts of unlabeled data and a small quantity of labeled data. Unlike other semi-supervised learning approaches, this kind of techniques does not make any specific assumptions about the characteristics input data. Based on traditional supervised classification algorithms, self-labeling methods follow an iterative procedure in which they accept that the predictions performed by supervised methods tend to be correct. These techniques aim to obtain an enlarged labeled set, based on their most confident predictions, to classify unlabeled data. In the literature [ZG09], self-labeled techniques are typically divided into self-training [Yar95] and co-training [BM98]. In the former, a classifier is trained with an initial small number of labeled examples. Then it is retrained with its own most confident predictions, enlarging its labeled training set. The latter split the feature space into two different conditionally independent views [DLZ10], training one classifier in each view and teaching each other the most confidently predicted examples. Multiview learning for semi-supervised classification is a generalization of Co-training, without requiring explicit feature splits or the iterative mutual-teaching procedure [ZL05, LZ07]. However, these concepts are sparse and frequently confused in the literature. There was no a general categorization focused on self-labeled techniques. In the literature, we can find general SSL surveys [Zhu05], but they are not exclusively focused on self-labeled techniques nor especially on studying the similarities among them. For these reasons, we have performed a survey of self-labeled methods. Firstly, we proposed a taxonomy based on the main characteristics presented in them. Secondly, we have conducted an exhaustive study that involves a large number of data sets, with different ratios of labeled data, aiming to measure their performance in terms of transductive and inductive classification capabilities. We conducted experiments involving a great number of data sets with different ratios of labeled data: 10%, 20%, 30% and 40%. In addition, we tested the performance of the best performing methods over 9 high dimensional data sets obtained from the book of Chapelle [CSZ06]. In this study we include different base classifiers, such as NN, C4.5 [Qui93], Naive Bayes [JL01] and SVM [Vap98, Pla99]. Furthermore, a comparison with the supervised learning context has been performed, analyzing how far are self-labeling techniques from traditional supervised learning. The journal article associated to this part is: • I. Triguero, S. Garcı́a, F. Herrera, Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study. Knowledge and Information Systems, in press (2014). 32 Chapter I. PhD dissertation 5.2.2 New Self-labeling Approaches Aided by Prototype Generation/Selection Models To the best of our knowledge, prototype reduction techniques had not been used to improve the performance of self-labeling methods. However, their capabilities to generate new data (PG) and detect noisy samples (via PS) make them an interesting trend to be exploited. The two approaches proposed in this part of the thesis aim to exploit the abilities of PG and PS techniques in the field of semi-supervised learning. The first contribution is devoted to study the influence of noisy data in one of the most used self-labeling approaches: the Self-Training algorithm [Yar95]. This simple approach perfectly exemplifies the problem of self-labeling techniques that can make erroneous predictions if noisy examples are labeled and incorporated into the training set. This problem is mainly important in the initial stages of the algorithm. In [LZ05], the authors proposed the addition of a statistical filter to the self-training process, naming this algorithm SETRED. Nevertheless, this method does not perform well in many domains. The use of a particular filter which has been designed and tested under different condition is not straightforward. Although the aim of any filter is to remove potentially noisy examples, both correct examples and examples containing valuable information may also be removed. Therefore, detecting true noisy examples is a challenging task. In the self-training approach, the number of available labeled data and the induced noisy examples are two decisive factors when filtering noise. The aim of this work was to deepen in the integration of different noise filters and we further analyze recent proposals in order to establish their suitability with respect to the self-training process into the self-training process to distinguish the most relevant features of filters. We distinguish two types of noise detection mechanism: local and global. We call local methods to those techniques in which the removal decision is based on a local neighborhood of instances [WM00]. Global methods create different models from the training data. Mislabeled examples can be considered noisy depending on the hypothesis agreement of the used classifiers. As such, both methodologies can be considered edition-based algorithms for data reduction. In our experiments, we focused on the NN rule as a base classifier and ten different noise filters, involving a wide variety of data set with different ratios of labeled data: 10%, 20%, 30% and 40%. In the second contribution, we designed a framework, named SEG-SSC, to improve the classification performance of any given self-labeled method by using synthetic labeled data. In our previous survey, we detected that self-labeled techniques are limited by the number of labeled points and their distribution to identifying reliable unlabeled examples. This problem was even more pronounced when the labeled ratio is greatly reduced and labeled examples do not minimally represent the domain. Moreover, most of the advanced models use some diversity mechanisms, such as bootstrapping [Bre96], to provide differences between the hypotheses learned with the multiple classifiers. However, these mechanisms may provide a similar performance to classical self-training or co-training approaches if the number of labeled data is insufficient to achieve different learned hypotheses. The aim of this work is to alleviate these weaknesses by using new synthetic labeled examples to introduce diversity to multiple classifier approaches and fulfill the labeled data distribution. The principal aspects of the proposed framework are: • Introducing diversity to the multiple classifiers used by using more (new) labeled data. • Fulfilling labeled data distribution with the aid of unlabeled data. 6. Discussion of results 33 • Being applicable to any kind of self-labeled method. In our empirical studies, we applied this scheme to four recent self-labeled methods that belong to different families. We tested their capabilities with a large number of data sets and a very reduced labeled ratio. A study on high-dimensional data sets, extracted from the book by Chapelle et al. [CSZ06] and the BBC News web page [BBC14], have been also included. The journal articles associated to this part are: • I. Triguero, José A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. Neurocomputing 132 (2014) 30-41, doi: 10.1016/j.neucom.2013.05.055. • I. Triguero, S. Garcı́a, F. Herrera, SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification. Submitted to IEEE Transactions on Cybernetics. 6. Discussion of results The following subsections summarize and discuss the results obtained in each specific stage of the thesis. 6.1 Prototype Generation for supervised learning This subsection is devoted to discuss the main results obtained in the first objective of this thesis. 6.1.1 A review on Prototype Generation Classical and recent approaches for PG have been thoroughly analyzed with the development of this review. As a result, we have highlighted the basic and advanced features of these techniques by designing a taxonomy of methods. It allowed us to establish several guidelines about which families of methods are more promising, less exploited and which ones are more susceptible for being improved. The extensive experimental study carried out has compared the performance of the current PG approaches in many different problems. To validate and support the results obtained we used a statistical analysis based on nonparametric tests. The best methods of each category have been highlighted. The results of this comparison have shown the potential of the positioning adjustment family of methods, obtaining a very good trade-off between accuracy, reduction rate and runtime. Specifically, we observed that evolutionary positioning adjustment algorithms, such as the PSO algorithm, [NL09] were highlighted as the most accurate results. We also stressed that this family is commonly based on a fixed reduction type that could be their main drawback. Nevertheless, the concrete choice of a PG method will depend on the problem tackled, but the results offered in this work could help to reduce the set of candidates. As an additional consequence of this paper, a complete software package of PG techniques has been developed for the KEEL platform [AFSG+ 09]. Moreover, all the data sets prepared for this work have been included in the KEEL data set repository [AFFL+ 11]. Both contributions allow to 34 Chapter I. PhD dissertation future PG proposals to conduct rigorous analyses, comparing with the state-of-the-art and a great number of problems. All the results, source code and data sets can be found in the following web page http://sci2s.ugr.es/pgtax. 6.1.2 New Prototype Generation Methods based on Differential Evolution In this part of the thesis, we have presented a new data reduction technique called IPADE which iteratively learns the most adequate number of prototypes per class and their respective positioning for the NN classifier, acting as a PG method. The proposed technique uses a real parameter optimization procedure based on a self-adaptive differential evolution. It allowed us to adjust the positioning of the prototypes at each step of the algorithm. The large experimental study performed with its respective statistical evaluation allows us to show that IPADE is a suitable method for PG in NN classification. The results shown that IPADE overcomes significantly all the comparison algorithms with respect to classification accuracy and reduction rates. It is noteworthy the great balance that this method has achieved in terms of accuracy and reduction rate. Given its incremental nature, this algorithm specially highlights because of its reduction power. The complete set of results can be found at http://sci2s.ugr. es/ipade/. In the further extension, IPADECS, we found more accurate results by changing the individual codification. It allowed us to apply new mutation and crossover operators that resulted in better convergence speed, and therefore, a better accuracy. 6.1.3 Integrating Prototype Selection and Feature Weighting within Prototype Generation Two hybrid models have been developed to improve the performance of PG algorithms. The first model combined PS and PG into a single algorithm. It showed the good relation between PS and PG in obtaining hybrid algorithms that allow us to find very promising solutions. The proposed hybrid models are able to tackle several drawbacks of the isolated methods. Concretely, we have analyzed the use of positioning adjustment algorithms as an optimization procedure after a previous PS stage. The wide experimental study performed has allowed us to justify the behavior of hybrid algorithms when dealing with small and large data sets. These results have been compared with several non-parametric statistical procedures, which have reinforced the conclusions. We concluded that the LVQ3 algorithm [Koh90] does not produce optimal positioning in most cases, whereas PSO [NL09] and the proposed differential evolution scheme result in excellent accuracy rates in comparison with isolated PS methods and PG methods. In terms of the previous PS, we especially emphasize the use of SSMA [GCH08] in combination with one of the two mentioned optimization approaches to also achieve high reduction rates in the final set of prototypes obtained. As part of this work, we also analyzed several self-adaptive differential evolution algorithms and a multitude of mutation/crossover operators for PG, studying their convergence capabilities. Among the analyzed approaches, we observed that the SFLSDE algorithm [NT09] found a great balance between exploration and exploitation during the evolutionary process. 6. Discussion of results 35 The second model introduced a novel data reduction technique which exploits the cooperation between FW and prototype reduction to improve the classification performance of the NN, storage requirements and its running time. A self-adaptive differential evolution algorithm has been used to optimize feature weights and the positioning of the prototypes for the NN algorithm, acting as an FW scheme and a PG method, respectively. The experimental study performed allowed us to contrast the behavior of these hybrid models when dealing with a wide variety of data sets with different numbers of instances and features. These hybrid models have been able to overcome isolated prototype reduction methods due to the fact that FW changes the way in which distances between prototypes are measured, and therefore the adjustment of prototypes can be more refined. In the comparison between the proposed IPADECSDEFW and SSMA-DEPGFW we observed that in terms of reduction rate IPADECS-DEFW is the best performing hybrid model, however, in terms of accuracy rate, SSMA-DEPGFW usually obtain more accurate results. Moreover, the proposed stratified procedure showed to be an useful tool to tackle the scaling up problem. 6.1.4 Enabling Prototype Reduction Models to deal with Big Data Classification A MapReduce solution for prototype reduction methods have been developed, denominated as MRPR. The proposed scheme has shown to be a suitable tool to apply these methods over big classification data sets with excellent results. Otherwise, these techniques would be limited to tackle small or medium problems that do not contain more than several thousand of examples, due to memory and runtime restrictions. We have taken advantage of cloud environments, making use of the Apache Hadoop implementation [Pro13a] to develop a MapReduce framework for prototype reduction. The MapReduce paradigm has offered a simple, transparent and efficient environment to parallelize the prototype reduction computation. Among the three different reduce types investigated: Join, Filtering and Fusion; we have found that a reducer based on fusion of prototypes permits to obtain reduced sets with higher reduction rates and accuracy performance. The designed framework enables prototypes reduction techniques to be applied with data sets of unlimited number of instances without major algorithmic modifications, just by using more computers if needed. It also guarantee that the objectives of prototype reduction models are maintained, so that, it reaches high reduction rates without significant accuracy loss. Some guidelines about which prototype reduction methods are more suitable for the proposed model are provided. The experimental study carried out has shown that MRPR obtains very competitive results with different prototype reduction algorithm. Its application has resulted in a very big reduction of storage requirements and classification time for the NN rule, when dealing with big data sets. 6.2 Self-labeling with Prototype Generation/Selection for Semi-Supervised Classification This subsection presents a discussion of the results achieved in the field of self-labeling semisupervised learning with the aid of PG and selection techniques. 36 6.2.1 Chapter I. PhD dissertation A Survey on Self-labeling Semi-Supervised Classification An overview of the growing field of self-labeled methods has been performed, classifying existing models according to their main features. As a result, we have designed a taxonomy of methods evaluating the main properties of these methods. This study has allowed us to make some remarks and guidelines for non-experts and researchers of this topic. We have identified the strengths of every family of methods, indicating which families may be improved. We have detected the main characteristics of self-labeling approaches. They include the addition mechanism (incremental, batch, amending), the number of classifier and learning algorithms (single or multiple), and the type of view (single or multiple view). According to these characteristics we have classified them into several families starting from the type of view to the addition mechanism. Moreover, some other properties have been remarked, such as kinds of confidence measures (simple, agreement and combination), types of teaching (self and mutual teaching) and stopping criteria. To compare these techniques, we defined four criteria: transductive and inductive accuracy, influence of the number of labeled instances, noise tolerance and time requirements. The empirical study performed allows us to highlight several methods from among the whole set. In both transductive and inductive settings, those methods that use multiple-classifiers and a single view, such as TriTraining [ZL05], Democratic-Co [GZ00] and Co-Bagging [HS10], have shown to be the best-performing methods. The experiments conducted with high-dimensional data sets and very reduced labeled ratio show that much more work is needed in the field of self-labeled techniques to deal with these problems. Moreover, a semi-supervised learning module has been developed for the KEEL software, integrating analyzed methods and data sets. The developed software allows the reader to reproduce the experiments carried out and uses it as an SSL framework to implement new methods. It could be a useful tool to do experimental analyses in an easier and more effective way. A web site with all the complementary material is available at http://sci2s.ugr.es/ SelfLabeled, including this work’s basic information, the source code of the analyzed algorithms, all the data sets involved and the complete results obtained. 6.2.2 New Self-labeling Approaches Aided by Prototype Generation/Selection Models Two different approaches have been studied in order to improve the performance of self-labeling methods with prototype reduction techniques. In the first attempt, we analyzed the characteristics of a wide variety of noise filters, of a different nature, to improve the self-training approach. We include some PS algorithms (denoted as local approaches) and some other ensemble-based models (named global methods). The experimental analysis performed allowed us to distinguish which characteristics of filtering techniques have reported a better behavior to address the transductive and inductive problems. We have checked that global filters (CF [GBLG99] and IPF [KR07] algorithms) highlight as the best performing family of filters, showing that the hypothesis agreement of several classifiers to select adequate noisy examples is also robust when the ratio of available labeled data is reduced. Most of local approaches need more labeled data to perform better. The use of these filters has resulted in a better performance than that achieved by the previously proposed self-training methods, SETRED and SNNRCE. 7. Concluding Remarks 37 Hence, the use of global filters is highly recommended in this field, which can be useful for further work with other semi-supervised approaches and other base classifiers. A web page with all the complementary material is available at http://sci2s.ugr.es/SelfTraining+Filters, including this paper’s basic information, all the data sets created and the complete results obtained for each algorithm. In our second contribution, we developed a framework called SEG-SSC to improve the performance of any self-labeled semi-supervised classification method. It is focused on the idea of generating synthetic examples with PG techniques in order to diminish the drawbacks occasioned by the absence of labeled examples, which deteriorates the efficiency of this family of methods. The wide experimental study carried out has allowed us to investigate the behavior of the proposed scheme with a high number of data sets with a varied number of instances and features. Within the proposed framework, the four self-labeled techniques used have been able to overcome the original self-labeled methods due to the fact that the addition of new labeled data has implied a better diversity of multi-classifier approaches and fulfills the distribution of labeled data. Thus, our proposal becomes a suitable tool for enhancing self-labeled methods. 7. Concluding Remarks In this thesis, we have addressed several problems pursuing a common objective: the analysis, design and implementation of evolutionary PG algorithms. Two main research lines have composed this dissertation: PG for supervised and semi-supervised classification. In the first research line, our initial objective was to obtain a full understanding of the PG field to improve the performance of the NN classifier. To do so, we have carried out a theoretical and empirical review on PG methods, focused on characterizing the traits of the related techniques. Based upon all the lessons learned by this study, we have proposed an iterative prototype adjustment model based on differential evolution. This technique, named IPADE (and its extension IPADECS), has aimed to generate the smallest reduced set of prototype that represents an original training data. Depending on the problem tackled, this technique has been able to determine the most appropriate number of prototypes per class that provides an adequate trade-off between accuracy and reduction rates. In the experiments conducted, this model has significantly outperformed to the current state-of-the-art of PG methods. Thus, it has become an useful tool to improve the classification performance of the NN algorithm. Aiming to improve the performance of PG models, we have developed hybrid data reduction techniques that combined PG with PS and FW. Our first attempts were based on the combination of a previous PS stage with an optimization of the positioning of the prototypes via PG. This model has shown to perform better than isolated PS and generation algorithms. Among the different analyzed hybridization, we obtained that the SSMA-SFLSDE algorithm was the best performing approach. The second hybrid model has taken into consideration the feature space. To do this, we have designed an evolutionary FW scheme that has been integrated within the previous models (IPADECS and SSMA-SFLSDE). The proposed hybrid models have been able to overcome isolated prototype reduction methods, changing the way in which distances between prototypes are measured with the proposed FW scheme. Moreover, we have dealt with the big data problem for prototype reduction methods by proposing a novel framework based on MapReduce, named MRPR. The designed model has distributed the processing of prototype reduction techniques among a set of computing elements, combining 38 Chapter I. PhD dissertation the resulting set of prototype in different manners. As a result, the MRPR framework has enabled prototype reduction models to be applied over big data problems with an acceptable runtime and without accuracy loss. It has become our last contribution to the state of the art of the field PG in supervised learning. The second part of this dissertation has been devoted to field self-labeling semi-supervised classification. Once again, we have performed a survey of specialized literature in order to categorize self-labeled methods, identifying their essential issues. We have observed that these methods are currently far from performance that could be obtained if all the training instances were labeled. We also concluded that these techniques have a strong dependence of the available labeled data. To alleviate these limitations, we have made use of prototype reduction models. As first contribution to the field of self-labeling, we have performed an exhaustive study to characterize the behavior of noise filters within a self-training approach. With this study, we have reduced the number of mislabeled examples that are added to the labeled set, avoiding the learning of wrong models. It has allowed us to remove noise data from unlabeled data with edition-based models. Furthermore, we have also identified the most appropriate filtering techniques to perform this task. Finally, we have proposed a framework called SEG-SSC that uses synthetic examples to improve the performance of self-labeling methods. This model has utilized PG techniques to create synthetic examples in order to fulfill the labeled data distribution. This model has enhanced the performance of self-labeling approaches through the incorporation of the generated data in different stages of the learning process. Conclusiones En esta tesis se han abordado distintos problemas persiguiendo un objetivo común: el análisis, diseño e implementación de algoritmos evolutivos de generación de prototipos. Dos lı́neas de investigación conforman esta disertación: generación de prototipos para clasificación supervisada y semi-supervisada. En la primera lı́nea de investigación, nuestro objetivo inicial fue obtener un conocimiento completo del campo de la generación de prototipos con el fin de mejorar el rendimiento del clasificador del vecino más cercano. Para ello, se han llevado a cabo una revisión teórica y experimental de los métodos de generación de prototipos, centrándose en la caracterización de los rasgos de estas técnicas. Basándose en las lecciones aprendidas en este estudio, se ha propuesto un modelo iterativo de ajuste de prototipos basado en el algoritmo de evolución diferencial. Esta técnica, llamada IPADE (y su extensión IPADECS), ha tenido por objetivo la generación del conjunto de prototipos reducido más pequeño posible que representa el conjunto de entrenamiento original. Dependiendo del problema abordado, esta técnica ha sido capaz de determinar el número más adecuado de ejemplos por clase que provee un buen balance entre precisión y ratio de reducción. En los experimentos llevados a cabo, el modelo propuesto ha mejorado significativamente al estado del arte en generación de prototipos. Ası́, se ha convertido en una herramienta muy útil para mejorar el rendimiento del clasificador del vecino más cercano. Con el objetivo seguir mejorando el rendimiento de los modelos de generación de prototipos, se han desarrollado técnica hı́bridas de reducción de datos que combinan generación de prototipos con selección y ponderación de caracterı́sticas. Nuestros primeros intentos se basaron en la combinación de una etapa preliminar de selección de instancias con una etapa de ajuste del posicionamiento de éstos mediante generación de prototipos. Entre las distintas hibridaciones estudiadas, obtuvimos que el modelo SSMA-SFLSDE destacó como el algoritmo de mejor rendimiento. El segundo modelo 8. Future Work 39 hı́brido ha tenido en consideración el espacio de caracterı́sticas. Para hacer esto, se ha diseñado un esquema evolutivo de ponderación de caracterı́sticas que ha sido integrado en los modelos anteriores (IPADECS y SSMA-SFLSDE). Los modelos hı́bridos propuestos ha sido capaces de superar a los métodos de reducción de prototipos de forma aislada, cambiando la forma en que se miden las distancias entre prototipos con el enfoque de ponderación de caracterı́sticas diseñado. Además, se ha abordado el problema causado por grandes cantidades de datos (big data) en los algoritmos de reducción de prototipos, mediante el desarrollo de un modelo basado en MapReduce, denominado MRPR. El modelo diseñado ha distribuido el procesamiento realizado por las técnicas de reducción de prototipos en un conjunto de procesadores, combinando los conjuntos resultantes de prototipos de distintas formas. Como resultado, el modelo MRPR ha sido capaz de permitirle a las técnicas de reducción de prototipos la capacidad de ser ejecutadas sobre grandes conjuntos de datos en un tiempo aceptable y sin pérdida de precisión. Esta ha sido la última contribución al estado del arte de generación de prototipos en aprendizaje supervisado. La segunda parte de esta tesis se ha dedicado al campo de la clasificación semi-supervisada con auto-etiquetado. Una vez más, se ha realizado un estudio completo de la literatura especializada para caracterizar los métodos de auto-etiquetado, identificando sus principales problemas. Se ha observado que estos métodos están actualmente lejos del rendimiento que se obtendrı́a si todos los ejemplos de entrenamiento estuviesen etiquetados. También se concluye que estas técnicas tienen una fuerte dependencia de los datos etiquetados disponibles. Para reducir la influencia de estas limitaciones, se ha hecho uso de modelos de reducción de prototipos. Como primera contribución al campo del auto-etiquetado, se ha realizado un estudio exhaustivo para caracterizar el comportamiento de los algoritmos de filtrado de ruido dentro de un enfoque de auto-entrenamiento (self-training). Con este estudio, se ha reducido el número de ejemplos más clasificado que son añadidos al conjunto de etiquetados, evitando el aprendizaje de modelos incorrectos. Esto ha permitido eliminar ruido de los datos no etiquetados con modelos basados en edición. Además, se han identificado qué tipo de filtros son más apropiados para realizar esta tarea. Por último, se ha propuesto un esquema, llamado SEG-SSC, que usa ejemplos sintéticos para mejorar el rendimiento de los métodos de auto-etiquetado. Este método ha utilizado técnicas de generación de prototipos para crear ejemplos sintéticos con el fin de completar la distribución de los datos etiquetados. Este modelo ha mejorado el rendimiento de los enfoques de auto-etiquetado mediante la incorporación de los datos generados en diferentes etapas del proceso de aprendizaje. 8. Future Work The results achieved in this PhD thesis may open new future trends in different challenging problems. In what follows, we present some research lines that can be addressed starting from the current studies. Prototype reduction for fuzzy nearest neighbor classification: Besides data reduction techniques, there are several approaches for improving the performance of NN classification. One of the trends that have arisen in the last decades is the introduction of fuzzy sets theory [Zad65] within the mechanics of the NN rule, giving a class membership to each instance of the training set. These techniques have shown to perform very well, improving the accuracy capabilities of the standard NN [DGH14]. 40 Chapter I. PhD dissertation However, these techniques can be improved even more if data reduction techniques are applied. Several fuzzy versions of the classical PS techniques have been proposed (such as [YC98] or [ZLZ11]). However, there is still a large potential for improving fuzzy NN algorithms if advanced PG or PS algorithms are developed for this problem. Fuzzy self-labeling approaches: Continuing with the fuzzy sets theory, we can find some recent fuzzy models for semi-supervised learning [MPJ11, PGDB13]. However, this topic is currently in its childhood and much more research is needed. The way in which fuzzy models can establish the class membership of an example may become an interesting approach to measure the confidence at the time of labeling an unlabeled example. The underlying idea used in [DGH14] for fuzzy NN can be easily extended to self-labeled techniques. Extending SEG-SSC to any semi-supervised learning schemes: There are many possible variations of our proposed SEG-SSC scheme that could be interesting to explore as future work. In our opinion, the use of generation techniques with self-labeled techniques is not only a new way to improve the capabilities of this family of techniques, but could also be useful for most of the existing semi-supervised learning algorithms such as graph-based, generative models or semisupervised support vector machines. New big data approaches: As we commented in Section 2.4, the problem of big data affects to classical and advance data mining and data reduction techniques [BBBE11]. Therefore, the ideas established in thesis for prototype reduction in large-scale data sets could be extended in many different fields. For example, we could tackle the following topics: • Feature selection/weighting: These techniques are also affected by the increment in the size of the data sets. They are not able to select or find appropriate weights for features in problem in which the number of instances is very high. Using the ideas learned in the proposed MapReduce framework, we could extend them to apply these techniques in largescale problems, by carefully developing the reduce phase. • Classification methods: Apart from the development of scalable data reduction methods, it may be very interesting the adaptation of standard classification methods to tackle big data problems. • Semi-supervised learning: Given that unlabeled data are quite easy to be obtained, it could be possible to find problems in which we dispose of a very few labeled data and large amounts of unlabeled problems. It offers a very complex scenario, in which the learning process may not be feasible in time, but the use of these unlabeled points is highly recommended. Therefore, the design of scalable semi-supervised models is becoming an interesting topic. Tackling other related learning paradigms with prototype reduction models: Besides standard classification problems, there are many other challenges in machine learning which are of great interest to the research community [YW06]. We consider that the lessons learned in this thesis can be use to explore new problems such as: • One-class classification: This problem deals with situations, in which not all of the classes are available at the training step [ZYY+ 14]. It assumes that the classifier is built on the basis 8. Future Work 41 of samples coming only from a single class, while it must discriminate between the known examples and new, unseen examples (known as outliers) that do not meet the assumption about the concept. Applying prototype reduction models in this topic may be useful to reduce computational complexity and sensitivity to noisy data of the models. Actually, there are some preliminary studies done on prototype reduction for one-class classification [CdO11]. • Multi-label classification: There are many real world applications where one instance can be assigned to multiple classes. As an example, consider the problem of assigning functions to proteins, where one protein could be labeled with multiple functions. This problem, known as multi-label learning [TK07], increases the complexity of the classification process. The adaptation of prototype reduction models under this scenario is a very challenging and interesting task. • Semi-supervised multi-label classification: A significant challenge is to classify examples with multiple labels by using a small number of labeled samples. This problem can be tackled by the combination of semi-supervised learning and multi-label learning. The extension of the ideas proposed in self-labeling learning or the application or data reduction models could be useful to tackle this problem. There are just a few works that use data reduction in this field (for instance [LYG+ 10]). Chapter II Publications: Published and Submitted Papers 1. Prototype generation for supervised classification The journal paper associated to this part is: 1.1 A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification • I. Triguero, S. Garcı́a, F. Herrera, A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100, doi: 10.1109/TSMCC.2010.2103939. – Status: Published. – Impact Factor (JCR 2012): 2.548 – Subject Category: Computer Science, Artificial Intelligence. Ranking 17 / 115 (Q1). – Subject Category: Computer Science, Cybernetics. Ranking 2 / 21 (Q1). – Subject Category: Computer Science, Interdisciplinary Applications. Ranking 16 / 100 (Q1). 43 86 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification Isaac Triguero, Joaquı́n Derrac, Salvador Garcı́a, and Francisco Herrera Abstract—The nearest neighbor (NN) rule is one of the most successfully used techniques to resolve classification and pattern recognition tasks. Despite its high classification accuracy, this rule suffers from several shortcomings in time response, noise sensitivity, and high storage requirements. These weaknesses have been tackled by many different approaches, including a good and wellknown solution that we can find in the literature, which consists of the reduction of the data used for the classification rule (training data). Prototype reduction techniques can be divided into two different approaches, which are known as prototype selection and prototype generation (PG) or abstraction. The former process consists of choosing a subset of the original training data, whereas PG builds new artificial prototypes to increase the accuracy of the NN classification. In this paper, we provide a survey of PG methods specifically designed for the NN rule. From a theoretical point of view, we propose a taxonomy based on the main characteristics presented in them. Furthermore, from an empirical point of view, we conduct a wide experimental study that involves small and large datasets to measure their performance in terms of accuracy and reduction capabilities. The results are contrasted through nonparametrical statistical tests. Several remarks are made to understand which PG models are appropriate for application to different datasets. Index Terms—Classification, learning vector quantization (LVQ), nearest neighbor (NN), prototype generation (PG), taxonomy. I. INTRODUCTION HE nearest neighbor (NN) algorithm [1] and its derivatives have been shown to perform well, like a nonparametric classifier, in machine-learning and data-mining (DM) tasks [2]–[4]. It is included in a more specific field of DM known as lazy learning [5], which refers to the set of methods that predicts the class label from raw training data and does not obtain learning models. Although NN is a simple technique, it has demonstrated itself to be one of the most interesting and effective algorithms in DM [6] and pattern recognition [7], and it has T Manuscript received March 8, 2010; revised August 22, 2010 and October 25, 2010; accepted December 27, 2010. Date of publication February 4, 2011; date of current version December 16, 2011. This work was supported by the Spanish Ministry of Science and Technology under Project TIN2008-06681-C06-01. The work of I. Triguero was supported by a scholarship from the University of Granada. The work of J. Derrac was supported by an FPU scholarship from the Spanish Ministry of Education and Science. This paper was recommended by Associate Editor M. Last. I. Triguero, J. Derrac and F. Herrera are with the Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology, University of Granada, 18071 Granada, Spain (e-mail: [email protected]; [email protected]; herrera@decsai. ugr.es). S. Garcı́a is with the Department of Computer Science, University of Jaén, 23071 Jaén, Spain (e-mail: [email protected]). Digital Object Identifier 10.1109/TSMCC.2010.2103939 been considered one of the top ten methods in DM [8]. A wide range of new real problems have been stated as classifications problems [9], [10], where NN has been a great support for them, for instance, [11] and [12]. The most intuitive approach to pattern classification is based on the concept of similarity [13]–[15]; obviously, patterns that are similar, in some sense, have to be assigned to the same class. The classification process involves partitioning samples into training and testing categories. Let xp be a training sample from n available samples in the training set. Let xt be a test sample, ω be the true class of a training sample, and ω̂ be the predicted class for a test sample (ω, ω̂ = 1, 2, . . . , Ω). Here, Ω is the total number of classes. During the training process, we use only the true class ω of each training sample to train the classifier, while during testing, we predict the class ω̂ of each test sample. With the 1NN rule, the predicted class of test sample xt is set equal to the true class ω of its NN, where nnt is an NN to xt , if the distance d(nnt , xt ) = mini {d(nni , xt )}. For NN, the predicted class of test sample xt is set equal to the most frequent true class among k nearest training samples. This forms the decision rule D : xt → ω̂. Despite its high classification accuracy, it is well known that NN suffers from several drawbacks [4]. Four weaknesses could be mentioned as the main causes that prevent the successful application of this classifier. The first one is the necessity of high storage requirements in order to retain the set of examples that defines the decision rule. Furthermore, the storage of all of the data instances also leads to high computational costs during the calculation of the decision rule, which is caused by multiple computations of similarities between the test and training samples. Regarding the third one, NN (especially 1NN) presents low tolerance to noise because of the fact that it considers all data relevant, even when the training set may contain incorrect data. Finally, NN makes predictions over existing data, and it assumes that input data perfectly delimits the decision boundaries among classes. Several approaches have been suggested and studied in order to tackle the aforementioned drawbacks [16]. The research on similarity measures to improve the effectiveness of NN (and other related techniques based on similarities) is very extensive in the literature [15], [17], [18]. Other techniques reduce overlapping between classes [19] based on local probability centers, thus increasing the tolerance to noise. Researchers also investigate about distance functions that are suitable for use under high dimensionality conditions [20]. 1094-6977/$26.00 © 2011 IEEE TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION A successful technique that simultaneously tackles the computational complexity, storage requirements, and noise tolerance of NN is based on data reduction [21], [22]. These techniques aim to obtain a representative training set with a lower size compared to the original one and with a similar or even higher classification accuracy for new incoming data. In the literature, these are known as reduction techniques [21], instance selection [23]–[25], prototype selection (PS) [26], and prototype generation (PG) [22], [27], [28] (which are also known as prototype abstraction methods [29], [30]). Although the PS and PG problems are frequently confused and considered to be the same problem, each of them relate to different problems. PS methods concern the identification of an optimal subset of representative objects from the original training data by discarding noisy and redundant examples. PG methods, by contrast, besides selecting data, can generate and replace the original data with new artificial data [27]. This process allows it to fill regions in the domain of the problem, which have no representative examples in original data. Thus, PS methods assume that the best representative examples can be obtained from a subset of the original data, whereas PG methods generate new representative examples if needed, thus tackling also the fourth weakness of NN mentioned earlier. The PG methods that we study in this survey are those specifically designed to enhance NN classification. Nevertheless, many other techniques could be used for the same goal as PG methods that are out of the scope of this survey. For instance, clustering techniques allow us to obtain a representative subset of prototypes or cluster centers, but they are obtained for more general purposes. A very good review of clustering can be found in [31]. Nowadays, there is no general categorization for PG methods. In the literature, a brief taxonomy for prototype reduction schemes was proposed in [22]. It includes both PS and PG methods and compares them in terms of classification accuracy and reduction rate. In this paper, the authors divide the prototype reduction schemes into creative (PG) and selecting methods (PS), but it is not exclusively focused on PG methods, and especially, on studying the similarities among them. Furthermore, a considerable number of PG algorithms have been proposed and some of them are rather unknown. The first approach we can find in the literature called PNN [32] is based on merging prototypes. One of the most important families of methods is that based on learning vector quantization (LVQ) [33]. Other methods are based on splitting the dimensional space [34], and even evolutionary algorithms and particle swarm optimization [35] have also been used to tackle this problem [36], [37]. Because of the absence of a focused taxonomy in the literature, we have observed that the new algorithms proposed are usually compared with only a subset of the complete family of PG methods and, in most of the studies, no rigorous analysis has been carried out. These are the reasons that motivate the global purpose of this paper, which can be divided into three objectives. 1) To propose a new and complete taxonomy based on the main properties observed in the PG methods. The taxonomy will allow us to know the advantages and drawbacks from a theoretical point of view. 87 2) To make an empirical study that analyzes the PG algorithms in terms of accuracy, reduction capabilities, and time complexity. Our goal is to identify the best methods in each family, depending on the size and type of the datasets, and to stress the relevant properties of each one. 3) To illustrate through graphical representations the trend of generation performed by the schemes studied in order to justify the results obtained in the experiments. The experimental study will include a statistical analysis based on nonparametric tests, and we will conduct experiments that involve a total of 24 PG methods, and 59 small- and largesize datasets. The graphical representations of selected data will be done by using a two-dimensional (2-D) dataset called banana with moderate complexity features. This paper is organized as follows. A description of the properties and an enumeration of the methods, as well as related and advanced work on PG, are given in Section II. Section III presents the taxonomy proposed. In Section IV, we describe the experimental framework, and Section V examines the results obtained in the empirical study and presents a discussion of them. Graphical representations of generated data by PG methods are illustrated in Section VI. Finally, Section VII concludes the paper. II. PROTOTYPE GENERATION: BACKGROUND PG builds new artificial examples from the training set; a formal specification of the problem is the following: Let xp be an instance, where xp = (xp1 , xp2 , . . . , xpm , xpω ), with xp belonging to a class ω given by xpω , and a m-dimensional space in which Xpi is the value of the ith feature of the pth sample. Then, let us assume that there is a training set T R, which consists of n instances xp , and a test set T S composed of s instances xt , with xtω unknown. The purpose of PG is to obtain a prototype generate set T G, which consists of r, r < n, prototypes, which are either selected or generated from the examples of T R. The prototypes of the generated set are determined to represent efficiently the distributions of the classes and to discriminate well when used to classify the training objects. Their cardinality should be sufficiently small to reduce both the storage and evaluation time spent by a NN classifier. In this paper, we will focus on the use of the NN rule, with k = 1, to classify the examples of T R and T S by using the T G as reference. This section presents an overview of the PG problem. Three main topics will be discussed in the following. 1) In Section II-A, the main characteristics, which will define the categories of the taxonomy proposed in this paper, will be outlined. They refer to the type of reduction, resulting generation set, generation mechanisms, and evaluation of the search. Furthermore, some criteria to compare PG methods are established. 2) In Section II-B, we briefly enumerate all the PG methods proposed in the literature. The complete and abbreviated names will be given together with the proposed reference. 3) Finally, Section II-C explores other areas related to PG and gives an interesting summary of advanced work in this research field. 88 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 A. Main Characteristics in Prototype Generation Methods This section establishes different properties of PG methods that will be necessary for the definition of the taxonomy in the following section. The issues discussed here include the type of reduction, resulting generation set, generation mechanisms, and evaluation of the search. Finally, some criteria will be set in order to compare the PG methods. 1) Type of Reduction: PG methods search for a reduced set T G of prototypes to represent the training set T R; there are also a variety of schemes in which the size of T G can be established. a) Incremental: An incremental reduction starts with an empty reduced set T G or with only some representative prototypes from each class. Then, a succession of additions of new prototypes or modifications of earlier prototypes occurs. One important advantage of this kind of reduction is that these techniques can be faster and need less storage during the learning phase than nonincremental algorithms. Furthermore, this type of reduction allows the technique to adequately establish the number of prototypes required for each dataset. Nevertheless, this could obtain adverse results due to the requirement of a high number of prototypes to adjust T R, thus producing overfitting. b) Decremental: The decremental reduction begins with T G = T R, and then, the algorithm starts to reduce T G or modify the prototypes in T G. It can be accomplished by following different procedures, such as merging, moving or removing prototypes, and relabeling classes. One advantage observed in decremental schemes is that all training examples are available for examination to make a decision. On the other hand, a shortcoming of these kinds of methods is that they usually present a high computational cost. c) Fixed: It is common to use a fixed reduction in PG. These methods establish the final number of prototypes for T G using a user’s previously defined parameter related to the percentage of retention of T R. This is the main drawback of this approach, apart from the fact that it is very dependent on each dataset tackled. However, these techniques only focus on increasing the classification accuracy. d) Mixed: A mixed reduction begins with a preselected subset T G, obtained either by random selection with fixed reduction or by the run of a PS method, and then, additions, modifications, and removals of prototypes are done in T G. This type of reduction combines the advantages of the previously seen, thus allowing several rectifications to solve the problem of fixed reduction. However, these techniques are prone to overfit the data, and they usually have high computational cost. 2) Resulting Generation Set: This factor refers to the resulting set generated by the technique, i.e., whether the final set will retain border, central, or both types of points. a) Condensation: This set includes the techniques, which return a reduced set of prototypes that are closer to the decision boundaries, that are also called border points. The reason behind retaining border points is that internal points do not affect the decision boundaries as much as border points and, thus, can be removed with relatively little effect on classification. The idea behind these methods is to preserve the accuracy over the training set, but the generalization accuracy over the test set can be negatively affected. Nevertheless, the reduction capability of condensation methods is normally high because of the fact that border points are less than internal points in most of the data. b) Edition: These schemes instead seek to remove or modify border points. They act over points that are noisy or do not agree with their NNs, thus leaving smoother decision boundaries behind. However, such algorithms do not remove internal points that do not necessarily contribute to the decision boundaries. The effect obtained is related to the improvement of generalization accuracy in test data, although the reduction rate obtained is lower. c) Hybrid: Hybrid methods try to find the smallest set T G, which maintains or even increases the generalization accuracy in test data. To achieve this, it allows modifications of internal and border points based on some specific criteria followed by the algorithm. The NN classifier is highly adaptable to these methods, obtaining great improvements, even with a very small reduced set of prototypes. 3) Generation Mechanisms: This factor describes the different mechanisms adopted in the literature to build the final T G set. a) Class relabeling: This generation mechanism consists of changing the class labels of samples from T R, which could be suspicious of having errors, and belonging to other different classes. Its purpose is to cope with all types of imperfections in the training set (mislabeled, noisy, and atypical cases). The effect obtained is closely related to the improvement in generalization accuracy of the test data, although the reduction rate is kept fixed. b) Centroid based: These techniques are based on generating artificial prototypes by merging a set of similar examples. The merging process is usually made from the computation of averaged attribute values over a selected set, yielding the so-called centroids. The identification and selection of the set of examples are the main concerns of the algorithms that belong to this category. These methods can obtain a high reduction rate, but they are also related to accuracy rate losses. c) Space splitting: This set includes the techniques based on different heuristics to partition the feature space, along with several mechanisms to define new prototypes. The idea consists of dividing T R into some regions, which will be replaced with representative examples establishing the decision boundaries associated with the original T R. This mechanism works on a space level because of the fact that the partitions are found in order to discriminate, as well as possible, a set of examples from others, whereas centroid-based approaches work on the data level, which mainly focuses on the optimal selection of only a set of examples to be treated. The reduction capabilities of these TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION techniques usually depend on the number of regions that are needed to represent T R. d) Positioning adjustment: The methods that belong to this family aim to correct the position of a subset of prototypes from the initial set by using an optimization procedure. New positions of prototype can be obtained by using the movement idea in the m-dimensional space, thus adding or subtracting some quantities to the attribute values of the prototypes. This mechanism is usually associated with a fixed or mixed type of reduction. 4) Evaluation of Search: The NN itself is an appropriate heuristic to guide the search of a PG method. The decisions made by the heuristic must have an evaluation measure that allows the comparison of different alternatives. The evaluation of search criterion depends on the use or nonuse of NN in such an evaluation. a) Filter: We refer to filters techniques when they do not use the NN rule during the evaluation phase. Different heuristics are used to obtain the reduced set. They can be faster than NN, but the performance in terms of accuracy obtained could be worse. b) Semiwrapper: NN is used for partial data to determine the criteria of making a certain decision. Thus, NN performance can be measured over localized data, which will contain most of prototypes that will be influenced in making a decision. It is an intermediate approach, where a tradeoff between efficiency and accuracy is expected. c) Wrapper: In this case, the NN rule fully guides the search by using the complete training set with the leave-oneout validation scheme. The conjunction, in the use of the two mentioned factors, allows us to get a great estimator of generalization accuracy, thus obtaining better accuracy over test data. However, each decision involves a complete computation of the NN rule over the training set and the evaluation phase can be computationally expensive. 5) Criteria to Compare PG Methods: When comparing the PG methods, there are a number of criteria that can be used to compare the relative strengths and weaknesses of each algorithm. These include storage reduction, noise tolerance, generalization accuracy, and time requirements. 1) Storage reduction: One of the main goals of the PG methods is to reduce storage requirements. Furthermore, another goal closely related to this is to speed up classification. A reduction in the number of stored instances will typically yield a corresponding reduction in the time it takes to search through these examples and classify a new input vector. 2) Noise tolerance: Two main problems may occur in the presence of noise. The first is that very few instances will be removed because many instances are needed to maintain the noisy decision boundaries. Second, the generalization accuracy can suffer, especially if noisy instances are retained instead of good instances, or these are not relabeled with the correct class. 3) Generalization accuracy: A successful algorithm will often be able to significantly reduce the size of the train- 89 TABLE I PG METHODS REVIEWED ing set without significantly reducing the generalization accuracy. 4) Time requirements: Usually, the learning process is carried out just once on a training set; therefore, it seems not to be a very important evaluation method. However, if the learning phase takes too long, it can become impractical for real applications. B. Prototype Generation Methods More than 25 PG methods have been proposed in the literature. This section is devoted to enumerate and designate them according to a standard that followed in this paper. For more details on their implementations, the reader can visit the URL http://sci2s.ugr.es/pgtax. Implementations of the algorithms in java can be found in KEEL software [38]. Table I presents an enumeration of the PG methods reviewed in this paper. The complete name, abbreviation, and reference is provided for each one. In the case of there being more than one method in a row, they were proposed together and the best performing method (indicated by the respective authors) is depicted in bold. We will use the best representative method of each proposed paper; therefore, only the methods in bold, when more than one method is proposed, will be compared in the experimental study. C. Related and Advanced Work Nowadays, much research to enhance the NN through data preprocessing is common and highly demanded. PG could represent a feasible and promising technique to obtain expected 90 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 results, which justifies its relationship to other methods and problems. This section provides a brief review on other topics closely related to PG, and describes other interesting work and future trends, which have been studied over the past few years. 1) Prototype selection: With the same objective as PG, storage reduction, and classification accuracy improvement, these methods are limited only to select examples from the training set. More than 50 methods can be found in the literature. In general, three kinds of methods are usually differentiated, which are also based on edition [54], condensation [55], or hybrid models [21], [56]. Advanced proposals can be found in [24] and [57]–[59]. 2) Instance and rule learning hybridizations: It includes all the methods, which simultaneously use instances and rules in order to compute the classification of a new object. If the values of the object are within the range of a rule, its consequent predicts the class; otherwise, if no rule matches with the object, the most similar rule or instance stored in the database is used to estimate the class. Similarity is viewed as the closest rule or instance based on a distance measure. In short, these methods can generalize an instance into a hyperrectangle or rule [60], [61]. 3) Hyperspherical prototypes: This area [62] studies the use of hyperspheres to cover the training patterns of each class. The basic idea is to cluster the space into several objects, each of them corresponding only to one class, and the class of the nearest object is assigned to the test example. 4) Weighting: This task consists of applying weights to the instances of the training set, thus modifying the distance measure between them and any other instance. This technique could be integrated with the PS and PG methods [16], [63], [64], [65], [66] to improve the accuracy in classification problems and to avoid overfitting. A complete review dealing with this topic can be found in [67]. 5) Distance functions: Several distance metrics have been used with NN, especially when working with categorical attributes [68]. Many different distance measures try to optimize the performance of NN [15], [64], [69], [70], and they have successfully increased the classification accuracy. Advanced work is based on adaptive distance functions [71]. 6) Oversampling: This term is frequently used in learning with imbalanced classes [72], [73], and is closely related to undersampling [74]. Oversampling techniques replicate and generate artificial examples that belong to the minority classes in order to strengthen the presence of minority samples and to increase the performance over them. SMOTE [75] is the most well known oversampling technique and it has been shown to be very effective in many domains of application [76]. III. PROTOTYPE GENERATION: TAXONOMY The main characteristics of the PG methods have been described in Section II-A, and they can be used to categorize the PG methods proposed in the literature. The type of reduction, resulting generation set, generation mechanisms, and the evalu- Fig. 1. Prototype generation map. ation of the search constitute a set of properties that define each PG method. This section presents the taxonomy of PG methods based on these properties. In Fig. 1, we show the PG map with the representative methods proposed in each paper ordered in time. We refer to representantive methods, which are preferred by the authors or have reported the best results in the corresponding proposal paper. Some interesting remarks can be seen in Fig. 1. 1) Only two class-relabeling methods have been proposed for PG algorithms. The reason is that both the methods obtain great results for this approach in accuracy, but the underlying concept of these methods does not achieve high reduction rates, which is one of the most important objectives of PG. Furthermore, it is important to point out that both algorithms are based on decremental reduction, and that they have noise filtering purposes. 2) The condensation techniques constitute a wide group. They usually use a semiwrapper evaluation with any type of reduction. It is considered a classic idea due to the fact that, in recent years, hybrid models are preferred over condensation techniques, with few exceptions. ICPL2 was the first PG method with a hybrid approach, combining edition, and condensation stages. 3) Recent efforts in proposing positioning adjustment algorithms are noted for mixed reduction. Most of the methods following this scheme are based on LVQ, and the recent approaches try to alleviate the main drawback of the fixed reduction. 4) There are many efforts in centroid-based techniques because they have reported a great synergy with the NN rule, since the first algorithm PNN. Furthermore, many of them are based on simple and intuitive heuristics, which allow them to obtain a reduced set with high-quality accuracy. By contrast, those with decremental and mixed reduction are slow techniques. TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION 91 2) Cohen’s Kappa (Kappa rate): It is an alternative measure to the classification rate, since it compensates for random hits [78]. In contrast to the classification rate, kappa evaluates the portion of hits that can be attributed to the classifier itself (i.e., not to mere chance), relative to all the classifications that cannot be attributed to chance alone. An easy way to compute the Cohen’s kappa is to makie use of the resulting confusion matrix (see Table III) in a classification task. With the following expression, we can obtain Cohen’s kappa: Ω n Ω i=1 hii − i=1 Tr i Tci (1) kappa = n2 − Ω T i=1 r i Tci Fig. 2. Prototype generation hierarchy. 5) Wrapper evaluation appeared a few years ago and is only presented in hybrid approaches. This evaluation search is intended to optimize a selection, without taking into account computational costs. Fig. 2 illustrates the categorization following a hierarchy based on this order: generation mechanisms, resulting generation set, type of reduction, and finally, evaluation of the search. The properties studied here can help to understand how the PG algorithms work. In the following sections, we will establish which methods perform best, for each family, considering several metrics of performance with a wide experimental framework. IV. EXPERIMENTAL FRAMEWORK In this section, we show the factors and issues related to the experimental study. We provide the measures employed to evaluate the performance of the algorithms (see Section IV-A), details of the problems chosen for the experimentation (see Section IV-B), parameters of the algorithms (see Section IV-C), and finally, the statistical tests employed to contrast the results obtained are described (see Section IV-D). A. Performance Measures for Standard Classification In this study, we deal with multiclass datasets. In these domains, two measures are widely used because of their simplicity and successful application. We refer to the classification rate and Cohen’s kappa rate measures, which we will explain in the following. 1) Classification rate: It is the number of successful hits (correct classifications) relative to the total number of classifications. It has been by far the most commonly used metric to assess the performance of classifiers for years [2], [77]. where hii is the cell count in the main diagonal (the number of true positives for each class), n is the number of examples, Ω is the number of class labels, and Tr i and the rows’ and columns’ Tci are Ω total counts, respectively (Tr i = Ω j =1 hij , Tci = j =1 hj i ). Cohen’s kappa ranges from −1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multiclass problems, kappa is a very useful, yet simple, meter to measure a classifier’s classification rate while compensating for random successes. The main difference between the classification rate and Cohen’s kappa is the scoring of the correct classifications. Classification rate scores all the successes over all classes, whereas Cohen’s kappa scores the successes independently for each class and aggregates them. The second way of scoring is less sensitive to randomness caused by a different number of examples in each class. B. Datasets In the experimental study, we selected 59 datasets from the University of California, Irvine (UCI) repository [79] and KEEL dataset1 [38]. Table II summarizes the properties of the selected datasets. It shows, for each dataset, the number of examples (#Ex.), the number of attributes (#Atts.), the number of numerical (#Num.) and nominal (#Nom.) attributes, and the number of classes (#Cl.). The datasets are grouped into two categories depending on the size they have. Small datasets have less than 2000 instances and large datasets have more than 2000 instances. The datasets considered are partitioned by using the tenfold crossvalidation (10-fcv) procedure. C. Parameters Many different method configurations have been established by the authors in each paper for the PG techniques. In our experimental study, we have used the parameters defined in the reference, where they were originally described, assuming that the choice of the values of the parameters was optimally chosen. The configuration parameters, which are common to all problems, are shown in Table IV. Note that some PG methods have no parameters to be fixed; therefore, they are not included in this table. 1 http://sci2s.ugr.es/keel/datasets. 92 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 TABLE II SUMMARY DESCRIPTION FOR CLASSIFICATION DATASETS TABLE III CONFUSION MATRIX FOR AN Ω-CLASS PROBLEM In most of the techniques, Euclidean distance is used as the similarity function, to decide which neighbors are closest. Furthermore, to avoid problems with a large number of attributes and distances, all datasets have been normalized between 0 and 1. This normalization process allows to apply all the PG methods over each dataset, independent of the types of attributes. fact that the initial conditions that guarantee the reliability of the parametric tests may not be satisfied, thus causing the statistical analysis to lose credibility with these parametric tests. These tests are suggested in the studies presented in [80] and [82]– [84], where its use in the field of machine learning is highly recommended. The Wilcoxon test [82], [83] is adopted considering a level of significance of α = 0.1. More information about statistical tests and the results obtained can be found in the web site associated with this paper (http://sci2s.ugr.es/pgtax). E. Other Considerations We want to outline that the implementations are based only on the descriptions and specifications given by the respective authors in their papers. No advanced data structures and enhancements for improving the efficiency of PG methods have been carried out. All methods are available in KEEL software [38]. D. Statistical Tests for Performance Comparison In this paper, we use the hypothesis-testing techniques to provide statistical support for the analysis of the results [80], [81]. Specifically, we use nonparametric tests because of the V. ANALYSIS OF RESULTS This section presents the average results collected in the experimental study and some discussions of them; the complete TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION TABLE IV PARAMETER SPECIFICATION FOR ALL THE METHODS EMPLOYED IN THE EXPERIMENTATION results can be found on the web page associated with this paper. The study will be divided into two parts: analysis of the results obtained over small-size datasets (see Section V-A) and over large datasets (see Section V-B). Finally, a global analysis is added in Section V-C. A. Analysis and Empirical Results of Small-Size datasets Table V presents the average results obtained by the PG methods over the 40 small-size datasets. Red. denotes reduction rate achieved, train Acc. and train Kap. present the accuracy and kappa obtained in the training data, respectively; on the other hand, tst Acc. and tst Kap. present the accuracy and kappa obtained over the test data. Finally, Time denotes the average time elapsed in seconds to finish a run of PG method. The algorithms are ordered from the best to the worst for each type of result. Algorithms highlighted in bold are those which obtain the best result in their corresponding family, according to the first level of the hierarchy in Fig. 2. Fig. 3 depicts a representation of an opposition between the two objectives: reduction and test accuracy. Each algorithm located inside the graphic gets its position from the average values of each measure evaluated (exact position corresponding to the beginning of the name of the algorithm). Across the graphic, there is a line that represents the threshold of test accuracy achieved by the 1NN algorithm without preprocessing. Note 93 that in Fig. 3(a), the names of some PG methods overlap, and hence, Fig. 3(b) shows this overlapping zone. To complete the set of results, the web site associated with this paper contains the results of applying the Wilcoxon test to all possible comparisons among all PG considered in small datasets. Observing Table V, Fig. 3, and the Wilcoxon Test, we can point out some interesting facts as follows. 1) Some classical algorithms are at the top in accuracy and kappa rate. For instance, GENN, GMCA, and MSE obtain better results than other recent methods over test data. However, these techniques usually have a poor associated reduction rate. We can observe this statement in the Wilcoxon test, where classical methods significantly overcome other recent approaches in terms of accuracy and kappa rates. However, In terms of Acc. ∗ Red. and Kap. ∗ Red. measures, typically, these methods do not outperform recent techniques. 2) PSO and ENPC could be stressed from the positioning adjustment family as the best performing methods. Each one of them belongs to different subfamilies, fixed and mixed reduction, respectively. PSO focuses on improving the classification accuracy, and it obtains a good generalization capability. On the other hand, ENPC has the overfitting as the main drawback, which is clearly discernible from Table V. In general, LVQ-based approaches obtain worse accuracy rates than 1NN, but the reduction rate achieved by them is very high. MSE and HYB are the most outstanding techniques belonging to the subgroup of condensation and positioning adjustment. 3) With respect to class-relabeling methods, GENN obtains better accuracy/kappa rates but worse reduction rates than Depur. However, the statistical test informs that GENN does not outperform to the Depur algorithm in terms of accuracy and kappa rate. Furthermore, when the reduction rate is taken into consideration, i.e., when the statistical test is based on the Acc. ∗ Red. and Kap. ∗ Red. measures, the Depur algorithm clearly outperforms GENN. 4) The decremental approaches belonging to the centroids family require high computation times, but usually offer good reduction rates. MCA and PNN tend to overfit the data, but GMCA obtains excellent results. 5) In the whole centroids family, two methods deserve particular mention: ICPL2 and GMCA. Both generate a reduced prototype set with good accuracy rates in test data. The other approaches based on fixed and incremental reduction are less appropriate to improve the effectiveness of 1NN, but they are very fast and offer much reduced generated sets. 6) Regarding space-splitting approaches, several differences can be observed. RSP3 is an algorithm based on Chen’s algorithm, but tries to avoid drastic changes in the form of the decision boundaries, and it produces a good tradeoff between reduction and accuracy. Although the POC algorithm is a relatively modern technique, this does not obtain great results. We can justify these results because the αparameter is very sensitive for each dataset. Furthermore, 94 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 TABLE V AVERAGE RESULTS OBTAINED BY THE PG METHODS OVER SMALL DATASETS Fig. 3. Accuracy in test versus reduction in small datasets. (a) All PG methods. (b) Zoom in the overlapping reduction-rate zone. it is quite slow when tackling datasets with more than two classes. 7) The best methods in accuracy/kappa rates for each one of the families are PSO, GENN, ICPL2, and RSP3, respectively, and five methods outperform 1NN in accuracy. 8) In general, hybrid methods obtain the best result in terms of accuracy and reduction rate. 9) Usually, there is no difference between the rankings obtained with accuracy and kappa rates, except for some concrete algorithms. For example, we can observe that 1NN obtains a lower ranking with the kappa measure; it probably indicates that 1NN benefits from random hits. Furthermore, in the web site associated with this paper, we can find an analysis of the results depending on the type of attributes of the datasets. We show the results in accuracy/kappa rate for all PG methods differentiating between numerical, nominal, and mixed datasets. In numerical and nominal datasets, all attributes must be numerical and nominal, respectively, whereas in mixed datasets, we include those datasets with numerical and nominal attributes mixed. Observing these tables, we want to outline different properties of the PG methods. 1) In general, there is no difference in performance between numerical, nominal, and mixed datasets, except for some concrete algorithms. For example, in mixed datasets, we TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION 95 TABLE VI AVERAGE RESULTS OBTAINED BY THE PG METHODS OVER LARGE DATASETS can see that a class-relabeling method, GENN, is on the top because of the fact that it does not produce modifications to the attributes. However, in numerical datasets, PSO is the best performing method, indicating to us that the positioning adjustment strategy is usually well adapted to numerical datasets. 2) In fact, comparing these tables, we observe that some representative techniques of the positioning adjustment family, such as PSO, MSE, and ENPC, have an accuracy/kappa rate close to 1NN. However, over nominal and mixed datasets, they decrease their accuracy rates. 3) ICPL2 and GMCA techniques obtain good accuracy/kappa rates independent of the type of input data. Finally, we perform a study depending on the number of classes of the datasets. In the web site associated with this paper, we show the average results in accuracy/kappa rate differentiating between binary and multiclass datasets. We can analyze several details from the results collected, which are as follows. 1) Eight techniques outperform 1NN in accuracy when they tackle binary datasets. However, over multiclass datasets, there are only three techniques that are able to overcome 1NN. 2) Centroid-based techniques usually perform well when dealing with multiclass datasets. For instance, we can highlight the MCA, SGP, PNN, ICPL2, and GMCA techniques, which increase their respective rankings with multiclass datasets. 3) GENN and ICPL2 techniques obtain good accuracy/kappa rates independent of the number of classes. 4) PSCSA has a good behavior with binary datasets. However, over multiclass datasets, PSCSA decreases its performance. 5) Some methods present significant differences between accuracy and kappa measures when dealing with binary datasets. We can stress MSE, Depur, Chen, and BTS3 like techniques penalized by the kappa measure. B. Analysis and Empirical Results of Large-Size Datasets This section presents the study and analysis of large-size datasets. The goal of this study is to analyze the effect of scaling up the data in PG methods. For time complexity reasons, several algorithms cannot be run over large datasets. PNN, MCA, GMCA, ICPL2, and POC are extremely slow techniques, and their time complexity quickly increases when the data scale up or manage more than five classes. Table VI shows the average results obtained, and Fig. 4 illustrates the comparison between the accuracy and reduction rates of the PG methods over large-size datasets. Finally, the web site associated with this paper contains the results of applying the Wilcoxon test over all possible comparisons among all PG considered in large datasets. These tables allow us to highlight some observations of the results obtained as follows. 1) Only the GENN approach outperforms the performance of the 1NN in accuracy/kappa rate. 2) Some methods present clear differences when dealing with large datasets. For instance, we can highlight the PSO and RSP3 techniques. The former may suffer from a lack of convergence due to the fact that the performance obtained in training data is slightly higher than that obtained by 1NN; hence, it may be a sign that more iterations are needed to tackle large datasets. On the other hand, the techniques based on space partitioning present some drawbacks when the data scale up and are made up of more attributes. This is the case with RSP3. 3) In general, LVQ-based methods do not work well when the data scale up. 96 Fig. 4. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 Accuracy in test versus reduction in large datasets. (a) All PG methods considered over large datasets. (b) Zoom in the overlapping reduction-rate zone. 4) BTS3 stands out as the best centroids-based method over large-size datasets because the best performing ones over small datasets were also the most complex in time, and they cannot be run here. 5) Although ENPC overfits the data, it is the best performing method that consider the tradeoff between accuracy/kappa and reduction rates. PSO can also be stressed as a good candidate in this type of dataset. 6) There is no significant differences between the accuracy and kappa rankings when dealing with large datasets. Again, we differentiate between numerical, nominal, and mixed datasets. Complete results can be found in the web site associated with this paper. Observing these results, we want to outline different properties of PG methods over large datasets. Note that there is only one dataset with mixed attributes; for this reason, we focus this analysis on the differences between numerical and nominal datasets. 1) When only numerical datasets are taken into consideration, three algorithms outperform the 1NN rule: GENN, PSO, and ENPC. 2) Over nominal large datasets, no PG method outperforms 1NN. 3) MixtGauss and AMPSO are highly conditioned on the type of input data, preferring numerical datasets. By contrast, RSP3 is better adapted to nominal datasets. Finally, we perform again an analysis of the behavior of the PG techniques depending on the number of classes, but in this case, over large datasets. the web site associated with this paper presents the results. Observing these results, we can point out several comments. 1) Over binary large datasets, there are four algorithms that outperform 1NN. However, when the PG techniques tackle multiclass datasets, no PG method overcome 1NN. 2) When dealing with large datasets, there is no important differences between the accuracy and kappa ranking with binary datasets. 3) Class-relabeling methods perform well independent of the number of classes. C. Global Analysis This section shows a global view of the obtained results. As a summary, we want to outline several remarks on the use of PG because the choice of a certain method depends on various factors. 1) Several PG methods can be emphasized according to their test accuracy/kappa obtained: PSO, ICPL2, ENPC, and GENN. In principle, in terms of reduction capabilities, PSCSA and AVQ obtain the best results, but they offer poor accuracy rates. Taking into consideration the computational cost, we can consider DSM, LVQ3, and VQ to be the fastest algorithms. 2) Edition schemes usually outperform the 1NN classifier, but the number of prototypes in the result set is too high. This fact could be prohibitive over large datasets because there is no significant reduction. Furthermore, other PG methods have shown that it is possible to preserve high accuracy with a better reduction rate. 3) A high reduction rate serves no purpose, if there is no minimum guarantee of performance accuracy. This is the case of PSCSA or AVQ. Nevertheless, MSE offers excellent reduction rates without losing performance accuracy. 4) For the tradeoff reduction–accuracy rate, PSO has been reported to have the best results over small-size datasets. In the case of dealing with large datasets, the ENPC approach seems to be the most appropriate one. 5) A good reduction–accuracy balance is difficult to achieve with a fast algorithm. Considering this restriction, we could say that RSP3 allows us to yield generated sets with a good tradeoff among reduction, accuracy, and time complexity. VI. VISUALIZATION OF DATA RESULTING SETS: A CASE STUDY BASED ON BANANA DATASET This section is devoted to illustrate the subsets selected resulting from some PG algorithms considered in this study. To do this, we focus on the banana dataset, which contains 5300 examples in the complete set. It is an artificial dataset of two TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION 97 Fig. 5. Data generated sets in banana dataset. (a) Banana original (0.8751, 0.7476). (b) GENN (0.0835, 0.8826, 0.7626). (c) LVQ3 (0.9801, 0.8370, 0.6685). (d) Chen (0.9801, 0.8792, 0.7552). (e) RSP3 (0.8962, 0.8755, 0.7482). (f) BTS3 (0.9801, 0.8557, 0.7074). (g) SGP (0.9961, 0.6587, 0.3433). (h) PSO (0.9801, 0.8819, 0.7604). (i) ENPC (0.7485, 0.8557, 0.7086). classes composed of three well-defined clusters of instances of the class −1 and two clusters of the class 1. Although the borders are clear among the clusters, there is a high overlap between both classes. The complete dataset is illustrated in Fig. 5(a). The pictures of the generated sets by some PG methods could help to visualize and understand their way of working and the results obtained in the experimental study. The reduction rate and the accuracy and kappa values in test data registered in the experimental study are specified for each one. In the original dataset, the two values indicated correspond to accuracy and kappa with 1NN. 1) Fig. 5(b) depicts the generated data by the algorithm GENN. It belongs to the edition approaches, and the generated subset differs slightly from the original dataset. Those samples found within the class boundaries can either be removed or be relabeled. It is noticeable that the clusters of different classes are a little more separated. 2) Fig. 5(c) shows the resulting subset of the classical LVQ3 condensation algorithm. It can be appreciated that most of the points are moved to define the class boundaries, but a few interior points are also used. The accuracy and kappa decrease with respect to the original, as is usually the case with condensation algorithms. 3) Fig. 5(d) and (e) represents the sets generated by the Chen and RSP3 methods, respectively. These methods are based on a space-splitting strategy, but the first one requires the specification of the final size of the generated sets, while the latter does not. We can see that the Chen method generates prototypes keeping a homogeneous distribution of points in the space. RSP3 was proposed to fix some problems observed in the Chen method, but in this concrete dataset, this method is worse in accuracy/kappa rates than its ancestor. However, the reduction type of Chen’s method is fixed, and it is very dependent on the dataset tackled. 4) Fig. 5(f) and (g) represents the sets of data generated by BTS3 and SGP methods. Both techniques are clusterbased and present very high reduction rates over this dataset. SGP does not work well in this dataset because it promotes the removal of prototypes and uses an incremental order, which does not allow us to choose the most 98 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 appropriate decision. BTS3 uses a fixed reduction type; thus, it focuses on improving accuracy rates, but its generation mechanisms are not well suited for this type of dataset. 5) Fig. 5(h) and (i) illustrates the sets of data generated by PSO and ENPC methods. They are wrapper and hybrid methods of the position-adjusting family and iterate many times to obtain an optimal reallocation of prototypes. PSO requires the final size of the subset selected as a parameter, and this parameter is very conditioned to the complexity of the dataset addressed. In the banana case, keeping 2% of prototypes seems to work well. On the other hand, ENPC can adjust the number of prototypes required to fit a specific dataset. In the case study presented, we can see that it obtains similar sets to those obtained by the Chen approach because it also fills the regions with a homogeneous distribution of generated prototypes. In decision boundaries, the density of prototypes is increased and may produce quite noisy samples for further classification of the test data. It explains its poor behavior in this problem with respect to PSO, the lower reduction rate achieved, and the decrement of accuracy/kappa rates with regard to the original dataset classified with 1NN. We have seen the resulting datasets of condensation, edition, and hybrid methods and different generation mechanisms with some representative PG methods. Although the methods can be categorized as a specific family, they do not follow a specific behavior pattern, since some of the condensation techniques may generate interior points (like in LVQ3), other clusters of data (RSP3), or even points with a homogeneous distribution in space (Chen or ENPC). Nevertheless, visual characteristics of generated sets are also the subject of interest and can also help to decide the choice of a PG method. VII. CONCLUSION In this paper, we have provided an overview of the PG methods proposed in the literature. We have identified the basic and advanced characteristics. Furthermore, existing work and related fields have been reviewed. Based on the main characteristics studied, we have proposed a taxonomy of the PG methods. The most important methods have been empirically analyzed over small and large sizes of classification datasets. To illustrate and strengthen the study, some graphical representations of data subsets selected have been drawn and statistical analysis based on nonparametric tests has been employed. Several remarks and guidelines can be suggested. 1) A researcher who needs to apply a PG method should know the main characteristics of these kinds of methods in order to choose the most suitable. The taxonomy proposed and the empirical study can help a researcher to make this decision. 2) To propose a new PG method, rigorous analysis should be considered to compare the most well-known approaches and those which fit with the basic properties of the new proposal. To do this, the taxonomy and analysis of influ- ence in the literature can help guide a future proposal to the correct method. 3) This paper helps nonexperts in PG methods to differentiate between them, to make an appropriate decision about their application, and to understand their behavior. 4) It is important to know the main advantages of each PG method. In this paper, many PG methods have been empirically analyzed, but a specific conclusion cannot be drawn regarding the best performing method. This choice depends on the problem tackled, but the results offered in this paper could help to reduce the set of candidates. REFERENCES [1] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967. [2] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge, MA: MIT Press, 2010. [3] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and Methods, 2nd ed. New York: Interscience, 2007. [4] I. Kononenko and M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms. West Sussex: Horwood, 2007. [5] E. K. Garcia, S. Feldman, M. R. Gupta, and S. Srivastava, “Completely lazy learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 9, pp. 1274– 1285, Sep. 2010. [6] A. N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A Database Perspective. New York: Springer-Verlag, 2004. [7] G. Shakhnarovich, T. Darrell, and P. Indyk, Eds., Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Cambridge, MA: MIT Press, 2006. [8] X. Wu and V. Kumar, Eds., The Top Ten Algorithms in Data Mining. (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series). Boca Raton, FL: CRC, 2009. [9] A. Shintemirov, W. Tang, and Q. Wu, “Power transformer fault classification based on dissolved gas analysis by implementing bootstrap and genetic programming,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 39, no. 1, pp. 69–79, Jan. 2009. [10] P. G. Espejo, S. Ventura, and F. Herrera, “A survey on the application of genetic programming to classification,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 2, pp. 121–144, Mar. 2009. [11] S. Magnussen, R. McRoberts, and E. Tomppo, “Model-based mean square error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories,” Remote Sens. Environ., vol. 113, no. 3, pp. 476–488, 2009. [12] M. Govindarajan and R. Chandrasekaran, “Evaluation of k-nearest neighbor classifier performance for direct marketing,” Expert Syst. Appl., vol. 37, no. 1, pp. 253–258, 2009. [13] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [14] Y. Chen, E. Garcia, M. Gupta, A. Rahimi, and L. Cazzanti, “Similaritybased classification: Concepts and algorithms,” J. Mach. Learning Res., vol. 10, pp. 747–776, 2009. [15] K. Weinberger and L. Saul, “Distance metric learning for large margin nearest neighbor lassification,” J. Mach. Learning Res., vol. 10, pp. 207– 244, 2009. [16] F. Fernández and P. Isasi, “Local feature weighting in nearest prototype classification,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 40–53, Jan. 2008. [17] E. Pekalska and R. P. Duin, “Beyond traditional kernels: Classification in two dissimilarity-based representation spaces,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 38, no. 6, pp. 729–744, Nov. 2008. [18] P. Cunningham, “A taxonomy of similarity mechanisms for case-based reasoning,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 11, pp. 1532– 1543, Nov. 2009. [19] B. Li, Y. W. Chen, and Y. Chen, “The nearest neighbor algorithm of local probability centers,” IEEE Trans. Syst., Man, Cybern. B. Cybern., vol. 38, no. 1, pp. 141–154, Feb. 2008. [20] C.-M. Hsu and M.-S. Chen, “On the design and applicability of distance functions in high-dimensional data space,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 4, pp. 523–536, Apr. 2009. TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION [21] D. R. Wilson and T. R. Martinez, “Reduction techniques for instancebased learning algorithms,” Mach. Learning, vol. 38, no. 3, pp. 257–286, 2000. [22] S. W. Kim and J. Oomenn, “A brief taxonomy and ranking of creative prototype reduction schemes,” Pattern Anal. Appl., vol. 6, pp. 232–244, 2003. [23] H. Brighton and C. Mellish, “Advances in instance selection for instancebased learning algorithms,” Data Mining Knowl. Discov., vol. 6, no. 2, pp. 153–172, 2002. [24] E. Marchiori, “Class conditional nearest neighbor for large margin instance selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 364– 370, Feb. 2010. [25] N. Garcı́a-Pedrajas, “Constructing ensembles of classifiers by means of weighted instance selection,” IEEE Trans. Neural Netw., vol. 20, no. 2, pp. 258–277, Feb. 2009. [26] E. Pekalska, R. P. W. Duin, and P. Paclı́k, “Prototype selection for dissimilarity-based classifiers,” Pattern Recognit., vol. 39, no. 2, pp. 189– 208, 2006. [27] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pekalska, and R. P. W. Duin, “Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces,” Pattern Recognit., vol. 39, no. 10, pp. 1827–1838, 2006. [28] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Self-generating prototypes for pattern classification,” Pattern Recognit., vol. 40, no. 5, pp. 1498– 1509, 2007. [29] W. Lam, C. K. Keung, and D. Liu, “Discovering useful concept prototypes for classification based on filtering and abstraction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 8, pp. 1075–1090, Aug. 2002. [30] J. S. Sánchez, “High training set size reduction by space partitioning and prototype abstraction,” Pattern Recognit., vol. 37, no. 7, pp. 1561–1564, 2004. [31] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005. [32] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE Trans. Comput., vol. C-23, no. 11, pp. 1179–1184, Nov. 1974. [33] T. Kohonen, “The self organizing map,” Proc. IEEE, vol. 78, no. 9, pp. 1464–1480, Sep. 1990. [34] C. H. Chen and A. Jóźwik, “A sample set condensation algorithm for the class sensitive artificial neural network,” Pattern Recognit. Lett., vol. 17, no. 8, pp. 819–823, Jul. 1996. [35] R. Kulkarni and G. Venayagamoorthy, “Particle swarm optimization in wireless-sensor networks: A brief survey ,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev. to be published. DOI: 10.1109/TSMCC. 2010.2054080. [36] F. Fernández and P. Isasi, “Evolutionary design of nearest prototype classifiers,” J. Heurist., vol. 10, no. 4, pp. 431–454, 2004. [37] L. Nanni and A. Lumini, “Particle swarm optimization for prototype reduction,” Neurocomputing, vol. 72, no. 4–6, pp. 1092–1097, 2008. [38] J. Alcalá-Fdez, L. Sánchez, S. Garcı́a, M. J. del Jesus, S. Ventura, J. M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C. Fernández, and F. Herrera, “KEEL: A software tool to assess evolutionary algorithms for data mining problems,” Soft Comput., vol. 13, no. 3, pp. 307–318, 2008. [39] J. Koplowitz and T. Brown, “On the relation of performance to editing in nearest neighbor rules,” Pattern Recognit., vol. 13, pp. 251–255, 1981. [40] S. Geva and J. Sitte, “Adaptive nearest neighbor pattern classification,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 318–322, Mar. 1991. [41] Q. Xie, C. A. Laszlo, and R. K. Ward, “Vector quantization technique for nonparametric classifier design,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 12, pp. 1326–1330, Dec. 1993. [42] Y. Hamamoto, S. Uchimura, and S. Tomita, “A bootstrap technique for nearest neighbor classifier design,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 1, pp. 73–79, Jan. 1997. [43] R. Odorico, “Learning vector quantization with training count (LVQTC),” Neural Netw., vol. 10, no. 6, pp. 1083–1088, 1997. [44] C. Decaestecker, “Finding prototypes for neares neghbour classification by means of gradient descent and deterministic annealing,” Pattern Recognit., vol. 30, no. 2, pp. 281–288, 1997. [45] T. Bezdek, J.C .and Reichherzer, G. Lim, and Y. Attikiouzel, “Multiple prototype classifier design,” IEEE Trans. Syst., Man Cybern. C, Appl. Rev., vol. 28, no. 1, pp. 67–79, Feb. 1998. [46] R. Mollineda, F. Ferri, and E. Vidal, “A merge-based condensing strategy for multiple prototype classifiers,” IEEE Trans. Syst., Man Cybern. B, Cybern., vol. 32, no. 5, pp. 662–668, Oct. 2002. 99 [47] J. S. Sánchez, R. Barandela, A. I. Marqués, R. Alejo, and J. Badenas, “Analysis of new techniques to obtain quality training sets,” Pattern Recognit. Lett., vol. 24, no. 7, pp. 1015–1022, 2003. [48] S. W. Kim and J. Oomenn, “Enhancing prototype reduction schemes with lvq3-type algorithms,” Pattern Recognit., vol. 36, pp. 1083–1093, 2003. [49] C.-W. Yen, C.-N. Young, and M. L. Nagurka, “A vector quantization method for nearest neighbor classifier design,” Pattern Recognit. Lett., vol. 25, no. 6, pp. 725–731, 2004. [50] J. Li, M. T. Manry, C. Yu, and D. R. Wilson, “Prototype classifier design with pruning,” Int. J. Artif. Intell. Tools, vol. 14, no. 1–2, pp. 261–280, 2005. [51] T. Raicharoen and C. Lursinsap, “A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POC-NN) algorithm,” Pattern Recognit. Lett., vol. 26, no. 10, pp. 1554–1567, 2005. [52] A. Cervantes, I. M. Galván, and P. Isasi, “AMPSO: A new particle swarm method for nearest neighborhood classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1082–1091, Oct. 2009. [53] U. Garain, “Prototype reduction using an artificial immune model,” Pattern Anal. Appl., vol. 11, no. 3–4, pp. 353–363, 2008. [54] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man Cybern., vol. SMC-2, no. 3, pp. 408– 421, Jul. 1972. [55] P. E. Hart, “The condensed nearest neighbor rule,” IEEE Trans. Inf. Theory, vol. IT-18, no. 3, pp. 515–516, May 1968. [56] S. Garcı́a, J.-R. Cano, E. Bernadó-Mansilla, and F. Herrera, “Diagnose of effective evolutionary prototype selection using an overlapping measure,” Int. J. Pattern Recognit. Artif. Intell., vol. 28, no. 8, pp. 1527–1548, 2009. [57] S. Garcı́a, J. R. Cano, and F. Herrera, “A memetic algorithm for evolutionary prototype selection: A scaling up approach,” Pattern Recognit., vol. 41, no. 8, pp. 2693–2709, 2008. [58] H. A. Fayed and A. F. Atiya, “A novel template reduction approach for the k-nearest neighbor method,” IEEE Trans. Neural Netw., vol. 20, no. 5, pp. 890–896, May 2009. [59] J. Derrac, S. Garcı́a, and F. Herrera, “IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule,” Pattern Recognit., vol. 43, no. 6, pp. 2082–2105, 2010. [60] P. Domingos, “Unifying instance-based and rule-based induction,” Mach. Learning, vol. 24, no. 2, pp. 141–168, 1996. [61] O. Luaces and A. Bahamonde, “Inflating examples to obtain rules,” Int. J. Intell. Syst., vol. 18, pp. 1113–1143, 2003. [62] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Hyperspherical prototypes for pattern classification,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 8, pp. 1549–1575, 2009. [63] D. Wettschereck, D. W. Aha, and T. Mohri, “A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms,” Artif. Intell. Rev., vol. 11, no. 1–5, pp. 273–314, 1997. [64] R. Paredes and E. Vidal, “Learning weighted metrics to minimize nearestneighbor classification error,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1100–1110, Jul. 2006. [65] M. Z. Jahromi, E. Parvinnia, and R. John, “A method of learning weighted similarity function to improve the performance of nearest neighbor,” Inf. Sci., vol. 179, no. 17, pp. 2964–2973, 2009. [66] C. Vallejo, J. Troyano, and F. Ortega, “InstanceRank: Bringing order to datasets,” Pattern Recognit. Lett., vol. 31, no. 2, pp. 133–142, 2010. [67] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,” Artif. Intell. Rev., vol. 11, pp. 11–73, 1997. [68] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Intell. Res., vol. 6, no. 1, pp. 1–34, 1997. [69] R. D. Short and K. Fukunaga, “Optimal distance measure for nearest neighbor classification,” IEEE Trans. Inf. Theory, vol. IT-27, no. 5, pp. 622–627, Sep. 1981. [70] C. Gagné and M. Parizeau, “Coevolution of nearest neighbor classifiers,” Int. J. Pattern Recognit. Artif. Intell., vol. 21, no. 5, pp. 921–946, 2007. [71] J. Wang, P. Neskovic, and L. Cooper, “Improving nearest neighbor rule with a simple adaptive distance measure,” Pattern Recognit. Lett., vol. 28, no. 2, pp. 207–213, 2007. [72] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: A review.,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4, pp. 687–719, 2009. [73] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. [74] S. Garcı́a and F. Herrera, “Evolutionary under-sampling for classification with imbalanced data sets: Proposals and taxonomy,” Evol. Comput., vol. 17, no. 3, pp. 275–306, 2009. 100 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012 [75] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. [76] N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Mining Knowl. Discov., vol. 17, no. 2, pp. 225–252, 2008. [77] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005. [78] A. Ben-David, “A lot of randomness is hiding in accuracy,” Eng. Appl. Artif. Intell., vol. 20, pp. 875–885, 2007. [79] A. Asuncion and D. Newman. (2007). UCI machine learning repository. [Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html. [80] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10, pp. 959–977, 2009. [81] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 2nd ed. London, U.K.: Chapman & Hall, 2006. [82] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learning Res., vol. 7, pp. 1–30, 2006. [83] S. Garcı́a and F. Herrera, “An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons,” J. Mach. Learning Res., vol. 9, pp. 2677–2694, 2008. [84] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power,” Inf. Sci., vol. 180, pp. 2044–2064, 2010. Isaac Triguero received the M.Sc. degree in computer science from the University of Granada, Granada, Spain, in 2009, where he is currently working toward the Ph.D. degree with the Department of Computer Science and Artificial Intelligence. His current research interests include data mining, data reduction, and evolutionary algorithms. Joaquı́n Derrac received the M.Sc. degree in computer science from the University of Granada, Granada, Spain, in 2008, where he is currently working toward the Ph.D. degree with the Department of Computer Science and Artificial Intelligence. His current research interests include data mining, data reduction, lazy learning, and evolutionary algorithms. Salvador Garcı́a received the M.Sc. and Ph.D. degrees in computer science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Assistant Professor with the Department of Computer Science, University of Jaén, Jaén, Spain. His research interests include data mining, data reduction, data complexity, imbalanced learning, statistical inference, and evolutionary algorithms. Francisco Herrera received the M.Sc. and Ph.D. degrees in mathematics from the University of Granada, Granada, Spain, in 1988 and 1991, respectively. He is currently a Professor with the Department of Computer Science and Artificial Intelligence, University of Granada. He has authored or coauthored more than 150 papers in international journals. He is a coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (Hackensack, NJ: World Scientific, 2001). He has coedited five international books and 20 special issues in international journals on different soft computing topics. He is as Associate Editor of the journals IEEE TRANSACTIONS ON FUZZY SYSTEMS, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems, Advances in Computational Sciences and Technology, and the International Journal of Applied Metaheuristics Computing. He is also an Area Editor of the journal Soft Computing (in the area of genetic algorithms and genetic fuzzy systems). He is also a member of several journal editorial boards, such as Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, the International Journal of Hybrid Intelligent Systems, and Memetic Computation. His research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy-rule-based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms, and genetic algorithms. 1. Prototype generation for supervised classification 1.2 59 IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification • I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. IEEE Transactions on Neural Networks 21 (12) (2010) 1984-1990, doi: 10.1109/TNN.2010.2087415. – Status: Published. – Impact Factor (JCR 2010): 2.633 – Subject Category: Computer Science, Artificial Intelligence. Ranking 17 / 108 (Q1). – Subject Category: Computer Science, Hardware & Architecture. Ranking 3 / 48 (Q1). – Subject Category: Computer Science, Theory & Methods. Ranking 8 / 97 (Q1). – Subject Category: Engineering, Electrical & Electronic. Ranking 22 / 247 (Q1). This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS 1 Brief Papers IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification Isaac Triguero, Salvador García, and Francisco Herrera Abstract— Nearest prototype methods are a successful trend of many pattern classification tasks. However, they present several shortcomings such as time response, noise sensitivity, and storage requirements. Data reduction techniques are suitable to alleviate these drawbacks. Prototype generation is an appropriate process for data reduction, which allows the fitting of a dataset for nearest neighbor (NN) classification. This brief presents a methodology to learn iteratively the positioning of prototypes using real parameter optimization procedures. Concretely, we propose an iterative prototype adjustment technique based on differential evolution. The results obtained are contrasted with nonparametric statistical tests and show that our proposal consistently outperforms previously proposed methods, thus becoming a suitable tool in the task of enhancing the performance of the NN classifier. Index Terms— Classification, differential evolution, nearest neighbor, prototype generation. I. I NTRODUCTION Classification is one of the most important tasks in machine learning and data mining [1], [2]. Most machine learning methods build a model during the learning process, known as eager learning methods [3], but there are some approaches where the algorithm does not need a model. These algorithms are known as lazy learning methods [4]. The nearest neighbor (NN) algorithm [5] and its derivatives belong to the family of lazy learning. It has proved itself to perform well for classification problems in many domains [2], [6] and is considered one of the top ten methods in data mining [7]. NN is a nonparametric classifier, which requires the storage of the entire training set and the classification of unseen cases, finding the class labels of the closest instances to them. In order to determine how close two instances are, several distances or similarity measures have been proposed [8]–[10]. The effectiveness and simplicity of the NN may be affected by several weaknesses such as high computational cost, high storage requirement, and sensitivity to noise. Furthermore, NN makes predictions over existing data and assumes that input data perfectly delimits the decision boundaries among classes. Several approaches have been suggested and studied to tackle the drawbacks mentioned above, for instance, weighting Manuscript received June 10, 2010; revised September 3, 2010; accepted October 9, 2010. This work was supported by TIN2008-06681-C06-01. I. Triguero and F. Herrera are with the CITIC-UGR, Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain (e-mail: [email protected]; [email protected]). S. García is with the Department of Computer Science, University of Jaén, Jaén 23071, Spain (e-mail: [email protected]). Digital Object Identifier 10.1109/TNN.2010.2087415 schemes [11], [12] have been widely used to improve the results of the NN classifier. A successful technique that simultaneously tackles the computational complexity, storage requirements, and sensitivity to noise of NN is based on data reduction. These techniques aim to obtain a representative training set with a lower size compared to the original one and with similar or even higher classification accuracy for new incoming data. Apart from feature selection [13], data reduction can be divided into two different approaches, known as prototype selection [14], [15] and prototype generation (PG) or abstraction [16], [17]. The former process consists of choosing a subset of the original training data, while PG can also build new artificial prototypes to better adjust the decision boundaries between classes in NN classification. In the specialized literature, a great number of PG techniques have been proposed. Since the first approach to PNN based on merging prototypes [18] and divide-and-conquerbased schemes [19], many other proposals of PG were considered, for instance, Mixt_Gauss [20], ICPL [17], and RSP [21]. Positioning adjustment of prototypes is another perspective within the PG methodology. It aims to correct the position of a subset of prototypes from the initial set by using an optimization procedure. Many proposals belong to this family, such as learning vector quantization (LVQ) [22] and its successive improvements [23], [24], genetic algorithms [25], and particle swarm optimization (PSO) [26], [27]. Many existing positioning adjustment of prototype techniques start with an initial set of prototypes and try to improve the classification accuracy by adjusting it. Two schemes of initialization are commonly used. 1) The number of representative instances for each class is proportional to their number in the input data. 2) All the classes are represented by the same number of prototypes. This initialization process is their main drawback due to the fact that this parameter can be very dependent on the problem tackled. Some PG approaches [23], [25] compute the number of needed prototypes to be retained automatically, but in complex domains they require to retain many prototypes. We propose a novel procedure to automatically find the smallest reduced set that achieves suitable classification accuracy over different types of problems. This method follows an iterative prototype adjustment scheme with an incremental approach. At each step, an optimization procedure is used to adjust the position of the prototypes, and the method adds new prototypes if needed. As a second contribution of this brief, we will adopt the differential evolution (DE) [28], [29] technique as optimizer. Our proposal will be denoted “iterative prototype adjustment based on differential evolution” (IPADE). 1045–9227/$26.00 © 2010 IEEE This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON NEURAL NETWORKS In experiments on 50 real-world benchmark datasets, the classification accuracy and reduction rate of our approach are investigated and its performance is compared with classical and recent PG models. The rest of this brief is organized as follows. Section II describes the background of PG and DE. Section III explains the proposed algorithm IPADE. Section IV discusses the experimental framework and presents the analysis of results. Finally, in Section V we summarize our conclusions. II. BACKGROUND This section covers the background information necessary to define and describe our proposal. Section II-A presents the background on PG. Section II-B shows the main characteristics of DE. A. PG PG is an important technique in data reduction. It has been widely applied to instance-based classifiers and can be defined as the application of instance construction algorithms over a dataset to improve the classification accuracy of a NN classifier. More specifically, PG can be defined as follows. Let x p be an instance where x p = (x p1 , x p2 , . . . , x pm , x pω ), with x p belonging to a class ω of possible classes given by x pω and an m-dimensional space in which x pi is the value of the i th feature of the pth sample. Furthermore, let xt be an instance where xt = (xt 1 , xt 2 , . . . , xt m , xt ψ ), with xt belonging to a class ψ, which is unknown, of possible classes. Then, let us assume that there is a training set TR which consists of n instances x p and a test set TS composed of s instances xt . The purpose of PG is to obtain a prototype generated set GS that consists of r, r < n, prototypes pu where pu = (pu1 , pu2 , . . . , pum , puω ), which are generated from the examples of TR. The prototypes of the generated set are determined to represent efficiently the distributions of the classes and to discriminate well when used to classify the training objects. Their cardinality should be sufficiently small to reduce both the storage and evaluation time spent by an NN classifier. The PG approaches can be divided into several families depending on the main heuristic operation followed. The first approach that we can find in the literature, called PNN [18], belongs to the family of methods that carry out a merging of prototypes of the same class in successive iterations, generating centroids. Other well-known methods are those based on a divide-and-conquer scheme, by separating the m-dimensional space into two or more subspaces with the purpose of simplifying the problem at each step [19]. Recent advances that follow a similar operation include Mixt_Gauss [20], which is an adaptive PG algorithm considered in the framework of mixture modeling by Gaussian distributions while assuming a statistical independence of features, and the RSP3 technique [21] which tries to avoid drastic changes in the form of decision boundaries associated with TR, which is the main shortcoming observed in the classical approach [19]. One of the most important families of methods is based on adjusting the position of the prototypes that can be viewed as an optimization process. The main algorithm belonging to this family is LVQ [22]. LVQ can be understood as an artificial neural network in which a neuron corresponds to a prototype and a competition weight based is carried out in order to locate each neuron in a concrete place of the m-dimensional space to increase the classification accuracy. The third version of this algorithm, LVQ3, reported the best results. Several approaches have been proposed that modify the basic LVQ, for instance LVQPRU [23] which extends LVQ by using a pruning step to remove noisy instances, or the HYB algorithm [24] that constitutes a hybridization of several prototype reduction techniques. Specifically, HYB combines support vector machines (SVMs) with LVQ3 and executes a search in order to find the most promising parameters of LVQ3. As a positioning adjustment of prototypes technique, a genetic algorithm called ENPC was proposed for PG in [25]. This algorithm executes different operators in order to find the most suitable position of the prototypes. PSO was proposed for PG in [26], [27], and they also belong to the positioning adjustment of prototypes category of methods. The main difference between them is the type of codification of the particles. The PSO approach proposed in [26] codifies a complete solution GS per particle. However, AMPSO [27] encodes each prototype of G S in a single particle. AMPSO has been shown to be more effective than PSO [26]. B. DE DE follows the general procedure of an evolutionary algorithm. It starts with a population of NP candidate solutions, the so-called individuals. The generations in DE are denoted by G = 0, 1, . . . , G max . It is usual to denote each individual 1 , . . . , x D }, called a as a D-dimensional vector X i,G = {x i,G i,G “target vector”. After initialization, DE applies the mutation operator to generate a mutant vector Vi,G , with respect to each individual X i,G , in the current population. For each target X i,G , at the generation G, its associated mutant vector Vi,G = 1 , . . . , V D }. The method of creating this mutant vector {Vi,G i,G is that which differentiates one DE scheme from another. We focus on the DE/Rand/1, which generates the mutant vector as follows: Vi,G = X r1 ,G + F · (X r2 ,G − X r3 ,G ). (1) After the mutation phase, the crossover operation is applied to each pair of the target vector X i,G and its corresponding mutant vector Vi,G to generate a new trial vector which we denote Ui,G . There are three kinds of crossover operators known as “binomial,” “exponential,” and “arithmetic.” Specifically, we will focus on the well-known DE/CurrentToRand/1 strategy [30], which generates the trial vector Ui,G by linearly combining the target vector X i,G and the corresponding mutant vector Vi,G as follows: Ui,G = X i,G + K · (Vi,G − X i,G ). (2) This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS 3 Now incorporating (1) in (2) and simplifying, we obtain 1: 2: 3: 4: 5: Ui,G = X i,G + K · (X r1 ,G − X i,G ) + F · (X r2 ,G − X r3 ,G ). (3) The indices r1i , r2i , and r3i are mutually exclusive integers randomly generated within the range [1, NP], which are also different from the base index i . The scaling factor F is a positive control parameter for scaling the different vectors. K is a random number from [0, 1]. When the trial vector has been generated, we must decide which individual between X i,G and Ui,G should survive in the population of the next generation G + 1. If the new trial vector yields an equal or better solution than the target vector, it replaces the corresponding target vector in the next generation, otherwise the target is retained in the population. The success of the DE algorithm in solving a specific problem crucially depends on the appropriately choice of its associated control parameter values that determine the convergence speed. Hence, a fixed selection of these parameters can produce a slow and/or premature convergence depending on the problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE algorithm. One of the most successful adaptive DE algorithms is SFLSDE [31]. It uses two local search algorithms in the scale factor space to find the appropriate parameters for a given X i,G . III. IPADE In this section, we present and describe the IPADE approach in depth. IPADE follows an iterative scheme in which it determines the most appropriate number of prototypes per class and their best positioning. Concretely, IPADE is divided into three different stages: initialization (Section III-A), optimization (Section III-B), and addition of prototypes (Section III-C). Fig. 1 shows the pseudocode of the model proposed. In the following, we describe the most significant instructions, enumerated from 1 to 26. A. Initialization A random selection (stratified or not) of examples from TR may not be the most adequate procedure to initialize the G S. Instead, IPADE iteratively learns prototypes in order to find the most appropriate structure of G S. Instruction 1 generates the initial solution G S. In this step, G S must represent each class with one prototype and should cover the entire search space as much as possible. For this reason, each class distribution is represented with its respective centroid. This initialization was satisfactorily used by the approaches proposed in [16] and [20]. The centroid of the class does not completely cover the region of each class and does not avoid misclassifications. Thus, instruction 2 applies the first optimization stage using the initial G S composed of centroids for each class. The optimization stage must modify the prototypes of G S using the movement idea in the m-dimensional space, adding or subtracting some quantities to the attribute values of the prototypes. It is important to point out that we normalize all attributes of the dataset to the [0, 1] range. 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: Fig. 1. GS = Initialization(T R) DE_Optimization(GS, T R) AccuracyGlobal = Evaluate(GS, T R) registerClass[0..] = optimizable while AccuracyGlobal <>1.0 or all classes are non − optimizables do lessAccuracy = ∞ for i = 1 to do if registerClass[i] == optimizable then AccuracyClass [i] = Evaluate (GS, Examples of class i in T R) if AccuracyClass [i] < lessAccuracy then lessAccuracy = AccuracyClass [i] targetClass = i end if end if end for GStest = GS ∪ RandomExampleForClass (T R, targetClass) DE_Optimization(GStest,T R) accuracy Test = Evaluate(GStest, T R) if accuracy Test > Accuracy Global then Accuracy Global = accuracy Test GS = GStest else registerClass[targetClass] = non−optimizable end if end while return GS IPADE algorithm—basic structure. B. DE Optimization for IPADE In this section, we explain the proposal to apply the underlying idea of the DE algorithm to the PG problem as a position adjusting of prototypes scheme. First of all, it is necessary to define the solution codification. In the proposed DE algorithm, each individual in the population encodes a single prototype without the class label and, as such, the dimension of the individuals is equal to the number of attributes of the specific problem. An individual classifies an example of TR when it is the closest particle (in terms of Euclidean distance) to that example. The DE algorithm uses each prototype pu of G S, provided by the IPADE algorithm, as an initial population. Next, mutation and crossover operators guide the optimization of the positioning of each pu in the m-dimensional space. It is important to point out that these operators only produce modifications in the attributes of the prototypes of G S. Hence, the class value remains unchangeable throughout the evolutionary cycle. We will focus on the well-known DE/CurrentToRand/1 strategy [30] to generate the trial prototypes pu because it has reportedly the best behavior. It can be viewed as pu = pu + K · (pr1 − pu ) + F · (pr2 − pr3 ). (4) The examples pr1 , pr2 , and pr3 are randomly extracted from TR and they belong to the same class as pu . In the hypothetical case that TR does not contain enough prototypes of the pu class, i.e., there is not at least three prototypes of This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4 this class in TR, we artificially generate the necessary number of new prototypes pr j , 1 ≤ j ≤ 3, with the same class label as pu , using little random perturbations such as pr j = (pu1 + r and[−0.1, 0.1], pu2 + r and[−0.1, 0.1], . . . , pum + r and[−0.1, 0.1], puω ). After applying this operator, we check if there have been values out of range [0, 1]. If a computed value is greater than 1, we truncate it to 1, and if it is lower than 0, we establish it at 0. After the mutation process over all the prototypes of G S, we obtain a trial solution G S , which is constituted for each pu . The selection operator decides which solution G S or G S should survive for the next iteration. The 1NN rule guides this operator to obtain the corresponding fitness value. We try to maximize this value, so the selection operator can be viewed as follows: G S if accur acy(G S ) >= accur acy(G S) GS = (5) G S otherwise. In order to guarantee a high-quality solution, we use the ideas established in [31] to obtain a self-adaptive algorithm. Instruction 3 evaluates the accuracy of the initial solution, measured by classifying the examples of TR with the prototypes of GS by using the NN rule. C. Addition of Prototypes After the first optimization process, IPADE enters in an iterative loop (instructions 5–25) to determine which classes need more prototypes to faithfully represent their class distribution. In order to do this, we need to define two types of classes. A class ω is said to be optimizable if it allows the addition of new prototypes to improve its local classification accuracy. The local accuracy of ω is computed by classifying the examples of TR whose class is ω with the prototypes kept in G S (using the NN rule). The target class will be the optimizable class with the least accuracy registered. From instructions 7–15, the algorithm identifies the target class in each iteration. Initially, all classes start as optimizable (instruction 4) In order to reduce the classification error of the target class, IPADE extracts a random example of this class from TR and adds this to the current GS in a new trial set G St est (instruction 16). This addition forces the re-positioning of the prototypes of G St est by again using the optimization process (instruction 17) and its corresponding evaluation (instruction 18) of predictive accuracy. After this process, we have to ensure that the new positioning of prototypes of G St est , generated with the optimizer, has reported a successful improvement of the accuracy rate with respect to the previous G S. If the global accuracy of the G St est is less than the accuracy of G S, IPADE does not add this prototype to G S and this class is registered as nonoptimizable. Otherwise, G S = G St est . The stopping criterion is satisfied when the accuracy rate is 1.0 or all the classes are registered as non-optimizable. The algorithm returns G S as the smallest reduced set that is able to classify the TR appropriately. IEEE TRANSACTIONS ON NEURAL NETWORKS IV. E XPERIMENTAL F RAMEWORK AND A NALYSIS OF R ESULTS This section presents the experimental framework (Section IV-A) and the comparative study between our proposal and other PG techniques (Section IV-B). A. Experimental Framework In this section, we show the issues related to the experimental study. In order to compare the performance of the algorithms, we use four measures, accuracy [1], [32], the reduction rate measured as Reducti on r ate = 1 − si ze(G S)/si ze(TR) (6) Acc·Red measured as accuracy·reduction rate, and execution time.1 We use 50 datasets2 from the KEEL dataset repository3 [33], [34]. These datasets contain between 100 and 20 000 instances, and the number of attributes ranges from 2 to 60. The datasets considered are partitioned using the 10-fold crossvalidation (10-fcv) procedure. Many different configurations are established by the authors of each paper for the different techniques. We focus this experimentation on the recommended parameters proposed by their respective authors, assuming that the choice of the values of the parameters was optimally chosen. However, we have done a previous study for each method, which depends on the number of iterations performed, with 300, 500, and 1000 iterations in all the datasets. This parameter can be very sensitive to the problem tackled. An excessive number of iterations may produce overfitting for some problems, and a lower number of iterations may not be enough to tackle other datasets. For this reason, we present the results of the best performing number of iterations in each method and dataset. The complete set of results can be found in the associated web site (http://sci2s.ugr.es/ipade/). The configuration parameters of IPADE and the methods used in the comparison are shown in Table I. In this table, the values of the parameters Fl , Fu , i ter S F G SS, and i ter S F H C of the IPADE algorithm are the recommended values established in [31]. Furthermore, Euclidean distance is used as a similarity function, and those which are stochastic methods have been run three times per partition. Implementations of the algorithms can be found in the web site associated or in the KEEL software tool [33]. B. Analysis of Results In this section, we analyze the results obtained. Specifically, we check the performance of the IPADE model and seven other PG techniques. 1 Reduction rate and execution time information can be found in the web page. 2 Datasets: abalone, appendicitis, australian, balance, banana, bands, breast, bupa, car, chess, cleveland, coil2000, contraceptive, crx, dermatology, ecoli, flare-solar, german, glass, haberman, hayes-roth, heart, hepatitis, housevotes, iris, led7digit, lymphography, magic, mammographic, marketing, monks, newthyroid, page-blocks, pima, ring, saheart, satimage, segment, sonar, spectheart, splice, tae, thyroid, tic-tac-toe, titanic, twonorm, wine, wisconsin, yeast, zoo. 3 Available at http://sci2s.ugr.es/keel/datasets. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS 5 TABLE I PARAMETER S PECIFICATION FOR A LL THE M ETHODS E MPLOYED IN THE RSP3 Mixt_Gauss ENPC AMPSO LVQPRU HYB LVQ3 Parameters Iterations of basic DE = 300/500/1000, iterSFGSS = 8, iterSFHC = 20, Fl = 0.1, Fu = 0.9 Subset choice = diameter Reduction rate = 0.95 Iterations = 300/500/1000 Iterations = 300/500/1000, C1 = 1.0, C2 = 1.0, C3 = 0.25, Vmax = 1, W = 0.1, X = 0.5, Pr = 0.1, Pd = 0.1 Iterations = 300/500/1000, α = 0.1, WindowWidth = 0.5 Search_Iter = 300/500/1000, Optimal_Iter = 1000 α = 0.1, I = 0, F = 0.5 Initial_Window = 0, Final_Window = 0.5 δ = 0.1, δ_Window = 0.1 Initial Selection = SVM Iterations = 300/500/1000, α = 0.1, WindowWidth = 0.2, = 0.1 Accuracy of the second algorithm 0.9 0.8 Fig. 3. 0.8 0.7 0.6 0.5 0.5 0.6 0.7 0.8 0.9 Accuracy*Reduction Rate of IPADE 1 Acc·Red results over 50 datasets. Convergence process Acc Cleveland Acc Thyroid Acc Car 0.95 vs 1NN vs RSP3 vs LVQ3 vs Mixt_Gauss vs LVQPRU vs HYB vs AMPSO vs ENPC yx (0.9995) 0.9 (0.9989) (0.9987) (0.9987) (0.9993) (0.9992) (0.9992) (0.9990) (0.9989) 0.85 0.7 (0.9961) (0.9954) (0.9954) (0.9948) (0.9948) 0.8 (0.9942) (0.9942) (0.9967) 0.75 0.7 0.6 0.65 0.6 0.5 0.55 0.4 0.4 (0.9633) (0.9633) (0.9633) (0.9633) (0.9633) (0.9706) (0.9670) (0.9780) (0.9743) (0.9816) (0.9974) (0.9995) 0 0.5 0.6 0.7 0.8 Accuracy of IPADE 0.9 1 Fig. 4. Fig. 2. 0.9 1 IPADE Comparison 1 vs 1NN vs RSP3 vs LVQ3 vs Mixt_Gauss vs LVQPRU vs HYB vs AMPSO vs ENPC y=x 0.4 0.4 Accuracy Algorithm IPADE Accuracy of the second algorithm E XPERIMENTATION IPADE Comparison 1 1 2 3 4 5 6 7 Iterations of the main loop 8 9 Map of convergence over three different datasets. Accuracy results over 50 datasets. In the scatterplot of Fig. 2, each point compares IPADE to a second algorithm on a single dataset. The x-axis position of the point is the accuracy of IPADE, and the y-axis position is the accuracy of the comparison algorithm. Therefore, points below the y = x line correspond to datasets for which IPADE performs better than a second algorithm. In order to test the reduction capabilities of PG methods in comparison with IPADE, Fig. 3 shows at each point the Acc·Red obtained on a single dataset. Fig. 4 shows a graphical representation of the convergence of the IPADE model over three different datasets. The graph shows a line representing the accuracy rate in each step and its corresponding reduction rate (in brackets). The xaxis represents the number of iterations of the main loop of IPADE, and the y-axis represents the accuracy rate currently achieved. Tables II and III present the statistical analysis conducted by nonparametric multiple comparison procedures for accuracy and Acc·Red, respectively. More specifically, we have used the Friedman aligned (FA) procedure [35], [36] to compute the set of rankings that represent the effectiveness associated with each algorithm (second column). Both tables are ordered from the best to the worst ranking. In addition, the third column shows the adjusted p-value with the Holm’s test (HAPV ) [35]. Note that IPADE is established as the control algorithm because it has obtained the best FA ranking. By using a level of significance α = 0.01, IPADE is significantly better than the rest of the methods, considering both accuracy and Acc·Red measures. More information about these tests and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/. For the sake of simplicity, we only include the graphical and statistical results achieved, whereas the complete results can be found at the web page associated with this brief. Looking at Tables II and III and Figs. 2–4, we want to make some interesting comments. 1) Fig. 2 shows that the proposed IPADE outperforms, on average, the rest of the PG techniques with the parameter setting established. The most competitive algorithms for IPADE, in terms of the accuracy measure, are the LVQ3 and LVQPRU algorithms. In this figure, most of the LVQ3 and LVQPRU points are close to the y = x line. However, the statistical test confirms that IPADE significantly outperforms these methods. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6 IEEE TRANSACTIONS ON NEURAL NETWORKS TABLE II AVERAGE R ANKINGS OF THE A LGORITHMS (FA + HAPV) FOR THE A CCURACY M EASURE Algorithm Accuracy FA Accuracy HAPV IPADE LVQPRU LVQ3 RSP3 1NN AMPSO ENPC HYB Mixt_Gauss 109.63 199.66 203.22 231.80 236.53 248.11 259.13 268.14 272.02 — 6.4064×10−4 6.4064×10−4 7.9160×10−6 4.2657×10−6 5.0703×10−7 5.4223×10−8 5.6781×10−9 3.4239×10−6 TABLE III AVERAGE R ANKINGS OF THE A LGORITHMS (FA + HAPV) FOR THE Acc·Red M EASURE Algorithm Acc·Red FA Acc·Red HAPV IPADE LVQ3 Mixt_Gauss LVQPRU AMPSO ENPC RSP3 HYB 1NN 53.83 125.92 169.18 170.38 182.54 267.91 275.32 362.57 421.83 — 0.0055 2.2324×10−5 2.2324×10−5 2.9965×10−6 9.3280×10−16 9.9684×10−17 1.5321×10−31 1.5321×10−44 2) The tradeoff between accuracy and reduction rate is an important factor because the efficiency of the NN classifier depends on the resulting number of prototypes of the G S. Fig. 3 shows that achieving this balance between accuracy and reduction rate is a difficult task. IPADE is the best performing method considering the balance between accuracy and reduction rates. In Fig. 3, there are more points under the y = x line in comparison with Fig. 2. Furthermore, Table III also supports this statement, showing smaller p-values when the reduction rate is considered. 3) Observing the map of convergence of Fig. 4, we can highlight the DE algorithm as a promising optimizer because it is able to reach highly accurate results very fast. This implies that the IPADE scheme needs a small number of iterations. V. C ONCLUSION In this brief, we have presented a new data reduction technique called IPADE which iteratively learns the most adequate number of prototypes per class and their respective positioning for the NN algorithm, acting as a PG method. This technique uses a real parameter optimization procedure based on DE in order to adjust the positioning of the prototypes at each step. The large experimental study performed allowed us to show that IPADE is a suitable method for PG in NN classification. Furthermore, due to the fact that IPADE is a heuristic optimization approach, as future work this technique could be used for building an ensemble of classifiers. R EFERENCES [1] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge, MA: MIT Press, 2010. [2] I. Kononenko and M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms. Chichester, U.K.: Horwood Publishing Ltd., 2007. [3] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. [4] E. K. García, S. Feldman, M. R. Gupta, and S. Srivastava, “Completely lazy learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 9, pp. 1274– 1285, Sep. 2010. [5] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inform. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967. [6] A. N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A Database Perspective. New York: Springer-Verlag, 2004. [7] X. Wu and V. Kumar, The Top Ten Algorithms in Data Mining. London, U.K.: Chapman & Hall, 2009. [8] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Intell. Res., vol. 6, no. 1, pp. 1–34, Jan. 1997. [9] F. Fernández and P. Isasi, “Local feature weighting in nearest prototype classification,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 40–53, Jan. 2008. [10] N. García-Pedrajas, “Constructing ensembles of classifiers by means of weighted instance selection,” IEEE Trans. Neural Netw., vol. 20, no. 2, pp. 258–277, Feb. 2009. [11] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,” Artif. Intell. Rev., vol. 11, nos. 1–5, pp. 11–73, Feb. 1997. [12] R. Parades and E. Vidal, “Learning prototypes and distances: A prototype reduction technique based on nearest neighbor error minimization,” Pattern Recognit., vol. 39, no. 2, pp. 180–188, Feb. 2006. [13] H. Liu and H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective. Norwell, MA: Kluwer, 2001. [14] H. Liu and H. Motoda, Instance Selection and Construction for Data Mining. Norwell, MA: Kluwer, 2001. [15] H. Fayed and A. Atiya, “A novel template reduction approach for the k-nearest neighbor method,” IEEE Trans. Neural Netw., vol. 20, no. 5, pp. 890–896, May 2009. [16] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Self-generating prototypes for pattern classification,” Pattern Recognit., vol. 40, no. 5, pp. 1498– 1509, May 2007. [17] W. Lam, C.-K. Keung, and D. Liu, “Discovering useful concept prototypes for classification based on filtering and abstraction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8, pp. 1075–1090, Aug. 2002. [18] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE Trans. Comput., vol. 23, no. 11, pp. 1179–1184, Nov. 1974. [19] C. H. Chen and A. Jóźwik, “A sample set condensation algorithm for the class sensitive artificial neural network,” Pattern Recognit. Lett., vol. 17, no. 8, pp. 819–823, Jul. 1996. [20] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pekalska, and R. P. W. Duin, “Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces,” Pattern Recognit., vol. 39, no. 10, pp. 1827–1838, Oct. 2006. [21] J. S. Sánchez, “High training set size reduction by space partitioning and prototype abstraction,” Pattern Recognit., vol. 37, no. 7, pp. 1561–1564, 2004. [22] T. Kohonen, “The self organizing map,” Proc. IEEE, vol. 78, no. 9, pp. 1464–1480, Sep. 1990. [23] J. Li, M. T. Manry, C. Yu, and D. R. Wilson, “Prototype classifier design with pruning,” Int. J. Artif. Intell. Tools, vol. 14, nos. 1–2, pp. 261–280, 2005. [24] S.-W. Kim and B. J. Oommen, “Enhancing prototype reduction schemes with LVQ3-type algorithms,” Pattern Recognit., vol. 36, no. 5, pp. 1083– 1093, May 2003. [25] F. Fernández and P. Isasi, “Evolutionary design of nearest prototype classifiers,” J. Heuristics, vol. 10, no. 4, pp. 431–454, Jul. 2004. [26] L. Nanni and A. Lumini, “Particle swarm optimization for prototype reduction,” Neurocomputing, vol. 72, nos. 4–6, pp. 1092–1097, Jan. 2009. [27] A. Cervantes, I. M. Galván, and P. Isasi, “AMPSO: A new particle swarm method for nearest neighborhood classification,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 39, no. 5, pp. 1082–1091, Oct. 2009. [28] R. Storn and K. Price, “Differential evolution – A simple and efficient heuristic for global optimization over continuous spaces,” J. Global Optim., vol. 11, no. 4, pp. 341–359, Dec. 1997. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS [29] K. V. Price, R. M. Storn, and J. A. Lampinen, Differential Evolution: A Practical Approach to Global Optimization (Natural Computing Series), G. Rozenberg, T. Bäck, A. E. Eiben, J. N. Kok, and H. P. Spaink, Eds. New York: Springer-Verlag, 2005. [30] K. V. Price, An Introduction to Differential Evolution. London, U.K.: McGraw-Hill, 1999. [31] F. Neri and V. Tirronen, “Scale factor local search in differential evolution,” Memetic Comput., vol. 1, no. 2, pp. 153–171, 2009. [32] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Mateo, CA: Morgan Kaufmann, 2005. [33] J. Alcalá-Fdez, L. Sánchez, S. García, M. J. del Jesus, S. Ventura, J. M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C. Fernández, and F. Herrera, “KEEL: A software tool to assess evolutionary algorithms for data mining problems,” Soft Comput., vol. 13, no. 3, pp. 307–318, Oct. 2009. 7 [34] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” J. Multiple-Valued Logic Soft Comput., 2010, to be published. [35] S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power,” Inform. Sci., vol. 180, no. 10, pp. 2044–2064, May 2010. [36] S. García, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10, pp. 959–977, Apr. 2009. 1. Prototype generation for supervised classification 1.3 67 Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification • I. Triguero, S. Garcı́a, F. Herrera, Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification. Pattern Recognition 44 (4) (2011) 901-916, doi: 10.1016/j.patcog.2010.10.020. – Status: Published. – Impact Factor (JCR 2011): 2.292 – Subject Category: Computer Science, Artificial Intelligence. Ranking 18 / 111 (Q1). – Subject Category: Engineering, Electrical & Electronic. Ranking 35 / 245 (Q1). Pattern Recognition 44 (2011) 901–916 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification Isaac Triguero a,, Salvador Garcı́a b, Francisco Herrera a a Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain b Department of Computer Science, University of Jaén, 23071 Jaén, Spain a r t i c l e in f o abstract Article history: Received 2 June 2010 Received in revised form 6 September 2010 Accepted 24 October 2010 Nearest neighbor classification is one of the most used and well known methods in data mining. Its simplest version has several drawbacks, such as low efficiency, high storage requirements and sensitivity to noise. Data reduction techniques have been used to alleviate these shortcomings. Among them, prototype selection and generation techniques have been shown to be very effective. Positioning adjustment of prototypes is a successful trend within the prototype generation methodology. Evolutionary algorithms are adaptive methods based on natural evolution that may be used for searching and optimization. Positioning adjustment of prototypes can be viewed as an optimization problem, thus it can be solved using evolutionary algorithms. This paper proposes a differential evolution based approach for optimizing the positioning of prototypes. Specifically, we provide a complete study of the performance of four recent advances in differential evolution. Furthermore, we show the good synergy obtained by the combination of a prototype selection stage with an optimization of the positioning of prototypes previous to nearest neighbor classification. The results are contrasted with non-parametrical statistical tests and show that our proposals outperform previously proposed methods. & 2010 Elsevier Ltd. All rights reserved. Keywords: Differential evolution Prototype generation Prototype selection Evolutionary algorithms Classification 1. Introduction The nearest neighbor (NN) algorithm [1] and its derivatives have been shown to perform well for classification problems in many domains [2,3]. These algorithms are also known as instance-based learning [4] and belong to the lazy learning family of methods [5]. The extended version of NN to k neighbors is considered one of the most influential data mining algorithms [6] and it has attracted much attention and research efforts in recent years [7–10]. The NN classifier requires that all of the data instances are stored and unseen cases classified by finding the class labels of the closest instances to them. In order to determine how close two instances are, several distances or similarity measures have been proposed [11–13] and this issue is continually under review [14,15]. Despite its simplicity and high classification accuracy, it suffers from several drawbacks such as high computational cost, high storage requirement and sensitivity to noise. Data reduction processes are very useful in data mining to improve and simplify the models extracted by the algorithms [16]. Corresponding author. Tel.: +34 958 240598; fax: + 34 958 243317. E-mail addresses: [email protected] (I. Triguero), [email protected] (S. Garcı́a), [email protected] (F. Herrera). 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.10.020 In NN, apart from feature selection [17,18], two main data reduction techniques have been used with promising results: prototype selection (PS) and prototype generation (PG) [19,20]. The former is limited to selecting a subset of instances from the original training set. Typically, three types of PS methods are known: condensation [21], edition [22] and hybrid methods [23]. Condensation methods try to remove examples which are redundant or irrelevant, which it means that these examples do not offer any capabilities in the classification task. However, edition methods focus on removing noisy examples, which are those examples that induce classification errors. Finally, hybrid methods combine both approaches. In the specialized literature, a wide number of PS techniques have been proposed. Since the first approaches for data condensation and edition, CNN [21] and ENN [22], many other proposals of PS have become well-known in this field. For example, IBL methods [4], DROP family methods [19] and ICF [23]. Recent approaches to PS are introduced in [24–26]. Regarding PG methods, also known as prototype abstraction methods [27], they are not only able to select data, but can also modify them, allowing interpolations, movements of instances and artificial generation of new data. Well known methods for PG are PNN [28], learning quantization vector (LVQ) [29], Chen’s algorithm [30], ICPL [27], HYB [31] and MixtGauss [32]. A good study of PS and PG can be found in [33]. 902 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 Evolutionary algorithms (EAs) [34] have been successfully used in different data mining problems [35,36]. Given that PS and PG problems could be seen as combinatorial and optimization problems, EAs have been used to solve them with excellent results [37–40]. PS can be expressed as a binary space search problem and, as far as we know, the best evolutionary model proposed for PS is based on memetic algorithms [41] and is called SSMA [38]. PG is expressed as a continuous space search problem. EAs for PG are based on the positioning adjustment of prototypes, which is a suitable method to optimize the position of prototypes, however, it usually depends upon an initial subset of prototypes extracted from the training set. Several proposals are presented on this topic, such as ENPC [42] or PSCSA [43]. Particle swarm optimization (PSO) [44,45] and differential evolution (DE) [46,47] are two effective evolutionary optimization techniques for continuous spaces. In fact, PSO has been satisfactorily used for prototype adjustment [39,40]. The first attempts at using DE for PG can be found in [48]. In that contribution, we did a preliminary study on the use of DE, concluding that the classic DE scheme offers competitive results compared to other PG approaches to small size data sets. The first contribution of this paper is the use of the DE algorithm [47] for prototype adjustment. The specialized literature on DE collects several advanced schemes: SADE [49], OBDE [50], DEGL [51], JADE [52] and SFLSDE [53]. We will study the mentioned proposals for PG, except OBDE. This last one, as the authors state, may not be used for problems where basic knowledge is available. It constantly seeks the opposite solution to the one evaluated in the search process and, in PG, this behavior does not make sense. The remaining proposals will be compared with evolutionary and non-evolutionary PG algorithms and we will analyze the behavior of each DE algorithm in this problem. It is common to use classical PS methods in pre- or late stages of a PG algorithm as mechanisms for removing noisy or redundant prototypes. For example, some PG methods implement ENN or DROP algorithms as early filtering processes [27,54] and, in [31], a hybridization method based on LVQ3 post-processing of conventional prototype reduction approaches is proposed. The second contribution of this paper follows a similar idea to that presented in [31], but it is extended so that PS methods can be hybridized with any positioning adjustment of prototype method. Specifically, we study the use of LVQ3, PSO and DE algorithms for the optimization positioning of prototypes after a PS process. We will see that LVQ3 does not produce optimal positioning in most cases, whereas PSO and DE result in excellent accuracy rates in comparison with isolated PS methods and PG methods. We especially emphasize the use of SSMA in combination with one of the two mentioned optimization approaches to also achieve high reduction rates in the final set of prototypes obtained. As we have stated before, the use of DE algorithms for the PG problem motivates the global purpose of this paper, which can be divided into three objectives: To make an empirical study for analyzing the DE algorithms for the PG problem in terms of accuracy and reduction capabilities. Our goal is to identify the best DE methods and stress the relevant properties of each one when they tackle the PG problem. To understand how positioning adjustment techniques can improve the classification accuracy of PS and PG methods with the use of hybridization models. To check the behavior and scaling-up capabilities of DE approaches and hybrid approaches for PG when tackling large size data sets. The experimental study will include a statistical analysis based on non-parametric tests and we will conduct experiments involving a total of 56 small and large size data sets. In order to organize this paper, Section 2 describes the background of PS, PG and DE. Section 3 explains the DE algorithm proposed for tackling the position adjustment problem. Section 4 presents the framework of the hybridization proposed. Section 5 discusses the experimental framework and Section 6 presents the analysis of results. Finally, in Section 7 we summarize our conclusions. 2. Background This section covers the background information necessary to define and describe our proposals. Section 2.1 presents the background on PS and PG. Next, Section 2.2 shows the main characteristics of DE and the most recent advances proposed in the literature are presented in Section 2.3. 2.1. PS and PG algorithms This section presents the definition and notation for both PS and PG problems. A formal specification of the PS problem is the following: Let xp be an example where xp ¼ ðxp1 ,xp2 , . . . ,xpD , oÞ, with xp belonging to a class o given by xpo and a D-dimensional space in which xpi is the value of the i-th feature of the p-th sample. Then, let us assume that there is a training set TR which consists of n instances xp and a test set TS composed of t instances xq, with o unknown. Let SS D TR be the subset of selected samples resulting from the execution of a PS algorithm, then we classify a new pattern xq from TS by the NN rule acting over SS. The purpose of PG is to obtain a prototype generated set GS, which consists of r, r o n, prototypes, which are either selected or generated from the examples of TR. The prototypes of the generated set are determined to represent efficiently the distributions of the classes and to discriminate well when used to classify the training objects. Their cardinality should be sufficiently small to reduce both the storage and evaluation time spent by an NN classifier. Both evolutionary and non-evolutionary approaches to PS and PG will be analyzed in the experimental study. A brief description of the methods compared will be detailed in Section 5. 2.2. Differential evolution Differential evolution follows the general procedure of an EA. DE starts with a population of NP candidate solutions, so-called individuals. The initial population should cover the entire search space as much as possible. In some problems, this is achieved by uniformly randomizing individuals, but in other problems, such as that considered in this paper, basic knowledge of the problem is available and the use of other initialization mechanisms is more effective. The subsequent generations in DE are denoted by G ¼ 0, 1,y,Gmax. It is usual to denote each individual as a D-dimensional vector Xi,G ¼ {x1i,G,y, xD i,G}, called a ‘‘target vector’’. 2.2.1. Mutation operation After initialization, DE applies the mutation operator to generate a mutant vector Vi,G, with respect to each individual Xi,G, in the current population. For each target Xi,G, at the generation G, its associated mutant vector Vi,G ¼ {V1i,G,y,VD i,G}. The method of creating this mutant vector is that which differentiates one DE I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 scheme from another. Six of the most frequently referenced strategies are listed below: ‘‘DE/Rand/1’’: Vi,G ¼ Xri ,G þ F ðXri ,G Xri ,G Þ 1 2 ð1Þ 3 ‘‘DE/Best/1’’: Vi,G ¼ Xbest,G þF ðXri ,G Xri ,G Þ 1 ð2Þ 2 903 where f() is the fitness function to be minimized. If the new trial vector yields a solution equal to or better than the target vector, it replaces the corresponding target vector in the next generation; otherwise the target is retained in the population. Therefore, the population always gets better or retains the same fitness values, but never deteriorates. This one-to-one selection procedure is generally kept fixed in most of the DE algorithms. 2.3. Advanced proposals for DE ‘‘DE/RandToBest/1’’: Vi,G ¼ Xi,G þ F ðXbest,G Xi,G Þ þ F ðXri ,G Xri ,G Þ 1 2 ð3Þ ‘‘DE/Best/2’’: Vi,G ¼ Xbest,G þF ðXri ,G Xri ,G Þ þ F ðXri ,G Xri ,G Þ 1 2 3 4 ð4Þ ‘‘DE/Rand/2’’: Vi,G ¼ Xri ,G þ F ðXri ,G Xri ,G Þ þF ðXri ,G Xri ,G Þ 1 2 3 4 5 ð5Þ ‘‘DE/RandToBest/2’’: Vi,G ¼ Xi,G þ F ðXbest,G Xi,G Þ þ F ðXri ,G Xri ,G Þ 1 þF ðXri ,G Xri ,G Þ 3 ri1, ri3, 2 ð6Þ 4 ri2, ri4, ri5 The indices are mutually exclusive integers randomly generated within the range [1, NP], which are also different from the base index i. These indices are randomly generated once for each mutation. The scaling factor F is a positive control parameter for scaling the difference vectors. Xbest,G is the best individual of the population in terms of fitness. 2.2.2. Crossover operator After the mutation phase, a crossover operation is applied to increase the potential diversity of the population. The DE algorithm can use three kinds of crossover schemes, known as ‘‘Binomial’’, ‘‘Exponential’’ and ‘‘Arithmetic’’ crossovers. This operator is applied to each pair of the target vector Xi,G and its corresponding mutant vector Vi,G to generate a new trial vector that we denote Ui,G. The mutant vector exchanges its components with the target vector Xi,G. We will focus on the binomial crossover scheme, which is performed on each component whenever a randomly picked number between 0 and 1 is less than or equal to the crossover rate (CR), The CR is a user-specified constant within the range [0,1), which controls the fraction of parameter values copied from the mutant vector. This scheme may be outlined as 8 j < Vi,G if randð0,1Þ o ¼ CR or j ¼ jrand j j ¼ 1,2, . . . ,D: ð7Þ Ui,G ¼ : Xj Otherwise i,G where rand½0,1Þ A ½0:1 is a uniformly distributed random number, and jrand A f1,2, . . . ,Dg is a randomly chosen index, which ensures that Ui,G gets at least one component from Vi,G. Finally, we describe the arithmetic crossover, which generates the trial vector Ui,G like this, Ui,G ¼ Xi,G þ K ðVi,G Xi,G Þ The success of DE in solving a specific problem crucially depends on choosing the appropriate mutation strategy and its associated control parameter values (F and CR) that determine the convergence speed. Hence, a fixed selection of these parameters can produce slow and/or premature convergence depending on the problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE algorithm. Now, we describe four of the newest and best DE algorithms proposed in the literature. ð8Þ where K is the combination coefficient which is usually used in the interval [0, 1]. This strategy is known as ‘‘DE/CurrentToRand/1’’. 2.2.3. Selection operator When the trial vector has been generated, we must decide which individual between XiG and Ui,G should survive in the population of the next generation G+ 1. The selection operator is described as follows: ( Ui,G if f ðUi,G Þ is better than f ðXi,G Þ ð9Þ Xi,G þ 1 ¼ Xi,G Otherwise 2.3.1. Self-adaptive differential evolution (SADE) SADE [49] was proposed by Qin et al. to alleviate the expensive trial-and-error search for the most adequate parameters and mutation strategy. They simultaneously implement four mutation strategies (Eqs. (1) (5), (6), (8)) that are called candidate pool. For each target vector Xi,G in the current population, we have to decide which strategy is selected. Initially, the probability with respect to each strategy is 1/S, where S is the number of strategies. SADE adapts the probability of generating offspring by either strategy based on their success ratios in the past LP generations. Specifically, they introduce success and failure memories to store the number of Ui,G that enter the next generation, and the number of discarded Ui,G. In SADE, the mutation factors Fi are independently generated at each generation according to a normal distribution N(0.5,0.3). The proper choice of CR can lead to successful optimization, so they consider gradually adjusting the range of CR values according to the previous values. It is adjusted by using a memory, to store the CR values with respect to an S-strategy, and a normal distribution. 2.3.2. Adaptive differential evolution with optional external archive (JADE) JADE [52] is proposed by Zhang and Sanderson and it is based on a new mutation strategy and parameter adaptation. The new strategy is called DE/RandTop best with an optional archive that is created to resolve the premature convergence of greedy strategies such as DE/RandToBest/k1 and DE/Best/k. The authors call p the percentage (per unit) of individuals that are considered in the mutation strategy. A mutation vector with DE/RandTop Best/1 with archive is generated as follows: Vi,G ¼ Xi,G þFi ðXbest,G Xi,G Þ þFi ðXr1,G Xur2,G Þ ð10Þ where Xi,G, Xr1,G and Xbest,G are selected from P (current population), S while Xur2,G is randomly chosen from the union of P A, where A is the archive of inferior solutions stored from recent explorations. The archive is initially empty. Then, after each generation, the solutions that fail in the selection process are stored to the archive. When the archive size exceeds a certain threshold, some solutions are randomly removed from the archive A. Furthermore, this algorithm proposes a parameter adaptation where the mutation factor Fi of each individual is independently generated according to a Cauchy distribution [55,56] with location parameter mF and scale 1 It is also known as DE/CurrentToBest/1. 904 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 parameter 0.1, and then it is truncated to be 1 if Fi 4 ¼ 1 or regenerated if Fi o ¼ 0. At each generation, the crossover rate CRi is generated according to a normal distribution N(mCR , 0.1), and then truncated to [0,1]. 2.3.3. Differential evolution using a neighborhood-based mutation operator (DEGL) DEGL [51] is also motivated by DE/RandToBest/1. They propose a new mutation model based on neighborhoods. The authors make two kinds of neighborhood called ‘‘Local’’ and ‘‘Global’’ neighborhoods, so they propose two kinds of mutation operator. When they talk about the local neighborhood is not necessarily local in the sense of their geographical nearness or similar fitness values. These mutation operators are combined in one, in the following manner. For each member of the population, a local trial vector is created by employing the best (fittest) vector in the neighborhood as Li,G ¼ Xi,G þF ðXLbesti ,G Xi,G Þ þF ðXp,G Xq,G Þ 3. Differential evolution for prototype generation In this section we explain the proposal to apply the underlying idea of DE to the PG problem as a position adjusting of prototypes scheme. Fig. 1 shows the pseudo-code of the model proposed with the DE/Rand/1 mutation strategy and binomial crossover. In the following we describe the most significant instructions enumerated from 1 to 34. First of all, it is necessary to define the solution codification. In the proposed DE algorithm, each individual Xi,G in the population encodes a complete solution; that is, a reduced set of prototypes are encoded sequentially in each individual. The number of prototypes encoded in each individual will define its individual size and it is denoted r as previously. A user parameter will set this value r. It is necessary to point out that this ð11Þ where Lbesti is the best vector in the local neighborhood of Xi,G, and p, q are the indices of two random vectors extracted from the local neighborhood. Similarly, the global trial vector is created as gi,G ¼ Xi,G þ F ðXgbesti ,G Xi,G Þ þ F ðXr1 ,G Xr2 ,G Þ ð12Þ where gbesti is the best vector in the current population, and r1 and r2 are randomized in the interval [1,NP]. To combine both operators, they use a new parameter, known as ‘‘scalar weight’’ o A ð0,1Þ, and they use the following expression: Vi,G ¼ o gi,G þ ð1oÞ Li,G ð13Þ As with other adaptive methods, they propose different schemes for adaptation. They introduce three kinds of performance: the adaptation of the new o parameter, a deterministic linear or exponential increment, and a random value for each vector or a self-adaptive weight factor scheme. However, they do not present an adaptive control parameter for F and CR. 2.3.4. Scale factor local search in differential evolution (SFLSDE) Scale factor local search in differential evolution was proposed by Neri and Tirronen [53]. This self-adaptive algorithm was inspired by memetic algorithms. In order to guarantee a high quality solution, SFLSDE uses two local search algorithms in the scale factor space to find the appropriate parameters for a given Xi,G. Specifically, they follow two different approaches: scale factor golden section search (SFGSS) and scale factor hill-climb (SFHC). Both are based on changing the scale factor value and calculate the fitness value of the trial vector Ui,G after the mutation and crossover phases. SFLSDE follows the typical DE scheme, but at each iteration, five random numbers are generated (rand1,y,rand5) and they are used to determine the corresponding trial vector Ui,G. The values of the parameters are as follows: 8 SFGSS if rand5 o t3 > > > > < SFHC if t3 o ¼ rand5 o t4 Fi ¼ ( F þ F rand ð14Þ if rand2 o t1 u 1 l > > > if rand5 4 t4 > : Fi otherwise ( CRi ¼ rand3 if rand4 o t2 CRi otherwise ð15Þ where tk , k A 1,2,3,4 are constant threshold values. In [53], the authors only use the DE/rand/1/Bin mutation strategy. In the experimental study, we incorporate other mutation strategies. Fig. 1. DE algorithm basic structure. I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 parameter r is different to the parameter D explained in Section 2.2. The dimensionality D corresponds to the number of input attributes of the problem. Following the notation used in Section 2.2, Xi,G defines the target vector, but in our case, the target vector can be represented as a matrix. Table 1 describes the structure of an individual. Furthermore, each prototype pj, 1 r j r r, of an individual Xi,G has a class xpo,j . This class value remains unchangeable by the DE operators throughout the evolutionary cycle, and it is fixed from the beginning of the process. The number of prototypes evolved for each class is assigned in the initialization stage. 3.1. Initialization DE begins with a population of NP individuals Xi,G. Given that this problem provides some knowledge based on the initial arrangement of training samples, instruction 3 initializes each individual Xi,G by choosing r random prototypes from the TR. The initialization process ensures that every class has at least one representative prototype. Specifically, we use an initial random stratified selection of prototypes which guarantees that the number of prototypes encoded in each individual for each class is proportional to the number of them in TR. There must be at least one prototype for each class encoded in the individuals. Fig. 2 shows an example. It is important to point out that every solution must have the same structure, thus they must have the same number of prototypes per class, and the classes must have the same arrangement in the matrix Xi,G. Following the example of Fig. 2, each individual Xi,0 should contain four prototypes, in the following order: three prototypes of Class 0, one of Class 1. Table 1 Encoding of a set of prototypes in an individual Xi,G for the DE algorithm. Prototype 1 Prototype 2 y Prototype r Attribute 1 Attribute 2 y Attribute D Class xp1,1 xp1,2 xp2,1 xp2,2 y y xpD,1 xpD,2 xpo,1 xpo,2 xp1,r xp2,r y xpD,r xpo,r 905 3.2. Mutation and crossover operators The mutation and crossover strategies explained in Section 2 have been implemented. From instructions 9–14, the algorithm selects 3 or 5 random individuals, depending on the mutation strategy, and then, it generates the mutant matrix Vi,G with respect to each individual Xi,G, in the current population. The operations of addition, subtraction and scalar product are carried out as typical matrices. This is the justification for the individuals having the same structure. In order for the mutation operator to make sense, the operators must act over the same attributes and over prototypes of the same class in all cases. After applying this operator, it is necessary to check that the mutant matrix Vi,G has been generated with correct values for all features of the prototypes, i.e. to check that the values are in the correct range. Instruction 15 normalizes all attributes of the data set to the [0, 1] range, so this procedure only needs to check if there have been values out of range of [0,1]. If a computed value is greater than 1, we truncate it to 1, and if is lower than 0, we establish it at 0. Our previous work [48] indicates that the binomial crossover operator has more suitable behavior for the PG problem than the rest of the operators. The new trial matrix is generated by using Eq. (7) and the instructions 16–21 show this operation. In PG, instead of interchanging attributes values, the mutant matrix Vi,G exchanges its prototypes with the target Xi,G to generate a new trial matrix Ui,G. 3.3. Selection operator This operator must decide which individual between Xi,G and Ui,G should survive in the population of the next generation G+ 1 (instructions 23–26). The NN rule, with k¼1 (1NN), guides this operator. The instances in TR are classified with the prototypes encoded in Xi,G or Ui,G by the 1NN rule with a leave-one-out validation scheme, and their corresponding fitness values are measured as the accuracyðÞ obtained, which represents the number of successful hits (correct classifications) relative to the total number of classifications. We try to maximize this value, so the selection operator can be viewed as follows: ( Ui,G if accuracyðUi,G Þ 4 ¼ accuracyðXi,G Þ ð16Þ Xi,G þ 1 ¼ Xi,G otherwise In case of a tie between the values of accuracy, we select the Ui,G in order to give the mutated individual the opportunity to enter the population. Finally, instructions 27–30 check if the selected individual obtains the best fitness in the population, and instruction 34 returns the best individual found during the evolutionary process. 4. Hybridizations of prototype selection and generation methods This section presents the hybridization model that we propose. Section 4.1 enumerates the arguments that justify hybridization. Section 4.2 explains how to construct the hybrid model. 4.1. Motivation Fig. 2. Initialization process for an individual Xi,0 in Appendicitis data set. TR contains 95 examples. If we established the reduction rate (RR) at 0.95, let us assume that Z ¼ 1 RR, r ¼ Z 95 examples ¼ 4 prototypes (truncating this value). Appendicitis is composed of two classes, with 76 and 19 prototypes respectively. Hence, the individual Xi,0 should contain: Z 76 ¼ 3 prototypes of Class 0, and Z 19 ¼ 0 prototype of Class 1. We ensure that Class 1 has at least one prototype. As we stated before, PS and PG relate to different problems. The main drawback of PS methods is that they assume that the best representative examples can be obtained from a subset of the original data whereas PG methods generate new representative examples if needed. Specifically, positioning adjustment methods 906 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 aim to correct the position of a subset of prototypes from the initial set by using an optimization procedure. However, the positioning adjustment methods are not free of different drawbacks. keeping the same structure as the S selected by the PS method, as in the example given in Section 3.1. Fig. 3 shows the two different hybrid models. Specifically, Fig. 3(a) presents the scheme to hybridize a PS method with DE and PSO, and Fig. 3(b) shows the hybridization process with LVQ3. They relate to a more complex problem than PS, i.e. the search space can be more difficult to explore. As a result of the above, finding a promising solution by using positioning adjustment methods requires a higher cost than a PS method. Positioning adjustment methods usually initialize the generated set GS with a fixed number of random prototypes from TR, which will be modified in successive iterations. This characteristic is one of the weaknesses of these methods because this parameter can be very dependent on the specific problem. In principle, a practitioner must know the exact number of prototypes which will compose the final solution for each problem, but moreover, the proportion of prototypes between classes should be estimated in order to obtain good solutions. Thus, two schemes of initialization are commonly used: 3 The number of representative instances for each class is proportional to the number of them in the input data. 3 All the classes are represented with the same number of prototypes. As we have seen, the appropriate choice of the number of prototypes per class has not been addressed by positioning adjustment techniques. 5. Experimental framework In this section, we show the factors and issues related to the experimental study. We provide the measures employed to evaluate the performance of the algorithms (Section 5.1), details of the problems chosen for the experimentation (Section 5.2), an enumeration of the algorithms used for comparison with their respective parameters (Section 5.3) and finally, the statistical tests employed to contrast the results obtained are described (Section 5.4). 5.1. Performance measures for standard classification In this work, we deal with multi-class data sets. In these domains, two measures are widely used for measuring the effectiveness of classifiers because of their simplicity and successful application. We refer to accuracy and Cohen’s kappa rate. Furthermore, the reduction rate will be used as the classification efficiency measure. They are explained as follows: Accuracy: is the number of successful hits (correct classifica- 4.2. Hybrid model Random selection (stratified or not) of prototypes from the TR may not be the most adequate procedure to initialize the GS. Instead, we can use a PS algorithm prior to the adjustment process to initialize a subset of prototypes. Making use of this idea, we mitigate the first and second drawbacks stated before as most of the effort performed by positioning adjustment is made over a localized search area given by a PS solution. We also tackle the third weakness, because the heuristic of the PS methods is not forced to select a determinate number of prototypes of each class; it selects the most suitable number of prototypes per class. In addition to this, if the prototypes selected by a PS method can be tuned in the search space, the main drawback associated with PS is also overcome. To hybridize PS and positioning adjustment methods, two different methods of initialization of the positioning adjustment algorithm will be used, depending on the type of codification of the solution: Complete solution per individual: This corresponds to the case where each individual of the population encodes a complete GS (i.e., that used by DE and PSO). The SS must be inserted once as one of the individuals of the population, initializing the rest of the individuals as the standard procedure does. Others: This is the case where the complete GS is optimized (i.e., the scheme used by LVQ3). The resulting SS of the PS methods is used by the positioning adjustment procedure as the initial set. When each individual of the evolutionary algorithm encodes a complete GS, it helps to alleviate the complexity of the optimization procedure, because there is a promising initial individual in the population. Operators used by DE and PSO benefit from the presence of this individual. Furthermore, this type of codification tries to avoid getting stuck at a local optimum, initializing the rest of the individuals with random solutions extracted from the TR, tions) relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years [57,58]. Cohen’s kappa (Kappa rate): is an alternative measure to the classification rate, since it compensates for random hits [59]. In contrast to accuracy, kappa evaluates the portion of hits that can be attributed to the classifier itself (i.e., not to mere chance), relative to all the classifications that cannot be attributed to chance alone. Cohen’s kappa ranges from 1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a very useful, yet simple, meter for measuring a classifier’s accuracy while compensating for random successes. Reduction rate: One of the main goals of the PG and PS methods is to reduce storage requirements. Another goal closely related to this is to speed up classification. A reduction in the number of stored instances will typically yield a corresponding reduction in the time it takes to search through these examples and classify a new input vector. Note that Accuracy and Kappa measures are applied over the training data with a leave-one-out validation scheme. 5.2. Data sets In the experimental study, we selected 56 data sets from the UCI repository [60] and the KEEL-dataset repository2 [61]. Table 2 summarizes the properties of the selected data sets. It shows, for each data set, the number of examples (#Ex.), the number of attributes (#Atts.), and the number of classes (#Cl.). The data sets are grouped into two categories depending on the size they have. Small data sets have less than 2000 instances and large data sets have more than 2000 instances. The data sets considered are partitioned using the 10-fold cross-validation (10-fcv) [62,63] procedure. 2 http://sci2s.ugr.es/keel/datasets I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 907 Fig. 3. Hybrid model. (a) PSO and DE approaches; (b) LVQ approach. In K-fold cross-validation (K-fcv), the original sample is randomly partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K-1 subsamples are used as training data. The crossvalidation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds will be averaged to produce a single estimation. ICF: This method follows an iterative procedure in which those Prototype generation methods: 5.3. Comparison algorithms and parameters Several methods, evolutionary and non-evolutionary, have been selected to perform an exhaustive study of the capabilities of our proposals. Those methods are as follows: LVQ3: Learning vector quantization can be understood as a 1NN: The 1NN rule is used as a baseline limit of performance. Prototype selection methods: DROP3: This combines an edition stage with a decremental approach where the algorithm checks all the instances in order to find those instances which should be deleted from GS [19]. instances susceptible to removal from GS based on reachability and coverage properties of the instance are determined [23]. SSMA: This memetic algorithm makes use of a local search or meme specifically developed for the prototype selection problem. This interweaving of the global and local search phases allows the two to influence each other [38]. special case of artificial neural network in which a neuron corresponds to a prototype and a competition weight based is carried out in order to locate each neuron in a concrete place of the m-dimensional space to increase the classification accuracy [29]. It will be used as an optimizer in the proposed hybrid models. MixtGauss: This is an adaptive PG method considered in the framework of mixture modeling by Gaussian distributions, while assuming a statistical independence of features. The prototypes are chosen as the mean vectors of the optimized 908 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 ENPC: This follows a genetic scheme with five operators, Table 2 Summary description for classification data sets. Data set #Ex. #Atts. #Cl. Abalone Appendicitis Australian Autos Balance Banana Bands Breast Bupa Car Chess Cleveland Coil2000 Contraceptive crx Dermatology Ecoli Flare-solar German Glass Haberman Hayes-roth Heart Hepatitis Housevotes Iris Led7digit Lymphography Magic Mammographic Marketing Monks Movement_libras Newthyroid Pageblocks Penbased Pima Saheart Satimage Segment Sonar Spambase Spectheart Splice Tae Texture Thyroid Tic-tac-toe Titanic Twonorm Vehicle Vowel Wine Wisconsin Yeast Zoo 4174 106 690 205 625 5300 539 286 345 1728 3196 297 9822 1473 125 366 336 1066 1000 214 306 133 270 155 435 150 500 148 19020 961 8993 432 360 215 5472 10992 768 462 6435 2310 208 4597 267 3190 151 5500 7200 958 2201 7400 846 990 178 683 1484 101 8 7 14 25 4 2 19 9 6 6 36 13 85 9 15 33 7 9 20 9 3 4 13 19 16 4 7 18 10 5 13 6 90 5 10 16 8 9 36 19 60 57 44 60 5 40 21 9 3 20 18 13 13 9 8 16 28 2 2 6 3 2 2 2 2 4 2 5 2 3 2 6 8 2 2 7 2 3 2 2 2 3 10 4 2 2 9 2 15 3 5 10 2 2 7 7 2 2 2 3 3 11 3 2 2 2 4 11 3 2 10 7 Gaussians, whose mixtures are fit to model each of the classes [32]. HYB: This constitutes a hybridization of several prototype reduction techniques. Concretely, HYB combines support vector machines with LVQ3 and executes a search in order to find the most appropriate parameters of LVQ3 [31]. RSP3: This technique is based on Chen’s algorithm [30]. The main difference between them is that in Chen’s algorithm any subset containing a mixture of instances belonging to different classes can be chosen to be divided. By contrast, in RSP3 [54], the subset with the highest overlapping degree is the one picked to be split. This process tries to avoid drastic changes in the form of decision boundaries associated with TR which are the main shortcomings of Chen’s algorithm. which focus their attention on defining regions in the search space [42]. PSO: This adjusts the position of an initial set with the PSO rules, attempting to minimize the classification error [39]. We will use it as an optimizer in the proposed hybrid models. PSCSA: This is based on an artificial immune system [64], using the clonal selection algorithm to find the most appropriate position for a prototype set [43]. Many different configurations are established by the authors of each paper for the different techniques. We focus this experimentation on the recommended parameters proposed by their respective authors, assuming that the choice of the values of the parameters was optimally chosen. The configuration parameters, which are common for all problems, are shown in Table 3. Note that some methods have no parameters to be fixed, so they are not included in this table. In all of the techniques, Euclidean distance is used as a similarity function and those which are stochastic methods have been run three times per partition. 5.4. Statistical tools for analysis In this paper, we use the hypothesis testing techniques to provide statistical support for the analysis of the results [65,66]. Specifically, we use non-parametric tests, due to the fact that the initial conditions that guarantee the reliability of the parametric tests may not be satisfied, causing the statistical analysis to lose credibility with these parametric tests. These tests are suggested in the studies presented in [67,68,65,69], where their use in the field of machine learning is highly recommended. Throughout the study, we perform several non-parametric tests. The Wilcoxon test [67,68] will be used to perform a multiple pairwise comparison between the different schemes of our proposals. It will be adopted considering a level of significance of a ¼ 0:1. Furthermore, in order to perform multiple comparisons between our proposals and the rest of the techniques considered, we will use the Friedman Aligned-Ranks test [70] to detect statistical differences among a group of results and the Holm post-hoc test [71], to find out which algorithms are distinctive among the 1*n comparisons performed [69]. A complete description of these statistical tests can be found in Appendix A. More information about these tests and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/. 6. Analysis of results In this section, we analyze the results obtained from different experimental studies. Specifically, our aims are: To compare the different DE schemes to each other and to several classical and recent prototype reduction techniques for 1NN based classification over small data sets (Section 6.1). To test the performance of our DE schemes when the size of the problems is increased (Section 6.2). To show the convergence process of basic and advanced DE algorithms (Section 6.3). To analyze the benefits of hybrid models over small data-sets (Section 6.4). To check if the performance of hybrid models is maintained with large data sets (Section 6.5). I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 6.1. Analysis and results of DE schemes over small size data sets This study is divided into two parts. First, in Section 6.1.1 we compare the different schemes of DE and identify the best alternatives for the positioning adjustment of prototypes. The Wilcoxon test will be used to support this analysis [67]. Next, Section 6.1.2 shows a comparative study of the better DE methods with other classical and recent PG techniques. In this case, the Friedman Aligned Ranks test for multiple comparisons will be used in association with the Holm post-hoc test [69]. We have used a total of 40 small data sets of the general framework for both experiments. 6.1.1. Results of DE schemes over small data sets We focus this experiment on comparing the differences in performance of the DE methods based on the experimental framework stated previously. Table 4 shows the average results (and its standard deviations ‘‘+ -’’) obtained over small data sets in training and test data by six different mutation strategies for the basic DE, two configurations for SADE parameters, one for JADE, four different schemes for DEGL and finally SFLSDE has been tested with two mutation strategies. The best case for each column is highlighted in bold. The Wilcoxon test is conducted to compute for each method, with a level of 909 significance of a ¼ 0:1, the number of algorithms outperformed by it and the number of algorithms with no detected differences in performance. Specifically, the column denoted by ‘‘+ ’’ reflects the number of methods outperformed by the method in the row and the column ‘‘+ ¼’’ shows the number of methods with similar or worse performance by the method in the row. Observing Table 4, we can point out some interesting facts: The choice of an adequate mutation strategy seems to be an important factor that influences the results obtained. When the perturbation process is based on the selection of random individuals to generate a new solution, it may be affected by a lack of exploitation capability. However, when the best individual guides the search, exploration capabilities are reduced. RandToBest strategies have reported the best results because they perform a good balance between exploration (random individual) and exploitation (best individual). Advanced proposals such as JADE and DEGL, that are completely motivated by the RandToBest strategy, clearly outperform those basic DE techniques which are based on Rand and Best strategies. SADE probably loses accuracy in the iterations that only execute Rand and Best strategies. The DEGL algorithm, with an exponential increment of the parameter o, has reported the best kappa test and the statistical test shows that it also Table 3 Parameter specification for all the methods employed in the experimentation. Algorithm Parameters SSMA LVQ3 HYB Population¼ 30, Eval¼ 10 000, Cross ¼0.5, Mutation ¼0.001 Iterations ¼ 500, a ¼ 0:1, WindowWidth ¼0.2, epsilon ¼ 0.1, Reduction Rate ¼ 0.95/0.99 Search_Iter ¼ 200, Optimal_Iter ¼ 1000, alpha ¼ 0.1, I_epsilon ¼ 0, F_epsilon ¼ 0.5 Initial_Window ¼ 0, Final_Window ¼ 0.5, delta ¼ 0.1, delta_Window ¼ 0.1, Initial Selection ¼ SVM Subset Choice ¼ Diameter Iterations ¼ 250 SwarmSize ¼ 40, Iterations ¼ 500 C1 ¼ 1, C2 ¼ 3, Vmax ¼ 0.25, Wstart ¼ 1.5, Wend ¼ 0.5, Reduction Rate ¼ 0.95/0.99 HyperMutation Rate ¼ 2, Clonal Rate ¼ 10, Mutation Rate ¼ 0.01, Stim_Threshold ¼ 0.89, a ¼ 0.4 PopulationSize ¼ 40, Iterations ¼ 500, F ¼ 0.5, CR ¼ 0.9, Reduction Rate ¼ 0.95/0.99 PopulationSize ¼ 40, Iterations ¼ 500, Learning Period ¼ 50 and 100, Reduction Rate ¼ 0.95/0.99 PopulationSize ¼ 40, Iterations ¼ 500, p¼ 0.05, c¼ 0.1, Reduction Rate ¼ 0.95/0.99 PopulationSize ¼ 40, Iterations ¼ 500, F ¼ 0.8, CR ¼ 0.9, WeightFactor¼ 0.0, WeightScheme¼ Exponential, Adaptive, Random and Linear, Reduction Rate ¼ 0.95/0.99 PopulationSize ¼ 40, Iterations ¼ 500, iterSFGSS ¼ 8, iterSFHC ¼20, Fl¼ 0.1, Fu¼ 0.9, Reduction Rate ¼ 0.95/0.99 RSP3 ENPC PSO PSCSA DE SADE JADE DEGL SFLSDE Note: The parameter reduction rate on fixed reduction algorithms has been established at 0.95 for small size data set, 0.99 for large. Table 4 Results of different DE models over small data sets. Algorithm Accuracy Training DE/Rand/1/Bin DE/Best/1/Bin DE/RandToBest/1/Bin DE/Best/2/Bin DE/Rand/2/Bin DE/RandToBest/2/Bin SADE LP 50 SADE LP 100 JADE DEGL EXP DEGL ADAP DEGL RANDOM DEGL LINEAR SFLSDE/Rand/1/Bin SFLSDE/RandToBest/1/Bin 0.7679 0.8005 0.8279 0.8285 0.7962 0.8231 0.8195 0.8243 0.8209 0.8144 0.8211 0.8146 0.8187 0.8347 0.8411 7 0.1536 7 0.1275 7 0.1192 7 0.1174 7 0.1456 7 0.1250 7 0.1563 7 0.1232 7 0.1219 7 0.1563 7 0.1563 7 0.1563 7 0.1563 7 0.1563 7 0.1563 Kappa rate Test 0.7268 0.7393 0.7524 0.7434 0.7348 0.7567 0.7513 0.7502 0.7541 0.7597 0.7525 0.7488 0.7550 0.7582 0.7619 Training 70.1670 70.1464 70.1460 70.1445 70.1563 70.1426 70.1452 7 0.1435 70.1417 70.1394 70.1401 70.1392 7 0.1404 70.1563 7 0.1563 0.5816 0.6355 0.6859 0.6854 0.6295 0.6735 0.6708 0.6776 0.6708 0.6728 0.6606 0.6615 0.6689 0.6960 0.7079 70.2216 70.1986 70.1887 70.1875 70.2187 70.2031 70.1563 70.1979 70.1974 70.1563 70.1563 70.1936 70.1930 70.1948 7 0.1563 Acc Tst Test 0.4947 0.5155 0.5384 0.5212 0.5061 0.5406 0.5335 0.5324 0.5417 0.5529 0.5351 0.5329 0.5437 0.5461 0.5516 7 0.2579 7 0.2468 7 0.2453 7 0.2463 70.2484 70.2484 7 0.2482 7 0.2432 7 0.2415 70.2328 7 0.2436 7 0.2335 7 0.2371 7 0.1563 7 0.1563 Kappa Tst + +¼ + +¼ 0 0 0 1 0 0 1 0 0 6 1 0 1 2 2 13 10 14 13 12 14 12 12 14 14 14 13 13 14 14 0 0 1 2 0 1 2 0 3 4 1 2 2 4 2 10 10 13 13 7 14 13 12 14 14 14 12 13 14 13 910 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 overcomes more methods supported with a level of significance a ¼ 0:1 in terms of accuracy rate. The other DEGL’s variants obtain similar results except for the random approach that is probably affected by a lack of convergence. SFLSDE with the RandToBest strategy achieves the best average results in accuracy. This technique involves the best mutation strategy and two local searches which allow it to find the most suitable parameters during the evolution process. Looking at accuracy, kappa rate and the statistical test; three methods deserve particular mention: DEGL exponential, SFLSDE Rand and SFLSDE RandToBest. We will use these methods in the comparison with other PG methods. 6.1.2. Comparison with other PG techniques over small data sets In this section, we perform a comparison between the best three DE models checked before (SFLSDE Rand, RandToBest and DEGL exponential) with respect to the other 7 PG methods. Table 5 shows the average results collected. In this case, we add the reduction rate, which is, an important measure to compare the methods. Table 6 presents the rankings obtained by the Friedman Aligned (FA) procedure with the accuracy measure. In this table, algorithms are ordered from the best to the worst ranking. Furthermore, the third column shows the adjusted p-value with the Holm’s test (Holm APV) [69]. Note that the SFLSDE RandToBest is established as the control algorithm because it has obtained the best FA ranking. Holm’s test uses the same level of significance as Wilcoxon, a ¼ 0:1. Algorithms highlighted in bold are those which have been outperformed with this level of significance. Observing both Tables 5 and 6, we want to make some interesting comments: DE methods significantly outperform the other PG techniques, except for PSO, in accuracy and kappa rate. PSO is clearly the most competitive PG algorithm for DE. PSO has the same type of solution codification as DE and a similar evolutionary scheme, but advanced proposals of the DE algorithm usually obtain better average results. We could also have stressed RSP3, HYB and ENPC algorithms as competitive algorithms for DE. But, as we can see in the table, they obtain a good performance over training results, but they do not report great test results, therefore they have a higher overfitting than the DE algorithms. In terms of reduction capabilities, DE has been fixed to 0.95. Only MixtGauss and PSCSA obtain better reduction rates, but they offer lower accuracy/kappa rates. DE outperforms the rest of the methods with similar or lower reduction rates. Now, we focus our attention on the first statement. Holm’s test has no reported significant differences between DE and PSO. PSO probably benefits from the multiple comparison test, because it significantly outperforms the rest of the PG techniques. For this reason, we want to check the comparison between PSO and DE with the Wilcoxon test. Table 7 shows the p-values obtained with the Wilcoxon test. As we can see, advanced DE proposals always outperform the PSO algorithm with a level of significance of a ¼ 0:1. 6.2. Analysis and results of DE schemes over large size data sets This section presents the study and analysis of large size data sets. The goal of this study is to analyze the effect of scaling up the data in DE methods. Again, we divide this section into two different stages. First, in Section 6.2.1 we look for the best DE method over large data sets. Next, Section 6.2.2 compares the results with other PG methods. 6.2.1. Results of DE schemes over large data sets In order to test the performance of DE methods we have established a high reduction rate 0.99. Table 8 presents Table 6 Average rankings of the algorithms over small data sets (Friedman Aligned-Ranks + Holm’s Test). Algorithm FA ranking Holm APV SFLSDE/RandToBest/1/Bin SFLSDE/Rand/1/Bin DEGL EXP PSO RSP3 1NN HYB ENPC Mixt_Gauss PSCSA LVQ3 131.4625 138.3 139.4 158.275 225.3999 226.0 258.3625 268.15 269.6875 286.1875 324.275 – 1.0 1.0 1.0 0.0044 0.0044 4.85 10 5 1.07 10 5 9.33 10 6 4.75 10 7 1.19 10 10 Table 7 Results of the Wilcoxon test compared with PSO over small data sets. Comparison p-Value DEGL EXP vs. PSO SFLSDE Rand vs. PSO SFLSDE RandToBest vs. PSO 0.0106 0.0986 0.0408 Table 5 Comparison between the three best DE models and other PG approaches over small data sets. Algorithm Accuracy Training 1NN MixtGauss LVQ3 HYB RSP3 ENPC PSO PSCSA DEGL EXP SFLSDE/Rand/1/Bin SFLSDE/RandToBest/1/Bin 0.7369 0.7138 0.6931 0.8309 0.7924 0.8247 0.8238 0.6787 0.8144 0.8347 0.8411 7 0.1654 7 0.1545 7 0.1560 7 0.0154 7 0.1373 7 0.1477 7 0.1274 7 0.1835 7 0.1563 7 0.1563 7 0.1563 Kappa rate Test 0.7348 0.6932 0.6763 0.7153 0.7325 0.7167 0.7501 0.6682 0.7597 0.7582 0.7619 Training 70.1664 70.1668 70.1662 70.1651 70.1591 70.1597 70.1409 70.1874 70.1394 70.1563 7 0.1563 0.4985 0.4888 0.4421 0.6988 0.6112 0.6800 0.6791 0.4461 0.6728 0.6960 0.7079 7 0.2910 7 0.2473 7 0.2458 7 0.2573 7 0.2420 70.2532 7 0.1950 7 0.2466 7 0.1563 7 0.1948 70.1563 Reduction Test 0.4918 0.4546 0.4114 0.4790 0.5004 0.4818 0.5332 0.4231 0.5529 0.5461 0.5516 7 0.2950 7 0.2680 7 0.1563 7 0.1563 7 0.2861 7 0.2936 7 0.2402 7 0.2540 7 0.2328 7 0.1563 7 0.1563 0.0000 0.9552 0.9488 0.4278 0.7329 0.7220 0.9491 0.9858 0.9483 0.9481 0.9481 70.0000 70.0084 70.0083 70.1563 70.1185 7 0.1447 70.0072 7 0.1563 70.1563 70.1563 70.1563 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 911 Table 8 Results of different DE models over large data sets. Algorithms Accuracy Kappa rate Training DE/Rand/1/Bin DE/Best/1/Bin DE/RandToBest/1/Bin DE/Best/2/Bin DE/Rand/2/Bin DE/RandToBest/2/Bin SADE LP 50 SADE LP 100 JADE DEGL EXP DEGL ADAP DEGL RANDOM DEGL LINEAR SFLSDE/Rand/1/Bin SFLSDE/RandToBest/1/Bin 0.7831 0.8025 0.8124 0.8183 0.7888 0.8243 0.8107 0.8070 0.8136 0.8076 0.8069 0.8069 0.8088 0.8327 0.8341 Test 70.0055 7 0.0038 70.0032 70.0045 70.0068 70.0045 7 0.0036 7 0.0032 70.0110 7 0.0032 7 0.0041 7 0.0037 7 0.0031 70.0046 7 0.0030 0.7798 0.7881 0.7968 0.7988 0.7838 0.8088 0.7966 0.7941 0.8020 0.7951 0.7946 0.7938 0.7961 0.8181 0.8154 Acc Tst Training 7 0.2075 7 0.2069 7 0.2087 7 0.2086 7 0.2088 70.2113 7 0.2063 7 0.2070 70.2058 7 0.2058 7 0.2074 7 0.2079 7 0.2058 70.2074 7 0.2072 0.5709 0.5883 0.6115 0.6224 0.5803 0.6377 0.6178 0.6030 0.6204 0.6044 0.6027 0.6025 0.6052 0.6541 0.6556 Test 7 0.2713 7 0.2849 7 0.2845 7 0.2859 7 0.2691 7 0.2920 7 0.2722 7 0.2793 7 0.2803 7 0.2783 7 0.2811 7 0.2799 7 0.2810 7 0.2840 7 0.2879 0.5639 0.5605 0.5815 0.5843 0.5686 0.6073 0.5918 0.5789 0.5969 0.5792 0.5761 0.5772 0.5811 0.6243 0.6184 70.2761 70.2894 70.2888 70.2897 70.2753 70.2928 70.2748 70.2829 70.2830 70.2829 70.2883 70.2844 70.2831 7 0.2925 70.2910 Kappa Tst + +¼ + +¼ 0 0 2 1 0 6 2 0 6 0 1 0 1 9 11 12 8 10 11 10 14 13 12 13 11 10 8 11 14 14 0 0 2 1 0 6 2 0 6 0 1 0 1 9 11 12 8 10 11 10 14 13 12 13 11 10 8 11 14 14 Table 9 Comparison between the two best DE models and other PG approaches over large data sets. Algorithm Accuracy Training 1NN MixtGauss LVQ3 HYB RSP3 ENPC PSO PSCSA SFLSDE/Rand/1/Bin SFLSDE/RandToBest/1/Bin 0.8197 0.7534 0.6840 0.7888 0.7922 0.8809 0.8022 0.6730 0.8388 0.8414 70.0023 70.0141 7 0.0057 70.0234 70.2545 7 0.1610 7 0.0055 7 0.2190 70.0039 70.0028 Kappa rate Test 0.8072 0.7505 0.6767 0.7618 0.7556 0.7986 0.8049 0.6707 0.8249 0.8236 Training 70.0100 70.2315 7 0.2680 7 0.2168 7 0.2708 7 0.2188 70.2136 70.2205 70.2205 7 0.2199 the comparative study. Again, we use the Wilcoxon test to differentiate between the different proposals. We can make several observations from these results: Some models present important differences when tackling large data sets. We can stress JADE as a good algorithm when the size of the data set is higher. With large data sets, this algorithm overcomes most of the advanced proposals; except for SFLSDE which remains the best advanced DE model. The number of difference vectors to be perturbed by the mutation operator does seem to be an important factor that influences the final result obtained when dealing with large data sets. SFLSDE/Rand/1/Bin has reported the best average results in accuracy and kappa rate. However, the statistical test shows that SFLSDE RandToBest has better behavior than the Rand strategy. As we can observe in Table 8, SFLSDE/Rand/1/Bin is able to overcome nine methods and SFLSDE RandToBest 11 methods. The rest of the proposals advanced are not able to overcome SFSLDE. When dealing with large data sets, the statistical test notes higher differences between the methods. Concretely, we can observe that SFLSDE RandToBest outperforms a total of 11 methods out of 14 with a level of significance of 0.1. We select both SFLSDE algorithms for the next comparison as they have, in general, reported the best results. 0.6195 0.4913 0.4409 0.5992 0.6299 0.7613 0.6177 0.3900 0.6570 0.6598 7 0.0229 7 0.3251 7 0.2926 7 0.2790 7 0.3266 7 0.2497 7 0.2887 70.2376 7 0.3036 7 0.3079 Reduction Test 0.5948 0.4860 0.4264 0.5567 0.5597 0.6170 0.5948 0.3842 0.6281 0.6240 7 0.0181 7 0.3255 7 0.2962 7 0.3153 7 0.3397 7 0.2949 7 0.2880 7 0.2824 7 0.3131 7 0.3131 0.0000 0.9514 0.9899 0.5727 0.8100 0.8205 0.9899 0.9988 0.9901 0.9901 7 0.0000 70.0001 70.0011 70.2903 7 0.1369 70.1919 70.0011 7 0.0017 70.0002 70.0002 Table 10 Average rankings of the algorithms over large data sets (Friedman Aligned-Ranks + Holm’s Test). Algorithm FA ranking Holm APV SFLSDE/RandToBest/1/Bin SFLSDE/Rand/1/Bin 1NN PSO ENPC RSP3 HYB Mixt_Gauss LVQ3 PSCSA 47.0313 48.0938 65.1875 66.0625 71.875 86.9063 92.6875 94.375 113.5313 119.25 – 0.9482 0.7359 0.7359 0.5174 0.0746 0.0319 0.0269 3.9323 10 4 9.3584 10 5 6.2.2. Comparison with other PG techniques over large data sets In this section, we perform a comparison between the two best DE models obtained for large data sets (SFLSDE models) with the same algorithms as in Section 6.1.2. Table 9 shows the average results obtained, and Table 10 displays the FA ranking and the adjusted p-value obtained with Holm’s test. Observing Tables 9 and 10, we can summarize that: Most of the PG methods present clear differences when dealing with large data sets. For instance, ENPC outperforms its ranking obtained over small data sets. Together with PSO, they are the most competitive PG techniques for the DE model and Holm’s test supports this statement. However, SFLSDE usually obtains better average results. 912 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 DE methods significantly overcome the rest of the PG techni- ques. Specifically, accuracy and the kappa rate demonstrate that SFLSDE Rand is able to overcome in 0.02 the average results obtained for the best PG technique (PSO). In order to improve the efficiency of the 1NN rule when tackling large data sets, the reduction rate becomes more important. FA ranking indicates that only SFLSDE models are able to outperform the 1NN with a high reduction rate (0.99), to a greater extent than the rest of the PG methods. Again, we use the Wilcoxon test to check if the DE models are able to outperform the most competitive PG algorithms. Specifically, we carry out a study with ENPC and PSO. Table 11 presents the results. The Wilcoxon test shows that ENPC is outperformed with a ¼ 0:1. However, this hypothesis is rejected with PSO, but the p-value is smaller than the adjusted p-value of Holm’s test. Table 11 Results of the Wilcoxon test compared with PSO and ENPC over large data sets. Comparison p-Value SFLSDE SFLSDE SFLSDE SFLSDE 0.1981 0.1928 0.0183 0.0214 Rand vs. PSO RandToBest vs. PSO Rand vs. ENPC RandToBest vs. ENPC One of the most important issues in the development of any EA is the analysis of the convergence of its population. If the EA does not evolve in time, it will not be able to obtain suitable solutions. We show a graphical representation of the convergence capabilities of DE models, Fig. 4. Specifically, the best basic DE (DE/RandToBest/2/Bin), SADE (LP ¼50), JADE, DEGL exponential, and SFLSDE with RandToBest. To perform this analysis we have selected the Bupa small data set. The graphics show a line representing the fitness value of the best individual of each population. The X-axis represents the number of iterations carried out, and the Y-axis represents the fitness value currently achieved. As we can see in the graphic, SADE and DEGL quickly find promising solutions and they waste more than 300 iterations without an improvement. However, SFLSDE and DE are slower to converge, which usually allows them to find a better solution at the end of the process. They find a great balance between exploration and exploitation during the evolution. 6.4. Analysis and results of hybrid models over small size data sets This section shows the average results obtained for our hybrid models when they are applied to small data sets. Table 12 collects the average results and the Wilcoxon test. The abilities of hybrid models are shown and their performance is compared with the basic components that take part in it. The results achieved in this part of the study allow us to conclude the following: Convergence Analysis (Bupa) 80 Fitness value 6.3. Analysis of convergence Hybrid models always outperform the basic algorithms upon 75 which they are based. The good synergy between PG and PS methods is clearly demonstrated with the obtained results. We selected three PS methods with different reduction rates (ICF 0.7107, DROP3 0.8202 and SSMA 0.9553). A priori a lower 70 DE SADE JADE DEGL SFLSDE 65 0 50 100 150 200 250 300 Iterations 350 400 Table 13 Average runtime of the optimizer algorithms over small data sets. 450 Runtime 500 SFLSDE 40.5483 PSO 42.3168 LVQ3 0.2316 Fig. 4. Map of convergence: Bupa data set. Table 12 Hybrid models with small data sets. Algorithm Accuracy Training DROP3 ICF SSMA LVQ3 PSO SFLSDE/RandToBest/1/Bin DROP3 + LVQ3 DROP3 + PSO DROP3 + SFLSDE/RandToBest/1/Bin ICF+ LVQ3 ICF+ PSO ICF+ SFLSDE/RandToBest/1/Bin SSMA +LVQ3 SSMA +PSO SSMA +SFLSDE/RandToBest/1/Bin 1NN 0.7527 0.7118 0.8207 0.6931 0.8238 0.8411 0.7666 0.8645 0.8711 0.7384 0.8677 0.8738 0.8347 0.8617 0.8651 0.7369 7 0.1240 7 0.1343 7 0.1335 7 0.1560 7 0.1274 7 0.1563 7 0.1197 7 0.0970 7 0.0958 7 0.1189 7 0.0980 70.0991 7 0.1100 7 0.1007 7 0.1010 7 0.1654 Kappa rate Test 0.7011 0.6784 0.7581 0.6763 0.7501 0.7619 0.7027 0.7501 0.7620 0.6865 0.7523 0.7618 0.7704 0.7770 0.7845 0.7348 Training 7 0.1497 7 0.1505 7 0.1518 7 0.1662 7 0.1409 7 0.1563 7 0.1471 7 0.1349 7 0.1348 7 0.1413 7 0.1398 7 0.1401 7 0.1267 7 0.1267 70.1256 7 0.1664 0.5498 0.4797 0.6685 0.4421 0.6791 0.7079 0.5705 0.7474 0.7605 0.5229 0.7526 0.7642 0.6842 0.7376 0.7407 0.4985 7 0.2139 7 0.2364 7 0.2089 7 0.2458 7 0.1950 7 0.1563 7 0.2200 7 0.1688 7 0.1687 7 0.2191 7 0.1678 70.1730 7 0.2087 7 0.1852 7 0.1891 7 0.2910 Reduction Test 0.4544 0.4175 0.5455 0.4114 0.5332 0.5516 0.4553 0.5286 0.5488 0.4282 0.5318 0.5462 0.5619 0.5727 0.5836 0.4918 7 0.2651 7 0.2644 7 0.2685 7 0.1563 7 0.2402 7 0.1563 7 0.2703 7 0.2588 7 0.2572 7 0.2590 7 0.2593 7 0.2695 7 0.2657 7 0.2515 70.2524 7 0.2950 0.8202 0.7107 0.9554 0.9488 0.9491 0.9481 0.8202 0.8202 0.8202 0.7107 0.7107 0.7107 0.9554 0.9554 0.9554 0.0000 7 0.0148 7 0.1369 70.0343 7 0.0083 7 0.0072 7 0.1563 7 0.0809 7 0.0809 7 0.0809 7 0.1369 7 0.1369 7 0.1369 70.0343 70.0343 70.0343 7 0.0000 Acc Tst Kappa Tst + +¼ + +¼ 0 0 7 0 6 9 0 5 6 0 5 7 9 7 10 2 5 4 15 4 15 14 5 9 13 5 9 14 14 15 15 11 1 0 6 0 5 7 1 6 6 0 3 7 6 9 10 4 6 4 15 2 14 13 5 13 14 5 8 15 15 15 15 10 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 913 Table 14 Hybrid models with large data sets. Algorithms Accuracy Training DROP3 ICF SSMA LVQ3 PSO SFLSDE/RandToBest/1/Bin DROP3 +LVQ3 DROP3 +PSO DROP3 +SFLSDE/RandToBest/1/Bin ICF +LVQ3 ICF +PSO ICF +SFLSDE/RandToBest/1/Bin SSMA + LVQ3 SSMA + PSO SSMA + SFLSDE/RandToBest/1/Bin 1NN 0.7744 0.6781 0.8493 0.6840 0.8022 0.8414 0.7730 0.8386 0.8538 0.6851 0.8397 0.8367 0.8534 0.8576 0.8635 0.8197 7 0.0232 7 0.1863 7 0.0021 7 0.0057 7 0.0055 7 0.0028 7 0.2166 7 0.1913 7 0.1868 7 0.1796 7 0.1950 7 0.1924 7 0.1998 7 0.2007 70.1979 7 0.0023 Kappa rate Test 0.7472 0.6621 0.8196 0.6767 0.8049 0.8236 0.7412 0.7974 0.8152 0.6641 0.8046 0.8145 0.8244 0.8241 0.8291 0.8072 Reduction Training 7 0.2256 7 0.2016 7 0.2220 7 0.2680 7 0.2136 7 0.2199 7 0.2382 7 0.2236 7 0.2260 7 0.1997 7 0.2259 7 0.2260 7 0.2197 7 0.2225 7 0.2213 7 0.0100 0.5592 0.4202 0.6725 0.4409 0.6177 0.6598 0.5661 0.6496 0.6843 0.4297 0.6444 0.6434 0.6793 0.6970 0.7056 0.6195 reduction rate should allow better accuracy results to be obtained. In terms of accuracy/kappa rates, we can observe how the hybrid models, DROP3+ SFSLDE and ICF+ SFLSDE probably produce overfitting in training data, because they do not present good generalization capabilities, obtaining lower accuracy/kappa rates in test results. SFSLDE is the best performing method in comparison with PS and PG basic algorithms. Furthermore, when it is applied as an optimizer method in the hybrid models it achieves the best accuracy/kappa rates. As we stated before, it can sometimes produce overfitting. However, as we can see from the test results of SSMA+ SFLSDE, when a high reduction PS method is applied, SFLSDE is very effective. We can extrapolate this statement for LVQ3 and PSO, which do not produce overfitting over the resulting set selected by SSMA. Although LVQ3 does not offer competitive results, when it is used to optimize a PS solution, LVQ3 is able to improve appropriately the position of the prototypes. For instance, as we can see with SSMA+ LVQ3 in comparison with SSMA, LVQ3 works properly when it starts from a good solution. Although PSO and DE outperform LVQ3 as optimizers, an advantage of LVQ is that it is faster. Table 13 shows the average runtime3 of the optimizers over small data sets. As we can see, the learning time of LVQ3 is clearly lower than PSO and DE which codify a complete solution per individual. With the same reduction rate, PSO outperforms LVQ3, and it is more effective when we use it in the hybrid models. Nevertheless, with the obtained results and in comparison with the DE algorithms, PSO is probably affected by a lack of convergence because of the absence of an adaptive process to improve its own parameters during the evolution. 6.5. Analysis and results of hybrid models over large size data sets In this section we want to check if the performance of hybrid models is maintained when dealing with large data sets. Table 14 shows this experiment. We briefly summary some interesting facts: Test 7 0.3079 7 0.3079 7 0.3134 7 0.2926 7 0.2887 7 0.3079 7 0.3140 7 0.2911 7 0.2875 7 0.3042 7 0.3050 7 0.2933 7 0.3120 7 0.3030 7 0.3005 7 0.0229 0.5119 7 0.3248 0.3940 7 0.3143 0.6221 7 0.3177 0.4264 7 0.2962 0.5948 7 0.2880 0.6240 7 0.3131 0.5106 7 0.3308 0.5768 7 0.3130 0.6173 7 0.3135 0.3960 7 0.3133 0.5846 7 0.3226 0.6082 7 0.3151 0.6312 7 0.3163 0.6384 7 0.3092 0.6442 7 0.3105 0.5948 7 0.0181 0.9061 0.8037 0.9844 0.9899 0.9899 0.9901 0.9061 0.9061 0.9061 0.8302 0.8302 0.8302 0.9847 0.9847 0.9847 0.0000 7 0.0569 7 0.1654 7 0.0100 7 0.0011 7 0.0011 7 0.0002 7 0.0569 7 0.0569 7 0.0569 7 0.1386 7 0.1386 7 0.1386 7 0.0099 7 0.0099 7 0.0099 7 0.0000 Acc Tst Kappa Tst + +¼ + +¼ 1 0 9 0 3 7 3 5 6 0 5 6 6 8 11 3 5 2 14 4 13 12 6 9 14 3 12 13 15 15 15 15 1 0 9 0 3 7 3 5 6 0 5 6 6 8 11 3 5 2 14 4 13 12 6 9 14 3 12 13 15 15 15 15 efficiency of the 1NN rule. The DE model has been fixed with a high reduction rate (0.99) and the advanced proposal SFLSDE outperforms the rest of the PG and PS basic techniques which are far from achieving this reduction rate. SSMA was proposed to cover a drawback of the conventional evolutionary PS methods: their lack of convergence when facing large problems. We can observe that it is the best PS method and its performance is improved when we hybridize with an optimization procedure. SSMA provides a promising solution which enables any optimization process, including LVQ3 which does not offer great results in combination with ICF and DROP3, to converge quickly. 7. Conclusions In this work, we have presented differential evolution and its recent advanced proposals as a data reduction technique. Specifically, it was used to optimize the positioning of the prototypes for the nearest neighbor algorithm, acting as a prototype generation method. The first aim of this paper is to determine which proposed DE algorithm works properly to tackle the PG problem. Specifically, we have studied the different mutation strategies recognized in the literature, and the recent approaches to adapt the parameters of this evolutionary algorithm in order to find a good balance between exploration and exploitation. The second contribution of this paper shows the good relation between PS and PG in obtaining hybrid algorithms that allow us to find very promising solutions. Hybrid models are able to tackle several drawbacks of the isolated PS and PG methods. Concretely, we have analyzed the use of positioning adjustment algorithms as an optimization procedure after a previous PS stage. Our DE model is an appropriate optimizer which has reported the best results in terms of accuracy and reduction rate. The wide experimental study performed has allowed us to justify the behavior of DE algorithms when dealing with small and large data sets. These results have been compared with several nonparametric statistical procedures, which have reinforced the conclusions. When dealing with large data sets, reduction rate must be taking into consideration as one of the main parameters to improve the 3 These results have been obtained with an Intel(R) Core(TM) i7 CPU 920 at 2.67 GHz. Acknowledgement Supported by the Spanish Ministry of Science and Technology under Project TIN2008-06681-C06-01. 914 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 Appendix A. Friedman Aligned Ranks and adjusted p-values The Friedman test is based on n sets of ranks, one set for each data set in our case; and the performances of the algorithms analyzed are ranked separately for each data set. Such a ranking scheme allows for intra-set comparisons only, since inter-set comparisons are not meaningful. When the number of algorithms for comparison is small, this may pose a disadvantage. In such cases, comparability among data sets is desirable and we can employ the method of aligned ranks [70]. In this technique, a value of location is computed as the average performance achieved by all algorithms in each data set. Then it calculates the difference between the performance obtained by an algorithm and the value of location. This step is repeated for algorithms and data sets. The resulting differences, called aligned observations, which keep their identities with respect to the data set and the combination of algorithms to which they belong, are then ranked from 1 to kn relative to each other. Then, the ranking scheme is the same as that employed by a multiple comparison procedure which employs independent samples; such as the Kruskal–Wallis test [72]. The ranks assigned to the aligned observations are called aligned ranks. The Friedman aligned ranks test statistic can be written as hP i 2 k 2 ^2 ðk1Þ j ¼ 1 R:j ðkn =4Þðkn þ1Þ ð17Þ T¼ P 2 f½knðkn þ 1Þð2kn þ 1Þ=6gð1=kÞ ni¼ 1 R^i: where R^ i: is equal to the rank total of the i-th data set and R^ :j is the rank total of the j-th algorithm. The test statistic T is compared for significance with a chi-square distribution for k 1 degrees of freedom. Critical values can be found at Table A3 in [66]. Furthermore, the p-value could be computed through normal approximations [73]. If the null-hypothesis is rejected, we can proceed with a post-hoc test. In this study, we use the Holm post-hoc procedure. We focus on the comparison between a control method, which is usually the proposed method, and a set of algorithms used in the empirical study. This set of comparisons is associated with a set or family of hypotheses, all of which are related to the control method. Any of the post-hoc tests is suitable for application to nonparametric tests working over a family of hypotheses. The test statistic for comparing the i-th algorithm and j-th algorithm depends on the main non-parametric procedure used. In this case, it depends on the Friedman Aligned Ranks test: Since the set of related rankings is converted to absolute rankings, the expression for computing the test statistic in Friedman Aligned Ranks is the same as that used by the Kruskal–Wallis test [72,74] ,rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kðn þ1Þ ^ ^ , ð18Þ z ¼ ðR i R j Þ 6 where R^ i , R^ j are the average rankings by Friedman Aligned Ranks of the algorithms compared. In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. It is a useful and interesting datum for many consumers of statistical analysis. A p-value provides information about whether a statistical hypothesis test is significant or not, and it also indicates something about ‘‘how significant’’ the result is: the smaller the p-value, the stronger the evidence against the null hypothesis. Most importantly, it does this without committing to a particular level of significance. When a p-value is considered in a multiple comparison, it reflects the probability error of a certain comparison, but it does not take into account the remaining comparisons belonging to the family. If one is comparing k algorithms and in each comparison the level of significance is a, then in a single comparison the probability of not making a Type I error is ð1aÞ, then the probability of not making a Type I error in the k 1 comparison is ð1aÞðk1Þ . Then the probability of making one or more Type I error is 1ð1aÞðk1Þ . For instance, if a ¼ 0:05 and k¼10 this is 0.37, which is rather high. One way to solve this problem is to report adjusted p-values (APVs) which take into account that multiple tests are conducted. An APV can be compared directly with any chosen significance level a. We recommend the use of APVs due to the fact that they provide more information in a statistical analysis. The z value in all cases is used to find the corresponding probability (p-value) from the table of normal distribution N(0,1), which is then compared with an appropriate level of significance a [66, Table A1]. The post-hoc tests differ in the way they adjust the value of a to compensate for multiple comparisons. Next, we will define the Holm procedure and we will explain how to compute the APVs. The notation used in the computation of the APVs is as follows: Indexes i and j each correspond to a concrete comparison or hypothesis in the family of hypotheses, according to an incremental order of their p-values. Index i always refers to the hypothesis in question whose APV is being computed and index j refers to another hypothesis in the family. pj is the p-value obtained for the j-th hypothesis. k is the number of classifiers being compared. The Holm procedure adjusts the value of a in a step-down manner. Let p1, p2,y,pk 1 be the ordered p-values (smallest to largest), so that p1 rp2 r rpk1 , and H1, H2,y,Hk 1 be the corresponding hypotheses. The Holm procedure rejects H1 to Hi 1 if i is the smallest integer so that pi 4 a=ðkiÞ. Holm’s step-down procedure starts with the most significant p-value. If p1 is below a=ðk1Þ, the corresponding hypothesis is rejected and we are allowed to compare p2 with a=ðk2Þ. If the second hypothesis is rejected, the test proceeds with the third, and so on. As soon as a certain null hypothesis cannot be rejected, all the remaining hypotheses are retained as well. Holm APVi: min{v;1}, where v ¼ maxfðkjÞpj : 1 r j r ig. References [1] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1) (1967) 21–27. [2] A.N. Papadopoulos, Y. Manolopoulos, Nearest Neighbor Search: A Database Perspective, Springer, 2004. [3] I. Kononenko, M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms, Horwood Publishing Limited, 2007. [4] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Machine Learning 6 (1) (1991) 37–66. [5] E.K. Garcia, S. Feldman, M.R. Gupta, S. Srivastava, Completely lazy learning, IEEE Transactions on Knowledge and Data Engineering 22 (9) (2010) 1274–1285. [6] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC Data Mining and Knowledge Discovery, 2009. [7] B.M. Steele, Exact bootstrap k-nearest neighbor learners, Machine Learning 74 (3) (2009) 235–255. [8] P. Chaudhuri, A.K. Ghosh, H. Oja, Classification based on hybridization of parametric and nonparametric classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (7) (2009) 1153–1164. [9] Y.-C. Liaw, M.-L. Leou, C.-M. Wu, Fast exact k nearest neighbors search using an orthogonal search tree, Pattern Recognition 43 (6) (2010) 2351–2358. [10] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition 43 (6) (2010) 2082–2105. [11] D.R. Wilson, T.R. Martinez, Improved heterogeneous distance functions, Journal of Artificial Intelligence Research 6 (1997) 1–34. [12] R. Paredes, E. Vidal, Learning weighted metrics to minimize nearest-neighbor classification error, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7) (2006) 1100–1110. I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 [13] M.Z. Jahromi, E. Parvinnia, R. John, A method of learning weighted similarity function to improve the performance of nearest neighbor, Information Sciences 179 (17) (2009) 2964–2973. [14] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (2009) 207–244. [15] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based classification: concepts and algorithms, Journal of Machine Learning Research 10 (2009) 747–776. [16] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 1999. [17] H. Liu, H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic Publishers, 2001. [18] Y. Li, B.-L. Lu, Feature selection based on loss-margin of nearest neighbor classification, Pattern Recognition 42 (9) (2009) 1914–1921. [19] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning algorithms, Machine Learning 38 (3) (2000) 257–286. [20] H.A. Fayed, S.R. Hashem, A.F. Atiya, Self-generating prototypes for pattern classification, Pattern Recognition 40 (5) (2007) 1498–1509. [21] P.E. Hart, The condensed nearest neighbor rule, IEEE Transactions on Information Theory 18 (1968) 515–516. [22] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on System, Man and Cybernetics 2 (3) (1972) 408–421. [23] H. Brighton, C. Mellish, Advances in instance selection for instance-based learning algorithms, Data Mining and Knowledge Discovery 6 (2) (2002) 153–172. [24] E. Marchiori, Hit miss networks with applications to instance selection, Journal of Machine Learning Research 9 (2008) 997–1017. [25] H.A. Fayed, A.F. Atiya, A novel template reduction approach for the k-nearest neighbor method, IEEE Transactions on Neural Networks 20 (5) (2009) 890–896. [26] E. Marchiori, Class conditional nearest neighbor for large margin instance selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2) (2010) 364–370. [27] W. Lam, C.K. Keung, D. Liu, Discovering useful concept prototypes for classification based on filtering and abstraction, IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (8) (2002) 1075–1090. [28] C.-L. Chang, Finding prototypes for nearest neighbor classifiers, IEEE Transactions on Computers 23 (11) (1974) 1179–1184. [29] T. Kohonen, The self organizing map, Proceedings of the IEEE 78 (9) (1990) 1464–1480. [30] C.H. Chen, A. Jóźwik, A sample set condensation algorithm for the class sensitive artificial neural network, Pattern Recognition Letters 17 (8) (1996) 819–823. [31] S.W. Kim, J. Oomenn, A brief taxonomy and ranking of creative prototype reduction schemes, Pattern Analysis and Applications 6 (2003) 232–244. [32] M. Lozano, J.M. Sotoca, J.S. Sánchez, F. Pla, E. Pekalska, R.P.W. Duin, Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces, Pattern Recognition 39 (10) (2006) 1827–1838. [33] J.C. Bezdek, L.I. Kuncheva, Nearest prototype classifier designs: an experimental study, International Journal of Intelligent Systems 16 (2001) 1445–1473. [34] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, SpringerVerlag, Berlin, 2003. [35] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms, Springer-Verlag, Berlin, 2002. [36] G.L. Pappa, A.A. Freitas, Automating the Design of Data Mining Algorithms: An Evolutionary Computation Approach, Natural Computing, Springer, 2009. [37] J.-R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study, IEEE Transactions on Evolutionary Computation 7 (6) (2003) 561–575. [38] S. Garcı́a, J.R. Cano, F. Herrera, A memetic algorithm for evolutionary prototype selection: a scaling up approach, Pattern Recognition 41 (8) (2008) 2693–2709. [39] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction, Neurocomputing 72 (4–6) (2008) 1092–1097. [40] A. Cervantes, I.M. Galván, P. Isasi, AMPSO: a new particle swarm method for nearest neighborhood classification, IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 39 (5) (2009) 1082–1091. [41] N. Krasnogor, J. Smith, A tutorial for competent memetic algorithms: model, taxonomy, and design issues, IEEE Transactions on Evolutionary Computation 9 (5) (2005) 474–488. [42] F. Fernández, P. Isasi, Evolutionary design of nearest prototype classifiers, Journal of Heuristics 10 (4) (2004) 431–454. [43] U. Garain, Prototype reduction using an artificial immune model, Pattern Analysis and Applications 11 (3–4) (2008) 353–363. 915 [44] J. Kennedy, R. Eberhart, Learning representative exemplars of concepts: an initial case study, in: Proceedings of the IEEE International Conference on Neural Networks, 1995, pp. 1942–1948. [45] R. Poli, J. Kennedy, T. Blackwell, Particle swarm optimization, Swarm Intelligence 1 (1) (2007) 33–57. [46] R. Storn, K.V. Price, Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces, Journal of Global Optimization 11 (10) (1997) 341–359. [47] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution A Practical Approach to Global Optimization, Natural Computing Series, 2005. [48] I. Triguero, S. Garcı́a, F. Herrera, A preliminary study on the use of differential evolution for adjusting the position of examples in nearest neighbor classification, in: Proceedings of the IEEE Congress on Evolutionary Computation, 2010, pp. 630–637. [49] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with strategy adaptation for global numerical optimization, IEEE Transactions on Evolutionary Computation 13 (2) (2009) 398–417. [50] S. Rahnamayan, H. Tizhoosh, M. Salama, Opposition-based differential evolution, IEEE Transaction on Evolutionary Computation 12 (1) (2008) 64–79. [51] S. Das, A. Abraham, U.K. Chakraborty, A. Konar, Differential evolution using a neighborhood-based mutation operator, IEEE Transactions on Evolutionary Computation 13 (3) (2009) 526–553. [52] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional external archive, IEEE Transactions on Evolutionary Computation 13 (5) (2009) 945–958. [53] F. Neri, V. Tirronen, Scale factor local search in differential evolution, Memetic Computing 1 (2) (2009) 153–171. [54] J.S. Sánchez, High training set size reduction by space partitioning and prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564. [55] H.A. David, H.N. Nagaraja, Order Statistics, third ed., Wiley, 2003. [56] T.J. Rothenberg, F.M. Fisher, C.B. Tilanus, A note on estimation from a cauchy sample, Journal of the American Statistical Association 59 (306) (1966) 460–463. [57] E. Alpaydin, Introduction to Machine Learning, second ed., MIT Press, Cambridge, MA, 2010. [58] I.H. Witten, E. Frank, Data Mining: Practical machine learning tools and techniques, second ed., Morgan Kaufmann, San Francisco, 2005. [59] A. Ben-David, A lot of randomness is hiding in accuracy, Engineering Applications of Artificial Intelligence 20 (2007) 875–885. [60] A. Asuncion, D. Newman, UCI machine learning repository, 2007, URL: /http:// www.ics.uci.edu/ mlearn/MLRepository.htmlS. [61] J. Alcalá-Fdez, L. Sánchez, S. Garcı́a, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernández, F. Herrera, KEEL: a software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13 (3) (2009) 307–318. [62] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, PrenticeHall, London, 1982. [63] R. Nisbet, J. Elder, G. Miner, Handbook of Statistical Analysis and Data Mining Applications, Elsevier, 2009. [64] L.N. de Castro, J. Timmis, Artificial Immune Systems: A New Computational Intelligence Approach, Springer, 2002. [65] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, A study of statistical techniques and performance measures for genetics—based machine learning: accuracy and interpretability, Soft Computing 13 (10) (2009) 959–977. [66] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, fourth ed., Chapman & Hall/CRC, 2006. [67] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [68] S. Garcı́a, F. Herrera, An extension on ‘‘statistical comparisons of classifiers over multiple data sets’’ for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694. [69] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Information Sciences 180 (2010) 2044–2064. [70] J. Hodges, E. Lehmann, Ranks methods for combination of independent experiments in analysis of variance, Annals of Mathematical Statistics 33 (1962) 482–497. [71] S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6 (1979) 65–70. [72] W.H. Kruskal, W.A. Wallis, Use of ranks in one-criterion variance analysis, Journal of the American Statistical Association 47 (1952) 583–621. [73] M. Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables, Dover Publications, 1974. [74] W.W. Daniel, Applied Nonparametric Statistics, Duxbury Thomson Learning, 1990. Isaac Triguero Velázquez received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction and evolutionary algorithms. Salvador Garcı́a López received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Assistant Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. His research interests include data mining, data reduction, data complexity, imbalanced learning, statistical inference and evolutionary algorithms. 916 I. Triguero et al. / Pattern Recognition 44 (2011) 901–916 Francisco Herrera Triguero received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 150 papers in international journals. He is coauthor of the book ‘‘Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases’’ (World Scientific, 2001). As edited activities, he has co-edited five international books and co-edited 20 special issues in international journals on different Soft Computing topics. He acts as associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems, Advances in Computational Sciences and Technology, and International Journal of Applied Metaheuristics Computing. He currently serves as area editor of the Journal Soft Computing (area of genetic algorithms and genetic fuzzy systems), and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation. His current research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms. 84 Chapter II. Publications: Published and Submitted Papers 1.4 Integrating a Differential Evolution Feature Weighting scheme into Prototype Generation • I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, Integrating a Differential Evolution Feature Weighting scheme into Prototype Generation. Neurocomputing 97 (2012) 332-343, doi: 10.1016/j.neucom.2012.06.009. – Status: Published. – Impact Factor (JCR 2012): 1.634 – Subject Category: Computer Science, Artificial Intelligence. Ranking 37 / 115 (Q2). Neurocomputing 97 (2012) 332–343 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Integrating a differential evolution feature weighting scheme into prototype generation Isaac Triguero a,n, Joaquı́n Derrac a, Salvador Garcı́a b, Francisco Herrera a a Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain b Department of Computer Science, University of Jaén, 23071 Jaén, Spain a r t i c l e i n f o a b s t r a c t Article history: Received 23 November 2011 Received in revised form 13 March 2012 Accepted 1 June 2012 Communicated by M. Bianchini Available online 1 July 2012 Prototype generation techniques have arisen as very competitive methods for enhancing the nearest neighbor classifier through data reduction. Within the prototype generation methodology, the methods of adjusting the prototypes’ positioning have shown an outstanding performance. Evolutionary algorithms have been used to optimize the positioning of the prototypes with promising results. However, these results can be improved even more if other data reduction techniques, such as prototype selection and feature weighting, are considered. In this paper, we propose a hybrid evolutionary scheme for data reduction, incorporating a new feature weighting scheme within two different prototype generation methodologies. Specifically, we will focus on a self-adaptive differential evolution algorithm in order to optimize feature weights and the placement of the prototypes. The results are contrasted with nonparametric statistical tests, showing that our proposal outperforms previously proposed methods, thus showing itself to be a suitable tool in the task of enhancing the performance of the nearest neighbor classifier. & 2012 Elsevier B.V. All rights reserved. Keywords: Differential evolution Prototype generation Prototype selection Feature weighting Nearest neighbor Classification 1. Introduction The designing of classifiers can be considered to be one of the most important tasks in machine learning and data mining [1,2]. Most machine learning methods build a model during the learning process, known as eager learning methods [3], but there are some approaches where the algorithm does not need a model. These algorithms are known as lazy learning methods [4]. The Nearest Neighbor (NN) rule [5] is a simple and effective supervised classification technique which belongs to the lazy learning family of methods. NN is a nonparametric classifier, which requires that all training data instances are stored. Unseen cases are classified by finding the class labels of the closest instances to them. The extended version of NN to k neighbors (kNN) is considered one of the most influential data mining algorithms [6] and it has attracted much attention and research in recent years [7,8]. However, NN may have several disadvantages, such as high computational cost, high storage requirement and sensitivity to noise, which can affect its performance. Furthermore, NN makes predictions over existing data and it n Corresponding author. Tel.: þ34 958 240598; fax: þ34 958 243317. E-mail addresses: [email protected] (I. Triguero), [email protected] (J. Derrac), [email protected] (S. Garcı́a), [email protected] (F. Herrera). 0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.06.009 assumes that input data perfectly delimits the decision boundaries among classes. Many approaches have been proposed to improve the performance of the NN rule. One way to simultaneously tackle the computational complexity, storage requirements, and sensitivity to noise of NN is based on data reduction [9]. These techniques try to obtain a reduced version of the original training data, with the double objective of removing noisy and irrelevant data. Taking into consideration the feature space, we can highlight Feature Selection (FS) [10–13] and feature generation/extraction [14] as the main techniques. FS consists of choosing a representative subset of features from the original feature space, while feature generation creates new features to describe the data. From the perspective of the instances, data reduction can be divided into Prototype Selection (PS) [15–17] and Prototype Generation (PG) [18,19]. The former process consists of choosing an appropriate subset of the original training data, while the latter can also build new artificial prototypes to better adjust the decision boundaries between classes in NN classification. In this way, PG does not assume that input data perfectly defines the decision boundaries among classes. Another way to improve the performance of NN is the employment of weighting schemes. Feature Weighting (FW) [20] is a well known technique which consists of assigning a weight to each feature of the domain of the problem to modify the way in which distances between examples are computed [21]. This technique I. Triguero et al. / Neurocomputing 97 (2012) 332–343 can be viewed as a generalization of FS algorithms, allowing us to obtain a soft approximation of the feature relevance degree assigning a real value as a weight, so different features can receive different treatments. Evolutionary algorithms [22] have been successfully used in different data mining problems [23,24]. Given that PS, PG and FW problems could be seen as combinatorial and optimization problems, evolutionary algorithms have been used to solve them with excellent results [25]. PS can be expressed as a binary space search problem. To the best of our knowledge, memetic algorithms [26] have provided the best evolutionary model proposed for PS, called SSMA [27]. PG is expressed as a continuous space search problem. Evolutionary algorithms for PG are based on the positioning adjustment of prototypes [28–30], which is a suitable methodology to optimize the location of prototypes. Concretely, Differential Evolution (DE) [31,32] and its advanced approaches [33] have been demonstrated as being the most effective positioning adjustment techniques [34]. Regarding FW methods, many successful evolutionary proposals, most of them based on genetic algorithms, have been proposed, applied to the NN algorithm [35]. Typically, positioning adjustment methods [28,36,30] focus on the placement process and they do not take into consideration the selection of the most appropriate number of prototypes per class. Recently, two different approaches have been proposed in order to tackle this problem. First, in [37,38], this problem is addressed by an iterative addition process that determines which classes need more prototypes to be represented. This algorithm is denoted as IPADECS. Secondly, in [34], the algorithm SSMA-DEPG is presented, in which a previous PS stage is applied to provide the appropriate choice of the number of prototypes per class. In these techniques, the increase in the size of the data set is a crucial problem. It has been addressed in PS and PG by using stratification techniques [39,40]. They split the data set into various parts to make the application of a prototype reduction technique easier, using a mechanism to join the solutions of each part into a global solution. The aim of this work is to propose a hybrid approach which combines these two PG methodologies with FW to enhance the NN rule addressing its main drawbacks. In both schemes, the most promising feature weights and location of the prototypes are generated by the SFLSDE algorithm [41] acting as an FW and PG method respectively. Evolutionary PG methods usually tend to overfit the training data in a small number of iterations. For this reason we apply, during the evolutionary optimization process, an FW stage to modify the fitness function of the PG method and determine the relevance of each feature. The hybridization of the PG and FW problems is the main contribution of this paper, which can be divided into three objectives: 333 In order to organize this paper, Section 2 describes the background of PS, PG, FW, DE and stratification. Section 3 explains the hybridization algorithms proposed. Section 4 discusses the experimental framework and Section 5 presents the analysis of results. Finally, in Section 6 we summarize our conclusions. 2. Background This section covers the background information necessary to define and describe our proposals. Section 2.1 presents a formal definition of PS and PG problems. Section 2.2 describes the main characteristic of FW. Section 2.3 explains the DE technique. Finally, Section 2.4 details the characteristics of the stratification procedure. 2.1. PS and PG problems This section presents the definition and notation for both PS and PG problems. A formal specification of the PS problem is the following: let xp be an example where xp ¼ ðxp1 ,xp2 , . . . ,xpD , oÞ, with xp belonging to a class o given by xpo and a D-dimensional space in which xpi is the value of the i-th feature of the p-th sample. Then, let us assume that there is a training set TR which consists of n instances xp and a test set TS composed of t instances xq , with o unknown. Let SS DTR be the subset of selected samples resulting from the execution of a PS algorithm, then we classify a new pattern xq from TS by the NN rule acting over SS. The purpose of PG is to obtain a prototype generated set GS, which consists of r, r o n, prototypes, which are either selected or generated from the examples of TR. The prototypes of the generated set are determined to efficiently represent the distributions of the classes and to discriminate well when used to classify the training objects. Their cardinality should be sufficiently small to reduce both the storage and evaluation time spent by an NN classifier. Both methodologies have been widely studied in the specialized literature. More than 50 PS methods have been proposed. In general, they can be categorized into three kinds of methods: condensation [43], edition [44] or hybrid models [27]. A complete review of this topic is proposed in [17]. Regarding PG techniques, they can be divided into several families depending on the main heuristic operation followed: positioning adjustment [30], class re-labeling [45], centroid-based [18] and space-splitting [46]. A recent PG review is proposed in [19]. More information about PS and PG approaches can be found at the SCI2S thematic public website on Prototype Reduction in Nearest Neighbor Classification: Prototype Selection and Prototype Generation.1 To propose a new FW technique based on a self-adaptive DE. To the best of our knowledge, DE has not yet been applied to the FW problem. To carry out an empirical study to analyze the hybridization models in terms of classification accuracy. Specifically, we will analyze whether the integration of an FW stage with PG methods improves the quality of the resulting reduced sets. To check the behavior of these hybrid approaches when dealing with huge data sets, developing a stratified model with the proposed hybrid scheme. To test the behavior of these approaches, the experimental study will include a statistical analysis based on nonparametric statistical tests [42]. We shall conduct experiments involving a total of 46 classification data sets with different properties. 2.2. Feature weighting The aim of FW methods is to reduce the sensitivity to redundant, irrelevant or noisy features in the NN rule, by modifying its distance function with weights. These modifications allow us to perform more robust classification tasks, increasing in this manner the global accuracy of the classifier. The most well known distance or dissimilarity measure for the NN rule is the Euclidean Distance (Eq. (1)), where xp and xq are two examples and D is their number of features. We will use it throughout this study as it is simple, easy to optimize, and has 1 http://sci2s.ugr.es/pr/ 334 I. Triguero et al. / Neurocomputing 97 (2012) 332–343 been widely used in the field of instance based learning [47] vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u D uX ð1Þ EuclideanDistanceðX,YÞ ¼ t ðxpi xqi Þ2 mutant vector. Then, we must decide which individual should survive in the next generation Gþ 1. The selection operator is described as follows: ( i¼0 FW methods often extend this equation to apply different weights to each feature (Wi) which modify the way in which the distance measure is computed (Eq. (2)) vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u D uX ð2Þ W i ðxpi xqi Þ2 FWDistðX,YÞ ¼ t i¼0 This technique has been widely used in the literature. As far as we know, the most complete study performed can be found in [20], in which a review of several FW methods for lazy learning algorithms is presented (with most of them applied to improve the performance of the NN rule). In this review, FW techniques were categorized by several dimensions, according to its weight learning bias, the weight space (binary or continuous), the representation of features, their generality and the degree of employment of domain specific knowledge. A wide number of FW techniques are available in the literature, both classical (see [20]) and recent (for example, [21,35]). The most well known group of them is the family of Relief-based algorithms. The Relief algorithm [48] (which was originally an FS method) has been widely studied and modified, producing some interesting versions of the original approach [49]. Some of them are based on ReliefF [50] which is the first step of development of Relief-based methods as FW techniques [51]. Finally, it is important to note that it is possible to find some approaches dealing simultaneously with FW and FS tasks, for instance, inside a Tabu Search procedure [52] or by managing ensemble-based approaches [53]. 2.3. Differential evolution DE follows the general procedure of an evolutionary algorithm [33]. DE starts with a population of NP solutions, so-called individuals. The initial population should cover the entire search space as much as possible. In some problems, this is achieved by uniformly randomizing individuals, but in other problems, such as the PG problem, basic knowledge of the problem is available and the use of other initialization mechanisms is more effective. The subsequent generations are denoted by G ¼ 0; 1, . . . ,Gmax . In DE, it is common to denote each individual as a D-dimensional vector X i,G ¼{x1i,G , . . . ,xD i,G }, called a ‘‘target vector’’. After initialization, DE applies the mutation operator to generate a mutant vector V i,G , with respect to each individual X i,G , in the current population. For each target X i,G , at the generation G, its associated mutant vector V i,G ¼{V 1i,G , . . . ,V D i,G }. The method of creating this mutant vector is that which differentiates one DE scheme from another. In this work, we will focus on the RandToBest/1 which generates the mutant vector as follows: V i,G ¼ X i,G þ F ðX best,G X i,G Þ þF ðX ri ,G X ri ,G Þ 1 2 ð3Þ The indices r i1 , r i2 are mutually exclusive integers randomly generated within the range ½1,NP, which are also different from the base index i. The scaling factor F is a positive control parameter for scaling the different vectors. After the mutation phase, the crossover operation is applied to each pair of the target vector X i,G and its corresponding mutant vector V i,G to generate a new trial vector that we denote U i,G . We will focus on the binomial crossover scheme, which is performed on each component whenever a randomly picked number between 0 and 1 is less than or equal to the crossover rate (CR), which controls the fraction of parameter values copied from the X i,G þ 1 ¼ U i,G if F ðU i,G Þ is better than F ðX i,G Þ X i,G Otherwise where F is the fitness function to be minimized. If the new trial vector yields a solution equal to or better than the target vector, it replaces the corresponding target vector in the next generation; otherwise the target is retained in the population. Therefore, the population always gets better or retains the same fitness values, but never deteriorates. This one-to-one selection procedure is generally kept fixed in most of the DE algorithms. The success of DE in solving a specific problem crucially depends on choosing the appropriate mutation strategy and its associated control parameter values (F and CR) that determine the convergence speed. Hence, a fixed selection of these parameters can produce slow and/or premature convergence depending on the problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE algorithm [54–56]. One of the most successful adaptive DE algorithms is the Scale Factor Local Search in Differential Evolution (SFLSDE) proposed by [41]. This method was established as the best DE technique for PG in [34]. 2.4. Stratification for prototype reduction schemes When performing data reduction, the scaling up problem appears as the number of training examples increases beyond the capacity of the prototype reduction algorithms, harming their effectiveness and efficiency. This is a crucial problem which must be overcome in most practical applications of data reduction methods. In order to avoid it, in this work we will consider the use of the stratification strategy, initially proposed in [39] for PS, and [40] for PG. This stratification strategy splits the training data into disjoint strata with equal class distribution. The initial data set D is divided into two sets, TR and TS, as usual (for example a 10th of the data for TS, and the rest for TR in a 10-fold cross validation). Then, TR is divided into t disjoint sets TRj, strata of equal size, TR1, TR2 TRt , maintaining class distribution within each subset. In this manner, the subsets TR and TS can be represented as follows: t [ TR ¼ TRj , TS ¼ D\TR ð4Þ j¼1 Then, a prototype reduction method should be applied to each TRj, obtaining a reduced set RSj for each partition. In PS and PG stratification procedures, the final reduced set is obtained joining every RSj obtained, and it is denoted as Stratified Reduced Set (SRS) SRS ¼ t [ RSj ð5Þ j¼1 When the SRS has been obtained, it is ready to be used by an NN classifier to classify the instances of TS. The use of the stratification procedure does not have a great cost in time. Usually, the process of splitting the training data into strata, and joining them when the prototype reduction method has been applied, is not time-consuming, as it does not require any kind of additional processing. Thus, the time needed for the I. Triguero et al. / Neurocomputing 97 (2012) 332–343 stratified execution is almost the same as that taken in the execution of the prototype reduction method in each strata, which is significantly lower than the time spent if no stratification is applied, due to the time complexity of the PS and PG methods, which most of the time is OðN2 Þ or higher. The prototypes present in TR are independent of each other, so the distribution of the data into strata will not degrade their representation capabilities if the class distribution is maintained. The number of strata, which should be fixed empirically, will determine the size of them. By using a proper number it is possible to greatly reduce the training set size. This situation allows us to avoid the drawbacks that appeared due to the scaling up problem. 3. Hybrid evolutionary models integrating feature weighting and prototype generation In this section we describe in depth the proposed hybrid approaches and their main components. First of all, we present the proposed FW scheme based on DE (Section 3.1). Next, as we established previously, we will design a hybrid model with the proposed FW for each of the two most effective PG methodologies, IPADECS [37] (Section 3.2) and SSMA-DEPG [34] (Section 3.3). Finally, we develop a stratified model for our hybrid proposals (Section 3.4). selection operator can be viewed as follows: 8 if accuracyðReducedSet,U i,G Þ U > < i,G 4 ¼ accuracyðReducedSet,X i,G Þ X i,G þ 1 ¼ > :X Otherwise 335 ð6Þ i,G In case of a tie between the values of accuracy, we select the U i,G in order to give the mutated individual the opportunity to enter the population. In order to overcome the limitation of the parameters’ selection (F and CR), we use the ideas established in [41] to implement a self-adaptive DE scheme. 3.2. IPADECS-DEFW: hybridization with IPADECS The IPADECS algorithm [37] follows an iterative prototype adjustment scheme with an incremental approach. At each step, an optimization procedure is used to adjust the position of the prototypes, adding new ones if needed. The aim of this algorithm is to determine the most appropriate number of prototypes per class and adjust their positioning during the evolutionary process. Specifically, IPADECS uses the SFLSDE technique as an optimizer with a complete solution per individual codification. At the end of the process, IPADECS returns the best GS found. The hybrid model which composes IPADECS and DEFW can basically be described as the combination of an IPADECS stage and then a DEFW to determine the best weights. Fig. 1 shows the pseudo-code of this hybrid scheme. The algorithm proceeds as follows: 3.1. Differential evolution for feature weighting Initially, we perform an IPADECS algorithm in which all the As we stated before, FW can be viewed as a continuous space search problem in which we want to determine the most appropriate weights for each feature in order to enhance the NN rule. Specifically, we propose the use of a DE procedure to obtain the best weights, which allows a given reduced set to increase the performance of the classification made in TR. We denote this algorithm as DEFW. DEFW starts with a population of NP individuals X i,G . In order to encode a weight vector in a DE individual, this algorithm uses a real-valued vector containing D elements corresponding to D attributes, which range in the interval [0,1]. It means that each individual X i,G in the population encodes a complete solution for the FW problem. Following the ideas established in [54,55,41], the initial population should better cover the entire search space as much as possible by uniformly randomizing individuals within the defined range. After the initialization process, DEFW enters in a loop in which mutation and crossover operators, explained in Section 2.3, guide the optimization of feature weights by generating new trial vectors U i,G . After applying these operators, we check if there have been values out of range of ½y,1. If a computed value is greater than 1, we truncate it to 1. Furthermore, based on [48], if this value is lower than a threshold y, we consider this feature to be irrelevant, and therefore, it is established at 0. In our experiments, y has been fixed empirically to 0.2. Finally, the selection operator must decide which generated trial vectors should survive in the population of the next generation Gþ1. For our purpose, the NN rule guides this operator. The instances in TR are classified with the prototypes of the reduced set given, but in this case, the distance measure for the NN rule is modified according to Eq. (2), where the weights Wi are obtained from X i,G and U i,G . Their corresponding fitness values are measured as the accuracy( ) obtained, which represents the number of successful hits (correct classifications) relative to the total number of classifications. We try to maximize this value, so the Then, the algorithm enters in a loop in which we try to find the features have a relevance degree of 1.0 (Instructions 1–4). most appropriate weights and placement of the prototypes: – Instruction 6 performs a DE optimization of the feature weights, so that, the best GS obtained from the IPADECS algorithm is used to determine the appropriate weights, as we described in Section 3.1. Furthermore, the current weights are inserted as one of the individuals of the FW population. In this way, we ensure that the FW scheme does not degrade the performance of the GS obtained with IPADECS, due to the selection operator used in DEFW. – Next, Instruction 7 generates a new GS, with IPADECS, but in this case, the optimization process takes into consideration the new weights to calculate the distances between prototypes (see Eq. (2)). The underlying idea of this instruction is that IPADECS should generate a different GS due to the fact that the distance measure has changed, and therefore, the continuous search space has been modified. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Weights [1.. D ] = 1.0 bestWeights [1.. D ] = Weights [1..D ] GS = IPADECS (Weights); Accuracy = Evaluate With Weights (GS , T R , Weights) for i = 1 to M AXIT ER do newWeights[1.. D ] = DEFW (GS, Weights) GS aux = IPADECS (newWeights) Accuracy trial = EvaluateWithWeights (GS aux , T R, newWeights) if Accuracy trial > Accuracy then Accuracy = Accuracytrial GS = GS aux end if Weights = newWeights end for return GS , Weights Fig. 1. Hybridization of IPADECS and DEFW. 336 I. Triguero et al. / Neurocomputing 97 (2012) 332–343 – After this process, we have to ensure that the new positioning of prototypes GSaux with its respective weights has reported a successful improvement of the accuracy rate with respect to the previous GS. If the computed accuracy of the new GSaux and its respective weights is greater than the best accuracy found, we save GSaux as the current GS (Instructions 8–12). – Instruction 13 stored the obtained weights and they will be used in the next iteration. After a previously fixed number of iterations, the hybrid model returns the best GS and its respective best feature weights. 3.3. SSMA-DEPGFW: hybridization with SSMA-DEPG The SSMA-DEPG approach [34] uses a PS algorithm prior to the adjustment process to initialize a subset of prototypes, finding a promising selection of prototypes per class. Specifically, the SSMA algorithm is applied [27]. This is a memetic algorithm which makes use of a local search or meme specifically developed for the PS problem. The interweaving of the global and local search phases allows the two to influence each other. The resulting SS is inserted as one of the individuals of the population in the SFLSDE algorithm, which, in this case, is acting as a PG method. Next, it performs mutation and crossover operations to generate new trial solutions. Again, the NN rule guides the selection operator, therefore the SSMA-DEPG returns the best location of the prototypes, which increases the classification rate. Fig. 2 outlines the hybrid model. To hybridize FW with SSMADEPG, this method is carried out as follows: Firstly, it is necessary to apply an SSMA stage to determine the number of prototypes per class (Instruction 1). Next, the rest of the individuals are randomly generated, extracting prototypes from the TR and keeping the same structure as the SS selected by the PS method, thus they must have the same number of prototypes per class, and the classes must have the same arrangement in the matrix X i,G . At this stage, we have established the relevance degree of all features to 1.0. Then, Instruction 4 determines the best classification accuracy obtained in the NP population. After this, our hybrid model enters into a cooperative loop between FW and SFLSDE. – The proposed FW method is applied with the best GS found up to that moment. Once again, the current weights are inserted as one of the individuals of the FW population (Instruction 6). – Then, a new optimization stage is applied to all the individuals of the population, with the obtained weights modifying the distance measure between prototypes. Finally, the method returns the best GS with its appropriate feature weights, and it is ready to be used as a reference set by the NN classifier. 1: GS [1] = SSMA(); 2: Generate GS [2 ..N P ] randomly 3: 4: 5: 6: 7: 8: 9: 10: with the prototypes distribution of GS [1] Weights[1.. D ] = 1.0 Determine the best GS for i = 1 to M AXIT ER do Weights [1.. D ] = DEFW( GS [best], Weights) GS [1..N P ] = SFLSDE( GS [1..N P ], Weights) Determine the best GS end for return GS [best], Weights Fig. 2. Hybridization of SSMA-DEPG and DEFW. 3.4. A stratified scheme for hybrid FW and PG methods Since the immediate application of these hybrid methods over huge sets should be avoided due to their computational cost, we propose the use of a stratification procedure to mitigate this drawback, and thus develop a suitable approach to huge problems. PS and PG stratified models join every resulting set RSj, obtained as the application of these techniques to each strata TRj. Nevertheless, in the proposed hybrid scheme, we obtain for each strata a generated reduced set and its respective feature weights. To develop a stratified method, we study two different strategies: Join procedure: In this variant, the SRS is also generated as the sum of each RSj. However, the weight of each feature is recalculated, applying the DEFW algorithm. In this case, it uses the SRS set as a given reduced set. The stratified method returns SRS and its obtained weights to classify the instances of TS. Voting rule: This approach consists of applying a majority voting rule. Each strata RSj and its respective weights are used to calculate the possible class of each instance of TS. The final assigned class is produced via majority voting of the computed class per strata. In our implementation, ties are randomly decided. 4. Experimental framework In this section, we present the main characteristics related to the experimental study. Section 4.1 introduces the data sets used in this study. Section 4.2 summarizes the algorithms used for comparison with their respective parameters. Finally, Section 4.3 describes the statistical tests applied to contrast the results obtained. 4.1. Data sets In this study, we have selected 40 classification data sets for the main experimental study. These are well-known problems in the area, taken from the KEEL data set repository2 [57]. Table 1 summarizes the properties of the selected data sets. It shows, for each data set, the number of examples (#Ex.), the number of attributes (#Atts.), and the number of classes (#Cl.). The data sets considered in this study contain between 100 and 20 000 instances, and the number of attributes ranges from 2 to 85. In addition, they are partitioned using the 10 fold cross-validation (10-fcv) procedure and their values are normalized in the interval [0,1] to equalize the influence of attributes with different range domains. In addition, instances with missing values have been discarded before the execution of the methods over the data sets. Furthermore, we will perform an additional experiment applying our hybrid models to six huge data sets, which contain more than 20 000 instances. Table 2 shows their characteristics, including the exact number of strata (#Strata.) and instances per strata (#Instances/Strata.). 4.2. Comparison algorithms and parameters In order to perform an exhaustive study of the capabilities of our proposals, we have selected some of the main proposed models in the literature of PS, PG and FW. In addition, the NN rule with k¼ 1 (1NN) has been included as a baseline limit of performance. Apart from SSMA, IPADECS and SSMA-DEPG, which 2 http://sci2s.ugr.es/keel/datasets I. Triguero et al. / Neurocomputing 97 (2012) 332–343 Table 1 Summary description for classification data sets. Table 3 Parameter specification for all the methods used in the experimentation. Data set #Ex. #Atts. #Cl. Data set #Ex. #Atts. #Cl. Algorithm Sbalone Banana Bands Breast Bupa Chess Cleveland Coil2000 Contraceptive Crx Dermatology Flare-solar German Glass Haberman Hayes-roth Heart Housevotes Iris led7digit 4174 5300 539 286 345 3196 297 9822 1473 125 366 1066 1000 214 306 133 270 435 150 500 8 2 19 9 6 36 13 85 9 15 33 9 20 9 3 4 13 16 4 7 28 2 2 2 2 2 5 2 3 2 6 2 2 7 2 3 2 2 3 10 Lym Magic Mammographic Marketing Monks Newthyroid Nursery Pima Ring Saheart Spambase Spectheart splice Tae Thyroid Titanic Twonorm Wisconsin Yeast Zoo 148 19 020 961 8993 432 215 12 690 768 7400 462 4597 267 3190 151 7200 2201 7400 683 1484 101 18 10 5 13 6 5 8 8 20 9 57 44 60 5 21 3 20 9 8 16 4 2 2 9 2 3 5 2 2 2 2 2 3 3 3 2 2 2 10 7 SSMA Table 2 Summary description for huge classification data sets. Data set #Ex. #Atts. #Cl. #Strata. #Instances/strata. Adult Census Connect-4 Fars Letter Shuttle 48 842 299 285 67 557 100 968 20 000 58 000 14 41 42 29 16 9 2 2 3 8 26 7 10 60 14 20 4 12 4884 4990 4826 5048 5000 4833 have been explained above, the rest of the methods are described as follows: TSKNN: A Tabu search based method for simultaneous FS and 337 FW, which encodes in its solutions the current set of features selected (binary codification), the current set of weights assigned to features, and the best value of k found for the kNN classifier. Furthermore, this method uses fuzzy kNN [58] to avoid ties in the classification process [52]. ReliefF: The first Relief-based method adapted to perform the FW process [50]. Weights computed in the original Relief algorithm are not binarized to 0; 1. Instead, they are employed as final weights for the kNN classifier. This method was marked as the best performance-based FW method in [20]. GOCBR: A genetic algorithm designed for simultaneous PS and FW process in the same chromosome. Weights are represented by binary chains, thus preserving binary codification in the chromosomes. It has been applied successfully to several realworld applications [59]. Parameters Population ¼30, Evaluations ¼10 000, Crossover Probability ¼ 0.5, Mutation Probability ¼0.001 SSMA-DEPG PopulationSFLSDE ¼40, IterationsSFLSDE ¼500, iterSFGSS ¼8, iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9 IPADECS Population ¼10, iterations of Basic DE ¼ 500, iterSFGSS ¼8, iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TSKNN Evaluations ¼ 10 000, M ¼ 10, N ¼2, P ¼ceil ð #FeaturesÞ ReliefF K value for contributions¼ Best in [1,20] GOCBR Evaluations ¼ 10 000, Population ¼100, Crossover Probability ¼ 0.7, Mutation Probability ¼0.1 SSMAMAXITER ¼ 20, PopulationSFLSDE ¼ 40, IterationsSFLSDE¼ 50 DEPGFW PopulationDEFW ¼25, IterationsDEFW ¼200, iterSFGSS ¼ 8, iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9 IPADECSMAXITER ¼ 20, PopulationIPADECS ¼10, iterations of Basic DEFW DE ¼ 50 PopulationDEFW ¼25, IterationsDEFW ¼200, iterSFGSS ¼ 8, iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9 4.3. Statistical tools for analysis Hypothesis testing techniques provide us with a way to statistically support the results obtained in the experimental study, identifying the most relevant differences found between the methods [60]. To this end, the use of nonparametric tests will be preferred over parametric ones, since the initial conditions that guarantee the reliability of the latter may not be satisfied, causing the statistical analysis to lose credibility. We will focus on the use of the Friedman Aligned-ranks (FA) test [42], as a tool for contrasting the behavior of each of our proposals. Its application will allow us to highlight the existence of significant differences between methods. Later, post hoc procedures like Holm’s or Finner’s will find out which algorithms are distinctive among the 1nn comparisons performed. Furthermore, we will use the Wilcoxon Signed-Ranks test in those cases in which we analyze differences between pairs of methods not marked as significant by the previous tests. More information about these tests and other statistical procedures specifically designed for use in the field of Machine Learning can be found at the SCI2S thematic public website on Statistical Inference in Computational Intelligence and Data Mining.3 5. Analysis of results In this section, we analyze the results obtained from different experimental studies. Specifically, our aims are: To compare the proposed hybrid schemes to each other over the 40 data sets (Section 5.1). To test the performance of these models in comparison with previously proposed methods (Section 5.2). To check if the performance of hybrid models is maintained with Many different configurations are established by the authors of each paper for the different techniques. We focus this experimentation on the recommended parameters proposed by their respective authors, assuming that the choice of the values of the parameters was optimally chosen. The configuration parameters, which are common to all problems, are shown in Table 3. In all of the techniques, Euclidean distance is used as a similarity function and those which are stochastic methods have been run three times per partition. Note that the values of the parameters Fl, Fu, iterSFGSS and iterSFHC remain constant in all the DE optimizations, and are the recommended values established in [41]. Implementations of the algorithms can be found in the KEEL software tool [57]. huge data sets using the proposed stratified model (Section 5.3). 5.1. Comparison of the proposed hybrid schemes We focus this experiment on comparing both hybrid schemes in terms of accuracy and reduction capabilities. Fig. 3 shows a star plot in which the obtained accuracy test of IPADECS-DEFW and SSMADEPGFW is presented for each data set, allowing us to see in an easier way how both algorithms behave in the same domains. 3 http://sci2s.ugr.es/sicidm/ 338 I. Triguero et al. / Neurocomputing 97 (2012) 332–343 Fig. 3. Accuracy rate comparison. Fig. 4. Reduction rate comparison. The reduction rate is defined as Reduction Rate ¼ 1sizeðGSÞ=sizeðTRÞ ð7Þ It has a strong influence on the efficiency of the solutions obtained, due to the cost of the final classification process performed by the 1NN classifier. Fig. 4 illustrates a star plot representing the reduction rate obtained in each data set for both hybrid models. These star plots represent the performance as the distance from the center; hence a higher area determines the best average performance. The plots allow us to visualize the I. Triguero et al. / Neurocomputing 97 (2012) 332–343 5.2. Comparison with previously proposed methods IPADECS−DEFW vs SSMA−DEPGFW 0.06 0.04 ΔAccuracy 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.04 −0.02 0 0.02 ΔReduction 0.04 0.06 Fig. 5. Accuracy/reduction rates comparison. Table 4 Results of the Wilcoxon signed-ranks test comparing hybrid schemes. Comparison Accuracy rate SSMA-DEPGFW vs IPADECS-DEFW Reduction rate IPADECS-DEFW Vs SSMA-DEPGFW Rþ R p-Value 442 338 0.4639 802 18 339 4.602 10 10 performance of the algorithms comparatively for each problem and in general. Fig. 5 shows a graphical comparison between these methods considering both objectives simultaneously (accuracy and reduction rate), by using a relative movement diagram [61]. The idea of this diagram is to represent with an arrow the results of two methods on each data set. The arrow starts at the coordinate origin and the coordinates of the tip of the arrow are given by the difference between the reduction (x-axis) and accuracy (y-axis) of IPADECS-DEFW and SSMA-DEPGFW, in this order. Furthermore, numerical results will be presented later in Tables 5 and 6. Apart from these figures, we use the Wilcoxon test to statistically compare our proposals in both measures. Table 4 collects the results of its application to the accuracy and reduction rates. This table shows the rankings R þ and R values achieved and its associate p-value. Observing Figs. 3–5 and Table 4 we want to make some comments: Fig. 3 shows that both proposals present similar behavior in many domains. Nevertheless, SSMA-DEPGFW obtains the best average result in 24 of the 40 data sets. The Wilcoxon test confirms this statement, showing that there are no significant differences between both approaches and the Rþ is greater for SSMA-DEPGFW. In terms of reduction capabilities, IPADECS-DEFW is shown to be the best performing hybrid model. As the Wilcoxon test reports, it obtains significant differences with respect to SSMA-DEPGFW. In Fig. 5, we observe that most of the arrows point out to the right side of the plot. This means that IPADECS-DEFW obtains a lower reduction rate in the problems addressed. Moreover, there are a similar number of arrows pointing up-right and down-left, depicting that the accuracy of both methods is similar. Hence, we can state that IPADECS-DEFW finds the best trade-off between accuracy and reduction rate. In this subsection we perform a comparison between the two proposed hybrid models and the comparison methods established above. We analyze the results obtained in terms of the accuracy in test data and reduction rate. Table 5 shows the accuracy test results for each method considered in the study. For each data set, the mean accuracy (Acc) and the standard deviation (SD) are computed. The best result for each column is highlighted in bold. The last row presents the average considering all the data sets. Table 6 presents the reduction rate achieved. Reduction rates are only shown for those methods which perform a relevant reduction of the instances of the TR. In this table we can observe that those methods which are based on SSMA obtain the same average reduction rate. This is due to the fact that SSMA is used to obtain the appropriate number of prototypes per class, determining the reduction capabilities at the beginning of the hybrid models: SSMA-DEPG and SSMA-DEPGFW. IPADECS and IPADECS-DEFW obtains slightly different reduction rates because they use the same PG approach changing the fitness function with weights. To verify the performance of each of our proposals, we have divided the nonparametric statistical study into two different parts. Firstly, we will compare IPADECS-DEFW and SSMADEPGFW with the rest of the comparison methods separately (excluding the other proposal) in terms of test accuracy. Tables 7 and 8 present the results of the FA test for IPADECSDEFW and SSMA-DEPGFW respectively. In these tables, the computed FA rankings, which represent the associated effectiveness, are presented in the second column. Both tables are ordered from the best (lowest) to the worst (highest) ranking. The third column shows the adjusted p-value (APV) with Holm’s test. Finally, the fourth column presents the APV with Finner’s test. Note that IPADECS-DEFW and SSMA-DEPGFW are established as control algorithms because they have obtained the best FA ranking in their respective studies. Those APVs highlighted in bold are methods outperformed by the control, at an a ¼ 0:1 level of significance. In this study, we have observed that hybrid schemes perform well with large data sets (those data sets that have more than 2000 instances). We select large data sets, from Table 1, and we compare weighted and unweighted proposals. Fig. 6 shows this comparison. The x-axis position of the point is the accuracy of the original proposal on a single data set, and the y-axis position is the accuracy of the weighted algorithm. Therefore, points above the y¼x line correspond to data sets for which new proposals perform better than the original algorithm. Given Fig. 6 and the results shown before, we can make the following analysis: SSMA-DEPGFW and IPADECS-DEFW achieve the best average results. It is important to note that the two hybrid models clearly outperform the methods upon which they are based. The good synergy between PG and FW methods is demonstrated with the obtained results. Specifically, if we focus our attention on those data sets with a large number of features (see splice, chess, etc.), we can state that, in general, the hybridization between PG and a DEFW scheme can be useful to increase the classification accuracy obtained. Furthermore, Fig. 6 shows that the proposed weighted algorithms are able to overcome, in most cases, the original proposal when dealing with large data sets. Both SSMA-DEPGFW and IPADECS-DEFW achieve the lowest (best) ranking in the comparison. The p-value of the FA test is 340 I. Triguero et al. / Neurocomputing 97 (2012) 332–343 Table 5 Accuracy test obtained. Data sets 1NN SSMA SSMA-DEPG SSMA-DEPGFW IPADECS IPADECS-DEFW TSKNN ReliefF GOCBR Acc SD Acc SD Acc SD Acc SD Acc SD Acc SD Acc SD Acc SD Acc SD Abalone Banana Bands Breast Bupa Chess Cleveland Coil2000 Contraceptive Crx Dermatology Flare-solar German Glass Haberman Hayes-roth Heart Housevotes Iris Led7digit Lym Magic Mammographic Marketing Monks Newthyroid Nursery Pima Ring Saheart Spambase Spectfheart Splice Tae Thyroid Titanic Twonorm Wisconsin Yeast Zoo 19.91 87.51 63.09 65.35 61.08 84.70 53.14 89.63 42.77 79.57 95.35 55.54 70.50 73.61 66.97 35.70 77.04 92.16 93.33 40.20 73.87 80.59 73.68 27.38 77.91 97.23 82.67 70.33 75.24 64.49 89.45 69.70 74.95 40.50 92.58 60.75 94.68 95.57 50.47 92.81 1.60 1.03 4.65 6.07 6.88 2.36 7.45 0.77 3.69 5.12 3.45 3.20 4.25 11.91 5.46 9.11 8.89 5.41 5.16 9.48 8.77 0.90 5.59 1.34 5.42 2.26 0.92 3.53 0.82 3.99 1.17 6.55 1.15 8.43 0.81 6.61 0.73 2.59 3.91 6.57 26.09 89.64 59.02 73.79 62.79 90.05 54.78 94.00 48.14 84.78 95.10 65.47 73.20 68.81 73.17 56.18 83.70 92.39 96.00 34.00 83.03 82.03 81.27 30.87 96.79 96.30 85.58 74.23 92.86 71.66 88.28 74.20 73.32 53.17 94.14 73.51 96.34 96.57 57.55 85.33 1.41 0.89 8.98 4.05 8.47 1.67 6.29 0.12 5.93 4.90 5.64 3.97 4.69 8.19 3.75 13.39 10.10 4.99 4.42 6.69 13.95 0.75 5.32 1.63 3.31 3.48 1.17 4.01 1.03 3.46 1.72 8.69 1.63 12.66 0.74 2.47 0.74 2.65 1.66 9.73 25.66 89.55 69.78 70.32 66.00 90.61 56.15 94.00 48.74 85.65 95.37 66.14 71.90 71.98 71.53 75.41 82.22 93.55 94.00 71.40 80.29 82.31 81.27 31.39 95.44 97.68 85.38 74.89 93.49 70.35 89.84 79.02 78.37 56.54 94.58 78.96 96.92 96.14 58.09 95.33 1.71 1.14 6.08 7.51 7.80 2.18 6.76 0.12 4.46 4.46 4.04 3.42 3.11 9.47 6.38 10.57 8.25 5.36 4.67 4.90 15.48 0.65 5.48 0.70 3.21 2.32 1.09 5.81 1.05 5.10 0.97 7.31 4.44 15.86 0.55 2.30 0.79 2.12 2.14 6.49 25.61 89.94 67.00 69.63 67.41 95.56 55.80 94.00 50.17 85.65 94.02 66.95 72.10 73.64 73.18 76.41 85.19 94.24 94.67 71.80 81.76 83.24 81.86 31.90 98.86 96.73 92.99 73.23 93.45 69.47 88.69 79.68 82.51 58.38 96.93 78.83 96.50 96.42 56.88 95.83 1.34 1.16 6.55 7.64 7.96 1.69 6.11 0.12 3.35 4.83 4.31 3.48 5.13 8.86 2.61 10.49 8.11 3.78 4.99 4.77 9.83 0.96 6.03 1.34 1.53 3.64 0.76 5.43 0.64 4.36 2.12 10.73 4.80 11.80 2.39 2.22 0.72 2.23 1.60 9.72 22.21 84.09 67.15 70.91 65.67 80.22 52.49 94.04 48.54 85.22 96.18 66.23 71.80 69.09 74.45 77.05 83.70 92.64 94.67 72.40 78.41 80.23 79.71 30.69 91.20 98.18 64.79 76.84 89.70 70.36 90.89 80.54 79.78 57.71 93.99 78.19 97.66 96.42 57.35 96.33 2.34 4.38 5.91 7.15 8.48 3.81 4.48 0.09 4.67 4.80 3.01 3.13 3.25 11.13 6.40 7.67 9.83 3.71 4.00 3.88 9.31 1.47 4.41 1.11 4.76 3.02 4.58 4.67 1.03 3.07 0.95 4.25 3.99 11.11 0.36 2.92 0.69 1.94 3.13 8.23 25.47 89.70 69.97 71.00 67.25 94.52 54.14 94.02 54.79 85.07 96.73 65.48 71.40 71.45 71.53 75.52 80.74 94.00 94.67 71.20 80.66 83.17 83.67 31.94 96.10 97.71 85.10 71.63 91.22 71.21 92.50 77.93 88.53 58.33 94.28 79.01 97.76 96.28 59.17 96.67 2.39 0.97 5.89 8.52 5.27 1.03 6.20 0.09 3.61 4.54 2.64 3.25 4.27 11.94 4.92 12.11 9.19 4.76 4.00 4.66 14.74 1.01 5.55 1.39 2.48 3.07 1.42 7.35 0.94 3.37 1.38 4.70 1.99 12.04 0.58 2.11 0.72 1.72 3.61 6.83 24.65 89.51 73.67 72.02 62.44 95.94 56.43 94.03 42.70 86.23 96.47 67.16 71.40 76.42 74.15 54.36 81.48 95.16 94.00 10.80 74.54 83.25 82.62 24.05 100.00 93.48 82.67 75.53 84.23 68.22 92.54 76.01 71.72 30.54 95.87 77.78 96.96 96.00 55.86 66.25 1.43 0.84 8.33 6.45 7.90 0.40 6.84 0.05 0.22 3.90 4.01 4.07 2.20 13.21 5.07 11.56 6.42 3.34 4.67 3.12 8.95 0.68 4.76 1.33 0.00 2.95 0.88 5.85 1.17 11.35 1.21 10.12 1.72 2.56 0.61 2.79 0.87 3.61 12.99 8.07 14.71 68.53 70.15 62.47 56.46 96.09 55.10 94.02 39.99 80.43 95.92 57.60 69.30 80.65 63.34 80.20 78.15 94.00 94.00 63.20 70.43 76.68 70.76 26.45 100.00 97.25 78.94 70.32 73.08 60.83 60.58 78.30 78.24 49.12 92.57 61.33 94.65 96.28 51.55 96.83 1.85 2.76 6.38 9.71 4.37 0.57 8.62 0.06 6.05 3.62 2.77 3.51 1.42 12.04 8.42 10.67 9.72 3.48 5.54 5.53 22.52 5.46 4.28 1.91 0.00 4.33 32.09 5.65 1.11 9.15 0.08 11.92 1.30 3.77 0.26 7.90 1.01 2.14 4.97 2.78 20.75 87.87 71.45 67.14 61.81 87.48 52.80 91.75 43.38 84.20 96.46 65.20 70.30 67.67 68.94 67.49 76.67 92.83 94.00 69.80 79.34 80.66 78.67 27.19 79.21 94.87 83.53 70.59 74.54 66.45 89.82 74.99 74.20 55.00 92.85 78.83 94.93 97.14 53.44 96.17 1.32 0.87 6.15 8.14 6.31 1.15 5.75 0.40 3.65 3.91 2.98 3.13 5.37 14.10 6.34 10.55 8.77 6.27 3.59 4.42 9.46 0.71 3.84 1.47 7.15 4.50 1.05 4.88 0.48 14.20 1.48 6.87 1.54 3.98 0.73 2.22 1.25 3.30 6.50 5.16 Average 70.80 20.06 75.20 18.98 77.66 17.24 78.43 17.55 76.44 17.46 78.29 17.29 73.68 22.09 72.46 19.78 74.51 17.89 lower than 105 in both cases, meaning that significant differences have been detected between the methods of the experiment. Holm’s procedure states that the differences of IPADECS-DEFW over 1NN, ReliefF, GOCBR, TSKNN and SSMA are significant (a ¼ 0:1). Finner’s procedure goes further, also highlighting the difference over IPADECS (Finner APV ¼ 0.0926). In the case of SSMA-DEPGFW the results are similar: the differences over 1NN, ReliefF, GOCBR, TSKNN and SSMA are marked as significant by Holm’s test (a ¼ 0:1), whereas Finner’s also highlights the difference over IPADECS again (Finner APV ¼ 0.0643). These results suggest that our proposals, SSMA-DEPGFW and IPADECS-DEFW, significantly improve all the comparison methods considered except SSMA-DEPG. The multiple comparison test applied does not detect significant differences between the three best methods. Hence, we will study this last case carefully, applying a pairwise comparison between our proposals and SSMA-DEPG. Specifically, we will focus on the Wilcoxon test, which allows us to have a further insight into the comparison of this method with our proposals. Table 9 shows the results of its application, comparing SSMA-DEFPGW and IPADECS-DEFW with SSMA-DEPG. The results obtained suggest that it is outperformed by the new proposals, at a ¼ 0:1 level. Although this result is not as strong as those differences found by Holm’s and Finner’s procedures, it still supports the existence of a significant improvement of SSMA-DEPGFW and IPADECS-DEFW over SSMA-DEPG. 5.3. Analyzing scaling up capabilities: a stratified model In this study, we select the hybrid model IPADECS-DEFW as the best trade-off between accuracy and reduction rate to implement a stratified model, considering the two strategies explained in Section 3.4. The performance of this method is analyzed by using six huge data sets taken from the KEEL data set repository (see Table 2). To check the performance of the proposed stratified models, we perform a comparison with the stratified versions of IPADECS and SSMA-DEPG proposed in [40]. Furthermore, 1NN behavior has also been analyzed as a baseline method for this study. For all the techniques, we used the same set up as that used in the former study, and set up the strata size as near as possible to 5000 instances. Table 2 shows the exact number of strata and instances per strata. Table 10 shows the accuracy test results for each method considered in this study. For each data set, the mean accuracy (Acc) and the standard deviation (SD) are computed. The best result for each column is highlighted in bold. The last row presents the average considering all the huge data sets. Table 11 collects the I. Triguero et al. / Neurocomputing 97 (2012) 332–343 Table 8 Average FA rankings of SSMA-DEPGFW and the rest of the comparison methods. Data sets SSMA SSMADEPG SSMADEPGFW IPADECS IPADECS DEFW Abalone Banana Bands Breast Bupa Chess Cleveland Coil2000 Contraceptive Crx Dermatology Flare-solar German Glass Haberman Hayes-roth Heart Housevotes Iris Led7digit Lym Magic Mammographic Marketing Monks Newthyroid Nursery Pima Ring Saheart Spambase Spectfheart Splice Tae Thyroid Titanic Twonorm Wisconsin Yeast Zoo 0.9749 0.9900 0.9567 0.9790 0.9417 0.9782 0.9710 0.9999 0.9672 0.9844 0.9663 0.9955 0.9686 0.9237 0.9840 0.9006 0.9716 0.9826 0.9630 0.9693 0.9504 0.9808 0.9895 0.9825 0.9750 0.9700 0.9396 0.9780 0.9902 0.9735 0.9805 0.9696 0.9679 0.9139 0.9982 0.9960 0.9952 0.9932 0.9681 0.9010 0.9749 0.9900 0.9567 0.9790 0.9417 0.9782 0.9710 0.9999 0.9672 0.9844 0.9663 0.9955 0.9686 0.9237 0.9840 0.9006 0.9716 0.9826 0.9630 0.9693 0.9504 0.9808 0.9895 0.9825 0.9750 0.9700 0.9396 0.9780 0.9902 0.9735 0.9805 0.9696 0.9679 0.9139 0.9982 0.9960 0.9952 0.9932 0.9681 0.9010 0.9749 0.9900 0.9567 0.9790 0.9417 0.9782 0.9710 0.9999 0.9672 0.9844 0.9663 0.9955 0.9686 0.9237 0.9840 0.9006 0.9716 0.9826 0.9630 0.9693 0.9504 0.9808 0.9895 0.9825 0.9750 0.9700 0.9396 0.9780 0.9902 0.9735 0.9805 0.9696 0.9679 0.9139 0.9982 0.9960 0.9952 0.9932 0.9681 0.9010 0.9886 0.9981 0.9872 0.9820 0.9848 0.9981 0.9600 0.9997 0.9926 0.9929 0.9806 0.9969 0.9940 0.9393 0.9904 0.9436 0.9853 0.9849 0.9748 0.9747 0.9594 0.9996 0.9938 0.9961 0.9910 0.9835 0.9992 0.9916 0.9956 0.9931 0.9971 0.9817 0.9947 0.9558 0.9992 0.9990 0.9993 0.9951 0.9858 0.9086 0.9882 0.9981 0.9866 0.9829 0.9842 0.9981 0.9600 0.9997 0.9922 0.9929 0.9806 0.9969 0.9920 0.9393 0.9891 0.9436 0.9831 0.9849 0.9748 0.9747 0.9594 0.9996 0.9938 0.9961 0.9910 0.9835 0.9992 0.9916 0.9956 0.9911 0.9971 0.9817 0.9947 0.9558 0.9989 0.9987 0.9993 0.9951 0.9858 0.9086 Average 0.9695 0.9695 0.9695 0.9842 0.9840 Table 7 Average FA rankings of IPADECS-DEFW and the rest of the comparison methods. Algorithm FA ranking Holm APV Finner APV IPADECS-DEFW SSMA-DEPG IPADECS SSMA TSKNN GOCBR ReliefF 1NN 97.6750 106.8875 133.9000 149.0250 153.1375 190.4875 209.0500 243.8375 – 0.6561 0.1599 0.0392 0.0294 0 0 0 – 0.6561 0.0926 0.0182 0.0128 0 0 0 p-Value by the FA test ¼ 9:915 106 . reduction rate achieved for each method. In this table, both IPADECS-DEFW variants obtain the same average reduction rate. In Table 10, we observe that the IPADECS-DEFW model with the join procedure has obtained the best average accuracy result. The Wilcoxon test has been conducted comparing this method with the rest. Table 12 shows the results of its application. As in the former study, IPADECS-DEFW obtains a slightly lower reduction power than IPADECS as we can see in Table 11. Observing these tables, we can summarize that with an appropriate stratification procedure, the idea of combining PG and FW is also applicable to huge data sets, obtaining good Algorithm FA ranking Holm APV Finner APV SSMA-DEPGFW SSMA-DEPG IPADECS SSMA TSKNN GOCBR ReliefF 1NN 93.8875 108.0750 133.5250 149.5250 155.2250 190.8000 208.7250 244.2375 – 0.4929 0.1107 0.0215 0.0121 0 0 0 – 0.4929 0.0643 0.0100 0.0053 0 0 0 p-Value by the FA test ¼ 9:644 106 . Original vs Weighted version 100 Accuracy of the Weighted proposal Table 6 Reduction rates obtained. 341 SSMA−DEPG vs SSMA−DEPGFW IPADECS vs IPADECS−DEFW y=x 95 90 85 80 75 75 80 85 90 95 Accuracy of the Original proposal 100 Fig. 6. Accuracy results over large data sets. Table 9 Results of the Wilcoxon signed-ranks test. Comparison Rþ R p-Value SSMA-DEPGFW vs SSMA-DEPG IPADECS-DEFW vs SSMA-DEPG 586.5 521 233.5 259 0.0469 0.0682 results. The Wilcoxon test supports this statement, showing that IPADECS-DEFW, with the join procedure, is able to significantly outperform IPADECS, SSMA-DEPG and 1NN to a level of a ¼ 0:1. 6. Conclusions In this paper, we have introduced a novel data reduction technique which exploits the cooperation between FW and PG to improve the classification performance of the NN, storage requirements and its running time. A self-adaptive DE algorithm has been used to optimize feature weights and the positioning of the prototypes for the nearest neighbor algorithm, acting as an FW scheme and a PG method, respectively. The proposed DEFW scheme has been incorporated within two of the most promising PG methods. These hybrid models are able to overcome isolated PG methods due to the fact that FW changes the way in which distances between prototypes are measured, and therefore the adjustment of prototypes can be more refined. Furthermore, we have proposed a stratified procedure specifically designed to deal with huge data sets. The wide experimental study performed has allowed us to contrast the behavior of these hybrid models when dealing with a wide variety of data sets with different numbers of instances and 342 I. Triguero et al. / Neurocomputing 97 (2012) 332–343 Table 10 Accuracy test results in huge data sets. Data sets 1NN IPADECS Acc SD Acc SSMA-DEPG SD Acc SD IPADECS-DEFW IPADECS-DEFW Join Voting rule Acc SD Acc SD Adult Census Connect-4 Fars Letter Shuttle 0.7960 0.9253 0.6720 0.7466 0.9592 0.9993 0.0035 0.0010 0.0036 0.0034 0.0002 0.0004 0.8263 0.9439 0.6569 0.7439 0.9420 0.9941 0.0032 0.0005 0.0009 0.0218 0.0082 0.0021 0.8273 0.9460 0.6794 0.7625 0.9053 0.9967 0.0098 0.0009 0.0061 0.0036 0.0082 0.0021 0.8335 0.9477 0.6847 0.7676 0.9632 0.9967 0.0077 0.0007 0.0058 0.0039 0.0121 0.0008 0.8313 0.9428 0.6624 0.7536 0.9699 0.9967 0.0031 0.0300 0.0045 0.0033 0.0075 0.0015 Average 0.8497 0.0020 0.8512 0.0061 0.8529 0.0051 0.8656 0.0052 0.8595 0.0083 Table 11 Reduction rate results in huge data sets. Data sets IPADECS SSMA-DEPG IPADECS-DEFW IPADECS-DEFW Join Voting rule Adult Census Connect-4 Fars Letter Shuttle 0.9986 0.9994 0.9990 0.9968 0.9924 0.9986 0.9882 0.9973 0.9822 0.9808 0.9805 0.9981 0.9986 0.9987 0.9981 0.9957 0.9901 0.9971 0.9986 0.9987 0.9981 0.9957 0.9901 0.9971 Average 0.9975 0.9878 0.9964 0.9964 Table 12 Results obtained by the Wilcoxon test for algorithm IPADECS-DEFW join. IPADECS-DEFW join VS Rþ R p-Value 1NN IPADECS SSMA-DEPG IPADECS-DEFW voting rule 20.0 21.0 15.0 12.0 1.0 0.0 0.0 3.0 0.0625 0.0312 0.0625 0.1775 features. The proposed stratified procedure has shown that this technique is useful to tackle the scaling up problem. The results have been compared with several nonparametric statistical procedures, which have supported the conclusions drawn. As future work, we consider that this methodology could be extended by using different learning algorithms such as support vector machines, decision trees, and so on, following the guidelines given in similar studies for training set selection [62–64]. Acknowledgments Supported by the Research Projects TIN2011-28488 and TIC6858. J. Derrac holds an FPU scholarship from the Spanish Ministry of Education and Science. References [1] E. Alpaydin, Introduction to Machine Learning, 2nd edition, MIT Press, Cambridge, MA, 2010. [2] I. Kononenko, M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms, Horwood Publishing Limited, 2007. [3] T.M. Mitchell, Machine Learning, McGraw-Hill, 1997. [4] D.W. Aha (Ed.), Lazy Learning, Springer, 1997. [5] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27. [6] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery, 2009. [7] Y. Gao, F. Gao, Edited AdaBoost by weighted kNN, Neurocomputing 73 (16–18) (2010) 3079–3088. [8] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition 43 (6) (2010) 2082–2105. [9] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 1999. [10] J.M. Urquiza, I. Rojas, H. Pomares, L.J. Herrera, J. Ortega, A. Prieto, Method for prediction of protein–protein interactions in yeast using genomics/proteomics information and feature selection, Neurocomputing 74 (16) (2011) 2683–2690. [11] H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Chapman & Hall/CRC, 2007. [12] J.X. Peng, S. Ferguson, K. Rafferty, P. Kelly, An efficient feature selection method for mobile devices with application to activity recognition, Neurocomputing 74 (17) (2011) 3543–3552. [13] J. Derrac, C. Cornelis, S. Garcı́a, F. Herrera, Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection, Inf. Sci. 186 (1) (2012) 73–92. [14] H. Liu, H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic Publishers, 2001. [15] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning algorithms, Mach. Learn. 38 (3) (2000) 257–286. [16] A. Guillén, L.J. Herrera, G. Rubio, H. Pomares, A. Lendasse, I. Rojas, New method for instance or prototype selection using mutual information in time series prediction, Neurocomputing 73 (10–12) (2010) 2030–2038. [17] S. Garcı́a, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest neighbor classification: taxonomy and empirical study, IEEE Trans. Pattern Anal. Mach. Intell. 34 (3) (2012) 417–435. [18] H.A. Fayed, S.R. Hashem, A.F. Atiya, Self-generating prototypes for pattern classification, Pattern Recognition 40 (5) (2007) 1498–1509. [19] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A taxonomy and experimental study on prototype generation for nearest neighbor classification, IEEE Trans. Syst. Man Cybern.—Part C: Appl. Rev. 42 (1) (2012) 86–100. [20] D. Wettschereck, D.W. Aha, T. Mohri, A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artif. Intell. Rev. 11 (1997) 273–314. [21] R. Paredes, E. Vidal, Learning weighted metrics to minimize nearest-neighbor classification error, IEEE Trans. Pattern Anal. Mach. Intell. 28 (7) (2006) 1100–1110. [22] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, SpringerVerlag, Berlin, 2003. [23] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms, Springer-Verlag, Berlin, 2002. [24] G.L. Pappa, A.A. Freitas, Automating the Design of Data Mining Algorithms: An Evolutionary Computation Approach, Natural computing, Springer, 2009. [25] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study, IEEE Trans. Evol. Comput. 7 (6) (2003) 561–575. [26] N. Krasnogor, J. Smith, A tutorial for competent memetic algorithms: model, taxonomy, and design issues, IEEE Trans. Evol. Comput. 9 (5) (2005) 474–488. [27] S. Garcı́a, J.R. Cano, F. Herrera, A memetic algorithm for evolutionary prototype selection: a scaling up approach, Pattern Recognition 41 (8) (2008) 2693–2709. [28] F. Fernández, P. Isasi, Evolutionary design of nearest prototype classifiers, J. Heuristics 10 (4) (2004) 431–454. [29] A. Cervantes, I.M. Galván, P. Isasi, AMPSO: a new particle swarm method for nearest neighborhood classification, IEEE Trans. Syst. Man Cybern.—Part B: Cybern. 39 (5) (2009) 1082–1091. [30] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction, Neurocomputing 72 (4–6) (2008) 1092–1097. [31] R. Storn, K.V. Price, Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces, J. Global Optim. 11 (10) (1997) 341–359. I. Triguero et al. / Neurocomputing 97 (2012) 332–343 [32] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A Practical Approach to Global Optimization, Natural Computing Series, , 2005. [33] S. Das, P. Suganthan, Differential evolution: a survey of the state-of-the-art, IEEE Trans. Evol. Comput. 15 (1) (2011) 4–31. [34] I. Triguero, S. Garcı́a, F. Herrera, Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification, Pattern Recognition 44 (4) (2011) 901–916. [35] F. Fernández, P. Isasi, Local feature weighting in nearest prototype classification, IEEE Trans. Neural Networks 19 (1) (2008) 40–53. [36] J. Li, M.T. Manry, C. Yu, D.R. Wilson, Prototype classifier design with pruning, Int. J. Artif. Intell. Tools 14 (1–2) (2005) 261–280. [37] I. Triguero, S. Garcı́a, F. Herrera, IPADE: iterative prototype adjustment for nearest neighbor classification, IEEE Trans. Neural Networks 21 (12) (2010) 1984–1990. [38] I. Triguero, S. Garcı́a, F. Herrera, Enhancing IPADE algorithm with a different individual codification, in: Proceedings of the 6th International Conference on Hybrid Artificial Intelligence Systems (HAIS), Lecture Notes in Artificial Intelligence, vol. 6679, 2011, pp. 262–270. [39] J.R. Cano, F. Herrera, M. Lozano, Stratification for scaling up evolutionary prototype selection, Pattern Recognition Lett. 26 (7) (2005) 953–963. [40] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A study of the scaling up capabilities of stratified prototype generation, in: Proceedings of the Third World Congress on Nature and Biologically Inspired Computing (NABIC’11), 2011, pp. 304–309. [41] F. Neri, V. Tirronen, Scale factor local search in differential evolution, Memetic Comput. 1 (2) (2009) 153–171. [42] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci. 180 (2010) 2044–2064. [43] P.E. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 18 (1968) 515–516. [44] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408–421. [45] J.S. Sánchez, R. Barandela, A.I. Marqués, R. Alejo, J. Badenas, Analysis of new techniques to obtain quality training sets, Pattern Recognition Lett. 24 (7) (2003) 1015–1022. [46] J.S. Sánchez, High training set size reduction by space partitioning and prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564. [47] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 37–66. [48] K. Kira, L.A. Rendell, A practical approach to feature selection, in: Proceedings of the Ninth International Conference on Machine Learning, Morgan Kaufmann, Aberdeen, Scotland, 1992, pp. 249–256. [49] K. Ye, K. Feenstra, J. Heringa, A. Ijzerman, E. Marchiori, Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a machine learning approach for feature weighting, Bioinformatics 24 (1) (2008) 18–25. [50] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, in: Proceedings of the 1994 European Conference on Machine Learning, Springer Verlag, Catania, Italy, 1994, pp. 171–182. [51] M.R. Sikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (1-2) (2003) 23–69. [52] M.A. Tahir, A. Bouridane, F. Kurugollu, Simultaneous feature selection and feature weighting using hybrid tabu search/k-nearest neighbor classifier, Pattern Recognition Lett. 28 (4) (2007) 438–446. [53] J. Gertheiss, G. Tutz, Feature selection and weighting by nearest neighbor ensembles, Chemometr. Intell. Lab. Syst. 99 (2009) 30–38. [54] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with strategy adaptation for global numerical optimization, IEEE Trans. Evol. Comput. 13 (2) (2009) 398–417. [55] S. Das, A. Abraham, U.K. Chakraborty, A. Konar, Differential evolution using a neighborhood-based mutation operator, IEEE Trans. Evol. Comput. 13 (3) (2009) 526–553. [56] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional external archive, IEEE Trans. Evol. Comput. 13 (5) (2009) 945–958. [57] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcı́a, L. Sánchez, F. Herrera, KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Mult. 17 (2–3) (2011) 255–287. [58] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy K-nearest neighbor algorithm, IEEE Trans. Syst. Man Cybern. 15 (4) (1985) 580–585. [59] H. Ahn, K. Kim, Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach, Appl. Soft Comput. 9 (2009) 599–607. [60] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 5th edition, Chapman & Hall/CRC, 2011. [61] C. Garcı́a-Osorio, A. de Haro-Garcı́a, N. Garcı́a-Pedrajas, Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts, Artif. Intell. 174 (2010) 410–441. [62] J.R. Cano, F. Herrera, M. Lozano, Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability, Data Knowl. Eng. 60 (2007) 90–108. [63] S. Garcı́a, A. Fernández, F. Herrera, Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems, Appl. Soft Comput. 9 (2009) 1304–1314. 343 [64] L. Nanni, A. Lumini, Prototype reduction techniques: a comparison among different approaches, Expert Syst. Appl. 38 (9) (2011) 11820–11828. Isaac Triguero Velázquez received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, semi-supervised learning, data reduction and evolutionary algorithms. Joaquı́n Derrac Rus received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2008. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, statistical inference and evolutionary algorithms. Salvador Garcı́a López received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Assistant Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has had more than 25 papers published in international journals. He has co-edited two special issues of international journals on different Data Mining topics. His research interests include data mining, data reduction, data complexity, imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms. Francisco Herrera Triguero received the M.Sc. in Mathematics in 1988 and the Ph.D. in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is a coauthor of the book ‘‘Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases’’ (World Scientific, 2001). He currently acts as Editor in Chief of the international journal ‘‘Progress in Artificial Intelligence’’ (Springer) and serves as area editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the ‘‘Spanish Engineer on Computer Science’’, and International Cajastur ‘‘Mamdani’’ Prize for Soft Computing (Fourth Edition, 2010). His current research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms. 1. Prototype generation for supervised classification 1.5 97 MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification • I. Triguero, D. Peralta, J. Bacardit, S. Garcı́a, F. Herrera, MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. Neurocomputing. – Status: Submitted. Elsevier Editorial System(tm) for Neurocomputing Manuscript Draft Manuscript Number: NEUCOM-D-13-01899R1 Title: MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification Article Type: SI: Data stream 2013 Keywords: Big data; Mahout; Hadoop; Prototype reduction; Prototype generation; Nearest neighbor classification Corresponding Author: Mr. ISAAC TRIGUERO VELÁZQUEZ, M. D. Corresponding Author's Institution: University of Granada First Author: ISAAC TRIGUERO VELÁZQUEZ, M. D. Order of Authors: ISAAC TRIGUERO VELÁZQUEZ, M. D.; Daniel Peralta, Mr.; Jaume Bacardit, Dr.; Salvador García, Dr.; Francisco Herrera, Prof. Abstract: In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data. Manuscript ick here to view linked References MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification Isaac Trigueroa,, Daniel Peraltaa , Jaume Bacarditb , Salvador Garcı́ac , Francisco Herreraa a Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology). University of Granada, 18071 Granada, Spain b School of Computing Science, Newcastle University, NE1 7RU, Newcastle, UK c Department of Computer Science. University of Jaén, 23071 Jaén, Spain Abstract In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of Email addresses: [email protected] (Isaac Triguero), [email protected] (Daniel Peralta), [email protected] (Jaume Bacardit), [email protected] (Salvador Garcı́a), [email protected] (Francisco Herrera) Preprint submitted to Neurocomputing March 3, 2014 instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data. Keywords: Big data, Mahout, Hadoop, Prototype reduction, Prototype generation, Nearest neighbor classification 1. Introduction The term of big data is increasingly being used to refer to the challenges and advantages derived from collecting and processing vast amounts of data [1]. Formally, it is defined as the quantity of data that exceeds the processing capabilities of a given system [2] in terms of time and/or memory consumption. It is attracting much attention in a wide variety of areas such as industry, medicine or financial businesses because they have progressively acquired a lot of raw data. Nowadays, with the availability of cloud platforms [3] they could take some advantages from these massive data sets by extracting valuable information. However, the analysis and knowledge extraction process from big data become very difficult tasks for most of the classical and advanced data mining and machine learning tools [4, 5]. Data mining techniques should be adapted to the emerging technologies [6, 7] to overcome their limitations. In this sense, the MapReduce framework [8, 9] in conjunction with its distributed file system [10], originally introduced by Google, offers a simple but robust environment to tackling the processing of large data sets over a cluster of machines. This scheme is currently taken into consideration in data mining, rather than other parallelization schemes such as MPI (Message Passing Interface) [11], because of its faulttolerant mechanism, which is crucial for time-consuming jobs, and because of its simplicity. In the specialized literature, several recent proposals have focused on the parallelization of machine learning tools based on the MapReduce approach [12, 13]. For example, some classification techniques such as [14, 15, 16] have been implemented within the MapReduce paradigm. They have shown that the distribution of the data and the processing under a cloud computing infrastructure is very useful for speeding up the knowledge extraction process. Data reduction techniques [17] emerged as preprocessing algorithms that aim to simplify and clean the raw data, enabling data mining algorithms to be applied not only in a faster way, but also in a more accurate way by 2 removing noisy and redundant data. From the perspective of the attributes space, the most well-known data reduction processes are feature selection and feature extraction [18]. Taking into consideration the instance space, we highlight instance reduction methods. This latter is usually divided into instance selection [19] and instance generation or abstraction [20]. Advanced models that tackle simultaneously both problems are [21, 22, 23]. As such, these techniques should ease data mining algorithms to address with big data problems, however, these methods are also affected by the increase of the size and complexity of data sets and they are unable to provide a preprocessed data set in a reasonable time. This work is focused on Prototype Reduction (PR) techniques [20], which are instance reduction methods that aim to improve the classification capabilities of the Nearest Neighbor rule (NN) [24]. These techniques may select instances from the original data set, or build new artificial prototypes, to form a resulting set of prototypes that better adjusts the decision boundaries between classes in NN classification. PR techniques have proved to be very competitive at reducing the computational cost and high storage requirements of the NN algorithm, and also improving its classification performance [25, 26, 27]. Large-scale data cannot be tackled by standard data reduction techniques because their runtime becomes impractical. Several solutions have been developed to enable data reduction techniques to deal with this problem. For PR, we can find a data-level approach that is based on a distributed partitioning model that maintains the class distribution (also called stratification). This splits the original training data into several subsets that are individually addressed. Then, it joins each partial reduced set into a global solution. This approach has been used for instance selection [28, 29] and generation [30] with promising results. However, two main problems appear when we increase the data set size: • A stratified partitioning process could not be carried out when the size of the data set is so big that it occupies all the available RAM memory. • This scheme does not consider that joining each partial solution into a global one could generate a reduced set with redundant or noisy instances that may damage the classification performance. In this work, we propose a new distributed framework for PR, based on the stratification procedure, which handles the drawbacks mentioned above. 3 To do so, we rely on the success of the MapReduce framework, designing carefully the map and reduce tasks to perform a proper PR process. Concretely, the map phase corresponds to the splitting procedure and the application of the PR technique. The reduce stage performs a filtering or fusion of prototypes to avoid the introduction of harmful prototypes to the resulting preprocessed data set. We will denote this framework “MapReduce for Prototype Reduction” (MRPR). The idea of splitting the data into several subsets, and processing them separately, fits better with the MapReduce philosophy, than with other parallelization schemes because of two reasons: Firstly, each subset is individually processed, so that, it does not need data exchange between nodes to proceed [31]. Secondly, the computational cost of each chunk could be so high that a fault-tolerant mechanism is mandatory. For the reduce stage we study three different strategies, of varying computational effort, for the integration of the partial solutions generated by the mappers. Developing a distributed partitioning scheme based on MapReduce for PR motivates the global purpose of this work, which can be divided into three objectives: • To enable PR techniques to deal with big data classification problems. • To analyze and illustrate the scalability of the proposed scheme in terms of classification accuracy and runtime. • To study how PR techniques enhance the NN rule when dealing with big data. To test the performance of our model, we will conduct experiments on big data sets focusing on an advanced PR technique, called SSMA-SFLSDE, which was recently proposed in [27]. Moreover, some additional experiments with other PR techniques will be also carried out. The experimental study includes an analysis of training and test accuracy, runtime and reduction capabilities of PR techniques under the proposed framework. Several variations of the proposed model will be investigated with different number of mappers and four data sets of up to 5.7 millions instances. The rest of the paper is organized as follows. In Section 2, we provide some background material about PR and MapReduce. In Section 3, we describe the MapReduce implementation proposed for PR and discuss which PR methods are candidates to be adapted to this framework. We present 4 and discuss the empirical results in Section 4. Finally, Section 5 summarizes the conclusions of the paper. 2. Background In this section we provide some background information about the topics used in this paper. Section 2.1 presents the PR problem and its weaknesses to deal with big data. Section 2.2 introduces the MapReduce paradigm and the implementation used in this work. 2.1. Prototype reduction and big data This section defines the PR problem, its current trends and the drawbacks of tackling big data with PR techniques. A formal notation of the PR problem is the following: Let T R be a training data set and T S a test set, they are formed by a determined number n and t of samples, respectively. Each sample xp is a tuple (xp1 , xp2 , ..., xpD , ω), where, xpf is the value of the f -th feature of the p-th sample. This sample belongs to a class ω, given by xpω , and a D-dimensional space. For the T R set the class ω is known, while it is unknown for T S. The purpose of PR is to provide a reduced set RS which consists of rs, rs < n, prototypes, which are either selected or generated from the examples of T R. The prototypes of RS should be calculated to efficiently represent the distributions of the classes and to discern well when they are used to classify the training objects. The size of RS should be sufficiently reduced to deal with the storage and evaluation time problems of the NN classifier. As we stated above, PR is usually divided into those approaches that are limited to select instances from T R, known as prototype selection, and those that may generate artificial examples (prototype generation). Both strategies have been deeply studied in the literature. Most of the recent proposals are based on evolutionary algorithms to select [32, 33] or generate [25, 26] an appropriate RS. Furthermore, there is a hybrid approach between prototype selection and generation in [27]. Recent reviews about these topics are [19] and [20]. More information about PR can be found at the SCI2S thematic public website on Prototype Reduction in Nearest Neighbor Classification: Prototype Selection and Prototype Generation 1 . 1 http://sci2s.ugr.es/pr/ 5 Despite the promising results shown by PR techniques with small and medium data sets, they lack of scalability to address big T R data sets (from tens of thousands of instances onwards [29]). The main problems found to deal with large-scale data are: • Runtime: The complexity of PR models is O((n · D)2 ) or higher, where n is the number of instances and D the number of features. Although these techniques are only applied once on a T R, if this process takes too long, its application could become inoperable for real applications. • Memory consumption: Most of PR methods need to store in the main memory many partial calculations, intermediate solutions, and/or also the entire T R. When T R is too big, it could easily exceed the available RAM memory. As we will see in further sections, these weaknesses motivate the use of distributed partitioning procedures, which divide the T R into disjoint subsets that can be manage by PR methods [28]. 2.2. Mapreduce MapReduce is a paradigm of parallel programming [8, 9] designed to process or generate large data sets. It allows us to tackle big data sets over a computer cluster regardless the underlying hardware or software. It is characterized by its highly transparency for programmers, which allows to parallelize applications in a easy and comfortable way. Based on functional programming, this model works in two different steps: the map phase and the reduce phase. Each one has key-value (< k, v >) pairs as input and output. Both phases are defined by a programmer. The map phase takes each < k, v > pair and generates a set of intermediate < k, v > pairs. Then, MapReduce merges all the values associated with the same intermediate key as a list (known as shuffle phase). The reduce phase takes that list as input for producing the final values. Figure 1 depicts a flowchart of the MapReduce framework. In a MapReduce program, all map and reduce operations run in parallel. First of all, all map functions are independently run. Meanwhile, reduce operations wait until their respective maps are finished. Then, they process different keys concurrently and independently. Note that inputs and outputs of a MapReduce job are stored in an associated distributed file system that is accessible from any computer of the used cluster. 6 Figure 1: Flowchart of the MapReduce framework An illustrative example about the way of working of MapReduce could be find the average costs per year from a big list of cost records. Each record may be composed by a variety of values, but it at least includes the year and the cost. The map function extracts from each record the pairs < year, cost > and transmits them as its output. The shuffle stage groups the < year, cost > pairs by its corresponding year, creating a list of costs per year < year, list(cost) >. Finally, the reduce phase performs the average of all the costs contained in the list of each year. Different implementations of the MapReduce framework are possible [8], depending on the available cluster architecture. Some implementations of MapReduce are: Mars [34], Phoenix [35] and Apache Hadoop [36, 37]. In this paper we will focus on the Hadoop implementation because of its performance, open source nature, installation facilities and its distributed file system (Hadoop Distributed File System, HDFS). A Hadoop cluster is formed by a master-slave architecture, where one master node manages an arbitrary number of slave nodes. The HDFS replicates file data in multiple storage nodes that can concurrently access to the data. As such cluster, a certain percentage of these slave nodes may be out of order temporarily. For this reason, Hadoop provides a fault-tolerant mechanism, so that, when one node fails, Hadoop restarts automatically the task on another node. As we commented above, the MapReduce approach can be useful for many different tasks. In terms of data mining, it offers a propitious environment 7 to successfully speed up these kinds of techniques. In fact, there is a growing open source project, called Apache Mahout [38], that collects distributed and scalable machine learning algorithms implemented on top of Hadoop. Nowadays, it supplies an implementation of several specific techniques, such as, k-means for clustering, a naive bayes classifier, a collaborative filtering, etc. We based our implementations on this library. 3. MRPR: MapReduce for prototype reduction In this section we present the proposed MapReduce approach for PR. Firstly, we argue the motivation that justify our proposal (Section 3.1). Then, we detail the proposed model in depth (Section 3.2). Finally, we comment which PR methods can be implemented within the proposed framework depending on their main characteristics (Section 3.3) 3.1. Motivation As mentioned before, PR methods decrease their performance when dealing with large amounts of instances. The distribution and parallelization of workload in different sub-processes may ease the problems previously enumerated (runtime and memory consumption). To tackle this challenge we have to create an efficient and flexible PR design that takes advantage of parallelization schemes and cloud-enable infrastructures. The designed framework should enable PR techniques to be applied with data sets of unlimited number of instances without major algorithmic modifications, just by using more computers. Furthermore, this model should guarantee that the objectives of PR models are maintained, so that, it should provide high reduction rates without significant accuracy loss. In our previous work [30], a distributed partitioning approach was proposed to alleviate these issues. This model splits the training set, called T R, into disjoint d subsets (T R1 , T R2 , ..., T Rd ) with equal class distribution and size. Then, a PR model is applied to each T Rj , obtaining a resulting reduced set RSj . Finally, all RSj (1 ≤ j ≤ d) are merged into a final reduced set, called RS, which is used to classify the instances of T S with the NN rule. This partitioning process shows to perform well in medium size domains. However, it has some limitations: • Maintaining the proportion of examples per class of T R within each subset T Rj cannot be accomplished when the size of the data set does 8 not fit in the main memory. Hence, this strategy cannot scale to data sets of arbitrary size. • Joining all the partial reduced sets RSj into a final RS may lead to the introduction of noisy and/or redundant examples. Each resulting RSj tries to represent, with the minimum number of instances, a proportion of the entire T R. Thus, when the size of T R tends to be very high, the instances contained in some T Rj subsets may be located very near in the D-dimensional space. Therefore, the final RS may enclose unnecessary instances to represent the training data. The likelihood of this issue increases with the number of partitions. Moreover, it is important to note that this distributed model was not implemented within any parallel environment that ensures high scalability and fault tolerance. These weaknesses motivate the design of a parallel PR system based on cloud technologies. In [30], we compared some relevant PR methods with the distributed partitioning model. We concluded that the best performing approach was the SSMA-SFLSDE model [27]. In our experiments, we will mainly focus on this PR model (although other models will be investigated). 3.2. Parallelizing PR with MapReduce This section explains how to parallelize PR techniques following a MapReduce procedure. Section 3.2.1 details the map phase and Section 3.2.2 presents the reduce stage. At the end of the section, Figure 3 illustrates a high level scheme of the proposed parallel system MRPR. 3.2.1. Map phase Suppose a training set T R, of a determined size, stored in the HDFS as a single file. The first step of MRPR is devoted to split T R into a given number of disjoint subsets. Within a Hadoop perspective, the T R file is composed by h HDFS blocks that are accessible from any computer of the cluster independently of its size. Let m the number of map tasks (a userdefined parameter). Each map task (Map1 , Map2 , ..., Mapm ) will form an associated T Rj , where 1 ≤ j ≤ m, with the instances of each chunk in which is divided the training set file. It is noteworthy that this partitioning process is performed sequentially, so that, the Mapj corresponds to the j data chunk of h/m HDFS blocks. So, each map will process approximately the same number of instances. 9 Under this scheme, if the partitioning procedure is directly applied over T R, the class distribution of each subset T Rj could be biased to the original distribution of instances in its corresponding file. As we stated before, a proper stratified partitioning could not be carried out if the size of T R does not fit in the main memory. In order to develop a scheme easily scalable to any number of instances, we previously randomize the entire file. This operation is not time-consuming in comparison with the application of the PR technique and should be applied only once. It does not ensure that every class is represented proportionally to its number of instances in T R. However, probabilistically, each chunk should include approximately a number of instances of class ω according to the probability of belonging to this class in the original T R. When each map has formed its corresponding T Rj , a PR step is performed using T Rj as the input training data. This step generates a reduced set RSj . Note that PR techniques may consume different computational times although they are applied with data sets of similar characteristics. It mainly depends on the stopping criteria of each PR model. Nevertheless, MapReduce starts the reduce phase as the first mapper has finalized. Figure 2 contains the pseudo-code of the map function. This function is basically the application of the PR technique for each training partition. As each map finishes its processing the results are forwarded to a single reduce task. 3.2.2. Reduce phase The reduce phase will consist of the iterative aggregation of all the RSj as a single one RS. Figure 2 shows the pseudo-code of the implemented reduce function. Initially RS = ∅. To do so, we propose different alternatives: • Join: This simple option, based on stratification, concatenates all the RSj sets into a final reduce set RS. Instruction 7 indicates how the reduce function progressively joins all the RSj as the mappers finish their processing. This type of reducer implements the same strategy used in the distributed partitioning procedure that we previously proposed [30]. As such, this joining process does not guarantee that the resulting RS does not contain irrelevant or even harmful instances, but it is included as a baseline. • Filtering: This alternative explores the idea of a filtering stage that removes noisy instances during the formation of RS. This is based on 10 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: function map(Number of splits j) Constitute T Rj with the instances of split j. RSj =PrototypeReduction(T Rj ) return RSj end function function reduce(RSj ,typeOf Reducer) RS = RS ∪ RSj if typeOf Reducer==Filtering then RS=Filtering(RS) end if if typeOf Reducer==Fusion then RS=Fusion(RS) end if return RS end function ⊲ Initially RS = ∅ Figure 2: Map and reduce functions those prototype selection methods belonging to the edition family of methods [19]. This kind of methods is commonly based on simple heuristics that discard points that are noisy or do not agree with their neighbors. They supply smoother decision boundaries for the NN classifier. In general, edition schemes enhance generalization capabilities by performing a slight reduction of the original training set. These characteristics are very appropriates for the current stage of our framework. At this stage, the map phase has reduced each partition to a subset of representative instances. To aggregate them into a single RS set, we do not pursue to reduce more the RS, we focus on removing noisy instances, if any. Therefore, the reduce function iteratively applies a filtering of the current RS. It means that as the mappers end their execution, the reduce function is run and the next RS is computed as the filtered set obtained with its current content and the new RSj . It is described in instructions 8-10 of Figure 2. • Fusion: In this variant we aim to eliminate redundant prototypes. To accomplish this objective we rely on the success of centroid-based methods for prototype generation [20]. These techniques reduce a prototype set by merging similar examples [39]. Since in this step we have to 11 Figure 3: MRPR scheme fuse all the RSj into a single one, these methods can be very useful to generate a final set without redundant or very similar prototypes. As in the previous scheme, the fusion phase will be progressively applied during the creation of RS. Instructions 11-13 of Figure 2 explain how to apply the fusion phase in the MapReduce framework. As we have explained, MRPR only uses one single reducer that is run every time that a mapper is completed. With the adopted strategy, the use of a single reducer is computationally less expensive than use more than one. It decreases the Mapreduce overhead (especially network overhead) [40]. As summary, Figure 3 outlines the way of working of the MRPR framework, differentiating between the map and reduce phases. It puts emphasis on how the single reducer works and it forms the final RS. The resulting RS will be used as training set for the NN rule to classify the unseen data of the T S set. 12 3.3. Which PR methods are more suitable for the MRPR framework? In this subsection we explain which kind of PR techniques fit with the proposed MRPR framework in its respective stages. In the map phase, the main prototype reduction process is carried out by a PR technique. Then, depending on the selected reduce type we should select a filtering or a fusion PR technique to combine the resulting reduced sets. In what follows, we discuss which PR techniques are more appropriate for these stages and how to combine them. All PR algorithms utilize a training set (in our case T Rj ) as input and then return a reduced set RSj . Therefore, all of them could be implemented in the map phase of MRPR according to the description performed above. However, depending on their characteristics (reduction, accuracy and runtime), we should take into consideration the following aspects to select a proper PR algorithm: • A very accurate PR technique is desirable. However, in many PR techniques it implies a low reduction rate. A resulting RS with an excessive number of instances can negatively influence in the time needed by the reduce phase. • The runtime consumption of a PR algorithm will determine the necessary number of mappers in which the T R set of a given problem should be divided. Depending on the problem tackled, a very high number of mappers may result in a non representative subset T Rj from the original T R. According to [19, 20], there are six main PR families: edition [41], condensation [42], hybrid approaches [43], positioning adjustment [25], centroidsbased [44] and space splitting [45]. Although there are differences between the methods of each family, most of them perform in a similar way. With these previous notes in mind, we can state the following general recommendations: • Edition-based methods are focused on cleaning the training set by removing noisy data. Thus, these methods are usually very fast and accurate but they obtain a very low reduction rate. To implement these methods in our framework we recommend the use of a very fast reduce phase. For instance, a simple join scheme, a filtering reducer with the ENN method [41] or a fusion reducer based on PNN [39]. 13 • Condensation, hybrid and space splitting approaches commonly offer a good trade-off between reduction, accuracy and runtime. Their reduction rate is normally around 60-80%, so that, depending on the problem addressed, the reducer should have a moderate time consumption. For example, we recommend the use of ENN [41] or Depur [46] for filtering reducers and GMCA [44] for fusion. • Positioning adjustment techniques may offer a very high reduction rate or even adjustable as a user-defined parameter. These techniques can provide very accurate results in a relatively moderate runtime. To implement these techniques we suggest the inclusion of very accurate reducers, such as ICPL [47] for fusion, because the high reduction rate will allow them to be applied in a fast way. • Centroid-based algorithms are very accurate, with a moderate reduction power but (in general) very time-consuming. Although its implementation is feasible and could be useful in some problems, we assume that their use should be limited to the later stage (reduce phase). As general suggestions to combine PR techniques in the map and reduce phases, we can establish the following rules: • High reduction rates in the map phase permit very accurate reducers. • Low reduction rates in the map phase need fast reducers (join, filtering or a fast fusion). As commented in the previous section, we propose the use of editionbased methods for the filtering reduce type and centroid-based algorithms to fuse prototypes. In our experiments, we will focus on a simple but effective edition technique: the edited nearest neighbor (ENN) [41]. This algorithm removes an instance from a set of prototypes if it does not agree with the majority of its k nearest neighbors. As algorithms to fuse prototype, we will use the ICLP2 method presented in [47] as a more accurate option and the GMCA model for a faster reduce phase [44]. The ICPL2 model integrates several prototypes by identifying borders and merging those instances that are not located in these borders. It highlights as the best performing model of the centroid-based family in [20]. The GMCA approach merges prototype based on a hierarchical clustering. This method provides a good trade-off between accuracy and runtime needed. 14 4. Experimental study In this section we present all the questions raised with the experimental study and the results obtained. Section 4.1 describes the performance measures used to evaluate the MRPR model. Section 4.2 defines and details the hardware and software support used in our experiments. Section 4.3 shows the parameters of the involved algorithms and the data sets chosen. Section 4.4 presents and discusses the results achieved. Finally, Section 4.5 includes additional experiments using different PR techniques within the MRPR model. 4.1. Performance measures In this work we study the performance of a parallel PR system to improve the NN classifier. Hence, we need several types of measures to characterize the abilities of the proposed approach and its variants. In the following, we briefly describe the considered measures: • Accuracy: It counts the number of correct classifications regarding the total number of instances classified [4, 48]. In our experiments we will compute training and test classification accuracy. • Reduction rate: It measures the reduction of storage requirements achieved by a PR algorithm. ReductionRate = 1 − size(RS)/size(T R) (1) Reducing the stored instances in the T R set will yield a time reduction to classify a new input sample. • Runtime: We will quantify the total time spent by MRPR to generate the RS, including all the computations performed by the MapReduce framework. • Test classification time: It refers to the time needed to classify all the instances of T S regarding a given T R. For PR, it is directly related to the reduction rate. • Speed up: It usually checks the efficiency achieved by a parallel system in comparison with the sequential version of the algorithm. Thus, it 15 measures the relation between the runtime of sequential and parallel versions. If the calculation is executed in c processing cores and it is considered fully parallelizable, the maximum theoretical speed up would be equal to the number of used cores, according to the the Amdahl’s Law [49]. With a MapReduce parallelization scheme, each map will correspond to a single core, so that, the number of used mappers determines the maximum attainable speed up. However, due to the magnitude of the data sets used, we cannot run the sequential version of the selected PR technique (SSMA-SFLSDE) because its execution is extremely slow. For this reason, we will take the runtime with the minimum number of mappers as reference time to calculate the speed up. Therefore, the speed up will be computed as: Speedup = parallel time (2) parallel time with minimum number of mappers 4.2. Hardware and software used The experiments have been carried out on twelve nodes in a cluster: The master node and eleven compute nodes. Each one of these compute nodes has the following features: • Processors: 2 x Intel Xeon CPU E5-2620 • Cores: 6 per processor (12 threads) • Clock Speed: 2.00 GHz • Cache: 15 MB • Network: Gigabit Ethernet (1 Gbps) • Hard drive: 2 TB • RAM: 64 GB The master node works as the user interface and hosts both Hadoop master processes: the NameNode and the JobTracker. The NameNode handles the HDFS, coordinating the slave machines by the means of their respective DataNode processes, keeping track of the files and the replications of each 16 HDFS block. The JobTracker is the MapReduce framework master process that manages the TaskTrackers of each compute node. Its responsibilities are maintaining the load-balance and the fault-tolerance in the system, ensuring that all nodes get their part of the input data chunk and reassigning the parts that could not be executed. The specific details of the software used are the following: • MapReduce implementation: Hadoop 2.0.0-cdh4.4.0. MapReduce 1 runtime(Classic). Cloudera’s open-source Apache Hadoop distribution [50]. • Maximum maps tasks: 128. • Maximum reducer tasks: 1. • Machine learning library: Mahout 0.8. • Operating system: Cent OS 6.4. Note that the total number of cores of the cluster is 132. However, the maximum number of map tasks are limited to 128 and one for the reducers. 4.3. Data sets and methods In this experimental study we will use four big classification data sets taken from the UCI repository [51]. Table 1 summarizes the main characteristics of these data sets. For each data set, we show the number of examples (#Examples), number of attributes (#Dimension), and the number of classes (#ω). Table 1: Summary description of the used big data classification Data set PokerHand KddCup 1999 (DOS vs. normal classes) Susy RLCP #Examples 1025010 4856151 5000000 5749132 #Dimension 10 41 18 4 #ω. 10 2 2 2 These data sets have been partitioned using a 5 fold cross-validation (5fcv) scheme. It means that the data set is split into 5 folds, each one containing 20% of the examples of the data set. For each fold, a PR algorithm is run over the examples presented in the remaining folds (that is, in the 17 Table 2: Approximate number of instances in each T Rj subset according to the number of mappers used. Data set PokerHand Kddcup (10%) Kddcup (50%) Kddcup (100%) Susy RLCP 64 12813 6070 30351 60702 62469 71862 Number 128 6406 3035 15175 30351 31234 35931 of mappers 256 512 3203 1602 1518 759 7588 3794 15175 7588 15617 7809 17965 8983 1024 801 379 1897 3794 3904 4491 training partition, T R). Then, the resulting RS is tested with the current fold using the NN rule. Test partitions are kept aside during the PR phase in order to analyze the generalization capabilities provided by the generated RS. Because of the randomness of some operations that these algorithms perform, they have been run three times per partition. Aiming to investigate the effect of the number of instances in our MRPR scheme, we will create three different versions of the KDD Cup data set by selecting (randomly) 10%, 50% and 100% of the instances of the original data set. We will denote these versions as Kddcup (10%), Kddcup (50%) and Kddcup (100%). The number of instances of a data set and the number of mappers used in our scheme have a straight relation. Table 2 shows the approximate number of instances per chunk, that is, the size of each T Rj for MRPR, attending to the number of mappers established. When the number of instances per chunk exceeds twenty thousand, the execution of the PR is not feasible in time. Therefore, we are unable to carry out these experiments. As we stated before, we will focus on the hybrid SSMA-SFLSDE algorithm [27] to test the MRPR model. However, in Section 4.5, we will conduct some additional experiments with other PR techniques. Concretely, we will use LVQ3 [52] and RSP3 [45] as pure prototype generation algorithms as well as DROP3 [43] and FCNN [53] as prototype selection algorithms. Furthermore, we will use the ENN algorithm [41] as edition method for the filtering-based reducer. For the fusion-based reducer, we will apply a very accurate centroid-based technique called ICLP2 [47] when SSMA-SFLSDE and LVQ3 are run in the map phase. It is motivated by the high reduction ratio of these positioning adjustment methods. For RSP3, DROP3 and FCNN we will based on a faster fusion method known as GMCA [44] 18 Table 3: Parameter specification for all the methods involved in the experimentation Algorithm MRPR SSMA-SFLSDE ICLP2 (Fusion) ENN (Filtering) NN LVQ3 RSP3 DROP3 FCNN GMCA (Fusion) Parameters Number of mappers = 64/128/256/512/1024. Number of reducers=1 Type of Reduce = Join/Filtering/Fusion. PopulationSFLSDE= 40, IterationsSFLSDE = 500, iterSFGSS =8, iterSFHC=20, Fl=0.1, Fu=0.9 Filtering method = RT2 Number of neighbors = 3, Euclidean distance. Number of neighbors = 1, Euclidean distance. Iterations = 100, alpha = 0.1, WindowWidth=0.2, epsilon = 0.1 Subset Choice = Diameter Number of neighbors = 3, Euclidean distance. Number of neighbors = 3, Euclidean distance. Number of neighbors = 1, Euclidean distance. In addition, the NN classifier has been included as baseline limit of performance. Table 3 presents all the parameters involved in our experimental study. These parameters have been fixed according to the recommendation of the corresponding authors of each algorithm. Note that our research is not devoted to optimize the accuracy obtained with a PR method over a specific problem. We focus our experiments on the analysis of the behavior of the proposed parallel system. To do so, we will study the influence of the number mappers and type of reduce regarding to the accuracy achieved and the runtime needed. In some of the experiments we will use a higher number of mappers than the available map tasks (128). In these cases, the Hadoop system queues the remaining tasks and they are dispatched as soon as any map task has finished its processing. A brief description of the used PR methods is: • SSMA-SFLSDE: This algorithm is a hybridization of prototype selection and generation. First, a prototype selection step is performed based on the memetic algorithm SSMA [32]. This approach makes use of a local search specifically developed for prototype selection. This initial step allows us to find a promising selection of prototypes per class. Then, its resulting RS is inserted as one of the individuals of the population of an adaptive differential evolution algorithm [54, 55], acting as a prototype generation model to adjust the positioning of the 19 selected prototypes. • LVQ3: This method combines strategies to “punish” or “reward” the positioning of a prototype in order to adjust the positioning of a set of initial prototypes (adjustable). Therefore, it is included in the positioning adjustment family. • RSP3: This technique tries to avoid drastic changes in the form of decision boundaries associated with T Rby splitting it in different subsets according to the highest overlapping degree [45]. As such, it belongs to the family of space-splitting PR techniques. • DROP3: This model combine a noise-filtering stage and a decremental approach to remove instances from the original T R set that are considered as harmful within the nearest neighbors. It is included in the family of hybrid edition and condensation PR techniques. • FCNN: With an incremental methodology, this algorithm starts by introducing to the resulting RS the centroids of each class. Then, a prototype contained in T R will be added according to the nearest neighbor of each centroid. It belongs to the condensation-based family. 4.4. Exhaustive evaluation of the MRPR framework for the SSMA-SFLSDE method This section presents and analyzes the results collected in the experimental study with the SSMA-SFLSDE method from two different points of view: • Firstly, we study the accuracy and reduction results obtained with the three implemented reducers of the MRPR model. We will check the performance achieved in comparison with the NN rule (Section 4.4.1). • Secondly, we analyze the scalability of the proposed approach in terms of runtime and speed up (Section 4.4.2). Tables 4, 5, 6 and 7 summarize all the results obtained on the considered data sets. They show training/test accuracy, runtime and reduction rate obtained by the SSMA-SFLSDE algorithm, in our MRPR framework, depending on the number of mappers (#Mappers) and reduce type. For each one of these measures, average (Avg.) and standard deviation (Std.) results 20 Table 4: Results obtained for the PokerHand problem. Reduce type #Mappers Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion 64 64 64 128 128 128 256 256 256 512 512 512 1024 1024 1024 NN – Training Avg. Std. 0.5158 0.5212 0.5201 0.5111 0.5165 0.5157 0.5012 0.5045 0.5161 0.5066 0.5114 0.5088 0.4685 0.4649 0.5052 0.0007 0.0008 0.0011 0.0005 0.0007 0.0012 0.0010 0.0010 0.0004 0.0007 0.0010 0.0008 0.0008 0.0009 0.0003 0.5003 0.0007 Test Avg. Std. 0.5102 0.5171 0.5181 0.5084 0.5140 0.5139 0.4989 0.5024 0.5151 0.5035 0.5091 0.5081 0.4672 0.4641 0.5050 Runtime Avg. Std. Reduction rate Avg. Std. Classification time (T S) 0.0008 0.0014 0.0015 0.0011 0.0007 0.0006 0.0010 0.0006 0.0007 0.0009 0.0005 0.0009 0.0008 0.0010 0.0009 13236.6012 13292.8996 14419.3926 3943.3628 3949.2838 4301.2796 2081.0662 2074.0048 2231.4050 1101.8868 1101.2614 1144.8080 598.2918 585.4320 601.0838 147.8684 222.3406 209.9481 161.4213 135.4213 180.5472 23.6610 25.4510 14.3391 16.6405 13.0263 18.3065 11.6175 8.4529 7.4914 97.5585 98.0714 99.1413 97.2044 97.7955 99.0250 96.5655 97.2681 98.8963 96.2849 97.1122 98.7355 95.2033 96.2073 98.6249 0.0496 0.0386 0.0217 0.0234 0.0254 0.0119 0.0283 0.0155 0.0045 0.0487 0.0370 0.0158 0.0202 0.0113 0.0157 1065.1558 848.0034 374.8814 1183.6378 920.8190 419.6914 1451.1200 1135.2452 478.8326 1545.4300 1472.6066 925.1834 2132.7362 1662.5460 1345.6998 0.5001 0.0011 – – – – 48760.8242 Table 5: Results obtained for the Kddcup (100%) problem. Reduce type #Mappers Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion 256 256 256 512 512 512 1024 1024 1024 NN 0 Training Avg. Std. 0.9991 0.9991 0.9994 0.9991 0.9989 0.9992 0.9990 0.9989 0.9991 0.0003 0.0003 0.0000 0.0001 0.0001 0.0001 0.0002 0.0000 0.0002 0.9994 0.0001 Test Avg. Std. 0.9993 0.9991 0.9994 0.9992 0.9989 0.9993 0.9991 0.9989 0.9991 Runtime Avg. Std. Reduction rate Avg. Std. Classification time(T S) 0.0003 0.0003 0.0000 0.0001 0.0001 0.0001 0.0002 0.0001 0.0002 8536.4206 8655.6950 8655.6950 4614.9390 4941.7682 5018.0266 2620.5402 3103.3776 3191.2468 153.7057 148.6363 148.6363 336.0808 44.8844 62.0603 186.5208 15.4037 75.9777 99.9208 99.9249 99.9279 99.8645 99.8708 99.8660 99.7490 99.7606 99.7492 0.0007 0.0009 0.0008 0.0010 0.0013 0.0006 0.0010 0.0011 0.0010 1630.8426 1308.1294 1110.4478 5569.8084 5430.4020 2278.2806 5724.4108 4036.5422 4247.8348 0.9993 0.0001 – – – – 2354279.8650 are presented (from the 5-fcv experiment). Moreover, the average classification time in the T S is computed as the time needed to classify all the instances of T S with the corresponding RS generated by MRPR. Furthermore, we compare these results with the accuracy and the test classification time achieved by the NN classifier. It uses the whole T R set to classify all the instances of T S. In these tables, average accuracies higher or equal than the obtained with the NN algorithm have been highlighted in bold. The best ones in overall, on training and test phases, are stressed in italic. 4.4.1. Analysis of accuracy and reduction capabilities This section is focused on comparing the resulting accuracy and reduction rates of the different versions of MRPR. Figure 4 depicts the test ac21 Table 6: Results obtained for the Susy problem. Reduce type #Mappers Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion 256 256 256 512 512 512 1024 1024 1024 NN 0 Training Avg. Std. 0.6953 0.6941 0.6870 0.6896 0.6898 0.6810 0.6939 0.6826 0.6757 0.0005 0.0001 0.0002 0.0012 0.0002 0.0002 0.0198 0.0005 0.0004 0.6899 0.0001 Test Avg. Std. 0.7234 0.7282 0.7240 0.7217 0.7241 0.7230 0.7188 0.7226 0.7208 Runtime Avg. Std. Reduction rate Avg. Std. Classification time(T S) 0.0004 0.0003 0.0002 0.0003 0.0003 0.0002 0.0417 0.0006 0.0008 69153.3210 66370.7020 69796.7260 26011.2780 28508.2390 30344.2770 13524.5692 14510.9125 15562.1193 4568.5774 4352.1144 4103.9986 486.6898 484.5556 489.8877 1941.2683 431.5152 327.8043 97.4192 97.7690 98.9068 97.2050 97.5609 98.8337 97.1541 97.3203 98.7049 0.0604 0.0046 0.0040 0.0052 0.0036 0.0302 0.5367 0.0111 0.0044 30347.0420 24686.3550 11421.6820 35067.5140 24867.5478 12169.2180 45387.6154 32568.3810 12135.8233 0.7157 0.0001 – – – – 1167200.3250 Reduction rate Avg. Std. Classification time(T S) Table 7: Results obtained for the RLCP problem. Reduce type #Mappers Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion 256 256 256 512 512 512 1024 1024 1024 NN 0 Training Avg. Std. 0.9963 0.9963 0.9963 0.9962 0.9962 0.9962 0.9960 0.9960 0.9960 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.9946 0.0001 Test Avg. Std. 0.9963 0.9963 0.9963 0.9962 0.9962 0.9963 0.9960 0.9960 0.9960 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.9946 0.0001 Runtime Avg. Std. 29549.0944 62.4140 29557.2276 62.7051 26814.9270 1574.4760 10093.9022 61.6980 10916.6962 951.5328 11326.7812 85.6898 5348.4346 20.6944 5328.0388 14.8981 5569.2214 16.5025 – – 98.0091 98.0091 98.6291 97.9911 97.9919 98.3012 97.9781 97.9781 98.2485 0.0113 0.0113 0.0029 0.0019 0.0016 0.0036 0.0010 0.0010 0.0015 10534.0450 10750.9012 10271.0902 11767.8596 11689.1144 10856.8888 10930.7026 11609.2740 10653.3659 – – 769706.2186 curacy achieved according to the number of mappers in the data sets considered. It represents the average accuracy depending on the reduce type utilized. The average accuracy result of the NN rule is presented as a line y = AverageAccuracy, to show the accuracy differences between using the whole T R or a generated RS as training data set. In addition, Figure 5 plots the reduction rates attained by each type of reduce for both problems. In each sub-figure the average reduction rate with 256 mappers has been drawn. According to these graphics and tables we can make several observations from these results: • Since that within the MRPR framework a PR algorithm does not dispose of the full information about the whole addressed problem, it is expected that the accuracy obtained decreases according as the number of available instances in the used training set is reduced. This statement and the way in which the accuracy is reduced depends crucially on the specific problem tackled and its complexity. However, it could be generalizable and extensible to most of the problems because there 22 PokerHand Kddcup (100%) 0.52 0.9994 0.9993 ReduceType Join Filtering Fusion Accuracy Test Accuracy Test 0.50 0.9992 ReduceType Join Filtering Fusion 0.9991 0.48 0.9990 0.9989 64 128 256 512 1024 256 512 Number of mappers 1024 Number of mappers (a) PokerHand: Test accuracy. (b) Kddcup (100%): Test accuracy. Susy RLCP 0.9960 ReduceType Join Filtering Fusion Accuracy Test Accuracy Test 0.725 ReduceType Join 0.9955 Filtering Fusion 0.720 0.9950 256 512 1024 256 Number of mappers 512 1024 Number of mappers (c) Susy: Test accuracy. (d) RLCP: Test accuracy. Figure 4: Test accuracy results will be a minimum number of instances in which the performance decrease drastically. Observing previous tables and graphics, we can see that in the case of the PokerHand problem its performance is markedly deteriorated when the problem is divided into 1024 subsets (mappers) in both training and test phases. In Susy data set, the accuracy is gradually deteriorated as the number of mapper is incremented. For the Kddcup (100%) and RCLP problems, their performance is very slightly reduced when the number of mappers is increased (the order 23 100 Average reduction (%) 99 ReduceType Join Filtering Fusion 98 97 PokerHand Kddcup (100%) RLCP Susy Dataset Figure 5: Reduction rate achieved with 256 mappers. of three or four ten-thousandths). • Nevertheless, it is important to highlight that although the accuracy of the PR algorithm may be gradually decreased it is not very far from the achieved with the NN rule. In fact, it could be even higher as happens in the cases of PokerHand, Susy and RLCP problems. This situation occurs because PR techniques remove noisy instances from the T R set that damage the classification performance of the NN rule. Moreover, PR models typically smooth the decision boundaries between classes that usually rebounds in an improvement of the generalization capabilities (test accuracy). • When tackling large-scale problems, the reduction rate of a PR technique becomes much more important, maintaining the premise that the accuracy is not very deteriorated. A high reduction rate implies a significant decrement in the computational time spent to classify new instances. As we commented before, the accuracy obtained by our model is not dramatically decreased when the number of mappers is augmented. The same behavior is found in terms of reduction capabilities. This number also influences in the reduction rate achieved 24 because the lack of information about the whole problem may produce a degradation of the reduction capabilities of PR techniques. However, in general, the reduction rates presented are very high, representing the original problem with less than a 5% of the total number of instances. It allows us to classify the T S in a very fast time. • Independently to the number of mappers and type of reduce, there are no differences between the results of training a test phases. The partitioning process slightly reduces accuracy and reduction rate because of the lack of the whole information. By contrast, this mechanism assists not to fall into the overfitting problem, that is, the overlearning of the training set. • Comparing the different reduce types, we can check that in general the fusion approach outperforms to the rest kinds of reducers in most of the data sets. The fusion scheme results in a better training and test accuracy. It is noteworthy that in the case of PokerHand data set, when the other types of reducers decrease their performance, the fusion reducer is able to preserve its accuracy with 1024 mappers. We can also observe that the filtering reducer also provides higher accuracy results than the join approach in PokerHand and Susy problems, while its results are very similar for the Kddcup (100%) and RLCP sets. • Taking a quick glance at Figure 5, it reveals that the fusion scheme always reports the higher reduction rate, followed by the filtering scheme. Beside the fusion reducer promotes a higher reduction rate it has shown the best accuracy. Therefore, it shows that merging the resultant RSj sets with a fusion or a filtering process provides a better accuracy and reduction rates than a joining phase. • Considering the results provided by the NN rule and the whole T R, Figure 4 shows that in terms of accuracy, the MRPR model with the fusion scheme overcomes to the NN rule in PokerHand, Susy and RCLP problems. A very similar behavior is reached for the Kddcup (100%) data set. Nevertheless, the reduction rate attained by the MRPR model implies a lower test classification time. For example, we can see in Table 4 that we can perform the classification of PokerHand data set up to 130 times faster than the NN classifier when the fusion method and 64 mappers are used. A similar improvement is achieved in Susy 25 and RLCP problems. However, for the Kddcup (100%) data set this improvement is much more accentuated and classifying the test set can be approximately 2120 times faster (using the fusion reducer and 256 mappers). These results demonstrate and exemplify the necessity of applying PR techniques to large-scale problems. 4.4.2. Analysis of the scalability In this part of the experimental study we concentrate on the analysis of runtime and speed up of the MRPR model. As defined in Section 4.3, we divided the Kddcup problem into three sets with different number of instances. We aim to study the influence of the number of instances in the same problem. Figure 6 draws the average runtime (obtained in the 5fcv experiment) according to the number of mappers used in the problem considered. Moreover, Figure 7 depicts the speed up achieved by MRPR and the fusion reducer. Note that, as we clarified in Section 4.1, the speed up has been computed using the runtime with the minimum number of mappers (minMaps) as the reference time. Therefore, it implies that the speed up does not represent the gain obtained regarding the number of cores. In this chart, the speed up of MRPR with minMaps in each data set is set as 1. Since the complexity of SSMA-SFLSDE is O((n·D)2), we cannot expect a quadratic speed up because the proposed scheme is focused on the number of instances. Furthermore, it is very important to remember that, in the used cluster, the maximum available mappers at the same time is 128 and the rest of tasks are queued. Figure 8 presents an average runtime comparison between the results obtained in the three versions of the Kddcup problem. It shows for each set its average runtime with 256, 512 and 1024 mappers of the MRPR approach using the reducer based on fusion. Given these figures and previous tables, we want to outline the following comments: • Despite the performance showed by the filtering and fusion reducers in comparison with the joining scheme, all the reduce alternatives spend very similar runtimes to generate a final RS. It means that although the fusion and filtering reducers require extra computations regarding to the join approach, we take advantage from the way of working of MapReduce, so that, the reduce stage is being executed while the mappers are still finishing. In this way, most of the extra calculations 26 PokerHand 15000 Kddcup (100%) 8000 Reduce Type Join Filtering Fusion Average runtime (s) Average runtime (s) 10000 Reduce Type 6000 Join Filtering Fusion 5000 4000 0 64 128 256 512 1024 256 Number of mappers 512 1024 Number of mappers (a) PokerHand: Runtime. (b) Kddcup (100%): Runtime. Susy RLCP 30000 Reduce Type Join Filtering 40000 Fusion Average runtime (s) Average runtime (s) 60000 20000 Reduce Type Join Filtering Fusion 10000 20000 256 512 1024 256 Number of mappers 512 1024 Number of mappers (c) Susy: Runtime. (d) RLCP: Runtime. Figure 6: Average runtime obtained by MRPR needed by filtering and fusion approaches are performed before all the mappers have finished. • In Figure 7, we can observe different tendencies depending on the used data set. It is due to the fact that these problems have a different number of features that also determine the complexity of the PR technique. For this reason, it easier to obtain a higher speed up with PokerHand, rather than, for instance, in the Kddcup problem, because it has a lesser number of characteristics. The same behavior is shown in Susy 27 Runtime speedup 25 20 Runtime speedup Dataset PokerHand 15 Kddcup (10%) Kddcup (50%) Kddcup (100%) RLCP 10 Susy 5 0 64 128 256 512 1024 Number of mappers Figure 7: Speed up achieved by MRPR with the fusion reducer Runtime comparison for Kddcup problem Average runtime (s) 7500 Dataset 5000 Kddcup (10%) Kddcup (50%) Kddcup (100%) 2500 0 256 512 1024 Number of mappers Figure 8: Runtime comparison on the three versions of the Kddcup problem, using MRPR with the fusion reducer. 28 and RLCP problems, with a similar number of instances, a slightly better speed up is achieved with RLCP. In addition, according with this figure, we can mention that we the same resources (128 mappers) MRPR is able to accelerate the processing of PR techniques by dividing in the T R set in a higher number of subsets. As we checked in the previous section, these speed ups do not fall into a significant accuracy loss. • Figure 8 illustrates the increment of average runtime when the size of the same problem is increased. In problems with quadratic complexity, we could expect that with the same number of mappers this increment should be also quadratic. In this figure, we can see that the increment of runtime is much lesser than a quadratic increment. For example, for 512 mappers, MRPR spends 2571.0068 seconds in Kddcup (50%) and 8655.6950 seconds for the full problem. As we can see in Table 2, the approximate number of instances in each T Rj subset is the double for Kddcup 100% than Kddcup 50% with 512 mappers. Therefore, its computational cost is not incremented quadratically. 4.5. Experiments on different PR techniques In this section we perform some additional experiments using four different PR techniques in the proposed MRPR framework. In these experiments, the number of mappers has been fixed to 64 and we focus on the PokerHand problem. Table 8 shows the results obtained. Figure 9 presents a comparison across the four techniques within MRPR. Figure 9a depicts the accuracy test obtained by the four techniques using the three reduce types. Figure 9b shows the time needed to classify the test set. In both plots, the results of the NN rule have been presented as baseline. As before, those results that are better than the NN rule have been stressed in bold and the best ones in overall are highlighted in italic. Observing these results, we can see that the MRPR model works appropriately with these techniques. Nevertheless, we can point out several differences in comparison with the results obtained with SSMA-SFLSDE: • Since LVQ3 is a positioning adjustment method with a high reduction rate, we observe a similar behavior between this technique and SSMASFLSDE within the MRPR model. Note that this algorithm has been also run with ICLP2 as fusion method. We can highlight that the 29 Table 8: Results obtained for the PokerHand problem with 64 Mappers. PR technique Reduce type LVQ3 Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion Join Filtering Fusion FCNN DROP3 RSP3 NN Training Avg. Std. 0.4686 0.4892 0.4932 0.4883 0.5185 0.6098 0.5073 0.5157 0.5390 0.6671 0.6491 0.5786 – 0.0005 0.0007 0.0010 0.0008 0.0006 0.0002 0.0004 0.0005 0.0004 0.0003 0.0003 0.0004 0.5003 0.0007 Test Avg. Std. 0.4635 0.4861 0.4918 0.4889 0.5169 0.4862 0.5044 0.5124 0.5011 0.5145 0.5173 0.5107 Runtime Avg. Std. Reduction rate Avg. Std. 0.0014 0.0013 0.0012 0.0010 0.0005 0.0006 0.0014 0.0013 0.0005 0.0007 0.0008 0.0010 15.3526 17.7602 83.7830 39.8196 5593.4358 3207.8540 69.5268 442.9670 198.1450 219.2912 1898.5854 1448.4272 0.8460 0.1760 4.8944 2.1829 23.1895 37.2208 2.5605 2.6939 5.2750 2.8126 10.8303 60.5462 97.9733 98.6244 99.3811 17.7428 47.3255 72.5604 77.0352 81.2203 92.3467 53.0566 58.8459 84.3655 0.0001 0.0101 0.0067 0.0241 0.0310 0.0080 0.0141 0.0169 0.0043 0.0554 0.0280 0.0189 841.5352 487.0822 273.4192 28232.4110 19533.5424 9854.8956 8529.0618 8139.5878 1811.0866 17668.5268 17181.5448 5741.6588 0.5001 0.0011 – – – – 48760.8242 PokerHand 0.52 Classification time (T S) PokerHand 50000 0.51 Accuracy test 0.50 ReduceType Join 0.49 Filtering Fusion 0.48 Classification time (s) 40000 30000 ReduceType Join Filtering Fusion 20000 10000 0.47 0 0.46 DROP3 FCNN LVQ3 RSP3 DROP3 Method FCNN LVQ3 RSP3 Method (a) PokerHand: Accuracy Test. (b) PokerHand: Classification Time. Figure 9: Results obtained by MRPR in different PR techniques filtering and fusion reduce schemes greatly improve the performance of LVQ3 in accuracy and reduction rates. • In the previous section we observed that the filtering and fusion stages provide a greater reduction rate than the join scheme. In this section, we can see that for FCNN, DROP3 and RSP3, their effect is even more accentuated due to the fact that these techniques have a lesser reduction power than SSMA-SFLSDE and LVQ3. Therefore, the filtering and fusion algorithms become more important with these techniques in order to achieve a high reduction ratio. 30 • The runtime needed by filtering and fusion schemes crucially depends on the reduction rate of the use technique. For example, the FCNN method initially provides a very reduced reduction rate (around 18%), so that, the runtime of filtering and fusion reducers is greater than the time needed by the join reducer. However, as commented before, the application of these reduces increases the reduction rate, resulting in a faster classification time. • As commented previously, we have used a fusion reducer based on GMCA when FCNN, DROP3 and RSP3 are applied. It is noteworthy that this fusion approach has resulted in a faster runtime in comparison with the filtering scheme. Nevertheless, as we expected, the performance reached with this fusion reducer, in terms of accuracy, is lower than the obtained with ICLP2 in combination with SSMA-SFLSDE. • Comparing the results obtained with these techniques and SSMA-SFLSDE, we can observe that the best accuracy test results is obtained with RSP3 and the filtering scheme (0.5173) with a medium reduction ratio (58.8459%). However, the SSMA-SFLSDE algorithm was able to achieve a higher accuracy test (0.5181) using the fusion reducer with a very high reduction rate (99.1413%). 5. Concluding remarks In this paper we have developed a MapReduce solution for prototype reduction, denominated as MRPR. The proposed scheme enables to these kinds of techniques to be applied over big classification data sets with promising results. Otherwise, these techniques would be limited to tackle small or medium problems that does not contain more than several thousand of examples, due to memory and runtime restrictions. The MapReduce paradigm has offered a simple, transparent and efficient environment to parallelize the prototype reduction computation. Three different reduce types have been investigated: Join, Filtering and Fusion; aiming to provide more accurate preprocessed sets. We have found that a reducer based on fusion of prototypes permits to obtain reduced sets with higher reduction rates and accuracy performance. The experimental study carried out has shown that MRPR obtains very competitive results. We have tested its behavior with different kinds of PR techniques, analyzing the accuracy, the reduction rate and the computational cost obtained. In particular, we have studied two prototype selection 31 methods (FCNN and DROP3), two prototype generation (LVQ3 and RSP3) techniques and the hybrid SSMA-SFLSDE algorithm. The main achievements of MRPR have been: • It has allowed us to apply PR techniques in large-scale problems. • No significant accuracy and reduction losses with very good speed up. • Its application has resulted in a very big reduction of storage requirements and classification time for the NN rule, when dealing with big data sets. As future work, we consider the study of new frameworks that enable PR techniques to deal with both large-scale and high dimensional data sets. Acknowledgment Supported by the Research Projects TIN2011-28488, P10-TIC-6858 and P11-TIC-7765. D.Peralta holds an FPU scholarship from the Spanish Ministry of Education and Science (FPU12/04902). References [1] V. Marx, The big challenges of big data, Nature 498 (7453) (2013) 255– 260. [2] M. Minelli, M. Chambers, A. Dhiraj, Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (Wiley CIO), 1st Edition, Wiley Publishing, 2013. [3] D. Plummer, T. Bittman, T. Austin, D. Cearley, D. S. Cloud, Defining and describing an emerging phenomenon. Technical report, Gartner (2008). [4] E. Alpaydin, Introduction to Machine Learning, 2nd Edition, MIT Press, Cambridge, MA, 2010. [5] M. Woniak, M. Graña, E. Corchado, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014) 3–17. 32 [6] S. Sakr, A. Liu, D. Batista, M. Alomari, A survey of large scale data management approaches in cloud environments, IEEE Communications Surveys and Tutorials 13 (3) (2011) 311–336. [7] J. Bacardit, X. Llorà, Large-scale data mining using genetics-based machine learning, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 (1) (2013) 37–61. [8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM 51 (1) (2008) 107–113. [9] J. Dean, S. Ghemawat, Map reduce: A flexible data processing tool, Communications of the ACM 53 (1) (2010) 72–77. [10] S. Ghemawat, H. Gobioff, S.-T. Leung, The google file system, in: Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, 2003, pp. 29–43. [11] M. Snir, S. Otto, MPI-The Complete Reference: The MPI Core, MIT Press, 1998. [12] W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on mapreduce, in: M. Jaatun, G. Zhao, C. Rong (Eds.), Cloud Computing, Vol. 5931 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2009, pp. 674–679. [13] A. Srinivasan, T. Faruquie, S. Joshi, Data and task parallelism in ILP using mapreduce, Machine Learning 86 (1) (2012) 141–168. [14] Q. He, C. Du, Q. Wang, F. Zhuang, Z. Shi, A parallel incremental extreme svm classifier, Neurocomputing 74 (16) (2011) 2532 – 2540. [15] I. Palit, C. Reddy, Scalable and parallel boosting with mapreduce, IEEE Transactions on Knowledge and Data Engineering 24 (10) (2012) 1904– 1916. [16] G. Caruana, M. Li, Y. Liu, An ontology enhanced parallel SVM for scalable spam filter training, Neurocomputing 108 (2013) 45 – 57. [17] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 1999. 33 [18] H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection, Chapman & Hall/Crc Data Mining and Knowledge Discovery Series, Chapman & Hall/Crc, 2007. [19] S. Garcı́a, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest neighbor classification: Taxonomy and empirical study, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3) (2012) 417– 435. [20] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A taxonomy and experimental study on prototype generation for nearest neighbor classification, IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100. [21] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition 43 (6) (2010) 2082–2105. [22] J. Derrac, C. Cornelis, S. Garcı́a, F. Herrera, Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection, Information Sciences 186 (1) (2012) 73–92. [23] N. Garcı́a-Pedrajas, A. de Haro-Garcı́a, J. Pérez-Rodrı́guez, A scalable approach to simultaneous evolutionary instance and feature selection, Information Sciences 228 (2013) 150–174. [24] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1) (1967) 21–27. [25] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction, Neurocomputing 72 (4-6) (2008) 1092–1097. [26] I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative prototype adjustment for nearest neighbor classification, IEEE Transactions on Neural Networks 21 (12) (2010) 1984–1990. [27] I. Triguero, S. Garcı́a, F. Herrera, Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification, Pattern Recognition 44 (4) (2011) 901–916. 34 [28] J. R. Cano, F. Herrera, M. Lozano, Stratification for scaling up evolutionary prototype selection, Pattern Recognition Letters 26 (7) (2005) 953–963. [29] J. Derrac, S. Garcı́a, F. Herrera, Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability, Memetic Computing 2 (3) (2010) 183–199. [30] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A study of the scaling up capabilities of stratified prototype generation, in: Proceedings of the third World Congress on Nature and Biologically Inspired Computing (NABIC’11), 2011, pp. 304–309. [31] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, E. Chang, Parallel spectral clustering in distributed systems, Pattern Analysis and Machine Intelligence, IEEE Transactions on 33 (3) (2011) 568–586. [32] S. Garcı́a, J. R. Cano, F. Herrera, A memetic algorithm for evolutionary prototype selection: A scaling up approach, Pattern Recognition 41 (8) (2008) 2693–2709. [33] N. Garcı́a-Pedrajas, J. Pérez-Rodrı́guez, Multi-selection of instances: A straightforward way to improve evolutionary instance selection, Applied Soft Computing 12 (11) (2012) 3590 – 3602. [34] B. He, W. Fang, Q. Luo, N. K. Govindaraju, T. Wang, Mars: A mapreduce framework on graphics processors, in: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, ACM, New York, NY, USA, 2008, pp. 260–269. [35] J. Talbot, R. M. Yoo, C. Kozyrakis, Phoenix++: Modular mapreduce for shared-memory systems, in: Proceedings of the Second International Workshop on MapReduce and Its Applications, ACM, New York, NY, USA, 2011, pp. 9–16. doi:10.1145/1996092.1996095. [36] T. White, Hadoop: The Definitive Guide, 3rd Edition, O’Reilly Media, Inc., 2012. [37] A. H. Project, Apache hadoop (2013). URL http://hadoop.apache.org/ 35 [38] A. M. Project, Apache mahout (2013). URL http://mahout.apache.org/ [39] C.-L. Chang, Finding prototypes for nearest neighbor classifiers, IEEE Transactions on Computers 23 (11) (1974) 1179–1184. [40] C.-T. Chu, S. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Ng, K. Olukotun, Map-reduce for machine learning on multicore, in: Advances in Neural Information Processing Systems, 2007, pp. 281–288. [41] D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on System, Man and Cybernetics 2 (3) (1972) 408–421. [42] P. E. Hart, The condensed nearest neighbor rule, IEEE Transactions on Information Theory 18 (1968) 515–516. [43] D. R. Wilson, T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine Learning 38 (3) (2000) 257–286. [44] R. Mollineda, F. Ferri, E. Vidal, A merge-based condensing strategy for multiple prototype classifiers, IEEE Transactions on Systems, Man and Cybernetics B 32 (5) (2002) 662–668. [45] J. S. Sánchez, High training set size reduction by space partitioning and prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564. [46] J. S. Sánchez, R. Barandela, A. I. Marqués, R. Alejo, J. Badenas, Analysis of new techniques to obtain quality training sets, Pattern Recognition Letters 24 (7) (2003) 1015–1022. [47] W. Lam, C. K. Keung, D. Liu, Discovering useful concept prototypes for classification based on filtering and abstraction., IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (8) (2002) 1075–1090. [48] I. H. Witten, E. Frank, Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. [49] G. M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: Proc. Spring Joint Comput. Conf., ACM, 1967, pp. 483–485. 36 [50] Cloudera, Cloudera distribution including apache hadoop (2013). URL http://www.cloudera.com [51] A. Frank, A. Asuncion, UCI machine learning repository (2010). URL http://archive.ics.uci.edu/ml [52] T. Kohonen, The self organizing map, Proceedings of the IEEE 78 (9) (1990) 1464–1480. [53] F. Angiulli, Fast nearest neighbor condensation for large data sets classification, IEEE Transactions on Knowledge and Data Engineering 19 (11) (2007) 1450–1464. [54] K. V. Price, R. M. Storn, J. A. Lampinen, Differential Evolution A Practical Approach to Global Optimization, Natural Computing Series, 2005. [55] F. Neri, V. Tirronen, Scale factor local search in differential evolution, Memetic Computing 1 (2) (2009) 153–171. 37 2. Self-labeling with prototype generation/selection for semi-supervised classification 2. 137 Self-labeling with prototype generation/selection for semisupervised classification The journal papers associated to this part are: 2.1 Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study • I. Triguero, S. Garcı́a, F. Herrera, Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study. Knowledge and Information Systems, in press (2014), doi: 10.1007/s10115-013-0706-y. – Status: Accepted for publication. Knowl Inf Syst DOI 10.1007/s10115-013-0706-y SURVEY PAPER Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study Isaac Triguero · Salvador García · Francisco Herrera Received: 14 May 2013 / Revised: 21 August 2013 / Accepted: 5 November 2013 © Springer-Verlag London 2013 Abstract Semi-supervised classification methods are suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. This problem has been addressed by several approaches with different assumptions about the characteristics of the input data. Among them, self-labeled techniques follow an iterative procedure, aiming to obtain an enlarged labeled data set, in which they accept that their own predictions tend to be correct. In this paper, we provide a survey of self-labeled methods for semi-supervised classification. From a theoretical point of view, we propose a taxonomy based on the main characteristics presented in them. Empirically, we conduct an exhaustive study that involves a large number of data sets, with different ratios of labeled data, aiming to measure their performance in terms of transductive and inductive classification capabilities. The results are contrasted with nonparametric statistical tests. Note is then taken of which self-labeled models are the best-performing ones. Moreover, a semi-supervised learning module has been developed for the Knowledge Extraction based on Evolutionary Learning software, integrating analyzed methods and data sets. Keywords Learning from unlabeled data · Semi-supervised learning · Self-training · Co-training · Multi-view learning · Classification I. Triguero (B) · F. Herrera Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology (CITIC-UGR), University of Granada, 18071 Granada, Spain e-mail: [email protected] F. Herrera e-mail: [email protected] S. García Department of Computer Science, University of Jaén, 23071 Jaén, Spain e-mail: [email protected] 123 I. Triguero et al. 1 Introduction The semi-supervised learning (SSL) paradigm [1] has attracted much attention in many different fields ranging from bioinformatics to Web mining, where it is easier to obtain unlabeled than labeled data because it requires less effort, expertise and time consumption. In this context, traditional supervised learning [2] is limited to using labeled data to build a model [3]. Nevertheless, SSL is a learning paradigm concerned with the design of models in the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples to either modify or reprioritize the hypothesis obtained from labeled samples alone. SSL is an extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm. Depending on the main objective of the methods, we can divide SSL into semi-supervised classification (SSC) [4] and semisupervised clustering [5,6]. The former focuses on enhancing supervised classification by minimizing errors in the labeled examples, but it must also be compatible with the input distribution of unlabeled instances. The latter, also known as constrained clustering, aims to obtain better-defined clusters than those obtained from unlabeled data. We focus on SSC. SSC can be categorized into two slightly different settings [7], denoted transductive and inductive learning. On the one hand, transductive learning concerns the problem of predicting the labels of the unlabeled examples, given in advance, by taking both labeled and unlabeled data together into account to train a classifier. On the other hand, inductive learning considers the given labeled and unlabeled data as the training examples, and its objective is to predict unseen data. In this paper, we address both settings in order to carry out an extensive analysis of the performance of the studied methods. Many different approaches have been suggested and studied in order to classify using unlabeled data in SSC. They usually make different assumptions related to the link between the distribution of unlabeled and labeled data. Generative models [8] learn a joint probability model that depends on the assumption that the data follow a determined parametric model. There are also other algorithms such as transductive inference for support vector machines [9] that assume that the classes are well separated and do not cut through dense unlabeled data. Alternatively, SSC can also be viewed as a graph min-cut problem [10]. If two instances are connected by a strong edge, their labels are likely to be the same. In this case, the graph construction determines the behavior of this kind of algorithm [11]. In addition, there are recent studies that address multiple assumptions in one model [7,12]. A successful methodology to tackle the SSC problem is based on traditional supervised classification algorithms [2]. These techniques aim to obtain one (or several) enlarged labeled set(s), based on their most confident predictions, to classify unlabeled data. We denote these algorithms self-labeled techniques. In the literature [1], self-labeled techniques are typically divided into self-training and co-training. Self-training [13,14] is a simple and effective SSL methodology that has been successfully applied in many real instances. In the self-training process, a classifier is trained with an initial small number of labeled examples, aiming to classify unlabeled points. Then it is retrained with its own most confident predictions, enlarging its labeled training set. This model does not make any specific assumptions for the input data, but it accepts that its own predictions tend to be correct. The standard co-training [15] methodology assumes that the feature space can be split into two different conditionally independent views and that each view is able to predict the classes perfectly [16–18]. It trains one classifier in each specific view, and then the classifiers teach each other the most confidently predicted examples. Multi-view learning [19] for SSC is usually understood to be a generalization of co-training, without requiring explicit feature splits or the iterative mutual- 123 Self-labeled techniques for semi-supervised learning teaching procedure [20–22]. However, these concepts are sparse and frequently confused in the literature. There is currently no general categorization focused on self-labeled techniques. In the literature, we find several SSL surveys [23,24]. These include a general classification of SSL methods, dividing self-labeled techniques into self-training, co-training and multi-view learning, but they are not exclusively focused on self-labeled techniques nor especially on studying the similarities among them. Furthermore, we can find a good theoretical survey about disagreement-based models in [25], introducing research advances in this paradigm. Nevertheless, this categorization is a subset of self-labeling approaches without an explicit definition of a taxonomy. Because of the absence of a focused taxonomy in the literature, we have observed that the new algorithms proposed are usually compared with only a subset of the complete family of self-labeled methods. Furthermore, in most of the studies, no rigorous analysis has been carried out. They do not follow a complete experimental framework. Instead, the new proposal is usually contrasted with a reduced number of data sets, only analyzing either the transductive or the inductive capabilities of the algorithms. These are the reasons that motivate the global purpose of this paper, which can be divided into four objectives: • To propose a new and complete taxonomy based on the main properties observed in selflabeled methods. The taxonomy will allow us to discern the advantages and drawbacks from a theoretical point of view. As a result, many frequently confused concepts will be clarified. • To establish an experimental methodology to analyze the transductive and inductive capabilities of these kinds of algorithms. • To make an empirical study of the state of art of self-labeled techniques. Our goal is to identify the best methods in each family, depending on the ratio of labeled data, and to stress the relevant properties of each one. • To provide an open-source SSL module for the Knowledge Extraction based on Evolutionary Learning (KEEL) software tool [26]. This is a research tool for solving data mining problems which contains a great number of machine learning algorithms. The source code of the analyzed algorithms and a wide variety of data sets are available in this module. We will conduct experiments involving a total of 55 UCI/KEEL classification data sets with different ratios of labeled data: 10, 20, 30 and 40 %. The experimental study will include a statistical analysis based on nonparametric statistical tests [27,28]. Then, we will test the performance of the best-performing methods over nine data sets obtained from the book of Chapelle [4]. These problems have a great number of features and a very reduced number of labeled data. A Web site with all the complementary material is available at http://sci2s. ugr.es/SelfLabeled, including this paper’s basic information, the source code of the analyzed algorithms, all the data sets involved and the complete results obtained. The rest of the paper is organized as follows: Sect. 2 provides a description of the properties and an enumeration of the methods, as well as related and advanced work on SSC. Section 3 presents the taxonomy proposed. In Sect. 4 we describe the experimental framework, and Sect. 5 examines the results obtained and presents a discussion on them. In Sect. 6 we summarize our conclusions. “Appendix” shows details and characteristics of the SSL software module. 123 I. Triguero et al. 2 Self-labeled techniques: background The SSC problem can be defined as follows: Let x p be an example where x p = (x p1 , x p2 , . . . , x p D , ω), with x p belonging to a class ω and a D-dimensional space in which x pi is the value of the ith feature of the pth sample. Then, let us assume that there is a labeled set L which consists of n instances x p with ω known. Furthermore, there is an unlabeled set U which consists of m instances xq with ω unknown, let m > n. The L ∪ U set forms the training set (denoted as T R). The purpose of SSC is to obtain a robust learned hypothesis using T R instead of L alone, which can be applied in two slightly different settings: transductive and inductive learning. Transductive learning is described as the application of an SSC technique to classify all the m instances xq of U correctly. The class assignment should represent the distribution of the classes efficiently, based on the input distribution of unlabeled instances and the L instances. Let T S be a test set composed of t unseen instances xr with ω unknown, which has not been used at the training stage of the SSC technique. The inductive learning phase consists of correctly classifying the instances of T S based on the previously learned hypothesis. This section presents an overview of self-labeled techniques. Three main topics will be discussed: • In Sect. 2.1, the common characteristics in self-labeled techniques which will define the categories of the taxonomy proposed in this paper will be outlined. • In Sect. 2.2, we briefly enumerate all the self-labeled techniques proposed in the literature. The complete and abbreviated name will be given together with the proposal reference. • Finally, Sect. 2.3 explores other areas related to self-labeled techniques and provides a summary of advanced work in this research field. 2.1 Common characteristics in self-labeled techniques This section provides a framework, establishing different properties of self-labeled techniques, for the definition of the taxonomy in the next section. Other issues that influence some of the methods are presented in this section although they are not involved in the proposed taxonomy. Finally, some criteria will be set in order to compare self-labeled methods. 2.1.1 Main properties of self-labeled techniques Self-labeled methods search iteratively for one or several enlarged labeled set(s) (E L) of prototypes to efficiently represent the T R. For simplicity in reading, in what follows we restrict the description of these properties to one E L. The taxonomy proposed will be based on these characteristics: • Addition mechanism: There are a variety of schemes in which an E L is formed. – Incremental: A strictly incremental approach begins with an enlarged labeled set E L = L and adds, step-by-step, the most confident instances of U if they fulfill certain criteria. In this case, the algorithm crucially depends on the way in which it determines the confidence predictions of each unlabeled instance, that is, the probability of belonging to each class. Under such a scheme, the order in which the 123 Self-labeled techniques for semi-supervised learning instances are added to the E L determines the learning hypotheses and therefore the following stages of the algorithm. One of the most important aspects of this approach is the number of instances added in each iteration. On the one hand, this number could be defined as a constant parameter and/or independent of the classes of the instances. On the other hand, it can be chosen as a proportional value of the number of instances of each class in L. In our experiments, we implement the latter as suggested in [15]. This is the most simple and intuitive way of addressing the SSL problem which often corresponds to classical self-labeled approaches. One advantage of this approach is that it can be faster during the learning phase than nonincremental algorithms. Nevertheless, the main disadvantage is that strictly incremental algorithms can add instances with erroneous predictions to the class label. Hence, if it occurs, the learned hypotheses could produce a low performance in transductive and inductive phases. – Batch: Another way to generate an E L set is in batch mode. This involves deciding whether each instance meets the addition criteria before adding any of them to the E L. Then, all those that do meet the criteria are added at once. In this sense, batch techniques do not assign a definitive class to each unlabeled instance during the learning process. They can reprioritize the hypotheses obtained from labeled samples. Batch processing suffers from increased time complexity over incremental algorithms. – Amending: Amending models appeared as a solution to the main drawback of the strictly incremental approach. In this approach, the algorithm starts with E L = L and iteratively can add or remove any instance that meets the specific criterion. This mechanism allows rectifications to already performed operations, and its main advantage is to make the achievement of a good accuracy-suited E L set of instances easy. As in the incremental approach, its behavior can also depend on the number of instances added per iteration. Typically, these methods have been designed to avoid the introduction of noisy instances into E L at each iteration [14,29]. However, under the rubric of amending model, many other proposals can be developed. For instance, incremental and batch mode algorithms in combination with a prior or a later cleaning phase of noisy instances are considered to be an amending model. Despite its flexibility, this scheme usually requires high computational demands compared to incremental and batch algorithms. • Single-classifier versus multi-classifier: Self-labeled techniques can use one or more classifiers during the enlarging phase of the labeled set. As we stated before, all of these methods follow a wrapper methodology using classifier(s) to establish the possible class of unlabeled instances. In a single-classifier model, each unlabeled instance belongs to the most probable class assigned by the uniquely used classifier. It implies that these probabilities should be explicitly measured. There are different ways in which these confidences are computed. For example, in probabilistic models, such as naive Bayes, the confidence predictions can usually be measured as the output probability in prediction, and other methods, such as nearest neighbor [30], could approximate confidence in terms of distance. In general, the main advantage of single-classifier methods is their simplicity, allowing us to compute faster confidence probabilities. Multi-classifier methods combine the learned hypotheses with several classifiers to predict the class of unlabeled instances. The underlying idea of using multi-classifiers in SSL is that several weak classifiers, learned with a small number of instances, can produce bet- 123 I. Triguero et al. ter generalization capabilities than only one weak classifier. These methods are motivated, in part, by the empirical success of ensemble methods. Two different approaches are commonly used to calculate the confidence predictions in multi-classifier methods: agreement of classifiers and combination of the probabilities obtained by single-classifiers. These techniques usually obtain a more accurate precision than single-classifier models. Another effect of using several classifiers in self-labeled techniques is their complexity. In both approaches, different confidence measures have been analyzed in the literature. They will be explained in Sect. 2.1.2. • Single-learning versus multi-learning: Apart from the number of classifiers, a key concern is whether they are constituted by the same (single) or different (multiple) learning algorithms. The number of different learning algorithms used can also determine the confidence predictions of unlabeled data. In a multi-learning approach, the confidence predictions are computed as the integration of a group of different kinds of learners to boost the performance classification. It works under the hypothesis that different learning techniques present different behaviors, using the bias classification between them, which generate locally different models. Multi-learning methods are closely linked with multi-classifier models. A multi-learning method is itself a multi-classifier method in which the different classifiers come from different learning methods. Hence, the general properties explained above for multiclassifiers are also extrapolated to multi-learning. A specific drawback of this approach is the choice of the most adequate learning approaches. By contrast, a single-learning approach could be linked to both single and multiclassifiers. With a single-learning algorithm, the confidence prediction measurement is relaxed with regard to the rest of the properties, type of view and number of classifiers. The goal of this approach, which has been shown to perform well in several domains, is simplicity [20]. • Single-view versus multi-view: This characteristic refers to the way in which the input feature space is taken into consideration by the self-labeled technique. In a multi-view self-labeled technique, L is split into two or more subsets of the features L subk of dimension M (M < D), by projecting each instance x p of L onto the selected Mdimensional subspace. Multi-view requires redundant and conditionally independent views, provided that the class attribute and each subset are sufficient to train a good classifier. Therefore, the performance of these kinds of techniques depends on the division procedure followed. One important advantage of this kind of view is that if this assumption is met [31], the multi-view approach makes fewer generalization errors, using the hypothetical agreement of classifiers. Nevertheless, this could lead to adverse results being obtained because the compatibility and independence of the features’ subsets obtained are strong assumptions and many real data sets cannot satisfy these hypotheses. Otherwise, a single-view self-labeled technique does not make any specific assumptions for the input data. Hence, it uses the complete L to train. Most of the selflabeled techniques adhere to this due to the fact that in many real-world data sets the feature input space cannot be appropriately divided into conditionally independent views. In the literature, many methods have been erroneously classified as multi-view or cotraining methods without the requirement of sufficient and redundant views. This term is frequently confused with single-view methods that use multi-classifiers. Note that multiview methods need, by definition, several classifiers. However, a single-view method could use single and multi-classifiers. 123 Self-labeled techniques for semi-supervised learning 2.1.2 Other properties We may remark upon other properties that explore how self-labeled methods work. Although they influence the operation and hence the results obtained with some of these techniques, we have not included them in the proposed taxonomy for the sake of simplicity and usability. • Confidence measures: An inaccurate confidence measure leads to adding mislabeled examples to the E L, which implies a performance degradation of the self-labeling process. The previously explained characteristics define in a global manner the way in which the confidence predictions are estimated. Nevertheless, the specific combination of these characteristics leads to different ideas to establish these probabilities. – Simple: As we stated before, the probability predictions can be extracted from the used learning model. This characteristic is essentially presented in single-classifier methods. For example, probabilistic models return an associated probability of belonging to each class for each unlabeled instance. In decision tree algorithms [32], the confidence probabilities can be estimated as the accuracy of the leaf that makes the prediction. Instance-based learning approaches can estimate probabilities in terms of dissimilarity between instances. – Agreement and combination: Multi-classifier methods can approximate their confidence probabilities based on the predictions obtained for each classifier. One way to calculate them is the hypothetical agreement of the used classifiers. Typically, a majority voting rule is used in which ties are commonly solved by a random process. By contrast, there are other methods that generate their confidence predictions as the aggregation of the obtained probabilities for each classifier in order to find a final confidence value for each unlabeled instance. Furthermore, it would also be possible to combine the agreement of classifiers with the calculated probabilities, developing a hybrid confidence prediction model. Not much research is currently underway with regard to this latter scheme. When considering these schemes, some questions should be taken into account depending on the number of different learning algorithms used. In a single-learning proposal, the method should retain multiple different hypotheses and combine (agreement or combination) their decisions during the computation of confidence probabilities. In this framework, a mandatory step is to generate new labeled sets based on the original L. For this purpose, self-labeled techniques usually apply bootstrapping techniques, such as resampling [33]. Nevertheless, there are other proposals such as a tenfold cross-validation scheme to create new labeled sets as is proposed in [34]. The effect obtained is related to the improvement of generalization capabilities with respect to the use of a single-classifier. However, in SSL, the labeled set tends to be small and the diversity obtained by the bootstrap sampling is limited, due to the fact that the obtained bootstraps are very similar. In multi-learning approaches, diversity is obtained from the different learning algorithms. Thus, they do not generate new labeled sets. • Self-teaching versus mutual-teaching: Independently of the learning procedure used, one of these characteristics appears in multi-classifier methods. In a mutual-teaching approach, the classifiers teach each other their most confident predicted examples. Each Ci classifier has an associated E L i , which is initialized in different ways. At each stage, all the classifiers are trained with its respective E L i , and then E L i is increased, with the most confident examples obtained as the hypotheses combination (or agreement) of the 123 I. Triguero et al. remaining classifiers. Under this scheme, a classifier Ci does not use its own predictions to increase its E L i . However, if Ci is unable to detect some interesting unlabeled instances as target examples, the rest of the classifiers may teach different hypotheses to Ci . To construct the final learned hypothesis, two approaches can be followed: join or voting procedures. The join procedure consists of forming a complete E L through the combination of each E L i without repetitions. With a voting procedure, the final hypothesis is obtained as the application of a majority voting rule, using all the E L i . By contrast, the self-teaching property refers to those multi-classifiers that maintain a single E L. In this case, the combination of hypotheses is blended to form a unique E L. As far as we know, there is no proposed model that combines mutual-teaching and self-teaching. • Stopping criteria: This is related to the mechanism used to stop the self-labeling process. It is an important factor due to the fact that it implies the size of the E L formed and therefore in the learned hypothesis. Three main approaches can be found in the literature. Firstly, in classical approaches such as self-training, the self-labeling process is repeated until all the instances from U have been added to E L. If erroneous unlabeled instances are added to the E L, they can damage the classification performance. Secondly, in [15], the authors suggest choosing examples from a smaller pool instead of the whole unlabeled set to form the E L, establishing a limited number of iterations. This approach has been successfully applied to many self-labeled methods, outperforming the overall classification rate of the previous stopping criteria. However, the maximum number of iterations is usually prefixed, and it is not adaptive to the size of the data set used. Thirdly, the termination criteria can be satisfied when the used classifiers, in the self-labeling process, do not change the learned hypotheses. This criteria limits the number of unlabeled instances added to E L; however, it does not ensure that erroneous unlabeled instances will not be added to E L. 2.1.3 Criteria to compare self-labeled methods When comparing self-labeled methods, there are a number of criteria that can be used to compare the relative strengths and weaknesses of each algorithm. These include transductive and inductive accuracy, influence of the number of labeled instances, noise tolerance and time requirements. • Transductive and inductive accuracy: A successful self-labeling algorithm will often be able to appropriately enlarge the labeled set, increasing the transductive and inductive accuracy. • Influence of the number of labeled instances: The number of labeled instances that a selflabeled technique manages determines the learned hypothesis. A good technique should be able to learn an appropriate hypothesis with the lowest possible number of labeled instances. • Noise tolerance: Noisy instances can be harmful if they are added to the labeled set, as they bias the classification of unlabeled data to incorrect classes, which could make the enlarged labeled set in the next iteration even more noisy. This problem may especially occur in the initial stages of this process, and it can also be more harmful with a reduced number of labeled instances. Two types of noisy instances may appear during the selflabeling process. The first one is caused by the distribution of the instances of L in their respective classes. It can lead the classifier to erroneously label some instances. Second, 123 Self-labeled techniques for semi-supervised learning there may be outliers within the original unlabeled data. These can be detected, avoiding their labeling and inclusion in L. • Time requirements: The learning process is usually carried out just once on a training set. If the objective of a specific application is related to transductive learning, this learning process becomes more important as it is responsible for managing unlabeled data. For inductive learning, it does not seem to be a very important evaluation method because test instances are classified based on the learned model. However, if the learning phase takes too long, it can become impractical for real applications. 2.2 Self-labeled methods We have performed a deep search of the specialized literature, identifying the most relevant self-labeled algorithms. At the time of writing, more than 20 methods have been proposed in the literature. This section is devoted to enumerating and designating them according to the standard followed in this paper. For more details on their implementations, the reader can visit the URL http://sci2s.ugr.es/SelfLabeled. Implementations of most of the algorithms in java can be found in the KEEL software [35] (see “Appendix”). Table 1 presents an enumeration of self-labeled methods reviewed in this paper. The complete name, abbreviation and reference are provided for each one. In the case of there being more than one method in a row, they were proposed together or by the same authors and the best-performing method (indicated by the respective authors) is depicted in bold. We Table 1 Self-labeled methods reviewed Complete name Abbr. name References Standard self-training Self-Training [13] Standard co-training Co-Training [15] Statistical co-learning Statistical-Co [34] ASSEMBLE ASSEMBLE [36] Democratic co-learning Democratic-Co [37] Self-training with editing SETRED [14] Tri-training TriTraining [20] Tri-training with editing DE-TriTraining [38] Co-forest CoForest [21] Random subspace method for co-training Rasco [39] Co-training by committee: AdaBoost Co-Adaboost [40,41] Co-training by committee: bagging Co-Bagging Co-training by committee: RSM Co-RSM Co-training by committed: Tree-structured ensemble Co-Tree [42] Co-training with relevant random subspaces Rel-Rasco [43] Classification algorithm based on local clusters centers CLCC [44] Ant-based semi-supervised classification APSSC [45] Self-training nearest neighbor rule using cut edges SNNRCE [46] Robust co-training R-Co-Training [17] Adaptive co-forest editing ADE-CoForest [47] Co-training with NB and SVM classifiers Co-NB-SVM [18] 123 I. Triguero et al. will test some representative methods therefore only the methods in bold will be compared in the experimental study. 2.3 Related and advanced work Nowadays, the use of unlabeled data in conjunction with labeled data is a growing field in different research lines. Self-labeled techniques form a feasible and promising group to make use of both kinds of data, which is closely related to other methods and problems. This section provides a brief review of other topics related to self-labeled techniques and describes other interesting work and future trends which have been studied in the last few years. With the same objective as self-labeled techniques, we group the three following methodologies according to Zhu and Goldberg’s book [1]: • Generative models and cluster-then-label methods: The first attempts to deal with unlabeled data correspond to this area. It includes those methods that assume a joint probability model p(x, y) = p(y) p(x|y), where p(x|y) is an identifiable mixture distribution, for example, a Gaussian mixture model. Hence, it follows a determined parametric model [48] using both unlabeled and labeled data. Cluster-then-label methods are closely related to generative models. Instead of using a probabilistic model, they apply a previous clustering step to the whole data set, and then they label each cluster with the help of labeled data. Recent advances in these topics are [8,49]. • Graph-based: This represents the SSC problem as a graph min-cut problem [10]. Labeled and unlabeled examples constitute the graph nodes, and the similarity measurements between nodes correspond to the graph edges. The graph construction determines the behavior of this kind of algorithm [50,51]. These methods usually assume label smoothness over the graph. Its main characteristics are nonparametric, discriminative and transductive in nature. Advanced proposals can be found in [11,52]. • Semi-supervised support vector machines (S 3 V M): S 3 V M is an extension of standard support vector machines (SVM) with unlabeled data [53]. This approach implements the cluster-assumption for SSL, that is, examples in data cluster have similar labels, so classes are well separated and do not cut through dense unlabeled data. This methodology is also known as transductive SVM, although it learns an inductive rule defined over the search space. Advanced works in S 3 V M are [54–56]. Regarding other problems connected with self-labeled techniques, we briefly describe the following topics: • Semi-supervised clustering: This problem, also known as constrained clustering, aims to obtain better-defined clusters than the ones obtained from unlabeled data [57]. Labeled data are used to define pairwise constraints between examples, must-links and cannotlinks. The former link establishes the examples that must be in the same cluster, and the latter refers to those examples that cannot be in the same cluster [58]. A brief review of this topic can be found in [59]. • Active learning: With the same objective as SSL, avoiding the cost of data labeling, active learning [60] tries to select the most important examples from a pool of unlabeled data. These examples are queried by an expert and are then labeled with the appropriate class, aiming to minimize effort and time consumption. Many active learning algorithms select as query the examples with maximum label ambiguity or least confidence. Several hybrid methods between self-labeled techniques and active learning [41,61–63] have been proposed and show that active learning queries maximize the generalization capabilities of SSL. 123 Self-labeled techniques for semi-supervised learning • Semi-supervised dimensionality reduction: This area studies the curse of dimensionality when it is addressed in an SSL framework. The goal of dimensionality reduction algorithms is to find a faithful low-dimensional mapping, or selection, of the high-dimensional data. Traditional dimensionality reduction techniques designed for supervised and unsupervised learning, such as linear discriminant analysis [64] and principal component analysis [65], are not appropriate to deal with both labeled and unlabeled data. Recently, many different frameworks have been proposed to use classical methods in this environment [66,67]. Two well-known dimensionality reduction solutions are feature selection and feature extraction [68] which have attracted much attention in recent years [69]. • Self-supervised: This paradigm has been recently presented in [70]. It integrates knowledge from labeled data with some features and knowledge from unlabeled data with all the features. Thus, self-supervised algorithms learn novel features from unlabeled examples without destroying partial knowledge previously acquired from labeled examples. • Partial label: It is a paradigm proposed in [71]. This problem deals with partially labeled multi-class classification, in which instead of a single label per instance, it has a candidate set of labels, only one of which is correct. In these circumstances, a classifier should learn how to disambiguate the partially labeled training instance and generalize to unseen data. 3 Self-labeled techniques: taxonomy The main characteristics of the self-labeled methods have been described in Sect. 2.1.1, and they can be used to categorize the self-labeled methods proposed in the literature. The type of view, number of learning algorithms, number of classifiers and addition mechanism constitute a set of properties that define each method. This section presents the taxonomy of self-labeled techniques based on these properties. Figure 1 illustrates the categorization following a hierarchy based on this order: type of view, number of learning algorithms, number of classifiers and addition mechanism. Considering this figure and the year of publication of each analyzed method, some interesting observations about the existing and nonexisting proposals can be made: • The number of single-classifier methods is smaller than multi-classifier. They constitute four of the 15 methods proposed in the literature. Although these methods may obtain great results, they do not have a refined confidence prediction mechanism because, in general, they are limited to extracting the most confident predictions from the learner used. Nevertheless, two of the most recent approaches belong to this category. • Amending models appeared a few years ago. Most of the recent research efforts are focused on these kinds of models because they have reported a great synergy with the iterative scheme presented in self-labeled methods. In different ways, they remove those instances that are harmful to the classification task in order to alleviate the main drawback of the incremental addition mechanism. • Only two multi-learning approaches have been proposed for self-labeled algorithms, and the most recent was published in 2004. In our opinion, more research is required in this area. For instance, there is no amending model that avoids introducing noisy instances into the enlarged labeled set. Similarly, no amending approaches have been designed for the family of methods which uses multiple views. • Standard co-training has been widely used in many real applications [72]. However, there are a reduced number of advanced multi-view approaches, which, for example, use multi-learning or amending addition schemes. 123 I. Triguero et al. Fig. 1 Self-labeled techniques hierarchy The properties studied here can help us to understand how the self-labeled algorithms work. In the following sections, we will establish which methods perform best for each family, considering several metrics of performance with a wide experimental framework. 4 Experimental framework This section describes all the properties and issues related to the experimental framework followed in this paper. We provide the measures used to observe differences in the performance of the algorithms (Sect. 4.1), the main characteristics of the problems used (Sect. 4.2), the parameters of the algorithms and the base classifiers used (Sect. 4.3) and finally a brief description of the nonparametric statistical tests used to contrast the results obtained (Sect. 4.4). 4.1 Performance measures Two performance measures are commonly applied because of their simplicity and successful application when multi-class classification problems are dealt with. As standard classification methods, the performance of SSC algorithms can be measured in terms of accuracy [1] and Cohen’s kappa rate [73]. They are briefly explained as follows: 123 Self-labeled techniques for semi-supervised learning • Accuracy: It is the number of successful hits (correct classifications) relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years [2,74]. • Cohen’s kappa (Kappa rate): It evaluates the portion of hits that can be attributed to the classifier itself, excluding random hits, relative to all the classifications that cannot be attributed to chance alone. Cohen’s kappa ranges from −1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a very useful, yet simple, meter for measuring a classifier’s accuracy while compensating for random successes. Both metrics will be adopted to measure the efficacy of self-labeled methods in transductive and inductive phases. 4.2 Data sets The experimentation is based on 55 standard classification data sets taken from the UCI repository [75] and the KEEL-data set repository1 [35]. Table 2 summarizes the properties of the selected data sets. It shows, for each data set, the number of examples (#Examples), the number of attributes (#Features) and the number of classes (#Classes). The data sets considered contain between 100 and 19,000 instances, the number of attributes ranges from 2 to 90, and the number of classes varies between 2 and 28. These data sets have been partitioned using the tenfold cross-validation procedure, that is, the data set has been split into ten folds, each one containing 10 % of the examples of the data set. For each fold, an algorithm is trained with the examples contained in the rest of folds (training partition) and then tested with the current fold. It is noteworthy that test partitions are kept aside to evaluate the performance of the learned hypothesis. Each training partition is divided into two parts: labeled and unlabeled examples. Using the recommendation established in [46], in the division process we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of SSC is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances, and the class label of the rest of the instances will be removed. We ensure that every class has at least one representative instance. In order to study the influence of the amount of labeled data, we take different ratios when dividing the training set. In our experiments, four ratios are used: 10, 20, 30 and 40 %. For instance, assuming a data set that contains 1,000 examples, when the labeled rate is 10 %, 100 examples are put into L with their labels, while the remaining 900 examples are put into U without their labels. In summary, this experimental study involves a total of 220 data sets (55 data sets × 4 labeled rates). Apart from these data sets, the best methods will be also tested with nine high-dimensional problems. These data sets have been extracted from the book of Chapelle [4]. To analyze transductive and inductive capabilities, these data sets have been also partitioned using the methodology explained above, except for the number of labeled data. We will use two splits for training partitions with 10 and 100 labeled examples, respectively. The remaining instances are marked as unlabeled points. Table 3 presents the main characteristics of these data sets. All the data sets created can be found on the Web site associated with this paper.2 1 http://sci2s.ugr.es/keel/datasets. 2 http://sci2s.ugr.es/SelfLabeled. 123 I. Triguero et al. Table 2 Summary description of the original data sets Data set abalone #Examples #Features #Classes Data set 4,174 8 28 appendicitis 106 7 2 mushroom australian 690 14 2 nursery autos 205 25 6 pageblocks banana movement_libras #Examples #Features #Classes 360 90 15 8,124 22 2 12,690 8 5 5,472 10 5 5,300 2 2 penbased 10,992 16 10 breast 286 9 2 phoneme 5,404 5 2 bupa 345 6 2 pima 768 8 2 chess 3,196 36 2 ring 7,400 20 2 cleveland 297 13 5 saheart 462 9 2 coil2000 9,822 85 2 satimage 6,435 36 7 contraceptive 1,473 9 3 segment 2,310 19 7 crx 125 15 2 sonar 208 60 2 dermatology 366 33 6 spambase 4,597 55 2 ecoli 336 7 8 spectheart 267 44 2 flare-solar 1,066 9 2 splice 3,190 60 3 german 1,000 20 2 tae 151 5 3 glass 214 9 7 texture 5,500 40 11 haberman 306 3 2 tic-tac-toe 958 9 2 heart 270 13 2 thyroid 7,200 21 3 hepatitis 155 19 2 titanic 2,201 3 2 housevotes 435 16 2 twonorm 7,400 20 2 iris 150 4 3 vehicle 846 18 4 led7digit 500 7 10 vowel 990 13 11 lymphography 148 18 4 wine 178 13 3 19,020 10 2 wisconsin 683 9 2 961 5 2 yeast 1,484 8 10 8,993 13 9 zoo 101 17 7 432 6 2 Table 3 Summary description of high-dimensional data sets Data set magic mammographic marketing monks 123 #Examples #Features #Classes bci 400 117 2 coil 1,500 241 6 coil2 1,500 241 2 digit1 1,500 241 2 g241c 1,500 241 2 g241n 1,500 241 2 secstr 83,679 315 2 text 1,500 11,960 2 usps 1,500 241 2 Self-labeled techniques for semi-supervised learning 4.3 Parameters and base classifiers In this subsection we show the configuration parameters of all the methods used in this study. The selected values are common for all problems, and they were selected according to the recommendation of the corresponding authors of each algorithm, which are also the default parameter settings included in the KEEL software [26] that we used to develop our experiments. The approaches analyzed should be as general and as flexible as possible. A good choice of parameters facilitates their better performance over different data sources, but their operations should allow good enough results to be obtained although the parameters are not optimized for a specific data set. This is the main purpose of this experimental survey, to show the generalization in performance of each self-labeled technique. The configuration parameters of all the methods are specified in Table 4. Some of the self-labeled methods have been designed with one or more specific base classifier(s). In this study, these algorithms maintain their used classifier(s). However, the interchange of the base classifier is allowed in other approaches. Specifically, they are: SelfTraining, Co-Training, TriTraining, DE-TriTraining, Rasco and Rel-Rasco. In this study, we select four classic and well-known classifiers in order to find differences in performance among these self-labeled methods. They are K-nearest neighbor, C4.5, naive Bayes and SVM. All of these selected base classifiers have been considered as one of the ten most influential data mining algorithms in [76]. A brief description of the base classifiers and their associated confidence prediction computation are enumerated as follows: • K-nearest neighbor (KNN): This is one of the simplest and most effective methods based on dissimilarities among a set of instances. It belongs to the lazy learning family of methods [77], which do not build a model during the learning process. With this method, confidence predictions can be approximated by the distance to the currently labeled set. • C4.5: This is a well-known decision tree algorithm [32]. It induces classification rules in the form of decision trees for a given training set. The decision tree is built with a top-down scheme, using the normalized information gain (difference in entropy) that is obtained from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. Confidence predictions are obtained from the accuracy of the leaf that makes the prediction. The accuracy of a leaf is the percentage of correctly classified train examples from the total number of covered train instances. • Naive Bayes (NB): Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established. For continuous data, we follow a typical assumption in which continuous values associated with each class are distributed according to a Gaussian distribution [78]. The extraction of probabilities is straightforward, due to the fact that this method explicitly computes the probability belonging to each class for the given test instance. • Support vector machines (SVM): It maps the original input space into a higherdimensional feature space using a certain kernel function [79]. In the new feature space, the SVM algorithm searches the optimal separating hyperplane with maximal margin in order to minimize an upper bound of the expected risk instead of the empirical risk. Specifically, we use the SMO training algorithm, proposed in [80], to obtain the SVM base classifiers. Using a logistic model, we can use the probability estimate from the SMO [80] as the confidence for the predicted class. 123 I. Triguero et al. Table 4 Parameter specification for all the base learners and self-labeled methods used in the experimentation Algorithm Parameters KNN Number of neighbors = 3, Euclidean distance C4.5 Confidence level: c = 0.25 Mininum number of item-sets per leaf: i = 2 Prune after the tree building NB No parameters specified SMO C = 1.0, tolerance parameter = 0.001 Epsilon = 1.0 × 10−12 Kernel type = polynomial Polynomial degree = 1 Fit logistic models = true Self-Training M AX _I T E R = 40 Co-Training M AX _I T E R = 40 , initial unlabeled pool = 75 Democratic-Co Classifiers = 3NN, C4.5, NB SETRED M AX _I T E R = 40, threshold = 0.1 TriTraining No parameters specified DE-TriTraining Number of neighbors k = 3, minimum number of neighbors = 2 CoForest Number of RandomForest classifiers = 6, threshold = 0.75 Rasco M AX _I T E R = 40, number of views/classifiers = 30 Co-Bagging M AX _I T E R = 40, committee members = 3 Ensemble learning = Bagging, pool U = 100 Rel-Rasco M AX _I T E R = 40, number of views/classifiers = 30 CLCC Number of RandomForest classifiers = 6, Threshold = 0.75 Manipulative beta parameter = 0.4, initial number of cluster = 4 Running frequency z = 10, best center sets = 6, Optional Step = True APSSC Spread of the Gaussian = 0.3, evaporation coefficient = 0.7, MT = 0.7 SNNRCE Threshold = 0.5 ADE-CoForest Number of RandomForest classifiers = 6, threshold = 0.75 Number of neighbors k = 3, minimum number of neighbors = 2 4.4 Statistical test for performance comparison Statistical analyses are highly recommended in the field of data mining to find significant differences between the results obtained by the studied methods . We consider the use of nonparametric tests according to the recommendation made in [27,81], where a set of simple, safe and robust nonparametric tests for statistical comparisons of classifiers is presented. In these studies, the use of nonparametric tests will be preferred to parametric ones, since the initial conditions that guarantee the reliability of the latter may not be satisfied, causing the statistical analysis to lose credibility. The Wilcoxon signed-ranks test [28,82] will be adopted to conduct pairwise comparisons between all the methods considered in this study. Considering the ratio of the number of data sets to the number of methods that we will compare throughout this paper, we fix the significance level α = 0.1 for all comparisons. 123 Self-labeled techniques for semi-supervised learning Furthermore, we will use the Friedman test [83] in the global analysis to analyze differences between the methods considered outstanding with a multiple comparison analysis. The Bergmann–Hommel [84] procedure is applied as a post hoc procedure, which was highlighted as the most powerful test in [81], to find out which algorithms are distinctive in n ∗ n comparisons. Any interested reader can find more information about these tests at http:// sci2s.ugr.es/sicidm/, together with the software for applying the statistical tests. 4.5 Other considerations We want to stress that the implementations are based only on the descriptions and specifications given by the respective authors in their papers. No advanced data structures and enhancements for improving the suitability of self-labeled methods have been applied. 5 Analysis of results This section presents the average results collected in the experimental study and some discussion on them. Due to the extent of the experimental analysis carried out, we report the complete tables of results on the Web page associated with this paper (see footnote 2). This study will be divided into two different parts: analysis of the results obtained in transductive learning (see Sect. 5.1) and inductive learning (see Sect. 5.2) considering different ratios of labeled data. A global discussion and the identification of outstanding methods is added in Sect. 5.3. Some representative methods will be also tested under high-dimensional data sets with small labeled ratio (see Sect. 5.4). Finally, a comparison with the supervised learning paradigm is performed in Sect. 5.5. 5.1 Transductive results As we claimed before, the main objective of transductive learning is to predict the true class label of the unlabeled data used to train. Hence, a good exploitation of unlabeled data can lead to successful results. Within the framework of transductive learning, we have analyzed which are the best or the most appropriate proposals attending to their characteristics as explained before. Table 5 presents the average accuracy results obtained in the transductive phase. Specifically, it shows the overall results of the analyzed algorithms over the 55 used data sets with 10, 20, 30 and 40 % of labeled data, respectively. For those methods that work with different classifiers, we have tested various base classifiers, specifying them between brackets. Acc presents the average accuracy obtained. The algorithms are ordered from the best to the worst accuracy obtained. Even though kappa results are not reported in the paper, we want to check the gain (or loss) of each method in the established ranking of algorithms (in terms of accuracy measure) when the kappa measure is taken into account. For this purpose, K shows the oscillation of each method in the classification order established with accuracy with respect to the position obtained with kappa. This information reveals whether or not a certain algorithm benefits from random hits in comparison with the rest of the methods. Complete kappa results can be found on the associated Web site. Furthermore, in this table, we highlight those methods whose performance is within 5 % of the range between the best and the worst method, that is, valuebest − (0.05 · (valuebest − valuewor st )). We use boldface for the accuracy measure and italic for kappa. They should 123 I. Triguero et al. Table 5 Average results obtained by self-labeled methods in the transductive phase be considered as a set of outstanding methods, regardless of their specific position in the table. Figure 2 depicts a star plot representing the average transductive accuracy obtained in each method for the four labeled ratios considered. This star plot presents the performance as the distance from the center; therefore, a higher area determines the best average performance. This illustration allows us to easily visualize the average performance of the algorithms comparatively for each labeled ratio and in general. Figure 3 presents the same results in a bar chart aiming to compare the specific accuracy values. Apart from the average results, we use the Wilcoxon test to statistically compare selflabeled methods in the different labeled ratios. Table 6 summarizes all possible comparisons involved in the Wilcoxon test between all the methods considered, considering accuracy results. Again, kappa results and the individual comparisons are exhibited on the aforementioned Web site, where a detailed report of statistical results can be found. This table presents, for each method in the rows, the number of self-labeled methods outperformed by using the Wilcoxon test under the column represented by the “+” symbol. The column with the “±” symbol indicates the number of wins and ties obtained by the method in the row. The maximum value for each column is highlighted by bold. 123 Self-labeled techniques for semi-supervised learning Fig. 2 Labeled ratio comparison (star plot): transductive phase Fig. 3 Labeled ratio comparison (bar chart): transductive phase 123 I. Triguero et al. Table 6 Wilcoxon test summary results: transductive accuracy Method 10 % 20 % 30 % 40 % + ± + ± + ± + ± Self-Training (KNN) 13 31 12 31 10 26 10 21 Self-Training (C45) 17 31 18 31 19 30 18 31 Self-Training (NB) 4 10 8 11 2 10 0 10 Self-Training (SMO) 9 29 11 30 14 33 17 33 Co-Training (KNN) 7 22 9 18 5 19 3 19 Co-Training (C45) 13 30 17 31 21 32 21 33 Co-Training (NB) 9 25 10 25 4 22 6 19 Co-Training (SMO) 12 31 15 34 26 34 24 34 Democratic-Co 27 34 28 34 27 34 26 34 SETRED 17 33 15 32 12 29 11 29 TriTraining (KNN) 15 31 12 28 8 22 6 19 TriTraining (C45) 25 34 26 34 28 34 27 34 TriTraining (NB) 10 28 13 28 10 27 8 23 6 28 12 30 9 30 16 33 DE-TriTraining (KNN) 13 33 12 28 9 25 9 28 DE-TriTraining (C45) 14 30 14 28 15 28 15 30 DE-TriTraining (NB) 9 21 10 19 4 20 4 22 DE-TriTraining (SMO) 12 31 14 31 15 30 13 31 CoForest 21 34 20 34 21 34 24 34 TriTraining (SMO) Rasco (KNN) 0 1 0 1 0 2 0 6 Rasco (C45) 4 10 2 8 5 26 9 29 Rasco (NB) 4 14 4 8 3 14 1 10 Rasco (SMO) 2 3 2 6 2 13 4 17 Co-Bagging (KNN) 14 32 16 31 13 30 14 27 Co-Bagging (C45) 24 34 26 34 28 34 11 31 Co-Bagging (NB) 7 21 10 24 7 23 6 21 Co-Bagging (SMO) 8 31 13 33 14 30 20 33 Rel-Rasco (KNN) 0 1 0 1 0 2 0 6 Rel-Rasco (C45) 4 10 2 8 4 26 9 29 Rel-Rasco (NB) 5 13 4 8 3 14 1 10 Rel-Rasco (SMO) 2 4 2 6 2 10 2 14 CLCC 3 19 2 9 0 8 0 4 APSSC 4 19 9 16 3 15 1 14 SNNRCE 22 34 19 33 17 33 19 33 ADE-CoForest 11 31 11 29 6 25 6 28 Once the results are presented in the above tables and graphics, we outline some comments related to the properties observed, pointing out the best-performing methods in terms of transductive capabilities: • Considering the influence of labeled ratio, Fig. 2 shows that, as could be expected, most of the algorithms obtain a regular increment on their accuracy transductive capabilities 123 Self-labeled techniques for semi-supervised learning • • • • when the number of labeled data is increased. By contrast, we can observe how several methods are highly affected by the labeled ratio. For instance, multi-view methods obtain higher accuracy and kappa rankings when the labeled ratio is increased. In Table 5, we can also point out that two techniques are always at the top in accuracy and kappa rate independently of the labeled ratio: Democratic-Co and TriTraining (C45). They are also noteworthy as the methods that always obtain a transductive performance within 5 % of the range between the best and the worst method in both accuracy and kappa measures. Moreover, Co-Training (SMO) and Co-Bagging (C45) are considered as outstanding methods in most of the labeled ratios, mainly in the kappa measure. We can observe the validity of these statements in the Wilcoxon test, which confirms the averaged results obtained in accuracy and kappa rates. For those methods whose wrapper classifier can be set, we can find significant differences in their transductive behavior. In general, C4.5 offers the best transductive accuracy results in most of the techniques. In depth, classical self-training and co-training approaches obtain better results when C4.5 or SMO are used as base classifiers. Tri-Training also works with C4.5 and with KNN. The edited version of Tri-Training and Co-bagging presents a good synergy with C4.5. Finally, the use of Rasco and Rel-Rasco is more appropriate when NB or C4.5 is established as the base classifier. The classical Co-training with SMO or C4.5 as base classifier could be stressed as the best-performing method from the multi-view family. Rasco and Rel-Rasco are based on the idea of using random feature subspace (relevant or not) to construct different learned hypotheses. This idea performs well in several domains as shown in the corresponding papers. Nevertheless, to deal with a wide variety of data sets, such as in this experimental study, a more accurate feature selection methodology should be used to enhance their performance. Regarding single-view algorithms, several subfamilies deserve particular mention: In general, incremental approaches obtain the best results in accuracy and kappa rates. The results obtained by CoForest show it to be the best batch model. Moreover, CoForest shows that it is at least statistically similar to the rest of the methods in transductive capabilities. From amending approaches, we can highlight SNNRCE as one important method which is able to clearly outperform the Self-Training (KNN) method on which it is based in both measures. Usually, there is no significant difference (K) between the rankings obtained with accuracy and kappa rates, except for some concrete algorithms. For example, we can observe that DE-Training usually obtains a lower ranking with the kappa measure; this probably indicates that it benefits from random hits. In contrast, other algorithms, such as APSSC and Co-Training (SMO), improve their rankings when the kappa measure is used. Furthermore, we perform an analysis of the results depending on the number of classes. On the Web site associated with this paper, we show the rankings obtained in accuracy and kappa for all the methods differentiating between binary and multi-class data sets. Figure 4 displays a summary representation of this study. It shows, for each method and labeled ratio, the differences in the ranking obtained in binary problems and the ranking achieved in multi-class data sets. Hence, positive bars indicate that the method performs better in multiclass problems, and negative bars show that the method obtains a higher ranking over binary domains. We can analyze several details from the results collected, which are as follows: • When the transductive analysis is divided into binary and multi-class data sets, we find wide differences in the ranking previously obtained. Democratic-Co and TriTraining 123 I. Triguero et al. Fig. 4 Differences between rankings in binary and multi-class domains: transductive phase (C45) continues to lead the ranking if we take into consideration only binary data sets. Nevertheless, in multi-class problems they are not the best-performing methods, although they maintain a good behavior. • Single-classifier methods with an amending addition scheme, such as SETRED and SNNRCE, are noteworthy as the best-performing methods to deal with multi-class problems. • Many differences appear in this study depending on the base classifier. In contrast to the previous analysis in which C4.5 was, in most cases, highlighted as the best base classifier, C4.5 is now stressed when tackling binary data sets, whereas for multi-class data sets, we can highlight the KNN rule as the most adequate base classifier for most of the self-labeled techniques. Specifically, Self-Training, Co-training, Co-Bagging and Tri-Training with KNN obtain a higher accuracy and kappa rates in comparison with the rest of the base classifiers. 5.2 Inductive results (test phase) In contrast to transductive learning, the aim of inductive learning is to classify unknown examples. In this way, inductive learning proves the generalization capabilities of the selflabeled methods, checking whether the previous learned hypotheses are appropriate or not. Table 7 shows the average results obtained, and Figs. 5 and 6 illustrate the comparison between labeled ratios with a star plot and a bar graph, respectively. Finally, the summary of the results of applying the Wilcoxon test to all the techniques in test data is presented in Table 8. These results allow us to highlight some differences in the generalization capabilities of the analyzed methods: • Some methods present clear differences when dealing with the inductive phase. For instance, the amending model SNNRCE obtains a lesser generalization accuracy/kappa, 123 Self-labeled techniques for semi-supervised learning Table 7 Average results obtained by self-labeled methods in the inductive phase and it is outperformed by the other member of its family SETRED. It may suffer from an excessive elimination rule of candidate instances to be incorporated into the labeled set. On the other hand, classical self-training and co-training methods are shown to perform well in the inductive phase, and they are at the top in accuracy and kappa rate. • TriTraining (C45) and Democratic-Co remain at the top of the rankings, established by the accuracy measure, in conjunction with Co-bagging (C4.5) which is a very competitive method in the test phase. In Table 7, we observe that the kappa ranking penalizes the Democratic-Co algorithm and considers other methods to be outstanding. For example, the classical Co-Training (SMO) is positioned as one of the most important methods when the kappa rate is taken into consideration. Wilcoxon test supports these ideas, showing that, in most labeled ratios, TriTraining (C45) is the best proposal to the detriment of Democratic-Co. • It is worth mentioning that, in general, those methods that are based on C4.5 and SMO as base classifier(s) obtain a higher number of outperformed methods (+) and, consequently, a higher number of wins and ties (±) than the numbers obtained in the transductive phase. Hence, these methods present good generalization capabilities, and their use is recommended for inductive tasks. 123 I. Triguero et al. Fig. 5 Labeled ratio comparison (star plot): inductive phase Fig. 6 Labeled ratio comparison (bar chart): inductive phase 123 Self-labeled techniques for semi-supervised learning Table 8 Wilcoxon test summary results: inductive accuracy Method 10 % 20 % 30 % 40 % + ± + ± + ± + ± Self-Training (KNN) 11 30 11 25 9 28 6 24 Self-Training (C4.5) 19 32 22 32 21 32 23 33 Self-Training (NB) 2 11 8 13 1 9 1 9 Self-Training (SMO) 9 29 10 28 11 34 17 32 Co-Training (KNN) 6 26 10 23 4 22 5 21 Co-Training (C4.5) 11 30 23 32 21 33 22 34 Co-Training (NB) 10 27 10 25 4 24 3 23 Co-Training (SMO) 11 32 14 33 21 34 25 34 Democratic-Co 26 34 26 34 25 34 26 34 SETRED 15 34 14 29 10 29 11 29 TriTraining (KNN) 13 34 10 23 3 20 2 16 TriTraining (C4.5) 28 34 31 34 24 34 27 34 TriTraining (NB) 10 29 11 28 9 28 4 20 9 31 12 33 12 33 18 32 DE-TriTraining (KNN) 11 31 11 26 8 26 7 25 DE-TriTraining (C4.5) 12 29 13 29 11 28 12 28 DE-TriTraining (NB) 9 25 10 25 3 19 2 20 DE-TriTraining (SMO) 11 30 17 31 16 32 14 29 CoForest 19 34 18 34 20 34 20 34 TriTraining (SMO) Rasco (KNN) 0 1 0 1 0 5 0 11 Rasco (C4.5) 2 10 2 8 4 28 11 29 Rasco (NB) 3 12 2 7 4 17 2 15 Rasco (SMO) 2 9 2 8 3 21 3 23 Co-Bagging (KNN) 12 31 18 31 17 32 16 29 Co-Bagging (C4.5) 27 34 28 34 27 34 15 32 Co-Bagging (NB) 8 24 10 28 4 25 4 20 Co-Bagging (SMO) 7 31 16 33 15 33 24 34 Rel-Rasco (KNN) 0 1 0 1 0 3 0 10 Rel-Rasco (C4.5) 2 10 2 8 4 28 10 29 Rel-Rasco (NB) 3 14 3 8 4 17 1 14 Rel-Rasco (SMO) 2 10 2 8 3 21 3 22 CLCC 2 20 2 9 0 2 0 2 APSSC 2 20 9 16 2 17 1 15 16 34 11 27 5 23 6 22 7 30 10 28 3 23 4 27 SNNRCE ADE-CoForest • In this phase, the ratio of labeled instances is a more relevant issue for the obtained performance of all the algorithms. In comparison with transductive results, there are higher differences between the results obtained for each method, comparing, as extreme cases, 10 and 40 % ratios. As we can see in Fig. 5, the star plot shows abrupt changes, mainly from 10 to 20 % of labeled instances. 123 I. Triguero et al. (a) (b) (c) (d) Fig. 7 Box plot accuracy inductive phase. a 10 % labeled ratio, b 20 % labeled ratio, c 30 % labeled ratio, d 40 % labeled ratio Aside from these tables, Fig. 7 collects box plot representations for each labeled ratio. In this figure, we select a subset of methods whose performances are of interest. In particular, we select the most promising alternatives based on the best base classifier choice from the previous study. Box plots have been shown to be an effective tool in data reporting because 123 Self-labeled techniques for semi-supervised learning Fig. 8 Differences between rankings in binary and multi-class domains: inductive phase they allow the graphical representation of the performance of algorithms, indicating important characteristics such as the median, extreme values and spread of values about the median in the form of quartiles (Q1 and Q3). • The box plots show which results are more robust in the sense that the boxes are more compact. In general, Self-training and Co-bagging present the smallest box plots in all the labeled ratios. By contrast, other methods such as CoForest and Rasco are less robust, as was previously reflected in the average results and Wilcoxon test. • Median results also help us to identify promising algorithms which perform well in many domains. It is interesting that Co-Bagging is shown to be the best median value in most of the labeled ratios in comparison with Democratic-Co and TriTraining (C4.5), which were pointed out as the best-performing methods. Again, we differentiate between binary and multi-class data sets. Figure 8 depicts a summary graphic. The Web site associated with this paper presents the complete results. Observing these results, we can make several comments. • Over binary data sets, we find an significant increment in the ranking obtained by Rasco (C45) and Rel-Rasco (C45). In general, C4.5 models lead the ranking established in accuracy and kappa rates. • When only multi-class data sets are taken into consideration, as in the transductive phase, we also observe that TriTraining, SETRED and Self-Training based on KNN reach the top positions. 5.3 Global analysis This section provides a global perspective on the obtained results. As a summary, we want to outline several remarks attending to the previous studies: 123 I. Triguero et al. Table 9 Average Friedman rankings of outstanding algorithms in transductive and inductive phases Algorithm Ranking transductive Algorithm Ranking inductive TriTraining (C45) 3.9477 TriTraining (C45) 3.9455 Democratic-Co 4.1205 Democratic-Co 4.2295 Co-Bagging (C45) 4.2409 Co-Bagging (C45) 4.2727 Co-Training (SMO) 4.3409 Co-Training (SMO) 4.4159 Self-Training (C45) 4.6136 Self-Training (C45) 4.4727 Self-Training (SMO) 4.7636 TriTraining (SMO) 4.8023 Co-Bagging (SMO) 4.9432 Co-Bagging (SMO) 4.8955 TriTraining (SMO) 5.0295 Self-Training (SMO) 4.9659 • In general, those self-labeled methods based on C4.5 or SMO have been shown to perform well in binary data sets. In contrast, the KNN rule is shown to work better with multi-class domains. Regarding the NB classifier, the continuous distribution version used [78] has not reported competitive results in comparison with the rest of classifiers. It will probably obtain a better performance if a discretization process is used for continuous attributes [85,86]. • According to the type of data sets, we can also claim that single-classifier models usually outperform those that are based on multi-classifiers when dealing with multi-class problems. By contrast, multi-classifier methods show a better behavior than single-classifier in tackling binary data sets. • Eight self-labeled methods have been emphasized as outstanding methods according to the accuracy/kappa obtained, at least in one of the labeled ratios of the transductive or inductive phases: TriTraining (C45), TriTraining (SMO), Democratic-Co, Co-Bagging (C45), Co-Bagging (SMO), Co-Training (SMO), SelfTraining (C45) and SelfTraining (SMO). Now, we focus our attention on these eight outstanding methods, performing a multiple comparison statistical test between them. To analyze the global capabilities of these algorithms independently of labeled ratios, the statistical test is conducted with all of the 220 data sets (55 data sets × 4 labeled rates). Table 9 presents the results of the Friedman test, which has been carried out considering the accuracy results obtained in the transductive and inductive phases. In this table, the computed Friedman rankings are presented, representing the associated effectiveness of each method in both phases. Algorithms are ordered from the best (lowest) to the worst (highest) ranking. Table 10 provides information about the state of retainment or rejection of all the hypotheses, comparing outstanding methods in transductive and inductive phases. They show the adjusted p value (APV) with Bergmann–Hommel’s procedure for the 28 established comparisons. Each table is set out so that each row contains a hypothesis in which the first algorithm mentioned (left side of the comparison) outperforms the second one (right side). The hypotheses are ordered from the most to the least significant differences. Those APVs highlighted in bold correspond to hypotheses whose left method outperforms the right method, at an α = 0.1 level of significance. Figure 9 outlines two graphs for transductive and inductive statistical results, respectively. The x-axis presents those methods that are not outperformed by any other algorithm. For each of them, the y-axis collects the methods that they outperform according to the Bergmann– Hommel test. 123 Self-labeled techniques for semi-supervised learning Table 10 Multiple comparison test: Bergmann–Hommel’s APVs (a) (b) Fig. 9 Graphical comparison with Bergmann–Hommel’s test. a Transductive results, b inductive results 123 I. Triguero et al. (a) (b) (c) Fig. 10 Two-dimensional projections of g241n. Red crosses class +1, blue circles class −1. a problem, b g24ln: labeled data, c g24ln: 100 labeled data (color figure online) Thus, observing the multiple comparison test and Fig. 9, we can highlight TriTraining (C45), Democratic-Co, Co-Bagging (C45) and Co-Training (SMO) as methods that appear in the x-axis in both Fig. 9a, b. Hence, they are not statistically outperformed by any algorithm. Self-Training (C45) is also highlighted as a nonoutperformed method in the inductive phase. 5.4 Experiments on high-dimensional data sets with small labeled ratio The aim of this section is to check the behavior of self-labeling methods when they deal with high-dimensional data and a reduced labeled ratio. To do this, we focus our attention on four of the best methods highlighted above: Democratic-Co, TriTraining (C4.5), Co-Bagging(C4.5) and CoTraining(SMO). The used data sets were provided by Chapelle in [4], and their main characteristics were described in Sect. 4.2. To illustrate the complexity of these problems, Fig. 10 depicts an example of one partition of the g241n problem. This graph presents a two-dimensional projection (obtained with PCA [87]) of the problem and the 10 and 100 labeled data points used with self-labeled methods. Tables 11 and 12 show the accuracy results obtained in the transductive and inductive phases with 10 and 100 labeled data, respectively. To measure the goodness of self-labeled techniques with this kind of data, we compare their results with the obtained with the base classifiers. Therefore, C4.5, KNN, SMO and NB have been trained with the available labeled 123 Self-labeled techniques for semi-supervised learning Table 11 High-dimensional data sets: self-labeled performance with ten labeled data Data sets Democratic-Co TriTraining (C4.5) Co-Bagging (C4.5) Co-Training (SMO) TRS TST TRS TST TRS TST TRS TST bci 0.5014 0.4875 0.5134 0.5050 0.5043 0.5100 0.5146 0.5250 coil 0.8328 0.8333 0.7864 0.7753 0.7792 0.7727 0.8239 0.8300 coil2 0.8050 0.8027 0.6897 0.7040 0.7139 0.7173 0.7601 0.7673 digit1 0.5945 0.6020 0.5337 0.5273 0.5261 0.5113 0.7372 0.7520 g241c 0.5290 0.5220 0.5169 0.5040 0.5213 0.5007 0.5925 0.6067 g241n 0.5043 0.5020 0.5029 0.5033 0.4990 0.4993 0.5340 0.5253 secstr 0.5719 0.5718 0.5097 0.5085 0.5298 0.5283 0.5281 0.5141 text 0.5604 0.5533 0.5272 0.5167 0.5190 0.5180 0.4993 0.5000 usps 0.8050 0.8027 0.6897 0.7040 0.7139 0.7173 0.7601 0.7673 Average 0.6338 0.6308 0.5855 0.5831 0.5896 0.5861 0.6389 0.6431 Table 12 High-dimensional data sets: self-labeled performance with 100 labeled data Data sets bci Democratic-Co TriTraining (C4.5) Co-Bagging (C4.5) Co-Training (SMO) TRS TST TRS TST TRS TST TRS TST 0.5027 0.5450 0.5588 0.5625 0.5604 0.5550 0.6573 0.6500 coil 0.8635 0.8773 0.8372 0.8393 0.8439 0.8480 0.9110 0.9047 coil2 0.8557 0.8413 0.7972 0.8033 0.8064 0.8333 0.8063 0.7927 digit1 0.9370 0.9347 0.8208 0.8600 0.8072 0.8127 0.9158 0.9173 g241c 0.6033 0.5213 0.5689 0.5413 0.5685 0.5660 0.7334 0.7453 g241n 0.5420 0.5053 0.5792 0.6067 0.5696 0.5733 0.7320 0.7313 secstr 0.5917 0.5915 0.5436 0.5421 0.5573 0.5500 0.5476 0.5515 text 0.6608 0.6667 0.6728 0.6800 0.6920 0.7333 0.6797 0.6735 usps 0.8557 0.8413 0.8184 0.8000 0.7904 0.7913 0.8063 0.7927 Average 0.7125 0.7027 0.6885 0.6928 0.6884 0.6959 0.7544 0.7510 examples to predict the class of the rest of unlabeled ones. Note that the information used for these techniques corresponds to the initial stage of all the self-labeled schemes. However, it is also known that, depending on the problem, unlabeled data can lead to worse performance [1]; hence, the inclusion of these baselines shows whether self-labeled techniques are appropriate for these high-dimensional problems. Results are presented in Tables 13 and 14. In these tables we can appreciate that self-labeled techniques do not fit adequately to these kinds of problems. They are not able to significantly outperform baseline techniques which in some cases result in a better performance. When only ten labeled points are used, base classifiers perform equal or better than most of the self-labeled techniques. If the number of labeled points is increased to 100, we observe that some self-labeled techniques, such as TriTraining (C4.5) and Co-Bagging (C4.5), perform better than their base classifier. However, they are not really competitive with the results obtained with KNN or SMO with 100 labeled points. Figure 11 illustrates an example of the evolution of the transductive and inductive accuracy of Democratic-Co during the self-labeling process. With a self-labeling approach, it is expected that as the iterations go by, the accuracy should be increased. Nevertheless, in 123 I. Triguero et al. Table 13 High-dimensional data sets: baselines performance with ten labeled data Datasets C4.5 KNN SMO NB TRS TST TRS TST TRS TST TRS TST bci 0.5189 0.5175 0.4889 0.4725 0.5194 0.5000 0.4966 0.5000 coil 0.7945 0.7913 0.7938 0.7900 0.8398 0.8413 0.6921 0.6913 coil2 0.6997 0.7107 0.7767 0.7867 0.7794 0.7867 0.7676 0.7673 digit1 0.5353 0.5220 0.7738 0.7773 0.6889 0.6673 0.6748 0.6787 g241c 0.5175 0.4973 0.5466 0.5653 0.6020 0.6140 0.5446 0.5587 g241n 0.5160 0.5187 0.5431 0.5333 0.5091 0.5060 0.5048 0.5000 secstr 0.5209 0.5215 0.5121 0.5127 0.5155 0.5155 0.5232 0.5220 text 0.5034 0.5064 0.5143 0.5102 0.5201 0.5167 0.4936 0.4986 usps 0.6997 0.7107 0.7767 0.7867 0.7794 0.7867 0.7676 0.7673 Average 0.5895 0.5885 0.6362 0.6372 0.6393 0.6371 0.6072 0.6093 Table 14 High-dimensional data sets: baselines performance with 100 labeled data Datasets C4.5 KNN SMO NB TRS TST TRS TST TRS TST TRS TST bci 0.5569 0.5525 0.5204 0.5600 0.6581 0.6500 0.5212 0.5275 coil 0.8226 0.8220 0.9422 0.9387 0.9182 0.9113 0.7684 0.7653 coil2 0.7762 0.7860 0.9245 0.9160 0.8386 0.8300 0.8624 0.8593 digit1 0.7726 0.7800 0.9361 0.9373 0.9146 0.9080 0.9365 0.9453 g241c 0.5446 0.5433 0.5919 0.5973 0.7405 0.7533 0.7202 0.7300 g241n 0.5377 0.5267 0.6286 0.6380 0.7371 0.7400 0.6877 0.6780 secstr 0.5298 0.5284 0.5156 0.5152 0.5240 0.5257 0.5340 0.5346 text 0.5054 0.5049 0.5064 0.5177 0.5196 0.5206 0.5003 0.4945 usps 0.7762 0.7860 0.9245 0.9160 0.8386 0.8300 0.8624 0.8593 Average 0.6469 0.6478 0.7211 0.7262 0.7433 0.7410 0.7103 0.7104 Democratic−Co: Accuracy Analysis (g241n) 0.65 Accuracy 0.6 0.55 0.5 TRS−10 TST−10 TRS−100 TST−100 0.45 0.4 0 1 2 3 4 5 6 7 8 9 10 Iterations Fig. 11 Transductive (TRS) and inductive (TST) accuracy of Democratic-Co in the g241n problem: 10 and 100 123 Self-labeled techniques for semi-supervised learning Fig. 12 Differences in average test results between outstanding models and C4.5 and SMO these problems we find that this expectation is not satisfied in most of cases. In the plot we see that in intermediate iterations the accuracy is deteriorated. It means that the estimation of most confident examples is erroneous and much more complicated to be obtained in these domains. Therefore, these results have exemplified the difficulty of these problems when a very reduced number of labeled data are used. In our opinion, more research is required to provide to the self-labeled techniques the ability of deal with high-dimensional problem with a very reduced labeled ratio. 5.5 How far removed is semi-supervised learning from the traditional supervised learning paradigm? This section is devoted to checking how far removed is the accuracy obtained with selflabeled methods, in the SSL context, in comparison with supervised learning. It is clear that SSL implies a more complex problem than the standard supervised learning problem. In SSL, algorithms are provided with a lesser number of labeled examples to learn a correct hypothesis. Specifically, the performance obtained with self-labeled methods is theoretically upper-bounded by the traditional supervised learning algorithms used as base classifiers. To contrast this idea, we compare the inductive (test) results obtained with the best methods highlighted in the previous section with C4.5 and SMO classifiers. These classifiers are trained with completely labeled training sets, using the same tenfold cross-validation scheme to compute the accuracy test results. The complete results of this study are available on the associated Web site. Figure 12 draws a graphical comparison between outstanding inductive methods, C4.5 and SMO classifiers. For each self-labeled method, the average result obtained is shown in each labeled ratio. The average result of C4.5 and SMO is represented as a line y = Average Result, to show the differences between self-labeled methods and these classifiers. As we can observe in this figure, it is noteworthy that with a reduced number of labeled examples (10 %), self-labeled techniques are far removed from the results obtained with base classifiers which use a completely labeled training set. Although an increment in the labeled ratio does not produce a proportional increase in the performance obtained with self-labeled techniques, it indicates that from 20 % of the labeled ratio, they offer an acceptable classi- 123 I. Triguero et al. fication performance. As an extreme case, Co-Training (SMO) does not perform well with 10 % of labeled data, and it shows a great improvement when the labeled ratio is augmented, approaching the SMO classifier with all labeled training examples. 6 Concluding remarks and global guidelines The present paper provides a complete overview of the self-labeled methods proposed in the literature. We have analyzed the basic and advanced features presented in them. Furthermore, existing and related work have also been reviewed. Based on the main properties studied, we have proposed a taxonomy of self-labeled methods. The most important methods have been empirically analyzed in terms of transductive and inductive settings. In order to strengthen this experimental study, we have conducted statistical analyses based on nonparametric tests which help us to characterize the capabilities of each method, supporting the conclusions drawn. Several remarks can be made and guidelines suggested: • This paper helps nonexperts in self-labeled methods to differentiate between them, to make an appropriate decision about their application and to understand their behavior. • A researcher who needs to apply a self-labeled method should know the main characteristics of these kinds of methods in order to choose the most suitable, depending on the type of problem. The taxonomy proposed and the empirical study can help a researcher to make this decision. • It is important to know the main advantages of each self-labeled method. In this paper, many methods have been empirically analyzed, but a specific conclusion cannot be drawn regarding the best-performing method. This choice depends on the problem tackled, but the results offered in this paper could help to reduce the set of candidates. • SSC is a growing field, and more research studies should be conducted. In this paper, several guidelines about unexplored and promising families have been described. • To propose a new self-labeled method, rigorous analyses should be considered to compare it with the most well-known approaches and those that fit with the basic properties of the new proposal in terms of transductive and inductive learning. To do this, the taxonomy and the proposed experimental framework can help guide a future proposal toward the correct method. • The empirical study allows us to highlight several methods from among the whole set. In both transductive and inductive settings, TriTraining (C45), Democratic-Co, Co-Bagging (C45) and Co-Training (SMO) are shown to be the best-performing methods. Furthermore, in the inductive phase, the classical Self-Training with C45 as base classifier is also remarkable as an outstanding method. • The experiments conducted with high-dimensional data sets and very reduced labeled ratio show that much more work is needed in the field of self-labeled techniques to deal with these problems. • The developed software (see “Appendix”) allows the reader to reproduce the experiments carried out and uses it as an SSL framework to implement new methods. It could be a useful tool to do experimental analyses in an easier and more effective way. Acknowledgments TIC-7765. 123 This work is supported by the Research Projects TIN2011-28488, TIC-6858 and P11- Self-labeled techniques for semi-supervised learning Fig. 13 A snapshot of the semi-supervised learning module for KEEL 7 Appendix As a consequence of this work, we have developed a complete SSL framework which has been integrated into the Knowledge Extraction based on Evolutionary Learning (KEEL) tool3 [26]. This research tool is an open-source software, written in Java, that supports data management and the design of experiments. Until now, KEEL has paid special attention to the implementation of supervised and unsupervised learning, clustering, pattern mining and so on. Nevertheless, it did not offer support for SSL. We integrated a new SSL module into this software. The main characteristics of this module are as follows: • All the data sets involved in the experimental study have been included into this module and can be used for new experiments. These data sets are composed of three files for each partition: training, transductive and test partitions. The former is composed of labeled and unlabeled instances (labeled as “unlabeled”). Transductive partition contains the real class of unlabeled instances and the latter collect the test instances. These data sets are included in the KEEL-data set repository and are static, ensuring that further experiments carried out will no longer be dependent on particular data partitions. • It allows the design of SSL experiments which generate all the XML scripts and a JAR program for running it, by creating a zip file for an off-line run. The SSL module is designed for experiments containing multiple data sets and algorithms connected among themselves to obtain the desired experimental setup. The parameters configuration of the methods is also customizable as well as the number of executions, validation scheme and so on. Figure 13 shows a snapshot of an experiment with three analyzed self-labeled methods and the customization of the parameters of the algorithm APSSC. Note that every 3 http://www.keel.es. 123 I. Triguero et al. method could be executed apart from the KEEL tool with an appropriate configuration file. • Special care has been taken to allow a researcher to be able to use this module to assess the relative effectiveness of his own procedures. Guidelines about how to integrate a method into KEEL can be found in [35]. The KEEL version with the SSL module is available on the associated Web site. References 1. Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning, 1st edn. Morgan and Claypool, San Rafael, CA 2. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco 3. Zhu Y, Yu J, Jing L (2013) A novel semi-supervised learning framework with simultaneous text representing. Knowl Inf Syst 34(3):547–562 4. Chapelle O, Schlkopf B, Zien A (2006) Semi-supervised learning, 1st edn. The MIT Press, Cambridge, MA 5. Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognit Lett 3:13–20 6. Zhao W, He Q, Ma H, Shi Z (2012) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587 7. Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semisupervised assumptions. IEEE Trans Pattern Anal Mach Intell 33(1):129–143 8. Fujino A, Ueda N, Saito K (2008) Semisupervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Trans Pattern Anal Mach Intell 30(3):424–437 9. Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of 16th international conference on machine learning, Morgan Kaufmann, pp 200–209 10. Blum A, Chawla S (2001) Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the eighteenth international conference on machine learning, pp 19–26 11. Wang J, Jebara T, Chang S-F (2013) Semi-supervised learning using greedy max-cut. J Mac Learn Res 14(1):771–800 12. Mallapragada PK, Jin R, Jain A, Liu Y (2009) Semiboost: boosting for semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 31(11):2000–2014 13. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, pp 189–196 14. Li M, Zhou ZH (2005) SETRED: self-training with editing. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 3518 LNAI, pp 611–621 15. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the annual ACM conference on computational learning theory, pp 92–100 16. Du J, Ling CX, Zhou ZH (2010) When does co-training work in real data? IEEE Trans Knowl Data Eng 23(5):788–799 17. Sun S, Jin F (2011) Robust co-training. Int J Pattern Recognit Artif Intell 25(07):1113–1126 18. Jiang Z, Zhang S, Zeng J (2013) A hybrid generative/discriminative method for semi-supervised classification. Knowl-Based Syst 37:137–145 19. Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23(7–8):2031–2038 20. Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17:1529–1541 21. Li M, Zhou ZH (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern A Syst Hum 37(6):1088–1098 22. Sun S, Shawe-Taylor J (2010) Sparse semi-supervised learning using conjugate functions. J Mach Learn Res 11:2423–2455 23. Zhu X (2005) Semi-supervised learning literature survey. Technical report 1530, Computer Sciences, University of Wisconsin-Madison 24. Chawla N, Karakoulas G (2005) Learning from labeled and unlabeled data: an empirical study across techniques and domains. J Artif Intell Res 23:331–366 25. Zhou Z-H, Li M (2010) Semi-supervised learning by disagreement. Knowl Inf Syst 24(3):415–439 123 Self-labeled techniques for semi-supervised learning 26. Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318 27. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30 28. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064 29. Triguero I, Sáez JA, Luengo J, García S, Herrera F (2013) On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification, Neurocomputing (in press) 30. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27 31. Dasgupta S, Littman ML, McAllester DA (2001) Pac generalization bounds for co-training. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. Neural information processing systems: natural and synthetic, vol 14. MIT Press, Cambridge, pp 375–382 32. Quinlan JR (1993) C4.5 programs for machine learning. Morgan Kaufmann Publishers, San Francisco, CA 33. Efron B, Tibshirani RJ (1993) An Introduction to the bootstrap. Chapman & Hall, New York 34. Goldman S, Zhou Y (2000) Enhancing supervised learning with unlabeled data. In: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, pp 327–334 35. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17(2–3):255–277 36. Bennett K, Demiriz A, Maclin R (2002) Exploiting unlabeled data in ensemble methods. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 289–296 37. Zhou Y, Goldman S (2004) Democratic co-learning. In: IEEE international conference on tools with artificial intelligence, pp 594–602 38. Deng C, Guo M (2006) Tri-training and data editing based semi-supervised clustering algorithm. In: Gelbukh A, Reyes-Garcia C (eds) MICAI 2006: advances in artificial intelligence, vol 4293 of lecture notes in computer science. Springer, Berlin, pp 641–651 39. Wang J, Luo S, Zeng X (2008) A random subspace method for co-training. In: IEEE international joint conference on computational intelligence, pp 195–200 40. Hady M, Schwenker F (2008) Co-training by committee: a new semi-supervised learning framework. In: IEEE international conference on data mining workshops, ICDMW ’08, pp 563–572 41. Hady M, Schwenker F (2010) Combining committee-based semi-supervised learning and active learning. J Comput Sci Technol 25:681–698 42. Hady M, Schwenker F, Palm G (2010) Semi-supervised learning for tree-structured ensembles of rbf networks with co-training. Neural Netw 23:497–509 43. Yaslan Y, Cataltepe Z (2010) Co-training with relevant random subspaces. Neurocomputing 73(10– 12):1652–1661 44. Huang T, Yu Y, Guo G, Li K (2010) A classification algorithm based on local cluster centers with a few labeled training examples. Knowl-Based Syst 23(6):563–571 45. Halder A, Ghosh S, Ghosh A (2010) Ant based semi-supervised classification. In: Proceedings of the 7th international conference on swarm intelligence, ANTS’10, Springer, Berlin, Heidelberg, pp 376–383 46. Wang Y, Xu X, Zhao H, Hua Z (2010) Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl-Based Syst 23(6):547–554 47. Deng C, Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36:253–281. doi:10.1007/s10844-009-0105-8 48. Nigam K, Mccallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134 49. Tang X-L, Han M (2010) Semi-supervised Bayesian artmap. Appl Intell 33(3):302–317 50. Joachims T (2003) Transductive learning via spectral graph partitioning. In: Proceedings of twentieth international conference on machine learning, vol 1, pp 290–297 51. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434 52. Xie B, Wang M, Tao D (2011) Toward the optimization of normalized graph Laplacian. IEEE Trans Neural Netw 22(4):660–666 53. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167 54. Chapelle O, Sindhwani V, Keerthi SS (2008) Optimization techniques for semi-supervised support vector machines. J Mach Learn Re. 9:203–233 123 I. Triguero et al. 55. Adankon M, Cheriet M (2010) Genetic algorithm-based training for semi-supervised svm. Neural Comput Appl 19:1197–1206 56. Tian X, Gasso G, Canu S (2012) A multiple kernel framework for inductive semi-supervised svm learning. Neurocomputing 90:46–58 57. Sugato B, Raymond JM (2003) Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, pp 42–49 58. Yin X, Chen S, Hu E, Zhang D (2010) Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognit 43(4):1320–1333 59. Grira N, Crucianu M, Boujemaa N (2004) Unsupervised and semi-supervised clustering: a brief survey. In: A review of machine learning techniques for processing multimedia content. Report of the MUSCLE European network of excellence FP6 60. Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28:133–168 61. Muslea I, Minton S, Knoblock C (2002) Active + semi-supervised learning = robust multi-view learning. In: Proceedings of ICML-02, 19th international conference on machine learning, pp 435–442 62. Zhang Q, Sun S (2010) Multiple-view multiple-learner active learning. Pattern Recognit 43(9):3113–3119 63. Yu H (2011) Selective sampling techniques for feedback-based data retrieval. Data Min Knowl Discov 22(1–2):1–30 64. Belhumeur P, Hespanha J, Kriegman D (1997) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720 65. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, Berlin 66. Song Y, Nie F, Zhang C, Xiang S (2008) A unified framework for semi-supervised dimensionality reduction. Pattern Recognit 41(9):2789–2799 67. Li Y, Guan C (2008) Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Mach Learn 71:33–53 68. Liu H, Motoda H (eds) (2007) Computational methods of feature selection. Chapman &Hall/CRC data mining and knowledge discovery series. Chapman & Hall/CRC, Boca Raton, FL 69. Zhao J, Lu K, He X (2008) Locality sensitive semi-supervised feature selection. Neurocomputing 71(10– 12):1842–1849 70. Gregory PA, Gail AC (2010) Self-supervised ARTMAP. Neural Netw 23:265–282 71. Cour T, Sapp B, Taskar B (2011) Learning from partial labels. J Mach Learn Res 12:1501–1536 72. Joshi A, Papanikolopoulos N (2008) Learning to detect moving shadows in dynamic environments. IEEE Trans Pattern Anal Mach Intell 30(11):2055–2063 73. Ben-David A (2007) A lot of randomness is hiding in accuracy. Eng Appl Artif Intell 20:875–885 74. Alpaydin E (2010) Introduction to machine learning, 2nd edn. MIT Press, Cambridge, MA 75. Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/mlearn/ MLRepository.html 76. Wu X, Kumar V (eds) (2009) The top ten algorithms in data mining. Chapman & Hall/CRC data mining and knowledge discovery. Chapman & Hall/CRC, Boca Raton, FL 77. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66 78. John GH, Langley P (2001) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Mateo, pp 338–345 79. Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, London 80. Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, MA 81. García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694 82. Sheskin DJ (2011) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman & Hall/CRC, Boca Raton, FL 83. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701 84. Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Bauer P, Hommel G, Sonnemann E (eds) Multiple hypotheses testing. Springer, Berlin pp 100–115 85. Yang Y, Webb G (2009) Discretization for naive-Bayes learning: managing discretization bias and variance. Mac Learn 74(1):39–74 123 Self-labeled techniques for semi-supervised learning 86. García S, Luengo J, Saez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750 87. Jolliffe IT (1986) Principal component analysis. Springer, Berlin Author Biographies Isaac Triguero received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, biometrics, evolutionary algorithms and semi-supervised learning. Salvador García received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has published more than 30 papers in international journals. As edited activities, he has co-edited two special issues in international journals on different Data Mining topics. His research interests include data mining, data reduction, data complexity, imbalanced learning, semisupervised learning, statistical inference and evolutionary algorithms. 123 I. Triguero et al. Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 230 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). He currently acts as Editor in Chief of the international journal “Progress in Artificial Intelligence” (Springer). He acts as area editor of the International Journal of Computational Intelligence Systems and associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Knowledge and Information Systems, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, and Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding 2008 Paper Award (bestowed in 2011), and 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association. His current research interests include computing with words and decision making, bibliometrics, data mining, biometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms. 123 2. Self-labeling with prototype generation/selection for semi-supervised classification 2.2 179 On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification • I. Triguero, José A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. Neurocomputing 132 (2014) 30-41, doi: 10.1016/j.neucom.2013.05.055. – Status: Published. – Impact Factor (JCR 2014): Not available. – Current Impact Factor of the Journal (JCR 2012): 1.634 – Subject Category: Computer Science, Artificial Intelligence. Ranking 37 / 115 (Q2). Neurocomputing 132 (2014) 30–41 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification Isaac Triguero a,n, José A. Sáez a, Julián Luengo b, Salvador García c, Francisco Herrera a a Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain b Department of Civil Engineering, LSI, University of Burgos, 09006 Burgos, Spain c Department of Computer Science, University of Jaén, 23071 Jaén, Spain ar t ic l e i nf o a b s t r a c t Article history: Received 22 October 2012 Received in revised form 18 February 2013 Accepted 30 May 2013 Available online 12 November 2013 Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set. Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning. & 2013 Elsevier B.V. All rights reserved. Keywords: Noise filters Noisy data Self-training Semi-supervised learning Nearest neighbor classification 1. Introduction The construction of classifiers can be considered one of the most important and challenging tasks in machine learning and data mining [1]. Supervised classification, which has attracted much attention and research efforts [2], aims to build classifiers using a set of labeled data. By contrast, in many real-world tasks, unlabeled data are easier to obtain than labeled ones because they require less effort, expertise and time-consumption. In this context, semi-supervised learning (SSL) [3] is a learning paradigm concerned with the design of classifiers in the presence of both labeled and unlabeled data. SSL is an extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm. Depending on the main objective of the methods, SSL encompasses several settings such as semi-supervised classification (SSC) [4] n Corresponding author. Tel.: þ 34 958 240598; fax: þ 34 958 243317. E-mail addresses: [email protected] (I. Triguero), [email protected] (J.A. Sáez), [email protected] (J. Luengo), [email protected] (S. García), [email protected] (F. Herrera). 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.05.055 and semi-supervised clustering [5]. The former focuses on enhancing supervised classification by minimizing errors in the labeled examples but it must also be compatible with the input distribution of unlabeled instances. The latter, also known as constrained clustering [6], aims to obtain better defined clusters than the ones obtained from unlabeled data. There are other SSL settings, including regression with labeled and unlabeled data, or dimensionality reduction [7] to find a faithful low dimensional mapping, or selection of the high dimensional data in a SSL context. We focus on SSC. SSC can be categorized into two slightly different settings [8], denoted as transductive and inductive learning. On one hand, transductive learning concerns the problem of predicting the labels of the unlabeled examples, given in advance, by taking both labeled and unlabeled data together into account to train a classifier. On the other hand, inductive learning considers the given labeled and unlabeled data as the training examples, and its objective is to predict unseen data. In this paper, we address both settings to carry out an extensive analysis of the performance of the studied methods. Many different approaches have been proposed to classify using unlabeled data in SSC. They usually make different assumptions I. Triguero et al. / Neurocomputing 132 (2014) 30–41 related to the link between the distribution of unlabeled and labeled data. Generative models [9] assume a joint probability model pðx; yÞ ¼pðyÞpðxjyÞ, where pðxjyÞ is an identifiable mixture distribution, for example a Gaussian mixture model [10]. The standard co-training [11] methodology assumes that the feature space can be split into two different conditionally independent views and that each view is able to predict the classes perfectly [12–15]. It trains one classifier in each specific view, and then the classifiers teach each other the most confident predicted examples. Multiview learning [16,17] can be viewed as a generalization of cotraining, without requiring explicit feature splits or the iterative mutual-teaching procedure. Instead, it focuses on the explicitly hypothetical agreement of several classifiers [18]. There are also other algorithms such as transductive inference for support vector machines [19,20] that assume that the classes are well-separated and do not cut through dense unlabeled data. Alternatively, SSC can also be viewed as a graph min-cut problem [21]. If two instances are connected by a strong edge, their labels are likely to be the same. In this case, the graph construction determines the behavior of this kind of algorithm [22]. In addition, there are recent studies which address multiple assumptions in one model [8]. Self-training [23,24] is a simple and effective SSL methodology which has been successfully applied in many real instances [25,26]. In the self-training process, a classifier is trained with an initially small number of labeled examples, aiming to classify unlabeled points. Then it is retrained with its own most confident predictions, enlarging its labeled training set. This model does not make any specific assumptions for the input data, but it accepts that its own predictions tend to be correct. However, this idea can lead to erroneous predictions if noisy examples are classified as the most confident examples and incorporated into the labeled training set. In [27], the authors propose the addition of a statistical filter [28] to the self-training process, naming this algorithm SETRED. Nevertheless, this method does not perform well in many domains. The use of a particular filter which has been designed and tested under different conditions is not straightforward. Although the aim of any filter is to remove potentially noisy examples, both correct examples and examples containing valuable information may also be removed. Thus, detecting true noisy examples is a challenging task because the success of filtering methods depends on several factors [29] such as the kind and nature of data errors, the quantity of noise removed or the capabilities of the classifier to deal with the loss of useful information related to the filtering. In the self-training approach, the number of available labeled data and the induced noisy examples are two decisive factors when filtering noise. Hence, the performance of the combination of filtering techniques and self-training relies heavily on the filtering method chosen. It is so much so that the inclusion or the absence of one prototype into the labeled training set can alter the following stages of the selftraining approach, especially in early steps. For these reasons, the inclusion and analysis of the most suitable filtering method into the self-training is mandatory in order to diminish the influence of noisy data. Filtering techniques follow different approaches to determine whether an example could be noisy or not. We distinguish two types of noise detection mechanism: local and global. We call local methods to those techniques in which the removal decision is based on a local neighborhood of instances [30,31]. Global methods create different models from the training data. Mislabeled examples can be considered noisy depending on the hypothesis agreement of the used classifiers [32,33]. It is necessary to mention that there are other related approaches in which unlabeled data are used to identify mislabeled training data [34,35]. In this work we deepen in the integration of different noise filters and we further analyze recent proposals in order to 31 establish their suitability with respect to the self-training process. We will adopt the Nearest Neighbor (NN) rule [36] as the base classifier, which has been highlighted as one of the most influential techniques in data mining [37]. For each filtering family, the most representative noise filters will be tested. The analysis of the behavior of noise filters in self-training motivates the global purpose of this paper, which pursues three objectives: To determine which characteristics of noise filters are more appropriate to be included in the self-training process. To perform an empirical study for analyzing the transductive and inductive capabilities of the filtered and non-filtered selftraining algorithm. To check the behavior of this approach when dealing with data sets with different ratios of labeled data. We will conduct experiments involving a total of 60 classification data sets with different ratios of labeled data: 10%, 20%, 30% and 40%. In order to test the behavior of noise filters, the experimental study will include a statistical analysis based on nonparametric statistical tests [38]. A web page with all the complementary material is available at 〈http://sci2s.ugr.es/SelfTrai ningþ Filters〉, including this paper's basic information, all the data sets created and the complete results obtained for each algorithm. The rest of the paper is organized as follows: Section 2 defines the SSC problem and the self-training approach. Section 3 explains how to combine self-training with noise filters. Section 4 introduces the filtering algorithms used. Section 5 presents the experimental framework and Section 6 discusses the analysis of results obtained. Finally, in Section 7 we summarize our conclusions. 2. Background: semi-supervised learning via the self-training approach This section provides the necessary information to understand the proposed integration of noise filters into the self-training process. Section 2.1 defines the SSC problem. Then, Section 2.2 presents the self-training approach used to address the SSC problem. 2.1. Semi-supervised classification This section presents the definition and notation for the SSC problem. A specification of this problem follows: Let xp be an example where xp ¼ ðxp1 ; xp2 ; …; xpD ; ωÞ, with xp belonging to a class ω and a D-dimensional space in which xpi is the value of the i-th feature of the p-th sample. Then, let us assume that there is a labeled set L which consists of n instances xp ω with ω known. Furthermore, there is an unlabeled set U which consists of m instances xq ω with ω unknown, let m b n. The L [ U set forms the training set TR. The purpose of SSC is to obtain a robust learned hypothesis using TR instead of L alone, which can be applied in two slightly different settings: transductive and inductive learning. Transductive learning is described as the application of an SSC technique to classify all the m instances xq ω of U with their correct class. The class assignation should represent the distribution of the classes efficiently, based on the input distribution of unlabeled instances and the L instances. Let TS be a test set composed of t unseen instances xr ω with ω unknown, which has not been used at the training stage of the SSC technique. The inductive learning phase consists of correctly classifying the instances of TS based on the previously learned hypothesis. 32 I. Triguero et al. / Neurocomputing 132 (2014) 30–41 2.2. Self-training 3. Combining self-training and filtering methods The self-training approach is a wrapper methodology characterized by the fact that the learning process uses its own predictions to teach itself. This process is also known as bootstrapping or self-teaching [39]. In general, self-training can be used either as inductive or transductive learning depending on the nature of the classifier. Self-training follows an iterative procedure in which a classifier is trained using labeled data to predict the labels of unlabeled data in order to obtain an enlarged labeled set L. Fig. 1 outlines the pseudo-code of the self-training methodology. In the following we describe the most significant instructions enumerated from 1 to 22. First of all, it is necessary to determine the number of unlabeled instances which will be added to L in each iteration. Note that this parameter can be a constant, or it can be chosen as a proportional value of the number of instances of each class in L, as Blum and Mitchell suggest in [11]. We apply this idea in our implementations to determine the amount of prototypes per class which will be added to L in each iteration (Instructions 1–11). Then, the algorithm enters into a loop to enlarge the labeled set L (Instructions 14–20). Instruction 15 calculates the confidence predictions of all the unlabeled instances, as the probability of belonging to each class. The way in which the confidence predictions are measured is dependant on the type of classifier used. Unlike probabilistic models such as Naive Bayes, whose confidence predictions can be measured as the output probability in prediction, the NN rule has no explicitly measured confidence for an instance. For the NN rule, the algorithm approximates confidence in terms of distance, hence, the most confident unlabeled instance is defined as the closest unlabeled instances to any labeled one (as defined in [27,3]). Next, instruction 16 creates a set L′ consisting of the most confident unlabeled data for each class, keeping the proportion of instances per class previously computed. L′ is labeled with its predictions and added to L (Instruction 17). Instruction 18 removes the instances of L′ from U. In the original description of the self-training approach [23], this process was repeated until all the instances from U had been added to L. However, following [11], we have established a limit to the number of iterations, MAXITER (Instruction 14). Hence, a pool of unlabeled examples smaller than U is used in our implementations. Finally, the obtained L set is used to classify the U set for transductive learning and the TS for inductive learning. In this section we explain the combination of self-training with noise filters in depth. As mentioned above, the goal of the selftraining process is to find the most adequate class label for unlabeled data with the aim of enlarging the L set. However, in SSC, the number of initial labeled examples tends to be too small to train a classifier with good generalization capabilities. In this scenario, noisy instances can be harmful if they are added to the labeled set, as they bias the classification of unlabeled data to incorrect classes, which could make the enlarged labeled set in the next iteration even more noisy. This problem may especially occur in the initial stages of this process. Two types of noisy instances may appear during the selflabeling process: Fig. 1. Self-training pseudo-code. The first one is caused by the distribution of classes in L. It can lead to the classifier to label erroneously some instances. There may be outliers within the original unlabeled data. This second kind can be detected, avoiding its labeling and its inclusion into L. These ideas motivate the treatment of noisy data during the self-training process. Filtering methods have been commonly used to deal with noisy data. Nevertheless, most of the proposed filtering schemes have been designed into the supervised classification framework. Hence, the number of labeled data can determine the way in which one filter decides whether an example is noisy or not. If incorrect examples are appropriately detected and removed during the labeling process, the generalization capabilities of the classifier are expected to be improved. The filtering technique should be applied at each iteration, after L′ is formed, in order to detect both types of noisy instances. The identification of noisy examples is performed using the set L [ L′ as a training set. If one example of L′ is annotated as a possible noisy example, it is removed from L′, and it will not be added to L. Nevertheless, this instance should be cleaned from U. Note that this scheme does not try to relabel suspicious examples and thereby avoids the introduction of new noise into the training set [27]. Fig. 2 shows a case of study of filtered self-training in a twoclass problem. In this figure, we can observe the first iteration of Fig. 2. Example of labeling process with editing. I. Triguero et al. / Neurocomputing 132 (2014) 30–41 the process and how the selection of the most confident examples can fail due to the distribution of the given labeled instances. One example of Class 2 has been selected as one of the most confident instances of Class 1. A filtering technique is needed to remove incorrectly labeled instances in order to avoid the erroneous future labeling in subsequent iterations. 4. Filtering methods This section describes the filters adopted in our study. Filtering methods are preprocessing mechanisms to detect and eliminate noisy examples in the training set. The separation of noise detection and learning has the advantage that noisy examples do not influence the model building design [40]. As we explained before, each method considers an example to be harmful depending on its nature. Broadly speaking, we can categorize filters into two different types: local (Section 4.1) and global filters (Section 4.2). In all descriptions, we use TR to refer to the training set, FS to the filtered set, and ND to refer to the noisy data identified in the training set (initially, ND ¼ |). 4.1. Local filters These methods create neighborhoods of instances to detect suspicious examples. Most of them are based on the distance between prototypes to determine their similarity. The best known distance measure for these filters is Euclidean distance (Eq. (1)). We will use it throughout this study, since it is simple, easy to optimize, and has been widely used in the field of instance based learning [41] sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EuclideanDistanceðX; YÞ ¼ D ∑ ðxpi xqi Þ2 i¼0 ð1Þ here we offer a brief description of the local filtering methods studied: Edited Nearest Neighbor (ENN) [42]: This algorithm starts with FS ¼TR. Then each instance in FS is removed if it does not agree with the majority of its k nearest neighbors. All kNN (AllKNN) [43]: The All kNN technique is an extension of ENN. Initially, FS¼ TR. Then the NN rule is applied k times. In each execution, the NN rule varies the number of neighbors considered between 1 and k. If one instance is misclassified by the NN rule, it is registered as removable from the FS. Then all those that do meet the criteria are removed at once. Relative Neighborhood Graph Edition (RNGE) [44]: This technique builds a proximity undirected graph G ¼ ðV; EÞ, in which each vertex corresponds to an instance from TR. There is a set of edges E, so that ðxi ; xj Þ A E if and only if xi and xj satisfy some neighborhood relation (Eq. (2)). In this case, we say that these instances are graph neighbors. The graph neighbors of a given point constitute its graph neighborhood. The edition scheme discards those instances misclassified by their graph neighbors (by the usual voting criterion) ðxi ; xj Þ A E 3 dðxi ; xj Þ rmaxðdðxi ; xk Þ; dðxj ; xk ÞÞ; 8 xk A TR; k a i; j: ð2Þ Modified Edited Nearest Neighbor (MENN) [45]: This algorithm starts with FS ¼TR. Then each instance xp in FS is removed if it does not agree with all of its kþ l nearest neighbors, where l is the number of instances in FS which are at the same distance as the last neighbor of xp . 33 Furthermore, MENN works with a prefixed number of pairs (k; k′). k is employed as the number of neighbors used to perform the editing process, and k′ is employed to validate the edited set FS obtained. The best pair found is employed as the final reference set. If two or more sets are found to be optimal, then both are used in the classification of the test instances. A majority rule is used to decide the output of the classifier in this case. Nearest Centroid Neighbor Edition (NCNEdit) [46]: This algorithm defines the neighborhood, taking into account not only the proximity of prototypes to a given example, but also their symmetrical distribution around it. Specifically, it calculates the k nearest centroid neighbors (k NCNs). These k neighbors can be searched for through an iterative procedure [47] in the following way: 1. The first neighbor of xp is also its nearest neighbor, x1q . 2. The ith neighbor, xiq , i b 2 is such that the centroid of this and previously selected neighbors, x1q ; …; xiq , is the closest to xp . The NCN Editing algorithm is a slight modification of ENN, which consists of discarding from FS every example misclassified by the k NCN rule. Cut edges weight statistic (CEWS) [28]: This method generates a graph with a set of vertices V¼TR, and a set of edges, E, connecting each vertex to its nearest neighbor. An edge connecting two vertices that have different labels is denoted as a cut edge. If an example is located in a neighborhood with too many cut edges, it should be considered as noise. Next, a statistical procedure is applied in order to label cut edges as noisy examples. This is the filtering method used in the first self-training approach with edition [27] (SETRED). Edited Nearest Neighbor Estimating Class Probabilistic and Threshold (ENNTh) [48]: This method applies a probabilistic NN rule, where the class of an instance is decided as a weighted probability of the class of its nearest neighbors (each neighbor has the same a priori probability, and its associated weight is the inverse of its distance). The editing process is performed starting with FS ¼TR and deleting from FS every prototype misclassified by this probabilistic rule. Furthermore, it defines a threshold in the NN rule, which will not consider instances with an assigned probability lower than the established threshold. Multiedit (Multiedit) [49,50]: This method starts with FS ¼ ∅ and a new set R defined as R¼TR. Then this technique splits R into nf blocks: R1 ; …; Rb (nf 42). For each instance of the block nfi, it applies a k NN rule with Rðnf i þ 1Þmod b as the training set. All misclassified instances are discarded. The remaining instances constitute the new TR. This process is repeated while at least one instance is discarded. 4.2. Global filters We denote as global filters those methods which apply a classifier to several subsets of TR in order to detect problematic examples. These methods use different methodologies to divide the TR. Then, these methods create models over the generated subsets and use different heuristics to determine noisy examples. From here, nf is the number of folds in which the training data are partitioned by the filtering method. Classification Filter (CF) [32]: The main steps of this filtering algorithm are the following: 1. Split the current training data set TR using an nf-fold cross validation scheme. 34 I. Triguero et al. / Neurocomputing 132 (2014) 30–41 2. For each of these nf parts, a learning algorithm is trained on the other n 1 parts, resulting in n different classifiers. Here, C4.5 is used as the learning algorithm [51]. 3. These n resulting classifiers are then used to tag each instance in the excluded part as either correct or mislabeled, by comparing the training label with that assigned by the classifier. 4. The misclassified examples from the previous step are added to ND. 5. Remove the noisy examples from the training set: FS’TR\ND. Iterative Partitioning Filter (IPF) [33]: This method removes noisy instances in multiple iterations until a stopping criterion is reached. The iterative process stops if, for a number of consecutive iterations, the number of identified noisy examples in each of these iterations is less than a percentage of the size of the original training data set. The basic steps of each iteration are as follows: 1. Split the current training data set TR into nf equal sized subsets. 2. Build a model with the C4.5 algorithm over each of these nf subsets and use them to evaluate the whole current training data set TR. 3. Add to ND the noisy examples identified in TR using a voting scheme. 4. Remove the noisy examples from the training set: FS’TR\ND. Two voting schemes can be used to identify noisy examples: consensus and majority. The former removes an example if it is misclassified by all the classifiers, whereas the latter removes an instance if it is misclassified by more than half of the classifiers. Furthermore, a noisy instance should be misclassified by the model which was induced in the subset containing that instance. In our experimentation we consider the majority scheme in order to detect most of the potentially noisy examples. 5. Experimental framework This section describes the experimental study carried out in this paper. We provide the measures used to determine the performance of the algorithms (Section 5.1), the characteristics of the problems used for the experimentation (Section 5.2), an enumeration of the algorithms used with their respective parameters (Section 5.3) and finally a description of the nonparametric statistical tests applied to contrast the results obtained (Section 5.4). 5.1. Performance measures Two measures are widely used for measuring the effectiveness of classifiers: accuracy [1,2] and Cohen's kappa rate [52]. They are briefly explained as follows: Accuracy: It is the number of successful hits (correct classifica- tions) relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years [1,2]. Cohen's kappa (Kappa rate): It evaluates the portion of hits that can be attributed to the classifier itself, excluding random hits, relative to all the classifications that cannot be attributed to chance alone. Cohen's kappa ranges from 1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a very useful, yet Table 1 Summary description of the original data sets. Data set #Ex. #Atts. #Cl. Data set #Ex. #Atts. #Cl. abalone appendicitis australian autos balance banana bands breast bupa chess coil2000 contraceptive crx dermatology ecoli flare german glass haberman heart hepatitis ionosphere housevotes iris led7digit letter lym magic mammograph marketing 4174 106 690 205 625 5300 539 286 345 3196 9822 1473 125 366 336 1066 1000 214 306 270 155 351 435 150 500 20,000 148 19,020 961 8993 8 7 14 25 4 2 19 9 6 36 85 9 15 33 7 9 20 9 3 13 19 33 16 4 7 16 18 10 5 13 28 2 2 6 3 2 2 2 2 2 2 3 2 6 8 2 2 7 2 2 2 2 2 3 10 10 4 2 2 9 monks movement mushroom nursery pageblocks penbased phoneme pima PostOper ring saheart satimage segment sonar spambase spectheart splice tae texture thyroid tic-tac-toe titanic twonorm vehicle vowel wdbc wine wisconsin yeast zoo 432 360 8124 12,690 5472 10,992 5404 768 90 7400 462 6435 2310 208 4597 267 3190 151 5500 7200 958 2201 7400 846 990 569 178 683 1484 101 6 90 22 8 10 16 5 8 8 20 9 36 19 60 55 44 60 5 40 21 9 3 20 18 13 30 13 9 8 16 2 15 2 5 5 10 2 2 3 2 2 7 7 2 2 2 3 3 11 3 2 2 2 4 11 2 3 2 10 7 simple, meter for measuring a classifier's accuracy while compensating for random successes. 5.2. Data sets The experimentation is based on 60 standard classification data sets taken from the KEEL-data set repository1 [53,54]. Table 1 summarizes the properties of the selected data sets. It shows, for each data set, the number of examples (#Ex.), the number of attributes (#Atts.), and the number of classes (#Cl.). The data sets considered in this study contain between 100 and 20,000 instances, the number of attributes ranges from 2 to 85 and the number of classes varies between 2 and 28. Their values are normalized in the interval [0,1] to equalize the influence of attributes with different range domains when using the NN rule. These data sets have been partitioned using the ten fold crossvalidation procedure. Each training partition is divided into two parts: labeled and unlabeled examples. Using the recommendation established in [24], in the division process, we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of SSC is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances, and the class label of the rest of instances will be removed. We ensure that every class has at least one representative instance. In order to study the influence of the amount of labeled data, we take different ratios when dividing the training set. In our experiments, four ratios are used: 10%, 20%, 30% and 40%. For instance, assuming a data set which contains 1000 examples, when the labeled rate is 10%, 100 examples are put into L with their labels while the remaining 900 examples are put into U 1 http://sci2s.ugr.es/keel/datasets.php I. Triguero et al. / Neurocomputing 132 (2014) 30–41 Table 2 Parameter specification for all the methods employed in the experimentation. Algorithm Parameters SelfTraining ENN AllKNN RNGE MENN NCNEdit CEWS ENNTh Multiedit CF IPF MAX_ITER¼ 40 Number of neighbors ¼3 Number of neighbors ¼3 Order of the graph¼1st order Number of neighbors ¼3 Number of neighbors ¼3 Threshold¼ 0.1 Noise threshold¼ 0.7 Number of sub-blocks ¼ 3 Number of partitions: n ¼5, Base algorithm: C4.5 Number of partitions: n ¼5, Filter type: majority, Iterations for stop criterion: i¼ 3, Examples removed pct.: p ¼ 1%, Base algorithm: C4.5 Threshold¼ 0.5 SNNRCE without their labels. In summary, this experimental study involves a total of 240 data sets (60 data setsn4 labeled rates). Note that test partitions are kept aside to evaluate the performance of the learned hypothesis. All the data sets created can be found on the web page associated with this paper.2 latter may not be satisfied, causing the statistical analysis to lose credibility [57]. Throughout the empirical study, we will focus on the use of the Friedman Aligned-Ranks (FAR) test [38], as a tool for contrasting the behavior of each proposal. Its application will allow us to highlight the existence of significant differences between methods. The Finner test is applied as a post hoc procedure to find out which algorithms present significant differences. More information about these tests and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/. 6. Analyzing the integration of noise filters in the self-training approach In this section, we analyze the results obtained in our experimental study. In particular, our aims are as follows: To compare the transductive capabilities achieved with the 5.3. Algorithm used and parameters Apart from the original self-training proposal, two of the main variants of this algorithm proposed in the literature are SETRED [27] and SNNRCE [24]. The former corresponds to the first attempt to use a particular filter (CEWS) [28] during the self-training process. Hence, we will consider SETRED as equivalent to selftraining with a CEWS filter. The latter algorithm is a recent approach which introduces several steps into the original selftraining approach, such as a re-labeling stage and a relative graphbased neighborhood to determine the confidence level during the labeling process. We include these proposals in the experimental study as comparative techniques. Table 2 shows the configuration parameters, which are common to all problems, of the comparison techniques and filters used with self-training. We focus this experimentation on the recommended parameters proposed by their respective authors, assuming that the choice of the values of the parameters was optimally made. For those filtering methods which are based on the NN rule, we have established the number of nearest neighbors as k¼ 3. In filtering algorithms, a value k 4 1 may be convenient, when the interest lies in protecting the classification task against noisy instances, as Wilson and Martinez suggested in [30]. In all of the techniques, we use the Euclidean distance. Due to the fact that CEWS, Multiedit, CF, IPF and SNNRCE are stochastic methods, they have been run three times per partition. Implementations of the algorithms can be found on the web site associated with this paper. 5.4. Statistical tools for analysis The use of hypothesis testing methods to support the analysis of results is highly recommended in the field of Machine Learning. The aim of these techniques is to identify the most relevant differences found between the methods [55,56]. To this end, the use of nonparametric tests will be preferred to parametric ones, since the initial conditions that guarantee the reliability of the 2 http://sci2s.ugr.es/SelfTrainingþ Filters 35 different kinds of filters under different ratios of labeled data (Section 6.1). To study how filtering techniques help the self-training methodology within the generalization process (Section 6.2). To analyze the behavior of the best filtering techniques in several data sets (Section 6.3). To present a global analysis of the results obtained in terms of the properties of the filtering methods (Section 6.4). Due to the extension of the experimental analysis carried out, we report the complete experimental results on the web page associated with this paper. In this section we present summary figures and the statistical tests conducted. Tables 3–6 tabulate the information of the statistical analysis performed by nonparametric multiple comparison procedures over 10%, 20%, 30% and 40% of labeled data, respectively. In these tables, filtering methods have been sorted according to their family, starting from classic to more recent methods. In each table, we carry out a total of four statistical tests for accuracy and kappa measures, differentiating between transductive and test phases. The rankings computed, according to the FAR test [38], represent the effectiveness associated with each algorithm. The best (lowest) ranking obtained in each FAR test is marked with ‘n’, which determines the control algorithm for the post hoc test. Next, together with each FAR ranking, we present the adjusted p-value with Finner's test (Finner APV) based on the control algorithm. Those APVs highlighted in bold show the methods outperformed by the control, at α ¼ 0:1 level of significance. In these tables, we include as a baseline the NN rule trained only with labeled data (NN-L), to determine the goodness of the SSC techniques. Note that this technique corresponds to the initial stage of all the self-training schemes. However, it is also known that, depending on the problem, unlabeled data can lead to worse performance [3], hence, the inclusion of NN-L shows whether the self-training scheme is an outstanding methodology for SSC. 6.1. Transductive results As we stated before, the main objective of transductive learning is to predict the true class label of the unlabeled data used to train. Hence, a good exploitation of unlabeled data can lead to successful results. Observing Tables 3–6 we can make the following analysis: Considering 10% of labeled instances, the FAR procedure highlights the global filter CF as the best performing in terms of transductive learning. With this filter, self-training is able to 36 I. Triguero et al. / Neurocomputing 132 (2014) 30–41 Table 3 Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 10% labeled rate. Algorithm SelfTraining-ENN SelfTraining-AllKNN SelfTraining-RNGE SelfTraining-MENN SelfTraining-NCNEdit SelfTraining-CEWS SelfTraining-ENNTh SelfTraining-Multiedit SelfTraining-CF SelfTraining-IPF SelfTraining SNNRCE NN-L Transductive phase Test phase FAR (Accuracy) Finner APV FAR (Kappa) Finner APV FAR (Accuracy) Finner APV FAR (Kappa) Finner APV 416.5580 405.1160 321.7920 364.0000 402.9254 330.2333 363.2833 377.2250 281.6174n 314.1333 432.5500 345.4421 721.6250 0.0041 0.0080 0.3526 0.0762 0.0080 0.2775 0.0762 0.0398 – 0.4293 0.0015 0.1577 0.0000 390.2920 395.7580 293.7170 408.7833 353.3750 294.9000 397.6667 413.2080 282.8255n 297.7164 381.6421 445.5000 721.1176 0.0154 0.0125 0.7979 0.0066 0.1134 0.7979 0.0125 0.0061 – 0.7805 0.0243 0.0005 0.0000 421.3917 392.8583 374.0583 371.4833 424.4250 379.4667 354.0833 370.5500 268.6083n 355.075 464.9833 388.3083 511.2083 0.0006 0.0060 0.0155 0.0165 0.0006 0.0120 0.0387 0.0165 – 0.0387 0.0000 0.0072 0.0000 397.9667 368.2917 342.4333 405.9417 395.1250 367.8167 379.2000 403.9167 261.4167n 341.5000 452.3417 454.2000 506.3500 0.0018 0.0125 0.0532 0.0013 0.0020 0.0125 0.0063 0.0013 – 0.0532 0.0000 0.0000 0.0000 Table 4 Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 20% labeled rate. Algorithm SelfTraining-ENN SelfTraining-AllKNN SelfTraining-RNGE SelfTraining-MENN SelfTraining-NCNEdit SelfTraining-CEWS SelfTraining-ENNTh SelfTraining-Multiedit SelfTraining-CF SelfTraining-IPF SelfTraining SNNRCE NN-L Transductive phase Test phase FAR (Accuracy) Finner APV FAR (Kappa) Finner APV FAR (Accuracy) Finner APV FAR (Kappa) Finner APV 414.1167 424.6750 312.8417 363.8083 393.1583 358.9417 360.5083 396.5667 270.9667n 297.7083 485.0083 545.1667 453.0333 0.0012 0.0006 0.3315 0.0358 0.0051 0.0391 0.0391 0.0045 – 0.5156 0.0000 0.0000 0.0000 396.7750 405.3250 306.2333 412.0750 353.2250 350.5750 398.9083 432.3583 279.6167n 287.8333 416.2333 599.0000 438.3417 0.0066 0.0045 0.5485 0.0031 0.0968 0.1006 0.0064 0.0008 – 0.8417 0.0027 0.0000 0.0007 434.2333 431.6667 316.0083 378.9333 397.6000 402.0917 355.1500 376.4000 305.1917n 350.3833 482.1667 436.1000 410.5750 0.0087 0.0087 0.7926 0.1075 0.0419 0.0366 0.2630 0.1097 – 0.2927 0.0002 0.0087 0.0248 420.5583 415.0333 319.8167 395.9500 372.9000 383.9083 367.0833 400.1250 307.3500n 343.1667 439.4583 494.4333 416.7167 0.0235 0.0235 0.7618 0.0530 0.1453 0.0926 0.1731 0.0476 – 0.4105 0.0079 0.0001 0.0235 Table 5 Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 30% labeled rate. Algorithm SelfTraining-ENN SelfTraining-AllKNN SelfTraining-RNGE SelfTraining-MENN SelfTraining-NCNEdit SelfTraining-CEWS SelfTraining-ENNTh SelfTraining-Multiedit SelfTraining-CF SelfTraining-IPF SelfTraining SNNRCE NN-L Transductive phase Test phase FAR (Accuracy) Finner APV FAR (Kappa) Finner APV FAR (Accuracy) Finner APV FAR (Kappa) Finner APV 394.2000 381.6917 341.9833 335.3250 403.8500 378.8083 338.2333 373.8500 302.7250 295.2667n 437.8167 644.0250 448.7250 0.0384 0.0700 0.3260 0.3541 0.0247 0.0714 0.3440 0.0829 0.8561 – 0.0021 0.0000 0.0011 381.1417 382.6500 324.3333 357.0083 379.5167 364.6333 355.2167 400.2833 314.1583 306.6667n 409.8417 653.7667 447.2833 0.1484 0.1484 0.6993 0.2833 0.1484 0.2285 0.2833 0.0670 0.8555 – 0.0477 0.0000 0.0038 419.4167 388.0500 327.0500 393.7167 425.5667 408.9500 393.0417 373.5167 293.1833n 320.5833 451.6500 450.2833 431.4917 0.0052 0.0280 0.4380 0.0248 0.0039 0.0098 0.0248 0.0607 – 0.5054 0.0014 0.0014 0.0031 405.9500 378.4333 293.5667n 391.3500 396.8750 411.3167 393.1583 411.1667 296.8750 311.7000 429.3500 519.5583 437.2000 0.0126 0.0467 – 0.0232 0.0205 0.0126 0.0231 0.0126 0.9359 0.6911 0.0039 0.0000 0.0029 significantly outperform 8 of the 12 comparison techniques for the accuracy and kappa measures. The IPF filter which also belongs to the global family of filters can be stressed as an excellent filter with this labeled ratio. Furthermore, we can highlight RNGE and CEWS as the most competitive local filters in comparison with CF. The comparison technique SNNRCE is outperformed in terms of kappa measure. By contrast, considering the accuracy measure, it is not overcome with a level of significance α ¼ 0:1. This fact indicates that SNNRCE benefits from random hits. When the number of labeled instances is increased to 20%, we observe a clear improvement in terms of accuracy and kappa rate for all the studied methods. Again, global filters obtain the two best rankings and the CF filter is stressed as the best performing method. The number of techniques outperformed has also been increased. RNGE is the most relevant local filtering technique in comparison with CF and IPF. When the data set has a relatively higher number of labeled instances (30% or 40%), either local and global filtering techniques display similar behavior. This is because all the filtering I. Triguero et al. / Neurocomputing 132 (2014) 30–41 37 Table 6 Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 40% labeled rate. Algorithm Transductive phase SelfTraining-ENN SelfTraining-AllKNN SelfTraining-RNGE SelfTraining-MENN SelfTraining-NCNEdit SelfTraining-CEWS SelfTraining-ENNTh SelfTraining-Multiedit SelfTraining-CF SelfTraining-IPF SelfTraining SNNRCE NN-L Test phase FAR (Accuracy) Finner APV FAR (Kappa) Finner APV FAR (Accuracy) Finner APV FAR (Kappa) Finner APV 375.1500 359.7917 331.3083 363.6417 368.7833 375.0833 351.8833 340.5667 322.5083 305.6167n 448.5083 676.8667 456.7917 0.2488 0.2681 0.5635 0.2559 0.2488 0.2488 0.3315 0.4534 0.6813 – 0.0021 0.0000 0.0014 359.7333 352.1083 324.2500 384.6250 354.9667 379.7667 367.7417 354.6833 323.7500 314.1167n 423.7333 680.7000 456.3250 0.4135 0.4401 0.8323 0.2378 0.4401 0.2450 0.3477 0.4401 0.8323 – 0.0305 0.0000 0.0033 404.8000 380.9417 322.9083n 350.3083 426.7917 418.6333 352.9083 353.1917 343.9083 371.1250 453.4750 477.7833 419.7250 0.0909 0.2558 – 0.5620 0.0454 0.0548 0.5620 0.5620 0.6097 0.3389 0.0090 0.0020 0.0548 394.6000 378.9333 308.5917n 363.1917 403.9000 427.5667 362.3667 377.9667 339.9000 354.7583 424.6833 523.0667 416.9750 0.0718 0.1449 – 0.2380 0.0485 0.0227 0.2380 0.1449 0.4466 0.2818 0.0227 0.0000 0.0250 1.0 ENN AllKNN RNGE Multiedit MENN NCNEdit ENNTh CEWS CF IPF SNNRCE Nofilter 0.90 0.80 Accuracy Test 0.70 0.60 0.50 0.40 0.30 mushroom penbased texture wisconsin wine twonorm wdbc zoo housevotes pageblocks iris segment dermatology thyroid coil2000 banana satimage letter ionosphere spambase hepatitis crx chess phoneme australian magic mammographic heart balance appendicitis nursery tic−tac−toe ecoli spectfheart splice ring sonar breast pima german flare monks saheart titanic lymphography bands led7digit haberman glass vehicle bupa vowel post−operative yeast tae movement_libras autos contraceptive marketing abalone 0.20 Data sets Fig. 3. Accuracy test over 10% of labeled data. techniques are able to detect noisy examples in an easier way with a representative number of labeled data. Nevertheless, the IPF filter is outstanding as the best ranking in both kappa and accuracy measures for high labeled ratios. Note that it is able to obtain better results than the standard self-training or the baseline NN-L which shows the usefulness of the filtering process with a greater number of labeled data. 6.2. Inductive results (test phase) In contrast to transductive learning, the aim of inductive learning is to classify unknown examples. In this way, inductive learning proves the generalization capabilities of the analyzed methods, checking if the previous learned hypotheses are appropriate or not. Apart from Tables 3–6, we include four figures representing the accuracy obtained by the methods in the different labeled ratios. Figs. 3 and 4 illustrate the accuracy test obtained in each data set over 10% and 40% of labeled instances. For the sake of simplicity, the figures with a 20% and 30% of labeled instances and their corresponding accuracy tables can be found on the associated web page. The aim of these figures is to determine in which data sets the original self-training algorithm is outperformed. For this reason, we take the standard self-training as the baseline method to be overcome. For a better visualization, on the x-axis the data sets are ordered from the maximum to the minimum accuracy obtained by the standard self-training. The y-axis position is the accuracy test of each algorithm. Self-training without any filter is drawn as a line. Therefore, points above of this line correspond to data sets for which the other proposals perform better than the original algorithm. Finally, Table 7 summarizes the main differences of each method over the basic self-training algorithm in each labeled ratio. In this table, we present the number of data sets in which the obtained accuracy and kappa rates for each technique are strictly greater (Wins) and greater or equal (Ties þWins) than the baseline. Again, the best results in each column are highlighted in bold. 38 I. Triguero et al. / Neurocomputing 132 (2014) 30–41 1.0 ENN AllKNN RNGE Multiedit MENN NCNEdit ENNTh CEWS CF IPF SNNRCE Nofilter 0.90 0.80 Accuracy Test 0.70 0.60 0.50 0.40 0.30 mushroom penbased texture wisconsin wdbc page−blocks iris twonorm segment wine dermatology zoo letter housevotes thyroid coil2000 satimage chess spambase ionosphere banana vowel phoneme hepatitis australian crx nursery tic−tac−toe magic sonar heart lymphography monk−2 balance movement_libras appendicitis mammographic ecoli spectfheart splice ring pima haberman german flare saheart vehicle bands breast titanic led7digit glass automobile bupa post−operative yeast contraceptive tae marketing abalone 0.20 Data sets Fig. 4. Accuracy test over 40% of labeled data. Table 7 Comparison of each method over the basic self-training approach. Algorithm SelfTraining-ENN SelfTraining-AllKNN SelfTraining-RNGE SelfTraining-MENN SelfTraining-NCNEdit SelfTraining-ENNTh SelfTraining-CEWS SelfTraining-Multiedit SelfTraining-CF SelfTraining-IPF SNNRCE 10% Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins Wins Ties þ Wins 20% 30% Accuracy Kappa Accuracy Kappa Accuracy Kappa Accuracy Kappa 24 40 24 40 27 42 26 40 20 38 23 38 26 49 24 40 34 45 15 31 29 29 26 41 28 43 32 45 26 40 28 46 28 42 26 49 24 40 35 47 28 44 30 31 29 49 28 45 38 51 30 44 29 52 32 48 24 47 31 45 36 50 32 48 31 32 28 43 26 42 33 47 30 44 27 47 28 44 22 45 28 42 33 46 30 46 26 28 24 45 25 43 34 53 27 40 24 48 26 42 19 43 28 45 29 49 30 50 28 29 23 42 24 41 35 52 25 39 23 47 23 38 22 45 24 42 30 46 29 47 20 21 25 49 25 47 29 52 26 46 23 52 27 45 20 44 31 49 27 51 24 46 24 24 23 45 22 42 28 49 26 43 23 50 27 44 16 40 27 43 27 49 24 45 18 19 Taking into account these figures, the previous statistical tests and Table 7, may make some comments to summarize: When the inductive learning problem is addressed with 10% of 40% labeled data, there are significant and interesting differences between the methods. Concretely, the global CF filter is highlighted as the most promising filter. The Finner procedure signals the differences of this filter to the rest of the comparison techniques in both accuracy and kappa measures. Fig. 3 also corroborates this statement, because the majority of its points are above the baseline. With 20% of labeled instances, CF is still the most suitable algorithm in terms of generalization capabilities. Depending on the used measure, accuracy or kappa, the CF filter is able to obtain a significant improvement over different filters. Nevertheless, IPF, ENNTh and RNGE are established as the best performing filters which are not statistically outperformed in both accuracy and kappa measures. Table 7 shows that these methods obtain a similar number of victories and ties over the standard self-training. Using 30%, two main methods from different families, CF and RNGE, are the two best algorithms. CF is note-worthy in terms of accuracy test, however RNGE is established as a control algorithm in terms of the kappa measure. In both cases, the global filter IPF is still a robust filter, not clearly outperformed by the control ones. I. Triguero et al. / Neurocomputing 132 (2014) 30–41 With 40% of labeled examples, RNGE works appropriately but it does not obtain a distinctive difference from most of the filters. By contrast, this method has significant differences in comparison with NN-L, standard self-training and SNNRCE. The figures reveal that when there is an increment in the labeled ratio, a notable number of points are closer to the used baseline. This means that when a considerable number of known data are available, the basic self-training algorithm is able to work fine, avoiding misclassification. Table 7 also confirms this idea, because the number of data sets in which basic self-training is outperformed, decreases over 30% and 40% of labeled data. 6.3. Analyzing the behavior of the noisy filters The noise detection capabilities of filtering techniques determine the behavior of the self-training filtering scheme. In this subsection, we want to analyze how many instances are detected as noisy ones during the self-labeling process. To perform this analysis, we have selected two data sets with a 10% of labeled data (we choose nursery and yeast as illustrative data sets in which the filters reach the same number of iterations in the self-training approach) and the three best filters (CF, IPF and RNGE). Table 8 collects the total number of removed instances (#RI) at the end of the self-training filtered process for each method. Furthermore, the improvement on average test results, regarding the standard self-training is also shown (Impr). In addition, Fig. 5 shows a graphical representation of the number of removed instances in each self-training iteration for both Nursery and Yeast data sets. The X-axis represents the number of iterations carried out, and the Y-axis represents the number of instances removed in this iteration. As we can observe in Table 8 and Fig. 5, we can establish that each method works in a different way, annotating different instances as noisy ones during the self-training process. For instance, the CF filter tends to remove a greater number of instances than IPF and RNGE. Nevertheless, Table 8 shows that this fact does not imply that its improvement capabilities are better. In both Fig. 5(a) and (b) global filters, CF and IPF, present similar trends. We can see that this kind of filtering techniques detects analogous proportions of noisy instances according to the iterations. However, the local filter, RNGE, shows a different behavior from that of global filters. 6.4. Global analysis This section provides a global perspective on the obtained results. As a summary we want to outline several points about the self-training filtered approach in general and the characteristics of noise filters that are more appropriate to improve the self-training process: In all the studies performed, one of the self-training filtered Table 8 Removed instances for each filtering method. Data set Nursery Yeast SelfTraining-CF SelfTraining-IPF SelfTraining-RNGE #RI Impr. #RI Impr. #RI Impr. 1313 284 0.0794 0.0411 1196 156 0.0805 0.0303 1129 201 0.0781 0.0317 39 always obtains the best results. Independent of the labeled rate and the established baselines, NN-L and self-training without a filter are clearly outperformed by at least one of these methods. This fact highlights the good synergy between noise filters and the self-training process. As we stated before, the inclusion of one erroneous example in the labeled data can alter the following stages, and therefore, the latter's generalization capabilities. Inductive results highlight the generalization abilities and the usefulness of the use self-training in conjunction with filtering techniques. Nevertheless, there are significant differences depending on the selected filter. Comparing transductive and test phases, we can state that, in general, the SSC methods used differ widely when tackling the inductive phase. It shows the necessity of some mechanisms, like appropriate filtering methodologies, to find robust learned hypotheses which allow the classifiers to predict unseen cases. In general, an increment in the labeled ratio indicates the lower benefit of the filtering techniques. It justifies the view that the use of SSC methods is more appropriate in the presence of a lower number of labeled examples. However, even in these cases, the analyzed self-training filtered algorithms can still be helpful, as has been shown in the reported results. The working process of one filter is an important factor in both transductive and inductive learning. In the conducted experimental study, we have observed that global methods are more robust in most of the experiments independent of the labeled Filtering process (Yeast) Filtering process (Nursery) 180 40 Number of detected noisy instances Number of detected noisy instances 35 160 140 120 100 80 1 2 3 4 5 6 7 25 20 15 10 CF IPF RNGE 0 30 8 5 CF IPF RNGE 0 1 2 3 Iterations Fig. 5. Filtering process. (a) Nursery data set and (b) yeast data set. 4 Iterations 5 6 7 8 40 I. Triguero et al. / Neurocomputing 132 (2014) 30–41 ratio. These methods assumed that the label errors are independent of particular classifiers learned from the data, collecting predictions from different classifiers could provide a better estimation of mislabeled examples than collecting information from a single classifier only. This idea performs well under the semi-supervised learning conditions and especially when the number of labeled data is reduced. Local approaches also help to improve the self-training process, however, they are more useful when there are a higher number of labeled data. It implies that the idea of constructing a local neighborhood to determine if one instance should be considered as noise is not the most appropriate way to deal with SSC. 7. Conclusions In this paper, we have analyzed the characteristics of a wide variety of noise filters, of a different nature, to improve the selftraining approach in SSC. Most of these filters have been previously studied from a traditional supervised learning perspective. However, the filtering process can be more difficult in semisupervised learning due to the reduced number of labeled instances. The experimental analysis performed, supported from a statistical point of view, has allowed us to distinguish which characteristics of filtering techniques have reported a better behavior to address the transductive and inductive problems. We have checked that global filters (CF and IPF algorithms) highlight as the best performing family of filters, showing that the hypothesis agreement of several classifiers is also robust when the ratio of available labeled data is reduced. Most of the local approaches need more labeled data to perform better. The use of these filters has resulted in a better performance than that achieved by the previously proposed self-training methods, SETRED and SNNRCE. Thus, the use of global filters is highly recommended in this field, which can be useful for further work with other SSC approaches and other base classifiers. As future work, we consider the design of new global filters for SSC that use fuzzy rough set models [58,59]. Acknowledgments Supported by the Research Projects TIN2011-28488 and P11TIC-7765. J.A. Sáez holds an FPU scholarship from the Spanish Ministry of Education and Science. References [1] E. Alpaydin, Introduction to Machine Learning, 2nd ed., MIT Press, Cambridge, MA, 2010. [2] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed., Morgan Kaufmann, San Francisco, 2011. [3] X. Zhu, A.B. Goldberg, Introduction to Semi-Supervised Learning, 1st ed., Morgan and Claypool, 2009. [4] O. Chapelle, B. Schlkopf, A. Zien, Semi-Supervised Learning, 1st ed., The MIT Press, 2006. [5] W. Pedrycz, Algorithms of fuzzy clustering with partial supervision, Pattern Recognition Lett. 3 (1985) 13–20. [6] N. Seliya, T. Khoshgoftaar, Software quality analysis of unlabeled program modules with semisupervised clustering, IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans 37 (2) (2007) 201–211. [7] L. Faivishevsky, J. Goldberger, Dimensionality reduction based on nonparametric mutual information, Neurocomputing 80 (2012) 31–37. [8] K. Chen, S. Wang, Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 129–143. [9] A. Fujino, N. Ueda, K. Saito, Semisupervised learning for a hybrid generative/ discriminative classifier based on the maximum entropy principle, IEEE Trans. Pattern Anal. Mach. Intell. 30 (3) (2008) 424–437. [10] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [11] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the Annual ACM Conference on Computational Learning Theory, 1998, pp. 92–100. [12] Z. Yu, L. Su, L. Li, Q. Zhao, C. Mao, J. Guo, Question classification based on cotraining style semi-supervised learning, Pattern Recognition Lett. 31 (13) (2010) 1975–1980. [13] J. Du, C.X. Ling, Z.-H. Zhou, When does co-training work in real data? IEEE Trans. Knowl. Data Eng. 23 (5) (2010) 788–799. [14] Y. Yaslan, Z. Cataltepe, Co-training with relevant random subspaces, Neurocomputing 73 (10–12) (2010) 1652–1661. [15] J. Xu, H. He, H. Man, Dcpe co-training for classification, Neurocomputing 86 (2012) 75–85. [16] Z.-H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classifiers, IEEE Trans. Knowl. Data Eng. 17 (2005) 1529–1541. [17] M. Li, Z.-H. Zhou, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Trans. Syst. Man Cybern., Part A: Syst. Humans 37 (6) (2007) 1088–1098. [18] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugate functions, J. Mach. Learn. Res. 11 (2010) 2423–2455. [19] T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. 200–209. [20] X. Tian, G. Gasso, S. Canu, A multiple kernel framework for inductive semisupervised SVM learning, Neurocomputing 90 (2012) 46–58. [21] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 19–26. [22] A. Mantrach, N. Van Zeebroeck, P. Francq, M. Shimbo, H. Bersini, M. Saerens, Semi-supervised classification and betweenness computation on large, sparse, directed graphs, Pattern Recognition 44 (6) (2011) 1212–1224. [23] D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, in: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 1995, pp. 189–196. [24] Y. Wang, X. Xu, H. Zhao, Z. Hua, Semi-supervised learning based on nearest neighbor rule and cut edges, Knowl. Based Syst. 23 (6) (2010) 547–554. [25] Y. Li, H. Li, C. Guan, Z. Chin, A self-training semi-supervised support vector machine algorithm and its applications in brain computer interface, in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, 2007, pp. 385–388. [26] U. Maulik, D. Chakraborty, A self-trained ensemble with semisupervised SVM: an application to pixel classification of remote sensing imagery, Pattern Recognition 44 (3) (2011) 615–623. [27] M. Li, Z.-H. Zhou, SETRED: self-training with editing, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2005, pp. 611–621. [28] F. Muhlenbach, S. Lallich, D. Zighed, Identifying and handling mislabelled instances, J. Intell. Inf. Syst. 39 (2004) 89–109. [29] X. Wu, X. Zhu, Mining with noise knowledge: error-aware data mining, IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans 38 (4) (2008) 917–932. [30] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning algorithms, Mach. Learn. 38 (3) (2000) 257–286. [31] S. García, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest neighbor classification: taxonomy and empirical study, IEEE Trans. Pattern Anal. Mach. Intell. 34 (3) (2012) 417–435. [32] D. Gamberger, R. Boskovic, N. Lavrac, C. Groselj, Experiments with noise filtering in a medical domain, in: Proceedings of the 16th International Conference on Machine Learning, 1999, pp. 143–151. [33] T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Technol. 22 (2007) 387–396. [34] D. Guan, W. Yuan, Y.-K. Lee, S. Lee, Nearest neighbor editing aided by unlabeled data, Inf. Sci. 179 (13) (2009) 2273–2282. [35] D. Guan, W. Yuan, Y.-K. Lee, S. Lee, Identifying mislabeled training data with the aid of unlabeled data, Appl. Intell. 35 (2011) 345–358. [36] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27. [37] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery, 2009. [38] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci. 180 (2010) 2044–2064. [39] N. Chawla, G. Karakoulas, Learning from labeled and unlabeled data: an empirical study across techniques and domains, J. Artif. Intell. Res. 23 (2005) 331–366. [40] D. Gamberger, N. Lavrac, S. Dzeroski, Noise detection and elimination in data preprocessing: experiments in medical domains, Appl. Artif. Intell. 14 (2000) 205–223. [41] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 37–66. [42] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408–421. [43] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst. Man Cybern. 6 (6) (1976) 448–452. I. Triguero et al. / Neurocomputing 132 (2014) 30–41 [44] J.S. Sánchez, F. Pla, F.J. Ferri, Prototype selection for the nearest neighbour rule through proximity graphs, Pattern Recognition Lett. 18 (1997) 507–513. [45] K. Hattori, M. Takahashi, A new edited k-nearest neighbor rule in the pattern classification problem, Pattern Recognition 33 (3) (2000) 521–528. [46] J. Sánchez, R. Barandela, A. Marques, R. Alejo, J. Badenas, Analysis of new techniques to obtain quality training sets, Pattern Recognition Lett. 24 (7) (2003) 1015–1022. [47] B.B. Chaudhuri, A new definition of neighborhood of a point in multidimensional space, Pattern Recognition Lett. 17 (1) (1996) 11–17. [48] F. Vázquez, J. Sánchez, F. Pla, A stochastic approach to Wilson's editing algorithm, in: Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis, 2005, pp. 35–42. [49] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proceedings of the Fifth International Conference on Pattern Recognition, 1980, pp. 72–80. [50] F.J. Ferri, J.V. Albert, E. Vidal, Consideration about sample-size sensitivity of a family of edited nearest-neighbor rules, IEEE Trans. Syst. Man Cybern. 29 (4) (1999) 667–672. [51] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA, USA, 1993. [52] A. Ben-David, A lot of randomness is hiding in accuracy, Eng. Appl. Artif. Intell. 20 (2007) 875–885. [53] J. Alcalá-Fdez, L. Sánchez, S. García, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernández, F. Herrera, KEEL: a software tool to assess evolutionary algorithms for data mining problems, Soft Comput. 13 (3) (2009) 307–318. [54] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2–3) (2010) 255–277. [55] S. García, A. Fernández, J. Luengo, F. Herrera, A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability, Soft Comput. 13 (10) (2009) 959–977. [56] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 5th ed., Chapman & Hall/CRC, 2011. [57] S. García, F. Herrera, An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9 (2008) 2677–2694. [58] N. Mac Parthalain, R. Jensen, Fuzzy-rough set based semi-supervised learning, in: 2011 IEEE International Conference on Fuzzy Systems (FUZZ), 2011, pp. 2465–2472. [59] J. Derrac, N. Verbiest, S. García, C. Cornelis, F. Herrera, On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection, Soft Comput. 17 (2013) 223–238. Isaac Triguero Velázquez received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, evolutionary algorithms and semi-supervised learning. José Antonio Sáez Muñoz received his M.Sc. in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include the study of the impact of noisy data in classification, data preprocessing, fuzzy rule-based systems and imbalanced learning. 41 Julián Luengo Martín received his M.Sc. in Computer Science and Ph.D. from the University of Granada, Granada, Spain, in 2006 and 2011 respectively. He is currently an Assistant Professor in the Department of Civil Engineering, University of Burgos, Burgos, Spain. His research interests include machine learning and data mining, data preparation in knowledge discovery and data mining, missing values, data complexity and semi-supervised learning. Salvador García López received his M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has published more than 30 papers in international journals. As edited activities, he has coedited two special issues in international journals on different Data Mining topics. His research interests include data mining, data reduction, data complexity, imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms. Francisco Herrera Triguero received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is a coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (World Scientific, 2001). He currently acts as an Editor in Chief of the international journal Progress in Artificial Intelligence (Springer) and serves as an Area Editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as an Associated Editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as a Member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the Spanish Engineer on Computer Science, and International Cajastur Mamdani Prize for Soft Computing (Fourth Edition, 2010), the 2011 IEEE Transactions on Fuzzy Systems Outstanding Paper Award and the 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association. His current research interests include computing with words and decision making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms. 2. Self-labeling with prototype generation/selection for semi-supervised classification 2.3 193 SEG-SSC: A Framework based on Synthetic Examples Generation for SelfLabeled Semi-Supervised Classification • I. Triguero, S. Garcı́a, F. Herrera, SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification. IEEE Transactions on Cybernetics – Status: Submitted. Transactions on Cybernetics SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification IEEE Transactions on Cybernetics r Fo Journal: Manuscript ID: Manuscript Type: Date Submitted by the Author: n/a Triguero, Isaac; University of Granada, Computer Science and Artificial Intelligence García, Salvador; University of Jaén, Computer Science Herrera, Francisco; University of Granada (Spain), Dept. of Computer Science and Artificial Intelligence; vi Keywords: Regular Paper Re Complete List of Authors: CYB-E-2013-08-0823.R1 self-Labeled methods, co-training, synthetic examples, semi-supervised classification ew On ly Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 Page 6 of 18 1 SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification Isaac Triguero, Salvador Garcı́a, and Francisco Herrera to the problem of predicting unseen data by learning from labeled and unlabeled data as training examples. In this work, we will analyze both settings. Existing SSC algorithms are usually classified depending on the conjectures they make about the relation of labeled and unlabeled data distributions. Broadly speaking, they are based on the manifold and/or cluster assumption. The manifold assumption is satisfied if data lie approximately on a manifold of lower dimensionality than the input space [6]. The cluster assumption states that similar examples should have the same label. Graph-based models [7] are the most common approaches to implementing the manifold assumption [8]. As regards examples of models based on the cluster assumption, we can find generative models [9] or semi-supervised support vector machines [10]. Recent studies have addressed multiple assumptions in one model [11], [5], [12]. Self-labeled techniques are SSC methods that do not make any specific suppositions about the input data [13]. These models use unlabeled data within a supervised framework via a self-training process. First attempts correspond to the self-training algorithm [14] that iteratively enlarges the labeled training set by adding the most confident predictions of the supervised classifier used. The standard co-training [15] methodology splits the feature space into two different conditionally independent views. Then, it trains one classifier in each specific view, and the classifiers teach each other the most confidently predicted examples. Advanced approaches do not require explicit feature splits or the iterative mutualteaching procedure imposed by co-training, as they are commonly based on disagreement-based classifiers [16], [17], [18]. These models have been successfully applied to many real applications such as image classification [19], shadow detection [20], computer-aided diagnosis [21], etc. Self-labeled techniques are limited by the number of labeled points and their distribution to identifying reliable unlabeled examples. This problem is more pronounced when the labeled ratio is greatly reduced and labeled examples do not minimally represent the domain. Moreover, most of the advanced models use some diversity mechanisms, such as bootstrapping [22], to provide differences between the hypotheses learned with the multiple classifiers. However, these mechanisms may provide a similar performance to classical self-training or co-training approaches if the number of labeled data is insufficient to achieve different learned hypotheses. The aim of this work is to alleviate these weaknesses by using new synthetic labeled examples to introduce diversity iew ev rR Fo Abstract—Self-labeled techniques are semi-supervised classification methods that address the shortage of labeled examples via a self-learning process based on supervised models. They progressively classify unlabeled data and use them to modify the hypothesis learned from labeled samples. Most relevant proposals are currently inspired by boosting schemes to iteratively enlarge the labeled set. Despite their effectiveness, these methods are constrained by the number of labeled examples and their distribution, which in many cases is sparse and scattered. The aim of this work is to design a framework, named SEG-SSC, to improve the classification performance of any given self-labeled method by using synthetic labeled data. These are generated via an oversampling technique and a positioning adjustment model that use both labeled and unlabeled examples as reference. Next, these examples are incorporated in the main stages of the self-labeling process. The principal aspects of the proposed framework are: (a) introducing diversity to the multiple classifiers used by using more (new) labeled data, (b) fulfilling labeled data distribution with the aid of unlabeled data, and (c) being applicable to any kind of self-labeled method. In our empirical studies, we have applied this scheme to four recent self-labeled methods, testing their capabilities with a large number of data sets. We show that this framework significantly improves the classification capabilities of self-labeled techniques. Index Terms—Self-Labeled methods, co-training, synthetic examples, semi-supervised classification. AVING a multitude of unlabeled data and few labeled ones occurs quite often in many practical applications such as medical diagnosis, spam filtering, bioinformatics, etc. In this scenario, learning appropriate hypotheses with traditional supervised classification methods [1] is not straightforward because they only can exploit labeled data. Nevertheless, Semi-Supervised Classification (SSC) [2], [3], [4] approaches also utilize unlabeled data to improve the predictive performance, modifying the learned hypothesis obtained from labeled examples alone. With SSC we may pursue two different objectives: transductive and inductive classification [5]. The former is devoted to predicting the correct labels of a set of unlabeled examples that is also used during the training phase. The latter refers This work was supported by the Research Projects TIN2011-28488, P10TIC-6858 and P11-TIC-7765. I. Triguero and F. Herrera are with the Department of Computer Science and Artificial Intelligence of the University of Granada, CITIC-UGR, Granada, Spain, 18071. E-mails: {triguero, herrera}@decsai.ugr.es Salvador Garcı́a is with the Department of Computer Science of the University of Jaén Jaén, Spain, 23071. E-mail: [email protected] ly On H I. I NTRODUCTION Page 7 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 2 different settings. On the one hand, transductive learning is devoted to classify all the m instances xq of U with their correct class. The class assignation should represent the distribution of the classes efficiently, based on the input distribution of L and U . On the other hand, the inductive learning phase consists of correctly classifying the instances of T S based on the previously learned hypothesis. B. Self-labeled techniques: previous work Self-labeled techniques form an important family of methods in SSC [3]. They are not intrinsically geared to learning in the presence of both labeled and unlabeled data, but they use unlabeled points within a supervised learning paradigm. These techniques aim to obtain one (or several) enlarged labeled set/s, based on the most reliable predictions. Thus, these models do not make any specific assumptions about the input data, but the models accept that their own predictions tend to be correct. Some authors state that self-labeling is likely to be the case when the classes form well-separated clusters [3] (cluster assumption). The major benefits of this family of methods are: simplicity and being a wrapper methodology. The former is related to the facility of implementation and applicability. The latter means that any kind of classifier can be used regardless of its complexity, which is very important depending on the problem tackled. As caveats, the addition of wrongly labeled examples during the self-labeling process can lead to an even worse performance. Several mechanisms have been proposed to reduce this problem [29]. A preeminent work with this philosophy is the self-training paradigm designed by Yarowsky [14]. In self-training, a supervised classifier is initially trained with the L set. Then it is retrained with its own most confident predictions, enlarging its labeled training set. Thus, it is defined as a wrapper method for SSC. This idea was later extended by Blum and Mitchell [15] with the method known as co-training. This consists of two classifiers that are trained on two sufficient and redundant sets of attributes. This requirement implies that each subset of features should be able to perfectly define the frontiers between classes. Then, the method follows a mutual teaching procedure that works as follows: each classifier labels the most confidently predicted examples from its point of view and they are added to the L set of the other classifier. It is also known that usefulness is constrained by the imposed requirement [30], which is not satisfied in many real applications. Nevertheless, this method has become an example for recent models thanks to the idea of using the agreement (or disagreement) of multiple classifiers and the mutual teaching approach. A good study of when co-training works can be found in [31]. Due to the success of co-training and its relatively limited application, many works have proposed the improvement of standard co-training by eliminating the established conditions. In [32], the authors proposed a multi-learning approach, so that two different supervised learning algorithms were used without splitting the feature space. They showed that this mechanism divides the instance space into a set of equivalence classes. Later, the same authors proposed a faster and more precise iew ev rR Fo to multiple classifier approaches and fulfill the labeled data distribution. A complete motivation for the use of synthetic labeled examples is discussed in Section III-A. We propose a framework applicable to any self-labeled method that incorporates synthetic examples in the selflearning process. We will denote this framework “Synthetic Examples Generation for Self-labeled Semi-supervised Classification” (SEG-SSC). It is composed of two main parts: generation and incorporation. • The generation process consists of an oversampling technique and a later adjustment of the positioning of the examples. It is initially inspired by the SMOTE algorithm [23] to generate new synthetic examples, for all the classes, based on both the small labeled set and the unlabeled data. Then, this process is refined using a positioning adjustment of prototypes model [24] based on a differential evolution algorithm [25]. • New labeled points are then included in two of the main steps of a self-labeling method: the initialization phase and the update of the labeled training set, so that it introduces new examples in a progressive manner during the self-labeling process. An extensive experimental analysis is carried out to check the performance of the proposed framework. We apply the SEG-SSC scheme to four recent self-labeled techniques that have different characteristics, comparing the performance obtained with the original proposals. We conduct experiments over 55 standard classification data sets extracted from the KEEL and UCI repositories [26], [27] and 11 high dimensional data sets from the book by Chapelle et al. [2]. The results will be contrasted with nonparametric statistical tests [28]. The remainder of this paper is organized as follows. Section II defines the SSC problem and sums up the classical and current self-labeled approaches. Then, Section III presents the proposed framework, explaining its motivation and the details of its implementation. Section IV describes the experimental setup and discusses the results obtained. Finally, Section V summarizes the conclusions drawn in this work. This section provides the definition of the SSC problem (Section II-A) and briefly describes the most relevant selflabeled approaches proposed in the literature (Section II-B). A. Semi-supervised classification A formal description of the SSC problem is as follows: Let xp be an example where xp = (xp1 , xp2 , ..., xpD , ω), with xp belonging to a class ω and a D-dimensional space in which xpi is the value of the i-th feature of the p-th sample. Then, let us assume that there is a labeled set L which consists of n instances xp with ω known and an unlabeled set U which consists of m instances xq with ω unknown, let m > n. The L ∪ U set forms the training set T R. Moreover, there is a test set T S composed of t unseen instances xr with ω unknown, which has not been used at the training stage. The aim of SSC is to obtain a robust learned hypothesis using T R instead of L alone. It can be applied in two slightly ly On II. S ELF - LABELED S EMI -S UPERVISED C LASSIFICATION Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 3 ev rR Fo alternative, named Democractic co-learning (Democratic-Co) [33], which is also based on multi-learning. As an alternative, which requires neither sufficient and redundant views nor several supervised learning algorithms, Zhou and Li [34] presented the Tri-Training algorithm, which attempts to determine the most reliable unlabeled data as the agreement of three classifiers (same learning algorithm). Then, they proposed the CoForest algorithm [21] as a similar approach that uses Random Forest [35]. A further similar approach is Co-Bagging [36], [37] where confidence is estimated from the local accuracy of committee members. Other recent self-labeled approaches are [38], [39], [40], [41], [42]. In summary, all of these recent schemes work on the hypothesis that several weak classifiers, learned with a small number of instances, can produce better generalizations than only one weak classifier. These methods are also known as disagreement-based models that are motivated, in part, by the empirical success of ensemble learning. The term disagreement-based was recently coined by Zhou and Li [17]. III. S YNTHETIC EXAMPLES GENERATION FOR SELF - LABELED METHODS . represented by labeled points, it also shows that some of the nearest unlabeled points to the two labeled examples of class 1 (blue circles) belong to class 0 (red crosses). This fact can affect confidence of a self-labeled method estimated with the base classifier. • A greatly reduced labeled ratio may produce a lack of diversity among self-labeling methods with more than one classifier. As we have established above, multiple classifier approaches work as a combination of several weak classifiers. However, if there are only a few labeled data it is very difficult to obtain different hypotheses, and therefore, the classifiers are identical. For example, the Tri-Training algorithm is based on a bootstrapping approach [22]. This re-sampling technique creates new labeled sets for each classifier by modifying the original L. In general, this operation yields different labeled sets to the original, but it is not significant in the case of small labeled data sets and the existence of outliers in the sample. As a consequence, it could lead to biased examples which will not accurately represent the domain of the problem. Although multi-learning approaches attempt to achieve diversity by using different kinds of learning models, a reduced number of instances usually damages their performance because the models are too weak. The first limitation has already been addressed in the literature with different mechanisms [46]. However, the last two issues are currently open problems. In order to ease both situations, mainly induced by the shortage of labeled points, we introduce new labeled data into the self-labeling process. To do this, we rely on the success of oversampling approaches in imbalanced domains [47], [48], [49], [50], but with the difference that we deal with all the classes of the problem. Nevertheless, the use of synthetic data for self-labeling methods is not straightforward and must be carefully performed. The aim of using an oversampling method is to effectively reinforce the decision regions between classes. To do so, we will be aided by the distribution of unlabeled data in conjunction with the labeled ones, because if we focus only on labeled examples, it may lead to generate noisy instances when the second issue explained above happens. The effectiveness of this idea will be empirically checked in Section IV. iew In this section we present the SEG-SSC framework. Firstly, Section III-A enumerates the arguments that justify our proposal. Secondly, Section III-B explains how to generate useful synthetic examples in a semi-supervised scenario. Finally, Section III-C describes the SEG-SSC framework, emphasizing when synthetic data should be used. A. Motivation: Why add synthetic examples? Page 8 of 18 ly On The most important weakness of self-labeling models can occur when erroneous labeled examples are added to the labeled training set. This will incorrectly modify the learned model, which may lead to the addition of wrong examples in successive iterations. Why does this situation occur? • There may be outliers in the original unlabeled set. This problem can be avoided if they are detected and not included in the labeled training set. For this problem, there are several solutions in the literature such as edition schemes [29], [43], [44] or some other mechanisms [32]. Recent models, such as Tri-Training [34] or CoForest [21], establish some criteria to compensate for the negative influence of noise by augmenting the labeled training set with sufficient new labeled data. • Independently of the number of unlabeled examples, they can be limited by the distribution of labeled input data. If the available labeled instances do not represent a reliable domain of the problem, it may complicate the estimation of confidence predictions because the supervised classifiers used do not have enough information to establish coherent hypotheses. Furthermore, it is even more difficult if these labeled points are very close to the decision boundaries. Figure 1 shows an example with the appendicitis problem [27]. This picture presents a twodimensional projection (obtained with PCA [45]) of the problem and a partition with 10 % of labeled examples. As we can observe, not only is the problem not well B. Generation of synthetic examples To generate new labeled data in an SSC context we perform certain operations on the available data, so that we use both labeled and unlabeled sets. Algorithm 1 outlines the pseudocode of the oversampling technique proposed. This method is initially based on the SMOTE algorithm proposed in [23] which was designed for imbalanced domains [51] and is limited to oversampling the minority class. In our proposal, we use the underlying idea of SMOTE as an initialization procedure, to generate new examples of all the classes. Furthermore, the resulting synthetic set of prototypes is then readjusted with a positioning adjustment of prototypes scheme [24]. Therefore, this mechanism is divided into two phases: initialization and adjustment of prototypes. Page 9 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 Appendicitis 1.0 4 1.0 Class 0 Class 1 0.0 0.0 0.5 1.0 1st principal component (a) Appendicitis problem Fig. 1. 0.0 −0.5 ev rR −0.5 Fo −0.5 −1.0 −1.0 Class 0 Class 1 Unlabeled 0.5 2nd component 2nd component 0.5 Appendicitis with 10% labeled −1.0 −1.0 −0.5 0.0 0.5 1st principal component 1.0 (b) Appendicitis problem: 10% Labeled points (Democratic - 0.7590 accuracy.) Two-dimensional projections of Appendicitis. Red circles, class 0. Blue squares, class 1. White triangles, unlabeled. training set T R (See instructions 4 and 5). Furthermore, to prevent the influence of the order of labeled and unlabeled instances when computing distances, the T R set is randomized (Instruction 6). Next, the algorithm enters a loop (Instructions 7-25) to proportionally oversample each class, using its own labeled samples as the base. Thus, we extract from L a set of examples P erClass that belong to the current class (Instruction 8). Each one will serve as the base prototype and will be oversampled as many times as the previous computed ratio indicates (Instructions 11-23). New synthetic examples are located along the line segments joining any of the k nearest neighbors (randomly chosen). To face the SSC scenario, the nearest neighbors are not only being looked for in the L set, but are searched for in the T R set (Instruction 12). In this way, we try to avoid the negative effects of the second weakness of self-labeled techniques explained before. Following the idea of SMOTE, synthetic examples are initially generated as the difference between an existing sample and one of its nearest neighbors (Instruction 17). Then, this difference is scaled by a random number in the range [0,1], and is added to the base example (Instruction 18 and 19). It is noteworthy that the class value of the generated example is the same as the considered base sample. The generated prototypes are iteratively stored in OverSampled until the stopping condition is satisfied. Adjustment of prototypes: Can we use this process to improve the distribution of labeled input data? The answer depends on the specific problem and partition used. Although the generation algorithm provides more labeled examples that may be very useful in many domains, they are not totally confident. It may suffer from the same problem as the self- iew 1: Input: Labeled set L, Unlabeled set U , Oversampling factor f , Number of Neighbors k. 2: Output: OverSampled set. 3: OverSampled = ∅ 4: T R=L ∪ U f · #T R 5: ratio= #L 6: Randomize T R 7: for i = 1 to N umberOf Classes do 8: P erClass[i] = getFromClass (L, i) 9: for j = 1 to #P erClass[i] do 10: Generated = 0 11: repeat 12: neighbors[1..k] = Compute k nearest neighbors for P erClass[i][j] in T R 13: nn = Random number between 1 and k 14: Sample = P erClass[j] 15: N earest = T R[neighbors[nn]] 16: for m = 1 to N umberOf Attributes do 17: dif = N earest[m] − Sample[m] 18: gap = Random number between 0 and 1. 19: Synthetic[m] = Sample[m] + gap ∗ dif 20: end for 21: OverSampled = OverSampled ∪ Synthetic 22: Generated + + 23: until Generated < ratio 24: end for 25: end for 26: OverSampled=DE adjustment(OverSampled, L) 27: return OverSampled Initialization: We start from the L and U sets as well as an user-defined oversampling factor f and a number k of nearest neighbors. We will generate a set of synthetic prototypes OverSampled that is initialized as empty (Instruction 3). The ratio of synthetic examples to be generated is computed according to f and the proportion of labeled examples in the ly On Algorithm 1: Generation of synthetic examples Transactions on Cybernetics Page 10 of 18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 5 Appendicitis: 10% Labeled + Initial Synthetic data Class 0 1.0 Class 1 Unlabeled 0.8 Synthetic: Class 0 Synthetic: Class 1 2nd component 1.2 0.6 0.4 0.2 0.0 −0.2 −0.4 −1.0 −0.5 In this subsection, we describe the SEG-SSC framework in depth. With the generation method presented, we obtain new labeled data that can be directly used to improve the generalization capabilities of self-labeled approaches. Nevertheless, the aim of this framework is to be as flexible as possible, so that it can be applied to different self-labeled algorithms. Although each method proceeds in a different way, they either share some operations or are very similar. Therefore, we explain how to incorporate synthetic examples in the self-learning process in order to address the limitations on the distribution of labeled data and the lack of diversity in multiple classifier methods. 0.5 1.0 Fig. 2. Example of data generation in the Appendicitis problem. Twodimensional projections of Appendicitis. Red circles, class 0. Blue squares, class 1. White triangles, unlabeled. Red stars, synthetic class 0. Blue pentagons, synthetic class 1. (SEG-SSC+Democratic - 0.8072 accuracy.) In general, self-labeled methods use a set of N classifiers Ci , where i [1, N ], to predict the class of unlabeled instances. Each Ci has an associated labeled set Li that is iteratively enlarged. In what follows, we describe the three main operations that support our proposal. For clarity, Figure 3 depicts a flowchart of the proposed scheme, outlining its more general operations and way of working. • Initialization of classifiers: In current approaches, Li is initially formed from the available data in L. Depending on the particular method, they may use the same labeled data for each Li or apply a bootstrapping to introduce diversity. As we showed before, both alternatives can lead to a lack of diversity when more than one classifier is used. To solve this, we promote the generation of different synthetic examples for each classifier Ci . In this way, the generation mechanism is applied a total of N times. Because L data are the most confident examples, we ensure that they belong to each Li in conjunction with synthetic examples. Note that the generation method has some randomness, so different executions generate distinct synthetic points. This ensures the diversity between Li sets. • Self-labeling stage: After the initialization phase, each classifier is trained with its respective Li . Then, the learned hypotheses are used to classify unlabeled points, determining the most reliable examples. There are several ways to perform this operation. Single classifier approaches extract their confidence from the base classifier and multiple classifiers calculate confidence predictions in terms of the agreement or combination of hypotheses. Independently of the procedure followed, each classifier ly On C. Self-labeling with synthetic data 0.0 1st principal component iew ev rR Fo labeling approaches and their confidence predictions. It is well-known that SMOTE can generate noisy data [52] which are usually eliminated with edition schemes. Because we are not interested in removing synthetic data, we will apply an evolutionary adjustment process to the OverSampled set (Instruction 26) based on the differential evolution algorithm used in [53]. Differential evolution [25] follows the general procedure of an evolutionary algorithm [54]. It starts with a set of candidate solutions, the so-called individuals, which evolve during a determined number of generations through different operators: mutation, crossover and selection; aiming to minimize/maximize a fitness function. For our purposes, this algorithm is adapted in the following way: • Each individual encodes a single prototype. The process consists of the optimization of the location of all the individuals of the population. • Mutation and crossover operators guide the optimization of the positioning of the prototypes. These operators only produce modifications to the attributes of the prototypes of the OverSampled set, keeping the class value unchangeable throughout the evolutionary cycle. We will focus on the DE/CurrentToRand/1 strategy to generate new prototypes [55]. • Then, we obtain a new set of synthetic prototypes that should be evaluated to decide whether it is better or not than the current set. To make this decision, we use the most reliable data we have, that is, the labeled data L. The generated data should be able to correctly classify L. To check this, the nearest neighbor rule is used as the base classifier to obtain the corresponding fitness value. We try to maximize this value. The stopping criteria is achieved when the generated data perfectly classify L, or a given number of iterations have been performed. More details in Section III.B of reference [53]. It is worth mentioning that this optimization process is only applied to cases in which the former oversampling approach generates synthetic data that is not able to classify L. We thereby endow our model with greater robustness. Figure 2 shows an example of a resulting set of over-sampled prototypes in the appendicitis problem. We can observe that in comparison with Figure 1, the available labeled data points better represent the domain of the problem. Page 11 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 6 SEG-SSC flowchart iew ev rR Fo Fig. 3. TABLE I M AIN CHARACTERISTICS OF SELECTED SELF - LABELED METHODS Algorithm Initialization Democratic-Co Tri-Training Co-Forest Co-Bagging Simple Bootstrapping Bootstrapping Simple learning learning learning learning Classifiers Teaching scheme Confidence rule algorithms algorithms algorithms algorithms Self-teaching Mutual-teaching Self-teaching Self-teaching Weighted majority Majority Majority Majority obtains a set Li that will be used to enlarge Li . At this point, there are two possibilities: self or mutual teaching. The former uses its own predictions to augment its Li . With a mutual teaching approach, a classifier Cj teaches its confidence predictions to the rest of the classifiers, that is, it increases Li , ∀ i = j. When all the Li are increased, a new oversampling stage is performed for each Li , using its prototypes and the remaining unlabeled examples. The resulting Li sets are ready to be used in the next iteration. Final classification: The stopping criteria depends on the specific self-labeled method used, which is usually defined by a given number of iterations or by the condition of the learned hypotheses of the classifiers used, which does not change. When it is satisfied, not all the unlabeled instances have had to be added to one of the Li sets. For this reason, the resulting Li sets have to be used to classify the remaining instances of U and the T S set. As such, this scheme is applicable to any self-labeling method and should provide better generalization capabilities to all of them. To test the proposed framework, we have applied these ideas to four self-labeling approaches: Democratic-Co ly On • Different Same Same Same [33], Tri-Training [34], Co-Forest [21] and Co-Bagging [36], [37]. These models have different characteristics, such as distinct mechanisms to determine confident examples (agreement or combination), teaching schemes, uses of different learning algorithms or having a different initialization scheme. Table I summarizes the main properties of these models. We modify these models by adding synthetic examples, as explained above, to have an idea of how flexible our framework is. The modified versions of these algorithms will be denoted: SEGSSC+Democratic-Co, SEG-SSC+Tri-Training, SEG-SSC+CoForest and SEG-SS+Co-Bagging. IV. E XPERIMENTAL SETUP AND ANALYSIS OF RESULTS This section presents all of the issues related to the experimental framework used in this work and the analysis of results. Section IV-A describes the main properties of the data sets used and the parameters of the selected algorithms. Section IV-B presents and analyzes the results obtained with standard classification data sets. Finally, Section IV-C studies the behavior of the proposed framework when dealing with high dimensional problems. A. Data sets and parameters The experimentation is based on 55 standard classification data sets taken from the UCI repository [27] and the KEELdataset repository1[26] and 11 high dimensional problems 1 http://sci2s.ugr.es/keel/datasets Transactions on Cybernetics Page 12 of 18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 7 TABLE II S UMMARY DESCRIPTION OF STANDARD CLASSIFICATION DATA SETS #D. 8 7 14 25 2 9 6 36 13 85 9 15 33 7 9 20 9 3 13 19 16 4 7 18 10 5 13 6 #ω. 28 2 2 6 2 2 2 2 5 2 3 2 6 8 2 2 7 2 2 2 2 3 10 4 2 2 9 2 Data set movement libras mushroom nursery pageblocks penbased phoneme pima ring saheart satimage segment sonar spambase spectheart splice tae texture tic-tac-toe thyroid titanic twonorm vehicle vowel wine wisconsin yeast zoo #Ex. 360 8124 12 690 5472 10 992 5404 768 7400 462 6435 2310 208 4597 267 3190 151 5500 958 7200 2201 7400 846 990 178 683 1484 101 #D. 90 22 8 10 16 5 8 20 9 36 19 60 55 44 60 5 40 9 21 3 20 18 13 13 9 8 17 #ω. 15 2 5 5 10 2 2 2 2 7 7 2 2 2 3 3 11 2 3 2 2 4 11 3 2 10 7 ev rR #Ex. 4174 106 690 205 5300 286 345 3196 297 9822 1473 125 366 336 1066 1000 214 306 270 155 435 150 500 148 19 020 961 8993 432 Fo Data set abalone appendicitis australian autos banana breast bupa chess cleveland coil2000 contraceptive crx dermatology ecoli flare-solar german glass haberman heart hepatitis housevotes iris led7digit lymphography magic mammographic marketing monks TABLE III S UMMARY DESCRIPTION OF HIGH DIMENSIONAL DATA SETS #D. 117 241 241 241 241 241 315 11 960 241 9636 4613 #ω. 2 6 2 2 2 2 2 2 2 5 5 Reference [2] Parameters Number of Neighbors = 3, Euclidean Distance Confidence level: c = 0.25 Mininum number of item-sets per leaf: i = 2 Prune after the tree building Democratic-Co Tri-Training Co-Forest Co-Bagging Classifiers = 3NN, C4.5, NB No parameters specified Number of RandomForest Classifiers = 6, Threshold = 0.75 M AX IT ER = 40, Committee members = 3 Ensemble Learning = Bagging, Pool U = 100 SEG-SSC Differential evolution parameters Oversampling factor=0.25, Number of Neighbors = 5 Iterations = 100, iterSFGSS = 8,iterSFHC =20 Fl=0.1, Fu=0.9 established in [40], in the division process we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of SSC is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances, and the class label of the rest of the instances will be removed. We ensure that every class has at least one representative instance. In standard classification data sets we have taken a labeled ratio of 10%. For high dimensional data sets, we will use two splits for training partitions with 10 and 100 labeled examples, respectively. In both cases, the remaining instances are marked as unlabeled points. Regarding the parameters of the algorithms, the selected values are fixed for all problems, and they have been chosen according to the recommendation of the corresponding authors of each algorithm. From our point of view, the approaches analyzed should be as general and as flexible as possible. It is known that a good choice of parameters boosts their better performance over different data sources, but their way of working should offer good enough results in spite of the fact that the parameters are not optimized for a specific data set. This is the main purpose of this experimental setup, to show how the proposed framework can improve the efficacy of self-labeled techniques. Table IV specifies the configuration parameters of all the methods. Because these algorithms carry out some random operations during the labeling process, they have been run three times per partition. In this table, we also present the parameters involved in our framework: the oversampling factor, the number of neighbors and the parameters needed for the differential evolution optimization. They can also be adjusted for each problem, however, with the same aim of being as flexible as possible. We have fixed these values empirically in previous experiments. The parameters used for the differential evolution optimization are the same as those established in [53], except for the number of iterations that have been reduced. We decrease this value because, under this framework, the reference set used by differential evolution contains a smaller number of instances than in the case of supervised learning. The Co-Forest and Democractic-Co algorithms were designed and tested with determined base classifiers. In this study, these algorithms maintain their classifiers. However, the interchange of the base classifiers is allowed in the TriTraining and Co-Bagging approaches. In these cases, we will [56] extracted from the book by Chapelle et al. [2] and the BBC News web page [56]. Tables II and III summarize the properties of the selected data sets. They show, for each data set, the number of examples (#Ex.), the number of attributes (#D.), and the number of classes (#ω.). The standard classification data sets considered contain between 100 and 19,000 instances, the number of attributes ranges from 2 to 90 and the number of classes varies between 2 and 28. However, the 11 high dimensional data sets contain between 400 and 83,679 instances and the number of features oscillates from 117 to 11,960. All the data sets have been partitioned using the 10 fold cross-validation procedure, that is, the data set has been split into 10 folds, each one containing 10% of the examples of the data set. For each fold, an algorithm is trained with the examples contained in the rest of the folds (training partition) and then tested with the current fold. Note that test partitions are kept aside to assess the performance of the learned hypothesis. Each training partition is then divided into two parts: labeled and unlabeled examples. Using the recommendation ly On #Ex. 400 1500 1500 1500 1500 1500 83 679 1500 1500 2225 737 Algorithm KNN C4.5 iew Data set bci coil coil2 digit1 g241c g241n secstr text usps bbc bbcsport TABLE IV PARAMETER SPECIFICATION FOR THE BASE LEARNERS AND THE SELF - LABELED METHODS USED IN THE EXPERIMENTATION Page 13 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 8 ev rR Fo test two base classifiers, the K-Nearest Neighbor [57] and the C4.5 algorithms [58]. A brief description of these base classifiers and their associated confidence prediction computation are given as follows: • K-Nearest Neighbor (KNN): This is an instance-based learning algorithm that belongs to the lazy learning family of methods [59]. As such, it does not build a model during the learning process and is based on dissimilarities among a set of instances. For those self-labeled methods that need to estimate confidence predictions from this classifier, they can approximate it in terms of distance from the currently labeled set. • C4.5: This is a decision tree algorithm [58] that induces classification rules for a given training set. The decision tree is built with a top-down scheme, using the normalized information gain (difference in entropy) that is obtained from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. Confidence predictions can be obtained from the accuracy of the leaf that makes the prediction. The accuracy of a leaf is the percentage of correctly classified train examples from the total number of covered train instances. (a) Transductive accuracy B. Experiments on standard classification data sets iew In this subsection we compare the modified versions of the selected self-labeled methods (within SEG-SSC) with the original ones, focusing on the results obtained on the 55 standard classification data sets and a labeled ratio of 10%. We analyze the transductive and inductive accuracy capabilities of these methods. Both results are presented in Tables V and VI, respectively. In these tables, we have specified the base classifier between brackets for Tri-Training and Co-Bagging algorithms. Aside from these tables, Figure 4 depicts two box plot representations of the results obtained in transductive and inductive settings, respectively. With these box plots we show a graphical comparison of the performance of the algorithms, indicating their most important characteristics such as the median, extreme values and spread of values about the median in the form of quartiles (Q1 and Q3). Observing these tables and the figure we can appreciate differences between each of the original proposals and the improvement achieved by the addition of synthetic examples. Nevertheless, the use of hypothesis testing methods is mandatory in order to contrast the results of a new proposal with several comparison methods. The aim of these techniques is to identify the most relevant differences found between methods, which is highly recommended in the data mining field [28]. To do this, we focus on the Wilcoxon signed-ranks test [60] because it establishes a pairwise comparison between methods. In this way, we can see if there are significant differences between the original and modified versions. More information about this test and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/. Table VII collects the results of the application of the Wilcoxon signed-ranks test to the transductive and inductive ly On (b) Inductive accuracy Fig. 4. Box plot of transductive and inductive accuracy rates. The boxes contain 50% of the data (Q1 to Q3), blue points are the median values and the lines extend to the most extreme values accuracy rates. It shows the rankings R+ and R− values achieved and its associate p-value. Adopting a level of significance of α = 0.1, we emphasize in bold face those comparisons in which SEG-SSC significantly outperforms the original algorithm. With these results we can make the following analysis: • In Tables V and VI we can see that our framework provides a great improvement in accuracy to the selflabeled techniques used in most of the data sets and rarely does it significantly reduce its performance level. On Transactions on Cybernetics Page 14 of 18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 9 TABLE V T RANSDUCTIVE ACCURACY RESULTS OVER STANDARD CLASSIFICATION DATA SETS . Datasets Co-Bagging (KNN) 22.9200 80.3700 82.0200 36.9300 87.6300 69.8400 55.3100 81.4600 53.2200 92.3700 41.7400 82.7300 87.2400 68.1700 61.9800 68.5200 47.7900 66.8700 76.6700 82.5300 89.0500 92.3000 66.1700 61.0900 79.8500 76.3300 26.6500 68.5400 36.3800 99.2000 73.4900 93.1300 96.8200 79.8800 65.6200 61.5200 65.6300 85.9500 87.6900 65.7800 83.3700 73.3700 69.5400 40.6000 93.4100 92.0300 67.9700 67.7400 95.2200 55.6500 37.7200 93.7600 95.9400 45.9300 78.4400 71.0558 Co-Bagging (C4.5) 20.1100 82.2600 81.3200 35.1900 85.7200 68.9500 59.8900 95.0800 50.8100 93.6300 47.9700 84.8300 86.9700 65.7000 71.2700 69.2000 45.9700 71.4500 73.7000 81.4200 93.5400 80.3300 57.3100 57.6200 83.2700 81.3600 27.8300 97.0300 20.9300 99.4700 90.2400 95.3300 90.5000 79.0700 64.4200 85.8900 65.8700 82.1500 90.9800 63.0100 89.4700 72.1200 81.9700 38.6400 84.8800 98.9700 71.4500 77.7700 85.9300 59.0200 43.6400 80.9900 92.9400 48.0800 75.5800 72.3462 Democratic-Co iew 18.6274 80.1273 83.2558 43.8713 85.9003 71.0801 58.1093 92.0444 54.2972 90.2300 44.7097 82.3222 90.2394 65.4363 70.7523 72.0370 49.8987 72.3424 79.2694 78.9245 89.6435 93.9344 61.8272 58.9634 79.6235 80.8290 26.0327 86.7461 28.8276 99.5213 89.5509 93.7309 95.6400 79.9137 70.4932 94.6296 65.9012 85.6518 90.6197 70.3383 88.7979 76.2400 92.7786 39.9467 88.7026 96.7147 70.5093 77.5310 97.3390 57.9951 50.3616 95.1475 96.6413 48.0431 86.4836 74.3477 average, the versions that use synthetic examples always outperform the algorithms upon which they are based in both the transductive and inductive phases. In general, the average improvement achieved for one algorithm in the transductive setting is more or less maintained in the inductive test, which shows the robustness of the models. Co-Bagging (KNN) seems to be the algorithm that benefits most when it uses synthetic instances, by contrast, SEG-SSC does not significantly increase the average performance of Co-Forest. Comparing all the algorithms, the best performing approach is SEG-SSC+Democraticco. It is known that the performance of self-labeled algorithms depends firstly on the general abilities of their base classifiers [61]. We notice that C4.5 is a better base classifier than KNN for the Tri-Training philosophy. However, the Co-Bagging algorithm performs in a similar way with both classifiers. As expected, the results obtained with our framework are also affected by the base classifier. At this point, we can see that those algorithms that are based on KNN offer a greater average improvement. • • Tri-Training (KNN) 19.1120 79.6511 80.7156 45.1214 86.7179 68.6319 56.8213 82.7566 50.7729 89.0100 42.4225 80.4531 90.8924 67.1914 63.7500 68.1605 54.2011 66.1459 74.3836 80.2432 89.7856 94.1803 61.3333 65.4149 77.7595 75.7620 26.2354 72.0527 44.1379 99.6088 78.3274 93.3202 98.0345 80.7817 66.7475 82.5375 64.8068 87.8005 91.6346 67.0835 85.9357 73.3690 79.8181 41.0862 96.4669 91.0460 70.6640 74.5973 96.3664 60.7683 50.9601 93.9660 95.9910 49.5750 93.7834 73.0708 SEG-SSC Tri-Training Co-Forest (C4.5) 19.6966 20.8546 80.3736 80.1847 80.8587 83.1306 39.0806 41.3491 86.7622 54.9849 68.3242 69.4731 58.6070 61.1814 95.3117 93.9368 51.3120 53.6428 92.1800 92.9300 48.0525 49.6524 83.1333 82.6423 90.2385 91.6161 64.9134 69.8830 70.1389 39.3403 68.7778 68.1358 51.7359 56.3432 71.2583 62.1837 74.2009 73.2877 78.9000 84.1557 92.2012 89.6818 83.6066 90.8197 63.5062 63.3580 59.0391 62.5010 82.4028 84.3311 81.2622 77.4400 27.0427 28.7900 92.4715 91.2758 26.9655 32.9310 99.7049 90.9362 91.5057 37.8782 95.5446 95.6912 92.3247 95.9388 80.6835 81.8896 70.6701 69.6090 81.7150 81.5349 65.1284 66.2758 84.1478 86.9986 91.0150 92.0726 66.3745 71.2210 90.0437 92.6510 75.1242 80.2129 87.6393 49.5666 40.8437 37.1800 89.1021 92.6218 98.0864 97.5103 73.0610 62.8186 77.7834 70.8566 87.0871 92.0938 62.1831 66.5154 50.9726 54.4140 83.9114 91.0536 93.9148 95.6295 50.2999 48.3013 88.2045 93.5267 73.6259 71.7279 ly On Original Proposals Tri-Training Co-Forest (C4.5) 20.2800 20.5500 82.7300 81.9000 82.7700 82.4900 37.7200 41.5600 85.1000 53.9400 68.6900 71.4800 58.5700 59.4700 95.4300 94.3700 50.2800 51.8500 93.4300 92.8300 47.6600 48.6300 84.8100 82.7200 87.6200 89.3000 65.9200 66.5600 71.1600 39.8700 68.5800 67.8500 49.5500 55.2600 71.6200 58.7000 75.6600 71.0500 81.4200 84.8800 92.9500 91.3500 75.9000 90.1600 60.3000 62.9400 58.9600 61.8000 82.0100 84.2000 81.5600 76.6700 28.1500 28.7900 97.0300 96.0700 22.4800 30.4800 99.5100 90.9200 90.3300 37.6600 95.4300 95.7900 90.0200 95.5900 78.0900 80.4700 68.2900 68.5500 85.0400 88.4400 65.6600 64.3800 82.3000 86.5500 90.3700 90.8600 63.8900 69.2100 88.3700 92.1400 72.8200 77.7600 81.9500 50.3800 39.3700 37.8300 85.2600 90.7000 99.1200 98.6100 71.0000 61.8500 77.6900 70.9800 85.9200 91.3300 60.4800 64.1500 44.9000 51.9800 78.9800 87.5200 92.5900 93.4800 49.7300 45.8800 75.4100 89.7800 72.5611 71.1002 ev rR • 19.7700 80.6000 82.6800 32.8500 84.2700 70.2000 54.2800 92.0000 53.8000 93.0300 45.0000 84.6100 88.0000 63.6400 71.3300 70.1200 48.7100 73.2300 78.8600 81.4200 89.8500 91.6400 60.0500 51.9000 78.7300 80.4300 27.8800 90.1100 17.5200 99.2600 89.6300 90.6000 94.7300 78.6000 71.9600 88.7200 67.4500 85.0600 90.5900 64.3100 88.1700 73.0900 93.4900 39.6300 88.2700 94.1500 69.7600 77.1800 97.0300 47.8700 40.6700 93.0700 96.2400 48.9500 92.6200 73.0475 Tri-Training (KNN) 22.1300 78.1400 79.0200 45.6100 86.9500 69.3400 55.7900 79.6200 52.4800 88.2500 42.0000 80.8500 89.4800 66.4800 62.9700 66.2100 51.2600 65.5800 73.4700 81.4000 89.1000 92.1300 59.4300 64.7600 78.0800 73.8900 25.6800 70.5700 40.2800 99.5200 76.3200 93.4200 97.7400 80.8000 65.0600 66.6000 63.3900 86.0700 90.0200 65.1900 82.8400 69.6300 66.5700 40.7600 94.4200 90.6400 70.3200 74.6000 93.8000 56.9700 48.0400 92.8500 95.5000 46.3300 92.2300 71.4651 Fo abalone appendicitis australian automobile banana breast bupa chess cleveland coil2000 contraceptive crx dermatology ecoli flare german glass haberman heart hepatitis housevotes iris led7digit lymphography magic mammographic marketing monk-2 movement libras mushroom nursery page-blocks penbased phoneme pima ring saheart satimage segment sonar spambase spectfheart splice tae texture thyroid tic-tac-toe titanic twonorm vehicle vowel wine wisconsin yeast zoo Average Democratic-Co Co-Bagging (KNN) 19.2365 79.4376 80.6082 45.3764 86.8274 68.9862 56.6096 84.1663 51.5208 90.2400 42.8332 82.1320 91.7198 67.3791 66.8634 69.0000 53.1702 66.6686 74.3379 81.5418 89.4874 94.0164 63.6790 63.0386 78.0042 77.3973 25.6569 76.5598 43.8276 99.5957 78.5675 93.3789 98.0155 80.9576 67.0209 86.7901 65.3412 88.1784 91.5652 67.2593 86.9667 74.2972 81.2926 42.3864 96.4916 91.1094 72.8934 75.5737 96.4097 61.7747 51.5586 92.9243 96.1535 49.2672 92.8808 73.6177 Co-Bagging (C4.5) 19.9401 81.3013 80.5546 35.4917 86.9113 67.7848 59.5062 95.4083 50.8462 90.6400 46.4188 81.8868 88.8587 63.4947 68.7963 69.0617 50.7878 68.9213 75.0228 79.3906 92.5117 83.0328 63.1111 54.8241 81.6811 80.5495 26.4127 88.3120 26.7931 99.7705 89.2461 95.1814 93.5455 81.0970 68.0663 79.1859 65.3147 84.3185 91.5438 66.9614 90.9567 76.5126 89.2918 41.8213 90.3681 96.6907 72.6751 77.5086 88.1198 62.1973 51.9202 87.4473 94.9061 49.1259 90.2835 73.3147 In Figure 4, the size of the boxes are related to the robustness of the algorithms. Thus, we observe that, in many cases, SEG-SSC finds more compact boxes than the original algorithms. In the cases in which the boxes have more or less the same size, we can see that they are higher in the plot. Median results also help us to identify algorithms that perform well in many domains. Thus, we observe again that most of the median values of modified versions are higher than the original proposals. Taking into account median values, SEG-SSC+TriTraining(C4.5) may be considered the best model. According to the Wilcoxon signed-ranks test, SEG-SSC achieves that all the methods significantly overcome their original proposals in terms of transductive learning, supporting previous conclusions. However, in the inductive phase, we find that Co-Forest and Co-Bagging (C4.5) have not been significantly improved. Even so, they report higher R+ rankings than the original models, which means that they perform slightly better. Page 15 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 10 TABLE VI I NDUCTIVE ACCURACY RESULTS OVER STANDARD CLASSIFICATION DATA SETS . Datasets Co-Bagging (KNN) 20.7700 74.7300 81.3000 31.0500 87.3600 70.9000 57.6600 80.8800 54.9600 92.2700 41.0700 81.0200 87.0800 66.7300 63.2300 68.8000 48.2400 67.9900 78.5200 81.1800 90.3600 93.3300 66.0000 66.0800 79.8100 77.5800 26.9300 67.0700 34.1700 99.0100 73.8500 93.4900 96.8200 79.8300 63.5800 61.7600 64.0800 85.6400 87.0600 65.8800 83.2700 73.1100 69.7500 42.8700 94.3100 92.1000 67.9600 67.4800 95.1800 55.5500 39.2900 93.2700 95.9300 46.7000 78.3300 70.9667 Co-Bagging (C4.5) 20.8200 80.4500 82.7500 33.6600 85.5300 72.5200 61.1900 95.4300 53.3500 93.4800 48.2700 84.9600 87.6000 65.5800 71.4000 71.1000 48.9800 71.2200 70.3700 83.4300 91.9500 80.0000 56.4000 59.4800 83.1900 80.8500 27.0600 96.5700 24.1700 99.5400 90.0600 95.6700 90.4900 78.8900 63.4200 85.8200 64.9700 82.0500 91.6000 70.1400 89.5100 75.7100 82.4800 42.3700 85.0000 99.0600 70.3500 78.3700 85.9700 60.3000 45.5600 78.6600 92.8400 47.6500 74.5600 72.7782 AND INDUCTIVE PHASES Transductive phase R+ R− p-value R+ SEG-SSC+Democratic-Co vs. Democratic-Co SEG-SSC+Tri-Training (KNN) vs. Tri-Training (KNN) SEG-SSC+Tri-Training (C4.5) vs. Tri-Training (C4.5) SEG-SSC+Co-Forest vs. Co-Forest SEG-SSC+Co-Bagging (KNN) vs. Co-Bagging (KNN) SEG-SSC+Co-Bagging (C4.5) vs. Co-Bagging (C4.5) 1086 1371 1083 1132 1201 966 0.0080 0.0000 0.0086 0.0008 0.0003 0.0997 1057 995 966 863 1204 943 454 169 457 353 339 574 20.3408 76.5455 84.3478 44.1879 85.7736 72.5921 61.9745 91.4895 55.7156 90.1955 46.0953 83.6402 89.3233 68.1907 71.3939 71.7000 52.7294 71.8602 80.0000 82.7446 90.2963 91.3333 59.8000 68.1709 79.8370 80.6285 25.8285 86.4930 29.7222 99.5531 88.4568 94.3714 95.8151 79.4957 68.1074 94.3784 67.9880 85.5945 91.0390 71.0714 88.0794 76.7379 92.2257 38.4583 90.9455 94.2917 71.5154 77.6018 96.8919 59.4664 49.1919 95.5229 96.3694 46.5645 88.5000 74.7488 iew TABLE VII R ESULTS OF THE W ILCOXON SIGNED - RANKS TEST ON TRANSDUCTIVE Comparison Democratic-Co Test phase R− p-value 428 545 574 622 336 597 0.0067 0.0588 0.0997 0.2970 0.0003 0.1460 C. Experiments on high dimensional problems with small labeled ratio This subsection is devoted to studying the behavior of the proposed framework when it is applied to high dimensional data and a very reduced labeled ratio. Most of the considered data sets (9 of 11) were provided in the book by Chapelle et al. [2], in which the studies were performed using only 10 and 100 labeled instances. We attempt to perform a similar study with the difference that we also investigate the inductive abilities of the models. Furthermore, BBC and BBCsport data Tri-Training (KNN) 20.1730 75.7273 80.4348 43.5711 86.4717 70.4380 54.7159 78.5345 55.9584 88.2800 41.7549 79.2744 90.1495 67.2727 64.4507 68.5000 57.1968 64.7419 77.0370 77.3506 89.0168 93.3333 60.4000 65.2129 77.1819 76.5328 26.7307 64.3084 43.0556 99.5009 75.7330 93.3847 98.2351 80.7726 64.4723 69.4189 63.2239 87.0244 91.8182 66.3571 82.4662 74.9858 65.8307 38.3750 96.8182 91.1250 71.3004 74.1512 93.6351 60.6373 52.5253 93.2026 95.2032 50.1383 92.7222 72.0157 SEG-SSC Tri-Training Co-Forest (C4.5) 20.8922 22.3287 79.4545 79.2727 82.1739 83.0435 40.7375 40.2639 86.8868 55.5472 71.9036 70.9338 63.2280 58.5091 94.9313 93.2094 51.0575 53.2235 92.1100 93.1000 47.5188 49.6920 83.4395 82.0433 87.6332 90.6976 65.5437 67.5758 70.3624 38.8415 68.0000 68.6000 56.1803 51.6927 69.2043 59.2903 72.5926 71.8519 81.2673 90.5032 88.2021 90.8505 83.3333 92.0000 63.0000 62.6000 65.2829 68.2017 82.3554 84.7266 81.7016 79.6800 27.5396 29.6700 92.0869 88.8386 29.4444 32.5000 99.7678 90.8713 91.3117 37.3148 95.6873 95.8699 92.2486 96.1066 80.7542 81.3837 67.7110 68.6167 81.5270 80.2568 66.6744 67.1045 83.8226 86.5584 91.2554 91.8182 64.8095 73.0476 90.2108 92.6257 79.4444 79.3875 88.1191 50.0000 38.5417 36.4600 89.4909 92.3818 98.1528 97.6528 72.0252 59.0833 78.2378 70.5140 87.0135 90.6216 61.7129 63.8459 52.2222 54.3434 85.9150 91.0131 93.7194 94.6063 50.3387 47.4420 86.9167 92.6389 73.9217 71.4700 ly On Original Proposals Tri-Training Co-Forest (C4.5) 21.6100 21.0100 80.4500 82.2700 84.4900 84.0600 38.8900 45.6900 84.8100 52.7000 72.1600 73.3900 57.4200 58.5100 95.7800 94.4000 47.6100 53.6600 93.5700 92.9900 48.1300 48.5300 85.5500 82.0600 88.1600 90.4700 65.8500 62.8300 71.5800 40.2400 71.7000 68.6000 49.2100 55.8900 70.8800 60.1400 71.4800 69.2600 83.4300 81.0900 91.5800 92.1600 72.6700 93.3300 60.4000 63.4000 61.1800 64.6400 82.4500 84.3600 81.8300 79.4100 26.9400 29.2300 96.5700 93.9200 27.5000 31.1100 99.5500 90.8400 90.3900 38.0900 95.6100 95.8500 90.2700 95.5100 77.7000 80.0700 65.6400 66.2700 85.4200 88.2300 67.7600 65.5900 82.2400 86.0000 90.0000 90.3000 70.1900 75.5000 88.1000 91.8600 75.7400 77.5100 82.5400 50.6600 45.7100 37.7900 85.2400 90.6500 99.1800 98.5800 70.8800 59.7100 77.6500 70.6500 86.1600 89.8900 61.9400 61.2400 45.2500 52.2200 82.0300 85.8800 93.1200 93.5800 49.0700 45.6200 71.9200 90.8900 72.9669 71.2424 ev rR 21.0600 82.1800 84.4900 36.0100 84.1700 72.8700 51.0400 91.9900 52.2300 93.2200 43.5800 84.9500 87.6000 63.7000 72.1400 71.6000 48.6800 71.5600 80.0000 83.4300 88.9900 91.3300 61.6000 49.0100 78.4200 79.6300 27.1000 90.7500 19.7200 99.2700 89.5100 90.7700 94.7400 78.7400 69.6700 87.4100 68.1900 84.6200 90.2600 60.0500 87.7700 73.7900 89.7800 37.7100 89.4400 93.9300 69.0000 77.5600 96.4500 50.2300 41.6200 94.9300 96.5000 48.8600 93.1400 73.0362 Tri-Training (KNN) 20.7200 73.8200 80.2900 43.0700 86.8100 70.8100 54.1600 83.0900 56.6200 87.9500 42.1600 80.3400 89.3000 66.7000 63.8900 66.7000 59.4100 59.7800 76.6700 79.6900 89.3900 92.0000 59.4000 67.8700 76.7800 76.9900 26.2000 64.6000 44.4400 99.4700 86.9800 93.6400 98.0100 80.4600 62.6500 60.4100 62.7700 85.2100 90.7400 63.4500 81.1000 69.0500 77.5900 40.4200 95.2400 90.6700 70.6700 74.1500 91.0900 55.2000 49.8000 92.6500 94.6200 47.5100 93.4700 71.7576 Fo abalone appendicitis australian automobile banana breast bupa chess cleveland coil2000 contraceptive crx dermatology ecoli flare german glass haberman heart hepatitis housevotes iris led7digit lymphography magic mammographic marketing monk-2 movement libras mushroom nursery page-blocks penbased phoneme pima ring saheart satimage segment sonar spambase spectfheart splice tae texture thyroid tic-tac-toe titanic twonorm vehicle vowel wine wisconsin yeast zoo Average Democratic-Co Co-Bagging (KNN) 20.3636 74.7273 81.5942 43.9220 86.6792 69.0904 57.3746 83.8238 56.0482 90.1500 42.9100 81.7268 91.5388 70.2674 66.8877 68.3000 54.2606 65.3226 75.9259 78.4383 90.3950 92.6667 61.8000 68.8683 77.9443 78.3287 25.8699 77.0660 43.3333 99.4826 78.2330 93.5306 98.0441 80.9578 65.8994 86.8649 65.6059 88.0969 91.7749 63.9524 86.5782 79.8291 81.7555 39.0417 96.8182 91.4444 72.4452 75.1940 96.5270 62.4034 52.5253 96.0784 95.7938 50.2091 92.7222 73.7715 Co-Bagging (C4.5) 21.1299 83.3636 82.3188 42.4181 86.7736 68.7853 62.5796 95.7745 50.6689 90.2500 47.1824 80.3731 87.3709 65.8556 68.5779 70.2000 54.6483 65.2688 71.8519 77.7403 90.9857 84.0000 61.2000 61.5182 81.9821 79.7660 27.1754 86.3375 27.2222 99.8040 89.1512 95.3398 93.6136 81.1430 66.9233 79.1757 64.7317 84.3355 91.5584 70.0476 90.8852 79.0456 89.2476 39.1667 90.6727 96.8611 72.7522 77.2394 87.8243 62.5266 53.0303 88.7582 94.7580 47.8501 87.3889 73.5845 sets have been also analyzed in a semi-supervised context with a few number of labeled instances [62]. In the scatterplots of Figure 5 we depict transductive and inductive accuracy results obtained with 10 and 100 labeled data. In these plots, the x-axis position of the point is the accuracy of the original self-labeled method on a single data set, and the y-axis position is the accuracy of the modified algorithm. Therefore, points above the y = x line correspond to data sets for which new proposals perform better than the original algorithm. Table VIII tabulates the average results obtained in the 11 data sets considered, including transductive and inductive phases for both 10 and 100 splits. Given Figure 5 and Table VIII, we can make the following comments: • In all the plots of Figure 5, most of the points are above the y = x line, which means that, with the proposed framework, the self-labeled techniques perform better than the original algorithms. Differentiating between 10 and 100 available labeled points, we can see that when we have 100 labeled examples, there are more points Transactions on Cybernetics Page 16 of 18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 11 Transductive comparison with 10 labeled points Inductive comparison with 10 labeled points 80 Accuracy with the proposed framework Accuracy with the proposed framework 80 70 60 50 Tritraining (KNN) vs. SE-Tritraining (KNN) 40 Tritraining (C4.5) vs. SE-Tritraining (C4.5) Democratic-Co vs. SE-Democratic-Co CoForest vs. SE-CoForest 30 70 60 50 Tritraining (KNN) vs. SE-Tritraining (KNN) 40 Tritraining (C4.5) vs. SE-Tritraining (C4.5) Democratic-Co vs. SE-Democratic-Co CoForest vs. SE-CoForest 30 Co-bagging (KNN) vs. SE-Co-bagging (KNN) Co-bagging (KNN) vs. SE-Co-bagging (KNN) Co-bagging (C4.5) vs. SE-Co-bagging (C4.5) Co-bagging (C4.5) vs. SE-Co-bagging (C4.5) y=x 30 40 50 60 70 y=x 80 30 40 Accuracy of the original proposals Fo Inductive comparison with 100 labeled points Tritraining (KNN) vs. SE-Tritraining (KNN) Tritraining (C4.5) vs. SE-Tritraining (C4.5) 40 Democratic-Co vs. SE-Democratic-Co Co-bagging (KNN) vs. SE-Co-bagging (KNN) 30 Co-bagging (C4.5) vs. SE-Co-bagging (C4.5) y=x 30 40 50 60 70 Accuracy of the original proposals (c) 100 labeled points: Transductive accuracy. 80 70 60 50 Tritraining (KNN) vs. SE-Tritraining (KNN) Democratic-Co vs. SE-Democratic-Co 100 • y=x 30 40 50 60 70 80 Accuracy of the original proposals (d) 100 labeled points: Inductive accuracy. ly On Phase TRS TST TRS TST 56.6987 56.5331 70.8563 70.1286 SEG-SSC+ Democratic-Co 10 Co-bagging (C4.5) vs. SE-Co-bagging (C4.5) High dimensional data sets: Transductive and inductive accuracy results Democratic-Co #L CoForest vs. SE-CoForest Co-bagging (KNN) vs. SE-Co-bagging (KNN) 30 TABLE VIII H IGH DIMENSIONAL DATA SETS : AVERAGE RESULTS OBTAINED IN TRANSDUCTIVE (TRS) AND INDUCTIVE (TST) PHASES . 10 Tritraining (C4.5) vs. SE-Tritraining (C4.5) 40 iew CoForest vs. SE-CoForest 80 Accuracy with the proposed framework Accuracy with the proposed framework ev rR 50 80 Accuracy of the original proposals Transductive comparison with 100 labeled points 60 100 70 (b) 10 labeled points: Inductive accuracy. 70 #L 60 (a) 10 labeled points: Transductive accuracy. 80 Fig. 5. 50 Phase TRS TST TRS TST 58.4330 58.8520 73.4449 73.5080 Tri-Training (KNN) 56.1988 55.9070 66.5833 65.4899 Tri-Training (C4.5) 53.1233 53.3841 68.7295 69.1031 Co-Forest SEG-SSC+ Tri-Training (KNN) 57.1121 56.0891 70.1754 65.1796 SEG-SSC+ Tri-Training (C4.5) 58.7192 58.3297 71.7133 71.1483 SEG-SSC+ Co-Forest 55.5031 55.6181 68.0144 67.1261 58.0919 57.6709 70.1887 69.2605 Co-Bagging (KNN) 56.8976 56.3126 65.4552 66.1272 Co-Bagging (C4.5) 53.8810 53.8852 68.7148 69.7143 SEG-SSC+ Co-Bagging (KNN) 57.2811 57.4786 70.4770 71.0570 SEG-SSC+ Co-Bagging (C4.5) 57.8010 58.1651 72.4768 71.9971 above this line in both the transductive and inductive phases. We do not discern great differences between the performance obtained in both learning phases which shows that the hypotheses learned with the available labeled and unlabeled data were appropriate. Table VIII shows that, on average, the proposed scheme obtains a better performance level than the original ones in most cases, independently of the learning phase and • the number of labeled data considered. Attending to the difference between transductive and inductive results, we observe that, in general, SEG-SSC increments both proportionally. Nevertheless, there are significant differences between the results obtained with 10 and 100 labeled points. With these results in mind, we can see the good synergy between synthetic examples and self-labeled techniques in these domains, but, what are the main differences with the results obtained in the previous subsection? We observe great differences between those algorithms that use KNN as a base classifier and those that use C4.5. With standard classification data sets, we ascertained that C4.5 was the best base classifier for Tri-Training and performs similarly to KNN for Co-Bagging. These statements are maintained in these domains, where C4.5 performs better. In this study, SEG-SSC+Democratic may be highlighted as the best performing model, obtaining the highest transductive and inductive accuracy results with 10 and 100 labeled examples. Page 17 of 18 Transactions on Cybernetics IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 12 V. C ONCLUDING R EMARKS R EFERENCES iew ev rR Fo In this paper we have developed a novel framework called SEG-SSC to improve the performance of any self-labeled semi-supervised classification method. It is focused on the idea of using synthetic examples in order to diminish the drawbacks occasioned by the absence of labeled examples, which deteriorates the efficiency of this family of methods. The proposed self-labeled scheme with synthetic examples has been incorporated in four well-known self-labeled techniques that have been modified by introducing the necessary elements to follow the designed framework. These models are able to overcome the original self-labeled methods due to the fact that the addition of new labeled data implies a better diversity of multiple classifier approaches and fulfills the distribution of labeled data. The wide experimental study carried out has allowed us to investigate the behavior of the proposed scheme with a high number of data sets with a varied number of instances and features. The results have been statistically compared, supporting the assertion that our proposal is a suitable tool for enhancing self-labeled methods. There are many possible variations of our proposed semisupervised scheme that could be interesting to explore as future work. In our opinion, the use of oversampling techniques with self-labeled techniques is not only a new way to improve the capabilities of this family of techniques, but could also be useful for most of the existing semi-supervised learning algorithms. [12] Q. Wang, P. Yuen, and G. Feng, “Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions,” Pattern Recognition, vol. 46, no. 9, pp. 2576–2587, 2013. [13] I. Triguero, S. Garca, and F. Herrera, “Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study,” Knowledge and Information Systems, pp. 1–40, 2014, in press, doi: 10.1007/s10115-013-0706-y. [14] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 1995, pp. 189–196. [15] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with Co-Training,” in Proceedings of the Annual ACM Conference on Computational Learning Theory, 1998, pp. 92–100. [16] K. Bennett, A. Demiriz, and R. Maclin, “Exploiting unlabeled data in ensemble methods,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 289– 296. [17] Z.-H. Zhou and M. Li, “Semi-supervised learning by disagreement,” Knowl. Inf. Syst., vol. 24, no. 3, pp. 415–439, 2010. [18] G. Jin and R. Raich, “Hinge loss bound approach for surrogate supervision multi-view learning,” Pattern Recognition Letters, vol. 37, pp. 143 – 150, 2014. [19] U. Maulik and D. Chakraborty, “A self-trained ensemble with semisupervised svm: An application to pixel classification of remote sensing imagery,” Pattern Recognition, vol. 44, no. 3, pp. 615 – 623, 2011. [20] A. Joshi and N. Papanikolopoulos, “Learning to detect moving shadows in dynamic environments,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 11, pp. 2055–2063, nov. 2008. [21] M. Li and Z. H. Zhou, “Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples,” IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 37, no. 6, pp. 1088–1098, 2007. [22] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123– 140, August 1996. [23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. [24] I. Triguero, S. Garcı́a, and F. Herrera, “Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification,” Pattern Recognition, vol. 44, no. 4, pp. 901–916, 2011. [25] K. V. Price, R. M. Storn, and J. A. Lampinen, Differential Evolution A Practical Approach to Global Optimization, ser. Natural Computing Series, G. Rozenberg, T. Bäck, A. E. Eiben, J. N. Kok, and H. P. Spaink, Eds., 2005. [26] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcı́a, L. Sánchez, and F. Herrera, “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–277, 2011. [27] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml [28] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power,” Information Sciences, vol. 180, pp. 2044–2064, 2010. [29] M. Li and Z. H. Zhou, “SETRED: self-training with editing,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3518 LNAI, 2005, pp. 611–621. [30] S. Dasgupta, M. L. Littman, and D. A. McAllester, “PAC generalization bounds for co-training,” in Advances in Neural Information Processing Systems 14,Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 375–382. [31] J. Du, C. X. Ling, and Z. H. Zhou, “When does co-training work in real data?” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 788–799, 2010. [32] S. Goldman and Y. Zhou, “Enhancing supervised learning with unlabeled data,” in In proceedings of the 17th International Conference on Machine Learning. Morgan Kaufmann, 2000, pp. 327–334. [33] Y. Zhou and S. Goldman, “Democratic co-learning,” in Tools with Artificial Intelligence, IEEE International Conference on, 2004, pp. 594– 202. [34] Z. H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data using three classifiers,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 1529–1541, 2005. [35] L. B. Statistics and L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. ly On [1] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. [2] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning, 1st ed. The MIT Press, 2006. [3] X. Zhu and A. B. Goldberg, Introduction to Semi-Supervised Learning, 1st ed. Morgan and Claypool, 2009. [4] F. Schwenker and E. Trentin, “Pattern classification and clustering: A review of partially supervised learning approaches,” Pattern Recognition Letters, vol. 37, pp. 4 – 14, 2014. [5] K. Chen and S. Wang, “Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 129–143, 2011. [6] G. Wang, F. Wang, T. Chen, D.-Y. Yeung, and F. Lochovsky, “Solution path for manifold regularized semisupervised classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2, pp. 308–319, 2012. [7] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 19–26. [8] J. Wang, T. Jebara, and S.-F. Chang, “Semi-supervised learning using greedy max-cut,” Journal of Machine Learning Research, vol. 14, no. 1, pp. 771–800, 2013. [9] A. Fujino, N. Ueda, and K. Saito, “Semisupervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 424–437, 2008. [10] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. 16th Internation Conference on Machine Learning. Morgan Kaufmann, 1999, pp. 200–209. [11] P. Kumar Mallapragada, R. Jin, A. Jain, and Y. Liu, “Semiboost: Boosting for semi-supervised learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 11, pp. 2000–2014, 2009. Transactions on Cybernetics Page 18 of 18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013 13 Isaac Triguero received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, biometrics, evolutionary algorithms and semi-supervised learning. Salvador Garcı́a received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. He has published more than 40 papers in international journals. As edited activities, he has co-edited two special issues in international journals on different Data Mining topics and is member of the editorial board of the Information Fusion journal. His research interests include data mining, data reduction, data complexity, imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms. iew ev rR Fo [36] M. Hady and F. Schwenker, “Combining committee-based semisupervised learning and active learning,” Journal of Computer Science and Technology, vol. 25, pp. 681–698, 2010. [37] M. Hady, F. Schwenker, and G. Palm, “Semi-supervised learning for tree-structured ensembles of rbf networks with co-training.” Neural Networks, vol. 23, pp. 497–509, 2010. [38] Y. Yaslan and Z. Cataltepe, “Co-training with relevant random subspaces,” Neurocomput., vol. 73, no. 10-12, pp. 1652–1661, 2010. [39] T. Huang, Y. Yu, G. Guo, and K. Li, “A classification algorithm based on local cluster centers with a few labeled training examples,” KnowledgeBased Systems, vol. 23, no. 6, pp. 563–571, 2010. [40] Y. Wang, X. Xu, H. Zhao, and Z. Hua, “Semi-supervised learning based on nearest neighbor rule and cut edges,” Knowledge-Based Systems, vol. 23, no. 6, pp. 547–554, 2010. [41] S. Sun and Q. Zhang, “Multiple-view multiple-learner semi-supervised learning,” Neural Processing Letters, vol. 34, no. 3, pp. 229–240, 2011. [42] A. Halder, S. Ghosh, and A. Ghosh, “Aggregation pheromone metaphor for semi-supervised classification,” Pattern Recognition, vol. 46, no. 8, pp. 2239–2248, 2013. [43] M.-L. Zhang and Z.-H. Zhou, “CoTrade: Confident co-training with data editing,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 6, pp. 1612–1626, 2011. [44] I. Triguero, J. A. Sáez, J. Luengo, S. Garcı́a, and F. Herrera, “On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification,” Neurocomputing, 2013, , in press, doi: 10.1016/j.neucom.2013.05.055. [45] I. T. Jolliffe, Principal Component Analysis. Berlin; New York: Springer-Verlag, 1986. [46] C. Deng and M. Guo, “A new co-training-style random forest for computer aided diagnosis,” Journal of Intelligent Information Systems, vol. 36, pp. 253–281, 2011. [47] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: A review,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, no. 04, pp. 687–719, 2009. [48] H. He and E. Garcia, “Learning from imbalanced data,” Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, pp. 1263–1284, 2009. [49] S. Garcı́a, J. Derrac, I. Triguero, C. J. Carmona, and F. Herrera, “Evolutionary-based selection of generalized instances for imbalanced classification,” Know.-Based Syst., vol. 25, no. 1, pp. 3–12, 2012. [50] H. Zhang and M. Li, “Rwo-sampling: A random walk over-sampling approach to imbalanced data classification,” Information Fusion, 2014, in press, doi: 10.1016/j.inffus.2013.12.003. [51] “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, pp. 113 – 141, 2013. [52] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behaviour of several methods for balancing machine learning training data,” SIGKDD Explorations, vol. 6, no. 1, pp. 20–29, 2004. [53] I. Triguero, S. Garcı́a, and F. Herrera, “IPADE: Iterative prototype adjustment for nearest neighbor classification,” IEEE Transactions on Neural Networks, vol. 21, no. 12, pp. 1984–1990, 2010. [54] A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing. Springer–Verlag, Berlin, 2003. [55] S. Das and P. Suganthan, “Differential evolution: A survey of the stateof-the-art,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 4–31, 2011. [56] “BBC datasets,” 2014. [Online]. Available: http://mlg.ucd.ie/datasets/ bbc.html [57] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967. [58] J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 1993. [59] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, 1991. [60] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [61] Z. Jiang, S. Zhang, and J. Zeng, “A hybrid generative/discriminative method for semi-supervised classification,” Knowledge-Based Systems, vol. 37, pp. 137–145, 2013. [62] W. Li, L. Duan, I. Tsang, and D. Xu, “Co-labeling: A new multi-view learning approach for ambiguous problems,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2012, pp. 419–428. ly On Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 240 papers in international journals. He is coauthor of the book ”Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). He currently acts as Editor in Chief of the international journals “Information Fusion” (Elsevier) and “Progress in Artificial Intelligence” (Springer). He acts as area editor of the International Journal of Computational Intelligence Systems and associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Knowledge and Information Systems, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, and Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding 2008 Paper Award (bestowed in 2011), and 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association. His current research interests include computing with words and decision making, bibliometrics, data mining, biometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms. Bibliography [AC10] Adankon M. and Cheriet M. (2010) Genetic algorithm-based training for semisupervised svm. Neural Computing and Applications 19: 1197–1206. [AFFL+ 11] Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., Garcı́a S., Sánchez L., and Herrera F. (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17(2-3): 255–277. [AFSG+ 09] Alcalá-Fdez J., Sánchez L., Garcı́a S., del Jesus M. J., Ventura S., Garrell J. M., Otero J., Romero C., Bacardit J., Rivas V. M., Fernández J. C., and Herrera F. (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing 13(3): 307–318. [AIS93] Agrawal R., Imieliński T., and Swami A. (1993) Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2): 207–216. [AKA91] Aha D. W., Kibler D., and Albert M. K. (1991) Instance-based learning algorithms. Machine Learning 6: 37–66. [Alp10] Alpaydin E. (2010) Introduction to Machine Learning. The MIT Press, 2nd edition. [Ang07] Angiulli F. (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering 19(11): 1450–1464. [BBBE11] Bizer C., Boncz P. A., Brodie M. L., and Erling O. (2011) The meaningful use of big data: four perspectives - four challenges. SIGMOD Record 40(4): 56–60. [BBC14] BBC (2014) BBC datasets. http://mlg.ucd.ie/datasets/bbc.html. [BC01] Blum A. and Chawla S. (2001) Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19–26. [BM98] Blum A. and Mitchell T. (1998) Combining labeled and unlabeled data with CoTraining. In Proceedings of the Annual ACM Conference on Computational Learning Theory, pp. 92–100. [BM02] Brighton H. and Mellish C. (2002) Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6(2): 153–172. [BNS06] Belkin M., Niyogi P., and Sindhwani V. (2006) Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7: 2399–2434. 209 210 BIBLIOGRAPHY [Bra07] Bramer M. (2007) Principles of Data Mining (Undergraduate Topics in Computer Science). Springer-Verlag New York, Inc., Secaucus, NJ, USA. [Bre96] Breiman L. (August 1996) Bagging predictors. Machine Learning 24: 123–140. [CdO11] Cabral G. G. and de Oliveira A. L. I. (2011) A novel one-class classification method based on feature analysis and prototype reduction. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Anchorage, Alaska, USA, October 9-12, 2011, pp. 983–988. [CGG+ 09] Chen Y., Garcia E., Gupta M., Rahimi A., and Cazzanti L. (2009) Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Research 10: 747–776. [CGI09] Cervantes A., Galván I. M., and Isasi P. (2009) Ampso: A new particle swarm method for nearest neighborhood classification. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics 39(5): 1082–1091. [CH67] Cover T. M. and Hart P. E. (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1): 21–27. [CHL03] Cano J. R., Herrera F., and Lozano M. (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation 7(6): 561–575. [CHL05] Cano J. R., Herrera F., and Lozano M. (2005) Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters 26(7): 953–963. [CLL13] Caruana G., Li M., and Liu Y. (2013) An ontology enhanced parallel SVM for scalable spam filter training. Neurocomputing 108: 45 – 57. [CM98] Cherkassky V. S. and Mulier F. (1998) Learning from Data: Concepts, Theory, and Methods. John Wiley & Sons, Inc., New York, NY, USA, 1st edition. [CSK08] Chapelle O., Sindhwani V., and Keerthi S. S. (2008) Optimization techniques for semisupervised support vector machines. Journal of Machine Learning Research 9: 203– 233. [CSZ06] Chapelle O., Schlkopf B., and Zien A. (2006) Semi-Supervised Learning. The MIT Press, 1st edition. [CW11] Chen K. and Wang S. (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1): 129–143. [DACK09] Das S., Abraham A., Chakraborty U. K., and Konar A. (2009) Differential evolution using a neighborhood-based mutation operator. IEEE Transactions on Evolutionary Computation 13(3): 526–553. [DG08] Dean J. and Ghemawat S. (Enero 2008) Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1): 107–113. [DG10] Dean J. and Ghemawat S. (2010) Map reduce: A flexible data processing tool. Communications of the ACM 53(1): 72–77. BIBLIOGRAPHY 211 [DGH10a] Derrac J., Garcı́a S., and Herrera F. (2010) A survey on evolutionary instance selection and generation. Internation Journal of Applied Metaheuristic Computing 1(1): 60–92. [DGH10b] Derrac J., Garcı́a S., and Herrera F. (2010) Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability. Memetic Computing 2(3): 183–199. [DGH14] Derrac J., Garcı́a S., and Herrera F. (2014) Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects. Information Sciences 260: 98–119. [DHS00] Duda R., Hart P., and Stork D. (2000) Pattern Classification. Wiley-Interscience, John Wiley & Sons, Southern Gate, Chichester, West Sussex, England, 2nd edition. [DLM01] Dasgupta S., Littman M. L., and McAllester D. A. (2001) Pac generalization bounds for co-training. In Advances in Neural Information Processing Systems 14,Neural Information Processing Systems: Natural and Synthetic, pp. 375–382. [DLZ10] Du J., Ling C. X., and Zhou Z. H. (2010) When does co-training work in real data? IEEE Transactions on Knowledge and Data Engineering 23(5): 788–799. [DS11] Das S. and Suganthan P. (2011) Differential evolution: A survey of the state-of-the-art. IEEE Transactions on Evolutionary Computation 15(1): 4–31. [DTGH12] Derrac J., Triguero I., Garcia S., and Herrera F. (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42(5): 1383–1397. [ES08] Eiben A. E. and Smith J. E. (2008) Introduction to Evolutionary Computing. Natural Computing. Springer-Verlag, 2nd edition. [EVH10] Espejo P., Ventura S., and Herrera F. (2010) A survey on the application of genetic programming to classification. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 40(2): 121–144. [FH51] Fix E. and Hodges J. (February 1951) Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine. [FHA07] Fayed H. A., Hashem S. R., and Atiya A. F. (2007) Self-generating prototypes for pattern classification. Pattern Recognition 40(5): 1498–1509. [FI04] Fernández F. and Isasi P. (2004) Evolutionary design of nearest prototype classifiers. Journal of Heuristics 10(4): 431–454. [FI08] Fernández F. and Isasi P. (2008) Local feature weighting in nearest prototype classification. IEEE Transactions on Neural Networks 19(1): 40–53. [FPSS96] Fayyad U. M., Piatetsky-Shapiro G., and Smyth P. (1996) From data mining to knowledge discovery in databases. AI Magazine 17(3): 37–54. [Fre02] Freitas A. A. (2002) Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag. 212 BIBLIOGRAPHY [Fri97] Friedman J. H. (1997) Data mining and statistics: What’s the connection? In Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics. [FUS08] Fujino A., Ueda N., and Saito K. (2008) Semisupervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(3): 424–437. [Gar08] Garain U. (2008) Prototype reduction using an artificial immune model. Pattern Analysis & Applications 11(3-4): 353–363. [GB08] Gancarski P. and Blansche A. (Oct 2008) Darwinian, lamarckian, and baldwinian (co)evolutionary approaches for feature weighting in k -means-based algorithms. IEEE Transactions on Evolutionary Computation 12(5): 617–629. [GBLG99] Gamberger D., Boskovic R., Lavrac N., and Groselj C. (1999) Experiments with noise filtering in a medical domain. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 143–151. [GCB97] Grother P. J., Candela G. T., and Blue J. L. (1997) Fast implementations of nearest neighbor classifiers. Pattern Recognition 30(3): 459 – 465. [GCH08] Garcı́a S., Cano J. R., and Herrera F. (2008) A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition 41(8): 2693–2709. [GDCH12] Garcı́a S., Derrac J., Cano J. R., and Herrera F. (2012) Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3): 417–435. [GE03] Guyon I. and Elisseeff A. (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157–1182. [GGL03] Ghemawat S., Gobioff H., and Leung S.-T. (2003) The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, pp. 29– 43. [GGNZ06] Guyon I., Gunn S., Nikravesh M., and Zadeh L. A. (Eds.) (2006) Feature Extraction: Foundations and Applications. Springer. [Gho06] Ghosh A. K. (2006) On optimum choice of k in nearest neighbor classification. Computational Statistics and Data Analysis 50(11): 3113 – 3123. [GJ05] Ghosh A. and Jain L. C. (Eds.) (2005) Evolutionary Computation in Data Mining. Springer-Verlag. [GZ00] Goldman S. and Zhou Y. (2000) Enhancing supervised learning with unlabeled data. In In proceedings of the 17th International Conference on Machine Learning, pp. 327–334. Morgan Kaufmann. [Har68] Hart P. E. (1968) The condensed nearest neighbor rule. IEEE Transactions on Information Theory 18: 515–516. [Har75] Hartigan J. A. (1975) Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition. BIBLIOGRAPHY 213 [HDW+ 11] He Q., Du C., Wang Q., Zhuang F., and Shi Z. (2011) A parallel incremental extreme svm classifier. Neurocomputing 74(16): 2532 – 2540. [HFL+ 08] He B., Fang W., Luo Q., Govindaraju N. K., and Wang T. (2008) Mars: A mapreduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 260–269. ACM, New York, NY, USA. [HGG13] Halder A., Ghosh S., and Ghosh A. (2013) Aggregation pheromone metaphor for semisupervised classification. Pattern Recognition 46(8): 2239–2248. [HKP11] Han J., Kamber M., and Pei J. (2011) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition. [HS10] Hady M. and Schwenker F. (2010) Combining committee-based semi-supervised learning and active learning. Journal of Computer Science and Technology 25: 681–698. [HSP10] Hady M., Schwenker F., and Palm G. (2010) Semi-supervised learning for treestructured ensembles of rbf networks with co-training. Neural Networks 23: 497–509. [HTF09] Hastie T., Tibshirani R., and Friedman J. (2009) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York Berlin Heidelberg, 2nd edition. [HYGL10] Huang T., Yu Y., Guo G., and Li K. (2010) A classification algorithm based on local cluster centers with a few labeled training examples. Knowledge-Based Systems 23(6): 563–571. [JL01] John G. H. and Langley P. (2001) Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo. [Joa99] Joachims T. (1999) Transductive inference for text classification using support vector machines. In Proc. 16th Internation Conference on Machine Learning, pp. 200–209. Morgan Kaufmann. [Joa03] Joachims T. (2003) Transductive learning via spectral graph partitioning. In Proceedings, Twentieth International Conference on Machine Learning, volumen 1, pp. 290–297. [Jol02] Jolliffe I. T. (2002) Principal Component Analysis. Springer, 2nd edition. [JP08] Joshi A. and Papanikolopoulos N. (nov. 2008) Learning to detect moving shadows in dynamic environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11): 2055–2063. [KB98] Kuncheva L. I. and Bedzek J. C. (1998) Nearest prototype classification: Clustering, genetic algorithms, or random search? IEEE Transactions on Systems, Man, and Cybernetics 28(1): 160–164. [KJ97] Kohavi R. and John G. H. (1997) Wrappers for feature subset selection. Artificial Intelligence 97(1-2): 273–324. 214 BIBLIOGRAPHY [KM99] Krishna K. and Murty M. (1999) Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 29(3): 433–439. [KO03a] Kim S. W. and Oomenn J. (2003) A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis and Applications 6: 232–244. [KO03b] Kim S.-W. and Oommen B. J. (2003) Enhancing prototype reduction schemes with LVQ3-type algorithms. Pattern Recognition 36(5): 1083–1093. [Koh90] Kohonen T. (1990) The self organizing map. Proceedings of the IEEE 78(9): 1464– 1480. [Kon94] Kononenko I. (1994) Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the 1994 European Conference on Machine Learning, Catania, Italy, pp. 171–182. Springer Verlag. [KR92] Kira K. and Rendell L. A. (1992) A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, Scotland, pp. 249–256. Morgan Kaufmann. [KR07] Khoshgoftaar T. M. and Rebours P. (2007) Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology 22: 387–396. [Kun95] Kuncheva L. I. (1995) Editing for the k–nearest neighbors rule by a genetic algorithm. Pattern Recognition Letters 16: 809–814. [Lan01] Laney D. (February 2001) 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group. [LdRBH14] López V., del Rı́o S., Benı́tez J. M., and Herrera F. (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data. Fuzzy Sets and Systems in press, doi: 10.1016/j.fss.2014.01.015. [LKL02] Lam W., Keung C. K., and Liu D. (2002) Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(8): 1075–1090. [LM07] Liu H. and Motoda H. (Eds.) (2007) Computational Methods of Feature Selection. Chapman & Hall/Crc Data Mining and Knowledge Discovery Series. Chapman & Hall/Crc. [LMYW05] Li J., Manry M. T., Yu C., and Wilson D. R. (2005) Prototype classifier design with pruning. International Journal on Artificial Intelligence Tools 14(1-2): 261–280. [LYG+ 10] Li G.-Z., You M., Ge L., Yang J. Y., and Yang M. Q. (2010) Feature selection for semi-supervised multi-label learning with application to gene function analysis. In Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB ’10, pp. 354–357. ACM, New York, NY, USA. [LZ05] Li M. and Zhou Z. H. (2005) SETRED: self-training with editing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volumen 3518 LNAI, pp. 611–621. BIBLIOGRAPHY 215 [LZ07] Li M. and Zhou Z. H. (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 37(6): 1088–1098. [Mar13] Marx V. (2013) The big challenges of big data. Nature 498(7453): 255–260. [MCD13] Minelli M., Chambers M., and Dhiraj A. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (Wiley CIO). Wiley Publishing, 1st edition. [MD01] Mjolsness E. and DeCoste D. (2001) Machine learning for science: State of the art and future prospects. Science 293: 2051–2055. [MFV02] Mollineda R., Ferri F., and Vidal E. (2002) A merge-based condensing strategy for multiple prototype classifiers. IEEE Transactions on Systems, Man and Cybernetics B 32(5): 662–668. [Mit97] Mitchell T. M. (1997) Machine Learning. McGraw-Hill. [MPJ11] Mac Parthalain N. and Jensen R. (2011) Fuzzy-rough set based semi-supervised learning. In IEEE International Conference on Fuzzy Systems (FUZZ), pp. 2465–2472. [NL09] Nanni L. and Lumini A. (2009) Particle swarm optimization for prototype reduction. Neurocomputing 72(4-6): 1092–1097. [NT09] Neri F. and Tirronen V. (2009) Scale factor local search in differential evolution. Memetic Computing 1(2): 153–171. [OLM04] Oh I.-S., Lee J.-S., and Moon B.-R. (Noviembre 2004) Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(11): 1424–1437. [PBA+ 08] Plummer D., Bittman T., Austin T., Cearley D., and Cloud D. S. (2008) Defining and describing an emerging phenomenon. Technical report, Gartner. [Ped85] Pedrycz W. (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Letters 3: 13–20. [PF09] Pappa G. L. and Freitas A. A. (2009) Automating the Design of Data Mining Algorithms: An Evolutionary Computation Approach. Natural computing. Springer. [PGDB13] Poria S., Gelbukh A., Das D., and Bandyopadhyay S. (2013) Fuzzy clustering for semisupervised learning - case study: Construction of an emotion lexicon. In Batyrshin I. and González Mendoza M. (Eds.) Advances in Artificial Intelligence, volumen 7629 of Lecture Notes in Computer Science, pp. 73–86. Springer Berlin Heidelberg. [Pla99] Platt J. C. (1999) Fast training of support vector machines using sequential minimal optimization. MIT Press. [PR12] Palit I. and Reddy C. (2012) Scalable and parallel boosting with mapreduce. IEEE Transactions on Knowledge and Data Engineering 24(10): 1904–1916. [Pro13a] Project A. H. (2013) Apache hadoop. http://hadoop.apache.org/. [Pro13b] Project A. M. (2013) Apache mahout. http://mahout.apache.org/. 216 BIBLIOGRAPHY [PSL05] Price K. V., Storn R. M., and Lampinen J. A. (2005) Differential Evolution A Practical Approach to Global Optimization. Natural Computing Series. [PV06] Paredes R. and Vidal E. (2006) Learning weighted metrics to minimize nearestneighbor classification error. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7): 1100–1110. [Pyl99] Pyle D. (1999) Data Preparation for Data Mining. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann. [QHS09] Qin A. K., Huang V. L., and Suganthan P. N. (2009) Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation 13(2): 398–417. [Qui93] Quinlan J. R. (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco, CA, USA. [Sán04] Sánchez J. S. (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition 37(7): 1561–1564. [SBM+ 03] Sánchez J. S., Barandela R., Marqués A. I., Alejo R., and Badenas J. (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24(7): 1015–1022. [SFJ12] Srinivasan A., Faruquie T., and Joshi S. (2012) Data and task parallelism in ILP using mapreduce. Machine Learning 86(1): 141–168. [SIL07] Saeys Y., Inza I., and Larrañaga P. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19): 2507–2517. [SO98] Snir M. and Otto S. (1998) MPI-The Complete Reference: The MPI Core. MIT Press. [SP97] Storn R. and Price K. V. (1997) Differential evolution - A simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11(10): 341–359. [SZ11] Sun S. and Zhang Q. (2011) Multiple-view multiple-learner semi-supervised learning. Neural Processing Letters 34(3): 229–240. [TDGH12] Triguero I., Derrac J., Garcı́a S., and Herrera F. (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(1): 86–100. [TH10] Tang X.-L. and Han M. (2010) Semi-supervised bayesian artmap. Applied Intelligence 33(3): 302–317. [TK07] Tsoumakas G. and Katakis I. (2007) Multi-label classification: An overview. Internation Journal of Data Warehousing and Mining 2007: 1–13. [TSA+ 10] Thusoo A., Shao Z., Anthony S., Borthakur D., Jain N., Sen Sarma J., Murthy R., and Liu H. (2010) Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 1013–1020. ACM, New York, NY, USA. BIBLIOGRAPHY 217 [TSK05] Tan P., Steinbach M., and Kumar V. (2005) Introduction to Data Mining. AddisonWesley Longman Publishing Co., Inc., 1st edition. [TYK11] Talbot J., Yoo R. M., and Kozyrakis C. (2011) Phoenix++: Modular mapreduce for shared-memory systems. In Proceedings of the Second International Workshop on MapReduce and Its Applications, pp. 9–16. ACM, New York, NY, USA. [Vap98] Vapnik V. N. (1998) Statistical Learning Theory. Wiley-Interscience. [WAM97] Wettschereck D., Aha D. W., and Mohri T. (1997) A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11: 273–314. [WFH11] Witten I. H., Frank E., and Hall M. A. (2011) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann. [Whi12] White T. (2012) Hadoop: The Definitive Guide. O’Reilly Media, Inc., 3rd edition. [Wil72] Wilson D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on System, Man and Cybernetics 2(3): 408–421. [WJC13] Wang J., Jebara T., and Chang S.-F. (2013) Semi-supervised learning using greedy max-cut. Journal of Machine Learning Research 14(1): 771–800. [WK09] Wu X. and Kumar V. (Eds.) (2009) The Top Ten Algorithms in Data Mining. Data Mining and Knowledge Discovery. Chapman & Hall/CRC. [WM00] Wilson D. R. and Martinez T. R. (2000) Reduction techniques for instance-based learning algorithms. Machine Learning 38(3): 257–286. [XWT11] Xie B., Wang M., and Tao D. (2011) Toward the optimization of normalized graph laplacian. IEEE Transactions on Neural Networks 22(4): 660–666. [Yar95] Yarowsky D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196. [YC98] Yang M. and Cheng C. (1998) On the edited fuzzy k-nearest neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics 28(3): 461–466. [YC10] Yaslan Y. and Cataltepe Z. (2010) Co-training with relevant random subspaces. Neurocomputing 73(10-12): 1652–1661. [YW06] Yang Q. and Wu X. (2006) 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4): 597–604. [Zad65] Zadeh L. (1965) Fuzzy sets. Information and Control 8(3): 338–353. [ZG04] Zhou Y. and Goldman S. (2004) Democratic co-learning. In IEEE International Conference on Tools with Artificial Intelligence, pp. 594–202. [ZG09] Zhu X. and Goldberg A. B. (2009) Introduction to Semi-Supervised Learning. Morgan and Claypool, 1st edition. 218 BIBLIOGRAPHY [Zhu05] Zhu X. (2005) Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. [ZL05] Zhou Z. H. and Li M. (2005) Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17: 1529–1541. [ZL10] Zhou Z.-H. and Li M. (2010) Semi-supervised learning by disagreement. Knowledge and Information Systems 24(3): 415–439. [ZLZ11] Zhai J.-H., Li N., and Zhai M.-Y. (2011) The condensed fuzzy k-nearest neighbor rule based on sample fuzzy entropy. In Proceedings of the 2011 International Conference on Machine Learning and Cybernetics (ICMLC’11), Guilin, China, July 10-13, pp. 282–286. [ZMH09] Zhao W., Ma H., and He Q. (2009) Parallel k-means clustering based on mapreduce. In Jaatun M., Zhao G., and Rong C. (Eds.) Cloud Computing, volumen 5931 of Lecture Notes in Computer Science, pp. 674–679. Springer Berlin Heidelberg. [ZS09] Zhang J. and Sanderson A. C. (2009) JADE: Adaptive differential evolution with optional external archive. IEEE Transactions on Evolutionary Computation 13(5): 945–958. [ZYY+ 14] Zhu F., Ye N., Yu W., Xu S., and Li G. (2014) Boundary detection and sample reduction for one-class support vector machines. Neurocomputing 123: 166 – 173. [ZZY03] Zhang S., Zhang C., and Yang Q. (2003) Data preparation for data mining. Applied Artificial Intelligence 17(5-6): 375–381.