A Case Study in Outlier Detection
Transcription
A Case Study in Outlier Detection
Improving Data Mining Results by taking Advantage of the Data Warehouse Dimensions: A Case Study in Outlier Detection Mohammad Nozari Zarmehri Carlos Soares INESC TEC Faculdade de Engenharia Universidade do Porto (FEUP) Porto, Portugal Email: [email protected] INESC TEC Faculdade de Engenharia Universidade do Porto (FEUP) Porto, Portugal Email: [email protected] Abstract—As more data becomes available, describing entities with finer detail, companies are becoming interested in developing many specific data mining models (e.g. a sales prediction model for each individual product) instead of a single model for the entire population. The choice of the level of granularity to obtain models is usually based on some domain-level knowledge. We argue that the schema of the data warehouse used in business intelligence solutions, which typically precede the development of data mining solutions, can be used for that purpose. We also argue that for different parts of the problems (e.g. different products), the best granularity level may vary. Our analysis is based on the results obtained on the problem of detecting errors in foreign trade transactions. We apply outlier detection methods on data aggregated at different levels of the product taxonomy. I. I NTRODUCTION Data mining (DM) projects often work on the raw data obtained from the transactional systems, such as the Enterprise Resource Planning (ERP) and Customer Relationship Management (CRM) systems. The data is collected from those systems, prepared and then analyzed, typically using a methodology such as CRISP-DM [1]. The knowledge extracted from the data is usually integrated in the business processes of the company (e.g. selection of customers for direct mailing or detection of fraud in telecom calls). Additionally, many companies also have Business Intelligence (BI) systems that collect and organize the data from the same systems [2]. The data is used to compute metrics that represent the performance of the business processes (e.g. sales or new customers acquired). The metrics are stored in a Data Warehouse (DW) and are organized into dimensions (e.g., product taxonomy and store location). These dimensions have different granularity levels (e.g., store location can be analyzed at the street, city, region and country level). Tools such as dashboards, reports and OnLine Analytical Processing (OLAP) enable decision makers to gain a better understanding of their business. There are two main differences between BI and DM systems. In BI, the knowledge is extracted by manual inspection, while in DM, knowledge extraction is automatic. The second difference is that BI systems are targeted for middle and upper level management, while DM models may be integrated into processes at all the levels of company. We note that although this distinction generally holds, there are naturally examples that contradict it. DM models can be generated at different levels of granularity (e.g. customer, product, store). Traditionally, DM algorithms were applied at the global level. For instance, a single model1 was learned to make sales predictions for all products. However, as more data is collected and the data characterizes objects at a finer level, there is a growing interest in more specific models, that represent groups or individual entities [3]. For instance, we may learn a direct marketing model for each segment of clusters or a sales prediction model for each individual product. Assuming that the data available is sufficient to generate models at different levels of granularity, the choice is usually based on some domain-level knowledge. For instance, in an application to detect errors in foreign trade transactions, approaches have been attempted where the modeling is done at the global level or at the model level [4], [5]. In this paper we argue that the DW schema, namely the hierarchy associated with the dimensions, can be used to support the selection of the granularity that should be used to develop the DM models. Furthermore, we argue that for different parts of the problems (e.g. different products), the best granularity level may vary. The description of the data and previous obtained results with this data are in Section II. Section III describes our proposal for outlier detection, aggregating the transactions at different levels of the product hierarchy. We then present the evaluation of the proposal in Section IV-A. Finally, Section V concludes the paper. 1 For simplicity, we include multiple model approaches, such as ensemble learning, when we refer to “single model”. II. BACKGROUND In this section, we start by describing the data set of foreign trade transactions used in this work. Then, we describe the previous results obtained on that data set. A. Foreign Trade Dataset The data which is used in this paper is the foreign trade data collected by the Portuguese Institute of Statistics (INE). Users from Portuguese companies fill in forms about transactions of import/export transactions with other EU countries, containing several pieces of information about those transactions. They may insert incorrect data when filling the form. Some of the common errors are declaring the cost in Euro rather than kEuro; the weight in grams instead of kilos; and associating a transaction with the wrong item. At INE, the experts extend the data set with a few other attributes. The most important of those attributes, in terms of error detection, is the Total Cost / Weight. This attribute represents the cost per kilo. The experts typically detect errors by analyzing this value first and then, eventually, other fields. The data set contains the following information[4]: Row number (RN) Origin (O) : Type of method used to insert the data into the data set (1: Disk, 2: Manual, 3: Web) In/Out (IO) : The flow (1: arrival, 2: dispatch) Lot number (L) : Group of transactions that were shipped together Document number (DN) : Official document number Operator ID (OID) : Company that is responsible for the transaction Month (M) : Month of the transaction Line number (LN) : Line number in the document Country (C) : Country of origin/destination Product code (PC) : Code of the product Weight of the traded goods (W) Total cost (TC) Type (import/export) : (1: Import, 2: Export) Total cost/weight (TCW) Average Weight/Month (AvgWM) : Average weight of the transactions made in the same month of the product which the current transaction is from Standard Deviation of Weight/Month (SDWM) : Standard deviation of AvgWM Score (S) : normalized distance of the Total cost/weight value to the average value [4] Transaction number (TN) : Number of transactions made in the same month of the product which the current transaction is from Error (E) : target value (1: error, 0: normal transaction). A small sample of the data set is presented in Table I. It contains one transaction from each of the months 1, 2, 3, 5, 6, 8, 9, 10 in 1998 and 1, 2 in 1999. B. Previous results The success criteria originally defined by the experts for the automated system is detecting less the 50% of the transactions containing at least 90% of the errors. In [4], four different outlier detection algorithms were applied: box plot [6], Fisher’s clustering algorithm [7], Knorr & Ng’s cell-based algorithm [8], and C5.0 [9]. The best results were obtained by C5.0. A scoring approach was used, which orders the transactions according to the probability of being an outlier. The model obtained with C5.0 was able to identify 90% of the errors by analysing 40% of the transactions. Fisher’s algorithm using the Euclidean distance and detecting 6 clusters detected 75% of the erros by selecting 49% of the transactions. The approach based on clustering was further explored more recently [10]. In this work, several hierarchical clustering methods were explored. The results shows that the performance of the algorithm is dependent on the used distance function. The method that use Canberra distance function shows the best performance. III. M ETHODOLOGY In this section, we describe the outlier detection algorithm which used for predicting the erroneous transaction in the foreign trade dataset. A. Data Preparation and Exploration The first step was data preparation. In this step, the data was cleaned by removing zeros in the main fields: weight, cost. In this step, each month was evaluated for each level. Grouping the data according to higher levels of the product taxonomy increases the number of transactions in each group (Fig. 1). Fig. 1. Average of number of observation for each level in each month Figure 2 shows a sample of existing hierarchically in the product code starting with “11”. TABLE I S MALL EXAMPLE OF THE DATA SET RN 1 2 3 4 5 6 7 8 9 10 O 2 2 2 2 2 2 2 2 2 2 IO 2 2 2 2 2 2 2 2 2 2 L 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 DN 10001 10001 10001 10001 10001 10002 10002 10002 10002 10002 OID 1727 1727 1727 1727 1727 5293 5293 5293 5293 5293 M 1 1 1 1 1 1 1 1 1 1 LN 1 4 5 8 10 1 2 3 4 5 C 11 3 11 11 11 11 11 11 11 11 PC 85411650 49794616 49794616 60786650 60770400 83134627 83137027 83788501 83784016 83780035 W 10 2000 25 2 1 35 50 343 1891 1644 TC 7 4497 22 2 1 299 609 3685 19824 16492 TCW 739 2248 892 1117 1640 8572 12137 10735 10482 10033 AvgWM 2.435 21.711 21.711 27.904 0.6697 15.754 10.429 9.493 13.201 9.558 SDWM 3315.93 48369.10 48369.10 17155.10 418.278 8061.60 6026.43 5167.60 7144.85 4724.45 S 0.5114699 0.4023847 0.4304191 1.5614917 2.3197148 0.8908902 0.2832658 0.2403615 0.3806161 0.1005249 TN 75 12 12 4 7 3 50 33 9 40 Fig. 2. An example of existing hierarchical structure within the data set B. Error detection method Finally, for each group of products, the outlier detection algorithm is used. We used the DMwR package [11] within R [12]. The lofactor is used to obtain local outlier factors. Using the LOF algorithm, this function produces a vector of local outlier factors for each case. The number of neighbours that is used in the calculation of the local outlier factors was k=5. Afterward, outlier factors are sorted increasingly. The first (lower) quartile is the median of the first half of the data (Q1), the second quartile is the median, and the third (upper) quartile is the median of the second half of the data (Q3) [6]. The following equation shows the Interquartile Range (IQR) 1: IQR = Q3 − Q1 (1) The factors that are substantially larger or smaller than the other factors are referred to as outliers. The potential outliers are the ones that fall below Equation 2, or above Equation 3. However, for our purposes, the only values that we are interested are the largest ones, representing the transactions that may have more impact on the foreign trade statistics that are computed based on these data. Q1 − 1.5 ∗ IQR (2) Q3 + 1.5 ∗ IQR (3) By applying this method to the data aggregated at different levels of the product taxonomy, we are able to investigate our hypothesis, namely that better results can be obtained with more data. C. Evaluation The goal of the outlier detection method is to meet the criteria introduced by experts. Accordingly, proper metrics as described in the following section are selected. 1) Metrics: The selected metrics for evaluating the model are recall [13] and effort. The Recall is the fraction of relevant instances that are retrieved (Eq. 4). Table II shows the values used in Eq. 4. tp is the number of errors which are predicted correctly. If the method can not predict an error, then it is counted as fn. On the other hand, tn is the number of errorless transactions which is predicted appropriately. But if the method predicts an erroless transaction as erroneous one, then it is counted as fp. Beside this common and well-known metric for model evaluation, we define another metric to introduce the cost of manually analyzing the outliers. Effort is the proportion of transactions that were selected by the method for manual inspection (Eq. 5). The effort should be as low as possible and not higher than 50%, as indicated earlier. Recall = tp tp + f n (4) E 0 0 0 0 0 0 0 0 0 0 TABLE II D EFINITION OF EVALUATION METRICS Predictions True False Ground Truce True False tp fp fn tn is not always true because for some of the products the best results are obtained individually and 2) the best results are obtained at different level of aggregations. In other words, for some products, the best results are obtained at the product level (8 digits) while, for other products, the best results are obtained by aggregating transactions at the first (2 digits), the second (4 digits) or the third (6 digits) levels of the taxonomy. B. Average results for each level Total number of transaction predicted as outliers Total number of transactions (5) 2) Methodology to obtain results for different levels of aggregation: The outlier detection method is used separately for the transactions aggregated at each level of the product hierarchy. To ensure that results are comparable, the evaluation is also done at each aggregation level. In the previous section, we illustrated our approach with results for two months. Here, we present the results on the complete data set (Section II-A), analyzed according to the different levels of the hierarchically in the product codes used for data aggregation. Figures 6 and 7 show the average results (both effort and recall) for different months. Grouping the transactions for month February (Figure 6) at the top level, the model can predict more than 70% of the errors just by selecting 10% of the transactions. Despite a very significant reduction in the effort, the recall is not good enough to be accepted by the experts. On the other hand, for level 2 (four digits), the model can predict more than 90% of the errors by analyzing just 12% of the transactions which is acceptable by the experts. So in this month, the best results are obtained by grouping the products by four digits of the product codes. In June, the results show that grouping by two and four digits of the product code leads to just 10% of the effort. However it is not acceptable due to very low recall (less than 90%). The best results for this month is obtained at the third level (six digits) of product codes. In summary, the decision taken in the second (four digits) and the first levels (two digits) require lower effort for analyzing. But to obtain acceptable results, the recall should be higher than 90%, which sometimes indicates that the aggregation that should be used is at the six digits level. IV. R ESULTS V. C ONCLUSION Fig. 3. The performance metrics (level 1) for Product code starting with 11 in January 1998 Effort = A. Experimental Results for each month To find the outliers, the labeled field (Error) is not used. After detecting them, we use this field to evaluate the model. As an example, the result for month January 1998, for importing good to Portugal with product code starting with "11" (level 1 where all the transactions are grouped together) is shown in Figure 3. Figure 3 shows the histogram of this level with the density of the products with different outlier scores. Obviously, most of transactions have very low outlier score, below 2. And a few transactions have high outlier scores which are actual outliers. Figures 4(a) and 5(a) show the percentage of the products codes for which the best effort (the lowest effort) is obtained at each levels of the product taxonomy in January and June of 1998, respectively. Similar plots for the recall measure are shown in Figures4(b) and 5(b). According to the graphs, the best results are obtained at different level of the taxonomy for different products. Although the results confirm our hypothesis that aggregation is generally useful, they also show that 1) this In this paper we have investigated the effect of aggregating data at different levels of a product taxonomy in the performance of an outlier detection method applied to the problem of identifying erroneous foreign trade transactions collected by the Portuguese Institute of Statistics (INE). The approach is tested on 10 months of data and the results are evaluated in terms of recall and the effort involved in the manual analysis of the selected transactions. The results generally confirm our hypothesis that the larger number of observations used at the higher levels of aggregation, the better are the results obtained. However, the results also show that, depending on the product, the best results can be obtained at different levels of aggregation. In fact, in some cases, the best results are obtained at the lowest level where there is no grouping. These results, indicate that different levels should be selected for different products. One approach that can be followed is to use a metalearning approach to map the characteristics of the data with the ideal level of aggregation [14]. (a) (b) Fig. 4. The obtained results for month Feb, 1998: (a) Best effort, (b) Best recall (a) (b) Fig. 5. The obtained results for June, 1998: (a) Best effort, (b) Best recall Additionally, we also intend to investigate our methodology for different data sets like vehicular networks. The study of the performance other machine learning algorithms instead of the outlier detection method used here is also another interesting future line of work. R EFERENCES [1] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth, {CRISP}-{DM} 1.0: Step-by-Step Data Mining Guide. SPSS, 2000. [2] R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, and B. Becker, The data warehouse lifecycle toolkit. Wiley, 2011. [3] F. Soulié-Fogelman, “Data Mining in the real world: What do we need and what do we have?,” in Proceedings of the KDD Workshop on Data Mining for Business Applications (R. Ghani and C. Soares, eds.), pp. 44– 48, 2006. [4] C. Soares, P. Brazdil, J. Costa, V. Cortez, and A. Carvalho, “Error Detection in Foreign Trade Data using Statistical and Machine Learning Algorithms,” in Proceedings of the 3rd International Conference and Exhibition on the Practical Application of Knowledge Discovery and Data Mining (N. Mackin, ed.), (London, UK), pp. 183–188, 1999. [5] L. Torgo and C. Soares, “Resource-bounded Outlier Detection Using Clustering Methods,” in Data Mining for Business Applications (R. G. Carlos Soares, ed.), Frontiers in Artificial Intelligence and Applications, pp. 84–98, IOS Press, 2010. (a) (b) Fig. 6. The average results for each level in Feb, 1998: (a) Import, (b) Export (a) (b) Fig. 7. Performance of the outlier detection vs. number of observation in June, 1998: (a) Import, (b) Export [6] P. M. J.S. Milton, J.J. Corbet, Introduction to statistics. Mc Graw Hill, 1997. [7] W. D. Fisher, “On Grouping for Maximum Homogeneity,” Journal of the American Statistical Association, vol. 53, p. 789, Dec. 1958. [8] E. M. Knorr and R. T. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” in Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB ’98, (San Francisco, CA, USA), pp. 392–403, Morgan Kaufmann Publishers Inc., 1998. [9] R. Quinlan, “C5.0: An Informal Tutorial,” 1998. [10] A. Loureiro, L. Torgo, and C. Soares, “Outlier Detection Using Clustering Methods: a Data Cleaning Application,” in Proceedings of the Data Mining for Business Workshop (D. C. Soares C Moniz L, ed.), (Oporto, Portugal), pp. 57–62, 2005. [11] L. Torgo, Data Mining with R, learning with case studies. Chapman and Hall/CRC, 2010. [12] R Core Team, R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2013. [13] D. M. W. Powers, “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation,” Tech. Rep. SIE07-001, School of Informatics and Engineering, Flinders University, Adelaide, Australia, 2007. [14] P. Brazdil, C. Giraud-carrier, C. Soares, and R. Vilalta, Metalearning: Applications to Data Mining. Cognitive Technologies, Springer, 2009.