A Case Study in Outlier Detection

Transcription

A Case Study in Outlier Detection
Improving Data Mining Results by taking
Advantage of the Data Warehouse Dimensions: A
Case Study in Outlier Detection
Mohammad Nozari Zarmehri
Carlos Soares
INESC TEC
Faculdade de Engenharia
Universidade do Porto (FEUP)
Porto, Portugal
Email: [email protected]
INESC TEC
Faculdade de Engenharia
Universidade do Porto (FEUP)
Porto, Portugal
Email: [email protected]
Abstract—As more data becomes available, describing entities
with finer detail, companies are becoming interested in developing
many specific data mining models (e.g. a sales prediction model
for each individual product) instead of a single model for the
entire population. The choice of the level of granularity to obtain
models is usually based on some domain-level knowledge. We
argue that the schema of the data warehouse used in business
intelligence solutions, which typically precede the development
of data mining solutions, can be used for that purpose. We
also argue that for different parts of the problems (e.g. different
products), the best granularity level may vary. Our analysis is
based on the results obtained on the problem of detecting errors
in foreign trade transactions. We apply outlier detection methods
on data aggregated at different levels of the product taxonomy.
I. I NTRODUCTION
Data mining (DM) projects often work on the raw data
obtained from the transactional systems, such as the Enterprise Resource Planning (ERP) and Customer Relationship
Management (CRM) systems. The data is collected from those
systems, prepared and then analyzed, typically using a methodology such as CRISP-DM [1]. The knowledge extracted from
the data is usually integrated in the business processes of the
company (e.g. selection of customers for direct mailing or
detection of fraud in telecom calls).
Additionally, many companies also have Business Intelligence (BI) systems that collect and organize the data from
the same systems [2]. The data is used to compute metrics
that represent the performance of the business processes (e.g.
sales or new customers acquired). The metrics are stored in
a Data Warehouse (DW) and are organized into dimensions
(e.g., product taxonomy and store location). These dimensions
have different granularity levels (e.g., store location can be
analyzed at the street, city, region and country level). Tools
such as dashboards, reports and OnLine Analytical Processing
(OLAP) enable decision makers to gain a better understanding
of their business.
There are two main differences between BI and DM systems. In BI, the knowledge is extracted by manual inspection,
while in DM, knowledge extraction is automatic. The second
difference is that BI systems are targeted for middle and upper
level management, while DM models may be integrated into
processes at all the levels of company. We note that although
this distinction generally holds, there are naturally examples
that contradict it.
DM models can be generated at different levels of granularity (e.g. customer, product, store). Traditionally, DM algorithms were applied at the global level. For instance, a
single model1 was learned to make sales predictions for all
products. However, as more data is collected and the data
characterizes objects at a finer level, there is a growing interest
in more specific models, that represent groups or individual
entities [3]. For instance, we may learn a direct marketing
model for each segment of clusters or a sales prediction model
for each individual product. Assuming that the data available is
sufficient to generate models at different levels of granularity,
the choice is usually based on some domain-level knowledge.
For instance, in an application to detect errors in foreign
trade transactions, approaches have been attempted where the
modeling is done at the global level or at the model level [4],
[5].
In this paper we argue that the DW schema, namely the
hierarchy associated with the dimensions, can be used to
support the selection of the granularity that should be used
to develop the DM models. Furthermore, we argue that for
different parts of the problems (e.g. different products), the
best granularity level may vary.
The description of the data and previous obtained results
with this data are in Section II. Section III describes our
proposal for outlier detection, aggregating the transactions at
different levels of the product hierarchy. We then present the
evaluation of the proposal in Section IV-A. Finally, Section V
concludes the paper.
1 For simplicity, we include multiple model approaches, such as ensemble
learning, when we refer to “single model”.
II. BACKGROUND
In this section, we start by describing the data set of foreign
trade transactions used in this work. Then, we describe the
previous results obtained on that data set.
A. Foreign Trade Dataset
The data which is used in this paper is the foreign trade data
collected by the Portuguese Institute of Statistics (INE). Users
from Portuguese companies fill in forms about transactions of
import/export transactions with other EU countries, containing
several pieces of information about those transactions. They
may insert incorrect data when filling the form. Some of the
common errors are declaring the cost in Euro rather than
kEuro; the weight in grams instead of kilos; and associating
a transaction with the wrong item. At INE, the experts extend
the data set with a few other attributes. The most important of
those attributes, in terms of error detection, is the Total Cost /
Weight. This attribute represents the cost per kilo. The experts
typically detect errors by analyzing this value first and then,
eventually, other fields. The data set contains the following
information[4]:
Row number (RN)
Origin (O) : Type of method used to insert the data into
the data set (1: Disk, 2: Manual, 3: Web)
In/Out (IO) : The flow (1: arrival, 2: dispatch)
Lot number (L) : Group of transactions that were shipped
together
Document number (DN) : Official document number
Operator ID (OID) : Company that is responsible for the
transaction
Month (M) : Month of the transaction
Line number (LN) : Line number in the document
Country (C) : Country of origin/destination
Product code (PC) : Code of the product
Weight of the traded goods (W)
Total cost (TC)
Type (import/export) : (1: Import, 2: Export)
Total cost/weight (TCW)
Average Weight/Month (AvgWM) : Average weight of
the transactions made in the same month of the product
which the current transaction is from
Standard Deviation of Weight/Month (SDWM) : Standard deviation of AvgWM
Score (S) : normalized distance of the Total cost/weight
value to the average value [4]
Transaction number (TN) : Number of transactions made
in the same month of the product which the current transaction is from
Error (E) : target value (1: error, 0: normal transaction).
A small sample of the data set is presented in Table I. It
contains one transaction from each of the months 1, 2, 3, 5,
6, 8, 9, 10 in 1998 and 1, 2 in 1999.
B. Previous results
The success criteria originally defined by the experts for the
automated system is detecting less the 50% of the transactions
containing at least 90% of the errors.
In [4], four different outlier detection algorithms were
applied: box plot [6], Fisher’s clustering algorithm [7], Knorr
& Ng’s cell-based algorithm [8], and C5.0 [9]. The best results
were obtained by C5.0. A scoring approach was used, which
orders the transactions according to the probability of being
an outlier. The model obtained with C5.0 was able to identify
90% of the errors by analysing 40% of the transactions.
Fisher’s algorithm using the Euclidean distance and detecting
6 clusters detected 75% of the erros by selecting 49% of the
transactions.
The approach based on clustering was further explored more
recently [10]. In this work, several hierarchical clustering
methods were explored. The results shows that the performance of the algorithm is dependent on the used distance
function. The method that use Canberra distance function
shows the best performance.
III. M ETHODOLOGY
In this section, we describe the outlier detection algorithm
which used for predicting the erroneous transaction in the
foreign trade dataset.
A. Data Preparation and Exploration
The first step was data preparation. In this step, the data was
cleaned by removing zeros in the main fields: weight, cost.
In this step, each month was evaluated for each level.
Grouping the data according to higher levels of the product
taxonomy increases the number of transactions in each group
(Fig. 1).
Fig. 1. Average of number of observation for each level in each month
Figure 2 shows a sample of existing hierarchically in the
product code starting with “11”.
TABLE I
S MALL EXAMPLE OF THE DATA SET
RN
1
2
3
4
5
6
7
8
9
10
O
2
2
2
2
2
2
2
2
2
2
IO
2
2
2
2
2
2
2
2
2
2
L
1001
1001
1001
1001
1001
1001
1001
1001
1001
1001
DN
10001
10001
10001
10001
10001
10002
10002
10002
10002
10002
OID
1727
1727
1727
1727
1727
5293
5293
5293
5293
5293
M
1
1
1
1
1
1
1
1
1
1
LN
1
4
5
8
10
1
2
3
4
5
C
11
3
11
11
11
11
11
11
11
11
PC
85411650
49794616
49794616
60786650
60770400
83134627
83137027
83788501
83784016
83780035
W
10
2000
25
2
1
35
50
343
1891
1644
TC
7
4497
22
2
1
299
609
3685
19824
16492
TCW
739
2248
892
1117
1640
8572
12137
10735
10482
10033
AvgWM
2.435
21.711
21.711
27.904
0.6697
15.754
10.429
9.493
13.201
9.558
SDWM
3315.93
48369.10
48369.10
17155.10
418.278
8061.60
6026.43
5167.60
7144.85
4724.45
S
0.5114699
0.4023847
0.4304191
1.5614917
2.3197148
0.8908902
0.2832658
0.2403615
0.3806161
0.1005249
TN
75
12
12
4
7
3
50
33
9
40
Fig. 2. An example of existing hierarchical structure within the data set
B. Error detection method
Finally, for each group of products, the outlier detection
algorithm is used. We used the DMwR package [11] within
R [12]. The lofactor is used to obtain local outlier factors.
Using the LOF algorithm, this function produces a vector of
local outlier factors for each case. The number of neighbours
that is used in the calculation of the local outlier factors was
k=5.
Afterward, outlier factors are sorted increasingly. The first
(lower) quartile is the median of the first half of the data (Q1),
the second quartile is the median, and the third (upper) quartile
is the median of the second half of the data (Q3) [6]. The
following equation shows the Interquartile Range (IQR) 1:
IQR = Q3 − Q1
(1)
The factors that are substantially larger or smaller than
the other factors are referred to as outliers. The potential
outliers are the ones that fall below Equation 2, or above
Equation 3. However, for our purposes, the only values that we
are interested are the largest ones, representing the transactions
that may have more impact on the foreign trade statistics that
are computed based on these data.
Q1 − 1.5 ∗ IQR
(2)
Q3 + 1.5 ∗ IQR
(3)
By applying this method to the data aggregated at different
levels of the product taxonomy, we are able to investigate our
hypothesis, namely that better results can be obtained with
more data.
C. Evaluation
The goal of the outlier detection method is to meet the
criteria introduced by experts. Accordingly, proper metrics as
described in the following section are selected.
1) Metrics: The selected metrics for evaluating the model
are recall [13] and effort. The Recall is the fraction of relevant
instances that are retrieved (Eq. 4). Table II shows the values
used in Eq. 4. tp is the number of errors which are predicted
correctly. If the method can not predict an error, then it
is counted as fn. On the other hand, tn is the number of
errorless transactions which is predicted appropriately. But if
the method predicts an erroless transaction as erroneous one,
then it is counted as fp.
Beside this common and well-known metric for model
evaluation, we define another metric to introduce the cost
of manually analyzing the outliers. Effort is the proportion
of transactions that were selected by the method for manual
inspection (Eq. 5). The effort should be as low as possible and
not higher than 50%, as indicated earlier.
Recall =
tp
tp + f n
(4)
E
0
0
0
0
0
0
0
0
0
0
TABLE II
D EFINITION OF EVALUATION METRICS
Predictions
True
False
Ground Truce
True
False
tp
fp
fn
tn
is not always true because for some of the products the best
results are obtained individually and 2) the best results are
obtained at different level of aggregations. In other words, for
some products, the best results are obtained at the product
level (8 digits) while, for other products, the best results are
obtained by aggregating transactions at the first (2 digits), the
second (4 digits) or the third (6 digits) levels of the taxonomy.
B. Average results for each level
Total number of transaction predicted as outliers
Total number of transactions
(5)
2) Methodology to obtain results for different levels of
aggregation: The outlier detection method is used separately
for the transactions aggregated at each level of the product
hierarchy. To ensure that results are comparable, the evaluation
is also done at each aggregation level.
In the previous section, we illustrated our approach with
results for two months. Here, we present the results on the
complete data set (Section II-A), analyzed according to the
different levels of the hierarchically in the product codes used
for data aggregation.
Figures 6 and 7 show the average results (both effort and
recall) for different months. Grouping the transactions for
month February (Figure 6) at the top level, the model can
predict more than 70% of the errors just by selecting 10%
of the transactions. Despite a very significant reduction in the
effort, the recall is not good enough to be accepted by the
experts. On the other hand, for level 2 (four digits), the model
can predict more than 90% of the errors by analyzing just
12% of the transactions which is acceptable by the experts.
So in this month, the best results are obtained by grouping
the products by four digits of the product codes.
In June, the results show that grouping by two and four
digits of the product code leads to just 10% of the effort.
However it is not acceptable due to very low recall (less than
90%). The best results for this month is obtained at the third
level (six digits) of product codes.
In summary, the decision taken in the second (four digits) and the first levels (two digits) require lower effort for
analyzing. But to obtain acceptable results, the recall should
be higher than 90%, which sometimes indicates that the
aggregation that should be used is at the six digits level.
IV. R ESULTS
V. C ONCLUSION
Fig. 3. The performance metrics (level 1) for Product code starting with 11
in January 1998
Effort =
A. Experimental Results for each month
To find the outliers, the labeled field (Error) is not used. After detecting them, we use this field to evaluate the model. As
an example, the result for month January 1998, for importing
good to Portugal with product code starting with "11" (level
1 where all the transactions are grouped together) is shown in
Figure 3.
Figure 3 shows the histogram of this level with the density
of the products with different outlier scores. Obviously, most
of transactions have very low outlier score, below 2. And a few
transactions have high outlier scores which are actual outliers.
Figures 4(a) and 5(a) show the percentage of the products
codes for which the best effort (the lowest effort) is obtained
at each levels of the product taxonomy in January and June
of 1998, respectively. Similar plots for the recall measure are
shown in Figures4(b) and 5(b). According to the graphs, the
best results are obtained at different level of the taxonomy for
different products. Although the results confirm our hypothesis
that aggregation is generally useful, they also show that 1) this
In this paper we have investigated the effect of aggregating
data at different levels of a product taxonomy in the performance of an outlier detection method applied to the problem
of identifying erroneous foreign trade transactions collected
by the Portuguese Institute of Statistics (INE). The approach
is tested on 10 months of data and the results are evaluated in
terms of recall and the effort involved in the manual analysis
of the selected transactions.
The results generally confirm our hypothesis that the larger
number of observations used at the higher levels of aggregation, the better are the results obtained. However, the results
also show that, depending on the product, the best results can
be obtained at different levels of aggregation. In fact, in some
cases, the best results are obtained at the lowest level where
there is no grouping.
These results, indicate that different levels should be selected for different products. One approach that can be followed is to use a metalearning approach to map the characteristics of the data with the ideal level of aggregation [14].
(a)
(b)
Fig. 4. The obtained results for month Feb, 1998: (a) Best effort, (b) Best recall
(a)
(b)
Fig. 5. The obtained results for June, 1998: (a) Best effort, (b) Best recall
Additionally, we also intend to investigate our methodology
for different data sets like vehicular networks. The study of the
performance other machine learning algorithms instead of the
outlier detection method used here is also another interesting
future line of work.
R EFERENCES
[1] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer,
and R. Wirth, {CRISP}-{DM} 1.0: Step-by-Step Data Mining Guide.
SPSS, 2000.
[2] R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, and B. Becker, The
data warehouse lifecycle toolkit. Wiley, 2011.
[3] F. Soulié-Fogelman, “Data Mining in the real world: What do we need
and what do we have?,” in Proceedings of the KDD Workshop on Data
Mining for Business Applications (R. Ghani and C. Soares, eds.), pp. 44–
48, 2006.
[4] C. Soares, P. Brazdil, J. Costa, V. Cortez, and A. Carvalho, “Error
Detection in Foreign Trade Data using Statistical and Machine Learning
Algorithms,” in Proceedings of the 3rd International Conference and
Exhibition on the Practical Application of Knowledge Discovery and
Data Mining (N. Mackin, ed.), (London, UK), pp. 183–188, 1999.
[5] L. Torgo and C. Soares, “Resource-bounded Outlier Detection Using
Clustering Methods,” in Data Mining for Business Applications (R. G.
Carlos Soares, ed.), Frontiers in Artificial Intelligence and Applications,
pp. 84–98, IOS Press, 2010.
(a)
(b)
Fig. 6. The average results for each level in Feb, 1998: (a) Import, (b) Export
(a)
(b)
Fig. 7. Performance of the outlier detection vs. number of observation in June, 1998: (a) Import, (b) Export
[6] P. M. J.S. Milton, J.J. Corbet, Introduction to statistics. Mc Graw Hill,
1997.
[7] W. D. Fisher, “On Grouping for Maximum Homogeneity,” Journal of
the American Statistical Association, vol. 53, p. 789, Dec. 1958.
[8] E. M. Knorr and R. T. Ng, “Algorithms for Mining Distance-Based
Outliers in Large Datasets,” in Proceedings of the 24rd International
Conference on Very Large Data Bases, VLDB ’98, (San Francisco, CA,
USA), pp. 392–403, Morgan Kaufmann Publishers Inc., 1998.
[9] R. Quinlan, “C5.0: An Informal Tutorial,” 1998.
[10] A. Loureiro, L. Torgo, and C. Soares, “Outlier Detection Using Clustering Methods: a Data Cleaning Application,” in Proceedings of the Data
Mining for Business Workshop (D. C. Soares C Moniz L, ed.), (Oporto,
Portugal), pp. 57–62, 2005.
[11] L. Torgo, Data Mining with R, learning with case studies. Chapman
and Hall/CRC, 2010.
[12] R Core Team, R: A Language and Environment for Statistical Comput-
ing. R Foundation for Statistical Computing, Vienna, Austria, 2013.
[13] D. M. W. Powers, “Evaluation: From Precision, Recall and F-Factor
to ROC, Informedness, Markedness & Correlation,” Tech. Rep. SIE07-001, School of Informatics and Engineering, Flinders University,
Adelaide, Australia, 2007.
[14] P. Brazdil, C. Giraud-carrier, C. Soares, and R. Vilalta, Metalearning:
Applications to Data Mining. Cognitive Technologies, Springer, 2009.