System filtracji reklam internetowych

Transcription

System filtracji reklam internetowych
Academic Year 2012/2013
Warsaw University of Technology
Faculty of Electronics and Information Technology
Electrical and Computer Engineering
BACHELOR OF SCIENCE THESIS
Krzysztof Kamiński
Comparative study of machine-learning algorithms for
the detection of ads on web pages
Supervisor
prof. dr hab. inż. Mieczysław Muraszkiewicz
Evaluation: .............................................
.................................................................
Signature of the Head
of Examination Committee
0
Electrical and Computer Engineering
Date of Birth:
Starting Date of Studies:
1989.07.07
2008.10.01
Curriculum Vitae
I was born on 7 July 1989 in Warsaw. After completing primary school and high school, I
attended the 25th Józef Wybicki High School in Warsaw. In October 2008, I started studying
at the Faculty of Electronics and Information Technology at Warsaw University of
Technology with major Electronics and information technology.
.......................................................
Signature of the Student
Bachelor of Science Examination
Examination was held on: ................................................................................................... 2013
With the result: ............................................................................................................................
Final Result of the Studies: .........................................................................................................
Suggestions and Remarks of the B.Sc. Examination Committee: ..............................................
......................................................................................................................................................
......................................................................................................................................................
1
SUMMARY
This thesis will involve analyzing learning algorithms for detecting the content of advertising
on the Internet. The analysis of the ads appearing in the context of web pages and finding a
good representation of the problem by defining the appropriate attributes that describe the
advertisement. The result of the thesis will implement the appropriate classifiers and testing.
The following part of the thesis will be implementation of the proposed solutions, the choice
of the best methods and optimal parameters. The thesis was carried out within the project
"Smart AdBlocker" Institute of Computer Science Warsaw University of Technology for TMobile company.
Keywords: machine-learning, text mining, advertisement filtration
TEMAT:
Badanie porównawcze algorytmów uczących się do wykrywania reklam na stronach
internetowych.
STRESZCZENIE
Niniejsza praca będzie dotyczyć przeanalizowania algorytmów uczących się do celów
wykrywania treści reklamowych na stronach internetowych. Badania obejmą analizę reklam
pojawiających się w kontekście stron internetowych i znalezienie dobrej reprezentacji
problemu poprzez zdefiniowanie odpowiednich atrybutów opisujących reklamy. Kolejną
częścią pracy będzie zaimplementowanie odpowiednich klasyfikatorów i przeprowadzenie
testów. Wynikiem pracy będzie propozycja rozwiązań, z wyborem najlepszych metod i
optymalnych parametrów. Praca została zrealizowana w ramach projektu "Smart AdBlocker"
Instytutu Informatyki Politechniki Warszawskiej dla firmy T-Mobile.
Słowa kluczowe: machine-learning, text mining, filtracja reklam
2
I would like to thank my co-workers on the project
P. Szczepański, A.Wiśniewski, M. Januszewski
for help, patience and a very good job.
3
Table of contents
1. Problem understanding ......................................... 6
2. Machine learning and URLs ................................. 7
3. Selected algorithms with parameters .................... 8
3.1. Naive Bayes classifier ............................................................................................ 8
3.2. Bayesian networks ................................................................................................. 9
3.3. Support vector machines ...................................................................................... 10
3.4. AdaBoost ............................................................................................................. 11
3.5. Artificial Neural Network .................................................................................... 12
3.6. k-nearest neighbor algorithm ............................................................................... 13
3.7. Decision tree ........................................................................................................ 14
3.8. Random forest ...................................................................................................... 14
4. Preprocess ........................................................... 16
4.1.Dataflow ................................................................................................................ 16
4.2. Constructing dataset ............................................................................................. 16
4.3.Feature extractors .................................................................................................. 17
4.4.Feature selection ................................................................................................... 19
4.5.Additional transformations ................................................................................... 20
5. System description .............................................. 21
5.1.Task generation ..................................................................................................... 21
5.2.Server/Database .................................................................................................... 22
6. Results and assessment method ........................... 25
6.1. Classifier accuracy assessment method ............................................................... 25
6.2. Results .................................................................................................................. 28
6.2.1. 100 features ................................................................................................... 28
6.2.2. 200 features ................................................................................................... 30
6.2.3. 500 features ................................................................................................... 32
4
6.2.4. 1000 features ................................................................................................. 34
6.2.5. 2000 features ................................................................................................. 36
6.2.6. 5000 features ................................................................................................. 39
7. Conclusions and future progress ......................... 41
7.1.Conclusions ........................................................................................................... 41
7.2. Summary of results .............................................................................................. 43
7.3. Future progress .................................................................................................... 43
Bibliography ............................................................. 44
5
1. Problem understanding
At the beginning, the internet was net with purpose of exchanging the
information, but today it is one of the most counted element of the global economy.
With its increase in popularity, people started to realize that beside other features,
internet can be a source of income and after short time commercials are common part
of almost every website and are one of the most efficient way of earning money for the
web owners. At the beginning, they were small due to limitations such as internet
speed connection, but also the unspoken ethics concerning online advertising.
Nowadays it looks like most of the web owners see their website as a business more
than a source of information, by putting more and more adverts which are bigger,
brighter and harder to close than before.
With taking into consideration that internet devices are becoming more mobile
like cell phones or tablets and direction in which online advertising is going, it is
starting to become a huge problem. By increasing size of the advert, the page load time
is increased remarkably on limited mobile internet, but that is not only reason, the
small screens cannot deal with the large aggressive ads and finally the most important
problem is that mobile users pay data charges for internet connection, meaning they
have to pay for every unwanted advertisement.
The common way of blocking online advertising is the blacklisting method[1],
unfortunately this method possess few holes. The agencies specializing in online
advertising operate on a wide range of domains what is making this method of
blocking ads insufficient, due to the often need of updating. Another problem is
inflexibility, because of inability to set the user preference about what kind of ad
should be blocked.
6
2. Machine learning and URLs
Machine learning goal is to find the hypothesis 𝑕: 𝑋 → 𝐶 where 𝑋 is a set of
entities and 𝐶 is the set of classes. The result is the closest possible hypothesis to the
concept 𝑐: 𝑋 → 𝐶. In the problem of categorizing the advertisement only two classes
are possible meaning 𝐶 = {𝑐𝑜𝑛𝑡𝑒𝑛𝑡, 𝑎𝑑𝑣𝑒𝑟𝑡}.
The feature selection is the process of reducing the space of features containing
entities. This is necessary not only to reduce algorithm time and space complexity but
also to significantly improve algorithm performance[2]. Two main problems with
huge features space are 𝑜𝑣𝑒𝑟𝑓𝑖𝑡𝑡𝑖𝑛𝑔 and 𝑐𝑢𝑟𝑠𝑒 𝑜𝑓 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦. The first issue
leads to situation, when our hypothesis cannot correctly identify new entities, because
it is too strongly formed by training set. The next problem refers to huge space, where
unimportant features cause quite similar objects to lie in the long distance from each
other.
Uniform resource locator in short URL is a character string referencing to
specified Internet source. and usually one web page consist of many parts labeled by
URL The attempt to fast classify web pages using URLs was made by Devi, Rajaram
and Selvalcuberan[4]. However, they used only simple method to extract features from
each entity, did not perform feature selection process, and though, obtained
insufficient results.
7
3. Selected algorithms with parameters
3.1. Naive Bayes classifier
Naive Bayes is simplest instance of a probabilistic classifier and base on Bayes
theorem. The classifier checks probability of each variable attribute separately. The
final assumption is made by quotation of all conditional probabilities, that given
f1,f2,…,fn possessing each feature f belongs to class C. The ending process of
classification comes down to the problem of maximizing the following function:
𝑛
𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦 𝑓1 , 𝑓2 , … , 𝑓𝑛 = argmax 𝑝 𝐶 = 𝑐
𝑐
𝑝(𝐹𝑖 = 𝑓𝑖 |𝐶 = 𝑐)
𝑖=1
This algorithm has been already studied and tested for solving URL feature
classification problem in literature [5], according to this paper Naive Bayes showed
very good performance.
This classifier requires discrete attributes and all methods differ on the way
how vector of the attributes is discretized and how segments are created. Creation
of
the segments can be made using unsupervised and supervised methods.
Unsupervised method is a basic division of the set into segments of equal size
without checking contents.
Supervised method will be Multi-Interval discretization. This algorithm use the
entropy minimization heuristic for discretizing the range of continuous-valued
attribute into multiple intervals.
Three different configurations were used.
- Density for a values of a given parameter is a normal distribution and
segments of the same size are created by unsupervised method.
- Density of values of parameter is determined by kernel estimator and also
segments are created of the same size by unsupervised method. According to
literature [6] This method of estimating density should show better results, than
normal distribution.
8
- Discretization of attributes is done by supervised method. Method introduced
by U. Fayyad and K. Irani called Multi-Interval discretization[7] was used.
3.2. Bayesian networks
Bayesian networks are statistical models representing the correlation between
features by directed acyclic graphs. Each node represent random variable in Bayesian
sense such as parameters or characteristics and conditional dependencies are
represented by edges between nodes. Not connected nodes are variables which are
conditionally independent of each other. Each node is connected with probability
function where set of values for the node's parent variables is taken as input and
returns probability of variable node.
Bayesian networks has few important advantages:
- It handle situations where some data entries are missing.
- Conjunction between Bayesian statistical methods and Bayesian networks
give good approach for avoiding the overfitting of data.
The process of using this algorithm is divided into two parts, learning process
of the networks structure and probability tables, followed by the process of classifying
instances made by maximizing functions based on conditional probability.
This algorithm can be customized by choosing different combinations of two
main parameters:
1. Approximation of the conditional probability distribution can be done by
different estimators.
2. Way of learning the structure by choosing score measure and search
algorithm. Among all approaches three were chosen:
- Local score metrics - allows to score the whole network by scoring
each individual node. Purpose of this approach is to search for the
optimal network structure.
9
- Conditional independence tests - eliminate correlation between
independent variables to calculate relations between features.
- Global score metrics - perform task classification and compare
accuracy of results to estimate the value of network structure
3.3. Support vector machines
Support vector machines in short SVM are a supervised learning model,
associated with data analysis and pattern recognition algorithms purposed for
classification and regression analysis. At the base level, SVM is a non-probabilistic
binary linear classifier, consisting of application of a simple linear method to the data
in high-dimensional feature space. What is important, the method itself does not
necessarily need any high-dimensional space computations, that feature of the
algorithm, make it fast while also adaptive.
SVM use a kernel function which define an implicit mapping Φ of the input
data into a high dimensional feature space defined by a kernel function k.
𝑘 𝑥, 𝑥 ′ = Φ 𝑥 , Φ(𝑥 ′ )
If 𝜙 ∶ 𝑋 → 𝐻 then the exemplary equation of function 𝑘 is returning the inner
product 𝛷𝑥, 𝛷(𝑥′) between the images of two data points 𝑥, 𝑥′ in the future space.
Feature space is where the learning phase takes place. This computation is often
referred to as the "kernel trick".
The above example is a simplest linear kernel function, which keeps the
original data format. From all kernel functions 3 other were chosen:
- the polynomial kernel, where 𝛾 is a scale, 𝑐0 is an offset and 𝑑 are degrees
𝑘𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑥, 𝑥 ′ = (𝛾 𝑥, 𝑥 ′ + 𝑐0 )𝑑
- the Gaussian Radial Basis Function (RBF) kernel
𝑘𝑟𝑏𝑓 𝑥, 𝑥′ = 𝑒𝑥𝑝(−𝛾||𝑥 − 𝑥′||2 )
10
- the hyperbolic tangent kernel, where 𝛾 is a scale and 𝑐0 is an offset
𝑘𝑕𝑡 𝑥, 𝑥′ = tanh 𝛾 𝑥, 𝑥′ + 𝑐0
During the classification phase, SVM use hyper-plane to separate different
classes of data
𝑤, 𝜙(𝑥) + 𝑏 = 0
corresponding to the decision function
𝑓(𝑥) = 𝑠𝑖𝑔𝑛( 𝑤, 𝜙(𝑥) + 𝑏)
By solving a standard constrained quadratic optimization problem, it can be
shown that in term of classification performance, hyper-plane has the maximal margin
of separation between two classes.
3.4. AdaBoost[7]
AdaBoost stands for Adaptive Boosting which is a meta-classification
algorithm. It is computational method that conjoint few weak classifiers as linear
combination into one classifier with improved performance.
During processing data it weight all training samples and iteratively build new
sub-classifier followed by the error rate calculation. Depending on the result it acts
differently, if error rate is above predefined threshold it stops, but if not it increases
weights of misclassified samples, what is resulting in focusing more on those samples.
Final step of process consists of assigning voting power dependent on error rate
for the purpose of newly built classifier and breaks after the predefined number of
loops. As a result it shows a weighted vote of all build sub-classifiers.
The most important argument of this classifier is the number of iteration. With
the increasing number of iterations the error rate decreases, but the time of estimation
rapidly grows.
AdaBoost will be tested as a meta classifier with usage of stump method, due
to the time and space complexity of other algorithms, what makes them not efficient in
real-life classifications.
11
3.5. Artificial Neural Network
Artificial neural network is a implementation of biological neural networks as
mathematical model due to their real-life characteristics advantages such as learning
ability, adaptivity, massive parallelism, distributed representation and computation,
fault tolerance and low energy consumption.
Artificial neurons consist inputs as synapses, which are multiplied by weights,
then being computed by function determining the activation of the neuron and another
compute the output. Artificial neural network is a combination of such neurons to
process information. Depending on the weights, the computation result will be
different for every neuron.
FIGURE 1 ARTIFICIAL NEURON CONCEPT
Apart from all versions of Artificial neuron systems four were chosen:
- Learning Vector Quantization
- Self-Organizing Map
- Feed-Forward Artificial Neural Network
- Artificial Immune Recognition System
12
Those networks consists of few layers, weights are assigned to inputs in order
to create the hidden layer (set of linear regressions), then those are combined into
additional layers. Furthermore range of training data and error is computed by
transforming sum of data and weights. Results are used by algorithm to adjust network
weights to minimize errors.
Outputs
Hidden layer
Inputs
FIGURE 2 ARTIFICIAL NEURON LAYERS
3.6. k-nearest neighbor algorithm
K-NN is a object classifying method based on closes training examples. It's
specialty is to perform discrimination analysis when reliable parametric estimates of
probability densities are unknown. When an object with unknown class is presented
for evaluation, the algorithm uses its “k” closest neighbors to compute and depending
on the neighbors answer class is being assigned.
Vectors can represent instances of our subject and also corresponding vectors
can be used as a measure of class similarity metrics.
- Euclidean - It is a measure of a distance in multidimensional space
𝑘
(𝑝𝑖 − 𝑞𝑖 )2
𝑑=
𝑖=1
13
- Manhattan - It measure on two dimensional space
𝑘
𝑑=
𝑝𝑖 − 𝑞𝑖
𝑖=1
- Chebychev - It measure using words with the largest difference of weight for
comparison.
𝑑 = 𝑚𝑎𝑥 𝑝𝑖 − 𝑞𝑖
- Edit - It is a measure of the distance between two strings by Levenshtein distance.
0
𝑖 𝑜𝑟 𝑗
𝑙𝑒𝑣𝑎,𝑏 𝑖, 𝑗 =
𝑙𝑒𝑣𝑎,𝑏
𝑚𝑖𝑛 𝑙𝑒𝑣𝑎,𝑏
𝑙𝑒𝑣𝑎,𝑏
,𝑖 = 𝑗 = 0
, 𝑗 = 0 𝑎𝑛𝑑 𝑖 > 0 𝑜𝑟 𝑖 = 0 𝑎𝑛𝑑 𝑗 > 0
𝑖 − 1, 𝑗 + 1
𝑖, 𝑗 − 1 + 1
, 𝑒𝑙𝑠𝑒
𝑖 − 1, 𝑗 − 1 + 1
3.7. Decision tree
Decision tree is one of the basic concepts in machine learning classification.
While leaves represents the classes, all interior nodes corresponds to input features of
the instances. The most challenging problem is to find optimal tree. Our algorithm will
use techniques taken from information theory such as entropy or information gain to
build best corresponding to dataset tree structure.
FIGURE 3 DECISION TREE EXAMPLE
14
3.8. Random forest[8]
Random forest is a set of decision trees, where the output is a mode of the
individual tree class output. The main advantages of this algorithm are efficiency on
large databases, and accuracy as a machine learning algorithm and also that it should
not become overfitted. As a comparison to AdaBoost which also uses another
classifiers as a base, this algorithm yields more favorable error rates and is more
robust with data noise.
The performance can be influenced by tree parameters:
- The number of the trees in the forest – with an increase in the number of trees
the precision grow, but time complexity increase significantly.
- The number of features in the tree - it determines number of all features in
the randomly chosen subset in each tree.
- The depth of the tree - this number describe how many levels can each tree
posses. This parameter has mainly influence in the algorithm running time, but
unfortunately the performance will also be decreased.
15
4. Preprocess
4.1.Dataflow
Process of training set generation critically affect the performance of the
classifier, then in order to find very good solution, this part needs to be handled very
carefully.
4.2. Constructing dataset
Constructing
dataset
Feature
extraction
Feature
selection
Additional
transformations
FIGURE 4 PREPROCESS
Successful testing and use of a machine learning algorithm needs both training
and test dataset. The learning dataset in basic is a list of URLs and each of them is
defined whether it is advertisement or not.
The list of URLs were collected by the usage of data logs supplied by TMobile company, but also with the manual usage of the browser connected through the
local proxy server set to collect every requested URL address.
For the purpose of creating learning set, supervising Mentor module was
implemented. It uses the extended and heavily modified regular expression-based rules
based on the
AdBlock Plus filter definition language which is used by popular
blacklist-based advertisement blocking software. It apply ad filters based on most
popular blacklists to classify any given URL.
16
4.3.Feature extractors
The majority of data-mining algorithms treat as an input data in form of a big
matrix of numeric values. Each row is composed of a single sample of data and each
column represent feature of a given sample. Taking this fact into consideration feature
extractors need to transform the input URL textual form into sequence of numeric
values.
Basing on URLs many useful features can be extracted such as length of whole
address or each segment and also what segments given URL is composed of. Another
feature of this is to create a dictionary of all words used in all URLs and count their
occurrences.
URLs are made from the sequences of alphanumeric strings and those
sequences create tokens, all special and white space characters are discarded due to the
fact that those symbols does not guarantee increase in accuracy and also can affect the
performance of the system.
Each single URL does possess a small amount of tokens, comparing the whole
set of generated tokens. This causes most features values to be set to zero. It allows us
to store dataset as a sparse matrix to save both memory and processing time.
URL = PROTOCOL.USERINFO.HOST:PORT/PATH/QUERY
Example url: “http://support.google.com/ads/?hl=en”
- Textual length of URL (total and per segment) – URL as a structure is composed
from segments such as: PROTOCOL, USERINFO, HOST, PORT, PATH, QUERY.
The length of each segment can be treated as an characteristics of URL types, mainly
part called QUERY in most of online advertisements is very long.
SegmentLength$Total = 29, SegmentLength$Host = 18
- Segment presence – In some of URLs there are a missing segments, in most of a
common addresses PORT is missing and usually if URL consist USERINFO it is not
an advert.
17
SegmentMissing$Host = false, SegmentMissing$UserInfo = true
- Token occurrences – There are many words in each URL and some of them may
occur more than once.
Token: com = 1, Token: google = 1, Token: ads = 1
- Token occurrences by segment – Each segment can be composed of many words,
and some of them may occur more than once.
Token: Host$com = 1, Token: Path$ads = 1
- Sequential n-grams – There are some phrases in URLs that may state, whether the
given address contain advertisement. Those phrases are created from words situated
nearby, such as add and blocking, those together create a sentence "add blocking"
which probably state, that this URL is not advertisement.
Ngram: com>google = 1, Ngram: google>support = 1
- Full token n-grams – Information can be stored in many places in URL by
combining all words in the whole URL it can be stated whether given example is
addvertisement.
Ngram: com>support = 1, Ngram: com>hl = 1
- Token count (total and per segment) – Most of URLs containing advertisement
posses very long QUERRY, then it is needed to count tokens inside segments.
TokenCount$Total = 7, TokenCount$Host = 3, TokensCount$Query = 2
- Numeric tokens count (total and per segment) – Inside URL numeric values can
be found, even those can have meaning. Example can be that many advertisement
containing pictures has their size inside the QUERRY stated as an number.
NumericCount$Total = 0, NumericCount$Host = 0
18
4.4.Feature selection
During the phase of feature extraction a great number of data is created,
however only minority carry information, which can be useful as a dataset. The
features that provide too little or none information can be treated as a noise, especially
for the classifiers which are vulnerable to big number of features, their training time is
increasing tremendously, making their overall performance insufficient. Then after the
initial creation of feature dataset, the useless information must be cut.
Feature selection heuristics try to estimate usefulness of features during the
classification process allowing to reduce dataset greatly. Estimation can be done in by
referring to the single feature or a whole subset of features. Single feature algorithms
are faster, however they do not take into consideration the correlation between
features, what is making their results worse than the other whole subset algorithms,
which discard redundant information. Unfortunately, with increase in performance,
also time-consummation is increased
Due to the fact that each experiment consume a big amount of time, it was
decided to use the distributed system in order to perform many tests simultaneously.
This approach gives opportunity to test wide range of classifiers in much shorter time.
For the tests two feature selection methods were used. Initially dataset was
filtered using information gain heuristics, which estimates usefulness of feature f by
calculating class value entropy reduction over whole dataset 𝑋:
𝐼𝐺 𝑋, 𝑓 = 𝐻 𝑋 − 𝐻(𝑋|𝑓)
After filtering most of superfluous features, second pass with Correlation
Feature Selection (𝐶𝐹𝑆) heuristics is performed. This method evaluates subsets of
features, rewarding for high correlation with class value and penalizing for correlations
between features considered in subset. Starting with empty set, features are added
greedily until given size is reached. Exact formula for calculating 𝐶𝐹𝑆 usefulness of set
𝑆 containing 𝑘 features, where 𝑟𝑐𝑓 is mean class to feature correlation between and
𝑟𝑓𝑓 is mean feature to feature correlation:
𝐶𝐹𝑆(𝑆𝑘 ) =
𝑘 ∗ 𝑟𝑐𝑓
𝑘 + 𝑘(𝑘 − 1) ∗ 𝑟𝑓𝑓
19
4.5.Additional transformations
Feature selection process chooses some optimal subspace of all features. In this
subspace many instances became indistinguishable. To deal with it, the datafile created
after feature selection process consists of the groups of instances. Each group is
weighted by the number of the same instances. This can significantly improve training
and testing time of our algorithms, without affecting classification performance.
20
5. System description
Client
storeResult
getNextTask
storeResult
Task
addTask
Server
Generator
getNextTask
Client
getTaskResults
Client
FIGURE 5 BASIC VIEW OF TEST PLATFORM
Each experiment is very time consuming process, to resolve this problem
distributed system was created. This solution created the opportunity to test wider
range of classifiers combinations.
5.1.Task generation
All classification algorithms can have large number of possible combinations
and due to this fact, module was developed with purpose of processing a set of tests
into list of precisely defined test-strings ready to be used with the WEKA libraries.
The task structure can define almost any combination of test argument ranges.
Unfortunately, every set of additional options increase exponentially usage of RAM,
there exist limitation of the physical memory of the task-injecting system. In order to
deal with this problem it is essential to focus only on the most probable argument
values.
21
5.2.Server/Database
During the testing phase our platform generated huge number of possible
configurations for the classifiers. To deal with that, it was decided to create a
connection between the platform and database. For that purpose JDBC API was used,
this technology provide methods for querying and updating data.
Information to store were divided into two parts. First are the "tasks" - the test
cases for each configuration for every classifier meant to be used. Second, the
"task_results" composition of results for the respective tasks. For each part database
was created.
The first table named "tasks" is assembled from 8 columns:
- taskid - primary key of the table holds the identification number of the task. It
is later used as a foreign key in the results table.
- classifier - stores the name of the classifier meant to be used.
- classifieroptions - holds a combination of parameters for a given classifier.
- datafilename - keeps the name of the training data file.
- folds - stores the number of parts on which we divide the training data for the
classifier to evaluate. Classifier use (fold – 1) part of data and use it for
training, then test what it learn on remaining one, process is repeated “fold”
number of times.
- decisionindex - determines which attribute in the set points out whether the
given instance is an ad or not.
- resultscount - keeps the number of results stored for a given task.
- workers - stores the number of clients which are working on a given task
case.
The second table named "tasks_results" consists of 8 columns:
22
- taskid - primary key of the table and also foreign key connected to table tasks
and its taskid column, holds the identification number of the task which result
is stored.
- resultno - second primary key of the table stores number of stored results for
a given task.
- truepositives - keeps the number of correctly identified adverts.
- falsepositives - keeps the number of incorrectly identified adverts.
- truenegatives - keeps the number of correctly identified legitimate content.
- falsenegatives - keeps the number of incorrectly classified legitimate content.
- traintime - stores the time spent on training classifier.
- testtime - stores the time spent on testing given case.
Truepositives, falsepositives, truenegatives, falsenegatives are later used to
compute sensitivity and specificity for a given classification with specified options.
This solution enables user to simultaneously run testing on many computers. Which
significantly reduced time of testing and allow us to do more precise results.
To operate on database following functions were designed:
- addTask - performs an operation of adding task to “tasks” table.
- getNextTask - imports task from “tasks” table which will be used for
computation, at the same time updates the number of workers working on a
given task. Tasks with no results and the smaller number of workers have the
priority,
- storeResult - exports computed results into “tasks_results” table and increase
results counter in “tasks” table.
- getTaskResults - imports all results from “task_results” table computed by a
given classifier, for the comparison purposes. The overloaded implementation
of this method take the pattern of classifier and its option as a argument and
extract suitable results.
23
Database Table:
Database Table: tasks
-
taskid
classifier
classifieroptions
datafilename
folds
decisionindex
resultscount
workers
tasks_results
1
1
*
-
taskid
resultno
truepositives
falsepositives
truenegatives
*
falsenegatives
traintime
testtime
FIGURE 6 DATABASE DIAGRAM
24
6. Results and assessment method
Before presentation of the result, the method of assessment needs to be
described. Due to the fact that failure in classifying web content is more cost
consuming than advertisement, there needs to be additional parameter. It will be
assumed that misclassification of ad is 4 times less than web content.
6.1. Classifier accuracy assessment method
During the phase of constructing the dataset each element was stated as advert
or web content, hence after the tests it can be stated exactly how many of
misclassifications appeared. Accordingly 4 parameters can be distinguished, which
will be used later as a base for the accuracy assessment:
- True Positives (TP) - This parameter state how many adverts were correctly
classified.
- True Negatives (TN) - This parameter shows the number of correctly classified
web content.
- False Positives (FP) - This number tells the number of misclassified web contents.
- False Negatives (FN) - This number says how many advertisements were
misclassified.
The most basic way to estimate the performance of the classifier called
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 is to determine the percentage of instances correctly classified.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
Variation of previous stated 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 by addition of misclassification
coefficient α=4 called the cost-sensitive accuracy.
𝐶𝑆𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝛼 ∗ 𝑇𝑃 + 𝑇𝑁
𝛼 ∗ 𝑇𝑃 + 𝛼 ∗ 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
25
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the estimation, about how many true advertisements were detected
out of all instances classified as an online advertisement.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 indicate the ratio of how many advertisements were correctly classified.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
F-measure is a combination of Recall and Precision.
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗
Variation
of
the
previous
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
F-measure
by
taking
into
consideration
misclassification coefficient α=4.
𝐶𝑆𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒
(1 + 𝛼 2 ) ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + (𝑎2 ) ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
There exist another aspect of the classifier, it's the training time and testing time,
however those parameters are not crucial. From the point of view of real-time system
neither training time nor testing time is crucial.
TABLE 1 ASSESSMENTS
Assessment type
Argumentation
Importance
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
Basic 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 does not consider cost-sensitivity
Low
and due to the fact that, advertisements are about
15% downloaded data, this assessment can give
high ratio to algorithms that cannot detect any
adverts.
𝐶𝑆𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
Comparably to basic 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 this assessment
Medium
include cost-sensitivity can give high mark to
algorithms that cannot detect any adverts.
26
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Due to the fact that misclassified web content
Medium
possess high cost, this assessment can give high
ratio to those classifier which classify web content
correctly.
𝑅𝑒𝑐𝑎𝑙𝑙
Basic ratio of how many adverts were correctly
Low
classified which does not include const sensitivity.
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
Combination of 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑟𝑒𝑐𝑎𝑙𝑙, without
Low
cost-sensitivity included.
𝐶𝑆𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒
Comparably to 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 this assessment
High
include cost-sensitivity and it is the most precise
assessment due to the maximization of the costsensitivity.
𝑇𝑟𝑎𝑖𝑛 𝑇𝑖𝑚𝑒
The time consumed for training is not crucial to the
Low
process of real-time classification, this process can
run in background and does not to be performed
often. This time factor can be variable when
classification is performed on different environment.
𝑇𝑒𝑠𝑡 𝑇𝑖𝑚𝑒
This process is more important than 𝑡𝑟𝑎𝑖𝑛 𝑡𝑖𝑚𝑒, but
Low
still is not crucial, because for this experiment
𝑡𝑒𝑠𝑡 𝑡𝑖𝑚𝑒 was about 10 seconds for 5000 features
which is not much. This time factor can be variable
when classification is performed on different
environment.
27
6.2. Results
The results are divided into parts by the size of features in datafile. The size of
features grow from 100 up to 5000. The presented results will contain only efficient
solutions for each datafile size, the missed solutions are inefficient or the time
complexity is making solution impractical.
Each part consist 3 charts, firstly all assessment types will be presented
excluding time factors which will be presented as the last chart and the comparison
between the number of false positives and false negatives.
6.2.1. 100 features
From the first run it can be seen, that only 8 classifiers configurations which
were the most efficient with optimal training and testing time.
The first chart shows that 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 were quite high except
AdaBoost which combined with its 𝑅𝑒𝑐𝑎𝑙𝑙, gives low level of 𝐶𝑜𝑠𝑡 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 𝐹 −
𝑚𝑒𝑎𝑠𝑢𝑟𝑒 making this configuration the first candidate to be excluded in further
research. Another case worthy of being suspected of inefficiency are Bayes Network
and Neural Network, their 𝑅𝑒𝑐𝑎𝑙𝑙 is lower that 5% and 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is 100% means
those classifiers did not classify any web content as advert, but at the same time it
classify almost all adverts wrong, which can be seen at the second chart by number of
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 and 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠. Rest of the configurations are promising and
they can be expected to improve with number of features.
Due to the fact, that this task is cost-sensitive for the next following parts
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 should be omitted, because they can be confusing.
28
100,00%
90,00%
80,00%
70,00%
60,00%
Accuracy
50,00%
Cost-sensitive Accuracy
40,00%
Precision
30,00%
Recall
20,00%
F-measure
10,00%
Cost-sensitive F-measure
0,00%
FIGURE 7 100 FEATURES A)
2500
2000
1500
1000
False Negatives
False Positives
500
0
FIGURE 8 100 FEATURES B)
The 3rd figure shows each algorithm time performance. It can be seen that
Bayes Network, AdaBoost and Neural Network not only gives worst classification
results, but also training and testing time.
29
1000000
100000
Time [ms]
10000
1000
Train Time
100
Test Time
10
1
FIGURE 9 100 FEATURES C)
6.2.2. 200 features
The main task for this test is to check whether Bayes Network, AdaBoost and
Neural Network improved.
100,00%
90,00%
80,00%
70,00%
60,00%
50,00%
Cost-sensitive Accuracy
40,00%
Precision
30,00%
Recall
20,00%
Cost-sensitive F-measure
10,00%
0,00%
FIGURE 10 200 FEATURES A)
30
From the graph it can be seen that none of previously listed classifier
combinations improved, another noteworthy occurrence is improvement of the kNN
algorithm in Recall from 28% percent up to 40%. Contrary to expectations other
algorithms performance decreased.
2500
2000
1500
1000
False Negatives
False Positives
500
0
FIGURE 11 200 FEATURES B)
10000000
1000000
Time [ms]
100000
10000
1000
Train Time
100
Test Time
10
1
FIGURE 12 200 FEATURES C)
31
Conclusions:
After this stage of the experiment, it can be stated that with the growth of the
number of features classifier performance does not need to grow accordingly. The
Bayesian Network, AdaBoost and Neural Network performance did not improve,
furthermore those classifiers will be omitted for the remaining tests due to inefficiency.
- Bayesian Network - the solution tries to build model of the correlation between
features. The exponential increase of the time complexity cause inability to build the
optimal model, for this reason approximation methods were used. Those methods are
slow and their performance low. Furthermore this solution is not proper for this task.
- AdaBoost - as a meta classifier it needs other classifier as an input, to improve. Due
to the time complexity it was tested basing on decision stump classifier. As shown in
graphs this solution is neither optimal nor fast, furthermore this solution is not proper
for this task.
- Neural Network - this solution is one of the most popular classification methods,
unfortunately form the tests it can be seen, that it is not efficient for huge size of the
dataset, furthermore it cannot be used for URL classification.
6.2.3. 500 features
100,00%
90,00%
80,00%
70,00%
60,00%
Cost-sensitive Accuracy
50,00%
Precision
40,00%
Recall
30,00%
Cost-sensitive F-measure
20,00%
10,00%
0,00%
Naive
Bayes
SVM
kNN
Decision
Tree
Random
Forest
FIGURE 13 500 FEATURES A)
32
In opposite to previous case Naive Bayes classifier noted increase in the number
of False Negatives
and False Positives, which led to a decrease of the overall
performance. The other classifiers noted the increase in all parameters.
1600
1400
1200
1000
800
False Negatives
600
False Positives
400
200
0
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 14 500 FEATURES B)
Contrary to expectations, that both test time and train time will continue to
increase with the number of features in dataset the test time of the Naive Bayes and
Decision Tree.
The three remaining classifiers behaved as expected meaning increase in time
complexity.
33
1000000
100000
Time [ms]
10000
1000
Train Time
Test Time
100
10
1
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 15 500 FEATURES C)
6.2.4. 1000features
100,00%
90,00%
80,00%
70,00%
60,00%
Const-sensitive Accuracy
50,00%
Precision
40,00%
Recall
30,00%
Cost-sensitive F-measure
20,00%
10,00%
0,00%
Naive
Bayes
SVM
kNN
Decision
Tree
Random
Forest
FIGURE 16 1000 FEATURES A)
The Naive Bayes False Negatives number decreased, resulting increase in Costsensitive Accuracy and Precision, unfortunately the number of False Positives
increased by 250, leading to decrease of Precision by more than 10% and Costsensitive F-measure by 1%.
34
The SVM, kNN, Random Forest classifiers performance decreased, due to
increased False Negatives and False Positives number.
The only classifier noting increase in overall performance was Decision Tree.
1800
1600
1400
1200
1000
False Negatives
800
False Positives
600
400
200
0
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 17 1000 FEATURES B)
It is expected that train time and test time of the classifier should increase with
the increase in size of the dataset, which appears to be true for Naive Bayes, SVM,
Random Forest.
kNN classifier noted big increase in test time, but its train time decreased from
790 ms down to 6 ms. In opposite to kNN the Decision Tree algorithm train time
increased more than 2 time, but unexpectedly test time remained the same as in
previous case.
35
1000000
100000
Time [ms]
10000
1000
Train Time
Test Time
100
10
1
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 18 1000 FEATURES C)
6.2.5. 2000 features
100,00%
90,00%
80,00%
70,00%
60,00%
Const-sensitive Accuracy
50,00%
Precision
40,00%
Recall
30,00%
Cost-sensitive F-measure
20,00%
10,00%
0,00%
Naive
Bayes
SVM
kNN
Decision
Tree
Random
Forest
FIGURE 19 2000 FEATURES A)
36
The Naive Bayes classifier False Positives number decreased resulting in the
improvement in Cost-sensitive Accuracy and Cost-sensitive F-measure, when the
number of False Negatives is exactly the same as for 1000 features.
The results of the SVM are a very interesting case, from the view on False
Positives number, it can be seen that this algorithm decreased its number to 0, meaning
none web content was classified as advertisement. Unfortunately the number of
misclassified adverts is the biggest among remaining classifiers, which is the reason of
decrease of the classifier overall performance.
The kNN improved its performance by decreasing number of both False
Positives and False Negatives.
As for the Decision Tree algorithm its performance on each aspect decreased.
The Random Forest overall performance increased, because of reduce of the
False Negatives number by 32.5%, unfortunately the number of False Positives
increased from 46 up to 56.
2000
1800
1600
1400
1200
1000
False Negatives
800
False Positives
600
400
200
0
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 20 2000 FEATURES B)
As expected both time parameters increased for all classification algorithms.
37
10000000
1000000
Time [ms]
100000
10000
Train Time
1000
Test Time
100
10
1
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 21 2000 FEATURES C)
At this point, obtained performance results are very good especially for
Random Forest which Cost-sensitive Accuracy was higher than 98% and Costsensitive F-measure almost 95%. The following experiment for 5000 feature dataset
will prove, that bigger training set will only worsen results, furthermore this part will
be treated as the most optimal.
38
6.2.6. 5000 features
100,00%
90,00%
80,00%
70,00%
60,00%
Const-sensitive Accuracy
50,00%
Precision
40,00%
Recall
30,00%
Cost-sensitive F-measure
20,00%
10,00%
0,00%
Naive
Bayes
SVM
kNN
Decision
Tree
Random
Forest
FIGURE 22 5000 FEATURES A)
All classifiers noted the big increase in time parameters followed by performance
decrease only exception for the decrease in performance is Naive Bayes algorithm,
unfortunately its results are still much lower than the other classifiers.
2000
1800
1600
1400
1200
1000
False Negatives
800
False Positives
600
400
200
0
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 23 5000 FEATURES B)
39
10000000
1000000
Time [ms]
100000
10000
Train Time
1000
Test Time
100
10
1
Naive Bayes
SVM
kNN
Decision Tree
Random
Forest
FIGURE 24 5000 FEATURES C)
40
7. Conclusions and future progress
7.1. Conclusions
During the last two parts of experiments it was experimentally proved, that
bigger set will only worsen results, furthermore the most optimal solution for this
experiment was obtained for 2000 features dataset.
- Naive Bayes - This algorithm does not returned very good results, it could only
detect 40% of adverts with 19% chance that legitimate content will be classified as
advert.
- SVM - This solution shows the worst 𝑅𝑒𝑐𝑎𝑙𝑙 only 25%, meaning that only one
out of four advertisements were correctly classified. However, only this solution does
not classify any web content as advert (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 100%).
- kNN - This classifier combination results are 2nd best. It classify correctly
almost 60% of advertisement and only 3 out of 100 web contents are misclassified.
The biggest disadvantage of this classifier is time requirement, its𝑇𝑒𝑠𝑡 𝑇𝑖𝑚𝑒 is almost
1 minute for 2000 features set, hence classification in real-time for voluminous data
streams can be insufficient.
- Decision Tree - The only disadvantage of this classifier is misclassification of
many web content, however it shows very good overall performance.
- Random Forest - This algorithm misclassified only 710 instances out of
550000, where 3 out of 100 web content instances were classified incorrectly and it
detects near 72% of all advertisements. This classifier seems to be the best for this
task.
41
95,00%
90,00%
Naive Bayes
85,00%
SVM
kNN
80,00%
Decision Tree
Random Forest
75,00%
70,00%
100
200
500
1000
2000
5000
FIGURE 25 CSF-MEASURE
99,00%
98,00%
97,00%
96,00%
Naive Bayes
95,00%
SVM
94,00%
kNN
Decision Tree
93,00%
Random Forest
92,00%
91,00%
90,00%
100
200
500
1000
2000
5000
FIGURE 26 CS ACCURACY
The Random Forest classifier, proved to be the best solution to classify online
advertisements, basing on URLs. For the set of 55000 instances the following
configuration was used:
- Number of trees in the forest: 200
- Number of randomly chosen features: 50
- The maximum depth for each tree: 50
42
7.2. Summary of results
It was proved that performance of each classifier is strongly dependant on
number of features constructing dataset.
- Naive Bayes - Out of all tested classifier this one is the simplest, and it showed
the best performance for 100 features. This performance could be the result of
neglecting the correlation between features, where the set of 100 features consist
mostly important pair wise independent features.
- SVM - The reason for performance decrease of this classifier for the set bigger
than 500 features, can be inability to separate the bigger space linearly.
- kNN and Random Forest - The set of 2000 features was the most optimal for
those solutions. The bigger set can contain unimportant features and cause the curse of
dimensionality.
- Decision Tree - Unfortunately this method has tendency to overfit for a bigger
space, which can be seen for sets bigger than 1000 features where its best results are
shown.
7.3. Future progress
During the development process few important problems were encountered.
- Feature selection process if very important for the classifier performance, it
allows to find effective solution considering only a subspace of all features and
computationally involving only weighted groups of instances instead of raw dataset,
hence if the application has to be real-life after collection of huge datasets, there is
need to develop fast-working feature selection methods.
- Another problem is that blacklisting method does not create perfect training
set, there is a need for human interference to correct this set. Unfortunately rating a
dataset consisting more than 25 thousands URLs, takes a enormous amount of time.
The other future work to focus on:
- Categorization of advertisement by the content with purpose of blocking only
the type of advertisement user do not want.
- Addition of the new feature selection and classification algorithms.
43
- Categorization of not only advertisement, but also any other web content. This
task would not be cost-sensitive, hence from the view on the results for 1000 feature
dataset the best algorithm would be still Decision tree, it has the lowest number of
misclassified instances and highest score in 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 from all
tests.
100,00%
90,00%
80,00%
70,00%
Accuracy
60,00%
Const-sensitive Accuracy
50,00%
Precision
40,00%
Recall
30,00%
F-measure
20,00%
Cost-sensitive F-measure
10,00%
0,00%
Naive
Bayes
SVM
kNN
Decision
Tree
Random
Forest
FIGURE 27 1000 ALL MEASURES
44
Bibliography
[1] M. W. Berry, J. Kogan, Text Mining: Applications and Theory, John Wiley and
Sons, 2010
[2] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, ML,
2003
[3] M. Devi, R. Rajaram, K. Selvakuberan, Machine learning techniques for
automated web page classification using url features, ICCIMA'07, 2007
[4] M. Indra Devi, R. Rajaram, K. Selvakuberan Machine Learning Techniques for
Automated Web Page Classification using URL Features, ICCIMA '07, 2007
[5] G. John, P. Langley, Estimating Continuous Distributions in Bayesian Classifiers,
UAI' 11, 1995
[6] U. Fayyad, K. Irani, Multi-Interval discretization of continuous-valued attributes
for classification learning, IJCAI '13, 1993
[7] T. Verma, J. Pearl, An algorithm for deciding if a set of observed indepedencies
has a causal explanation. UAI'08, 1992
[8] L. Breiman, Random Forests, Machine Learning, 2001
List of figures and Tables
Figure 1 Artificial neuron concept .................................................... 12
Figure 2 Artificial neuron layers ....................................................... 13
Figure 3 Decision tree example ......................................................... 14
Figure 4 Preprocess .......................................................................... 16
Figure 5 Basic view of test platform .................................................. 21
Figure 6 Database diagram .............................................................. 24
Figure 7 100 features a) ................................................................... 29
Figure 8 100 features b) ................................................................... 29
Figure 9 100 features c) ................................................................... 30
Figure 10 200 features a) ................................................................. 30
Figure 11 200 features b) ................................................................. 31
Figure 12 200 features c) ................................................................. 31
Figure 13 500 features A) ................................................................. 32
Figure 14 500 features b) ................................................................. 33
Figure 15 500 features c) ................................................................. 34
Figure 16 1000 features a) ............................................................... 34
Figure 17 1000 features b) ............................................................... 35
Figure 18 1000 features c)................................................................ 36
45
Figure 19 2000 features a) ............................................................... 36
Figure 20 2000 features b) ............................................................... 37
Figure 21 2000 features c)................................................................ 38
Figure 22 5000 features a) ............................................................... 39
Figure 23 5000 features b) ............................................................... 39
Figure 24 5000 features c)................................................................ 40
Figure 25 CSF-measure ................................................................... 42
Figure 26 CS Accuracy ..................................................................... 42
Figure 27 1000 all measures ............................................................ 44
Table 1 Assessments ........................................................................ 26
46