System filtracji reklam internetowych
Transcription
System filtracji reklam internetowych
Academic Year 2012/2013 Warsaw University of Technology Faculty of Electronics and Information Technology Electrical and Computer Engineering BACHELOR OF SCIENCE THESIS Krzysztof Kamiński Comparative study of machine-learning algorithms for the detection of ads on web pages Supervisor prof. dr hab. inż. Mieczysław Muraszkiewicz Evaluation: ............................................. ................................................................. Signature of the Head of Examination Committee 0 Electrical and Computer Engineering Date of Birth: Starting Date of Studies: 1989.07.07 2008.10.01 Curriculum Vitae I was born on 7 July 1989 in Warsaw. After completing primary school and high school, I attended the 25th Józef Wybicki High School in Warsaw. In October 2008, I started studying at the Faculty of Electronics and Information Technology at Warsaw University of Technology with major Electronics and information technology. ....................................................... Signature of the Student Bachelor of Science Examination Examination was held on: ................................................................................................... 2013 With the result: ............................................................................................................................ Final Result of the Studies: ......................................................................................................... Suggestions and Remarks of the B.Sc. Examination Committee: .............................................. ...................................................................................................................................................... ...................................................................................................................................................... 1 SUMMARY This thesis will involve analyzing learning algorithms for detecting the content of advertising on the Internet. The analysis of the ads appearing in the context of web pages and finding a good representation of the problem by defining the appropriate attributes that describe the advertisement. The result of the thesis will implement the appropriate classifiers and testing. The following part of the thesis will be implementation of the proposed solutions, the choice of the best methods and optimal parameters. The thesis was carried out within the project "Smart AdBlocker" Institute of Computer Science Warsaw University of Technology for TMobile company. Keywords: machine-learning, text mining, advertisement filtration TEMAT: Badanie porównawcze algorytmów uczących się do wykrywania reklam na stronach internetowych. STRESZCZENIE Niniejsza praca będzie dotyczyć przeanalizowania algorytmów uczących się do celów wykrywania treści reklamowych na stronach internetowych. Badania obejmą analizę reklam pojawiających się w kontekście stron internetowych i znalezienie dobrej reprezentacji problemu poprzez zdefiniowanie odpowiednich atrybutów opisujących reklamy. Kolejną częścią pracy będzie zaimplementowanie odpowiednich klasyfikatorów i przeprowadzenie testów. Wynikiem pracy będzie propozycja rozwiązań, z wyborem najlepszych metod i optymalnych parametrów. Praca została zrealizowana w ramach projektu "Smart AdBlocker" Instytutu Informatyki Politechniki Warszawskiej dla firmy T-Mobile. Słowa kluczowe: machine-learning, text mining, filtracja reklam 2 I would like to thank my co-workers on the project P. Szczepański, A.Wiśniewski, M. Januszewski for help, patience and a very good job. 3 Table of contents 1. Problem understanding ......................................... 6 2. Machine learning and URLs ................................. 7 3. Selected algorithms with parameters .................... 8 3.1. Naive Bayes classifier ............................................................................................ 8 3.2. Bayesian networks ................................................................................................. 9 3.3. Support vector machines ...................................................................................... 10 3.4. AdaBoost ............................................................................................................. 11 3.5. Artificial Neural Network .................................................................................... 12 3.6. k-nearest neighbor algorithm ............................................................................... 13 3.7. Decision tree ........................................................................................................ 14 3.8. Random forest ...................................................................................................... 14 4. Preprocess ........................................................... 16 4.1.Dataflow ................................................................................................................ 16 4.2. Constructing dataset ............................................................................................. 16 4.3.Feature extractors .................................................................................................. 17 4.4.Feature selection ................................................................................................... 19 4.5.Additional transformations ................................................................................... 20 5. System description .............................................. 21 5.1.Task generation ..................................................................................................... 21 5.2.Server/Database .................................................................................................... 22 6. Results and assessment method ........................... 25 6.1. Classifier accuracy assessment method ............................................................... 25 6.2. Results .................................................................................................................. 28 6.2.1. 100 features ................................................................................................... 28 6.2.2. 200 features ................................................................................................... 30 6.2.3. 500 features ................................................................................................... 32 4 6.2.4. 1000 features ................................................................................................. 34 6.2.5. 2000 features ................................................................................................. 36 6.2.6. 5000 features ................................................................................................. 39 7. Conclusions and future progress ......................... 41 7.1.Conclusions ........................................................................................................... 41 7.2. Summary of results .............................................................................................. 43 7.3. Future progress .................................................................................................... 43 Bibliography ............................................................. 44 5 1. Problem understanding At the beginning, the internet was net with purpose of exchanging the information, but today it is one of the most counted element of the global economy. With its increase in popularity, people started to realize that beside other features, internet can be a source of income and after short time commercials are common part of almost every website and are one of the most efficient way of earning money for the web owners. At the beginning, they were small due to limitations such as internet speed connection, but also the unspoken ethics concerning online advertising. Nowadays it looks like most of the web owners see their website as a business more than a source of information, by putting more and more adverts which are bigger, brighter and harder to close than before. With taking into consideration that internet devices are becoming more mobile like cell phones or tablets and direction in which online advertising is going, it is starting to become a huge problem. By increasing size of the advert, the page load time is increased remarkably on limited mobile internet, but that is not only reason, the small screens cannot deal with the large aggressive ads and finally the most important problem is that mobile users pay data charges for internet connection, meaning they have to pay for every unwanted advertisement. The common way of blocking online advertising is the blacklisting method[1], unfortunately this method possess few holes. The agencies specializing in online advertising operate on a wide range of domains what is making this method of blocking ads insufficient, due to the often need of updating. Another problem is inflexibility, because of inability to set the user preference about what kind of ad should be blocked. 6 2. Machine learning and URLs Machine learning goal is to find the hypothesis : 𝑋 → 𝐶 where 𝑋 is a set of entities and 𝐶 is the set of classes. The result is the closest possible hypothesis to the concept 𝑐: 𝑋 → 𝐶. In the problem of categorizing the advertisement only two classes are possible meaning 𝐶 = {𝑐𝑜𝑛𝑡𝑒𝑛𝑡, 𝑎𝑑𝑣𝑒𝑟𝑡}. The feature selection is the process of reducing the space of features containing entities. This is necessary not only to reduce algorithm time and space complexity but also to significantly improve algorithm performance[2]. Two main problems with huge features space are 𝑜𝑣𝑒𝑟𝑓𝑖𝑡𝑡𝑖𝑛𝑔 and 𝑐𝑢𝑟𝑠𝑒 𝑜𝑓 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦. The first issue leads to situation, when our hypothesis cannot correctly identify new entities, because it is too strongly formed by training set. The next problem refers to huge space, where unimportant features cause quite similar objects to lie in the long distance from each other. Uniform resource locator in short URL is a character string referencing to specified Internet source. and usually one web page consist of many parts labeled by URL The attempt to fast classify web pages using URLs was made by Devi, Rajaram and Selvalcuberan[4]. However, they used only simple method to extract features from each entity, did not perform feature selection process, and though, obtained insufficient results. 7 3. Selected algorithms with parameters 3.1. Naive Bayes classifier Naive Bayes is simplest instance of a probabilistic classifier and base on Bayes theorem. The classifier checks probability of each variable attribute separately. The final assumption is made by quotation of all conditional probabilities, that given f1,f2,…,fn possessing each feature f belongs to class C. The ending process of classification comes down to the problem of maximizing the following function: 𝑛 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦 𝑓1 , 𝑓2 , … , 𝑓𝑛 = argmax 𝑝 𝐶 = 𝑐 𝑐 𝑝(𝐹𝑖 = 𝑓𝑖 |𝐶 = 𝑐) 𝑖=1 This algorithm has been already studied and tested for solving URL feature classification problem in literature [5], according to this paper Naive Bayes showed very good performance. This classifier requires discrete attributes and all methods differ on the way how vector of the attributes is discretized and how segments are created. Creation of the segments can be made using unsupervised and supervised methods. Unsupervised method is a basic division of the set into segments of equal size without checking contents. Supervised method will be Multi-Interval discretization. This algorithm use the entropy minimization heuristic for discretizing the range of continuous-valued attribute into multiple intervals. Three different configurations were used. - Density for a values of a given parameter is a normal distribution and segments of the same size are created by unsupervised method. - Density of values of parameter is determined by kernel estimator and also segments are created of the same size by unsupervised method. According to literature [6] This method of estimating density should show better results, than normal distribution. 8 - Discretization of attributes is done by supervised method. Method introduced by U. Fayyad and K. Irani called Multi-Interval discretization[7] was used. 3.2. Bayesian networks Bayesian networks are statistical models representing the correlation between features by directed acyclic graphs. Each node represent random variable in Bayesian sense such as parameters or characteristics and conditional dependencies are represented by edges between nodes. Not connected nodes are variables which are conditionally independent of each other. Each node is connected with probability function where set of values for the node's parent variables is taken as input and returns probability of variable node. Bayesian networks has few important advantages: - It handle situations where some data entries are missing. - Conjunction between Bayesian statistical methods and Bayesian networks give good approach for avoiding the overfitting of data. The process of using this algorithm is divided into two parts, learning process of the networks structure and probability tables, followed by the process of classifying instances made by maximizing functions based on conditional probability. This algorithm can be customized by choosing different combinations of two main parameters: 1. Approximation of the conditional probability distribution can be done by different estimators. 2. Way of learning the structure by choosing score measure and search algorithm. Among all approaches three were chosen: - Local score metrics - allows to score the whole network by scoring each individual node. Purpose of this approach is to search for the optimal network structure. 9 - Conditional independence tests - eliminate correlation between independent variables to calculate relations between features. - Global score metrics - perform task classification and compare accuracy of results to estimate the value of network structure 3.3. Support vector machines Support vector machines in short SVM are a supervised learning model, associated with data analysis and pattern recognition algorithms purposed for classification and regression analysis. At the base level, SVM is a non-probabilistic binary linear classifier, consisting of application of a simple linear method to the data in high-dimensional feature space. What is important, the method itself does not necessarily need any high-dimensional space computations, that feature of the algorithm, make it fast while also adaptive. SVM use a kernel function which define an implicit mapping Φ of the input data into a high dimensional feature space defined by a kernel function k. 𝑘 𝑥, 𝑥 ′ = Φ 𝑥 , Φ(𝑥 ′ ) If 𝜙 ∶ 𝑋 → 𝐻 then the exemplary equation of function 𝑘 is returning the inner product 𝛷𝑥, 𝛷(𝑥′) between the images of two data points 𝑥, 𝑥′ in the future space. Feature space is where the learning phase takes place. This computation is often referred to as the "kernel trick". The above example is a simplest linear kernel function, which keeps the original data format. From all kernel functions 3 other were chosen: - the polynomial kernel, where 𝛾 is a scale, 𝑐0 is an offset and 𝑑 are degrees 𝑘𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑥, 𝑥 ′ = (𝛾 𝑥, 𝑥 ′ + 𝑐0 )𝑑 - the Gaussian Radial Basis Function (RBF) kernel 𝑘𝑟𝑏𝑓 𝑥, 𝑥′ = 𝑒𝑥𝑝(−𝛾||𝑥 − 𝑥′||2 ) 10 - the hyperbolic tangent kernel, where 𝛾 is a scale and 𝑐0 is an offset 𝑘𝑡 𝑥, 𝑥′ = tanh 𝛾 𝑥, 𝑥′ + 𝑐0 During the classification phase, SVM use hyper-plane to separate different classes of data 𝑤, 𝜙(𝑥) + 𝑏 = 0 corresponding to the decision function 𝑓(𝑥) = 𝑠𝑖𝑔𝑛( 𝑤, 𝜙(𝑥) + 𝑏) By solving a standard constrained quadratic optimization problem, it can be shown that in term of classification performance, hyper-plane has the maximal margin of separation between two classes. 3.4. AdaBoost[7] AdaBoost stands for Adaptive Boosting which is a meta-classification algorithm. It is computational method that conjoint few weak classifiers as linear combination into one classifier with improved performance. During processing data it weight all training samples and iteratively build new sub-classifier followed by the error rate calculation. Depending on the result it acts differently, if error rate is above predefined threshold it stops, but if not it increases weights of misclassified samples, what is resulting in focusing more on those samples. Final step of process consists of assigning voting power dependent on error rate for the purpose of newly built classifier and breaks after the predefined number of loops. As a result it shows a weighted vote of all build sub-classifiers. The most important argument of this classifier is the number of iteration. With the increasing number of iterations the error rate decreases, but the time of estimation rapidly grows. AdaBoost will be tested as a meta classifier with usage of stump method, due to the time and space complexity of other algorithms, what makes them not efficient in real-life classifications. 11 3.5. Artificial Neural Network Artificial neural network is a implementation of biological neural networks as mathematical model due to their real-life characteristics advantages such as learning ability, adaptivity, massive parallelism, distributed representation and computation, fault tolerance and low energy consumption. Artificial neurons consist inputs as synapses, which are multiplied by weights, then being computed by function determining the activation of the neuron and another compute the output. Artificial neural network is a combination of such neurons to process information. Depending on the weights, the computation result will be different for every neuron. FIGURE 1 ARTIFICIAL NEURON CONCEPT Apart from all versions of Artificial neuron systems four were chosen: - Learning Vector Quantization - Self-Organizing Map - Feed-Forward Artificial Neural Network - Artificial Immune Recognition System 12 Those networks consists of few layers, weights are assigned to inputs in order to create the hidden layer (set of linear regressions), then those are combined into additional layers. Furthermore range of training data and error is computed by transforming sum of data and weights. Results are used by algorithm to adjust network weights to minimize errors. Outputs Hidden layer Inputs FIGURE 2 ARTIFICIAL NEURON LAYERS 3.6. k-nearest neighbor algorithm K-NN is a object classifying method based on closes training examples. It's specialty is to perform discrimination analysis when reliable parametric estimates of probability densities are unknown. When an object with unknown class is presented for evaluation, the algorithm uses its “k” closest neighbors to compute and depending on the neighbors answer class is being assigned. Vectors can represent instances of our subject and also corresponding vectors can be used as a measure of class similarity metrics. - Euclidean - It is a measure of a distance in multidimensional space 𝑘 (𝑝𝑖 − 𝑞𝑖 )2 𝑑= 𝑖=1 13 - Manhattan - It measure on two dimensional space 𝑘 𝑑= 𝑝𝑖 − 𝑞𝑖 𝑖=1 - Chebychev - It measure using words with the largest difference of weight for comparison. 𝑑 = 𝑚𝑎𝑥 𝑝𝑖 − 𝑞𝑖 - Edit - It is a measure of the distance between two strings by Levenshtein distance. 0 𝑖 𝑜𝑟 𝑗 𝑙𝑒𝑣𝑎,𝑏 𝑖, 𝑗 = 𝑙𝑒𝑣𝑎,𝑏 𝑚𝑖𝑛 𝑙𝑒𝑣𝑎,𝑏 𝑙𝑒𝑣𝑎,𝑏 ,𝑖 = 𝑗 = 0 , 𝑗 = 0 𝑎𝑛𝑑 𝑖 > 0 𝑜𝑟 𝑖 = 0 𝑎𝑛𝑑 𝑗 > 0 𝑖 − 1, 𝑗 + 1 𝑖, 𝑗 − 1 + 1 , 𝑒𝑙𝑠𝑒 𝑖 − 1, 𝑗 − 1 + 1 3.7. Decision tree Decision tree is one of the basic concepts in machine learning classification. While leaves represents the classes, all interior nodes corresponds to input features of the instances. The most challenging problem is to find optimal tree. Our algorithm will use techniques taken from information theory such as entropy or information gain to build best corresponding to dataset tree structure. FIGURE 3 DECISION TREE EXAMPLE 14 3.8. Random forest[8] Random forest is a set of decision trees, where the output is a mode of the individual tree class output. The main advantages of this algorithm are efficiency on large databases, and accuracy as a machine learning algorithm and also that it should not become overfitted. As a comparison to AdaBoost which also uses another classifiers as a base, this algorithm yields more favorable error rates and is more robust with data noise. The performance can be influenced by tree parameters: - The number of the trees in the forest – with an increase in the number of trees the precision grow, but time complexity increase significantly. - The number of features in the tree - it determines number of all features in the randomly chosen subset in each tree. - The depth of the tree - this number describe how many levels can each tree posses. This parameter has mainly influence in the algorithm running time, but unfortunately the performance will also be decreased. 15 4. Preprocess 4.1.Dataflow Process of training set generation critically affect the performance of the classifier, then in order to find very good solution, this part needs to be handled very carefully. 4.2. Constructing dataset Constructing dataset Feature extraction Feature selection Additional transformations FIGURE 4 PREPROCESS Successful testing and use of a machine learning algorithm needs both training and test dataset. The learning dataset in basic is a list of URLs and each of them is defined whether it is advertisement or not. The list of URLs were collected by the usage of data logs supplied by TMobile company, but also with the manual usage of the browser connected through the local proxy server set to collect every requested URL address. For the purpose of creating learning set, supervising Mentor module was implemented. It uses the extended and heavily modified regular expression-based rules based on the AdBlock Plus filter definition language which is used by popular blacklist-based advertisement blocking software. It apply ad filters based on most popular blacklists to classify any given URL. 16 4.3.Feature extractors The majority of data-mining algorithms treat as an input data in form of a big matrix of numeric values. Each row is composed of a single sample of data and each column represent feature of a given sample. Taking this fact into consideration feature extractors need to transform the input URL textual form into sequence of numeric values. Basing on URLs many useful features can be extracted such as length of whole address or each segment and also what segments given URL is composed of. Another feature of this is to create a dictionary of all words used in all URLs and count their occurrences. URLs are made from the sequences of alphanumeric strings and those sequences create tokens, all special and white space characters are discarded due to the fact that those symbols does not guarantee increase in accuracy and also can affect the performance of the system. Each single URL does possess a small amount of tokens, comparing the whole set of generated tokens. This causes most features values to be set to zero. It allows us to store dataset as a sparse matrix to save both memory and processing time. URL = PROTOCOL.USERINFO.HOST:PORT/PATH/QUERY Example url: “http://support.google.com/ads/?hl=en” - Textual length of URL (total and per segment) – URL as a structure is composed from segments such as: PROTOCOL, USERINFO, HOST, PORT, PATH, QUERY. The length of each segment can be treated as an characteristics of URL types, mainly part called QUERY in most of online advertisements is very long. SegmentLength$Total = 29, SegmentLength$Host = 18 - Segment presence – In some of URLs there are a missing segments, in most of a common addresses PORT is missing and usually if URL consist USERINFO it is not an advert. 17 SegmentMissing$Host = false, SegmentMissing$UserInfo = true - Token occurrences – There are many words in each URL and some of them may occur more than once. Token: com = 1, Token: google = 1, Token: ads = 1 - Token occurrences by segment – Each segment can be composed of many words, and some of them may occur more than once. Token: Host$com = 1, Token: Path$ads = 1 - Sequential n-grams – There are some phrases in URLs that may state, whether the given address contain advertisement. Those phrases are created from words situated nearby, such as add and blocking, those together create a sentence "add blocking" which probably state, that this URL is not advertisement. Ngram: com>google = 1, Ngram: google>support = 1 - Full token n-grams – Information can be stored in many places in URL by combining all words in the whole URL it can be stated whether given example is addvertisement. Ngram: com>support = 1, Ngram: com>hl = 1 - Token count (total and per segment) – Most of URLs containing advertisement posses very long QUERRY, then it is needed to count tokens inside segments. TokenCount$Total = 7, TokenCount$Host = 3, TokensCount$Query = 2 - Numeric tokens count (total and per segment) – Inside URL numeric values can be found, even those can have meaning. Example can be that many advertisement containing pictures has their size inside the QUERRY stated as an number. NumericCount$Total = 0, NumericCount$Host = 0 18 4.4.Feature selection During the phase of feature extraction a great number of data is created, however only minority carry information, which can be useful as a dataset. The features that provide too little or none information can be treated as a noise, especially for the classifiers which are vulnerable to big number of features, their training time is increasing tremendously, making their overall performance insufficient. Then after the initial creation of feature dataset, the useless information must be cut. Feature selection heuristics try to estimate usefulness of features during the classification process allowing to reduce dataset greatly. Estimation can be done in by referring to the single feature or a whole subset of features. Single feature algorithms are faster, however they do not take into consideration the correlation between features, what is making their results worse than the other whole subset algorithms, which discard redundant information. Unfortunately, with increase in performance, also time-consummation is increased Due to the fact that each experiment consume a big amount of time, it was decided to use the distributed system in order to perform many tests simultaneously. This approach gives opportunity to test wide range of classifiers in much shorter time. For the tests two feature selection methods were used. Initially dataset was filtered using information gain heuristics, which estimates usefulness of feature f by calculating class value entropy reduction over whole dataset 𝑋: 𝐼𝐺 𝑋, 𝑓 = 𝐻 𝑋 − 𝐻(𝑋|𝑓) After filtering most of superfluous features, second pass with Correlation Feature Selection (𝐶𝐹𝑆) heuristics is performed. This method evaluates subsets of features, rewarding for high correlation with class value and penalizing for correlations between features considered in subset. Starting with empty set, features are added greedily until given size is reached. Exact formula for calculating 𝐶𝐹𝑆 usefulness of set 𝑆 containing 𝑘 features, where 𝑟𝑐𝑓 is mean class to feature correlation between and 𝑟𝑓𝑓 is mean feature to feature correlation: 𝐶𝐹𝑆(𝑆𝑘 ) = 𝑘 ∗ 𝑟𝑐𝑓 𝑘 + 𝑘(𝑘 − 1) ∗ 𝑟𝑓𝑓 19 4.5.Additional transformations Feature selection process chooses some optimal subspace of all features. In this subspace many instances became indistinguishable. To deal with it, the datafile created after feature selection process consists of the groups of instances. Each group is weighted by the number of the same instances. This can significantly improve training and testing time of our algorithms, without affecting classification performance. 20 5. System description Client storeResult getNextTask storeResult Task addTask Server Generator getNextTask Client getTaskResults Client FIGURE 5 BASIC VIEW OF TEST PLATFORM Each experiment is very time consuming process, to resolve this problem distributed system was created. This solution created the opportunity to test wider range of classifiers combinations. 5.1.Task generation All classification algorithms can have large number of possible combinations and due to this fact, module was developed with purpose of processing a set of tests into list of precisely defined test-strings ready to be used with the WEKA libraries. The task structure can define almost any combination of test argument ranges. Unfortunately, every set of additional options increase exponentially usage of RAM, there exist limitation of the physical memory of the task-injecting system. In order to deal with this problem it is essential to focus only on the most probable argument values. 21 5.2.Server/Database During the testing phase our platform generated huge number of possible configurations for the classifiers. To deal with that, it was decided to create a connection between the platform and database. For that purpose JDBC API was used, this technology provide methods for querying and updating data. Information to store were divided into two parts. First are the "tasks" - the test cases for each configuration for every classifier meant to be used. Second, the "task_results" composition of results for the respective tasks. For each part database was created. The first table named "tasks" is assembled from 8 columns: - taskid - primary key of the table holds the identification number of the task. It is later used as a foreign key in the results table. - classifier - stores the name of the classifier meant to be used. - classifieroptions - holds a combination of parameters for a given classifier. - datafilename - keeps the name of the training data file. - folds - stores the number of parts on which we divide the training data for the classifier to evaluate. Classifier use (fold – 1) part of data and use it for training, then test what it learn on remaining one, process is repeated “fold” number of times. - decisionindex - determines which attribute in the set points out whether the given instance is an ad or not. - resultscount - keeps the number of results stored for a given task. - workers - stores the number of clients which are working on a given task case. The second table named "tasks_results" consists of 8 columns: 22 - taskid - primary key of the table and also foreign key connected to table tasks and its taskid column, holds the identification number of the task which result is stored. - resultno - second primary key of the table stores number of stored results for a given task. - truepositives - keeps the number of correctly identified adverts. - falsepositives - keeps the number of incorrectly identified adverts. - truenegatives - keeps the number of correctly identified legitimate content. - falsenegatives - keeps the number of incorrectly classified legitimate content. - traintime - stores the time spent on training classifier. - testtime - stores the time spent on testing given case. Truepositives, falsepositives, truenegatives, falsenegatives are later used to compute sensitivity and specificity for a given classification with specified options. This solution enables user to simultaneously run testing on many computers. Which significantly reduced time of testing and allow us to do more precise results. To operate on database following functions were designed: - addTask - performs an operation of adding task to “tasks” table. - getNextTask - imports task from “tasks” table which will be used for computation, at the same time updates the number of workers working on a given task. Tasks with no results and the smaller number of workers have the priority, - storeResult - exports computed results into “tasks_results” table and increase results counter in “tasks” table. - getTaskResults - imports all results from “task_results” table computed by a given classifier, for the comparison purposes. The overloaded implementation of this method take the pattern of classifier and its option as a argument and extract suitable results. 23 Database Table: Database Table: tasks - taskid classifier classifieroptions datafilename folds decisionindex resultscount workers tasks_results 1 1 * - taskid resultno truepositives falsepositives truenegatives * falsenegatives traintime testtime FIGURE 6 DATABASE DIAGRAM 24 6. Results and assessment method Before presentation of the result, the method of assessment needs to be described. Due to the fact that failure in classifying web content is more cost consuming than advertisement, there needs to be additional parameter. It will be assumed that misclassification of ad is 4 times less than web content. 6.1. Classifier accuracy assessment method During the phase of constructing the dataset each element was stated as advert or web content, hence after the tests it can be stated exactly how many of misclassifications appeared. Accordingly 4 parameters can be distinguished, which will be used later as a base for the accuracy assessment: - True Positives (TP) - This parameter state how many adverts were correctly classified. - True Negatives (TN) - This parameter shows the number of correctly classified web content. - False Positives (FP) - This number tells the number of misclassified web contents. - False Negatives (FN) - This number says how many advertisements were misclassified. The most basic way to estimate the performance of the classifier called 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 is to determine the percentage of instances correctly classified. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 Variation of previous stated 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 by addition of misclassification coefficient α=4 called the cost-sensitive accuracy. 𝐶𝑆𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝛼 ∗ 𝑇𝑃 + 𝑇𝑁 𝛼 ∗ 𝑇𝑃 + 𝛼 ∗ 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 25 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the estimation, about how many true advertisements were detected out of all instances classified as an online advertisement. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 indicate the ratio of how many advertisements were correctly classified. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 F-measure is a combination of Recall and Precision. 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗ Variation of the previous 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 F-measure by taking into consideration misclassification coefficient α=4. 𝐶𝑆𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 (1 + 𝛼 2 ) ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + (𝑎2 ) ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 There exist another aspect of the classifier, it's the training time and testing time, however those parameters are not crucial. From the point of view of real-time system neither training time nor testing time is crucial. TABLE 1 ASSESSMENTS Assessment type Argumentation Importance 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Basic 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 does not consider cost-sensitivity Low and due to the fact that, advertisements are about 15% downloaded data, this assessment can give high ratio to algorithms that cannot detect any adverts. 𝐶𝑆𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Comparably to basic 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 this assessment Medium include cost-sensitivity can give high mark to algorithms that cannot detect any adverts. 26 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 Due to the fact that misclassified web content Medium possess high cost, this assessment can give high ratio to those classifier which classify web content correctly. 𝑅𝑒𝑐𝑎𝑙𝑙 Basic ratio of how many adverts were correctly Low classified which does not include const sensitivity. 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 Combination of 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑟𝑒𝑐𝑎𝑙𝑙, without Low cost-sensitivity included. 𝐶𝑆𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 Comparably to 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 this assessment High include cost-sensitivity and it is the most precise assessment due to the maximization of the costsensitivity. 𝑇𝑟𝑎𝑖𝑛 𝑇𝑖𝑚𝑒 The time consumed for training is not crucial to the Low process of real-time classification, this process can run in background and does not to be performed often. This time factor can be variable when classification is performed on different environment. 𝑇𝑒𝑠𝑡 𝑇𝑖𝑚𝑒 This process is more important than 𝑡𝑟𝑎𝑖𝑛 𝑡𝑖𝑚𝑒, but Low still is not crucial, because for this experiment 𝑡𝑒𝑠𝑡 𝑡𝑖𝑚𝑒 was about 10 seconds for 5000 features which is not much. This time factor can be variable when classification is performed on different environment. 27 6.2. Results The results are divided into parts by the size of features in datafile. The size of features grow from 100 up to 5000. The presented results will contain only efficient solutions for each datafile size, the missed solutions are inefficient or the time complexity is making solution impractical. Each part consist 3 charts, firstly all assessment types will be presented excluding time factors which will be presented as the last chart and the comparison between the number of false positives and false negatives. 6.2.1. 100 features From the first run it can be seen, that only 8 classifiers configurations which were the most efficient with optimal training and testing time. The first chart shows that 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 were quite high except AdaBoost which combined with its 𝑅𝑒𝑐𝑎𝑙𝑙, gives low level of 𝐶𝑜𝑠𝑡 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 making this configuration the first candidate to be excluded in further research. Another case worthy of being suspected of inefficiency are Bayes Network and Neural Network, their 𝑅𝑒𝑐𝑎𝑙𝑙 is lower that 5% and 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is 100% means those classifiers did not classify any web content as advert, but at the same time it classify almost all adverts wrong, which can be seen at the second chart by number of 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 and 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠. Rest of the configurations are promising and they can be expected to improve with number of features. Due to the fact, that this task is cost-sensitive for the next following parts 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 should be omitted, because they can be confusing. 28 100,00% 90,00% 80,00% 70,00% 60,00% Accuracy 50,00% Cost-sensitive Accuracy 40,00% Precision 30,00% Recall 20,00% F-measure 10,00% Cost-sensitive F-measure 0,00% FIGURE 7 100 FEATURES A) 2500 2000 1500 1000 False Negatives False Positives 500 0 FIGURE 8 100 FEATURES B) The 3rd figure shows each algorithm time performance. It can be seen that Bayes Network, AdaBoost and Neural Network not only gives worst classification results, but also training and testing time. 29 1000000 100000 Time [ms] 10000 1000 Train Time 100 Test Time 10 1 FIGURE 9 100 FEATURES C) 6.2.2. 200 features The main task for this test is to check whether Bayes Network, AdaBoost and Neural Network improved. 100,00% 90,00% 80,00% 70,00% 60,00% 50,00% Cost-sensitive Accuracy 40,00% Precision 30,00% Recall 20,00% Cost-sensitive F-measure 10,00% 0,00% FIGURE 10 200 FEATURES A) 30 From the graph it can be seen that none of previously listed classifier combinations improved, another noteworthy occurrence is improvement of the kNN algorithm in Recall from 28% percent up to 40%. Contrary to expectations other algorithms performance decreased. 2500 2000 1500 1000 False Negatives False Positives 500 0 FIGURE 11 200 FEATURES B) 10000000 1000000 Time [ms] 100000 10000 1000 Train Time 100 Test Time 10 1 FIGURE 12 200 FEATURES C) 31 Conclusions: After this stage of the experiment, it can be stated that with the growth of the number of features classifier performance does not need to grow accordingly. The Bayesian Network, AdaBoost and Neural Network performance did not improve, furthermore those classifiers will be omitted for the remaining tests due to inefficiency. - Bayesian Network - the solution tries to build model of the correlation between features. The exponential increase of the time complexity cause inability to build the optimal model, for this reason approximation methods were used. Those methods are slow and their performance low. Furthermore this solution is not proper for this task. - AdaBoost - as a meta classifier it needs other classifier as an input, to improve. Due to the time complexity it was tested basing on decision stump classifier. As shown in graphs this solution is neither optimal nor fast, furthermore this solution is not proper for this task. - Neural Network - this solution is one of the most popular classification methods, unfortunately form the tests it can be seen, that it is not efficient for huge size of the dataset, furthermore it cannot be used for URL classification. 6.2.3. 500 features 100,00% 90,00% 80,00% 70,00% 60,00% Cost-sensitive Accuracy 50,00% Precision 40,00% Recall 30,00% Cost-sensitive F-measure 20,00% 10,00% 0,00% Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 13 500 FEATURES A) 32 In opposite to previous case Naive Bayes classifier noted increase in the number of False Negatives and False Positives, which led to a decrease of the overall performance. The other classifiers noted the increase in all parameters. 1600 1400 1200 1000 800 False Negatives 600 False Positives 400 200 0 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 14 500 FEATURES B) Contrary to expectations, that both test time and train time will continue to increase with the number of features in dataset the test time of the Naive Bayes and Decision Tree. The three remaining classifiers behaved as expected meaning increase in time complexity. 33 1000000 100000 Time [ms] 10000 1000 Train Time Test Time 100 10 1 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 15 500 FEATURES C) 6.2.4. 1000features 100,00% 90,00% 80,00% 70,00% 60,00% Const-sensitive Accuracy 50,00% Precision 40,00% Recall 30,00% Cost-sensitive F-measure 20,00% 10,00% 0,00% Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 16 1000 FEATURES A) The Naive Bayes False Negatives number decreased, resulting increase in Costsensitive Accuracy and Precision, unfortunately the number of False Positives increased by 250, leading to decrease of Precision by more than 10% and Costsensitive F-measure by 1%. 34 The SVM, kNN, Random Forest classifiers performance decreased, due to increased False Negatives and False Positives number. The only classifier noting increase in overall performance was Decision Tree. 1800 1600 1400 1200 1000 False Negatives 800 False Positives 600 400 200 0 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 17 1000 FEATURES B) It is expected that train time and test time of the classifier should increase with the increase in size of the dataset, which appears to be true for Naive Bayes, SVM, Random Forest. kNN classifier noted big increase in test time, but its train time decreased from 790 ms down to 6 ms. In opposite to kNN the Decision Tree algorithm train time increased more than 2 time, but unexpectedly test time remained the same as in previous case. 35 1000000 100000 Time [ms] 10000 1000 Train Time Test Time 100 10 1 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 18 1000 FEATURES C) 6.2.5. 2000 features 100,00% 90,00% 80,00% 70,00% 60,00% Const-sensitive Accuracy 50,00% Precision 40,00% Recall 30,00% Cost-sensitive F-measure 20,00% 10,00% 0,00% Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 19 2000 FEATURES A) 36 The Naive Bayes classifier False Positives number decreased resulting in the improvement in Cost-sensitive Accuracy and Cost-sensitive F-measure, when the number of False Negatives is exactly the same as for 1000 features. The results of the SVM are a very interesting case, from the view on False Positives number, it can be seen that this algorithm decreased its number to 0, meaning none web content was classified as advertisement. Unfortunately the number of misclassified adverts is the biggest among remaining classifiers, which is the reason of decrease of the classifier overall performance. The kNN improved its performance by decreasing number of both False Positives and False Negatives. As for the Decision Tree algorithm its performance on each aspect decreased. The Random Forest overall performance increased, because of reduce of the False Negatives number by 32.5%, unfortunately the number of False Positives increased from 46 up to 56. 2000 1800 1600 1400 1200 1000 False Negatives 800 False Positives 600 400 200 0 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 20 2000 FEATURES B) As expected both time parameters increased for all classification algorithms. 37 10000000 1000000 Time [ms] 100000 10000 Train Time 1000 Test Time 100 10 1 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 21 2000 FEATURES C) At this point, obtained performance results are very good especially for Random Forest which Cost-sensitive Accuracy was higher than 98% and Costsensitive F-measure almost 95%. The following experiment for 5000 feature dataset will prove, that bigger training set will only worsen results, furthermore this part will be treated as the most optimal. 38 6.2.6. 5000 features 100,00% 90,00% 80,00% 70,00% 60,00% Const-sensitive Accuracy 50,00% Precision 40,00% Recall 30,00% Cost-sensitive F-measure 20,00% 10,00% 0,00% Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 22 5000 FEATURES A) All classifiers noted the big increase in time parameters followed by performance decrease only exception for the decrease in performance is Naive Bayes algorithm, unfortunately its results are still much lower than the other classifiers. 2000 1800 1600 1400 1200 1000 False Negatives 800 False Positives 600 400 200 0 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 23 5000 FEATURES B) 39 10000000 1000000 Time [ms] 100000 10000 Train Time 1000 Test Time 100 10 1 Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 24 5000 FEATURES C) 40 7. Conclusions and future progress 7.1. Conclusions During the last two parts of experiments it was experimentally proved, that bigger set will only worsen results, furthermore the most optimal solution for this experiment was obtained for 2000 features dataset. - Naive Bayes - This algorithm does not returned very good results, it could only detect 40% of adverts with 19% chance that legitimate content will be classified as advert. - SVM - This solution shows the worst 𝑅𝑒𝑐𝑎𝑙𝑙 only 25%, meaning that only one out of four advertisements were correctly classified. However, only this solution does not classify any web content as advert (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 100%). - kNN - This classifier combination results are 2nd best. It classify correctly almost 60% of advertisement and only 3 out of 100 web contents are misclassified. The biggest disadvantage of this classifier is time requirement, its𝑇𝑒𝑠𝑡 𝑇𝑖𝑚𝑒 is almost 1 minute for 2000 features set, hence classification in real-time for voluminous data streams can be insufficient. - Decision Tree - The only disadvantage of this classifier is misclassification of many web content, however it shows very good overall performance. - Random Forest - This algorithm misclassified only 710 instances out of 550000, where 3 out of 100 web content instances were classified incorrectly and it detects near 72% of all advertisements. This classifier seems to be the best for this task. 41 95,00% 90,00% Naive Bayes 85,00% SVM kNN 80,00% Decision Tree Random Forest 75,00% 70,00% 100 200 500 1000 2000 5000 FIGURE 25 CSF-MEASURE 99,00% 98,00% 97,00% 96,00% Naive Bayes 95,00% SVM 94,00% kNN Decision Tree 93,00% Random Forest 92,00% 91,00% 90,00% 100 200 500 1000 2000 5000 FIGURE 26 CS ACCURACY The Random Forest classifier, proved to be the best solution to classify online advertisements, basing on URLs. For the set of 55000 instances the following configuration was used: - Number of trees in the forest: 200 - Number of randomly chosen features: 50 - The maximum depth for each tree: 50 42 7.2. Summary of results It was proved that performance of each classifier is strongly dependant on number of features constructing dataset. - Naive Bayes - Out of all tested classifier this one is the simplest, and it showed the best performance for 100 features. This performance could be the result of neglecting the correlation between features, where the set of 100 features consist mostly important pair wise independent features. - SVM - The reason for performance decrease of this classifier for the set bigger than 500 features, can be inability to separate the bigger space linearly. - kNN and Random Forest - The set of 2000 features was the most optimal for those solutions. The bigger set can contain unimportant features and cause the curse of dimensionality. - Decision Tree - Unfortunately this method has tendency to overfit for a bigger space, which can be seen for sets bigger than 1000 features where its best results are shown. 7.3. Future progress During the development process few important problems were encountered. - Feature selection process if very important for the classifier performance, it allows to find effective solution considering only a subspace of all features and computationally involving only weighted groups of instances instead of raw dataset, hence if the application has to be real-life after collection of huge datasets, there is need to develop fast-working feature selection methods. - Another problem is that blacklisting method does not create perfect training set, there is a need for human interference to correct this set. Unfortunately rating a dataset consisting more than 25 thousands URLs, takes a enormous amount of time. The other future work to focus on: - Categorization of advertisement by the content with purpose of blocking only the type of advertisement user do not want. - Addition of the new feature selection and classification algorithms. 43 - Categorization of not only advertisement, but also any other web content. This task would not be cost-sensitive, hence from the view on the results for 1000 feature dataset the best algorithm would be still Decision tree, it has the lowest number of misclassified instances and highest score in 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 from all tests. 100,00% 90,00% 80,00% 70,00% Accuracy 60,00% Const-sensitive Accuracy 50,00% Precision 40,00% Recall 30,00% F-measure 20,00% Cost-sensitive F-measure 10,00% 0,00% Naive Bayes SVM kNN Decision Tree Random Forest FIGURE 27 1000 ALL MEASURES 44 Bibliography [1] M. W. Berry, J. Kogan, Text Mining: Applications and Theory, John Wiley and Sons, 2010 [2] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, ML, 2003 [3] M. Devi, R. Rajaram, K. Selvakuberan, Machine learning techniques for automated web page classification using url features, ICCIMA'07, 2007 [4] M. Indra Devi, R. Rajaram, K. Selvakuberan Machine Learning Techniques for Automated Web Page Classification using URL Features, ICCIMA '07, 2007 [5] G. John, P. Langley, Estimating Continuous Distributions in Bayesian Classifiers, UAI' 11, 1995 [6] U. Fayyad, K. Irani, Multi-Interval discretization of continuous-valued attributes for classification learning, IJCAI '13, 1993 [7] T. Verma, J. Pearl, An algorithm for deciding if a set of observed indepedencies has a causal explanation. UAI'08, 1992 [8] L. Breiman, Random Forests, Machine Learning, 2001 List of figures and Tables Figure 1 Artificial neuron concept .................................................... 12 Figure 2 Artificial neuron layers ....................................................... 13 Figure 3 Decision tree example ......................................................... 14 Figure 4 Preprocess .......................................................................... 16 Figure 5 Basic view of test platform .................................................. 21 Figure 6 Database diagram .............................................................. 24 Figure 7 100 features a) ................................................................... 29 Figure 8 100 features b) ................................................................... 29 Figure 9 100 features c) ................................................................... 30 Figure 10 200 features a) ................................................................. 30 Figure 11 200 features b) ................................................................. 31 Figure 12 200 features c) ................................................................. 31 Figure 13 500 features A) ................................................................. 32 Figure 14 500 features b) ................................................................. 33 Figure 15 500 features c) ................................................................. 34 Figure 16 1000 features a) ............................................................... 34 Figure 17 1000 features b) ............................................................... 35 Figure 18 1000 features c)................................................................ 36 45 Figure 19 2000 features a) ............................................................... 36 Figure 20 2000 features b) ............................................................... 37 Figure 21 2000 features c)................................................................ 38 Figure 22 5000 features a) ............................................................... 39 Figure 23 5000 features b) ............................................................... 39 Figure 24 5000 features c)................................................................ 40 Figure 25 CSF-measure ................................................................... 42 Figure 26 CS Accuracy ..................................................................... 42 Figure 27 1000 all measures ............................................................ 44 Table 1 Assessments ........................................................................ 26 46