Login - Innhold
Transcription
Login - Innhold
Faculty of Science and Technology MASTER’S THESIS Study Program / Specialization Spring Semester, 2014 Computer Science Open / Restricted Access Writer: Piyush Duggal ………………………………………… (Writer’s signature) Faculty supervisor: Chunming Rong, (UiS) External supervisor : Yann Chagourin, (Accenture) Thesis title: SAP PA: What is the Inside Chemistry? Predicting future of Predictive Analysis Credits (ECTS): 30 Key words: Pages: Enclosure : CD Stavanger 15/06/2014 SAP, Predictive analysis, Association, Apriori, Regression, PAL, Cluster Analysis, K-Means, SOM, ABC, Scaling Front page for master thesis Faculty of Science and Technology Decision made by the Dean October 30th 2014 1 How Worth is Predictive Analysis? Predicting the Future & Exploring Inside Chemistry of SAP PA Uncovering value in the data Piyush Duggal Department of Electrical and Computer Engineering University of Stavanger E-mail: [email protected] Thesis submitted in partial fulfillment of the Requirements for MASTER DEGREE In Computer Science June 14, 2014 2 Acronyms BI Business Intelligence BIT Business Intelligence Tools DB Database DM Data Mining DS Data Sources DSMS Data Stream Management System DW Data Warehouse EAI Enterprise Application Integration ETL Extract Transform and Load HANA High Performance Analytic Appliance KPI Key Performance Indicator OLAP Online Analytical Processing PA Predictive Analysis PAL Predictive Analysis Library PMML Predictive Modelling Markup Language RDBMS Relational Database Management System RT Real Time SOM Self Organizing Maps SVID SAP Visual Intelligence Document SPAR SAP Predictive Analysis Archive file 3 Abstract With enormous growth in analytical data and insight about advantage of managing future brings Predictive Analysis in picture. It really has potential to be called one of efficient and competitive technologies that give an edge to business operations. The possibility to predict future market conditions and to know customers’ needs and behavior in advance is the area of interest of every organization. Other areas of interest may be maintenance prediction where we tend to predict when and where any equipment or component will break or fraud detection for insurance and banking sector companies. SAP predictive analysis tool is sum of all efforts and investments SAP has made through support of open source statistics language R, many inbuilt predictive algorithms. This tool thus supports definition, visualization, processing and deployments of predictive analysis processes in way it was never done or imagined so effectively before. There are other tools available, like SAS, infinite insight in market for quite some time now but SAP has now strategically came up with an impressive investment; a concept with Hana (in-memory database) combined with PA gives them the edge to competitors as they have a powerful selling case allowing business users to do predictive analysis on huge amounts of data, with a user-friendly tool that still, via R support, offers the possibility for expert users to develop their own algorithms Decision support systems based on predictive models are increasing in popularity as organizations collect more data than decision makers can handle manually. These predictive models can be applied to find potentially valuable patterns in the data, or to predict the outcome of some event. This report talks about PA as concept, understanding it’s necessity, the value it adds to business and how can analytical users trying to predict future of their business operations, algorithms involved, hot topics and trends, challenges & criteria of success and more. Predictive analytics thus enables having an insight of future outcomes and trends based on extracted information with a probability of outcome from existing data sets with help of models, pattern recognition, statistical algorithms. Software for PA process can be deployed on-premises for enterprise users or can be accessed via cloud and there are various solutions both proprietary & ones based on open source technologies available in market for instance Angoss, IBM Predictive Analytics, KXEN, Oracle Data Mining, SAP PA, Statistica etc. 4 Acknowledgements I would like to thank Prof. Chunming Rong & Yann Chagourin, my supervisors for their valuable advises, support and contributions in every phase of thesis work. My deepest gratitude goes to Chunming for his relentless positive suggestions and insightful comments throughout. He was always available whenever, I needed help. I would like to extend my sincere thanks to Yann Chagourin, Analytics manager with Accenture having vast experience with data mining and analytics for his support. The thesis would have never been possible without his help. I would like to thank my friends Bikash Agarwal, Tormod Lea, Raji Khushi and Marina Samohvalova Wifi for their inspirations and help. I would also like to thanks all my wonderful colleagues in Accenture for their encouraging words. Last but not the least, I would like to thank all my family and friends in Norway and India for making my thesis completing successfully. I also greatly thank God for giving me the courage and energy to finish my Master program despite my other responsibilities. Without his blessings, I will never be able to reach the achievements I have until now. Piyush Duggal University of Stavanger 5 Preface This thesis i submitted in partial fulfilment of the requirements to complete the Master of Science (M.Sc.) degree at the Department of Electrical & Computer Engineering at the University of Stavanger (UiS), Stavanger Norway. The work can be seen as a case study trying to understand and explore the inner mechanism of new tool from SAP called Predictive Analysis. Thanks to my supervisors and the books I referred to achieve the goals set at time of starting the thesis. The report contains work done from February 2014 to June 2014. This might be helpful for those thinking of switching their IT skills towards data mining and analytics. It might have a solid ground for the students and statisticians who are curious to dive in this wonderful contribution from SAP to computer world. Data visualization, prediction, exploration and analysis techniques are covered with focus on Apriori and K-means algorithms implemented in PA. Self-growing Maps can also be area of interest for those who are new to analytics. Piyush Duggal University of Stavanger 6 Table of Contents Chapter 1: Overview of Predictive Analysis ........................................... 11 1.1 Definition .......................................................................................................... 11 1.2 Potential and what value PA can bring ............................................................ 12 1.3 Five Kinds of Analysis for 5 questions .............................................................. 14 1.3.1 Time Series Analysis .................................................................................. 15 1.3.2 Classification Analysis ............................................................................... 16 1.3.3 Cluster Analysis ......................................................................................... 17 1.3.4 Association Analysis .................................................................................. 18 1.3.5 Outlier Analysis ......................................................................................... 19 1.4 Predictive Analysis as a process ....................................................................... 20 1.5 User’s Classification .......................................................................................... 22 1.6 Challenges & Criteria for Success ..................................................................... 22 Chapter 2: PA as a product from SAP .................................................... 26 2.1 Intro to SAP HANA (Based on 3rd Semester Project work)............................... 26 2.2 SAP HANA Predictive Analysis Library ............................................................... 28 2.3 R Integration ....................................................................................................... 29 2.4 Interface walkthrough of SAP Predictive Analysis as a tool ............................. 30 2.4.1 Step 1: Accessing and viewing the Data Source .......................................... 32 2.4.2 Preparing Data for Analysis ........................................................................ 34 2.4.3 Step 3: Applying Algorithms for data analysis ............................................ 36 2.4.4 Step 4: Running the model and viewing the Results................................. 40 2.4.5 Step 5: Deploying Model in Business Application ....................................... 43 Chapter 3: Predictive Analysis Applied .................................................. 46 3.1 Initial Data Exploration ....................................................................................... 46 3.1.1 Sampling .................................................................................................... 48 3.1.2 Scaling.......................................................................................................... 50 3.1.3 Binning.......................................................................................................... 52 7 3.1.4 Outliers ........................................................................................................ 55 3.2 Which Algorithm When ....................................................................................... 56 3.3. Challenges & Resolutions ................................................................................... 61 Chapter 4: Cluster & Association Analysis Explored............................... 65 4.1 Association Analysis ............................................................................................ 65 4.1.1 Applications of Association Analysis ........................................................... 66 4.1.2 Apriori Association Analysis ........................................................................ 67 4.1.3 Apriori Association Analysis in PAL ............................................................. 68 4.1.4 Strength & Weakness with Apriori Lite ....................................................... 69 4.2 Cluster Analysis ................................................................................................... 70 4.2.1 Introduction & Applications of Cluster Analysis .......................................... 70 4.2.2 ABC Analysis in PAL ...................................................................................... 71 4.2.3 K-Means Cluster Analysis in PAL .................................................................. 75 4.2.4 Silhouette ..................................................................................................... 78 4.2.5 Self-Organizing Maps ................................................................................... 80 Chapter 5: Conclusion ........................................................................... 88 5.1 Problem Set: Burn that Churn ............................................................................ 88 5.2 Results & Analysis ............................................................................................... 89 5.2.1 Clustering ......................................................................................................... 92 5.2.2 Decision Tree .................................................................................................... 94 5.2.3 Apriori............................................................................................................... 95 5.2.4 Neural Network ................................................................................................ 96 5.3 Discussion & Issues ........................................................................................... 100 5.3.1 SAP PA compared to Hadoop....................................................................... 100 5.3.2 Sharing your own R component................................................................. 101 5.3.3 Configuring HANA PAL to use with SAP PA ............................................... 101 5.4 Future Work ...................................................................................................... 102 5.5 Conclusion ......................................................................................................... 102 References .......................................................................................... 104 8 Table of Figures Figure 1: PA utilizing approaches from many disciplines ............................................................. 12 Figure 2: Competitive Advantage goes well with Analysis ........................................................... 13 Figure 3: Five main questions of PA .............................................................................................. 14 Figure 4: Historic points as base for future points plot ................................................................ 15 Figure 5: Classification Analysis .................................................................................................... 16 Figure 6: Cluster Analysis .............................................................................................................. 18 Figure 7: Association Analysis ....................................................................................................... 19 Figure 8: Two Dimensional showing Outlier……………………………………………………………………………..20 Figure 9: Steps of PA Process…………………………………………………………………………………………………….21 Figure 10: SAP HANA Internal Architecture……………………………………………………………………………….27 Figure 11: R Integration of PA…………………………………………………………………………………………………..30 Figure 12: Welcome Screen for PA…………………………………………………………………………………………….31 Figure 13: Select Input Source for PA………………………………………………………………………………………..32 Figure 14: Window to search for database………………………………………………………………………………..34 Figure 15: Merge option in Step 1……………………………………………………………………………………………..34 Figure 17: Preparing Data for Analysis ………………………………………………………………………………………35 Figure 18: Possibility to apply and configure algorithms…………………………………………………………..36 Figure 19: Configuring the attributes for algorithms………………………………………………………………….38 Figure 20: An Advanced Analysis in PA……………………………………………………………………………………..40 Figure 21: Dialogue to create a new R Component for PA…………………………………………………………40 9 Figure 22: Predict Results Grid View………………………………………………………………………………………….41 Figure 23: Cluster Parallel Coordinate Chart……………………………………………………………………………..42 Figure 24: Scoring the saved model in PA………………………………………………………………………………….44 Figure 35: Share View in PA for outputs……………………………………………………………………………………45 Figure 36: Table versus Charts…………….……………………………………………………………………………………47 Figure 37: Input & Output Systematic Sampling………………………………………………………………..………50 Figure 38: The Sample Component in PA…………………………………………………….……………………………50 Figure 39: Scaling types and their results compared…………………………………………………………………51 Figure 40: Normalization Component in PA………………………………………………………………………………53 Figure 41: Input Output tables for Binning table in PAL.……………………………………………………………54 Figure 42: Algorithm Categories with tasks and examples…..……………………………………………………57 Figure 43: Four Data Sets in Anscombe’s Quartet...................................................................…...62 Figure 44: Process of Overfitting the models……………………………………………….……………………………63 Figure 45: Examples of Multicollinearity………………..…………………………………………………………………64 Figure 46: Apriori Principle.………………………………………………………………………………………………………66 Figure 47: Parameter Table Definition for Apriori………..……………………………………………………………67 Figure 48: An Example of ABC Analysis………………………………..……………………………………………………68 Figure 49: ABC Analysis Input & Output tables…...........................................................................73 Figure 50: Parameter Table Definition for K-Means….………………………………….……………………………76 Figure 51: Decision Tree Analysis of Clusters …..…….……………….……………………………………………….80 Figure 52: Data Set Records to the Map...…………………………………………………………………………………83 Figure 53: Four Clusters in the 4 * 4 Map……………………..……………………………………………………………86 10 Chapter 1: Overview of Predictive Analysis 1.1 Definition SAP defines its predictive analysis tool as ‘SAP Predictive Analysis is a statistical analysis and data mining solution that enables you to build predictive models to discover hidden insights and relationships in your data, from which you can make predictions about future events by allowing you to perform various analyses on the data, including time series forecasting, outlier detection, trend analysis, classification analysis, segmentation analysis, and affinity analysis’. In most simple words it is quantitative analysis supporting predictions and steps involved. It is a trending term in computer science terminology but not a new topic as we in past few decades can find many prediction attempts like product sales, costs, headcount, customer churn, advertising campaign response, possible fraud etc. One can argue that it involves data mining in contrary to involving knowledge discovery, whatever be end of debate, and it can prove to be business changing methodology if skilled to best of its potential. It is essentially a process of finding meaningful correlations, patterns and trends by interpreting and analyzing over through large amounts of data stored in data repositories deploying statistical/mathematical techniques or pattern recognition concepts. Inferential statistics and statistical sampling not only enforces requirement of very large data sets for prediction but also provides possibility to analyze smaller data sets for efficient sampling of correlations among datasets. Wikipedia defines it to be an area of statistical analysis in which you extract information from data to predict patterns and trends. This can then be used to predict an unknown, be it past, present or future; for example identifying fraud that has been committed or as it is actually occurring, through to forecasting future sales. The heart of predictive analytics is finding the relationship between known variables and a predicted variable, using past occurrences. This relationship is then used to predict an unknown outcome. Naturally, in such an analysis the quality of the data analysis and the assumptions made, will greatly affect the accuracy and usability of the predictions. Predictive analysis is a blend of multi quantitative analytics disciplines and Venn diagram below may describe the contribution of these disciplines to PA. Predictive analytics enriches decision makers and analysts with the potential to make accurate predictions about future events based on complex statistical algorithms applied to data under investigation. In other words, PA is synergy of interdisciplinary methodologies and prospective and combination of useful approaches to problem solving from diff professions. Statisticians find analysis methods like inferential statistics, regression and other multivariate methods as key concepts while operational researchers prefer simulation & optimization methods contrary to 11 Chapter 1: Overview of Predictive Analysis Artificial intelligence and information extraction approach followed by data miners. No matter what approach one goes for, this will always be an analytics process which initializes with data selection, acquisition and explorations using visualizations or sampling, finding validity of results, possibly reiterate whole result set and then dissemination in end to implement improved business processes. Predictive Analytics thus now can be seen as a broader term describing a variety of statistical and analytical algorithms/techniques used in order to develop models that can predict future behaviors or events. Figure 1.1: PA utilizing approaches from many disciplines 1.2 Potential and what value PA can bring White paper titled ‘The Business Value of Predictive Analytics’ by IDC research reported an asset management firm increased its marketing offer acceptance rate by300%; an insurance company identified fraudulent claims 30 days faster than before; a bank was able to identify 50% of fraud cases within the first hour and a communications company increased customer satisfaction by 53%. During the 2009 pandemic of H1N1 influenza virus or swine flu, Google was able to leverage search term activities to predict the spread of the H1N1 disease two weeks ahead of the government’s reports. This knowledge enabled state and local healthcare to ensure the availability of medicine resources and treatment for patients. What can describe better the advantage of being powered with information of what may happen in future depending on model efficiency? Management becomes when you have an insight of future provided the 12 Chapter 1: Overview of Predictive Analysis predictions are accurate. Better and accurate analysis of future happenings better is the control over it. Figure below describes the competitive advantage we get progressing from simply reporting the past to predicting the future and clearly advantage increases considerably. SAP Predictive Analysis was launched in late 2012 as a supplement to SAP Lumira (formerly Visual Intelligence), a tool to allow users to run R, HANA PAL, and HANA-R algorithms through a userfriendly interface. It will be quiet interesting to note here the results of a survey by Ventana research titled ‘Predictive Analytics: Improving performance by making future more visible’ and the results were as below 55% use predictive analytics to create new revenue opportunities. 68% who use predictive analytics claimed a competitive edge. 86% asserted that predictive analytics will have a major positive impact. Measurement of benefit from PA is not easy to calculate as theoretically it is difference between what happened from using PA to what would have been happened if PA was not used and we don’t have value known for later but the fact that market for predictive analysis software is estimated at over 2 billion dollar can give an idea about its potential, worth and relevance to business today. Figure 1.2: Competitive Advantage goes well with Analysis Users for PA can be data scientists, data analysts or business users. Data scientists are less than 1% of an organization’s head count and generally create complex predictive models, validate 13 Chapter 1: Overview of Predictive Analysis predictive requirements and publish results to management. Data analysts contribute to around 3% of head count and assists data scientists in transforming & enriching data sources, creating simple models and visualize results to publish to BI tools. Rest 97% are direct or indirect consumers of this analysis information and collaborate with each other for further business actions. Data scientists generally encompasses traditional terms like data miner, statisticians or data researchers and have deep knowledge & expertise to build predictive models for analysis, data collection, validation, exploration, selection and finally prediction. Business users don’t have technical knowledge and just need output of analysis. 1.3 Five Kinds of Analysis for 5 questions Whatever be the business or reason to deploy Predictive Analysis in that business, technically PA tries to find answers to following 5 questions as shown in figure below. Figure 1.3: Five main questions of PA Finding trends in historical data can be used to project future data by applying times series analysis, by utilizing historical data points to see how they might continue. This can be applied to predict demands or sales forecasting. Keeping track of key influencers of an event or an outcome can prove to be worthy for churn analysis as we can try to follow purchasing trend of customers. There can also exist significant segments or groups in data which are of more interest to us, finding them can be a key for further analysis. Are there any clear groupings of data or some main influencers? PA also tries to find associations or links between products by analyzing market baskets to trigger recommendation engines and lastly what and why some anomalies exist in 14 Chapter 1: Overview of Predictive Analysis data, are they errors or actual variations to be further analyzed. Off course PA is being deployed for large set of applications in cross industry but the key questions to be investigated remain same are thoroughly used. They are actually as basic to PA as a methodology that we can even group classes of applications. Each of 5 question we discussed above correspond to one of 5 classes of Predictive Analysis helping to describe structure of data for analysis. We classify predictive analysis applications to one of following 5 classes 1.3.1 Time Series Analysis Time series analysis accounts for the fact that data points taken over time may have an internal structure such as autocorrelation, trend or seasonal variation that should be accounted for where Time Series is an ordered sequence of values of a variable at equally spaced time intervals. It helps to obtain an understanding of the underlying forces and structure that produced the observed data and it helps to predict a model and proceed to forecasting, monitoring or even feedback and feed forward control. The intent is to discern whether there is some pattern in the values collected to date, with the intention of short term forecasting. Past data points are used as basis for predicting future ones. Time Series is an ordered sequence of values of a variable at equally spaced time intervals which give an understanding of the underlying forces and structure for data under observation. This is also a major weakness because it relies on the assumption that past behavior will be repeated which may not be true always and thus should be used with caution. There is actually no real argument to say that decision trees are a better algorithm than neural networks to classify data but still very common. It will depend on the fit the data set at hand, and also on the demands of the client as decision trees are easy to understand, and can help with understanding data patterns, whereas neural networks are black boxes. Figure 1.4: Historic points as base for future points plot 15 Chapter 1: Overview of Predictive Analysis 1.3.2 Classification Analysis This is largest group of applications and tend to predict a variable using data of other variables that is believed to affect the value of variable we want to predict. Prediction variable is also called as output variable or target variable as it depends on few independent variables or input variables. Studying churn analysis or target marketing is most used result. It is one of most common data mining techniques for finding hidden patterns in data along with clustering analysis. Classification is different to clustering as it also segments customer records into distinct segments called classes but unlike to cluster approach classification analysis requires that the end-user/analyst know ahead of time how classes are defined. A common approach to classify is to use decision trees for segmenting & partitioning records when better records are obtained by traversing the tree from the root via branches and nodes, to the leaf as it is a class instance. The path takes through a decision tree is a rule, as in "Income<$30,000 and age<25, and debt=High. Due to the sequential nature of the way a decision tree splits the records, it can result in a decision tree being overly sensitive to initial splits. It is thus advisable to find error rate of each leaf node.it is easy to express as paths can be shown as rules making it possible to use measures for evaluating the usefulness of rules such as Support, Confidence and Lift to also evaluate the usefulness of the tree. We don't use these values practically to measure the quality of a decision tree model, they go more well with Apriori algorithms. On decision tree models you can just check the accuracy of the model on known past data. Figure 1.5: Classification Analysis 16 Chapter 1: Overview of Predictive Analysis 1.3.3 Cluster Analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the general task to be solved. The greater the similarity (or homogeneity) within a group, and the greater the difference between groups, the “better” or more distinct the clustering. Cluster analysis is a classification of objects from the data, where by classification we mean a labeling of objects with class (group) labels. Cluster analysis is distinct from pattern recognition or the areas of statistics know as discriminant analysis and decision analysis, which seek to find rules for classifying objects given a set of pre-classified objects and hence can be considered an alternative to Factor Analysis. As the groups are not know in advance, it can be difficult as results don’t make sense in the context of the research being conducted. Hierarchical Clustering is efficient and groups data over a variety of scales by creating a cluster tree which not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level. K-Means Clustering is a partitioning method and partitions data into k mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering, k-means clustering operates on actual observations rather than the larger set of dissimilarity measures by creates a single level of clusters. Gaussian Mixture Models form clusters by representing the probability density function of observed variables as a mixture of multivariate normal densities. Mixture models of the Gaussian mixture distribution class use an expectation maximization (EM) algorithm to fit data, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability and often considered as soft clustering method. It helps to understand the attributes of smaller subsets more effectively. Patterns in data or any further relationships are easy to find when we focus on these clusters and it is also possible to cluster data in a way that allows us to focus on a specific group within dataset. Cluster Analysis is actually pattern recognition without a priori knowledge of the data set. When we have groups of similar customers, based on some attributes, it can be utilized to improve the business processes. For example, if the algorithms find a cluster of high value customers, there might be idea to target those with specific campaigns. Contrary to the classification analysis where all observations are known to be a part one of a number of groups and the objective is to predict the group to which a new observation belongs, cluster analysis tries to find the number and composition of the groups. 17 Chapter 1: Overview of Predictive Analysis Figure 1.2: Cluster Analysis 1.3.4 Association Analysis Given a set of transactions, we try to find rules that will predict the occurrence of an item based on the occurrences of an item based on the occurrences of other items in the transaction. The purpose of association analysis is to find patterns in particular in business processes and to formulate suitable rules, of the sort "If a customer buys product A, that customer also buys products B and C". Thus association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules. In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store. In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store. The associations necessarily don’t need to be products in shopping baskets, it can even be people in social network or telephone calling patterns etc. 18 Chapter 1: Overview of Predictive Analysis Figure 1.3: Association Analysis 1.3.5 Outlier Analysis An outlier is a data point which is significantly different from the remaining data i.e. is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism and can also be referred to as abnormalities, deviants or anomalies. An outlier often contains useful information about abnormal characteristics of the systems and entities, which impact the data generation process. Most outlier detection algorithm output a score about the level of “outliereness” of a data point. This can be used in order to determine a ranking of the data points in terms of their outlier tendency. This is a very general form of output, which retains all the information provided by a particular algorithm, but does not provide a concise summary of the small number of data points which should be considered outliers. A second kind of output is a binary label indicating whether a data point is an outlier or not. While some algorithms may directly return binary labels, the outlier scores can also be converted into binary labels. This is typically done by imposing thresholds on outlier scores, based on their statistical distribution. A binary labeling contains less information than a scoring mechanism, but it is the final result which is often needed for decision making in practical applications. Predictive model should be capable of differentiating between outlier caused due to errors in data or genuine variations of data. It is mostly applied in fraud detection, clinical trials, voting irregularity etc. 19 Chapter 1: Overview of Predictive Analysis Data sets with multiple outliers or clusters of outliers are subject to masking and swamping effects. Masking occurs when a cluster of outlying observations skews the mean and the covariance estimates toward it, and the resulting distance of the outlying point from the mean is small. Swamping occurs when a group of outlying instances skews the mean and the covariance estimates toward it and away from other non-outlying instances, and the resulting distance from these instances to the mean is large, making them look like outliers. This is the second main usage for outlier analysis, i.e. "improving" the quality of a data set before running other algorithms on it as algorithms that would "suffer" from the presence of outliers, like regression algorithms should not be brought in. Two Dimensional with an outlier point 1.4 Predictive Analysis as a process Like every other process, Predictive analysis also is series of defined logical steps and can be defined well with following steps. 1. Requirement Analysis – What is reason behind your prediction and what is motivation behind this attempt to predict? What outcomes are expected and who are prospective participants, timelines and resources. It is important step and requires on average around 20% of total time to come up with good analysis of requirement. A flaw here will always affect the last prediction output. 2. Data Identification - What are data requirements and what sort of data will support our prediction model the best? What are various sources available and which one can be most reliable? Validation of data after acquiring from various sources is also a good practice. Initial 20 Chapter 1: Overview of Predictive Analysis data exploration is conducted and some data transformations like sampling, binning or rescaling the data may be performed in preparation for model building. It will be interesting to note here that data selection, acquisition and preparation is most time consuming step in whole process. A lot of questions like what is needed, in what terms it should be measured, where it can be obtained from, how good resultant data set is etc. determine data to be captured and transformed. On an average this step accounts for 36% of total time spent on a PA process. 3. Model Building – Here algorithms come in picture and they are applied to identify data along with chosen parameters to find the best analysis. Training the model or testing the model on one set of data and then reapplying on another or unseen part of data is generally done to evaluate algorithm fit (how good is the algorithm at solving the problem at hand) for particular case. This also gives an idea about model performance in terms of robustness, usability and goodness. 20% time is average consumption in this step. 4. Deployment – Here comes the opportunity to apply selected models in various business applications and is sometimes also referred to as model scoring. Business rules may sometimes need to be integrated with business rules & fundamentals to get a better prospective of business context. Here we also monitor model performance over time. 5. Reiterate – Predictive analysis requires iteration to any stage of process back thus making it wrong to define as single pass through sequential well defined steps. It is because data is needed further to transform data so as to alternate the analysis for better results and provides an option to see a set of data from multi prospective scale. Figure 1.9: Steps of PA Process 21 Chapter 1: Overview of Predictive Analysis 1.5 User’s Classification We can have PA users classified in diverse categories with different skill sets and domain experience ranging from data scientists to consumer of business applications. Data Scientists form the minority group as they come under 0.01% of total users and are responsible to create predictive models, validate predictive business requirements and publish results to management. Models built by data scientists are used by data analysts through an interface like some wizard to explore and analyze data related to a particular application say marketing campaign. They generally have functional and business domain knowledge but generally don't want to get engaged with technical process of applying algorithms and creating models. They generally need guidance for understanding these models and basically are interested in output of these models and most of times are market researchers, market campaign managers or analysts. People who just want the benefits of predictive analysis simply embedded in their business processes are classified as business users. They are only interested to analyses the output of algorithms for decisions. With Service Pack 6 for HANA, SAP has now introduced the Application Function Modeler tool, that is a graphical interface for running advanced algorithms & accessing results from the Application Function Libraries and thus helping these business users be more effective. 1.6 Challenges & Criteria for Success Before we look Predictive Analysis as a product from SAP and its potential, let’s talk about some challenges of PA and myths involved. It has to be understood that PA don’t guarantee the successful & consistent prediction of future. In classical business organizations, enormous amount of data is collected without knowing where & when it will be used with an approach in mind to save everything because you never know when you need something. But for analyzing data, quality of data is more important than quantity of data. It may be interesting and efficient if we store data with a metadata defining the purpose and possible decision making it may support. Identifying the variable correctly that has biggest impact on prediction or output is most of times very difficult. Objectives of analysis should be very clear in starting if we want a successful predictive analysis project implemented. There are many myths about PA in market and are propagated by innocent or biased parties. Below fiver are most common misconceptions held about PA as per SAP published book on Predictive Analysis. 1. 2. 3. 4. 5. PA is all about algorithms. PA is all about accuracy. PA requires a data warehouse. PA is all about vast quantities of data. PA is done by predictive experts. 22 Chapter 1: Overview of Predictive Analysis First myth that PA is all about algorithms is not totally wrong as algorithms form heart of this process. Good and efficient algorithms are a part of story only. In previous section, we saw that only 20% of this PA process is devoted to generating models. Other important things that make core of PA other than algorithms are defining project goals, acquiring understanding and manipulating data, analyzing, evaluating, modeling and presenting the results. Believing this myth is something like driving a car that only has engine but no steering wheel, fuel or brakes. Everyone wants the best model and thus there are various measures to model quality but that doesn’t mean PA is always accurate. Spending continuous energy and time to refine a model in order to get very last model of precision always creeps in extra cost. It is business decision that how worth that extra efforts are to get more and more accurate analysis. A part of predictive accuracy may discover an interesting pattern in data which will prove to be very crucial in business decision, but this usefulness doesn’t depend on accuracy of model. Usefulness of PA algorithms depend on understandability & deplorability of it in business model and not fully on its accuracy. Third myth is based on assertion that we need to have a fully functional data warehouse to start with predictive analysis process. It is not true but for sure PA process will be more efficient and easy to implement if organizational data is relatively clean & easy to access. While planning to start a data warehouse requisite data for any analysis should be considered. Management reporting is basic purpose of data warehouse and not data analysis. That’s why it is a considered a myth and having no data warehouse can’t stop anyone anyhow to start predictive analysis. Its era of Big Data and thus computers memories, database sizes, performance factors etc. All are many times bigger these days; but it is definitely a myth that we cannot predict or analyze without having a vast dataset. We can't "predict" if there are no patterns in the data, whatever the amount of data. The idea with "big data" is that the more data you have, the more likely you are to see the patterns, but that pre-supposes the existence of patterns. PA is equally relevant & successful for very small volume of data. As far as statistical inference is concerned, data volumes can be small but yet of very high importance for analysis. In most of cases like churn analysis, credit risks test, loan defaulting etc., even though we have thousands of records in dataset, PA still depends on just very few key variables. It is not wrong to say data analysis of bigger dataset is increasing in popularity but for sure analysis of small data volumes may be equally beneficial and popular to next business decision. Last myth is true bit not only case and hence PA can be done by Predictive experts as well as newbies in this area. With tools like PA, the idea is that you do not need to be an expert in R, or in the inner workings of the provided algorithms, in order to build predictive models. You still need some knowledge in order to know how to use the algorithms, their strengths and weaknesses, but yes actually the most important part might be business knowledge to be able to interpret the results. It actually depends on what has to be done and complexity of this task. When prospective borrower is browsing a bank website to analyze the bank rules for passing a 23 Chapter 1: Overview of Predictive Analysis loan, he is doing his predictive analysis but definitely need not be an expert in this case. PA is best performed by someone who has relevant business domain knowledge. Soon we will be exploring how to use predictive analysis so it becomes wise to go through common pitfalls that every user should avoid to help saving him from any unapparent source of trouble. Firstly predictive analysis will make no sense by simply throwing in data without any thoughts. Rational thinking treats it illogical which makes no sense to just dump all accessible data but some data scientists prefer doing it and want to reply on some intelligent and reliable algorithm that can work out to find important variables and ignore all irrelevant ones. It is somehow related to myth 4 mentioned above that PA is all about vast quantities of data. It can be a good practice to keep dumping all data if you believe on an algorithm that can sort out noise from signal and has potential to reject some variables if algorithm thinks it to be irrelevant to analysis. However experienced business users find this approach dangerous and counterintuitive. Predictive analysis will off course be of no use if user don’t have basic to intermediate knowledge of related business domain. Without business knowledge of application area it is ideally impossible to guide predictive analysis process towards useful results and make a decision based on that results once we have them. Thus it involves teams working together with diverse skillsets from business knowledge to analysis knowledge working together otherwise neither results nor variables and dependencies will sound understandable to someone with no business knowledge when making prediction. Lack of data knowledge is another common pitfall to be avoided and thus approach should be to have detailed answers about data, data types, authenticity, source, provider, measurements, and interpretation in terms of business rules etc. need to be found out at first place. Has data come from sample or survey or was it unbiased, such questions have deep significance. Irrelevant data or lack of data knowledge can be as worse as having no data. Without data knowledge we can be misled even and tend to make erroneous invalid assumptions too. Some assumptions need to be verified twice for an instance if a customer can hold multiple accounts or if a class attendance is mandatory. In case of legacy and outsourced data, it is difficult task as even data experts need to be sure about these assumptions. In Short, sources should always be questioned and drilled before we finalize any assumption while data verification. After these pitfalls and myths it would be easy to summarize and understand this section as key factors for success for any predictive analysis process. Expectations should not be kept very high and data mining does not guarantee finding gold. It depends on your expectations to see if finding 10 when you promised 12 is a success or failure. It is always advisable to steer any predictive analysis project after agreeing the first steps of setting objectives, business case and desired outcomes and not to go other way to start process and wait for something to be found if we are lucky. Working in team is also crucial with business domain knowledge experts available at every step while data analysts supporting them and understanding their requirements. 24 Chapter 1: Overview of Predictive Analysis Sensitivity analysis is to question the impact of assumptions made and to verify result of these assumptions on analysis output. It is actually a very influencer of success. Solution can be considered unstable or model is considered unhealthy if small changes in assumptions bring large alterations in results. 25 Chapter 2: PA as a product from SAP Intro to SAP HANA (Based on 3rd Semester Project work) 2.1 Before we talk about SAP HANA & Predictive Analysis library, here is a quick introduction to SAP HANA and in-memory computing basics from report of 3rd semester work. An in-memory database system also known as main memory database system contrast traditional systems which rely on disk storage mechanism and are claimed to be faster relying on faster internal optimization algorithms which eliminates seek time. SAP HANA is a powerful platform providing libraries for predictive, planning, text processing, spatial and business analytics combining data processing, application platform and database capabilities in memory. SAP HANA is a powerful platform providing libraries for predictive, planning, text processing, spatial and business analytics combining data processing, application platform and database capabilities in memory. SAP HANA is an innovative in-memory data platform that is deployed on-premise as an appliance, in the cloud or as hybrid of two. The key lies in its unique ability to converge database and application logic within in-memory engine to perform advanced, real-time analytics. HANA stores a table in column store as sequence of columns in consecutive memory locations maximizing spatial locality of table columns. CPU execution speeds are high without need of internal waits for memory address operations. Data is compressed in two-fold making it a less costly database allowing speedy searches and calculations. Hana Database also called SAP in-memory database follows hybrid approach and consists of two relational database engines. Column bases store arranges data in columns and is optimized to hold huge amount of data, which can be aggregated in real time. Row based storage is more optimized for insert and updates and stores data in rows. To achieve the desired performance, in-memory computing follows these basic concepts: Keep data in main memory to speed up data access. Minimize data movement by leveraging the columnar storage concept, compression, and performing calculations at the database level. Divide and conquer leverage the multi-core architecture of modern processors and multiprocessor servers, or even scale out into a distributed landscape, to be able to grow beyond what can be supplied by a single server. All standard features expected from any relational database like views, triggers, indexes etc. are supported by HANA database engines. At time of table creation, administrator can select either of two options. It is always possible later to convert tables from one form to another. Both engines share common persistency layer which is responsible for page management and logging. 26 Chapter 2: PA as a product from SAP Logger saves every transaction committed on HANA database in a log entry written on persistent storage. Log volumes use low latency flash technology for storage. Modeling capabilities to define in memory transformation of analytical views from relational tables are also provided. Analytical views always provide real time results as views are never materialized. In-memory computing allows the processing of massive quantities of real time data in main memory to provide immediate results from analysis and transaction. In order to support developers in creating applications and services directly within this new SAP HANA Extended Application Services, SAP has enhanced the SAP HANA Studio to include all the necessary tools. SAP HANA Studio was already based upon Eclipse; therefore we were able to extend the Studio via an Eclipse Team Provider plug-in which sees the SAP HANA Repository as a remote source code repository similar to Git or Perforce. This way all the development resources (everything from HANA Views, SQLScript Procedures, Roles, Server Side Logic, HTML and JavaScript content, etc.) can have their entire lifecycle managed with the SAP HANA Database. These lifecycle management capabilities include versioning, language translation export/import, and software delivery/transport. SAP HANA Internal Architecture 27 Chapter 2: PA as a product from SAP 2.2 SAP HANA Predictive Analysis Library More & more are getting aware of SAP huge efforts and contributions in the area of predictive analysis ranging from SAP HANA (in-memory computing database) to modern user interface for visualizing, defining and executing the whole process efficiently. SAP is also recognized as a leader in big data predictive analysis by Forrester in their report ‘The Forrester Wave: Big Data, Predictive Analysis Solutions’ just because of innovative solution and research contributions providing business users with powerful predictive assets as data preparation, data predictive algorithms, developer tools and a workbench to execute, visualize and share analysis accelerating the business applications. SAP allows its predictive tool to support many data sources like HANA or data from SAP BO along with non-SAP solutions like Hadoop (via SAP data services), CSV or even normal excel files. The fundamentals behind these PA assets will always be the powerful predictive analysis algorithms and Predictive Analysis Library (PAL) in HANA which is C++ built in library to perform in-database data mining and in-database statistical calculations. An enterprise class solution is delivered by SAP Data Services for data integration, quality management, text analytics, data profiling and metadata management. Unstructured data sources are also supported through combination of data services. PAL contain a lot of defined predictive analysis algorithms that execute in-database to process large datasets. Point here being data is not extracted out of SAP HANA to another analysis placed somewhere and thus reducing data movement time allowing performing calculations with in HANA server and database. These algorithms are called from within HANA SQLScript procedures and are generally grouped together using following classes of applications. Listed below are all algorithms provided by PA. a. Association Analysis - Apriori - Apriori Lite b. Cluster Analysis - ABC Classification - DBSCAN - K-Means - Kohonen Self Organized Maps c. Classification Analysis - C4.5 Decision Tree Analysis - CHAID Decision Tree Analysis - K Nearest Neighbor/(KNN) - Multiple Linear Regression - Polynomial Regression - Exponential Regression - Bi-Variate Geometric Regression 28 Chapter 2: PA as a product from SAP d. e. f. g. - Bi-Variate Logarithmic Regression - Logistic Regression - Naïve Bayes Time Series Analysis - Single Exponential Smoothing - Double Exponential Smoothing - Triple Exponential Smoothing Outlier Detection - Inter-Quartile Range Test (Tukey’s Test) - Variance Test - Anomaly Detection Link Prediction - Common Neighbors - Jaccard’s Coefficient - Adamic/Adar - Katz_ Data Preparation - Sampling - Binning - Scaling - Convert Categorical to Binary Link prediction is emerging set of group of algorithms to analyze social networks finding links between entities on social networks. It will not be wrong to consider PAL to be table based because each algorithm PA supports, three tables are maintained for each algorithm, an input table which contains data for analysis, a parameter or control table containing various parameter combinations for particular algorithm and an output table for the output of the analysis. The SQLScript which calls PAL contains code which first generates specific procedure followed by definitions of table for input data, parameter settings & results and finally calls the procedure. All these procedures are defined in AFL schema which stands for Application Function Library Schema. 2.3 R Integration R, open source statistics language with around 3500 plus packages/algorithms, is one of most used predictive analysis tool approximately used by 60% of data miners. Allowing use of R from with HANA offer breadth of algorithms available for business calculations in addition to specific algorithms defined in PAL. A high level architecture for SAP predictive assets and their association with R can be shown as figure below. SAP HANA platform being core along with PAL provides flexibility to involve R. HANA studio provides development environment while the client tool SAP Predictive Analysis is used by business analysts and data scientists. R and SAP HANA 29 Chapter 2: PA as a product from SAP resides on separate servers side by side and R servers takes in data from data stored in HANA tables which is transformed by R into R vectors or R frames which is default data format used by R. SQLScript embeds within R script code which is passed over to R for R processing on R server and the results are transferred back. These results given by R server are again in data vectors format and are thus needed to be converted back to HANA table. All these transformations and transfer are performed by HANA Platform. SQLScript containing R script first calls code to initiate specific procedure and calls parameters, input & output tables before calling the procedure. No doubt, predictive analysis gets huge flexibility and comprehensiveness due to this R support & Integration with HANA. If you want to use PAL algorithms you should know SQLScript similarly like you need knowledge of R to use open source R algorithms and packages. SAP PA is a simple tool with nice user interface allowing business users to get best benefit of predictive analysis without knowledge of R or SQLScript. SAP PA capability increases to a great extent with this feature to add R algorithms. With use of R in SAP PA, data mining capabilities can be extended with many new algorithms. It also enhances further charts/visualization capabilities. Prerequisite is off course R software to be installed on host machine with necessary libraries and R algorithms. R Integration for PA 2.4 Interface walkthrough of SAP Predictive Analysis as a tool SAP Predictive Analysis (PA) in most simple terms is a tool or solution from SAP that serves like user interface which defines and executes all predictive analysis processes. These processes can be on in-database PAL in HANA or on predictive algorithms in R or even traditional data sources such as SAP BO, XLS or CSV. PA has another advantage of being fully integrated with Lumira to enhance and ease of sharing the results after data acquisition, visualization and manipulation is done. PA fully supports all analysis processes for prediction mentioned earlier in this report. This section below gives a detailed description of SAP PA as a product with 30 Chapter 2: PA as a product from SAP screenshots and functions possible explaining data preparation, applying algorithms and deployment of models. All stages starting from accessing/viewing input data to performing required data preparation and then to finally applying algorithms to analyze this data is covered in following section of report. The first screen you get when you initiate or invoke PA is called Welcome Screen. Immediately, a simple five-step getting Started guide comes up and displays possible five steps. It is really simple by design and interface too as it seems to be in text. Welcome screen also contains collection of SAMPLES to help new users learn and understand the product and get used to the tool. Connecting to a data source. Select a data source. Prepare your data. Explore the input data. Analyze your data. Click on the Predict View. Visualize the analysis results. Click on Results. Save the analysis: an optional step. WELCOME SCREEN FOR PA 31 Chapter 2: PA as a product from SAP 2.4.1 Step 1: Accessing and viewing the Data Source When we select NEW DOCUMENT button from PA welcome Screen, a new dialogue box opens up giving us possibility to SELECT A SOURCE. PA supports seven different kind of data sources and this list is populated under the NEW DATA SOURCE COLUMN while on right RECENT DATA SOURCES column list recently acquired data sources for convenience and fast access. Unique feature in this list of ‘New Data Source’ is SAP HANA ONLINE which acts as a data source helping to acquire data from SAP HANA tables, views and analysis views to perform in-database predictive analysis functions using PAL algorithms and R integration of SAP HANA. Rest all data sources in list are off course non in-database and exclude PAL algorithms & R integrating for HANA as analysis part runs outside HANA. SAP HANA Source Input Following seven different data sources appearing under New Data Source are 1. CSV file: This option gives the possibility to acquire data from comma-separated value data file and perform in-process analysis using native PA algorithms and R integration for PA. 2. Free hand SQL: This option helps to create user’s own data provider allowing manual entry of SQL values to a target data source to perform in-process analysis with help of native PA algorithms and R integration for PA. 32 Chapter 2: PA as a product from SAP 3. SAP HANA Offline: This option lets you to acquire data from SAP HANA tables, views and analysis views and allows performing in-process analysis of data using native PA algorithms and with help of integration R to PA. It is only option that allows the predictive models to be run on the HANA database, all the other options have model run locally. 4. SAP HANA Online: This option can acquire data from HANA tables, views and analysis views to perform in-database analysis using SAP HANA PAL algorithms and R Integration for SAP HANA. 5. MS Excel: Microsoft excel spreadsheet can be used as a data source and after acquiring data from an excel spreadsheet, we can perform in-process analysis using native PA algorithms with integrating R to PA. 6. Universe 3.x: This option allows you to acquire data from SAP BusinessObjects Universe which are available on X1 3.x platform and perform in-process analysis using native PA algorithms and by integrating R to PA. 7. Universe 4.x: This option allows you to acquire data from SAP BusinessObjects Universe which are available on BI 4.x platform and perform in-process analysis using native PA algorithms and by integrating R to PA. Selecting SAP HANA online populates a dialogue box asking for SAP HANA connection information and selecting desired table to fetch data as shown in figure above. Once user select HANA table and gets a successful connection to HANA, PA directs to PREPARE view where you can opt to PREDICT view by using all in-database PAL algorithms listed as shown in figure below, with other data source components in the analysis editor. Data writer and preparation components are always run in-database in HANA when you opt for SAP HANA online as your data source and it is important to note that various PAL algorithms like Apriori, K-Means, CNR Tree or Multiple linear regression are supported for R Integration in such cases. For other 6 data sources other than HANA online, for an instance CSV file, prompt for data source selection looks like as figure below where you can browse for data file located in hard disk of your machine. Once data is acquired for analysis, similar to HANA Online case PA takes you to PREPARE view from where you can opt PREDICT view and can utilize all the native PA algorithms and PA supported R algorithms from algorithms section. All supported algorithms are listed under algorithms tab and can be selected to use based on your requirement. SAP PA also allows you to combine data from two different data sets from within PREPARE view. You have two options for combining data sets, either you can merge or union data sets. MERGE functionality creates a combined table to match a key column from two given data sets while UNION appends to target dataset the selected columns of source data set based on an identifier provided matching columns in two data sets have same data type. Figure below gives you an idea of functionality to MERGE two data sets in the Prepare View of PA. Once you have acquired input data from any of 7 possible data sources, you are ready to analyze and perform initial data exploration along with preparation before applying any algorithm. 33 Chapter 2: PA as a product from SAP Window to search for input file database Merge Data in Step 1 2.4.2 Preparing Data for Analysis Once data input is finished, PA moves to prepare view where we have possibility to review data in grid format or apply to columns rich features like sorting, filtering, renaming, merging, 34 Chapter 2: PA as a product from SAP creating as geographical hierarchy, creating a time hierarchy, reformatting or converting to a different data type. Data in prepare stage can be viewed in both grid or facets display option as shown in figures below. In facets view, the data is shown by distinct value equivalent to horizontal bar chart but by value. It is useful in case we have few distinct values. Data manipulators available are similar in both views. Preparing Data for Analysis A unique and useful functionality available in Prepare View is the Visualize view with help of the available extensive chart library. After accessing the data and then exploring it using many 35 Chapter 2: PA as a product from SAP visualizations and choosing one or more of extensive chart options, we get better control of data and thus can perform further data preparation to apply PA algorithms in an effective way. 2.4.3 Step 3: Applying Algorithms for data analysis After data preparation is over, all components that can be added to create an analysis are grouped under tabs into Algorithms, Data Preparation and Data Writers are seen under the Predict view of PA. Components available to use vary depending on whether you are building an in-database analysis using HANA algorithms or alternatively adding an in-process analysis. Actual construction of analyses is similar in case of both methods even when components available vary. Building an analysis is quiet straight forward, simply select a component and then drag selected component to analysis editor workspace and you will see it getting automatically connected to component in focus. Second way is to double click desired next component instead of dragging which also make it connect to component in focus automatically. Input and output anchors also called as connection points are contained by each component and are useful to get connected to other components. Data source output always have only single output connection point. Connected data components always work in a fashion that data transmits from predecessor component to their successor component or we can say output of predecessor component in a connection acts as source of input to successor component in that connection. Structure of component is shown in figure below and has options to rename, run, delete or configure its properties. Figure below also shows different states a component can be in. ‘Not Configured’ refers to scenario when user drag a component on analysis editor workspace and it needs to be configured before analysis can be run. ‘Configured’ refers to case when all mandatory properties of components are configured and analysis can be run. ‘Success’ is displayed after successful execution of analysis and ‘Failure’ refers to case when component causes execution of analysis towards failure state. Possibility to apply and configure Algorithms 36 Chapter 2: PA as a product from SAP Now we will discuss both in-data base analysis with data sourced from HANA and algorithms based on PAL along with case of in-process analysis with data sourced from a CSV file and algorithms based on integrating R to PA using approach to building an analysis and with help of screenshots. Case 1: In-Database analysis using HANA tables and PAL We start with selecting the data in SAP HANA and then choose predict view in PA to run this analysis. Key difference and point to mention here is that in this case when data source is SAP HANA online, the data does not leave SAP HANA i.e. whole analysis is run in-database. We take an example to run analysis aiming to segment or cluster the retails stores data into similar groups based on sales turn over, profit margins, staff numbers and store size and we choose to implement HANA K-means algorithm for this analysis. Choosing component as explained in previous step is simple and it requires just to drag and connect the component in analysis editor workspace as shown in figure below. PAL on HANA Data source Next we proceed by configuring the properties of HANA K-means component as shown in figure below and we are given options to change all primary properties. We can choose what variables we want to use in the analysis e.g. Turnover, Size, margin etc. and the number of clusters that is the value of K, which in this example is taken to be 5. Clicking the Advanced properties tab in dialogue box will display rest of control parameters for this algorithm component as shown in figure below which are clearly defined for business analysts contrary to writing SQLScript. Fields marked with an asterisk are mandatory inputs. Generally default values for advanced properties 37 Chapter 2: PA as a product from SAP are displayed and can be changed, for instance, the maximum no of literation’s should be changed to some lower value than default 100 if data volumes are too large and processing time is crucial factor. After primary and advanced properties are set, analysis is ready to run. Configuring the attributes for algorithms It starts with selecting the data in SAP HANA and then choose predict view in PA to run this analysis. Key difference and point to mention here is that in this case when data source is SAP HANA online, the data does not leave SAP HANA i.e. whole analysis is run in-database. We take an example to run analysis aiming to segment or cluster the retails stores data into similar groups based on sales turn over, profit margins, staff numbers and store size and we choose to implement HANA K-means algorithm for this analysis. Choosing component as explained in previous step is simple and it requires just to drag and connect the component in analysis editor workspace as shown in figure below. Case 2: In-Process analysis using CSV file and R Integration 38 Chapter 2: PA as a product from SAP We start with selecting the data from a CSV file for this case of in-process analysis and then go to predict view. To explain this case we refer to same example of aiming to cluster or group the retail stores into similar groups based on sales, turnover, store-size etc. We similar to previous case again proceed to select the R K-Means algorithm and drag to connect the component in analysis editor workspace. Next we have options available to configure the properties of the R K-means components which are very similar to editable properties in previous case. We can select the variables for the analysis and value of K which represents the number of clusters to be created during the analysis. Going to Advanced properties in dialogue box displays rest of control parameters for R K-Means algorithm which are defined for business analysts as opposed to writing R language script. Values displayed under advanced properties are default values and can be left untouched by user. Analysis is now ready to be run as we have configured right data source with required algorithms and given desired parameters. Before we move on to next step when we describe how to run the analyses, let’s have a look to a more advanced and more realistic analysis as compared to analyses we have built in upper cases just by dragging in two components. In figure below which represents a realistic analysis scenario, Stores.csv is data source which then runs the inter-quartile range test on each variable to filter for outliers before running cluster analysis on data. Cluster analysis is then run which results writing to a database table the source data with assigned cluster numbers, while specific results for cluster one are under execution. Data as a result of cluster analysis is then analyzed with help of a decision tree when target or dependent variable is from a set of previously derived cluster number and independent variables still represent store turnover, margin, staff and size giving us an insight to find specific rules and patterns explaining why such cluster sexist. Finally the results are exported to a filtered subset. It is quiet beneficial as we have an advantage of saving our decision tree models to reapply to a new data to predict a new stores cluster assignment. Saving models offer an option to export it to another application using Predictive Modeling Markup Language (PMML) standard which is explained more lately in this report. We start with selecting the data from a CSV file for this case of in-process analysis and then go to predict view. To explain this case we refer to same example of aiming to cluster or group the retail stores into similar groups based on sales, turnover, store-size etc. We similar to previous case again proceed to select the R K-Means algorithm and drag to connect the component in analysis editor workspace. New feature added to PA is the ability to define and run R algorithm from the PA analysis editor. GUI to add R script as new component in A is provided by tool based on either R integration for PA or R integration for SAP HANA to run such scripts. Even capability to add your own custom algorithm written in C++ or JAVA is available. Figure below displays a part of wizard to write custom R script components which can later be included in any analysis. This integration features SAP PA to thousands of algorithms from R libraries. An expert R user can write new components or algorithms which another business user can then easily embed into his analysis. 39 Chapter 2: PA as a product from SAP An Advanced Analysis in PA Dialogue to create a new R Component for PA 2.4.4 Step 4: Running the model and viewing the Results 40 Chapter 2: PA as a product from SAP Running in analysis whether in in-database or in-process method is exactly same. Let’s take an example when we run in-process analysis we developed in last step using CSV file to explain the PA functionality using screenshots. We can run analysis after we generate it in two ways. Either we can run using the ‘Run till here’ option or R K-Means component or we can also invoke it from the RUN ANALYSIS icon on the analysis editor toolbar. Once analysis run is completed successfully, we can switch to predict results view for tabular or grid output along with specific charts of used algorithms and default ad hoc chart viewer for user defined visualizations. Figure below shows a new column added corresponding to each record, its assigned cluster number in the results view for first few records of our cluster analysis with input data listed to left on new column. Predict Results Grid View If you click CHARTS option, just on right of Grid button, then for used K-Means algorithm you can see cluster chart which provides four different visualizations of results of this cluster analysis for further exploration of data by user and to get a better view of analysis. Vertical bar chart compares each cluster by showing size of each cluster which can also be changed to a horizontal bar chart or pie chart. There is also generated a cluster density chart with distance chart where color code scale of dark to light for dense to sparse clusters is shown following the fact that thicker line means closer the clusters. Small-sized clusters that seem to be close can be entertained for combining while other small-sized clusters distant from other clusters may be considered as outliers. Two chars generated at bottom can be used by user to compare clusters based on any specific variable to differentiate properties of each cluster. For every algorithm available, an algorithm summary is generated when it is used in an analysis, which in this example can be shown as figure below representing output from R for this R K-means algorithm applied. This summary includes the cluster center coordinates, the within cluster sum of squares which is 41 Chapter 2: PA as a product from SAP like the squared sum of the distance between individual records in the cluster and the cluster center for all records in the cluster and finally the size of each cluster. Cluster Chart in PA Under CHARTS option, a third chart is available with name CLUSTER PARALLEL COORIDINATES CHART where each record is plotted as horizontal line connected by its value on vertical axis of each displayed variable and then colored coded based on its cluster number. As with many of visualizations in PA, we can drill down into specific data to examine it in more details. Each algorithm has a default visualization which is discussed under algorithms sections of this report. We can also use the visualize option for ad hoc user defined chart creation. In figure below, a trellis chart is created showing each variable by cluster group of comparison. Going further we have possibility to use the model to predict new data either with in PA or in an external application. Cluster Parallel Coordinate Chart 42 Chapter 2: PA as a product from SAP 2.4.5 Step 5: Deploying Model in Business Application We have several options available in SAP PA to deploy models: Scoring models in PA and exporting the results. Exporting the model as PMML. Sharing the analysis in the Share View in PA. Exporting and importing analyses between PA users. Exporting an SAP HANA PAL model from PA as a stored procedure. Most commonly used option of these possibilities is to use PA to predict new data or score the model. PA provides this functionality to save the built model after you have built this model and then when you want to make predictions using same model but with new set of data. Saved model can be run again with new data from which the target or dependent variable can be predicted. It got its name scoring models based on phrase of scoring a customer’s credit worthiness or their probability to churn. You can extend the analysis by adding the decision tree algorithms R-CNR tree to derive rules that describe why records have been assigned to specific clusters. Independent or input variables are retail store turnover, margin, staff number and shop size and target or dependent variable is cluster number. Figure below demonstrates this, possibility of an option to save model with a model name and corresponding description. Saved models are shown in a new tab, along with the Algorithms, Data Preparation and Data Writers tab once the model is saved. It makes easy to build new analysis using saved models to predict new data set. In prepare view, directly add saved model to predict along with new data set. Alternative approach can be to export saved model from current analysis and after this to create a new analysis with new data for scoring and importing saved model. Below two screenshots show the saved model tab with name and description in PA tool and also how the scoring the saved model can be done. Above section explained the procedure to export the predictions of a model for reuse and so in following section we will talk about exporting the model as PMML. PA gives you the opportunity to export a saved model as the Predictive Modell ling Markup Language (PMML) which is now seen as an industry standard for model sharing between various applications. The option to export model as PMML appears when we right click on saved model component in PA analysis editor where we can specify the output file to store the XML. Figure below is a screenshot from such an XML of PMML generated for a saved model. PMML can then easily be read by another application which describes the created model and then can exploit it to bring up predictions. This was detailed view of exporting model as PMML and now in next section we will see how to share the analysis with other users. Off course sharing the analysis is as important step as developing analysis. 43 Chapter 2: PA as a product from SAP Scoring the saved model in PA There is a share view option given in PA tool which gives us the following functionalities to use on data, chart of both of them together. We can share our charts, export our generated data set to a file, publish generated data set for an analysis view to SAP HANA, publish generated data set and charts to Stream work, and publish generated data set to SAP Lumira cloud or publishing our data set to Business Objects information space to be accessed in SAP BO Explorer. Diagram below gives an idea of how we can share the Customers.xlsx data along with associated visualizations. Following section will cover exporting and importing analyses between PA users. PMML Output for decision tree 44 Chapter 2: PA as a product from SAP To Export a model for analysis by another PA user, .spar file can be generated in a SVID document which can later be used in another SVID document once imported is successful. It can be done with a series of simple steps. From PA tool in the predict view, choose ‘Export Model’ and provide name of .spar file when prompted and save it. This saved model in .spar file can be reused by another SVID document by importing it from saved .spar file. To Import model is also as straightforward as export model. Choose ‘Import Model’ under PA tool bar in Predict view of PA tool and when asked choose the path and file of desired .spar file and click Open. This will display the model in saved model tab after importing completes. SVID stands for SAP Visual Intelligence Document where Visual Intelligence is earlier name of Lumira and was a product to store data set along with visualizations generated by users. Saving model in PA automatically makes a copy and stores it in SVID also. The purpose is to make it available to share with other users or to share on Lumira cloud from where they can be accessed directly in PA or Lumira. SPAR is an acronym for SAP Predictive Analysis Archive file and forms the proprietary format to export models created in PA. Current scenario sees it mainly for transporting model purpose only but eventually plans are to cover analysis and custom created components also in this format. It is very helpful and core of business operations when one user creates the model and other uses it or share it or modify it to his own needs. To use a saved PA model in SAP HANA, it is possible to export saved model as SAP HANA PAL model too. We can export and save model using wizard as shown in figure below after creating an in-database SAP HANA model and saving it in correct format. Exported procedure along with associated objects of tables, types, procedures appear under the selected schema in SAP HANA. Share View in PA for output 45 Chapter 3: Predictive Analysis Applied It all starts from initial data exploration i.e. historic and current data sets form basis of all predictions. Following section talks about importance and methods of initial data exploration along with data preparation for predictive analysis. Quality of any analysis in general is directly dependent on the quality of input data sets agreeing with age old concept of garbage n garbage out. It don’t seem true that pushing algorithms on any collected data will give useful predictions thus making data exploration very difficult and crucial step. Prediction depends on identifying first what data might be useful, then finding out where it might be available, analyzing it, reviewing it to understand and validate, to propose key element that effects the outcome etc. 3.1 Initial Data Exploration There are two types of data: qualitative and quantitative. Qualitative data also called as categorical data in statistics is mostly expressed by means of natural languages and not in terms of numbers. Examples include text color is red, tallest in class is Anna, male elephant etc. These categories generally are associated with some structure. Nominal categories are which has no natural ordering like race, gender, religion while ordinal variables are ones which can have their categories ordered in some way like small, medium and large. Numerical measurements expressed in terms of numbers instead of natural languages are categorized as quantitative data. Case can be some numbers which can’t be continuous or measurable like post codes and tax codes. We can differentiate on fact that such numerical data can’t be added or subtracted. Quantitative variables either are discrete such as number of students in class or continuous such as weight, height salary etc. they are more like integers. There can be further categorization like binary variables like 0 or 1, on or off etc. or date formats which look like numerical digits. In a sense, categorical, string or text variables are qualitative while numeric which can go under algebraic functions are quantitative variables. Data type is crucial and important for any analysis as non-numeric values cannot be predicted by linear regression while some decision tress need input data in a standard format. Data types should be chosen before test as most statistical test and analysis are sensitive to data types. Database types are generally reflected by data types in PAL. In PA there are string or varchar data types for qualitative variables and only integer and double for quantitative variables. ‘Date’ data type is also supported by PA giving facility to convert data types, format them to appropriate formats. There are functions available in PAL which converts data types, for an 46 Chapter 3: Predictive Analysis Applied instance, CONV2BINARYVECTOR converts categorical input data to numeric data for use of algorithms which accept only numeric data like K-Means. Second way to convert categorical variables to numeric ones is to use Formula component and clauses such as If (‘Name’ == ‘ANNA’) Then (1). Under data types, we may sometimes need to construct new variables from existing variables. A case can be finding ratio of bank transactions made on weekend compared to transactions made on weekdays may be more helpful than using those two values independently. Missing value also forms a very considerable data type when we talk about predictive analysis. There can be lot of reasons behind getting missing values in a data set like mistakes, reluctance to provide confidential information or simply unavailable values. We can approach these missing values with different ways, either ignoring them or substituting them with some values based on similar records or can be interpolated from within set of known data values. Data type is crucial and important for any analysis as non-numeric values cannot be predicted by linear regression while some decision tress need input data in a standard format. Everyone will believe when you say that data is understood better when represented in a visual form rather than through lists or tables of data. Following figure proves the power of visualizing data makes understanding easy over tables or lists. Most frequently call pairing and group of callers can clearly be analyzed with figure on right compared to same data represented in table for call traffic. Credit to invent idea of line charts, bar charts, pie charts etc. goes to William Playfair who published first version of these charts in 1786. Once data is explored and visualized, we come to next stage of data preparation for predictive analysis: sampling, scaling and binning in following section. Table versus Chart 47 Chapter 3: Predictive Analysis Applied 3.1.1 Sampling Sampling refers to process of creating subsets of all of the date in order to produce inferences about al of given data. All data is referred to as the population by statisticians and hence we also refer this process as sampling from the population and is pushed where efforts to access and interpret all data is too high. It also helps when some of data goes missing for any reason. Generally speaking sampling is done mainly to explore data initially with different analysis points before choosing one and focusing in detail, as it gives us an idea about which data is actually unnecessary for the specific analysis. ‘Sample is representative of all data.’ Sampling when used in predictive analysis technique is called as cross-validation and we can also sample because the sheer amount of data makes running/training models on the complete data set too time consuming. Simply it involves creating of subset of data to build a model and then use the left over data to test this model; that implies a condition that test data has to be removed from initial creation of model so that we can compare later to see how good our analysis and prediction is. Most of times, model works poorly on new data as contrary to given data thus we need model over fitting which means to adjust an excellent model using trading data and poor model for the test data. Overfitting a model is the issue one could have by getting a predictive model that is too good on the data set it was trained upon. That could render it useless on other samples of the data. For example you have a 100 000 rows data set, take a sample of 10000 to train a model, and get a predicted accuracy of 97%. You think it's pretty good, but when you run the model on another (control) sample, you get an accuracy of below 50% => your model was over fit. Sampling is easy to use with simple train before test splitting of data, called as holdout, through to k-fold cross-validation where data set is composed of k subsets and hold method is called k times in iteration. All k-1 subsets are amalgamated to form training set every time one of k subsets is used as test set which helps to compute average across all k trials. A different approach to this sampling can be to randomly distribute data into attest and training each set k times. In PAL, there are eight sampling methods, invoked from function to sample data. It is thus easy but you still have several available methods with the underlying issue when you're not looking at the whole data set. What you gain in processing times, you lose in data coverage. 1. First N 2. Middle N 3. Last N 4. Every Nth 5. Simple random with replacement 6. Simple random without replacement 7. Systematic sampling 8. Stratified sampling 48 Chapter 3: Predictive Analysis Applied Simple random with replacement method initiates random sampling of N or N% of all records with replacement i.e. for potential further section selected record is allowed to come back and join all left data. Simple random without replacement also works same as smile random with replacement but a select record here cannot be returned back to all data for further selection. In systematic sampling, little structure is required for random sampling. It is also called as interval sampling sometimes. We first determine sampling interval (k) and then divide number of units in the population by this k to create samples. For an instance, selecting from a population of 600 a sample of 50, we would require a sampling interval of 600 ÷ 50 = 12. Therefore, k will be set to 12, which allows you to choose one record from every 12 records to total as 100 records in the sample. Given a random number between one and k, which would be the first number included in your sample is referred to as random start. The 25 records that comprise population or data which we will use in this example to select samples are shown below in figure. Code below represents main elements of PAL Script for systematic sampling while full code is available in the file SAP_HANA_PAL_SAMPLING_SYSTEMATIC_Example_SQLScript on the SAP PRESS website. This sampling is based on method 6 of eight methods mentioned above. When we select a sampling size i.e. k as 5 output is sown in figure below with 5 records showing an increment of 5 as sampling interval when starting value was chosen randomly as 2. // The procedure generator Call SYSTEM.afl_wrapper_generator ('SAMPLING_TEST','AFLPAL','SAMPLING',PDATA); // The Control Table parameters INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_METHOD',6,null,null); INSERT INTO #CONTROL_TAB VALUES ('SAMPLING_SIZE',5,null,null); //Assume the data as in figure below and calling the procedure CALL SAMPLING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) WITH OVERVIEW; SELECT * FROM RESULT_TBL; There is one more sampling method available in PAL called Stratified sampling which targets to find out those attributes that may divide up a data set or population in subpopulations or strata. This should be in a way that selected sample from population can still be considered as representative of population. Stratified sampling takes samples from each stratum of population. Requirement while sampling population in such case is that the proportion of each stratum in sample be same as in population. These takes are applied more when population is heterogeneous and dissimilar but still gives a possibility to find homogeneous subpopulations. On other hand, when data is homogeneous, simple random sampling methods are more appropriate. 49 Chapter 3: Predictive Analysis Applied Input & Output Systematic Sampling PA allows us to sample data both for in-database in SAP HANA or on non-SAP HANA data sources, with sample component option available in Data Preparation tab of Predict view. The Sample Component in PA 3.1.2 Scaling Scaling of data is done before we run any of predictive algorithms with intention to be sure that every variable in the model gets equal weight and emphasis as an input to the model which needs 50 Chapter 3: Predictive Analysis Applied to define a common data scale for all these variables. For instance, we can scale all the data used as input to be classified within a selected range, such as, -2.5 to 2.5, or 0.4 to 2.9. Another scaling approach close to normalization or standardization the data use a z-score and a variable is rescaled to have a mean of zero and a standard deviation of one. Calculation is thus in case taken as the average of the variable subtracted from the value for each record, giving in result the mean of the standardized variable of zero, divided by the standard deviation, which results in a standard deviation of one. In simple words a value of 2 indicates that the value for that record is two standard deviation above the mean, while a value of -3 indicates that a record has a value three standard deviations less than the mean. Scaling or normalization plays a vital role to classify algorithms which involve neural networks, or distance measurements such as nearest neighbor classification and clustering, where independently scaled data, for an instance, few entities in millions and some in tens or hundreds, can influence the analysis to a considerable value and hence a need for a common scale among the numeric variables arises. The PAL supports three methods, and these function can be called using SCALINGRANGE: EE Min-max normalization. EE Z-Score normalization EE Normalization by decimal scaling Let’s take an example of scaling assuming we have data table DATA_TAB in HANA as shown in figure below. To scale this data in range A to B, formula we use is (B – A) * (Xi – Min Xi) / (Max Xi – Min Xi) + A which when considering A as 0 and B as 1 simplifies to (Xi – Min Xi) / (Max Xi – Min Xi). Procedure to scale can be called as with the following PAL SQLScript for scaling. // Calling the procedure CALL SCALINGRANGE_TEST (DATA_TAB,"#CONTROL_TAB", RESULT_TAB) with overview; SELECT * FROM RESULT_TAB; Scaling types and their results compared 51 Chapter 3: Predictive Analysis Applied Figure above shows us the result when we scale that data from 0 to 1 by setting the parameter NEW_MAX to 1 and NEW_MIN to 0 i.e. maximum value 1 and minimum value 0. Other method to scale in PAL is z-score normalization and can be called by setting scaling method control parameter to value 1, and then choosing one of 3 possibilities which are ‘Mean and standard deviation’, ‘Mean and mean absolute deviation’, ‘Median and median absolute deviation’. Consider Z_Score_Method as Zero with same input table as in previous example results are shown in figure above where scaling of data is on mean zero with standard deviation of 1. We can scale with Normalization component in PA also with an interface available in Data Preparation tab under Predict view. Scaling can be done on both HANA and non-SAP HANA data sources. Below figure gives an overview of Normalization component is PA using inbuilt function for normalization. Normalization Component in PA 3.1.3 Binning Binning of data is used to summarize or group for better visualizations when data volume to be analyzed is large. To construct a histogram for an instance is not easy without data binning. To visualize huge amount of data points need data binning first which can further ask for subsequent interactive drill. Generally data binning is done prior to run any predictive algorithm as an 52 Chapter 3: Predictive Analysis Applied attempt to reduce complexity of model. Complex models are no one’s cup tea as they are difficult to understand and thus aim is to achieve concept of parsimony i.e. simplest model with very few variants. Binning of numeric data is also called discretization of continuous data and it is important to have it done effectively otherwise it will lead to complex models. You can imagine a situation trying to construct decision trees based on variables with huge set of numbers where each branch of tree is considering every number thus making a complex decision tree difficult to implement. Before discussing binning functions available in PAL, it is good to know that PAL also allows three methods of smoothing. Noise is a random error or variance in a measured variable. Given a numerical attribute such as, say, age, how can we “smooth” out the data to remove the noise? Data Smoothing uses an algorithm to remove noise from a data set thus facilitating important patterns in dataset to stand out. Random, random walk, moving average, simple exponential, linear exponential and seasonal exponential smoothing are some common ways in data mining for smoothing Firstly smoothing by bin where mean value of bin replaces all other values in bin, secondly smoothing by bin medians where bin median replaces all other values in bin and lastly smoothing by bin boundaries where closest boundary value replace each bin value and represents minimum or maximum values in given bin. It is also trying to find threshold on continuous variables. For example a person's income might have an effect on where the person is going on vacation and maybe the threshold is on 500K a year. Those with less stay in Norway, those with more go abroad etc and if you use a continuous variable in a decision tree the threshold will be difficult to spot, but if you bin that data in "less than 500k" and "more than 500k" then it will be easier. The problem is that often we would not be knowing beforehand what binning strategy should be applied i.e. how many groups, and with what rules? PAL supports following four methods to achieve binning. Equal widths based on the number of bins Equal widths based on the bin width Equal number of records per bin Mean/Standard Deviation bin boundaries Let’s try to implement an example of this binning process on table DATA_TAB as shown in figure below. Main elements of the SQLScript are shown in code below while full code is available in file SAP_HANA_PAL_BINNING_Example_SQLScript on the SAP PRESS web site. 53 Chapter 3: Predictive Analysis Applied Input Output tables for Binning table in PAL // The procedure generator Call SYSTEM.afl_wrapper_generator('BINNING_TEST','AFLPAL','BINNING',PD ATA); // The Control Table parameters INSERT INTO #CONTROL_TAB VALUES ('BINNING_METHOD',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('SMOOTH_METHOD',0,null,null); INSERT INTO #CONTROL_TAB VALUES ('BIN_NUMBER',4,null,null); //Assume the data as shown in table DATA_TAB from input table in figure below //Calling the procedure CALL BINNING_TEST(DATA_TAB, "#CONTROL_TAB", RESULT_TAB) with overview; SELECT * FROM RESULT_TAB; In this example of binning method, numbering starts from zero of equal widths depending on number of bins here set to 4 and smoothing is done using smoothing by bin mean. The bin widths calculated as (max – min) / k, which in this case becomes (38 – 6) / 4 equals 8, so the bin ranges are >=6 to <14; >=14 to <22; >=22 to <30; >=30 to <=38 which makes first bin getting the values 6, 12, 13 and 10, which has a mean of 10.25. Same way we can have second bin containing the value of 15 only while the third bin possesses the values 23, 24 and 25, with a mean value of 24.The last bin fourth one, contains the values 30, 32 and 38and thus man value of 33.33. Results are shown in figure up. We can choose binning method based on which one of us appeals most in that particular scenario, most of times found by hit & trials. Best approach can be to try all binning and smoothing methods to see individual impact of each on model. Binning method is not important when model gets robust to changes with binning method. It is always practical to analyze the reason behind variations by looking data in detail if model solution does vary significantly. 54 Chapter 3: Predictive Analysis Applied 3.1.4 Outliers Outliers in data can always influence any algorithm’s performance, model’s parameters and confidence in predictions to a significant level thus a crucial practice as a part of initial data exploration is always to check existence of outliers or unusual values in given data set to understand the cause and decide next action items concerning them. Some algorithms are more sensible to outliers than others. An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. Outliers can easily be visualized in scatter plots, although difficult to scale for large data volumes. Box plot as show in in figure is most popular used visualization approach for outlier detections. Box plots are thus an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. Every value is specified in the box with upper and lower quartiles on y axis scale, induced white line in the box representing median value. Fences on top and bottom of box signifies a factor time’s interquartile range. A single box plot can be drawn for one batch of data with no distinct groups. Alternatively, multiple box plots can be drawn together to compare multiple data sets or to compare groups in a single data set. For a single box plot, the width of the box is arbitrary. For multiple box plots, the width of the box plot can be set proportional to the number of points in the given group or sample (some software implementations of the box plot simply set all the boxes to the same width). All the dots plotted outside these fences represent the outliers. Data volumes and dimensions affect the outlier detection and visualization. PAL offers specific algorithms for outlier detection namely variance Test, the InterQuartile Range Test, the K Nearest Neighbor Outlier Test, and Anomaly Detection using Cluster Analysis. The Inter-Quartile Range test is a simple and popular test for outlier detection and is the basis of the very useful box plot. It is also a robust test in that the outliers do not themselves affect the statistics of the test, as opposed to the Variance test, where outliers clearly affect the limits given that they are measured in terms of standard deviations. That is the weakness of the Variance test, but, again, its simplicity makes it popular. K Nearest Neighbors looks for local outliers, as opposed to global outliers, which is very useful as these are often harder to find because they are not so obvious. The weakness of the test is that the value of K may affect the solution, but this can be minimized by exploring the solution using several values of K. The other weakness is that by specifying the number of outliers, you ensure that you get that number, and some may not really be outliers. 55 Chapter 3: Predictive Analysis Applied 3.2 Which Algorithm When In Chapter 2 of this report, we mentioned vast number of algorithms provided by PAL but an interesting question is out of all many algorithms available, which one should be used when? It seems to be a whole big task for new users to decide which algorithm will should be used by them to get the result & analysis they want. The task become worse if we consider 3500 plus packages or algorithms contained in R. Below section discusses the criteria and main factors to consider before we select the right algorithm. Also we will talk here about accuracy factor of an algorithm and trying to summarize with general set of rules to select efficient and right algorithm. Basic questions that drive the decision of selecting an algorithm are What is purpose of algorithm and what you want to see as an analysis result? For example: group the data, look for associations in the data, or predict a series of data values. What data do you have and what are the attributes of that data? For example: numeric, categorical, Boolean, etc. The answers to these questions help you finding the best algorithm to apply as you have a better idea which algorithm fits best in your purpose. Below table lists some common tasks with corresponding algorithm category and example algorithms as mentioned in SAP Predictive analysis book which makes it to pick algorithm i.e. if we want to look for unusual values outliers we can use variance test and inter-quartile range test. To build a predictive model on a variable using data of second variable for model building we can use decision trees or neural networks and regression models. Thinking about second main factor about kind of data and its attributes is also crucial as some algorithms work only on numeric data while others on categorical data while others can be modified to support both of data types. Off course the table above can help but still we will try to classify algorithms based on five main classes of application in PA. Algorithms in PA can broadly be classified in following 5 groups and this classification gives a better understanding of purpose and use of algorithms. 1. Association analysis trying to find for associations or affinities in the data. 2. Segmentation or cluster analysis, trying to segment or group the data into similar clusters. 3. Classification analysis trying to classify or predict new data based on a model built by an algorithm. It is the largest group of algorithms in PA to predict a variable using the data of other variables that are believed to affect the values of the variable that we are trying to predict 4. Time-series analysis trying to use data with an inherent periodicity to predict values for future time periods. 5. Outlier analysis trying to find unusual values in the data. 56 Chapter 3: Predictive Analysis Applied Task Example Algorithms Algorithm Category Example Algorithms Summary statistics Descriptive statistics Mean, median, variance… Outlier detection Statistical tests Variance test, IQR test, anomaly detection… Preparation of the data for analysis Data preparation Sampling, scaling, binning… Statistical inference Sampling theory T tests, F tests, ANOVA… Relationships, cause and effect Correlation and regression Multiple linear regression, non-linear regression… Clustering or grouping data Cluster analysis ABC Analysis, K-Means, Kohonen SOMs… Time series forecasting Time series analysis Exponential smoothing, regression… Association or affinity analysis Association analysis Apriori Prediction, model building Classification analysis Decision trees, neural networks, regression… Social network analysis Network analysis Jaccard’s coefficient, common neighbors… Optimization Optimization Linear and non-linear programming Risk analysis, modelling Simulation Monte Carlo analysis Algorithm Categories with tasks and examples In Association analysis, most common and powerful algorithm is Apriori which is discussed in detail later in this chapter. In second group of segmentation most popular one is KMeans algorithm and it is known for simplicity and positive correlation. Classification analysis has largest group of algorithms indicating its importance in PA and are further sub classified in 3 groups: regression algorithms, decision trees algorithms and neural network algorithms. Regression algorithms is essentially fitting of a model either linear or non-linear, of the form Y is a function of X1, X2…XN, where Y is the dependent variable and Xi are the independent variables, which minimizes the difference between the fitted data and the actual data. Bivariate linear and non-linear, multiple linear, polynomial & logistic regression are main regression algorithms. Decision trees recursively part the data, initializing with the most divisive split of the input variable values compared to the target variable, and keep doing same till any of many stopping criteria is met. Result then defines the relationships between input & target variables. 57 Chapter 3: Predictive Analysis Applied Class of Problem and Algorithm Group Association Input or Independent Variables Categorical Output or Target or Dependent Variable Algorithms Categorical : Association rules with support, confidence and lift Apriori, Apriori Lite Cluster Numeric NA : Cluster groupings, cluster quality K-Means, Analysis, SOMs Classify - Regression Numeric Numeric : Best fit regression equation Multiple Linear & Non-Linear Regression Classify- Regression Numeric/Categoric al Numeric/Categorical : Best fit logistic curve, probabilities of outcomes Logistic Regression Classify -Decision Trees Numeric/Categoric al Numeric/Categorical : Decision tree and rules with confidence level C4.5, CHAID Classify Networks Numeric/Categoric al Numeric/Categorical : Black box model for prediction Neural Network Numeric Numeric/Categorical : Classification of new data K Nearest Neighbor Numeric : Best fit and projected values Exponential Smoothing, Regression NA : Detected outliers IQR, Variance Test, Anomaly Detection -Neural Classify -Other Time Series Analysis Outlier Detection Numeric Numeric ABC Kohonen Neural network algorithms are closer to the way human brain processes information. Two neural network algorithms, sourced from R namely Monmlp package and Nnet package are supported by PAL. Functionally they work by simulating huge number of interconnected simple processing units which are arranged in layers; input, hidden and output layer attached with varying connections strengths or weights. The network adapts by analyzing individual records, to give prediction for each record, and adjusting the value of weights whenever it sees an incorrect prediction. K-nearest neighbor is final sub category under classification algorithms which predicts or classifies objects based on their similarity or closeness to other objects with prediction calculated as average classification. Time series algorithms are significant as business applications need advantage of time series forecasting. Data is generally constant, trending or seasonal and thus smoothing goes hand in hand here. Outlier analysis algorithms come under last group and seek unusual values. Best known algorithm here is Inter-Quartile range test which is also ground of Box Plot. Variance test is also commonly used following the simple concept that unusual data 58 Chapter 3: Predictive Analysis Applied is distant from average of data. With this much knowledge, we can start applying algorithms on hit & trial basis to find out best suitable for our purpose but still it is advised to simply try all algorithms under same group to see which provides best fit for your analysis. Table below summarize what we just discussed. To check and analyze which algorithm is working best with our problem set, easy and logical approach is to apply and run all algorithms on input data and choose the best one but what factors will decide what is best and how to measure it? Answer to this question will be different for each group as we can’t compare two algorithms in different groups. For association analysis, the choice of algorithms is between Apriori and Apriori Lite. Apriori lite being a subset of the Apriori, is restricted to find single pre and post rules. The choice thus depends on rule requirements and performance, as Apriori Lite will be faster than the generic Apriori but is restricted in terms of the rules extracted from the data. For cluster analysis, finding better algorithm is difficult as for example in ABC Analysis, different values of A, B or C can’t be judged good or worse. User can only find the best value and thus no model is best. The K-Means algorithm may be poorer to analyze cluster better than Kohonen Self Organizing Maps (Kohonen SOM) but is easier to understand & flexible while Kohonen SOMs lack functionality to determine the number of clusters in advance. It is thus logical to try both K-Means and Kohonen SOMs, with varying cluster numbers, to explore the solutions in order to decide which is the most appropriate for the application. For time series analysis, numeric predictions as in classification analysis assume to have same measure of model quality, except that the analysis is done based on time periods. The Variance test and Inter-Quartile Range (IQR) test help to find overall outliers in the data set for the outlier tests. The Variance test is trivial and well-known but the outliers themselves influence the analysis. So an algorithm using the median and quartiles is more popular IQR test, measuring in a way that an identified outlier is not affected by the actual outliers anyhow. Local outliers in the data set can easily be found by the Anomaly Detection algorithm. The Variance test and Inter-Quartile Range (IQR) test help to find overall outliers in the data set for the outlier tests. The Variance test is trivial and well-known but the outliers themselves influence the analysis. So an algorithm using the median and quartiles is more popular IQR test, measuring in a way that an identified outlier is not affected by the actual outliers anyhow. Local outliers in the data set can easily be found by the Anomaly Detection algorithm. For classification models, it is more logical to compare algorithms based on their input being numerical or categorical. For numeric predictions, the residual error which is the sum of square of actual minus fitted for each data point is the most common measure along with another common way of presenting as MSE that is Mean Square Root. MSE scales back to original data RMSE on associations. Statistical measure of goodness of fit like R Squared, analysis of variance 59 Chapter 3: Predictive Analysis Applied or F value are used to produce numeric predictions for regression analysis. Classifier confusion matrices which are how often each category correctly predicts and how often incorrectly, generally evaluate categorical prediction. Model quality measure can then be made based on these matrices and measures of quality like Sensitivity or true Positive rate and Specify or True Negative rate. Gain and lift charts are plotted to compare model and algorithm performance in case of binary classification models. Based on this discussion in section above, we can have following rules as a base to start o compare algorithms their performance and applications based on user requirement as follows. If objective is to find associations in the data, Use Apriori. If all multiple item associations are required. Use Apriori Lite if only single pre and post item rules are required Use Apriori Lite sampling if performance with Apriori is too slow If objective is to find clusters or segments in the data, Use ABC Analysis if the cluster sizes are user defined. Use K-Means if the desired number of clusters is known. Use Kohonen SOMs if the number of clusters is unknown. If objective is to find outliers or unusual values, Use the Variance and IQR tests if you are looking for global outliers. Use IQR test if there are significant outliers. Use Anomaly Detection If you are looking for local outliers. If objective is to classify data when the target variable is numeric and only one independent numeric variable exists, Use Bivariate Linear Regression if a linear relationship is considered. Use Bivariate Exponential or Geometric or Natural Logarithmic Regression if a non-linear relationship is considered. Use Multiple Linear and Non-Linear Regression for linear and non-linear models respectively if more than one independent numeric variable is considered. If objective is to classify data when the variables are categorical or a mixture of categorical and numeric, Use either C4.5 or CHAID or CNR and choose the best fit if Output of decision tree rules is desired. Use Logistic Regression when preference is to find output of the probability. 60 Chapter 3: Predictive Analysis Applied Use Neural Networks and choose the best fit if model quality is of primary concern. If objective is to predict time series data, Use Single Exponential Smoothing when the data is Constant or stationary. Use Double Exponential Smoothing when the data is trending. Use Triple Exponential Smoothing when the data is seasonal. Use K Nearest Neighbor when objective is to classify data with simple easy to understand approach and the target variable is numeric or categorical in case of numeric input variables. 3.3. Challenges & Resolutions This section will discuss and elaborate following four common difficulties faced in any predictive analysis process. 1. 2. 3. 4. Cause & effect Lies, dammed lies & statistics Model Over fitting Correlation between independent variables We cannot always conclude a relation to be a cause & effect relationship if we find out a good mathematical relationship between variables as it is not always easy to interpret every mathematical relationship as cause & effect relationships. If we plot a graph between numbers of jobs in market to number of cars newly bought in city, we can see a mathematical relationship but we cannot summarize it that with every car sold in market there is a new job created. We can understand second challenge of lies & statistics by an example which covers dangers of looking only at statistical measures and is known as Anscombe’s quartet. He used 4 datasets having simple similar statistical properties to prove the importance of plotting data before analyzing it and impact of outliers as they all appear to be very different from one another when plotted. First dataset appears as a well-behaved dataset having clean and well-fitting linear model and can be plotted using y = 3 + 0.5x having mean of X as 9 and mean of Y as 10. Second dataset does not have a linear correlation strangely has same equation y = 3 + 0.5x but with R squared value of 0.67. Third dataset does have linear relation but the linear regression is thrown off by an outlier which means if the outliers were spotted and removed before plotting it would have been easy to fit a correct linear model. Last dataset does not fit any kind of linear model but the single outlier makes keeps the alarm from going off. This implies that it is wise to understand data before applying any algorithm. Graphs and 4 data sets used are shown in figure below. 61 Chapter 3: Predictive Analysis Applied I x II y x III y x IV y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 Four Data Sets in Anscombe’s Quartet 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Overfitting signifies a condition when data under analysis fits a model “too good” that it can be thought to describing your sample nearly perfectly and is too rigid to fit any other sample. This condition thus makes it loose enough to serve our predictive needs by fitting badly on new data. Over fit specifically needs to be watched when you’ve got small sample sizes or your data is too small & limited in some way and defining as phenomenon where the predictive model may well describe the relationship between predictors only but may fail to provide valid predictions in new data. It is generally due to high expectations and need for accuracy requiring an extra good job to fit the sample data by introducing too many input variables. Most of times it is case when a model has to many data points compared to number of data points. Including test data and analyzing it from every angle set is crucial when building a predictive model to have it more accurate and stable over time. Figure below explains the context with help of two figures representing two graphs based on same data points. Left graph off course is doing a decent job 62 Chapter 3: Predictive Analysis Applied as it captures general nature & characteristic of the relationship between the X and Y variables. While right hand side graphs is clearly attempting too hard to capture every subtle change in the relationship between the two variables; It makes model on left outperforming model on right when new data points are fed into the model as the right hand side model will not be able to generalize well the data it has not seen before. To avoid Overfitting, words of advice is to use a proportion and balance of the available data to train the model and the rest of the data which is unseen or hold out data to test the model. This is a key methodology in PA and definitely an important one in classification analysis and time series analysis. Process of Overfitting the models ‘Multicollinearity’ is problem & comes in picture when you’re trying to fit a regression model or other linear model. It indicates a case of predictors correlated with other predictors in the model. Unfortunately, the effects of Multicollinearity can feel unsure and intangible, which makes it unclear about how to fix if you are able to decide that it should be fixed. Statisticians define multicollinearity as a strong correlation between two or more independent variables. It is quite difficult to remove effects on dependent variables because of linear relation making model easily assuming the existence of multicollinearity in dataset. Estimates made on parameter may alter significantly in response to small changes in the model or the data which means Multicollinearity effects the calculations regarding individual predictors without minimizing the predictive power or reliability of the model as a whole specially at least within the sample data itself indicating that a multiple regression model with correlated predictors can definitely show you the degree of relation between bundle of predictors predicts the outcome variable, but will not produce always a valid results about any individual predictor and about extent of redundancy of predictors with regards to each other. Multicollinearity to an extent is normal but if it has higher value it becomes a problem because i the variance of the coefficient estimates increase which make the estimates very sensitive to minor changes in the model. Following can be seen as main sources of multicollinearity; method used for data collection, constraints pushed in the population, Model specification or an over fitted over defined model. Removing multicollinearity 63 Chapter 3: Predictive Analysis Applied fully is not possible but can be reduced by several remedial measures such as collecting the additional data or new data, re-specification of the model, ridge regression or by using data reduction technique like principal component analysis. Examples of Multicollinearity Figure above shows two graphs X1 & X2 that are highly positively correlated and value of correlation coefficient between them is 0. 9771 as computed by data on left. Trying to find a model now that describe the relationship between Y and independent variables X1 & X2 is difficult because we can merely differentiate because of them being so close and hence higher value of multicollinearity becomes a problem because i the variance of the coefficient estimates increase which make the estimates very sensitive to minor changes in the model. Mitigation, adding more data sampling gives an advantage can’t solve it completely. Omitting one of correlated variables can be another interesting approach if you can decide which variable to ignore risking the danger of ignoring real casual variable. 64 Chapter 4: Cluster & Association Analysis Explored Although all groups of algorithms supported and analysis techniques used by PA are crucial and have their own importance. Keeping time limit in mind, I thought to go through in detail of only two classes of Analysis techniques instead of covering all briefly. It was quiet interesting to go through code behind and implementation of these algorithms so neatly on SAP PA interface. 4.1 Association Analysis As name suggests, Association Analysis looks for associations between objects and also known as affinity analysis. Output of this analysis is generally in form of rules like ‘if item A is purchased by customer A, he has a very high probability to purchase item B and item C’, ‘75% of those who buy comics on-line also buy music on-line’, ‘60% of those who have high blood pressure and are overweight have high levels of cholesterol’ etc. Following the original definition by Agrawal the problem of association rule mining is defined as: Let I = {i1, i2, ..., in} be a set of n binary attributes called items. Let D = {t1, t2... tn} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X→Y where X, Y ⊆ I and ∩ = ∅. The sets of items (for short item sets) X and Y are called antecedent (left-handside or LHS) and consequent (right-hand-side or RHS) of the rule respectively. Quality of these rules can be calculated by finding the number of cases when this rule was proved to be true divided by total number of sales from that store and is referred as rule support. Support of an item set is defined as the percentage of the data set which contains that particular item set. Rule confidence is related important statistical measure and calculates the efficiency of rule to calculate prediction of right hand side of rule, here item B in our example when left hand side of rule, item A in this example is triggered thus giving number of baskets in which A & B both exist divided by number of baskets with only expressed in percentage. Confidence of combination of items divided by support of result is called as lift and gives a ratio of how often B is bought along with A to how often B is bought independent. This is nice calculation as it gives better picture to association than rule support as we can see clearly that B is more often bought with A when lift value is more than one otherwise not. We will elaborate more about these statistical terms and their calculations in detail with help of examples later in this report. As calculations are simple, the challenge comes with performance as generally data under analysis is huge. Even to interpret results is difficult without deep business domain knowledge as you can lead in wrong impression of rules to be either trivial associations or apparent nonsensical 65 Chapter 4: Cluster & Association Analysis Explored associations. It is most often called as market bucket analysis based on its most common application of finding out rules of products getting sold together in a supermarket. Using the data gathered from baskets, list of products sold together, we can have an analysis of patterns or strong relations between products to recommend product placement in store, suggest additional product purchases to buyers or identify unusual combinations of fraud management. Different objective measures define different association patterns with different properties and applications. For instance, the purchase of an electronic device that does not include batteries often implies the purchase of batteries or charger. Apriori principle: If an itemset is frequent, then all of its subsets are frequent. 4.1.1 Applications of Association Analysis Netflix based on previous rating of movies compared to other users watching patterns predicts for you movies of interest to you. Associations generally depend on finding patterns that can be evaluated through subjective arguments. It is considered uninteresting for data analysis if it don't reveal unexpected information about the data or give some new unknown information that can lead to profitable actions. To include subjective knowledge into pattern evaluation needs lot of efforts and knowledge from domain experts and an extensive amount of prior information from historic data. Pattern evaluation gets more challenging when partial associations among items within the pattern are also present. For an instance, few associations & relationships keep appearing and disappearing when conditioned with the value of certain items. Support(X) = no. of transactions which contain the itemset X / total no. of transactions Confidence (X>>Y) = Support(X U Y) / Support (X) Lift (X>>Y) = Support(X U Y) / Support (Y) * Support (X) 66 Chapter 4: Cluster & Association Analysis Explored 4.1.2 Apriori Association Analysis It is an influential algorithm to find associations in market basket data or sales transaction data giving some Boolean association rules as an output based on is calculations of three statistical values, support, confidence and lift. It continues to identify the frequent individual items and extend them to larger data sets till these item sets appear sufficiently often in the analysis. Apriori is designed to handle databases that hold transactional data like list of items bought by customers or details of a website frequentation. Apriori algorithm works on following general process by splitting association rule generation into two separate steps: 1. Minimum support is applied to find all frequent itemsets in a database. 2. These frequent itemsets combined with minimum confidence constraint are considered to output rules. 67 Chapter 4: Cluster & Association Analysis Explored Let’s take the example dataset as shown above to illustrate these three terms and algorithms. Support is calculated as ratio of total number of baskets that support rule i.e. a combination exists to total number of baskets expressed in percentage. Note that support is bidirectional that is 'if 10 then 20' will be similar to 'if 20 then 10' as both will have same rule percentage. Confidence is defined as ratio of number of baskets in which both items 1 & 2 exist divided by the number of baskets with only item 1 in them expressed as percentage. Confidence is not bidirectional as support. Both support & confidence give an idea about rule’s validity but cases exist when value of both of them is high and concluded rule is of no use. This shortcoming of these 2 measures bring into picture one more measure to find accuracy association called a lift or improvement and is defined by the ratio of how often when item 1 is bought item 2 is also bought divided by how often item 2 is bought independent. Value less than one indicates item 2 is more often bought independently on own while value greater than one tells item 2 is often bought with item one. Calculations are as shown in figure and it is not challenging to conclude that Support can be used to see most popular rules, Confidence gives you most useful rules while overall most useful & popular rules are given by Lift. Now as we understand terms to measure and compare associations in a dataset, we will now talk about how to implement Apriori association analysis. 4.1.3 Apriori Association Analysis in PAL In PA library for SAP, the function name to perform Apriori association analysis is APRIORIRULE and the algorithm name is Apriori. The input is always of two variables, first being the transaction ID representing Basket ID and second being Item ID signifying the product name. Data types for both these variables can be Integer, varchar or char. The output is comprised of two tables where first one contains the association rules with leading items or pre-rule or lefthand side in the first column and dependent items also called as post rules in the second column. Some applications allow and demand combining the pre-rule & post-rule item columns in order to construct the total rule i.e. table will now display support, confidence and lift values. Second table shows PMML definition of Apriori model which are forced to be the calculated rules and their measures. Following figure shows the definition of Parameter table for Apriori algorithm. 68 Chapter 4: Cluster & Association Analysis Explored Parameter Table Definition for Apriori The following text displays main components of the SQLScript for Apriori. The full code can be accessed from the file SAP_HANA_PAL_Apriori_Example_SQLScript on the SAP PRESS website. // The procedure generator CALL "SYSTEM".afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL','APRIORIRULE', PDATA); // The Control Table parameters INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT',null,0.01,null); INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE',null,0.01,null); INSERT INTO PAL_CONTROL_TAB VALUES ('PMML_EXPORT',2,null,null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); // Assume the data has been stored in the table PAL_TRANS_TAB // Calling the procedure CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB,PAL_PMMLMODEL_TAB) with overview; SELECT * FROM PAL_RESULT_TAB; SELECT * FROM PAL_PMMLMODEL_TAB; // Merging Prerule & Postrule DROP VIEW TMP_RESULT_V; CREATE VIEW TMP_RESULT_V AS SELECT CONCAT(PRERULE, ' => ') AS PRERULE, POSTRULE, SUPPORT, CONFIDENCE, LIFT FROM PAL_RESULT_TAB; DROP VIEW RESULT_V; CREATE VIEW RESULT_V AS SELECT CONCAT(PRERULE, POSTRULE) AS RULES , SUPPORT, CONFIDENCE, LIFT FROM TMP_RESULT_V; SELECT * FROM RESULT_V; As discussed, we get two output tables one containing the measures and one showing associations in PMML file. This PMML output can be used to transfer model rules to a business application such as recommendation engine. 4.1.4 Strength & Weakness with Apriori Lite Apriori Lite can be considered as an alternate to Apriori but actually is a specific instance of Apriori algorithm in a way that it looks pre-rule and post-rule rules only for a single item. This makes it more efficient and faster but can be applied only when we try to seek one-to-one rules instead of getting all associations. Another plus point of this algorithm is the possibility to sample data. LITEAPRIORIRULE is the function name to call it in SAP Predictive analysis library and the input to this algorithm is exactly same as Apriori full. In parameter table of Apriori Lite, 69 Chapter 4: Cluster & Association Analysis Explored MAXITEMLENGTH does not exist but include two extra parameters namely OPTIMIZATION_TYPE and IS_RECALCULATE. Reason behind association analysis being so popular is the ability to produce clear results. Calculations are most of times so straightforward that anyone in management position without detailed technical knowhow can understand it easily and thus speeding up faster and better decision making. One of biggest drawback with Apriori is its heaviness and it requires more and more computations exponentially with increase in data. Apriori lite thus is an alternate when one-to-one rules have to be found. Sometimes results are of no value and misleading. No matter how many weakness we can count for this association analysis and related algorithms, we can never utilize PA to fullest without them in most of cases. There may be cases and problem sets that can be best solved with regression analysis & never use Apriori. 4.2 Cluster Analysis This chapter involves the concepts around cluster analysis and how it is being implemented in PAL, R and SAP PA. Cluster analysis is also referred to as segmentation analysis and is a very popular application of SAP PA. We start this section explaining simplest of all algorithms i.e. ABC Classification which groups records in data set based on specific parameter into top X % then top Y % and all remaining in Z % to total up to 100%. We will also talk about popular and efficient statistical analysis algorithm for comparing, machine-learning clustering, self-organizing maps etc. and is called K-Means cluster analysis. Task of grouping and classification started since days of early man who needed to distinguish between edible & poisonous food, pet & wild animals and we still do in daily life by grouping students at university by country of origin, previous studies aggregates, age, gender etc. giving us a reason to do it. It is clear that we have a better understanding of data even when data size is large if we break it down in groups. Attribute values which describe the objects are used for assessing the dissimilarities among clusters. 4.2.1 Introduction & Applications of Cluster Analysis Cluster Analysis intends to organize and segment data into groups with close characteristics and features in a way that data within a group is closely matching to other data in same group and is different from data in other groups in some respect. In other words, objects within one cluster are shortly distant and compact within cluster while distance between intracluster objects & clusters is more as they look disparate. Cluster analysis or clustering thus can be defined as the job of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). We may need to do this clustering for many reasons, for an example, identifying 70 Chapter 4: Cluster & Association Analysis Explored people with similar shopping pattern to find better marketing strategies, grouping movie shows into similar categories based on viewer ratings, making task of cluster analysis considered as intellectually satisfying, profitable, and sometimes both. Cluster analysis is a concept, a method which do not represent or identify a particular statistical method or model, as factor analysis, and regression. There can be many ways to cluster data into groups and the choice depends on various factors and requirement. Cluster analysis encompasses a variety of algorithms and methods to classify objects of similar kind into given categories. Cluster analysis as on output finds for you structures in data without knowledge of why they exist. Cluster analysis can be seen as most widely used class of predictive analysis methods with diverse applications including criminal pattern analysis, medical research, social services, psychiatry, education, archaeology, astronomy, and taxonomy making it indeed ubiquitous and significant for data analysis. Market segmentation is one of most talked about application and helps to make better decisions by making different business plans for different group of buyers with different promotional offers. A nice example to highlight the importance of clustering is that there is an exponential decrease in number of different sizes of clothes available in stores because we after analyzing many measurements of body size and came up with a generalized system of body measurements whereby individuals are allocated to specific sizes/clusters. 4.2.2 ABC Analysis in PAL Cluster Analysis intends to organize and segment data into groups with close characteristics and features in a way that data within a group is closely matching to other data in same group and is different from data in other groups in some respect. In other words, objects within one cluster are shortly distant and compact within cluster while distance between intracluster objects & clusters is more as they look disparate. Cluster analysis or clustering thus can be defined as the job of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). We may need to do this clustering for many reasons, for an example, identifying people with similar shopping pattern to find better marketing strategies, grouping movie shows into similar categories based on viewer ratings, making task of cluster analysis considered as intellectually satisfying, profitable, and sometimes both. Cluster analysis is a concept, a method which do not represent or identify a particular statistical method or model, as factor analysis, and regression. There can be many ways to cluster data into groups and the choice depends on various factors and requirement. Cluster analysis encompasses a variety of algorithms and methods to classify objects of similar kind into given categories. Cluster analysis as on output finds for you structures in data without knowledge of why they exist. 71 Chapter 4: Cluster & Association Analysis Explored ABC Analysis, K-means and self-organizing maps are three cluster analysis algorithms supported by SAP Predictive analysis library. Three user defined clusters are given by ABC Analysis while K-means creates K clusters based on data memberships. Self-Organizing maps use a map, usually an M*N matrix, to map the given data item to some coordinates on the map which later takes form of clusters when multiple records get mapped to specific coordinates. ABC which stands for Artificial Bee Colony Analysis clusters data depending on what that particular data item contributes to total to find top X % of items based on characteristic A or top Y % of items based on characteristic B etc. was first proposed by Karaboga. ABC classification hence gives functionality to an organization to segregate units into three groups: A, the most important; B, important; and C, the least important. The intention behind such classification of items into groups is to have a better control and understanding over each item based on their group. The ease of use and simplicity to understand make it more popular. Data generally is initially sorted in descending numeric order and then grouped into first A %, the second B % and finally remaining C % to total up as complete hundred. It can be considered as one weakness of this algorithm as it don’t support more than 3 groups. The Artificial Bee Colony (ABC) algorithm treats the search space representing data set as it were a foraging environment, and each point in this search space relates to a food source which we take as solution and has to exploit by the artificial bees. This can be useful but basically the algorithm just sorts the data based on a continuous variable i.e. is my customer in the 20%, 50%, 80% most spending customers for an example? The fitness of the solution is represented as the nectar amount of a food source. According to this algorithm, there are 3 kinds of bees employed bees, onlooker bees, and scout bees. Specific food sources are first exploited by Employed bees before and then forwarded for the quality information of the food sources to the onlooker bees. Information about the food sources is received by onlooker bees who then will exploit a particular food source based on the information of nectar quality will be chosen by them. The more nectar the food source contains, the larger probability the onlooker bees will choose it [23] [24]. “Limit” is the quality parameter controlling the employed bees whose food should be abandoned. Food sources is responsibility of scout bees by searching & analyzing whole environment. ABC algorithm can be defined as below steps: 1. Initialization phase when each food source X i,j available in environment is initialized by scout bees after setting up the control parameters. The number of food sources equals to half of the colony size. D the dimension represents the number of parameters to be optimized. 2. Employed bees phase when Employed bees start searching for more food sources having more nectar i.e. increased fitness value in the neighborhood of the food sources in their memory. Once a neighbor food source is encountered, these employed bees calculate its 72 Chapter 4: Cluster & Association Analysis Explored fitness. Greedy algorithm is applied on the new food source to the original food source and the best will be placed in memory. If the food source is improved, the trials counter of this food will be reset to zero else incremented by one. 3. Onlooker bees phase when Onlooker bees resting till now in their hive are given all this food source information from employed bees which based on their probabilistically calculations on fitness values on given information choose their food sources. An onlooker bee chooses a food source depending on its probability value which may allow multiple onlooker bees choosing a same food source if that food source has a higher fitness. Once food sources have been selected by onlooker bees, each of onlooker bee will now find a new food source in the neighborhood and will compute Fitness values of these new food sources same way as the employed bees did in their phase. That means more onlooker bees will be used to find richer food sources. 4. Scout bees phase is last phase when the value of trials counter of each food source is used to decide. If the value is more than the limit parameter, the food source will be abandoned and the bee there will become a scout bee going back to initialization phase and a new food source will be produced randomly in the search space for these new scout bees and the trials counter will be reset to zero for them. The first three phases will be repeated until some end criterion is met and the best food source showing the best optimal value will be considered as final solution. An example is shown in figure below where values of A, B and C are 20%, 30% and 50% respectively which help us to visualize that A segment being 20% of total is accounted for 5 items out of 70 or 7.1% of all items. Segment B being 30 % of total is accounted by 9 items and this being 12.9% of all items whereas last segment of 50% is accounted by 56 items and thus relates to 80% of item population. An Example of ABC Analysis 73 Chapter 4: Cluster & Association Analysis Explored Algorithm name to call it for grouping in PAL is ABC ANALYSIS and corresponding function name is ABC. There are two columns in input able when item or record names are contained in first column while numeric values to be used for analysis are stored in second column. Item name has data type as char or varchar while corresponding value is inputted as Double always. The parameter table contains four parameters namely ’PERCENT_A’ having data type as Double representing Interval for A class, ’PERCENT_B’ having data type as Double representing Interval for B class, ’PERCENT_C’ having data type as Double representing Interval for C class and ’THREAD_NUMBER’ having data type as an integer value representing the total number of threads. Values of ABC always should add up to 100 and is a check done by PAL before calling the algorithm. Output table has 2 columns again holding assigned values of A, B or C to items and other one holding item name in output table as shown in figures below. ABC Analysis Input & Output tables The main elements of the SQLScript are as follows, with the control parameters set as A=35%, B=20% and C=45%. The full code is available in the file SAP_HANA_PAL_ABC_Example_SQLScript on the SAP PRESS website. // the procedure generator Call SYSTEM.afl_wrapper_generator ('PAL_ABC','AFLPAL','ABC', PDATA); // The Control Table parameters INSERT INTO #CONTROL_TBL VALUES ('PERCENT_A', null, 0.35, null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_B', null, 0.20, null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_C', null, 0.45, null); INSERT INTO #CONTROL_TBL VALUES ('THREAD_NUMBER', 1, null, null); //Assume the data has been stored in table TESTABCTAB 74 Chapter 4: Cluster & Association Analysis Explored //Calling the procedure CALL PAL_ABC (TESTABCTAB, "#CONTROL_TBL", RESULT_TBL) with overview; SELECT * FROM RESULT_TBL; 4.2.3 K-Means Cluster Analysis in PAL The K-Means algorithm is one of the best known predictive analysis algorithms and very famous for cluster analysis as it efficiently clusters the records or observations into K clusters such that each record belongs to the cluster with the nearest mean. This algorithm works over continuous data and can be applied in different kinds of domains. As k-means needs initial partitions to initialize the task, best results can be expected only when the initial partitions keep getting closer to the final solution. This algorithm tries to identify relatively homogeneous groups of values in given dataset based on chosen parameters and the specified number of clusters. This algorithm processes a set of data to cluster them into a predefined number of clusters represented by K. k-means initialize itself with random cluster centroids as a starting point and as it progresses it keeps replacing the data objects in the dataset to cluster centroids depending on closeness between the cluster centroids and the data objects. This reassignment procedure and algorithm terminates As soon as the any of finishing convergence criterion like the number of iterations, or the cluster results being unchanged even after a certain number of loops) is encountered. Performance is dependent significantly on random selection of centroids to initialize the process. The k-means clustering process can be seen as by the four following steps: 1. Randomly picking K centroids to give an initial dataset partition, which most of times is struggle to find, how to choose value of K, and calculate centers with distances? 2. Placing each value from the dataset under analysis to the closest cluster centroid. The measure nearest can be looked differently as there can be several inter-object distance ways to it which can affect the assignment. 3. Recalculating the centroid of each of K clusters to get new mean. 4. Repeating above two steps until exit criteria is met. In Predictive Analysis Library, associated function name is KMEANS and algorithm name is KMEANS. Input table is not fixed in structure but there is always a column for initial ID and then subsequent columns to contain variables for analysis which are always numeric in nature. This is because clusters are calculated based on inter-object distance measures which don’t make sense for non-numeric data type. The table represents the definition of the parameter table for K-means algorithm in PAL. Value of K is simply an integer signifying the quantity of clusters or segments we wish to derive. Manhattan distance which is also called city block distance calculates distance between two points horizontally or vertically on a grid. Euclidean distance is unique shortest path and is used 75 Chapter 4: Cluster & Association Analysis Explored as most common approach to calculate distance between any two points for clustering. Murkowski generalize both of above mentioned Euclidean and Manhattan distances. Maximum number of iterations can be used to define an exit criteria thus giving some control over processing time and complexity of algorithm and saving from process going relatively very long. For initialization or seeding, first K records method takes first K records as initial cluster centers and makes illogical processing if data is sorted. Random with replacement method selects randomly K records from data set to be used as initial centers. SAP patent method to find K random clusters is based on max-min approach where initial center is chosen very close to minimum point and then subsequent centers chosen. Threshold value indicates when iterative process should end and default value is set as 0.00001. Name Data Type Description GROUP_NUMBER Integer The value of K, the number of clusters. DISTANCE_LEVEL Integer Computes the distance between the item and cluster center, can be Manhattan distance, Euclidean distance or Murkowski distance MAX_ITERATION Integer The maximum number of iterations. Integer Center initialization method: 4 options First K, Random with replacement, Random without replacement or one which is SAP’s patent for selecting the initial centers NORMALIZATION Integer Normalization method with three options No, Yes for each point or Yes for each column EXIT_THRESHOLD Double The threshold (actual value) for exiting the iterations. THREAD_NUMBER Integer The number of threads. INIT_TYPE Parameter Table Definition for K-Means To save our analysis from a situation where very large numbers dominating very small numbers, it’s a good practice to normalize or standardize data which ensures all variables having equal weight in subsequent calculations. The output for K-Means is in form of two tables. The first table holds results of the analysis, which maps each record or item in the data set to an assigned cluster 76 Chapter 4: Cluster & Association Analysis Explored number along with the distance from that item to the cluster center this data item is part of. The co-ordinates of each cluster center are listed in second table as the Center Points row. This distances between item & cluster centers for all data items gives a way to measure the compactness of the clusters and to identify unusual values or outliers. The main elements of SQLScript to call PAL K-Means algorithm are as follows with defined parameters settings while the full code is available in the file SAP_HANA_PAL_KMEANS_Example_SQLScript on the SAP PRESS website. // the procedure generator Call SYSTEM.afl_wrapper_generator ('PAL_KMEANS','AFLPAL', 'KMEANS', PDATA); //The Control Table parameters INSERT INTO PAL_CONTROL_TAB VALUES ('GROUP_NUMBER', 4, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('INIT_TYPE', 4, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('DISTANCE_LEVEL', 2, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION', 100, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('EXIT_THRESHOLD', null, 0.000001, null); INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION', 0, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null); //Assume the data has been stored in table PAL_KMEANS_DATA_TAB //Calling the procedure CALL _SYS_AFL.PAL_KMEANS (PAL_KMEANS_DATA_TAB, PAL_CONTROL_TAB, PAL_KMEANS_RESASSIGN_TAB, PAL_KMEANS_CENTERS_TAB) with overview; SELECT * FROM PAL_KMEANS_CENTERS_TAB; SELECT * FROM PAL_KMEANS_RESASSIGN_TAB; 77 Chapter 4: Cluster & Association Analysis Explored The results of running the SQLScript are shown above with input data, which show the assignment of each record to a specific cluster and the data points for each cluster center. 4.2.4 Silhouette The cluster viewer window in SAP PA displays four charts. A horizontal bar chart showing the size of each cluster and A cluster density and distance chart with a color coded scale of dark to light, for dense to sparse clusters. Thicker the line, the closer are the clusters. Other two charts at the bottom allow the user to compare clusters by user chosen variable for better customized understanding and analysis. How to choose value of K is a key question as it impacts performance and output of analysis. In some business cases we can think of a value based on requirement and application like t shirt size groups for sale in stores, still for many cases the value of K is difficult to pre-set before analysis. One of commonly used approaches is to use the square root of N halved where N is total number of records in dataset. This becomes challenging when data set is too big as an instance we would need 700+ clusters to cluster a million records based on this approach and this big number of clusters can’t be considered easy to manage. In that case, data visualization is a big aid and stored can be seen and plotted in a bubble plot. Cluster Analysis comes under undirected data mining as here is no target or dependent variable to be predicted, so another quantitative approach can be to determine value of K which measures cluster quality is silhouette. Silhouette is thus a practical way to measure quality of cluster analysis without which it is not possible as we can never compare or find the output of analysis. Good clusters can be thought of groups where cluster members are close to each other as well as far from members of other clusters. The average for all the records in the dataset of (b−a) / max (a, b), is calculate here by silhouette method where a is the average distance of the record to all other records within the same cluster i.e. cohesion while b is the average distance of the record to all the other records in the nearest cluster center that it does not belong to i.e. separation. This calculated value indicates all records to be located directly on cluster centers when value is 1 while indicating all records to be located on cluster centers of some other cluster when value is -1. When records will be equidistant from their own cluster center as well as cluster center of nearest other cluster, it will have silhouette coefficient set to 0. These values off course are ideal values but still can help to summarize a general guide as per which value less than 0.2 means very poor clustering while value of 0.5 or above is considered good and meaningful. In PAL, it can be called with function name VALIDATEKMEANS and has two input tables to run. First table is required for cluster analysis while second one is used to assign cluster numbers to each record. For table 1, the first column which is for Record ID can either hold Integer or 78 Chapter 4: Cluster & Association Analysis Explored String as data type while all other columns in table one which represent attribute data store values either in integer or double value. For table 2, the first column which is again used to store Record ID here hold only Integer values while second columns in table two which represent assigned cluster number store values integer form only. Parameter table and output table for Validate K-means when called in PAL are defined below. The silhouette measured value will increase and, finally, when K = N it will become equal to 1, as the value of K approaches the number of records (N). Name Data Type Description VARIABLE_NUM Integer The number of variables THREAD_NUMBER Integer The number of threads Name Data Type Description Result 1 Varchar or char Name Result 2 Double The Silhouette value Parameter Table Definition for Validate K-Means The SQLScript to call Validate K-Means in the PAL is as follows while the full code available in the file SAP_HANA_PAL_VALIDATEKMEANS_Example_SQLScript on the SAP PRESS website. // the procedure generator Call SYSTEM.afl_wrapper_generator ('palValidateKMeans','AFLPAL','VALIDATEKMEANS',PDATA); // The Control Table parameters INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM', 2, null, null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER', 1, null, null); // Calling the procedure CALL palValidateKMeans(PAL_KMEANS_DATA_TAB, V_KMEANS_TYPE_ASSIGN,"#CONTROL_TAB", KMEANS_SVALUE_TAB) with overview; SELECT * FROM KMEANS_SVALUE_TAB; To choose the initial cluster center for K-means algorithm is second key question and is answered by seeding strategy applied. Generally many different seeding strategies are applied and solutions are compared for robustness to see if the solution change or remains constant with change in values. Clusters need to be understood and analyzed again to check reason of differences if solution is changing frequently. This was a discussion when data set to be clustered 79 Chapter 4: Cluster & Association Analysis Explored was all numeric but imagine a scenario when we have to cluster categorical data. Does PA allow us to handle it? For sure, categorical data cannot be clustered based on inter-object distances as we cannot say the difference between sun and moon. An approach can be to convert each category into a new variable in binary format. Once we have them we can rescale them by multiplying by SQRT (0.5) to reduce influence when we have binary variable mapped with categorical data items. It is also a good practice to merge categories when there are many categorical variables and within each variable more sub categories exist or alternatively we can consider other non-distance based clustering algorithms. Decision tress to associate analysis close to clustering is a win situation. Decision Tree Analysis of Clusters 4.2.5 Self-Organizing Maps Self-Organizing Maps or SOMs also called as Kohonen SOMs after their inventor professor Teuvo Kohonen, of the Academy of Finland, are a type of neural network that can be used to cluster a dataset into distinct groups. One or two dimensions in a vector or matrix, known as a map are generally used to represent multi-dimensional data in much lower dimensional space. Once network gets trained records in data set that are different will appear far apart while records which are similar will appear close together on the output map. More populated units are shown by the number of records or observations captured by each cell or unit in the map indicating groupings of the records or segments initializing the existence of a sense of the appropriate number of clusters in the dataset. The value of ‘K’ is not predetermined as in K-Means cluster analysis. They are based on unsupervised learning, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data. 80 Chapter 4: Cluster & Association Analysis Explored A network created from a 2D lattice of “nodes”–the map is shown in figure below which is fully connected to the input layer. A small SOM network of 3 * 3 nodes connected to the input layer of a two dimensional vector i.e. a two-variable dataset is also shown below. Generally specific topological position is assigned to each node, an x, y coordinate in the lattice or map, which also contains a vector of weights of the same dimension as the dimensions of input vectors. In the input vector/dataset there are 2 dimensions/variables in our example which means each node will have a corresponding weight vector W, of 2 dimensions: W1, W2 there to represent adjacency we have lines connecting the nodes but they do not signify a connection. 3*3 SOM connected to a 2-variable Dataset Following steps occurring over many iterations represent training of the SOM: 1. Each node in the map gets its weights initialized with random values between -0.05 and 0.05 set by PAL. 2. From inside the set of training data, a vector is chosen generally starting with the first one and is presented on the map. 3. The ”winning” node or the Best Matching Unit (BMU) is calculated such that nodes’ s weights are most similar or closest to the input vector using any distance measure such as the Euclidean distance for each and every node in set. 4. The radius of the neighborhood of the BMU is then calculated which represents a value that starts large, typically chosen to the radius of the lattice but decreases with each iteration and all nodes found within this radius are deemed to be inside the BMU’s neighborhood. 81 Chapter 4: Cluster & Association Analysis Explored 5. Weights of each neighboring node’s found in step 4 are adjusted to make them more similar to the input vector. Weights get altered more when node is closer to the BMU. 6. Steps 2 to 5 are repeated for the next vector in the data set and then for N iterations or until the weights stop changing. Figure below highlights with an example the size of a typical neighborhood when its time near to the commencement of training. The area of the neighborhood shrinks with time, which is accomplished by making the radius of the neighborhood shrink too with help of decay function, a unique characteristic of the Kohonen SOM. Decreasing Neighborhood Size during SOM Iterations Eventually with time, the neighborhood shrinks to the size of just one node which we call as the BMU. The goal is to discover some underlying structure of the data. Thus we can say that SOMs have two phases: Learning phase when map is built and network organizes using a competitive process with help of training set and secondly Prediction phase when new vectors are quickly given a location on the converged map, easily classifying or categorizing the new data. The weight vector of node is adjusted as follows when a node is found to be within the neighborhood, else it is left alone. W (t+1) = W (t) + λ (t) * (V (t) – W (t)) Where T represents the iteration and λ, which is a small variable and called the learning rate, reduces with each iteration that means the new adjusted weight for the node gets equal to the old weight (W) when added to a fraction of the difference (λ) between the old weight and the input vector (V). Λ (t) = λ0 exp (–t / λ) λ0, denotes the width of the lattice at iteration t = 0 and the Greek letter lambda, λ, denotes a constant. λ0 is set to 0.5 by default in PAL. 82 Chapter 4: Cluster & Association Analysis Explored The effect of learning should also be proportional to the distance of a node from the BMU with every decay in leering rate over time. The learning process should barely have any effect at all at the edges of the BMUs neighborhood. The quantity of learning generally fades with distance same way as it do in the Gaussian decay which is shown below: λ(t) = exp (–dist2 / 2λ2(t)) With each iteration, records from the training data set are allocated to cells on the map, and closely related records grouped together, as shown in figure below. Self-organizing maps are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space. Assignment of the Data Set Records to the Map, Showing the Clusters We can call this method by associated function name ‘SELFORGMAP’ In the Predictive Analysis Library where it comes under set of algorithms and the algorithm name is Self-Organizing Maps. The Input table consists of an initial ID column, and all subsequent columns contain the variables to be used for the cluster analysis. These variables must be numeric to be used with the cluster analysis. It’s because SOMs cluster data objects using the inter object distance which cannot be computed if the input variables are not-numeric. The Parameter Table Definition for Self-Organizing Maps is shown in table below. There can be 3 methods to normalize the data as discussed below: 1. The data stays as default setting and the parameter is set to 0. 83 Chapter 4: Cluster & Association Analysis Explored 2. For each variable X (x1,x2,...,xn), the minimum and maximum value of X is found and then X[i] = (X[i]-min)/(max-min) is calculated to rescale the data between 0 (min value) and 1 (max value) and and the parameter is set to 1. 3. For each variable X (x1,x2,...,xn), the normalized values depend on the mean and standard deviation of X. x1, is normalized to X’ by computing X’ = (xi – Mean(X)) / S.D.(X) for each X when the parameter is set to 2. Name Data Type Description SIZE OF MAP Integer The self-organizing map is made up of n × n unit cells. This parameter defines the value n. MAX_ITERATION Integer The maximum number of iterations. NORMALIZATION Integer Normalization method with 3 methods : No or Transform to new range (0.0, 1.0) or Z-score normalization THREAD_NUMBER Integer Number of threads. The Parameter Table Definition for Self-Organizing Maps Self-Organizing Maps outputs two tables. The first table is called SOM Map & holds the final weights corresponding to each of the map cell IDs, along with the number of records or tuples assigned to each map cell ID i.e. cluster size. It is stored in last column of table and is always integer type. The weight vectors to stimulate original tuples are always outputted as double in middle column. The second output table is called SOM Assign row and maps each Cell ID assigned to each record displaying the membership of the clusters. ID for tuples can be integer or string. Name Data Type Description 1st column Integer Unit cell ID. Other columns except the last one Double The weight vectors used to simulate the original tuples. Last column Integer The number of original tuples that every unit cell contains. Output Tables Defined for Self-Organizing Maps See below an example of self-organizing maps in the PAL implemented on same data set as shown in figure above on which we implemented K-Means and thus an interesting attempt to 84 Chapter 4: Cluster & Association Analysis Explored comparison of the two cluster analysis algorithms also. The PAL SQLScript to call for Self Organizing Maps is as follows while full code available in the file SAP_HANA_PAL_SELFORGMAP_Example_SQLScript on the SAP PRESS website. // PAL set-up Call SYSTEM.afl_wrapper_generator('PAL_SELF_ORG_MAP', 'AFLPAL', 'SELFORGMAP', PDATA); // Preparing application data for calling procedure INSERT INTO PAL_CONTROL_TAB VALUES ('MAX_ITERATION', 200, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('SIZE_OF_MAP', 4, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('NORMALIZATION', 0, null, null); INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null); //Assume the data has been stored in table PAL_SOM_DATA_TAB // Calling the procedure CALL PAL_SELF_ORG_MAP(PAL_SOM_DATA_TAB, PAL_CONTROL_TAB, PAL_SOM_MAP_TAB, PAL_SOM_RESASSIGN_TAB) with overview; Select * from PAL_SOM_MAP_TAB; Select * from PAL_SOM_RESASSIGN_TAB; This script outputs two tables as shown below. The final weights assigned to each of the map cell IDs along with the number of records or tuples corresponding to specific map cell ID (represents the cluster sizes) is shown in first output table. Cell ID assigned to each record which is the membership of the clusters is output shown in second table. E.g., Cell ID 0 has 2 records with trans_id 11, 12 and Cell ID 13 has 2 records again with trans_id 18, 19 and Cell ID 15 has 5 records (Trans ID 0,1,2,3,4)etc. 85 Chapter 4: Cluster & Association Analysis Explored Generally after this the results are presented in a visualization to understand and analyze the clustering of the data, for instance, cell 7 has five records, cell 8 has two records, and so on. We can run this SOM in N*N grid. Smaller the number of N, more close the cells and hence more closed grouping can be seen. Comparing the outputs of SOM with K-Means give most of times same number of clusters but in K-Means algorithm user predetermines the value of K prior to running the algorithm but this is not case with Kohonen SOMs as we allow the data suggest the value of K. SOM if have less variables than number of cells, it will result a sparse output with lot of empty regions in middle of classified objects while in alternate case, cells will be forced to share with variables and thus giving groups or clusters. Following figure show the visualization of results when we run SOM on input data in figure XX and plot the table data for 4 clusters in 4*4 map. The Four Clusters in the 4 * 4 Map Cluster analysis definitely is capable to be called one of most popular methods of predictive analysis but is more involved for business operations because of its uniqueness in splitting down huge amounts of data into smaller manageable clusters giving better understanding compared to analyzing data for predictions. Most commonly used application is market segmentation based on fact that focused marketing is more effective than a generic approach. Biggest strength of cluster analysis in general is it's easiness to understand as intention in this step is not to make predictions but to cluster existing data. Each cluster analysis algorithm discussed above has individual pros 6 cons. ABC Classification is very simple, very practical and therefore very popular but limited to 3 grouping seven when a user wants more groups which maybe not necessarily add up to 100% like finding the top 15%, second 10% and third 40%. 86 Chapter 4: Cluster & Association Analysis Explored K-Means is also easy to understand and to apply if we choose value of K in advance and detection to be undirected and it being data driven. K-Means is clearly driven by the choice of K which depends on user experience and skills but if chosen wrong will make whole clustering a bad experience. In K-Means the results generally vary dependent on the choice of distance is sensitive to the initial choice of cluster centers and measure. We can never try to optimize value of K as it don’t involve maximizing or minimizing some function. Input data is assumed to be numeric from idle run but Non-numeric data can be aliased to numeric using some approaches to initiate cluster analysis such as using decision trees and association analysis to find groups in the data. Self-organizing maps find its application to create an order visualization of multidimensional data which simplifies complexity and reveals meaningful relationships and thus can be compared to a non-parametric regression technique which converts multi-dimensional data spaces into lower dimensional abstractions. They are popular and beneficial as they provide quick and concise model creation even for voluminous data sets and have outstanding prediction accuracy due to patented procedure for the extraction of non-linear relations. They can work well only if clusters exist, but they can also not work i.e. it can’t find clusters for user if there are no clusters. 87 Chapter 5: Conclusion Chapter 5: Conclusion 5.1 Problem Set: Burn that Churn To implement and have a hand on experience for SAP PA, I decided to have data predictive analysis on a telecom data for last few years and then compare the efficiency and trueness of predictions by comparing the model correctness applying same model to different data sample subsets. Huge volumes of customer data is analyzed in order to identify business factors like new revenue opportunities, reduce churn which can help make informed decisions. Customers become “churners” when they discontinue their subscription and move their business to a competitor. Credit card issuers, insurance companies and telecommunication companies always are keen to predict churning i.e. process of customer turnover. If they predict the laving customer at right time, they can always try to hold him with better offers as it is always cheaper to retain a current customer than to gain a new one. In this work, we have only considered customerinitiated churn and ignored operator-initiated churn because mostly latter is because of payments problems from customers and are of no interest. All telecom service providers must be able to respond and react in timely manner to customer’s preferences for their survival in this competitive world. They must be able to predict and prevent subscriber churn by understanding reasons of already churned customers as well as expectations of current customers. All telecom operators store enormous size of data about subscribers and their behaviors. In this thesis part of work, this data has been attempted to analyze and find some valuable insight for business decision makers. For this objective there are many approaches to begin like customers segmentation based on location, demographics, purchase history etc., understanding which market campaigns and new products were able to add in most customers and more. As it is not practical to understand personal preferences for each and every subscriber, but it is possible to make visualizations and understand patterns based on historic data in repositories to predict probability of churn in each segment. Following variables were taken in main consideration as they appear to be having highest effect on churn. Customer demographics, i.e., age, gender, marital status, location, etc. Call statistics: length of calls at different times of the day, no of long distance and local calls. Billing information for each customer – what the customer is paying for local and long distance. Extra service information, that is, what extra plan the customer is registered on, e.g. special long distances rates. 88 Chapter 5: Conclusion Complaint information: how many customer service calls are made for disputed billing, dropped calls, slow service provisioning, non-working special services, and so on. Credit history. 5.2 Results & Analysis I got the data set for this work from TERADATA CENTER FOR CUSTOMER RELATIONSHIP MANAGEMENT AT DUKE UNIVERSITY which they used for NCR Teradata 2003 Tournament. The data represents a major wireless telecommunication service. Data holds records for more than 100,000 customers who have minimum of 6 months of service history with them. I found data quiet good for churn modeling and predictions to get results and analysis for this thesis report. Dataset was named tourn_1_calibration.csv’ and is copied to attached CD. Using normal Excel program, firstly I removed noise by deleting all rows with missing or null or strange appearing values for variables that were most appealing to me to influence prediction and modelling. These variables from data source are: 1. 2. 3. 4. 5. 6. 7. 8. Age of Handset Calls to customer care in last 3 months % change in monthly outgoing calls on previous 3 months average Age of Customer Current Handset Price Income of customer Total number of months as customer Credit History New dataset after removing noise is named as tourn_1_calibration _thesis.csv and is also copied to CD. Binning is implemented too on this data set to improve the analysis results and to have data read in group’s clusters instead of integer values. New columns I introduced to the dataset under binning process are: 1. range_handprice which has 3 possible values ‘low end’, ’medium’ & ‘high end’ representing the cost group of headsets. Mobile phones costing below 79 are classified as low end, costing more than 170 as high end while remaining come under medium group. 2. range_income which again has 3 possible values ‘low’, ’medium’ & ‘high’ representing the income bracket of users. Customers earning less than or equal to 3 on index are classified as low income group, earning more than or equal to 7 as high income group while remaining come under medium group. 3. range_agemob which has 3 possible values ‘less than 1 year’, ‘around 2 years’ and ‘very old mobile’ depending on duration of handset used. As name signifies, group called less than 1 year holds all entries where handset is used less than 12 months. 13 89 Chapter 5: Conclusion to 30 months of handset use comes under around 2 years group and all others more than 30 months old are classified under very old mobile group. 4. A new column is also inserted in dataset with column name ‘range_churn’ simply representing value of churn flag in a readable text format for quick analysis. Churn value 0 which represents customer still active has been shown in this new column as ‘Remained’ while value 1 is shown as ’churned’ stating that customer has already churned. 5. range_custcare is a new column grouping the number of customer care calls a user has made in last 3 months and has 3 possible values ‘less than 20’,’20 to 60 calls’ & ’more than 60’. Values under these 3 groups are easy to guess based on the group names. Same way all crucial columns that I thought to have a major influence on churn have been held in new columns and replaced by same bracket value with help of excel program. All these new columns are named as range_xxx to quickly select them from list of variables in PA. The process starts with importing this modified data set with added columns and noise removed to PA. In PA interface you can acquire the data source file from File menu. Select New, search csv file option and navigate to dataset in question. Once the dataset got imported, I run sample function to create 4 sample datasets of 7500 records each to make modelling faster. These 4 diff samples also helped to verify correctness of any model by applying models repeatedly and comparing results for these four samples individually. I named them as ‘churnsample1.csv’, ‘churnsample2.csv’, and ‘churnsample3.csv’ and ‘churnsample4.csv’ .All 4 sampled datasets were taken by random selection in PA and are copied in attached CD. 90 Chapter 5: Conclusion Next step is to start applying algorithms on each dataset and compare the calculated predicted value of churn flag with the actual value of churn flag. Following subsections hold the results for 4 algorithms categories and comments on how we can get an idea of future churning behavior based on the results of these algorithms on datasets. All the models created to get the below results are saved in .pmml format and are copied to attached CD for reference. Under Predict tab, PA Interface allows you to drag and drop the required algorithm in front of data set and before running you can configure the settings of each algorithm by left clicking on settings icon on algorithm itself, as shown in figure below. Subsections 5.2.1 to 5.2.4 are results for analysis and calculated values by PA on 4 different datasets for four different classes of algorithms. Some algorithms allow you to predict value for a particular variable say churn in our example by letting us dependent variable under configuration settings while others don’t. In that case we cannot have direct analysis for model efficiency by comparing the correctly predicted values to total number of values, but we still can have some understanding of patterns and behavior based on visualizations and comparisons. 91 Chapter 5: Conclusion 5.2.1 Clustering Algorithm Used: R-K-Means (MacQueen) Number of clusters: 5 Dependent variables: change_mou, months, no_of_cars, income, eqpdays and custcare_mean No of Iterations: 100 As this algorithm don’t allow us to calculate a value to specific variable based on other dependent variables, we can still find clusters and common patterns based on some independent variables. As clear from result sets (visualizations) below that cluster 5 is most dense while cluster 1 is smallest. From cluster density & distance chart, we can see strong connection between cluster 2 and cluster 5. Cluster center representation diagram reveals that cluster 1, 3 and 4 are more inclined on handset price variable while biggest one cluster 5 is based on 2 values equipment age and income bracket. Parallel coordinate chart displays calculated_churns for all 5 clusters while scatter matrix chart plots relations between different chosen independent variables against each other. The summary is as change_mou months hnd_price eqpdays income custcare_mean 1 2391.470 19.55148 7998999023 431.2059 5.913963 1.7418900 2 3313.584 24.77475 3026621144 639.8905 5.955474 0.8334206 3 2213.295 18.43822 5997563536 465.2371 5.928161 1.1652299 4 1932.428 19.43715 9995865138 308.0222 5.407119 1.9365962 5 1896.514 16.94159 1556729838 304.7009 5.701856 1.7578339 92 Chapter 5: Conclusion Cluster Analysis for first dataset 93 Chapter 5: Conclusion 5.2.2 Decision Tree This algorithm allows us to make calculations for a chosen variable or target variable and thus we can easily find the efficiency and correctness of model after running the model. Algorithm Used: R-CNR Tree (Regression) Output Mode: Trend Dependent variables: range_income, range_custcare, range_months, range_handprice, range_agemo Target variable: churn New output column: calculated_churn The output chart shows the probability of churn based on above mentioned dependent variables. By analyzing in detail, we can predict customers with high churning possibility based on rules. Example, customers with more than 20 customer calls and more than 23% decrease in overall revenue over past 3 months have 69% probability to churn or customers with expensive handsets and more use of mobile data compared to voice calls have 78% probability to churn, may be they use handsets for internet gaming or browsing more and thus latest handsets are only interest to them. This analysis also under result grid view give you a calculated predicted value of each customer to churn as shown in figure below. Decision Tress showing probability and classification of dependent variables. 94 Chapter 5: Conclusion Calculated churn value for each customer based on dependent variables 5.2.3 Apriori Here also we can’t calculate value for a particular variable based on other independent variables in dataset, so we rely on visualizations analysis and rules summary. Although it is not a favorable algorithm to apply in this problem set but business decisions can be made better just by some small piece of new information. Sort Type: Ascending Transaction Size Output Mode: Rules Dependent variables: range_income, range_custcare, range_months, range_handprice, range_agemo Support: 0.1 Confidence: 0.8 Below output figure is a set of rules generated by Apriori and can be very useful for decision making if dived in. This analysis based on 5 dependent variables gave us a set of 37 rules. Interestingly, changing the values of support and confidence brings change in number of generated rules as well as value of lift for each rule comparatively. 95 Chapter 5: Conclusion Apriori rules output for first dataset 5.2.4 Neural Network This algorithm according to my experience fits best in this problem set and allows us to prepare a model to be reused with new data to predict values for target variable. Using this approach, we run a model based on training data and keep fitting it unless the drop in correctness when applied to another dataset is not huge. When we run this model, it adds a new column in the table which according to model should be the value for that target variable (churn) based on previous patterns and behavior. This allows us to compare the value of target data in real to value of target data predicted by PA. Algorithm: R-MONMLP Target Variable: Churn Output Mode: Trend Dependent variables: income, months, custcare_mean, handprice, Hidden Layer 1 Neurons: 5 Predicted column name: predicted_value The following screenshot shows the insertion of new column predicted_value when we run this algorithm with values 0 and 1 which represents the customer churning or not churning according to this model and configuration settings for chosen dependent variables. 96 Chapter 5: Conclusion New column ‘Predicted value’ in result set We can easily find how many of the values are predicted right by PA by comparing the output values to real churn values in dataset. Total number of right findings divided by total number of rows will give the percentage correctness of the model. In first attempt, I got the following figures for the predicted values compared to churn values in dataset. Confusion Matrix for predicted churn variable That means out of 3780 correct one values, our model predicted ones correctly 2082 times i.e. approx. 55% correctness. I then tried to put new values by hit & trial method, combinations of variables to configure algorithm until I got the results and a model which is relatively better. In PA any model with 80% or above correctness is considered to be good. No model can be 100% accurate and applicable in all problem sets. It needs business processes understanding and domain knowledge to plan and implement predictive analysis solution for a problem set. 97 Chapter 5: Conclusion Output model with 65% correctness Output model with 96% correctness This improvement in model can also be an outcome of model Overfitting, so to confirm the possibility of model Overfitting, I applied the same model run on other 3 datasets. The correctness percentage of model din fluctuate to large values, so I considered this to be a good model. To save a model, using interface click on save as model under predict tab. Once saved models appear under component pane as shown in diagram below. This model is one which gave us 96% correct values for churnsample1. Saving a model makes it easy to reapply it to other datasets as we don’t need to worry about configuring algorithm settings again. 98 Chapter 5: Conclusion Saving a model with high correctness Following steps show the results and figures for same model ‘neuralnetwork’ when applied to other 3 datasets. Confusion matrix for data set 2, correctness 99% Confusion matrix for data set 3, correctness 98% 99 Chapter 5: Conclusion Confusion matrix for data set 4, correctness 79% As we don’t see a big fall or deviation in correctness when this model was applied to other 3 sample sets, we can conclude it to be a good configured model and can use it to predict churn variable for customers who are still active. That’s the potential of modelling and PA. 5.3 Discussion & Issues 5.3.1 SAP PA compared to Hadoop SAP Predictive Analysis tool as per date is designed to predict and analyze structured data only, that means user should have structured data either in xls, csv, txt or HANA database tables. If data is not in structured format then user cannot proceed further for analysis and prediction using PA while Hadoop is leveraged to analyze the unstructured and semi structured data as well. Hadoop is often used to store and process big data volumes of semi structured or binary data. But it looks feasible to pass results from Hadoop to PA with 2 steps. Firstly Hadoop doing first selection and aggregation to output reduced dataset. Take a case of text documents stored in Hadoop and it returning only list of words with frequency to PA and after that SAP Predictive Analysis when inputted this information from Hadoop can then apply data mining algorithms on the Hadoop result set. Comparing them makes no sense as they drive different purposes. Hadoop is a file system which stores variety of data like Big Data i.e. Structured, Unstructured and semi-structured data; while PA being an advanced analytical tool reads the historical data from above data sources and projects the predicted results for a business query where data can be picked from any data sources like HANA, HADOOP, BW, BO. It can be a next big approach from SAP to provide possibilities of SAP & Hadoop integration to ease the customer, either by merging PA with InfiniteInsight to PA which allows this Hadoop integration or some other innovative module 100 Chapter 5: Conclusion 5.3.2 Sharing your own R component It is possible to share with colleagues or customers your own created R components and just involves few simple steps. Depending on whether you need a new library or not with your component, library has to be added first under C:\users\Public\R-3.0.1\library and then can be shared. Remember to close PA program at time of sharing components. In the folder C:\Users\piyush\SAP Predictive Components\RScript you can see all your R components created by you and this folder is to be used to paste component shared to you by someone else. Rename the components to more meaningful names from default names given by PA. After renaming, it is crucial to modify the name of folder in the ‘component.xml’ file. With help of simple text editor replace ‘automatically created’ name by your new folder name. That’s all required to make your R created components being reused by your colleagues by simply sharing the folder. Logic is encapsulated in so called “Custom R Components”, allowing users even without R skills to use them. Connecting R and PA 5.3.3 Configuring HANA PAL to use with SAP PA SAP PA while connected to HANA in an online mode; facilitates users to leverage the HANA Predictive Analysis Library with help of a user friendly interface that push all the processing operations to the HANA server. The PAL is not installed on HANA by default. To connect them, Install AFL (Application Function Library) at first place on HANA server with help of below commands. Login to Root of HANA server. Extract the files using SAPCAR - SAPCAR -xvf IMDB_AFL100_60_1-10012328.SAR and then navigate into the SAP_HANA_AFL directory which was created in step 2 and execute 'hdbinst' as ~/tmp/SAP_HANA_AFL #./hdbinst It can be downloaded from http://service.sap.com/swdc and once installed you can verify the AFL installation success with commands below: 101 Chapter 5: Conclusion SELECT * FROM "SYS"."AFL_AREAS" WHERE SCHEMA_NAME = '_SYS_AFL' AND AREA_NAME = 'AFLPAL'; SELECT * FROM "SYS"."AFL_PACKAGES" WHERE SCHEMA_NAME = '_SYS_AFL' ANDAREA_NAME = 'AFLPAL'; SELECT * FROM "SYS"."AFL_FUNCTIONS" WHERE SCHEMA_NAME = '_SYS_AFL' AND AREA_NAME = 'AFLPAL'; Add the afl_wrapper_generator and afl_wrapper_eraser procedures if they don't exist. On the HANA server, navigate to the /hanamnt//<SID>/HDB <instance_number>/exe/plugins/afl/ directory and execute the afl_wrapper_generator.sql and afl_wrapper_eraser.sql scripts as HANA user SYSTEM. (An easy way to do this is to open the files in a text editor on the Linux server and copy the code back to HANA studio for execution as the SYSTEM user in a SQL console). You now have two procedures - AFL_WRAPPER_GENERATOR and AFL_WRAPPER_ERASER which are owned by SYSTEM. Grant the EXECUTE privilege on system.afl_wrapper_generator and system.afl_wrapper_eraser to your predictive analysts. For example, if the user name is MyHANAUser, run the commands: GRANT EXECUTE ON system.afl_wrapper_generator to MyHANAUser; GRANT EXECUTE ON system.afl_wrapper_eraser to MyHANAUser. 5.4 Future Work SAP traditional Predictive Analysis tool allows to run data analysis activities but they are manual and thus repetitive and prone to human errors. SAP InfiniteInsight (formerly KXEN) introduce automation to PA activities and allow users to concentrate more on business decisions. It would be interesting to dive in and explore extra potential added to traditional PA. I wanted to implement HANA server connectivity and running PA predictions on both HANA server and local machine to show the difference in performance in figures, but couldn’t because of expensive licenses fees for HANA server. Developing mobile applications to integrate PA results and charts to other applications for better and faster business making. A predictive model which is as an equation, algorithm, or set of rules needed to predict an outcome depending on the input dataset; can simply be a set of business rules based on past observations, and can be developed more accurately using statistically rigorous predictions and statistical algorithms. Future work can also consider making a standard for these models to be used by business communities again just by importing. 5.5 Conclusion In normal daily life even, everyone takes advantage of predictive analytics, in the form of anything from weather forecasts to insurance premiums. Predictive analytics will be used more & more as businesses understand and appreciate the business benefits that this prediction tools 102 Chapter 5: Conclusion bring. Predictive analytics is a subset of data mining but indicates a focus on making predictions. SAP in 2012 announced the launch of SAP Predictive Analysis 1.0, their new solution in the predictive analytics portfolio as a replacement to the classical offering of SAP BO Predictive Workbench. With PA, HANA and HANA native predictive library (PAL) enables the execution of predictive algorithms in-database that is making the procedures running in the DB layer and then exporting just the result set, instead of exporting the whole dataset for the algorithms to run in the application layer. This gave SAP a leading edge and made SAP a big contender in the Big Data Predictive Analytics space. But in actual terms, SAP's actual portfolio was very small. HANA no doubt brought a lot of modern and ground breaking technologies to the game that weren't available before, but in terms of actual functionalities related to analytical models, it was still behind its main competitors SAS, IBM, and Tibco etc. Only a couple of dozen algorithms in PAL were available and definitely were not enough to compete. But the game changed since day SAP announced the R integration which means more than 3500 algorithms are part of library set. This HANA R integration, although powerful, still had disadvantage as it required a very specific set of development skills in order to deliver actual analytical models to the business users. To support these users with very little or almost none technical knowledge, SAP PA 1.0 latest version 11 allows implementing custom R functionalities (i.e. algorithms that weren't built in standard PA) without having to resort to developing HANA SQLScript/R procedures. Users can share their existing R scripts (just adapting it to a function model and can now visually create their analytical models with the most complex algorithms. PA interface and possibility to run analysis on in-memory databases allows analysis of very large amount of data with better performance. The basic working character for predictive analysis is the predicator, that variable whose measured value for an individual or entity gives idea about future behavior. For example, an insurance company could consider age, income, credit history, insurance claims history and other demographics as predictors when issuing an insurance policy to determine an applicant’s risk factor. Predictive model is combination of such multiple predicators which may be helping factor to forecast future possibilities or behaviors with an acceptable value of reliability when they are subjected to an analysis. Model must be kept re-validated and revised as additional data become available for further predictions based on collected data, formulations of statistical models and previous properties of models. PA always go hand in hand with business knowledge and statistical techniques for its full exploitation and prediction objective/insight must be clear before starting the process. It's a good practice to keep ready multiple related predictive models available to be run and applied to dataset for better strategic company decisions. 103 References [1] J. MacGregor, Predictive Analysis with SAP®, SAP Press, 2013. [2] C. Mankala og G. M. V, SAP HANA Cookbook, Packt Publishing, Packt Publishing. [3] I. Gordon, Managing the New Customer Relationship: Strategies to Engage the Social Customer and Build Lasting Value, 2013: Wiley. [4] E. Siegel, Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Wiley, 2013. [5] A. Bari, Predictive Analytics For Dummies, For Dummies, 2014. [6] T. W. Miller, Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R, Pearson FT Press, 2013. [7] B. Ratner, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition, CRC Press, 2011. [8] C. Carlberg, Decision Analytics: Microsoft Excel, Conrad Carlberg , 2013. [9] M. Kuhn, Applied Predictive Modeling, Springer, 2013. [10 M. Zaki, R. U. N. U. Dept. of Comput. Sci., S. Parthasarathy, W. Li og M. Ogihara, «Evaluation ] of sampling for data mining of association rules,» Research Issues in Data Engineering, 1997. Proceedings. Seventh International Workshop, 1997. [11 G. P.-S. P. S. Usama Fayyad, «From Data Mining to Knowledge Discovery in Databases,» ] Association for the Advancement of Artificial Intelligence (www.aaai.org), 2014. [12 J. S. Park, M.-S. Chen og P. S. Yu, «Efficient parallel data mining for association rules,» CIKM ] '95 Proceedings of the fourth international conference on Information and knowledge management. [13 C. Rygielski, J.-C. Wang og D. C. Yen, «http://www.sciencedirect.com/,» Elsevier Science Ltd, ] [Internett]. Available: http://www.sciencedirect.com/science/article/pii/S0160791X02000386. [Funnet 2014]. [14 V. K. J. R. Q. J. G. Q. Y. H. M. G. J. M. A. N. B. L. P. S. Y. Z.-H. Z. M. S. D. J. H. D. S. Xindong Wu, ] «Top 10 algorithms in data mining,» Springer-Verlag, 2006. 104 References [15 P. J. R. Leonard Kaufman, «Finding Groups in Data,» i An Introduction to Cluster Analysis, ] John Wiley & Sons, 2009, pp. 47-55. [16 P. J. Rousseeuw, «Silhouettes: A graphical aid to the interpretation and validation of cluster ] analysis,» http://dx.doi.org/10.1016/0377-0427(87)90125-7, p. 467. [17 A. Sturn, J. Quackenbush og Z. Trajanoski, «Genesis: cluster analysis of microarray data,» i ] Oxford University Press 2002. [18 C. Fraley og A. E. Raftery, «How Many Clusters? Which Clustering Method? Answers Via ] Model-Based Cluster Analysis,» The Computer Journal (1998), nr. 41 (8): 578-588. [19 A. J. S. a. M. Knott, «A Cluster Analysis Method for Grouping Means in the Analysis of ] Variance,» April 2008. [Internett]. Available: http://www.ime.usp.br/~abe/lista/pdfXz71qDkDx1.pdf. [Funnet 2014]. [20 T. W. H. M. Akihiro Inokuchi, «An Apriori-Based Algorithm for Mining Frequent ] Substructures from Graph Data,» Springer Link, nr. Department of Computer and Information Science, Norwegian University of Science and Technology, pp. 13-23. [21 C. B. ([email protected]), «Efficient Implementations of Apriori,» ] [Internett]. Available: http://www.intsci.ac.cn/shizz/fimi.pdf. [22 Y. Ye, A. Corp. og C.-C. Chiang, «A Parallel Apriori Algorithm for Frequent Itemsets Mining ] 10.1109/SERA.2006.6».Software Engineering Research, Management and Applications, 2006. Fourth International Conference. [23 F. E. H. J. G. J. B. M. E. Y. V. J. F. H. Ewout W Steyerbergemail address, «Internal validation ] of predictive models,» Journal of Clinical Epidemiology, nr. Volume 54, Issue 8, 2000. [24 H. Bliss, «SAP BI Blog, All things Business Intelligence,» [Internett]. Available: ] http://sapbiblog.com/category/predictive-analytics/. [25 D. W. T. H. a. R. T. Gareth James, «Linear Regression,» i An Introduction to Statistical ] Learning with applications in R, Springer. [26 D. Alahakoon, S. Halgamuge og B. Srinivasan, «Dynamic self-organizing maps with ] controlled growth for knowledge discovery,» IEEE, 2002. [27 A. RAUBER, «Self-organizing maps,» [Internett]. ] http://www.ifs.tuwien.ac.at/ifs/research/pub_html/mer_dexa98/node3.html. 105 Available: References [28 F. Y og J. B. Partovi, «Emerald Insight :Using the Analytic Hierarchy Process for ABC ] Analysis,» [Internett]. Available: http://www.emeraldinsight.com/journals.htm?articleid=848735&show=abstract. [29 A. Tanwari, A. Lakhiar og A. Ghulam, «ABC Aanlysis as a Inventory Control Technique,» ] http://www.goiit.com/upload/2009/2/22/e4fac76b66664f7f346c3aaed9feb829_1302799. pdf. [30 I. Ben-Gal, «OUTLIER DETECTION,» http://www.eng.tau.ac.il/~bengal/outlier.pdf. ] [31 «Outlier detection,» [Internett]. ] http://www.molmine.com/magma/global_analysis/outlier_detection.html. [32 C. C. AGGARWAL, OUTLIER ANALYSIS, ] http://www.charuaggarwal.net/outlierbook.pdf. New Available: York USA: [33 M. N. M. S. M. O. Mansur, «Outlier Detection Technique in Data Mining: A Research ] Perspective,» http://eprints.utm.my/3336/1/Mohd_Noor__Outlier_Detection_Technique_in_Data_Mining_A_Research_Perspective.pdf?origin=publication_detail. [34 http://scn.sap.com/docs/DOC-32651, «Official Product Tutorials – SAP Predictive Analysis,» ] SAP. [35 S. D. Zengyou He, «fats Greedy Algorith for ] http://arxiv.org/ftp/cs/papers/0507/0507065.pdf, China University. Outlier Mining,» [36 C. C. A.-N. 4. I. License, «Linear Least Squares Regression,» [Internett]. Available: ] http://www.cyclismo.org/tutorial/R/linearLeastSquares.html. [37 O. Torres-Reyna, «Getting Started in Linear Regression using R,» [Internett]. Available: ] http://www.princeton.edu/~otorres/Regression101R.pdf. [38 H. C. H. J. A. W. A. K. S. a. M. J. v. d. W. Robert A van den Berg*, «Centering, scaling, and ] transformations: improving the biological information content of metabolomics data,» BMC Genomics, [Internett]. Available: http://www.biomedcentral.com/1471-2164/7/142. [39 «Data Collection,» i Centering, scaling, and transformations: improving the biological ] information content of metabolomics data, http://highered.mcgrawhill.com/sites/dl/free/0073373656/639839/doa73656_ch02.pdf. 106 References [40 D. K. a. B. Basturk, «Artificial bee colony (ABC) optimization algorithm for solving ] constrained optimization problems,» Lecture Notes in Computer Science, pp. 789-821, 2007. [41 B. A. a. C. O. D. Karaboga, «Artificial bee colony (ABC) optimization algorithm for training ] feed-forward neural networks,» Lecture Notes in Computer Science, pp. 318-329, 2007. [42 D. V. H. D. Morris, «T h e B u s i n e s s V a l u e o f P r e d i c t i v e A n a l y t i c s by IBM,» ] [Internett]. Available: http://www.spss.com.ar/MKT/Promos/2012/0612_PA/0612_businessvalue_PA.pdf. [43 R. D. Kugel, «Business Planning and Predictive Analytics,» [Internett]. Available: ] http://www.ventanaresearch.com/research/article.aspx?id=355. [44 L. M. 8, «Mining Frequent Itemsets – Apriori Algorithm,» [Internett]. Available: ] http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab8-Apriori.pdf. [45 I. Tudor, «Association Rule Mining as a Data ] http://bmif.unde.ro/docs/20081/7%20ITudor.pdf, Romania. 107 Mining Technique,»