Data mining of sports performance data

Transcription

Data mining of sports performance data
Data mining of sports performance data
Leonardo De Marchi
Erasmus computing 2010/2011
Summary
There is no doubt that professional sport now is big business. Micheal Jordan, who played for the
Chicago Bulls twenty years ago is one of the most famous people in the world. At its best the
market around him was greater than 37 billion dollars peryear, greater than the Gross Domestic
Profit of many countries. Nowadays the sports market is even bigger with sponsors, gadgets, tv
programs and so on.
If you look at how many times a word record is broken in every Olympic Game it is clear that the
training techniques are constanlty improving. Food, new trainings and a lot of research push the
athletes till his physical limit or even further.
Our work is designed for this context.In this work we are going to study a framework to have a
better understanding of the match performances. Specifically we are going to build a robust
structure, based on the standard Methodology for data mining, to understand what are the most
important statistics of the performance in a match.
We will propose a way to evaluate our methodology and results and we will try to interpret the
results to provide a feedback useful to improve the training exercise.
Acknowledgements
I have so many people tos ay thanks that probably I should use more pages that the rest of the
report.
For the moment I will say a big thanks to the following people for all their support during this
project, the merit of this is mostly belongs to them:
•
My supervisor Derek Magee, always avaiable for help and with a big smile that cheers me
up every time.
•
My assessor Eric Atwell, ofr the kind and entusiastic feedback he give me in the interim
project report and in the progress meeting,
•
Ram Mylvaganam, the tutor from Apollo, a very busy person but every time I asked his help
he found time for me
•
My family, always there when I need them.
Chapter 1 Objective of the project
1.1 Overview
Professional sport competition in recent last years become a very hard, for elite athletes nowadays
conventional training is not enough. Because of the economic interest in sport the study of the
athletes had become more scientific, trying to improve the performance as much as possible. A good
approach to innovate this problem is Data Mining, a set of technique used to find patterns and made
prediction in many fields to study the sport performances in various ways.
The overall aim of this project is to take sports training and match data and extract relationships
between these using data mining and machine learning techniques.
Some of data for this project is provided by the Company Apollo and it relates to training session
from this year of a basketball team which is playing in the National Basketball Association, in
United States. Basically the data are recorded every tenth of a second of position, velocity distance
and Body Load. In the deployment of our project we find out that was the data was not enough and
we decided to use data form the matches.
For this analysis we used mostly data from internet rather than data from Apollo, in particular we
used the statistics of all the matches of the 2010/2011 season. We considered all the most common
statistics such as points, rebounds, shooting percentage, but only of the New York Knicks.
The goal of the company is to understand how to evaluate and asses a basketball sport performance,
so it will be possible to understand the key aspect which lead to a victory.
To achieve this must understand the specific field of application, in this case Sport science, then
select the appropriate Machine Learning techniques, and finally how to present this data.
1.2 Main Aims/Objectives
As we said the general aim of these kind of projects is to improve the sports performance. As it is
the first analysis of this type of data the company does not have a specific aim but it wants to
improve the performance of the team.
We can see in the literature that the the study in sport activity analysis concern:
1. predict future performance
2. indicate strength and weaknesses of training
3. measure improvement
4. enable the coach to assess the success of his training program
5. place the athlete in appropriate training group
6. motivate the athlete
After analyzing the data and after the literature review we end up with a specific
objective:identifying the crucial aspects in a match result. Specifically our analysis should provide a
methodology to asses the games and to find out the most important statistics in a match. It will
consider official games and it will cover the number 3,4 of this list, providing the basis of 1 and 2.
We have also the collaboration of two external experts:
1) Ram Mylvaganam, the chairman of Apollo company, an expert in business and sport.
2) Doc. John O'Hara, a professor of sport Science at Leeds Met.
1.3 Minimum Requirements
In this project we will focused on indicate strength and weaknesses of training, measure
improvement and enable the coach to assess the success of his training program.
Therefore we need to:
1. Identify and implement an appropriate Data Mining technique
2. Identify and implement an appropriate evaluation method for the results.
3. Identify and implement an appropriate a method to visualize the data.
1.4 Further enhancements:
This work is in a very early stage therefore it is opened of various further enhancements. After
summarizzation, visualization of the data an interesting continuation of this work could be find a
mathematical way to find out what are the most important factors which determine the final result
of a match. After doing that is possible to analyze the training data and link training with
performance. This can provide a feedback to select the correct training exercises but also it will be
possible to predict future performances of the athletes.
This task is not trivial because the performance depends on a lot of factors, like external conditions
or mental status, which are difficult to measure, therefore probably it will be necessary to explore
multiple methods to find the correct way.
1.5 Methodology
To achieve our aim we decide to use an appropriate methodology for the project management,
especially designed for Data Mining analysis.
Data Mining in sports is a quite recent field, and there is only a small amount of literature on this
topic, so we are looking for a more general-porpose methodology.
However there exists a general framework especially designed to define all the steps needed in a
general data mining project: the Cross Industry Standard Process for Data Mining (CRISP-DM). We
are going to use CRISP-DM as a structure for our analysis, adapting it to our problem when it is
necessary.
The CRISP-DM is a standard methodology for Data Mining, we decided to use it as it provides all
we need to ensure the project is going on the right way, providing mid-term goals and dividing the
process in indipendents and more controllable subprocess. Achieving the goal of each step will
ensure that the project is going in the right direction. It also provides a precise indication about how
to evaluate the final result, expecially for the business objectives.
1.5.1 Development of the project
As we said the structure of our analysis will be an adapted CRISP-DM methodology. In this
subsection we are going to define the main step that we are going to develop. We are going to
motivate behind every choice will be made on each section.
The development will follow these main steps:
1)Business Understanding
The project is proposed by an external company, therefore it is necessary what do they need and
state a set of goals we want to achieve. At the beginning the goal was to predict the future
performance of an athlete given the data from trainings sessions. After the data visualization phase
(in the second step) and some consideration about the process behind the performance we decided
to find out what are the most important matches statistics. We made this decision because we
believe that between the match performance there are a lot of “unobservable variables” therefore
will be tough to predict the match performance directly from the training data.
2)Data Selection
One of the crucial part is understanding, selecting and cleaning the data. The algorithm will be
choose based on the consideration produced on this stage and here we are also going to define
which data we are going to use and why.
2.1 Data processing:
The company provides some data from training session but after some analysis we decided to use
mostly data from past marches, available on internet. Therefore is necessary to:
2.2 Select the data
there is a large number of information about matches, it is necessary to define what it is possible to
use and how collect these information.
2.3 Visualize the data
To check if there are relevant relationships and understand the nature of the data, like the
distribution underlying the process.
2.4 Create a database
It is better to have a flexible and comprehensive way to input the data in our model therefore we
decided to gather all the data in a unique database.
3)Create a Model
3.1 Select the model
After clarified the goals and the data it is necessary to select an appropriate model to use. For the
prediction and to find out the most important statistics. To predict the model we select Support
Vector Machine (SVM) and to find out the most important statistics we are going to implement a
Sensitivity Analysis.
3.2 Implement the model
To implement this model we choose a scripting language, Python, because there are available good
libraries which implement statistical analysis, and also it was possible to call Weka, a data mining
tool, and collect the output in a relatively easy way. In particular we choose to use the SVM
algorithm already implemented in Weka, which is a wildly used program for data mining. To
implement the sensitivity analysis we used a Monte Carlo Simulation, changing the inputs (the
match statistics) and observing the changes in the output.
4)Results presentation
After collecting the results form the analysis is necessary to select a method to present these results.
This part is very important because allows external people to take advantage of our work.
4.1 GUI creation
To provide a better feedback the Apollo company request to test the program for the sensitivity
analysis therefore we decide to provide a graphical user Interface (GUI), as it is not user friendly to
use the original command line. At the beginning this program was designed to analyze the data and
output the results of the analysis in a command separated values be see we decide to
4.2 Data mining Report
We create some graphs to visualize the results and we give an interpretation.
5) Evaluation
The evaluation gives an indication about how successful was the project and if it is possible to
continue this work or use this framework and the results in further projects. It is fundamental to
communicate neutrally how successful and useful the results are. To achieve this we analyze the
results from two different prospective, the mathematical and the business one
5.1 Evaluation the results form a mathematical point of view
To asses this project we are going to use some statistical property of our results, which give us an
indication about the significance of the
5.2 Evaluation the results form the business prospective
In this section we are going to study the impact of our project in the company's business. We did not
have time to evaluate its the impact in the long or medium term, therefore we decide to produce a
questionnaire and collect the result to receive a feedback.
1.6 Project schedule and progress report
After the first plan we change quite a lot of the schedule two times.
The first time was on the 17th of June, as we change direction for the datamining analysis. We
decided to perform the Sensitivity Analysis, to create database and to use python.
The risk in this project are quite high because there are no publications for this dataset so there are
not a precise indication about how to act. For this reason we decide to maintain a back-up plan: if
we complete the analysis Before the 8th of August we will try to link the training performance with
the match performance. This plan has been changed because Ram ask for an User Interface, to test
the program by his own an provide an useful feedback. In appendix C is possible to find the original
Gantt chart and the final, only the part which is different form the previous.
The key activities are: Data Visualization, Sensitivity Analysis, explore a method to predict the
result, creation of a database. The concurrent research regards all the technical material studied to
use the tools needed, like for example python documentation (especially for the library Numpy),
weka, mathlab, etc.
We also set some milestones: to keep track of the analysis progress:
8th of July: Database ready, this includes script to acquire the data completed.
25th of July: presentation with the assessor: methodology chapter finished and alpha version of the
program for the sensitivity analysis completed
8th of August: presentation of the beta program to Ram and starting the writing up.
15th of August: Creation of an executable file to revive a feedback from ram
29th of Agust(actual date 30 of Agust): Collection of the feedback and finishing the writing up
Chapter 2:
Background
The main problem in this project is to understand how the variables effect one to each other. For
examples we expect that when a player grab more offensive rebounds it will scores more points, but
that could means that the player is in a different position: this could affects the performance and
could be not good fort the team. In general a data mining application is not only applying some
mathematics and collecting the results, it's also necessary to understand the data. If we have a good
understanding of them is possible to select the appropriate technique to process the data and then “
read “ the results correctly[2].
We have identified three main areas of our literature review:
1. Sport science: the study of athlete’s performance.
2. Data Mining:a set of techniques which allows to extract patterns from data.
3. Data Visualization: the study of the visualization of the data to allow the human user to have
a better understanding of them.
our project will touch all these points but in a different way, in particular we are going to focus on
the data mining part. Our problem is stochastic, it means that is better not to model this process with
an exact function but a probability distribution, is time dependent, because the state in the present
depends on the previous states an d the input is noisy.
2.1 Sport science
Sport science is the discipline that studies sports using scientific tools to improve sport
performance. For this project we will use only a little part of this broad field, the Performance
Analysis. We are going to have also some meetings with sport scientists to ensure that we are
studying this problem from the right point of view.
The literature in sport science is not homogeneous, we have a lot of different sources and some of
them are from websites and formulas from the companies.
There is no impact in the final program of this literature review as for our project we decided to use
matches data rather then trainings data, however this review is fundamental to understand the
context of our problem and also to understand why before asses the trainings it is necessary to asses
the matches.
2.1.1 Heart Rate Zones
A very informative variable is the Heart Rate, which is defined as the number of heart beat per
minute. It is used by doctors in the diagnosis and tracking of medical conditions. The Heart Rate is
proportional to the oxygen needed buy the muscles so it can be used to gain the maximum
efficiency form the training.[6] [5].
Is possible to find some very useful information for our project [25] studying the Heart rate.
Is possible to divide the HR into four zones:
• The Energy Efficient or Recovery Zone - 60% to 70%
• The Aerobic Zone - 70% to 80%
• The Anaerobic Zone - 80% to 90%
• The Red Line Zone 90% to 100%
Being in each of these zones will chance the effects of the training. Because of this we suppose that
the HR is the more informative variable that we have. We need to understand how the combination
and the time spend in these HR zones can influence the training and how is possible to understand
the physical condition of a person.
The Heart rates zone are used to prepare an effective plan for determinate goals (like lose weight, or
improve aerobic performances).
2.1.1.1 Recovery Zone
The Recovery Zone is useful for endurance and aerobic capacity. In these zone we are burning fat
and re-energise muscles with glycogen, (expended during those faster workouts).
The Aerobic Zone is useful for the cardiovascular system development, transport oxygen to, and
carbon dioxide away from, the working muscles can be developed and improved and improved
aerobic capacity.
2.1.1.2 Aerobic Zone
This zone is useful to develop the cardiovascular system. It is a zone where the blood transport
oxygen to, and carbon dioxide away from the muscles. Working in this zone will increase the fat
burning and the muscles can be developed and improved. It improves also the aerobic capacity.
2.1.1.3 Anaerobic Zone
Training in the Anaerobic Zone will develop the lactic acid system. Here, the individual anaerobic
threshold (AT) is found . This value (AT) can be very important to determine the efficiency of the
training but we do not have this value in our data. Here the glycogen is the main source of energy
so, the muscles will produce lactic acid. When the blood cannot remove enough quickly the lactic
acid we reach the anaerobic threshold (AT). With a correct training is possible increase this
threshold.
2.1.1.4 Red Line Zone
Training in this zone is only possible for short period of consecutive time. It effectively trains only
fast muscle fibres and it helps to develop speed but not endurance. This zone is reserved for interval
running and the body must be fit to train effectively within this zone. As it push the body till the
limit it is necessary to balance short period of training in this zone with training in other zones.
2.1.2 Fatigue measure
To understand how an athletes perform during training it the Body Loan measure has been
introduced . The body loan is calculated as a function of multiple factors but basically it depends on
the distance. Covered by the athlete during a training session and his weight.[16]. As we said early
the trainings depends on the Heart Rate, therefore it can be useful to implement another fatigue
measure which takes into account how much time the athlete was in that particular zone and also
how much consecutive time the athletes spend in that HR zone. This involves the study of an
appropriate visualization technique. This study had been done on a very early stage of the project ut
it has not been developed because of changes in the aims.
2.2 Overview about data mining
Data mining is deeply routed in statistics. We can see it as an evolution of statistics as a method to
find the reasons behind relations. For examples statistics using co-variance tell us the strength of
the relationship between two variables. This is not enough to understand why the relationship exists
or the impact it may have [17]. Data mining is not looking only at the numerical relationships them
self but it investigates the data to understand the causes of them. It does so through an interactive,
iterative and/or exploratory analysis of the data[7]. Two examples of the statistical part of data
mining we can find methods like regression analysis, discriminant analysis but there is also a more
sophisticate (and more complex to implement) branch from Artificial Intelligence, like Genetic
Algorithms and Neural Networks.
2.2.1 Why data mining in sports?
Data mining as statistical analysis in sports started to become popular with the book Moneyball,[2]
in which is described how the low budget Oakland A's, was competing whit teams who were
spending twice his budget. This book is considered as a milestone in sport analysis because it
started to look at the statistics not only as a mere collection of number but as a set of critical
information.
We didn't find any previous work exactly about our problem because this field is relatively new but
is possible to understand the general approach to a Data Mining problem in sports.
Data Mining is already used in a lot of sport, especially in the most popular one sports like baseball,
but is is adopted in different ways and grades, from football to baseball[25].
2.2.2 Criticism
Some sport specialist argue that sport is mainly based on “heart” and passion, this is the real
component of a good performance, and it is not measurable whit the statistics. The most critical are
former players and commentator[7]. We can argue that this point of view is only partially true,
mostly for two reasons:
1. Is possible to measure the physical condition of one player, and the the performance
generally is highly dependent on the physical condition.
2. From each player we have some expectation about his performance, also without data
mining, but whit this technique is possible to reduce the uncertainty.
So we believe that Data Mining can help to evaluate improve and predict sports performances.
There are also some positive examples that prove this, first of all the Oakland A's, but also Los
Angeles Dodgers for 2004-2005 seasons and Red Sox in the 2003 season.
In the following sections we are going to analyze the data mining techniques used in past works
relating to data mining in sports and why they are good or not for our propose. The final decision
about what technique we are going to use will be made in chapter [], following the CRISP
methodology.
2.2.3 Models for Data Mining in sports
Data Mining is a very broad field with lots of different techniques. Is possible to divide the in three
main groups:
1) Statistics
2 )Heuristic Based Approach
3) Machine Learning
Each of these technique has some advantages and some disadvantages, in the following paragraph
we are going to analyze some methods from these categories and evaluate them for the application
in our project.
2.2.3.1 Statistics
Statistics are very common in a lot of different field, from environment protection to control quality
analysis. A very basic form of statistics, descriptive statistics, are very wildly used in basket and in
sport in general to summarize the performance in a game. The descriptive statistic is the base form
of analysis, it includes samples indicators like mean, standard deviation, maximum and minimum.
The advantages of these techniques is that they are easy to implement and the give a general
indications about the data. In their basic form they can be used for have an indication about the kind
of variables we are considering but the cannot give us enough information[8]. . It helps to identify
the main characteristics of the data and if there are outliers and this can be the one of the first step
of our analysis. In our project they represents the data, as we are considering the stats form the
game, so fro sure we have to deal with statistics.
2.2.3.1.1 Regression analysis
Regression analysis considers each data point by fitting a line or polynomial, trying to minimize the
total error between the model and our data. The idea behind the regression analysis is to evaluate
the relationships between the data checking if the points follow the line of a determinate function.
The most common regression is the linear one, and it checks if the points lie in the same line, the
coefficients are obtained from the points themselves.
From a regression analysis, causal relationships can sometimes be discovered between dependent
and independent variables. Most often, this type of analysis is used for prediction where if the
observed trend holds true, estimates of the dependent variables value can be made by varying the
values of independent variables[8].
This technique is successfully used in the prediction of games result in football[27] therefore we
decided to use a linear model, but we are looking for a more flexible methodology, because we
know that the nature of the problem implies a lot of variance.
2.2.3.1.2 Time series analysis
Time series analysis are a set of methods [42] useful to extract useful information from time series
data. Time series data is a sequence of data points, usually measured at successive instant from the
same phenomena: an example could be the records of the maximum level of the sea in a bay. The
time series analysis can used calculate the probability of reaching a specific maximum in the future
years. This kind of analysis is wildly used to predict future performance, especially in finance,
econometric and signal processing. It has been used also in sports, for example to predict the
results of the results in the English premier league. The accuracy was pretty good, 80% of correct
losing prediction and 90% for both drawing and winning.
There are a lot of different ways it is possible to analyze time series, some if them are:
1. graphical analysis
2. autocorrelation analysis
3. spectral analysis which spots out cyclic behavior not related to seasonality
4. decomposition of time series
5. calculate the marginal distribution from the data
6. Statistical models to predict future performances, like autoregressive moving average
(ARMA), used to predict the football results form the Premiere League[43] .
For our analysis probably it is not the best technique as we are looking for results in a lower level
respect of the final result as we want to analysis the most important factor which determines a good
result. On the other hand we think that this technique seems pretty good to study the data from
training session, as this data are daily and depends mostly on the athletes condition and less on
external facts in comparison with the match results.
2.2.3.2 Heuristic Approaches
The second branch of data mining, Heuristic Approaches, differs from statistics by applying a
heuristic algorithm to the data[7]. Heuristics are used to find out a good solution in a quick way, not
looking for the better solution but reducing the domain and find a “good” one. This approach tries
to balance out the statistics by applying a human problem-oriented perspective to problem solving
based upon educated guesses or common sense [7]. This kind of approach nowadays is used only
on problem very complex which require a lot of computational power.
This approach is not very common as the computational complexity of the problem usually does not
requires this compromise. However it is used to create training timetables to optimize the benefits
obtained by training in a specific Heart Rate zone[40].
For example to create a timetable for a certain number of exercise the Heuristic tries a lot of
possible combinations, selecting the combination with less cost. The cost is defined to take into
account undesirable solutions, like for example if two exercise has the same starting time the coast
will be high because that solution reduces the number of exercise for that training session, and we
do not want it. Compute an optimal solution can be very time expansive if there a re a lot of
constrains, therefore in this case an Heuristic approach is very useful.
Our problem is not in so computational expansive so the Heuristic approach is not necessary.
2.2.3.3 Machine learning
Machine learning uses algorithms to find out the knowledge behind the data [17]. To discover this
knowledge we can use two different approaches: supervised or unsupervised learning. The choice
depends on the data available: if we already have some classified data we can use the supervised
learning otherwise we must use the unsupervised method.
Our problem can be interpreted as both of this cases, it depends on the outputs we consider, as it is
very difficult to define the concept of “good” for a performance. We can perform poorly but win the
match or the opposite; by this point of view our problem is a unsupervised learning task.
By the other hand if we consider that the main goal is to win a game we can say that a good
performance is when our team wins, or when the difference of points between out team and the
opponent is high. In this case the statistics are the input, and the the difference of point is the
output; in this case it is possible to use supervised learning algorithms.
As the project is vague we have to define the problem and choose an appropriate algorithm, in the
following paragraphs we are going to analyze the algorithms in respect of our needs.
2.2.3.3.1 Neural networks
Neural networks are another machine learning technique that can learn to classify from data
patterns. A very common type of Neural Network is the Back Propagation Neural Network
(BPNN). This is a supervised learning technique which modifies itself to fit the training data and
can be used to make predictions about testing data.. BPNNs are built thinking about the human
brain: they imitates the human synaptic activity by modeling the network of neurons and axons as
perceptrons, using weighted to link the perceptrons. These weights are used to classify the new data,
during the training they are modified to minimize the error between the output and the correct
classification, during the testing they simply classify the data. The BPNNs can recognize also
complex patterns.
NNs are one of the most predominant machine learning system in sports to learn hidden trends in
the data. They have been used for example to predict Finland's soccer championships (performing
better than Genetic Algorithms) by Rotshtein[14]. Another interesting work on soccer prediction shows
quitr good results, but we don0t think they are suitable for our project because our data are very noisy
and we are from a probability distribution, which NNs cannot handle. They have been successfully
used in football result prediction[41],
On the other side we need a flexible algorithm as our problem, the performance prediction, is
characterized by highly uncertainly. Our output should be a probabilistic result rather than a precise
classification, therefore NNs is not sutable for our goals.
2.2.3.3.2 Self-organizing maps (SOMs) Unsup
Self-organizing maps (SOMs) are a particular type of neural network used to represent and visualize
data with a large number of dimensions in lower dimensions. Usually we visualize them into a twodimensional plane. This visualization can be very useful to identify trends which in the original
data are confused because of the high number of dimensions. It is used for example in sport
Biometric [41]. In our project is possible to use this visualization technique to understand the main
relationships in the data, but we choose some simpler methods,. This because are more standard and
mostly already implemented in tools, therefore the benefit in using this technique were not enough
to payback the effort to use it.
2.2.3.3.3 Bayesian statistics
Bayesian statistics are useful to compute the probability if a condition given the event appended
( the reverse of the conditional probability). These probabilities include prior probability, a measure
of the probability of a state condition being met before the introduction of data, conditional
probability of seeing that data if the state is correct, the marginal probability of the data considering
all states and posterior probability that measures the belief in the state condition being met after
seeing the data.[8]. This methodology can be used for our project because is possible to retrieve the
information about past sport matches from internet and use them to find the probability that the
athlete his in a certain physical condition. It helps also in underling hidden relationships in the data
and study the effect on the other variables. This has been used to predict the results in baseball at
2004 [42].
In particular dynamic Bayesian networks can be particularly useful to predict time series, therefore
they are particularly suitable for our project. At the end we did not choose this methodology but it
can be interesting to test its performance against the performance of the chosen methodology.
In fig: an example of Bayesian network
2.2.3.3.4 Monte Carlo Simulation
Monte Carlo simulation is a very good method to imitate real life scenario, in particular if it
involves lots of uncertain parameters[20]. This methods consist of simulating an event selecting
some inputs and entering them in a given model, then look at the output. Running the simulation a
lot of times it is possible to obtain a robust estimation of the result. This procedure imitates the
process of sampling form a distribution for a certain number of time: the mean of the result
simulated will be a good estimation of the mean result obtained form that process.
It is used in many fields, such as physics, finance, marketing and so on to calculate, for example,
complex integrals.
The two main steps of the model are:
1. Create a parametric model
2. Input the data into the model
3. Collect the results
it is necessary to repeat this steps for N times, where N is a number which grantees that the results
are stable. This means that the distributions of the results will not change much increasing or
decreasing N of a percentage (like 20%).
Using this method it is not necessary to specify every relation which is very useful to solve our
problem as it is impossible model all the relationships in it. Moreover our problem involves a lot
of uncertainty therefore instead of trying to calculate an exact solution is better to have a
probabilistic one.
The Monte Carlo simulation is based on the assumption that is possible to calculate the
probability of an event upon the number of times that it occurs. This assumption is supported by
the central limit theorem. The most evident shortcoming of this method is that the convergence
to a stable solution is very low compared with other methods.
Another advantage of the Monte Carlo technique is that the method is quite simple, there are
also methods such as Variance Reduction which are complicate, but the basic algorithm is quite
simple. Our project does not needs this complicated techniques so ti is possible to implement it
by our own.
2.2.3.3.5 Finite state machines
The Finite state machine is a mathematical model can be used to design each system with a finite
number of state [12]. From a particular state is possible to move in other state with a certain
probability depending on the input. This method had been successfully used in Baseball to find the
optimal pitch[13].
This method can be used for our problem because it can model very well the train condition, an d
we know that we have a certain profitability to move from one state to another. A problem of this
method can be identify the state, because they can very during the time and also from a person to
another.
Anyway this method can be used in our solution but for the moment we decide to implement the
Bayesian Networks, it can be interesting in further works to use this methods and compare the
performances
2.2.3.3 6 Sensitivity Analysis
Sensitivity analysis is a technique used to study the impact of one variable. It is used when the
function is unknown and we want to study how the variables effects the results. This can be very
useful when it is not possible or not convenient to study directly the function and so it necessary to
use a “black-box” approach. That means to look at the input and the output only and with the results
understand the interesting aspects of the system. To apply this methods is necessary to be very
careful on the inputs, because he behavior of the function is unknown therefore it is possible to
made conclusions only within the range of the inputs [21]. This seems very useful for our project to
understand which are the most important variables of these
There are several possible procedures to perform uncertainty (UA) and sensitivity analysis (SA).
The most important classes of methods are:
• Local methods, for example a simple derivative of the output with respect to an input factor.
• A sampling-based sensitivity the values are sampled from the a distribution
• Methods based on emulators (for example. Bayesian).Here the problem of evaluating the
output is treated as a stochastic process and it is estimated from the available data points,
generated by a computer.
• Screening methods. These are based on sampling the instances, used to estimate a few active
factors in models with many factors. Our problem si not so large therefore it is not useful fro
our propose.
• Variance based methods. Here the unconditional variance V(Y) of Y is decomposed into
terms due to individual factors plus terms due to interaction among factors. (if input factors
are independent from one another).[52]
• Methods based on Monte Carlo filtering, which are sampling-based. The objective here is to
identify the “regions” of the input space where the output has a particularly high response,
so it is possible to find out which variables influence the results most, in our case which
statistics influence more the result [53].
We decide to use the last methods to check who are the Here the statistics are used as an indication
of the way of playing of a particular players. Obviously a game is an interaction between the
players, and there are complex relationships between. The relationships are between players of
different team but also between players in the same team and other factors, like fans: it's almost
impossible to model all these interactions.
It is necessary to find out a way to facilitate this analysis: we want to understand the main factor
that reveals the behavior of the player and the key factors of the performance which lead to a
victory.
Our sensitivity analysis will perturb the statistics and then observe the result, so ti will be possible
to identify the correlation between the total number of points and each variable.
2.2.4 State of the Art in Basketball Data Mining
As we said different sport has a different implementation of data mining. In basketball there are a
lot of descriptive statistical data from the games, like average points per game, average rebounds
per game etc.
The most common statistics are:
1) shot zones: the court is divided in 16 different zones and they collect a lot of data from
marches. With this method is possible to find the most convenient way to place the defense
2) player efficiency rating: gives an information about the contribution of each players in a
match. It counts, for examples, the points made but also the shoots attempted penalizing
players with low percentages
3) plus/ minus rating: the statistics is calculate with the points that the team makes when that
particular player is playing[8] .
There are also some studies to predict future performance[28] but they are not used to find the most
important characteristic of this sport and improve the training. This method use a simple statistical
simulation but it very sensitive to the players available, if one of them is for example injured the
system does not work properly.
As we can see the data mining in this sport is in a primary stage, therefore our project must adopt
also an exploratory methodology and we must see in more developed fields to search some useful
material.
2.3 Data Visualization
Data Visualisation can be defined as the study of the visual representation of the data. Its main goal
is to communicate information through a graphical representation.
It is a fundamental step for the knowledge discovering project,
2.3.1 General Data Visualization
In out project data visualisation is very important especially in the first stage, when we want to
understand the main relationships in the data and their main characteristics.
Some very simple visualization techniques are line graphs, which are very good to see trends. Is
possible to modify the scale of the x and y axes, for example taking the logarithm, and check the
behavior of the data. To understand id the data follows a particular distribution can be helpful to
visualize the histograms.
Is possible to underline the relationships within the data simply using a parallel coordinates
representation. Using tools like xmdv tool is possible to discover also some non trivial relationships,
for example is possible to find out that there is correlation only in a certain Area of the graph. To
use this tool we need to transform the data in the xmdv format.
2.5 Deriving knowledge
To be of practical value, data needs to be transformed by identifying relationships [9]or limited to
only that which is relevant to the problem at-hand [10][7].
To visualize this problem is possible to use the DIKW (Data Information Knowledge Wisdom)
Hierarchy.
This data transformation produces information which can be characterized as meaningful, useful
data [11][8]. Now data stops to be simple data and become informations
The next level of the hierarchy, knowledge, can provide additional meaning by identifying patterns
or rules within the information. The first three levels can be viewed as the evaluation of our data,
but we can understand also more, we want to be able to predict future performance.
Here it come the final level of the DIKW hierarchy, wisdom can be viewed as a grasp of the overall
situation [9], which can predict the future situation using the knowledge. It represents the ability of
use the knowledge reached in the previous step.
DIKW image from Wikipedia[15]
As we can see in the pyramid the final step is the most important and useful goal of the learning, but
it also requires a lot of data transformation. This level should be the ultimate goal of all the learning
methodologies and it appears as the natural evolution of our work.
Therefore we decided to study the layer between the players performances (statistics) and the
player's condition as there are too many unobservable variables between these layers.
We assume that the statistics are the observed values which comes from a statistical distribution.
Using some visualization tools such as Paraview and the Weka preprocess form we can see that it's
reasonable to assume that the points and the other statistics follow a Normal distribution. If there
are more than 30 examples and there is no outlier the normal distribution is a good approximation
every time. For the players of our interest the datapoint are much more than 30, so it's possible to
use the normal distribution and its test.
Our approach will be predict the number of points form the statistics than change the statistics and
to perform a sensitivity analysis. The sensitivity analysis is the methodology which we are going to
use to individuate the most significant statistics.
The steps to do that are:
1. find out the most relevant variables for our analysis. This phase is composed by:
1. data visualization
2. find a good criteria to define the success in basketball
3. find the variable which represent that criteria
2. find a good way to predict the output from these variables
1. find an appropriate algorithm to use
2. find the appropriates tools
3. find the most important variables
1. collect an adequate number of results
2. compare these results
3. evaluate the results
Chaper 3
Methodology
To choose the appropriate methodology we take into consideration two main factors:
1. the nature of the problem
2. the relatively short period of time in which the project must be completed
3. the luck of previous research on this dataset
3.1 CRISP Methodology
As we are doing a data mining project we decided to apply the CRISP-DM methodology. CRISPDM is the acronym for Cross-Industry Standard Process for Data Mining. This methodology consist of
6 steps with some feedback loop ; it's a consolidate methodology especially deigned to give a
framework general-porpose for a data mining analysis.
It is composed by six main steps:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
The process is not linear, often will be necessary to come back to previous step with the feedback so
it will be possible to adjust the analysis. This is a very aspect as the analysis is a process that is
discovered little by little: it needs to be adjusted as hypothesis are tested and new information are
gathered. There are three main feedbacks to adjust the model, from data understand to business
understanding, form modeling to data preparation and from evaluation to business understanding
again: in our project we experiment the utility of these feedbacks.
The model is flexible: sometimes it is necessary to change the order or skip some steps but this is a
very general method which can be applied in very different context.
In the following chapters we are going to explain how we used this step in our project.
3.1.1 Business Understanding
The aim of this phase is to clarify what the client really wants to accomplish. This project is with an
external company, Apollo MIT, which is providing data from the training augmented with data from
internet.
Probably the most crucial step is to translate the customer’s requirements into technical
specification, because a different background culture can lead to a result which does not match the
clients needs.
Therefore at the beginning of this project we met the chairman of the company to discuss what they
are expecting to know and we had multiple meetings to check if the project was going in the right
direction. Their long-term goal is to evaluate the performance of a player so it is possible to assets
the effectiveness of a training, choose an appropriate turnover in a game, predict future
performances.
To perform all these analysis and predictions first of all it is necessary to know how to asses a
generic performance for a generic player. In other words we want to define what is “good” in a
basketball game.
We obviously have statistics but it is not trivial to say how these statistics will lead to a victory and
which are the most important or the ones we want to work on in training to improve.
The business objective can be defined as: “understand the game characteristics that lead to a
victory” which can be translated in the data mining objective: “find the correlation between winning
and the players statistics given the statistics of the 2010-2011 NBA season”
We consider this a successful project if we can provide information about the performance of an
athlete.
3.1.2 Data Understanding
The initial data got from the company were from the trainings of two teams: Manchester United and
New York Knicks.
After a brief analysis about these two different problems we decided to focus our attention only in
the New York Knicks' data, this mainly because of three reasons:
1. The game is more simple: there are only five players in the field at a time and a total of 12 in
a game.
2. The statistics are much more descriptive.
3. My personal experience: I played basketball for 11 years.
The data comes from a sensor which was measuring the heart rate and the GPS position. It records
the new parameters every tenth of a second. The training usually last more than one hour and an
half, therefore the files are two or three hundreds of megabytes,
The data are directly exported from the sensor and saved in a format known as comma separated
value. This format is very simple and helps to use the data wiht different programs, the data are
saved in a normal ASCII format and the values are separated but a coma. Below is possible to find
an example of these data:
an example from an atheltes, anonymised
Is possible to notice that some data are missing here: the smoothed player load, the Heart Rate and
the exertion are not present. This can be a problem and it may cause the exclusion of this dataset for
the analysis.
We did not had the full data from the whole season, but only three days for three players therefore is
not possible to judge the overall data quality.
We did some basics analysis with these dataset, focusing our attention in particularity on one player,
the most important, one of the most valuable player in NBA. Using the Sport Science background
reading we find out that the main
//[will bs completed with a description fo the figure]
After these visualization and some discussion was clear that only data form trainings are not
enough. In fact is possible to interpret the performance of the player as the observed value of the
athletes condition. The athletes' condition is composed by a certain number of hidden variables,
the performance is the observable part of the athletes' conditions plus others interactions with the
external world and probabilities.
We can see that there are two main layers:
1. game performance;
2. training performance.
Before evaluating the training it is necessary to evaluate match data, so it will be possible to
understand what are the aspects to try to improve the training.
Link these two parts is a very complicate task because in between there are a lots of hidden layers,
therefore is necessary to
[MAYBE PUT CAUSALITY GRAPH HERE
<training performance> -> <game performance> -> <game score/outcome>
& say will concentrate on second relationship!
3.2 Find a good criteria to define success in basketball
In this section we are going to describe how we decide the criteria to evaluate a basketball
performance. This task seems an easy one but it has a lot of issues and still there is a lot of debate
on it. Often when people are talking about sports matches is possible to heard: “The team play we
well, but the lost the game” or the opposite, but this may be not true. Maybe our concept of “good”
relays on how much we enjoy the game and not about the effectiveness on the play. There are also
some factors that we cannot observe and maybe they determines the victories of the defeat that is
why it necessary an accurate analysis bout the evaluation method.
One of the issue is that the performance is influenced by a lot of different factors, like the opponent
performance, the motivation of the athletes, the physical condition and also some uncontrollable
fact that is possible to group with the term “chance”. The final performance nothing but the
outcome of all these components so to carry on an analysis first of all we have to reduce our
domain. We decide also to group all the external factors only by looking at the statistics on the team
of interest.
In this simulation we decided to not consider the influence of the other team in the game in the team
of interest performance, however the framework developed is open enough to consider it in further
analysis.
An interesting work can be add these factor into the model, for example by creating a classification
of the other team and putt them into our model or even more sophisticated by crating players
prototypes and let them play with our team. This method can lead to more accurate results but there
is also the possibility that by doing this we confuse our result, because it can increase the variability
of the model.
To perform the evaluation we analyzed the correlation between the mean number of points made by
the team, the mean number of point received, the number of wins /losses in the ranking of the
2010/2011 standing and the number of points made by the team a
We find out that the correlation between the points made and the final position is very high,
therefore the number of points made are much more important than the number of points received,
so the attach is the most important and delicate phase.
4.1 Data collection
After it was decided decided that the nature of the project was to look at the match result the
problem was to decide how to to collect the data and which data. On internet is possible to find lots
of information about NBA matches: there are sport magazines, sports bet forums, websites
specialized on NBA and so on.
The problems with these sources is that are incomplete, highly fragmented and in different formats.
The first step is to have an idea on what are the information that we are looking for and then is
possible to find the correct source.
4.1.1 Data Selection
An approach can be to start with the statistics and trying to predict the number of points [23]. The
standard statistics seems a good starter to predict the number of points. Therefore we start to
consider the following player specific statistics, which are by far the most popular in basketball
analysis:
MP = Minutes Played
FG = Field Goals
FGA = Field Goals Attempts
FG% = Field Goals Percentage
3P = 3-Point Field Goals
3PA = 3-Point Field Attempts
3P% = 3-Point Field Percentage
FT = Free Throws
FTA = Free Throws Attempts
FT% = Free Throws Percentage
ORB = Offensive Rebounds
DRB = Defensive Rebounds
TRB = Total Rebounds
AST = Assists
STL = Steals
BLK = Bloks
TOV = Turnovers
PF = Personal Fouls
PTS = Points
The very first analysis was made to be sure that there are not very clear reaction. We know for
examples that is almost certain that if Dirk Nowitzki scores more than 40 points it si almost sure
that Dallas Mavericks will win the match[24]. We perform a very rough analysis using Paravew to
underline any very clear indication. Here we can see an example of the result of this analysis:
It is necessary to visualize a not many variables at the same time, otherwise it will be too difficult to
visualize them and understand the broad picture. In the example we are not visualizing the number
of points, the shooting percentage and the date of the game.
We can see that the first vertical line represents the results of the team in a particular match, the
other statistics are the statistics of a single player, in this case Amare Stoudemire, one of the main
players of New York Knicks. All these data are public therefore there is no need to anonymize them.
The data which belongs to the same match are connected by a line. In red we can see all the data
from the matches won by the New York Knicks. If there was a clear relationship between victories
and one of these variables we should see the red lines particularly concentrated in a set of values in
that variables. Here we can see that the values are very spread, and there is no evidence of
correlation. We can see that some variables are always in the same range, with a few exception, like
for example the minuted played and the offensive rebounds. These variables follow a Gaussian
distribution, as we can see from the Weka pre-process tools, and they are not very much related with
the result of the game.
Figure: The distribution of the nu,ber pf points for amal stoudemire in the 20010/2011 season
represented by an historgam. The x azes represent the number of time that we observe the
number of points indicated in the x axes. Is possible to see that the mean value is 25 and the
distribution follows the gaussian model
The only exception seems to be the number of steals, and also the number of three pointers made.
However this do not give us a good indication: these events for that player are rare and they follows
another distribution, not related with the victories or the loses.
Performing this analysis on all players the result is that the performance of a single player cannot
determines the final result of the match.
That means that to find out the final result is necessary to consider the performance of more players
at the same time. For this reason the strategy chosen to predict the result of each game is to find the
points made by each single player and sum them, comparing them with the result of the other team.
After this analysis we decide to consider more elements as we realize that the result can be
influenced by a number of factor, like for example a particular players can score more points against
a particular team.
We decided to considered a lot of different type of data, which can be divided in three main groups:
1. Nba season games
2. New York Nicks' players bio
3. New York Nicks' training session
The problem with all these data is that there are a lot of link between them, and considering them at
the same time will potentially confuse our model. Therefore we need a method to change
dynamically the inputs, adding or deleting information. A database is commonly used to explore
multy-dimension data, therefore we decided to implement create one.
4.1.2 Create the database
The amount of data available for this project is huge: therefore we decided to limit out analysis in
the last season (2010/2011). There are a lots of different sources, like for example the official NBA
website or the ENSPN.
There are two main problems with this kind of sources:
1. the data are not complete: it is necessary to search data in different web pages and even in
different websites to collect a complete set of data to describe
2. There is a lot of redundant data
As we already said it is necessary to select the appropriate data and build our analysis on them. This
is the first analysis on this dataset, we do not know “a priori” which attribute we should use. For
this reason we know that probably we will repeat the analysis changing the attributes. This means
that it is necessary to visualize the data in a multidimensional space.
As I have studied in this MSc in Artificial Intelligence when you have different sources and
different views of the same data is useful to create a unique Data warehouse.
4.1.3 Why a Database?
One of the main problems that I encountered in the first phase is that there are different sources of
data, with different formats. This could results of having scalability problems or having a
complicated program to create an appropriate input files.
These problems are very common in many fields, like for example sales companies needs to
integrate data form lots of different sources like the data from the stores, the data from the
purchasing office and so on.
The most used solution in other fields is by far the Database, which is an organised collection of
data which can be easly accessed, organized and managed. In particular we want to organize our
data in independents entity so it will be more easy to manage them, to add other entities and to have
a multidimensional view.
Some of these data is on purpose redundant because we want to avoid too much complex queries to
obtain relatively simple and highly demanded data. For example was possible to obtain the game
results simply adding all the point of the players of that particular game, but that solution was quite
inefficient.
This solution has another advantage: this project represent the first step of a bigger process to
understand how to link games results to statistics so it will be necessary to add others information,
like for example training data. With the datawarehouse it will be very easy to add this information.
Another point is that a lot of data mining tools (e.g. Weka) are designed to perform direct queries to
the database(is possible to do it also in Matlab) and then save, if necessary, the data used in another
format, like comma separated value.
Therefore we made a script using python to scrape data from webpages and create SQL queries to
insert the data in the structure of our database.
4.1.4 The Database
We have identified four fundamental entities
1. the teams: the list of the players
2. the players bio:Name, weigh, height, date of birth
3. the players' statistics in each game: the statistics already explained, only for New York
Knicks
4. the games results: all the statistics form every game played by the new york knock,
including the opponent statistics
At the beginning we had some errors because of the format: the name must be in quotation mark so
as the data. after doing that we check if the database was correct with some simple queries and so it
was.
Then was necessary to connect Weka to that database, this requires the installation of a DMBC
driver and some modification in the c path. We have some problems in these stage because of
different settings of my machine and because.
When this step was completed we tried to perform some queries using weka. We realize that weka
does not have the data attribute so we came back to our database and changed every data value into
a numeric value. We choose to start the enumeration from the first day of the season.
Then we start a preliminary analysis, this time only with the teams. Whit this analysis we want to
understand mostly two things:
1. how to group the teams
2. what is the most important statistics
it turns out that it's possible to predict the final rank position of a team simply watching at the
average number of points. This tells us that is much more important the offence rather than the
defence so it is a good idea looking at the number of points first, and then eventually look others
factors.
This gives a very flexible framework to conduct experiments, so it would be very easy to add more
data and even different parameters, or select only a subset of them directly with a query from weka,
After downloading from internet the data we create a Python script to create the queries to populate
the tables. Is possible to find the
We decide also to classify the teams, as requested by the external company, and we end up with four
clusters, taking into consideration the rank, the number of points made and the number of points
recived. (show results)
4.5 How we create a database
The first step is to produce a database so it is necessary to select a tool to produce it. We decided to
use SQL as it is wildly used to manage large dataset, is easy to use, well documented as I have
learned is the module “Technique for knowledge management”.
Is also possible to connect the database directly with weka rhis
We selected mySQL mostly because it is opensource and his functionality are the same as others
As the source of data are not uniforms and
To create the database we decide to write a python script to convert the data from internet in a set of
sql queries, which can be executed and they will populate the database. We are not going to explain
in detail the process as it is not a crucial part in this analysis. Here is possible to find an example of
these SWL queries:
insert into Stats values('Amare_Stoudemire','1','NYK',1,'
59',7,16,.438,0,0,0,5,6,.833,3,7,10,2,0,2,9,3,19);
Chapter 5
Implementation
5.1. FUNCTIONALITY
The main object is asset a to define a way to evaluate the performances of a player. This is the first
priority, as for all business the golden rule is “if you can't measure it you can't control it”.
This task is not easy, because it involves a lot of hidden variables, some of them are also impossible
to measure, like for example the players emotional status. Therefore we decided to study the layer
between the players performances (statistics) and the player's condition.
We assume that the statistics are the observed values which comes from a statistical distribution.
Using some visualization tools such as Paraview and the Weka preprocess form we can see that it's
reasonable to assume that the points and the other statistics follow a Normal distribution. As we can
fin in [] we know that if there are more than 30 examples and there is no outlier the normal
distribution is a good approximation. For the players of our interest the datapoint are much more
than 30, so it's possible to use the normal distribution and its test.
Our approach will be predict the number of points form the statistics than change the statistics and
to perform a sensitivity analysis. The sensitivity analysis is the methodology which we are going to
use to individuate the most significant statistics.
5.2 Tools Selection
Data mining is a wildly used methodology but the is not only one way to perform it. There are a lot
of different tools available to perform it, it is necessary.
In our project we have multiple steps, therefore we need more tools to achieve the goal of each step.
We are going to choose a tool for each step;
With the visual analysis we want to understand if there are obvious relationships between the
statistics, therefore will be useful to visualize all the variables at the same time. A solution can be a
the parallel-axes visualization. This visualization basically consist of representing each variables
with a vertical line, each observation consist of a set of points in these vertical lines connected by a
polygonal chain. We encountered this methodology in our module of “Knowledge Management”,
using the tool Xmdv Tool[29]. We encountered some problems converting our file in th Xmdv
format therefore we decided to use ParaView[30], another scientific visualization tool which
implements the same functionality but it is easier to load data from different formats.
We want to predict a value given a set of parameters using an appropriate algorithm. To do that we
choose to use a program for data mining rather to implement by our own. This because our task
requires nothing more than a standard algorithm[54], and there are a lot of programs which can fit
our needs. The final choice made is to use Weka, an open-source program already encountered in
the “Machine Learning” module. This program is designed to do a deep data analysis, it has some
visualization tools and a lot of algorithms already implemented and tested. Another important
feature so that it is possible to call it from the command line. This make possible to call Weka from
an external program, as it was a function.
Then it is necessary to choose how implement the Sensitivity Analysis. Basically our sensitivity
analysis consists of running the prediction algorithm changing a the inputs and observing the
results, looking for the variables which influence more the final result. We decide to implement this
part by our own as we need to automatically call Weka with the passing the parameter that we want
to use. For doing this we decided to use the Python Programming language. Python is a scripting
language easy to use and also with a big community of developers, therefore it will be easy to fin
support. There are also a lot of libraries, in particular NumPy[31], a complete tools for statistical
analysis. In general the scripting languages are not efficiently to use, and in data mining the
efficiency is a critical requirement, but the function ni the Python libraries are very well optimized
and for the algorithm we are going to use Weka, therefore it will not be a problem.
5.3 Modeling
In this step we are going to choose the algorithm that we are going to use to predict the number of
points given the statistics.
The first fundamental choice is to decide if we are going go use a supervised or unsupervised
algorithm. As unsupervised learning is good to classify by similarity to predict a result[32] is better
to use a supervised algorithm.
As we already said in chapter 2 we know that linear models are god to predict the scores of a match.
In Weka there is a lot of choice for linear algorithms so is is better to use one of the Weka's
algorithm, selecting the one which fits best our needs.
As the nature of results are very variable it is necessary to have a flexible algorithm, able to deal
with outliers and miss-classification. For the analysis performed in the background reading we
decided to use a support vector machine(SVM), in particular in Weka implements the Sequential
Minimal Optimization(SMO)
to train the Support vector machine. Training a SVM is a Quadratic
Programming (QP), we are not going to explain in detail this as is not relevant, the relevant part is
that it needs at least a polynomial time [35].[33]The SMO technique is very efficient: the amount of
memory needed increases linearly with the data size. Here arise another problem: Weka is
implemented in Java, which means that the program runs on the Java virtual machine. In some tests
we run out of memory as the memory for the virtual machine is set when Weka is launched and if
Weka is launched with the (Graphical User Interface) it is roughly 500 MB. To allocate more
memory it is necessary to run Weka from the command line, specifying the size of the memory
needed with the command. For example is possible to allocate 1 GB call Weka from the shall with
the command:
java -Xmx1024m ...
We run the algorithm with the default parameters, but we specified the option “output prediction”.
This option is very important to our purpose because we want to collect all these results to discover
the distribution underlying them
5.4 Assessment of the model
To have a clue about the goodness of these algorithm is possible to run the algorithm with some
data and check the Weka results, which are composed by:
1. Correlation coefficient: how linear dependence there is between the input and the output
as our model is linear if this value is small it means that we cannot use the model, or the data
are not enough [37].
2. Mean absolute error: is a measure about how good the prediction are. It is the mean
difference between the prediction and the real value. In our case is the average number of
points between the prediction and the actual result [36].
3. Root mean squared error: It gives also an indication for the error, but is useful for the
waveform, not for our model[38]
4. Relative absolute error: it is the absolute error normalized by the performance obtained by
the ZeroR algorithm, which is a simple Laplace estimator that we can see as a baseline[39].
5. Root relative squared error: it is the absolute error normalized by the performance obtained
by the ZeroR algorithm, which is a simple Laplace estimator that we can see as a
baseline[39].
A key activity for a data mining task is to decide which data we are going to use to train the model
and which to test it
To evaluate the model we used a 10 fold cross validation:
We made the first test to predict the number of points made by a generic player, therefore the only
the attributes considered were the minutes played and the number of points made. The data was
form all the players who ever played for the New York Knicks in the season 2010/2011.
The table of results states that the data were too generic: the correlation was almost zero and the
error was very high
Therefore we decided to add another statistics: for the second test we selected more features, all of
them form the “Stats” entity specified in chapter 4, the player specific statistics
We ignored all the attributes directly related to the number of points made, otherwise the prediction
of the points made would result a mere calculation.
The statistics considered in this second test were: Blocks, Personal Fauls, Assist, Offensive
rebounds, Turnovers,Defensive rebounds,minutes played, data played.
As we can see now the correlation is very high, but also the errors result quite high in some cases.
To correctly read this result we need to think about our domain application. From my past as a
player in this sport I know that the number of points can be very different form a match to another.
Is possible to see that by simply looking to a random players statistics: there can be a trends but it is
clearly with noise: there are picks and drops.
Given this consideration a Mean absolute error seems acceptable, especially with the high
correlation coefficient therefore we decided to start the analysis using this model and this statistics.
In other words we treat the prediction as the mean of a normal distribution with a variance
estimated from the training errors
5.5 Creation of the program
Now we have a clear idea about the theory, we select the tools and we produce a project plan,
therefore we can start the implementation.
Before start coding the first step of the analysis is to check the data on Weka. For doing this it is
necessary to connect Weka to the database and then execute a query to the database.
To connect Weka with the database there are a lot of different procedures, depending on the type of
the database and and on the Weka version.
Connect them is not very straight-forward, we are not going in detail of the procedure, only mention
the main steps for a postgresql:
1) install the driver “JDBC drive“
2) add the drive in the system classpath
3) create a customized DatabaseUtils.props, which is a file with some parameters to run
allow directly query form Weka to the database
we decided to look at the statistics from all the matches, to create the input file it is necessary to go
to Weka, select “open Database” form the “data preprocess” section, enter the username and
password for the database and perform the query:
select * from
stats
Weka and postgresql are compatible but Weka cannot handle some variables, like a date, therefore it
si necessary to ensure that all the variables are correctly identified.
If the data are finr we can carry on the implementation, otherwise it is necessary to change the data
type on the database and perform the query again.
We have to select only useful data identified in the previous Data Selection phase. This can be done
modifying the query or simply removing the unwanted data from the “data preprocess” section.
Then is possible to save the results in a Weka file, file format that our program is able to read.
This file must be placed in the local driver directory. It is also necessary to save ti as
“stats_all_numeric.arff”, these information are hard coded in the program, this will save user's time.
When it is launched the program open the file and launch Weka creating this command:
This code basically calls the Weka SVM algorithm with some parameters and with the name of the
file that is going to be used for the training and the file for the testing.
We specified the option “output prediction” because our program collects the results from Weka and
elaborates them.
To collect Weka's results it is necessary to study the format of the output: The output consist of four
colons, respectively:
1. instance number
2. actual result
3. predicted result
4. error for that instance
Weka's output is stacked, therefore the number of characters between the variables vary a lot. Here
we can find an example of the output:
The orange rectangles represents the spaces, as we can see form one line to another their number
may differs a lot. The difference between one line to another can be a even of 7 spaces, this could be
a problem to read the output correctly. We designed a quite flexible function to overcome this
problem, however with a different model the result can be very different, therefore it is necessary to
check carefully this part.
5.6 Calculating the result of the game
Now that we have the prediction for each set of statistics we want to find our the mean and the
standard deviation of the points made by each player.
In Weka's output there is no indication about the name of the player, therefore we created a function
to link the players name with all the instances it is connected with.
The input of this function is a list with all the names of the players we want. This list represent
basically the team we want to create, the program will use this list to create a matrix: there is a list
of all the teams member end all their statistics. Now we have for every player a list of all the
predicted points. To calculate the mean and the standard deviation we installed a python library very
useful for statistical calculation: NumPy[46]. NumPy is avaiable only for the 32 but version of
python and to preserve the compatibility with other library used it is necesasry to use Python
version 2.5.
After obtaining the statistics for each player it is possible to tun the Monte Carlo Simulation. From
the data analysis we know that number of points made by each player follow a Gaussian
distribution. Now we know the parameters of these distribution, so it is possible to simulate the
scores of one player sampling from that distribution. The result of the game will be the sum of
points made by the individual players of the two teams.
Once obtained the result the program keep track of the number of point made by each team and also
the variance in the results. These values will be very useful to perform statistical analysis and
evaluate the results.
From the literature review we know that it is very complicated to evaluate the effect of the variance
where the variable are complex and not independent, like our case. The tool selected to study the
variance is the Monte Carlo Simulation. There is a nice package available for python to perform this
analysis,NumPY, but in our case the implementation is not complex therefore we decided to
implement it by our own. Basically it simulates a game a certain number of times, keeping track of
the statistics.
To perform the sensitivity analysis basically we decided to perturb the initial statistics and then
predict the result with the SVM, as we said before. It is possible to specify the percentage to
increase the original statistics. Each statistic will be increased one by the percentage and with the
distribution obtained for each player a Monte Carlo simulation will be used to collect the result
(with multiple simulations performed per statistic setting). It is also possible to specify how many of
these cycles will be performed and how much the variables should be incremented on each step.
All the results are stored in the “res.csv” file for further analysis. We completed the Data Mining
Analysis creating a report to summarize our results.
As this is an analysis instrument we decided to focus on the analysis functionality and reduce as
much as possible time-wasting operations. We decided to hard-code the path for all the files needed
by the program, and also to save automatically all the test in a folder. The path of this folder is hard
coded as well and follows this format:
*c:*\SensAnalysis\*date*\*time*
where *c* is the main disk driver, *date* is the date when we perform the analysis and *time* is the
time when the analysis has been performed. The program creates automatically a set of Weka files
and a comma separated value files where the results are stores. The Weka files are created for tow
reasons:
1. they are useful because it is possible to check if the program is creating correct instances
2. it is possible to perform more analysis with Weka or with another program
3. is possible to save every experiment and keep track of the results
the comma separated value is the analysis of the results obtained, It is saved in a coma separated
value to be opened with other program. To create the graphs we use Matlab, in particular the
function, which combines the BAR and ERRORBAR MATLAB functions. With this function is
possible to visualize the results in a bar and the variance as a segment at the top of the function
We decided to start only with the players statistics and then eventually add some extra information
like player's weight and high or information about the opponent team.
Before use this as our algorithm it is necessary to check if the results are acceptable. So we run the
SMO with the default settings, also showing the results.
We decided to remove all the statistics that give an indication about the number of points made, like
the shooting percentage , because we want to use the statistics as .
It turns out that the results are quite good, with more than 80% of accuracy and a variability of
nearly 4 points (see table below). So we decided to use the SMO algorithm with these statistics.
5.7 Sensitivity Analysis Results
Is possible to look at these results and to classify them from three different points of view:
1) which variables make you score more in total (without taking into consideration the variance)
2) How the variables effect the difference of points between the two teams
3) Which variables are the most informative( the variables which tell firstly if the result is changed
significantly).
These three points of view mainly differs about how the variance is taken into consideration.
The attributes (statistics) of one team were changed systematically one at a time with the same
change applied to each player"
We simulated 50'000 games for each variable
The players for the first team are: 'Roger_Mason', 'Anthony_Carter',
'Ronny_Turiaf','Wilson_Chandler','Carmelo_Anthony', 'Timofey_Mozgov', 'William_Walker', '
Shelden_Williams', 'Renaldo_Balkman','Raymond_Felton'
and for the secon team: 'Landry_Fields , 'Roger_Mason', 'Anthony_Carter',
'Ronny_Turiaf','Wilson_Chandler','Carmelo_Anthony', 'Timofey_Mozgov', 'William_Walker', '
Shelden_Williams', 'Renaldo_Balkman','Raymond_Felton'
These two teams are slightly different: the second team has one more player. The analysis will find
the break even point were the first team will start to win against the second
We are showing only the graphs for the most informative statistics, the rest of the graphs can be
found in Appendix B.
The light blue horizontal lines represents the variance in the baseline, when the two opponents are
exactly the same team, with the same statistics. The mean of the baseline is zero because the team is
playing against himself but the variance can change the result. The blue bars are the difference of
points between the two teams, the black lines represent the variance between the difference of
points of the two opponents. In each graph is possible to see how the difference of points change
when each statistic is incremented.
As we can see form the graphs assist, turnovers and blocks improve the performance but make it
very variable, so variable that the result into a loss.
It is also possible to observe that increasing by 1% is not enough for the first team to win over the
second.
Probably the most important result is how increasing the number of fouls can make the result very
unstable. We also observe that the offensive rebounds lead to a stable improvement. It is also
interesting to observe that a variable very similar to offensive rebounds, the defensive rebounds, are
not so effective, and they don't give a clear indication.
Chapter 6
Evaluation
In the previous chapters we have already talk about some evaluation methodologies, as it was
necessary to receive a feedback from each step to direct our analyzes and ensure we are going in the
right way.In these chapter we are going in detail about the mathematical but we are also going to
evaluate our model and results from the business point of view.
Is possible to define the total outputs of a data-mining project using the equation:
RESULtS = MODELS + FINDINGS
This equation reflect the fact that the total output of the data mining project is not just the model but
is also important to meet the objectives of the business[47].
6.1 Evaluate the results
Our project is composed by three different methodology, here we are going to evaluate each one of
them
6.1.1 Support vector machine
Following the CRISP-DM methodology we already evaluate the model in section 5.4 as it was
necessary ot check if the model was appropriate to predict the results.
6.1.2 Monte Carlo Simulation
The problem that can arise using a Monte Carlo simulation is that we do not know the behavior of
the distribution behind it.
The problems basically consist in the fact the changing the number of loops of the simulation will
change the result. For example it is clear that if we run the simulation only two times we are not
going to obtain an useful indication. There is not a formula to find out the number of loops as it
depends a lot on the simulation scenario. A common methodology [28] is to find the number of
loops by simply increasing the results by a power of ten until the final result of the simulation is
stable. We can see the behavior of the system is this graph.
The x-axes represent the number of cycles, the y-axes the number of points made
After a certain point the result will be stable, it depends mainly on the variance in the model. We
modified a little bit this methodology as we have some indication about the variance. In particular
using a T-test is possible to find out when is possible to say with enough co0nfidence that a team
will win. Our program automatically perform this test analyzing the distribution of points of each
team and finding the probability that these two distribution are the same distribution. In python this
test is already implemented in the SciPy package, the function name is ttest_1samp.
The t-test is used because we afre using the variance obtained from the data as an estimator of the
model's variance. In particular we must used the test for the paired data, as we are testing a variable
against the same variable a little bit modified[50]. To implement this methodology we are studing
the distribution of the difference of points and its standard deviation[50].
We considered as a strong indication if there is a probability greater than 5% that the points of the
two teams comes from different distributions.
We also consider a probability between 1% and 5% as an ambiguous result because our dataset is
very big, we do not have more than 82 result, which is the number of games in a regular season. A
probability in that range is not good enough to give clear indication,
If they comes form different distribution is possible to identify a factor that modify the result. The
expectation is that at the beginning it is not possible to say that the data comes form two different
distribution, than some variables will enter in the ambiguous zone and finally if we increase the
variables enough the indication is stronger.
the variance will increase with the size of the team, so we performed the test with all the players of
New York Knicks in both team and the result was that over 30'000 games simulation the results are
stable. We encoded 50'000 game simulation as the result will not change after the break point and
50'000 cicles was a good compromise between speed and robustness.
6.1.3 Sensitivity Analysis
The sensitivity Analysis is probably the most difficult part to asset because is difficult to receive a
feedback about the results. It wold be interesting to use the results from this analysis to change the
training sessions and the players strategy to see if there is a real impact also on the game.
Unfortunately this requires resources and time that at this moment are not available, but hopefully
in the future this methodology will be used and it will the possible to evaluate the ultimate propose
of this work.
The only evaluation is possible to perform is about the reliability of the results. If we look at the
Data Mining report we can see that the results seems reasonable. For example we can see that in
some cases increasing the parameters more than 5% the variance is quite low and this will result on
a win. This is the case for example of the offensive rebounds, which give a cleat indication that
increasing them the chance to win are very good. Is possible to experiment this by simply thinking
that the defensive rebounds are near the ring so the chance of scoring is high. This fact is confirmed
also from some of basketball strategy and some important results. We can find confirmation of this
on real games, like game one of the 2009-2010 final was won by the Los Angeles Lakers because
they grab much more rebounds than Boston Celtics [51] .
Another confirm about the goodness of our results is given by studying the results about the
minutes played.We expect that increasing the minutes played of a player this will increase the points
made and this is exactly what is happening in our simulation, and the variance is low therefore the
results for out model are very reliable.
We can also see from a variable, the minutes played that looking at
of the The results seems
Now we are going to analyze the overall results of our analysis
6.2 From the business point of view
the data mining goal was to find a way to evaluate the performance in an official games and identify
the statistics which lead to a victory. For the consideration expressed before we think that the
business goal was fully achieved as we provides correct result and also a methodology to carry on
further studies.
As the project is open for further works, we ask a feedback from our expert about the conclusion.
The program can simulate a wide range of games, changing the member of the teams or the
parameters impact on the statistics, therefore we decide to give it to the expert from the company,
Ram Mylvaganam. In fact the program can be used also to decide the turnover on a game as it give
the reuslts and the probability of winning the game
On request we designed also a Graphical User Interface to let other people use this program. We
decided to allow the user to change a very few number of parameters, the most important, so it will
not be complicate to simulate. The user interface is not refined nut is designed to facilitate the user.
Is possible to change the weight of each parameter, so it is possible to change one parameter more
than the other. Is also possible to specify the increasing percentage in each step and for how many
steps tun the simulation. To create the team there a list on the right were it is possible to select the
players.
As python is not designed to create executable files we find an tool to do that. we are not going into
detail the procedure, the way how ti works is not strait-forward, and there can be some pitfalls
depending on the python version[52],
The overall feedback was good, the only negative point was the interface, Ram asked to change it
but unfortunately I did not have enough. The feedback about the analysis and the program was very
good as he said that he appreciate the analytical approach used to perform the analysis, especially
regarding the variance. He said also that we have created a good started framework to stretch the
system to accommodate other criteria.
Chapter 7
Conclusions and further works:
7.1 Project Objectives
A number of requirements were specified in chapter 1 and these were:
1. Identify and implement an appropriate Data Mining technique to summarize the data
2. Identify and implement an appropriate evaluation method for the results.
3. Identify and implement an appropriate a method to visualize the data.
The methodology identified and applied
1. it use appropriates data mining techniques to predict the number of points made, the Support
Vector Machine and the mean and the standard deviation to summarize the data-mining
2. it states two types of evaluation criteria, one using a statistical test, one using an expert
opinion.
3. It use different methodologies to visualize the data, such ad parallel-axes and histograms for
the inputs and bar graphs with standard deviation for the outputs.
In addition we added the following features:
•
a data-based approach was used to make the analysis more flexible
•
a flexible framework for data-minig analysis in basketball had been proposed
•
a Sensitivity Analysis has been designed to understand the most relevant statistics to win a
basketball match
•
a flexible program to implement the Sensitivity analysis has been created
•
a GUI interface has been created to facilitate the end-user
•
a visualization technique has been identified to present the results and to facilitate their
evaluation
7.2 Further Work
As we said this project is in a very early stage, therefore is possible to develop this work in many
directions
7.2.1 Enhancements
7.2.1.1 Adding more features:
The number of features considered in this analysis are not enough to outline the complete situation
of the player An interesting analysis can be performed by adding more information, for example
the players bio or the opponent characteristics. These information are already implemented in our
database therefore will not be any difficult to perform this analysis with the current framework.
7.2.1.2 Modify the inputs
In the sensitivity analysis the input to predict the number of points are arbitrary modified, but they
come from a statistical distribution like the points, therefore it will be interesting to study what
happens if we vary the distribution of the statistics instead of increasing them by a fixed percentage.
This represents reflect the reality in a better way but the variance could result too much high to
obtain good results.
7.2.1.3 Connect the training with the match result
As we said in the first chapter this research should be the first step in a broad project to understand
and predict the future performance of set of data form the athlete's training. This can be a turning
point for sport science as at the moment this evaluation is based on personal opinions from the
coach of the team and his stuff. In this project we modelled relationships between game statistics
and game outcomes. Future work would link trainig session statistics to game statistics to further
guide training.
Evaluation:
as we said in 6.1 it is difficult to evaluate the results of the Sensitivity Analysis in a real scenario, at
this stage of the work is only possible to give a mathematical judgment, but it cannot guarantee a
real success in the real word. It will be interesting to continue this project with an expert in sport,
like John O'Hara, the sport scientist who help us understanding the data from the training, in the
very early stage of this project. Using these results to modify the training session can provide a very
Appendix A
Personal Reflection
Now that this experience is nearly over I can sit and think about what this project represents to me.
For sure I can say ti was a challenge, as every project, but more that what I was expecting. This
experience was not comparable with the course project, it si much more intense but it can also be
very enjoyable.
I learned a lot from this project but I think the most important lesson to me was how to manage my
time and my work. It is fundamental to write down the main taks, broke them down into elementary
sub-task and make a good schedule. This was the hardest part to me mainly for two reasons. Firstly
because this is my first real project and I'm not used to work in such a systematic way. Secondly
because my project was not clearly defined at the beginning, therefore it was impossible to make a
detailed plan. Although I have learned that is important to make plan also in this situation, you can
always change the plan if you have on, but if you don't the project will not have a direction and you
will only lose time.
Another important note about the scheduling: write-up takes time! IT seems obvious but believe
me, it is very easy to underestimate the writing up,maybe because probably it is the less enjoyable
part for most of the student. I was convinced that I had enough time for the write-up and at the end I
had to make several compromise to complete the project on time. This because some small delays
and some changes on the project .
In particular my advice si to evaluate carefully the tools you are going to use, I had some problems
because I assumed that all the python libraries was compatible with the 64 bit version.
Unfortunately they was not but I discovered ti only in t he middle of the project so to make sure that
all the libraries were supported I installed an old version of python, bugged, This domino chain
effect make em loose quite a lot of time.
One of the best experience was the interaction with my supervisor. Remember that you supervisor is
there to give you support, guide and sometimes hope. IT is very important to be always prepared in
the weekly meeting so you can ask questions and he knows what you are doing, giving a direction
to you project. The feedback is very important, remember that the project you are doing can be your
work in the future so take as much as you can from your supervisor and assessor. That's why it si
important to be always prepared, but especially for the interim report and the progress meeting, the
only two times you will get a feedback form your assessor(before the last, final, feedback).
My last advice is to enjoy the project as much as you can, is the last part of the Msc, it will never
get back. It will be painful so it is better to choose a project you like and you believe in. otherwise
will be very hard. Good luck!
Appendix B
Appendix C
Original Gantt char
Final changes in the Gantt Char
References
biography:
[1] CRISP-DM 1.0 Step-by-step data mining guide, Pete Chapman, Julian Clinton, Randy Kerber,
Thomas Khabaza , Thomas Reinartz ,Colin Shearer and Rüdiger Wirth, pag 3
[2] Moneyball: The Art of Winning an Unfair Game (ISBN 0-393-05765-8) is a book by Michael
Lewis, published in 2003, pag75/77
[3]Data Mining in Soft Computing Framework:A Survey Sushmita Mitra, Senior Member, IEEE,
Sankar K. Pal, Fellow, IEEE, and Pabitra Mitra
[4]"Taking the First Step with PDCA". 2 February 2009. Retrieved 17 March 2011.
[5] Kolata, Gina 'Maximum' Heart Rate Theory Is Challenged, 2001-04-24, New York Times
[6] Heart Rate, Wikipedia, accessed at http://en.wikipedia.org/wiki/Heart_rate, 1/5/2011
[7] SPORTS DATA MINING Robert P., Osama K. Solieman, Hsinchun Chen, Introduction
[8] Statistics, wikipedia, accessed at http://en.wikipedia.org/wiki/Statistics on 24/4/2011
[9] Barlas, I., A. Ginart, et al. 2005. Self-Evolution in Knowledgebases. IEEE AutoTestCon,
Orlando, FL.
[10]Carlisle, J. P. 2006. Escaping the Veil of Maya - Wisdom and the Organization. 39th Hawaii
International Conference on System Sciences, Koloa Kauai, HI.
[11] Bierly, P. E., E. H. Kessler, et al. 2000. Organizational Learning, Knowledge and Wisdom.
Journal
[12]http://en.wikipedia.org/wiki/Finite-state_machineof Organizational Change Management 13(6):
595-618
[13] Hirotsu, N. & M. Wright 2003. A Markov Chain Approach to Optimal Pinch Hitting Strategies in a Designated
Hitter Rule Baseball Game. Journal of Operations Research 46(3): 353-371.
[14] Rotshtein, A., M. Posner, et al. 2005. Football Predictions Based on a Fuzzy Model with Genetic and Neural
Tuning. Cybernetics and Systems Analysis 41(4): 619-630.
[15] http://en.wikipedia.org/wiki/DIKW
[16] Personal comunication from Catapult spa
[17]DataSoftSystems 2009. Data Mining - History and Influences. Retrieved Sept. 2, 2009, from
http://www.datasoftsystem.com/articles/article-1380.html.
[18] Time-Critical Decision Making for Business Administration
http://home.ubalt.edu/ntsbarsh/stat-data/forecast.htm accessed on 16/6/2011
[19]Time Series http://en.wikipedia.org/wiki/Time_series accessed on 16/6/2011
[20] Monte carlo Simullation, websource:
http://www.vertex42.com/ExcelArticles/mc/MonteCarloSimulation.html
[21] Leamer, E., (1990) Let's take the con out of econometrics, and Sensitivity analysis would help. In C. Granger (ed.),
Modelling Economic Series. Oxford: Clarendon Press 1990.
[22] Douglas Hubbard "How to Measure Anything: Finding the Value of Intangibles in Business", John Wiley & Sons,
2007
[23]"A Step by Step Introduction to Data Mining for Sports Analysis", Mikhail Golovnya, Salford Systems,2009
[24] Ram Mylvaganam, project Meeting, personal comunication
[25]www.brianmac.ac.uk
[26]http://ai.arizona.edu/research/sports_data/index.asp
[27]Numerical Algorithms for Predicting Sports Results, Jack David Blundell, School of Computing, Faculty of
Engineering http://www.engineering.leeds.ac.uk/e-engineering/documents/JackBlundell.pdf
[28]Prediction of physical performance using data mining. (Measurement).: An article from: Research
Quarterly for Exercise and Sport [HTML] [Digital]
Lynn Fielitz (Author), David Scott (Author)
[29]”Xmdv Tool”, official webpage http://davis.wpi.edu/xmdv/
[30]"ParaView", official webpage http://www.paraview.org/
[31]"Most Popular Data Mining Software", accessed at http://www.the-datamine.com/bin/view/Software/MostPopularDataMiningSoftware on 2/7/2011
[32]"NumPy" official page http://numpy.scipy.org/
[33] Unsupervied Learning http://en.wikipedia.org/wiki/Unsupervised_learning
[34] Fast Training of Support Vector Machines Using Sequential Minimal Optimization, John C. Platt,
January 1998
[35]Quadratic Programming compllexity
http://en.wikipedia.org/wiki/Quadratic_programming#Complexity accessed on 25/5/2011
[36] Mean absolute error http://en.wikipedia.org/wiki/Mean_absolute_error
[37] Pearson product-moment correlation coefficient
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
[38] Root mean square http://en.wikipedia.org/wiki/Root_mean_square
[39] "Weka List help", Eibe Frank, university of Wikato
https://list.scms.waikato.ac.nz/pipermail/wekalist/2004-August/002825.html
[40] Trick, Michael A., "A Schedule-Then-Break Approach to Sports Timetabling" (2000). Tepper School of Business.
Paper 514. http://repository.cmu.edu/tepper/514
[41]The
8th Australasian Conference on Mathematics and Computers in Sport, 3-5 July 2006,
Queensland, Australia, ARTIFICIAL INTELLIGENCE IN SPORTS BIOMECHANICS:
NEW DAWN OR FALSE HOPE?
[42] Yang, T. Y. & T. Swartz 2004. A Two-Stage Bayesian Model for Predicting Winners in Major League Baseball.
Journal of Data Science 2(1): 61-73.
[43]Time series http://en.wikipedia.org/wiki/Time_series
[44] CRISP-DM 1.0 Step-by-step data mining guide, Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza ,
Thomas Reinartz ,Colin Shearer and Rüdiger Wirth, pag 13
[45]How
do I connect to a database?, weka Wikispace, accessed at
http://weka.wikispaces.com/How+do+I+connect+to+a+database%3F on 4/6/2010
[46] Numpy Scientific Computing Tools For Python —Numpy http://numpy.scipy.org/
[47] CRISP-DM 1.0 Step-by-step data mining guide, Pete Chapman, Julian Clinton, Randy Kerber,
Thomas Khabaza , Thomas Reinartz ,Colin Shearer and Rüdiger Wirth, pag 57-58
[48]Testing Monte Carlo Algorithmic Systems,Frank Erdman, accessible at
http://sqa.fyicenter.com/art/Testing-Monte-Carlo-Algorithmic-Systems.html
[49]t-test, wikipedia, http://en.wikipedia.org/wiki/T_test ,accessed on 7/8/2011
[50]paired t-test, wikipedia, http://en.wikipedia.org/wiki/Paired_difference_test .accessed on
7/8/2011
[51]Basket – Nba, finalissima: Lakers a tutto GASol. 102-89 sui Boston Celtics in gara-1
, Ivano Agostino, accessed on 28/8/2011 at http://www.stadiosport.it/basket-nba-lakers-a-tuttogasol-102-89-sui-boston-celtics-in-gara-1/
[52] sensitivity Analysis, wikipedia, accessed at
http://en.wikipedia.org/wiki/Sensitivity_analysis#Methodology 25/8/2011
[53] Saltelli, A., S. Tarantola, F. Campolongo, and M. Ratto (2004). Sensitivity Analysis in Practice: A Guide to
Assessing Scientific Models. John Wiley and Sons.
[54] Weka 3: Data Mining Software in Java, nuiversity of Wikato,
http://www.cs.waikato.ac.nz/ml/weka/