Data Analysis Tools

Transcription

Data Analysis Tools
Stat 5969
Statistical Software Packages
Data Analysis Tools
•
This section of the notes is meant to introduce you to many of the tools that
are provided by Excel under the Tools/Data Analysis menu item. If your
computer does not have that tool loaded, you need to go to Tools/Add-Ins
and then check the box Analysis ToolPak. When you do so, you may be
prompted to enter your original CD to load the tools.
Tools for Summarizing Data
•
There are two principal analysis tools for summarizing data. They are
“Histogram” and “Descriptive Statistics.”
Histograms
•
We can use a spreadsheet to obtain a histogram. In the process it finds the
frequency distribution and then it will draw the plot. It also has the option of
finding an ogive. Below is the procedure.
1.
To get to the Analysis Tools, select Tools/Data Analysis. This will
bring up the list of statistical methods.
2. Select the tool entitled "Histogram." The dialog box below will then
appear. All of the analysis tools in Excel provide a similar dialog box
Analysis Tools - 1
Stat 5969
Statistical Software Packages
3. In the dialog box specify where the data are you want to analyze and
where you want the output to go. Specify the location of the data
either by typing the cell range, or by dragging the mouse over the cells
containing the data. For now, skip the box asking for the bin range
(see below for how to use the bin range input). If you have indicated
the row that has the variable name or heading, click in the labels box.
In the box asking for the output range, type or click on the cell
reference where you want the output to begin. Do not mark the box
next to "Pareto." If you want Excel to draw the histogram, click in the
appropriate box. The “Cumulative Percentage” box will give you the
ogive. Then click OK.
•
The result of this procedure will be a frequency distribution. The first
column will show the value which defines the right (or maximum) value of
the class interval, which Excel refers to as a “bin.” The second column will
show the number of observations in the bin, and the third column will
contain the cumulative percentage of observations falling in or below the bin.
Analysis Tools - 2
Stat 5969
Example Output:
Bin
16.9
18.84
20.78
22.72
24.66
26.6
28.54
30.48
32.42
34.36
36.3
38.24
40.18
42.12
44.06
More
Frequency
1
2
13
38
64
56
26
18
8
8
1
2
1
1
0
1
Cumulative %
.42%
1.25%
6.67%
22.50%
49.17%
72.50%
83.33%
90.83%
94.17%
97.50%
97.92%
98.75%
99.17%
99.58%
99.58%
100.00%
Histogram
80
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
.00%
60
40
20
0
16
.9
20
.78
24
.66
28
.54
32
.42
36
.3
40
.18
44
.06
Frequency
•
Statistical Software Packages
Bin
Analysis Tools - 3
Frequency
Cumulative %
Stat 5969
•
Statistical Software Packages
The way to interpret the frequency distribution is as follows. The first
frequency number is the number of data points that have values less than or
equal to the first bin number. The next frequency number is the number of
data points less than or equal to the second bin number, but greater than the
first bin number. For example, in what is above, there is one number in the
data set that is less than or equal to 16.9. There are 2 numbers in the data
set larger than 16.9 and less than or equal to 18.84. The other numbers are
interpreted similarly. The last bin always says “More.” The corresponding
frequency number tells us how many numbers in the data set are larger than
the second to last bin number. In the example above, 1 number in the data
set is larger than 44.06.
Minor Fixes to Excel’s Output
•
There are two things about Excel’s histogram output that I don’t like. The
first is the way it handles the first bin. It always sets the first bin value equal
to the smallest number in the data set. Hence its frequency is almost always
equal to 1. In almost every case, I choose to combine this bin with the next
one. To do so, I add the frequency of this first bin to the frequency of the
second bin, and then delete the first row of the output given by Excel. For
the example above, the first two rows of my modified frequency distribution
would look like this.
Bin
18.84
20.78
Frequency Cumulative %
3
1.25%
13
6.67%
Analysis Tools - 4
Stat 5969
•
Statistical Software Packages
The second thing that I don’t like is that the chart that Excel automatically
constructs is actually a bar graph. To make it look more like a histogram,
we need to have no space between the bars. To remove the space, double
click on the bars of the chart, then select the Options tab, and change the
Gap width to 0, then select OK.
Selecting Your Own Bin Values
•
If you don’t like the bin values that Excel uses, you can create your own.
Below I describe the process that I would follow to do it. As you can see, it
is quite a bit longer, and my preference is to let Excel choose the bins
values.
1. First determine the number of bins. Say that the number of observations
you have is n. Then a rule for the number of bins is (2*n)1/3 (i.e., the
cube root of 2n). You will usually have to round this number to an
integer. The usual suggestion is to round up. For the example above,
there were 240 data points. Then (2*240) 1/3 = 7.83. We round up to 8
to get 8 bins.
2. To find the bin width, take the range of the data (largest minus smallest),
and divide by the number of bins found in step 1 above. Again you will
want to round up to determine the actual bin width, but it is quite
subjective as to how to round (you can go to the nearest integer, tenth,
hundredth, etc.). For the example, the smallest and largest of the 240
values were 16.9 and 46. To find the interval, we use (46-16.9)/8 = 3.64.
The original data had two decimal places, so it is convenient to use two
decimal places for the bin width. To make it an “even” number, I
decided to use 3.65 as the bin width.
Analysis Tools - 5
Stat 5969
Statistical Software Packages
3. When creating the bin boundaries, I take the smallest number and add bin
width to it to obtain the starting bin value. If you don’t like fractions or
“uneven” numbers, you can round to a neighbor that fits your criteria for
a good starting value. Excel will take the first number that you put in the
bin range, and then find how many numbers in the data set are less than
or equal to that number. Then it will take the 2nd number in the bin
range, and find how many are greater than the first bin number, but less
than or equal to the second bin number.
For the example, say my original data are in cells A2:A241 and cell A1
contains a label. In cells C2:C8 I can enter the numbers 20.5, (which is
close to 16.9 + 3.65), 24.15, 27.8, 31.45, 35.1, 38.75, 42.4 (notice I only
entered 7 numbers, even though there are 8 bins—the 8th bin will be
created by Excel and called “More”). In cell C1 I should enter some
label for the bins. The most obvious choice is to just type “Bin” in C1.
(If you check the “Labels in First Row” box, you must add a label to the
bins as well.) Now use Data Analysis from the Tools menu. Input
A1:A241 in the data input range. In the bin input range, enter C1:C8.
Choose the other options as normal. Then hit OK.
•
Below is the resulting output, including the chart (after adjusting the gap
width to 0).
Bin
20.5
24.15
27.8
31.45
35.1
38.75
42.4
More
Frequency
13
88
93
28
13
2
2
1
Analysis Tools - 6
Cumulative %
5.42%
42.08%
80.83%
92.50%
97.92%
98.75%
99.58%
100.00%
Stat 5969
Statistical Software Packages
Frequency
Histogram
100
120.00%
80
100.00%
80.00%
60
60.00%
40
40.00%
20
20.00%
0
.00%
20.5 24.15 27.8 31.45 35.1 38.75 42.4 More
Bin
•
The interpretation of the frequency distribution is exactly the same as before.
Analysis Tools - 7
Stat 5969
Statistical Software Packages
Descriptive Statistics
•
To use Excel to obtain a listing of descriptive statistics, we again use the
Analysis Tools. This time, instead of selecting "Histogram," select
"Descriptive Statistics." Indicate where the data are located, and select
whether they are in rows or columns. If you want the data set to have a
descriptive title, you can include the label in the first entry above the data,
and then click the box next to "Labels." Specify where you want the output
to go. I recommend always clicking on the “Summary Statistics” box. I
also recommend checking the Confidence Level for Mean box (and filling in
the confidence level) if you are interested in confidence intervals for the
mean. I rarely use the other boxes.
•
You can do descriptive statistics on several variables at once. You just need
to be sure that the variables are next to each other in the spreadsheet, and
then refer to all the columns in the input portion of the dialog box.
•
Here is some example output.
Height
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(99.0%)
1.2
0.01206
1.18
1.23
0.04
0.0016
-1.11302
0.50417
0.12
1.15
1.27
13.2
11
0.03822
Analysis Tools - 8
Stat 5969
Statistical Software Packages
Box plots
•
There is nothing built in to Excel to do box plots. I have created a template
that will do up to 4 simultaneous box plots. It is also limited to data sets of
no more than 500 observations. It has some faults, but it is not bad. The
file is called Multiple Boxplots.xls. Below is a sample of what it produces.
Automobile
Public
0
10
20
30
40
50
Covariance
•
A covariance matrix can be obtained from the spreadsheet by using the
Covariance Analysis Tool. Select Data Analysis, then Covariance. Identify
the input area, and where you would like the output to go. Indicate whether
the data are grouped by column or row, and whether labels are being used,
and then select OK. Example output is shown at the top of the next page.
**WARNING**
This analysis tool divides the cross products by n rather than by n-1. If you
want true sample variances and covariances, you should multiply all of the
numbers by
n
.
n −1
Analysis Tools - 9
Stat 5969
Statistical Software Packages
Day
Day
Hour
Prep Time
Wait Time
Travel Time
Distance
•
3.906276
0
0.212155
1.123469
0.193243
0.166276
Hour
Prep Time Wait Time
5.271967
-0.19626
-0.83906
0.221318
0.143933
1.110149
0.310482
0.033146
-0.02226
10.60447
-0.29553
-0.06857
Travel
Time
Distance
3.59392
1.799046
1.02825
The numbers on the diagonals are variances (except they are divided by n),
and all other numbers are covariances. The matrix is symmetric, so only
numbers on one side of the diagonal are shown.
Correlation
•
We can also use the spreadsheet to find the sample correlation matrix, and
the procedure is identical to that of finding the covariance, except that we
choose the Correlation Analysis Tool.
•
Here is the correlation matrix for the pizza example.
Day
Hour
Prep Time
Wait Time
Travel Time
Distance
•
Day
1.0000
0.0000
0.1019
0.1746
0.0516
0.0830
Hour
Prep Time
1.0000
-0.0811
-0.1122
0.0508
0.0618
1.0000
0.0905
0.0166
-0.0208
Wait Time Travel Time Distance
1.0000
-0.0479
-0.0208
1.0000
0.9359
1.0000
The off-diagonal terms are the sample correlation coefficients between pairs
of variables. Excel does these computations correctly and no adjustments
are necessary.
Analysis Tools - 10
Stat 5969
Statistical Software Packages
Summarizing Qualitative Data in Tables
•
Excel has a utility called a Pivot Table that allows us to create and analyze
tabular summaries (contingency tables) of qualitative data. It can also be
used with quantitative data or combinations of quantitative and qualitative
data.
•
To use the pivot table feature, data must be entered in columns and each
column must have a title or header. Before invoking the procedure, be sure
that the cursor is in one of the cells containing a header or data.
•
To start the “wizard,” go to Data/PivotTable and PivotChart Report. In the
first step, just click on Next (the default values are what we want). In the
second step, verify that the data range shown contains all of the data that
you want to analyze, then click on Next again.
•
In step 3, click on the button called “Layout.” You will be presented with
the following dialog box (except the buttons on the right will change
according to the data set you are using).
Analysis Tools - 11
Stat 5969
Statistical Software Packages
•
At this point, click on and drag the button corresponding to the variable that
you want to be on the rows of your output table to the area labeled “Row”
and the variable you want in columns to the area that says “Column.” Then
drag either of the two buttons that you just used to the “Data” area. I
recommend always dragging one of the qualitative variables’ buttons. The
button should change to say “Count of VARIABLE” “where VARIABLE is the
name of the variable that you dragged to the middle. Then say OK.
•
To complete the procedure there are a few other options you can change if
you desire, but I usually just click on Finish at this point and change options
later if the output is not what I desire. If you have used a quantitative
variable, you will likely want to group it. To do so, right click on the
variable name in the table. One item in the pop-up menu should say Group.
Choose it, and then specify how you want the variable to be grouped.
•
The pivot table can display several different types of summary measues.
The default or “normal” state is to display total counts. There may be times
that you want to display the numbers in the table as overall percentages, as
row percentages, etc. To change the display, click any where in the table
and go again to the Data/PivotTable and PivotChart Report menu item. You
should be at step 3 again. Click on Layout and then double click what is in
the middle of the table (it should say “Count of…”). Then select options.
A drop down menu that says “Show Data As” will be in the middle of the
dialog box. Use the drop down menu to say how you want to display the
data. Then exit out of all of the boxes.
•
The default way that Excel lists the categories in qualitative variables is
alphabetically. You may want them listed in some kind of logical ascending
order (for example, you may want to list class standing as Freshman,
Sophomore, Junior and Senior). To tell Excel how you want the labels to
be ordered, go to the Tools menu, select options, and then click on the tab
called “Custom Lists.” Then you can type in the list items in the order you
want them (separate them with a comma or return) in the List Entries section.
Or you can import the list in the order that you want by identifying the cells
where they are listed.
Analysis Tools - 12
Stat 5969
•
Statistical Software Packages
Below is a portion of an Excel worksheet with both qualitative and
quantitative variables. It shows both a portion of the original data and the
the resulting pivot table. I created a custom list in Excel as “Good, Very
Good, Excellent.”
Random Sampling
•
We can obtain a random sample from a set of data using the analysis tools.
The tool is called "Sampling." Before using the tool, I suggest including a
column in the data file that is a numbered label. After selecting Data
Analysis, choose the Sampling tool. Next indicate the location of the
numbers to be sampled from (which would be the location of the data
labels), input the first cell of the output block, choose random (rather than
periodic), then indicate how many samples you want to draw (i.e., the
sample size). Then hit OK.
•
With the above procedure, it is possible to obtain repeated items in the
sample (e.g., the same item could be drawn twice). That is why I use the
label column rather than the original data column to create the sample. That
way I can tell if I have duplicates. If I do obtain a duplicate, I simply
continue to draw more samples until I have a sufficient number of distinct
items for the desired sample size.
Analysis Tools - 13
Stat 5969
Statistical Software Packages
•
The best way I know to look for a duplicate is to sort the data. The sort
routine is under the DATA menu or can be found on the tool bar
.
•
To find the actual data associated with the label, we can use the function
=VLOOKUP. Suppose that my labels are in cells A2:A301 and the data
from which I want the random sample is in cells B2:B301. Suppose also
that I started the output from the Sampling tool in C2 and drew a sample of
25 (so the sampled labels are in cells C2:C26). I will also assume that I
don’t have any duplicates. Then in cell D2 I would enter the function
=VLOOKUP(C2,$A$2:$B$301,2). This function says look for what is in
cell C2 in the first column of A2:B301. When you find the number report
back what the corresponding number in the second column of A2:B301 (the
2 is what tells it to report back what is in the second column). Then I would
copy cell D2’s contents down through cell D26.
Inference Tools
•
The majority of the tools in Excel are for statistical inference. I will discuss
the how to use the tools for confidence intervals on one mean, hypothesis
tests on one and two means, analysis of variance, and regression.
Confidence Intervals
•
I have already described the Descriptive Statistics Tool, which is what we
use to do confidence intervals. The tool is useful for cases where we have
the data and we do not know the population standard deviation. Then we
use the first and last two numbers in the Descriptive Statistics output to
create the confidence interval. The first number is the sample average. The
last number, which Excel calls Confidence Level(xx%) (which I consider to
be a very poor name) is the margin of error. Below I have repeated part of
the printout from above.
Analysis Tools - 14
Stat 5969
Statistical Software Packages
Height
Mean
Standard Error
1.2
0.01206
M
M
Count
Confidence Level(99.0%)
11
0.03822
Hypothesis Test on One Mean:
This procedure is used when you do not know the population standard
deviation and you have all of the data given. Before going to the Tools
menu you need to add another column which consists only of the
hypothesized value µ 0, next to each value of the original data. The easiest
way to do this is to enter µ 0 once, and then use the fill down command to
put it in the rest of the cells.
Then from the Data Analysis Tools select "t-Test: Paired Two-Sample for
Means" in Excel. Variable 1 input will be the column where the original data
are located. Variable 2 input will be the column where the hypothesized
value is located. Indicate where you want the output to go, and give a level
of significance (α) value. The (hypothesized) difference should always be 0
or can be left blank. Finally, if you labeled your columns and included them
in the Variable 1 and Variable 2 input portions, then click the labels box.
Example:
Pineapple Corporation (PC) maintains that their cans have always contained
an average of 12 ounces of fruit. The production group believes that the
mean weight has changed. The drained weights in ounces for a sample of 15
cans of fruit from PC had a mean value of 12.09 and a standard deviation of
.20. Use an appropriate hypothesis test to determine if the data show
evidence of a change in mean weight. Use a significance level of .01. The
output is presented on the next page.
Analysis Tools - 15
Stat 5969
Statistical Software Packages
t-Test: Paired Two-Sample for Means
Weight
Mean
Variance
Observations
Pearson Correlation
Pooled Variance
Hypothesized Mean
Difference
df
t
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
•
12.08667
0.041238
15
#DIV/0!
0
0
14
1.652907
0.060295
2.624492
0.120591
2.976849
Conclusions:
Analysis Tools - 16
Hypothesized
Mean
12
0
15
Stat 5969
Statistical Software Packages
Testing Two Means (with unpaired or unmatched samples)
If we want to test the relationship between two means, we have two choices:
"t-Test: Two-Sample Assuming Equal Variance" or "t-Test: Two-Sample
Assuming Unequal Variance." The choice obviously depends on what we
believe the relationship is between the population variances of the two
groups. Whatever we decide, the procedure in Excel is identical once we
have chosen made our choice. Variable 1 input will be the column (or row)
where the first set of data is located. Variable 2 input will be the column (or
row) where the second set of data is located. Indicate where you want the
output to go, and give a level of significance (α) value. The hypothesized
difference will usually be 0, but not always. Finally, if you labeled your
columns (or rows) and included them in the Variable 1 and Variable 2 input
portions, then click the labels box.
•
Consider the following example. A manager is interested in determining
whether the productivity of workers that work during two different shifts is
the same. To test her hypothesis, the manager randomly samples 8 workers
from each shift and records the average time (in minutes) needed to
complete a given assembly-line task, with the results given below.
Shift 1
Shift 2
81.2 72.6 56.8 76.9 42.5 49.6 62.8 48.2
56.6 58.6 45.4 39.1 42.8 65.2 40.7 49.9
From the data, can we conclude that the two shifts have the same
productivity level? It looks like the second shift completes the task in less
time, but is the difference due to sampling, or because the mean times are
really different. The output from the two procedures is given on the next
page.
Analysis Tools - 17
Stat 5969
Statistical Software Packages
t-Test: Two-Sample Assuming Equal Variances
Mean
Variance
Observations
Pooled Variance
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Shift 1
Shift 2
61.325 49.7875
207.3564 89.50125
8
8
148.4288
0
14
1.894011
0.039536
1.761309
0.079072
2.144789
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Analysis Tools - 18
Shift 1
Shift 2
61.325 49.7875
207.3564 89.50125
8
8
0
12
1.894011
0.041288
1.782287
0.082575
2.178813
Stat 5969
Statistical Software Packages
Testing Two Variances
•
To do this type of problem on the computer, go to the Data Analysis Tools,
and select "F-test Two-Sample for Variances." Variable 1 input will be the
column (or row) where the first set of data is located. Try to use the
variable with the largest sample variance as variable 1. Variable 2 input will
be the column (or row) where the second set of data is located. Give the
value of α and indicate where you want the output to go. Finally, if you
labeled your columns (or rows) and included them in the Variable 1 and
Variable 2 input portions, then click the labels box.
•
For this procedure, Excel only calculates one-sided values. If the test is
two-sided (as it usually is) you have two options. First, you can divide the
given value of α by 2, and input the result as the level of significance. The
second option is to always use the p-value criterion and for a two-sided test,
multiply the one-sided p-value by 2.
•
For the example:
F-Test: Two-Sample for Variances
Mean
Variance
Observations
df
F
P(F<=f) one-tail
F Critical one-tail
Shift 1
61.325
207.356
8
7
2.317
0.145
3.787
Analysis Tools - 19
Shift 2
49.7875
89.501
8
7
Stat 5969
Statistical Software Packages
ANOVA:
•
Excel can do one and two-way analysis of variance. I only describe the
single factor case below. If you are interested in two-way ANOVA, Excel’s
help should guide you through it. It should also be very similar to what is
described below.
•
After selecting Data Analysis, choose the option called, "Anova: Single
Factor" in Excel. Next specify the input block, which will contain the data
from all groups. Each group should be in its own column or row. If the
groups have differing numbers of samples, be sure to highlight to include all
samples. Excel will handle the blank spaces without a problem. Indicate
where to send the output, and then input a value of α. Check the box
indicating whether the groups are entered in columns or rows, and check the
label box if you have included labels in your input block. Then start the
procedure.
•
Example
Three different automatic milling machines at Castmetal, Inc. were set up to
mill the same type of part. Observations were taken at random times to find
out how many parts were being produced per hour by each machine. Only
four observations were taken on machine 3 since the inspector became ill
and had to go home before he could complete his work. These data were
entered into Excel in cells A1:C5. Can we conclude that the mean hourly
output for the three machines is different?
Machine 1
105
105
110
107
102
Machine 2 Machine 3
91
104
99
106
89
99
95
109
103
Analysis Tools - 20
Stat 5969
•
Statistical Software Packages
Below is the dialog box and output for the example.
Anova: Single-Factor
Summary
Groups
Count
Sum
Average
Variance
5
5
4
529
477
418
105.8
95.4
104.5
8.7
32.8
17.6667
Machine 1
Machine 2
Machine 3
ANOVA
Source of Variation
•
Between Groups
Within Groups
SS
313.8571
219
df
MS
F P-value
F crit
2 156.9286 7.882257 0.007518 7.205699
11 19.90909
Total
532.8571
13
Conclusions:
Analysis Tools - 21
Stat 5969
Statistical Software Packages
Regression:
•
Doing regression in Excel is very similar to using the other analysis tools.
With regression, however, having the data in the right form is more
important. First, all data should be entered in columns. Second, all
independent variables should be next to each other (i.e., in a contiguous set
of cells).
Once the data are entered correctly, select "Regression" from the Tools/
Data Analysis menu item in Excel. You will be presented with the dialogue
box shown below.
In the Input Y Range, enter the cell range referring to the column containing
the dependent variable. In the Input X Range, enter the range of cells
containing all independent variables. This is why the X variables need to be
next to each other. If your range of cells included a row of labels, click the
label box.
Analysis Tools - 22
Stat 5969
Statistical Software Packages
I never click the Constant is Zero box. In some physical systems it only
makes sense for the intercept to be 0, so we can force it do so. In our
examples that will never be the case. If you want a confidence interval for
the β values other than a 95% confidence interval, click in the Confidence
Level box and enter a different confidence level.
Next, indicate where you want the output to go. Finally, click on the box
next to “Residuals.” I leave all other boxes blank, because I don’t like the
way that Excel does the rest of the residual analysis or the normal probability
plot. Then hit enter.
•
Below is some sample output.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.936248
0.87656
0.874991
0.670277
240
Analysis of Variance
Regression
Residual
Total
Intercept
Day
Hour
Distance
df
3
236
239
Sum of
Squares
752.919
106.028
858.947
Coefficients
Standard
Error
t Statistic
1.156832
-0.02521
-0.00592
1.754525
0.229887
0.022013
0.018919
0.042988
5.032169 9.54E-07 0.703939 1.609725
-1.14541 0.253183 -0.06858 0.018153
-0.31297 0.754578 -0.04319 0.031351
40.8147 1E-109 1.669837 1.839213
Analysis Tools - 23
Mean
Square
F
Significance F
250.973 558.6225 7.1E-107
0.449271
P-value
Lower
95%
Upper
95%