Data Analysis Tools
Transcription
Data Analysis Tools
Stat 5969 Statistical Software Packages Data Analysis Tools • This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool loaded, you need to go to Tools/Add-Ins and then check the box Analysis ToolPak. When you do so, you may be prompted to enter your original CD to load the tools. Tools for Summarizing Data • There are two principal analysis tools for summarizing data. They are “Histogram” and “Descriptive Statistics.” Histograms • We can use a spreadsheet to obtain a histogram. In the process it finds the frequency distribution and then it will draw the plot. It also has the option of finding an ogive. Below is the procedure. 1. To get to the Analysis Tools, select Tools/Data Analysis. This will bring up the list of statistical methods. 2. Select the tool entitled "Histogram." The dialog box below will then appear. All of the analysis tools in Excel provide a similar dialog box Analysis Tools - 1 Stat 5969 Statistical Software Packages 3. In the dialog box specify where the data are you want to analyze and where you want the output to go. Specify the location of the data either by typing the cell range, or by dragging the mouse over the cells containing the data. For now, skip the box asking for the bin range (see below for how to use the bin range input). If you have indicated the row that has the variable name or heading, click in the labels box. In the box asking for the output range, type or click on the cell reference where you want the output to begin. Do not mark the box next to "Pareto." If you want Excel to draw the histogram, click in the appropriate box. The “Cumulative Percentage” box will give you the ogive. Then click OK. • The result of this procedure will be a frequency distribution. The first column will show the value which defines the right (or maximum) value of the class interval, which Excel refers to as a “bin.” The second column will show the number of observations in the bin, and the third column will contain the cumulative percentage of observations falling in or below the bin. Analysis Tools - 2 Stat 5969 Example Output: Bin 16.9 18.84 20.78 22.72 24.66 26.6 28.54 30.48 32.42 34.36 36.3 38.24 40.18 42.12 44.06 More Frequency 1 2 13 38 64 56 26 18 8 8 1 2 1 1 0 1 Cumulative % .42% 1.25% 6.67% 22.50% 49.17% 72.50% 83.33% 90.83% 94.17% 97.50% 97.92% 98.75% 99.17% 99.58% 99.58% 100.00% Histogram 80 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% .00% 60 40 20 0 16 .9 20 .78 24 .66 28 .54 32 .42 36 .3 40 .18 44 .06 Frequency • Statistical Software Packages Bin Analysis Tools - 3 Frequency Cumulative % Stat 5969 • Statistical Software Packages The way to interpret the frequency distribution is as follows. The first frequency number is the number of data points that have values less than or equal to the first bin number. The next frequency number is the number of data points less than or equal to the second bin number, but greater than the first bin number. For example, in what is above, there is one number in the data set that is less than or equal to 16.9. There are 2 numbers in the data set larger than 16.9 and less than or equal to 18.84. The other numbers are interpreted similarly. The last bin always says “More.” The corresponding frequency number tells us how many numbers in the data set are larger than the second to last bin number. In the example above, 1 number in the data set is larger than 44.06. Minor Fixes to Excel’s Output • There are two things about Excel’s histogram output that I don’t like. The first is the way it handles the first bin. It always sets the first bin value equal to the smallest number in the data set. Hence its frequency is almost always equal to 1. In almost every case, I choose to combine this bin with the next one. To do so, I add the frequency of this first bin to the frequency of the second bin, and then delete the first row of the output given by Excel. For the example above, the first two rows of my modified frequency distribution would look like this. Bin 18.84 20.78 Frequency Cumulative % 3 1.25% 13 6.67% Analysis Tools - 4 Stat 5969 • Statistical Software Packages The second thing that I don’t like is that the chart that Excel automatically constructs is actually a bar graph. To make it look more like a histogram, we need to have no space between the bars. To remove the space, double click on the bars of the chart, then select the Options tab, and change the Gap width to 0, then select OK. Selecting Your Own Bin Values • If you don’t like the bin values that Excel uses, you can create your own. Below I describe the process that I would follow to do it. As you can see, it is quite a bit longer, and my preference is to let Excel choose the bins values. 1. First determine the number of bins. Say that the number of observations you have is n. Then a rule for the number of bins is (2*n)1/3 (i.e., the cube root of 2n). You will usually have to round this number to an integer. The usual suggestion is to round up. For the example above, there were 240 data points. Then (2*240) 1/3 = 7.83. We round up to 8 to get 8 bins. 2. To find the bin width, take the range of the data (largest minus smallest), and divide by the number of bins found in step 1 above. Again you will want to round up to determine the actual bin width, but it is quite subjective as to how to round (you can go to the nearest integer, tenth, hundredth, etc.). For the example, the smallest and largest of the 240 values were 16.9 and 46. To find the interval, we use (46-16.9)/8 = 3.64. The original data had two decimal places, so it is convenient to use two decimal places for the bin width. To make it an “even” number, I decided to use 3.65 as the bin width. Analysis Tools - 5 Stat 5969 Statistical Software Packages 3. When creating the bin boundaries, I take the smallest number and add bin width to it to obtain the starting bin value. If you don’t like fractions or “uneven” numbers, you can round to a neighbor that fits your criteria for a good starting value. Excel will take the first number that you put in the bin range, and then find how many numbers in the data set are less than or equal to that number. Then it will take the 2nd number in the bin range, and find how many are greater than the first bin number, but less than or equal to the second bin number. For the example, say my original data are in cells A2:A241 and cell A1 contains a label. In cells C2:C8 I can enter the numbers 20.5, (which is close to 16.9 + 3.65), 24.15, 27.8, 31.45, 35.1, 38.75, 42.4 (notice I only entered 7 numbers, even though there are 8 bins—the 8th bin will be created by Excel and called “More”). In cell C1 I should enter some label for the bins. The most obvious choice is to just type “Bin” in C1. (If you check the “Labels in First Row” box, you must add a label to the bins as well.) Now use Data Analysis from the Tools menu. Input A1:A241 in the data input range. In the bin input range, enter C1:C8. Choose the other options as normal. Then hit OK. • Below is the resulting output, including the chart (after adjusting the gap width to 0). Bin 20.5 24.15 27.8 31.45 35.1 38.75 42.4 More Frequency 13 88 93 28 13 2 2 1 Analysis Tools - 6 Cumulative % 5.42% 42.08% 80.83% 92.50% 97.92% 98.75% 99.58% 100.00% Stat 5969 Statistical Software Packages Frequency Histogram 100 120.00% 80 100.00% 80.00% 60 60.00% 40 40.00% 20 20.00% 0 .00% 20.5 24.15 27.8 31.45 35.1 38.75 42.4 More Bin • The interpretation of the frequency distribution is exactly the same as before. Analysis Tools - 7 Stat 5969 Statistical Software Packages Descriptive Statistics • To use Excel to obtain a listing of descriptive statistics, we again use the Analysis Tools. This time, instead of selecting "Histogram," select "Descriptive Statistics." Indicate where the data are located, and select whether they are in rows or columns. If you want the data set to have a descriptive title, you can include the label in the first entry above the data, and then click the box next to "Labels." Specify where you want the output to go. I recommend always clicking on the “Summary Statistics” box. I also recommend checking the Confidence Level for Mean box (and filling in the confidence level) if you are interested in confidence intervals for the mean. I rarely use the other boxes. • You can do descriptive statistics on several variables at once. You just need to be sure that the variables are next to each other in the spreadsheet, and then refer to all the columns in the input portion of the dialog box. • Here is some example output. Height Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(99.0%) 1.2 0.01206 1.18 1.23 0.04 0.0016 -1.11302 0.50417 0.12 1.15 1.27 13.2 11 0.03822 Analysis Tools - 8 Stat 5969 Statistical Software Packages Box plots • There is nothing built in to Excel to do box plots. I have created a template that will do up to 4 simultaneous box plots. It is also limited to data sets of no more than 500 observations. It has some faults, but it is not bad. The file is called Multiple Boxplots.xls. Below is a sample of what it produces. Automobile Public 0 10 20 30 40 50 Covariance • A covariance matrix can be obtained from the spreadsheet by using the Covariance Analysis Tool. Select Data Analysis, then Covariance. Identify the input area, and where you would like the output to go. Indicate whether the data are grouped by column or row, and whether labels are being used, and then select OK. Example output is shown at the top of the next page. **WARNING** This analysis tool divides the cross products by n rather than by n-1. If you want true sample variances and covariances, you should multiply all of the numbers by n . n −1 Analysis Tools - 9 Stat 5969 Statistical Software Packages Day Day Hour Prep Time Wait Time Travel Time Distance • 3.906276 0 0.212155 1.123469 0.193243 0.166276 Hour Prep Time Wait Time 5.271967 -0.19626 -0.83906 0.221318 0.143933 1.110149 0.310482 0.033146 -0.02226 10.60447 -0.29553 -0.06857 Travel Time Distance 3.59392 1.799046 1.02825 The numbers on the diagonals are variances (except they are divided by n), and all other numbers are covariances. The matrix is symmetric, so only numbers on one side of the diagonal are shown. Correlation • We can also use the spreadsheet to find the sample correlation matrix, and the procedure is identical to that of finding the covariance, except that we choose the Correlation Analysis Tool. • Here is the correlation matrix for the pizza example. Day Hour Prep Time Wait Time Travel Time Distance • Day 1.0000 0.0000 0.1019 0.1746 0.0516 0.0830 Hour Prep Time 1.0000 -0.0811 -0.1122 0.0508 0.0618 1.0000 0.0905 0.0166 -0.0208 Wait Time Travel Time Distance 1.0000 -0.0479 -0.0208 1.0000 0.9359 1.0000 The off-diagonal terms are the sample correlation coefficients between pairs of variables. Excel does these computations correctly and no adjustments are necessary. Analysis Tools - 10 Stat 5969 Statistical Software Packages Summarizing Qualitative Data in Tables • Excel has a utility called a Pivot Table that allows us to create and analyze tabular summaries (contingency tables) of qualitative data. It can also be used with quantitative data or combinations of quantitative and qualitative data. • To use the pivot table feature, data must be entered in columns and each column must have a title or header. Before invoking the procedure, be sure that the cursor is in one of the cells containing a header or data. • To start the “wizard,” go to Data/PivotTable and PivotChart Report. In the first step, just click on Next (the default values are what we want). In the second step, verify that the data range shown contains all of the data that you want to analyze, then click on Next again. • In step 3, click on the button called “Layout.” You will be presented with the following dialog box (except the buttons on the right will change according to the data set you are using). Analysis Tools - 11 Stat 5969 Statistical Software Packages • At this point, click on and drag the button corresponding to the variable that you want to be on the rows of your output table to the area labeled “Row” and the variable you want in columns to the area that says “Column.” Then drag either of the two buttons that you just used to the “Data” area. I recommend always dragging one of the qualitative variables’ buttons. The button should change to say “Count of VARIABLE” “where VARIABLE is the name of the variable that you dragged to the middle. Then say OK. • To complete the procedure there are a few other options you can change if you desire, but I usually just click on Finish at this point and change options later if the output is not what I desire. If you have used a quantitative variable, you will likely want to group it. To do so, right click on the variable name in the table. One item in the pop-up menu should say Group. Choose it, and then specify how you want the variable to be grouped. • The pivot table can display several different types of summary measues. The default or “normal” state is to display total counts. There may be times that you want to display the numbers in the table as overall percentages, as row percentages, etc. To change the display, click any where in the table and go again to the Data/PivotTable and PivotChart Report menu item. You should be at step 3 again. Click on Layout and then double click what is in the middle of the table (it should say “Count of…”). Then select options. A drop down menu that says “Show Data As” will be in the middle of the dialog box. Use the drop down menu to say how you want to display the data. Then exit out of all of the boxes. • The default way that Excel lists the categories in qualitative variables is alphabetically. You may want them listed in some kind of logical ascending order (for example, you may want to list class standing as Freshman, Sophomore, Junior and Senior). To tell Excel how you want the labels to be ordered, go to the Tools menu, select options, and then click on the tab called “Custom Lists.” Then you can type in the list items in the order you want them (separate them with a comma or return) in the List Entries section. Or you can import the list in the order that you want by identifying the cells where they are listed. Analysis Tools - 12 Stat 5969 • Statistical Software Packages Below is a portion of an Excel worksheet with both qualitative and quantitative variables. It shows both a portion of the original data and the the resulting pivot table. I created a custom list in Excel as “Good, Very Good, Excellent.” Random Sampling • We can obtain a random sample from a set of data using the analysis tools. The tool is called "Sampling." Before using the tool, I suggest including a column in the data file that is a numbered label. After selecting Data Analysis, choose the Sampling tool. Next indicate the location of the numbers to be sampled from (which would be the location of the data labels), input the first cell of the output block, choose random (rather than periodic), then indicate how many samples you want to draw (i.e., the sample size). Then hit OK. • With the above procedure, it is possible to obtain repeated items in the sample (e.g., the same item could be drawn twice). That is why I use the label column rather than the original data column to create the sample. That way I can tell if I have duplicates. If I do obtain a duplicate, I simply continue to draw more samples until I have a sufficient number of distinct items for the desired sample size. Analysis Tools - 13 Stat 5969 Statistical Software Packages • The best way I know to look for a duplicate is to sort the data. The sort routine is under the DATA menu or can be found on the tool bar . • To find the actual data associated with the label, we can use the function =VLOOKUP. Suppose that my labels are in cells A2:A301 and the data from which I want the random sample is in cells B2:B301. Suppose also that I started the output from the Sampling tool in C2 and drew a sample of 25 (so the sampled labels are in cells C2:C26). I will also assume that I don’t have any duplicates. Then in cell D2 I would enter the function =VLOOKUP(C2,$A$2:$B$301,2). This function says look for what is in cell C2 in the first column of A2:B301. When you find the number report back what the corresponding number in the second column of A2:B301 (the 2 is what tells it to report back what is in the second column). Then I would copy cell D2’s contents down through cell D26. Inference Tools • The majority of the tools in Excel are for statistical inference. I will discuss the how to use the tools for confidence intervals on one mean, hypothesis tests on one and two means, analysis of variance, and regression. Confidence Intervals • I have already described the Descriptive Statistics Tool, which is what we use to do confidence intervals. The tool is useful for cases where we have the data and we do not know the population standard deviation. Then we use the first and last two numbers in the Descriptive Statistics output to create the confidence interval. The first number is the sample average. The last number, which Excel calls Confidence Level(xx%) (which I consider to be a very poor name) is the margin of error. Below I have repeated part of the printout from above. Analysis Tools - 14 Stat 5969 Statistical Software Packages Height Mean Standard Error 1.2 0.01206 M M Count Confidence Level(99.0%) 11 0.03822 Hypothesis Test on One Mean: This procedure is used when you do not know the population standard deviation and you have all of the data given. Before going to the Tools menu you need to add another column which consists only of the hypothesized value µ 0, next to each value of the original data. The easiest way to do this is to enter µ 0 once, and then use the fill down command to put it in the rest of the cells. Then from the Data Analysis Tools select "t-Test: Paired Two-Sample for Means" in Excel. Variable 1 input will be the column where the original data are located. Variable 2 input will be the column where the hypothesized value is located. Indicate where you want the output to go, and give a level of significance (α) value. The (hypothesized) difference should always be 0 or can be left blank. Finally, if you labeled your columns and included them in the Variable 1 and Variable 2 input portions, then click the labels box. Example: Pineapple Corporation (PC) maintains that their cans have always contained an average of 12 ounces of fruit. The production group believes that the mean weight has changed. The drained weights in ounces for a sample of 15 cans of fruit from PC had a mean value of 12.09 and a standard deviation of .20. Use an appropriate hypothesis test to determine if the data show evidence of a change in mean weight. Use a significance level of .01. The output is presented on the next page. Analysis Tools - 15 Stat 5969 Statistical Software Packages t-Test: Paired Two-Sample for Means Weight Mean Variance Observations Pearson Correlation Pooled Variance Hypothesized Mean Difference df t P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail • 12.08667 0.041238 15 #DIV/0! 0 0 14 1.652907 0.060295 2.624492 0.120591 2.976849 Conclusions: Analysis Tools - 16 Hypothesized Mean 12 0 15 Stat 5969 Statistical Software Packages Testing Two Means (with unpaired or unmatched samples) If we want to test the relationship between two means, we have two choices: "t-Test: Two-Sample Assuming Equal Variance" or "t-Test: Two-Sample Assuming Unequal Variance." The choice obviously depends on what we believe the relationship is between the population variances of the two groups. Whatever we decide, the procedure in Excel is identical once we have chosen made our choice. Variable 1 input will be the column (or row) where the first set of data is located. Variable 2 input will be the column (or row) where the second set of data is located. Indicate where you want the output to go, and give a level of significance (α) value. The hypothesized difference will usually be 0, but not always. Finally, if you labeled your columns (or rows) and included them in the Variable 1 and Variable 2 input portions, then click the labels box. • Consider the following example. A manager is interested in determining whether the productivity of workers that work during two different shifts is the same. To test her hypothesis, the manager randomly samples 8 workers from each shift and records the average time (in minutes) needed to complete a given assembly-line task, with the results given below. Shift 1 Shift 2 81.2 72.6 56.8 76.9 42.5 49.6 62.8 48.2 56.6 58.6 45.4 39.1 42.8 65.2 40.7 49.9 From the data, can we conclude that the two shifts have the same productivity level? It looks like the second shift completes the task in less time, but is the difference due to sampling, or because the mean times are really different. The output from the two procedures is given on the next page. Analysis Tools - 17 Stat 5969 Statistical Software Packages t-Test: Two-Sample Assuming Equal Variances Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Shift 1 Shift 2 61.325 49.7875 207.3564 89.50125 8 8 148.4288 0 14 1.894011 0.039536 1.761309 0.079072 2.144789 t-Test: Two-Sample Assuming Unequal Variances Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Analysis Tools - 18 Shift 1 Shift 2 61.325 49.7875 207.3564 89.50125 8 8 0 12 1.894011 0.041288 1.782287 0.082575 2.178813 Stat 5969 Statistical Software Packages Testing Two Variances • To do this type of problem on the computer, go to the Data Analysis Tools, and select "F-test Two-Sample for Variances." Variable 1 input will be the column (or row) where the first set of data is located. Try to use the variable with the largest sample variance as variable 1. Variable 2 input will be the column (or row) where the second set of data is located. Give the value of α and indicate where you want the output to go. Finally, if you labeled your columns (or rows) and included them in the Variable 1 and Variable 2 input portions, then click the labels box. • For this procedure, Excel only calculates one-sided values. If the test is two-sided (as it usually is) you have two options. First, you can divide the given value of α by 2, and input the result as the level of significance. The second option is to always use the p-value criterion and for a two-sided test, multiply the one-sided p-value by 2. • For the example: F-Test: Two-Sample for Variances Mean Variance Observations df F P(F<=f) one-tail F Critical one-tail Shift 1 61.325 207.356 8 7 2.317 0.145 3.787 Analysis Tools - 19 Shift 2 49.7875 89.501 8 7 Stat 5969 Statistical Software Packages ANOVA: • Excel can do one and two-way analysis of variance. I only describe the single factor case below. If you are interested in two-way ANOVA, Excel’s help should guide you through it. It should also be very similar to what is described below. • After selecting Data Analysis, choose the option called, "Anova: Single Factor" in Excel. Next specify the input block, which will contain the data from all groups. Each group should be in its own column or row. If the groups have differing numbers of samples, be sure to highlight to include all samples. Excel will handle the blank spaces without a problem. Indicate where to send the output, and then input a value of α. Check the box indicating whether the groups are entered in columns or rows, and check the label box if you have included labels in your input block. Then start the procedure. • Example Three different automatic milling machines at Castmetal, Inc. were set up to mill the same type of part. Observations were taken at random times to find out how many parts were being produced per hour by each machine. Only four observations were taken on machine 3 since the inspector became ill and had to go home before he could complete his work. These data were entered into Excel in cells A1:C5. Can we conclude that the mean hourly output for the three machines is different? Machine 1 105 105 110 107 102 Machine 2 Machine 3 91 104 99 106 89 99 95 109 103 Analysis Tools - 20 Stat 5969 • Statistical Software Packages Below is the dialog box and output for the example. Anova: Single-Factor Summary Groups Count Sum Average Variance 5 5 4 529 477 418 105.8 95.4 104.5 8.7 32.8 17.6667 Machine 1 Machine 2 Machine 3 ANOVA Source of Variation • Between Groups Within Groups SS 313.8571 219 df MS F P-value F crit 2 156.9286 7.882257 0.007518 7.205699 11 19.90909 Total 532.8571 13 Conclusions: Analysis Tools - 21 Stat 5969 Statistical Software Packages Regression: • Doing regression in Excel is very similar to using the other analysis tools. With regression, however, having the data in the right form is more important. First, all data should be entered in columns. Second, all independent variables should be next to each other (i.e., in a contiguous set of cells). Once the data are entered correctly, select "Regression" from the Tools/ Data Analysis menu item in Excel. You will be presented with the dialogue box shown below. In the Input Y Range, enter the cell range referring to the column containing the dependent variable. In the Input X Range, enter the range of cells containing all independent variables. This is why the X variables need to be next to each other. If your range of cells included a row of labels, click the label box. Analysis Tools - 22 Stat 5969 Statistical Software Packages I never click the Constant is Zero box. In some physical systems it only makes sense for the intercept to be 0, so we can force it do so. In our examples that will never be the case. If you want a confidence interval for the β values other than a 95% confidence interval, click in the Confidence Level box and enter a different confidence level. Next, indicate where you want the output to go. Finally, click on the box next to “Residuals.” I leave all other boxes blank, because I don’t like the way that Excel does the rest of the residual analysis or the normal probability plot. Then hit enter. • Below is some sample output. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.936248 0.87656 0.874991 0.670277 240 Analysis of Variance Regression Residual Total Intercept Day Hour Distance df 3 236 239 Sum of Squares 752.919 106.028 858.947 Coefficients Standard Error t Statistic 1.156832 -0.02521 -0.00592 1.754525 0.229887 0.022013 0.018919 0.042988 5.032169 9.54E-07 0.703939 1.609725 -1.14541 0.253183 -0.06858 0.018153 -0.31297 0.754578 -0.04319 0.031351 40.8147 1E-109 1.669837 1.839213 Analysis Tools - 23 Mean Square F Significance F 250.973 558.6225 7.1E-107 0.449271 P-value Lower 95% Upper 95%