Lab 8: Inference on two samples (19.5 pts. + 2...
Transcription
Lab 8: Inference on two samples (19.5 pts. + 2...
STAT 350 class: April 2, 2014 due: April 7, 2014 Lab 8: Inference on two samples (19.5 pts. + 2 pts. BONUS) Purposes: 1) Inference for 2 – sample independent 2) Inference 2 – sample paired Remember: a) Please put your name, my name, course, section number (class time) and lab # on the front of the lab. b) Label each part and put them in logical order. c) ALWAYS include your R code and relevant output for each problem (DO NOT SPAM ME WITH OUTPUT!) 1) Inference for 2 – sample independent For two means, we will again use “proc ttest”. However, like in creating a boxplot, we need to indicate which value is for which of the two cases. In example 7.14 (Data file: 7.wheat): Price 6.8250 7.3025 7.0275 7.0825 7.3000 7.3325 7.5575 7.3125 7.3600 7.5550 Month July July July July July September September September September September Be aware that you must sort the data by the group variable first (Month in this example), in order for the proc ttest to work correctly. Because you are sorting the data, be careful to determine what is first and what is second. SAS always does the inference as first – second. The default value for H0 is 0, therefore H0 is only required if ≠ 0. For pooled variance, use the ‘Pooled’ rows, for unpooled variance, use the ‘Satterthwaite’ rows. The section titled ‘Equality of Variances’ is a hypothesis test whether the variances are the same. For this situation, each of the samples needs to be normal. To create the plots, you need to separate them the two data sets. The code to do that is provided in the script below. I did not include the code to generate the required plots. 1 Stat 350 Lab 8 R R Learning script: (h1.R) prices <- read.table("wheat.txt", header = TRUE) prices #t.test is used for confidence intervals and hypothesis tests # conf.level = C = 1 - alpha # for the hypothesis test. mu is mu_0 # the first column is quantitative values ~ categorical column # the second column is the name of the table # var.equal = FALSE (the variances are not equal, R calls the Satterthwaite approximation the Welch approximation) # paried = FALSE (2 sample independent) t.test(Price ~ Month,prices,conf.level=0.95) #To create the histograms and QQ plots, you need to create the sets # of data individually. You can then create the plots as you have # done previously JulyPrice <- subset(prices,Month == "July") SeptPrice <- subset(prices, Month == "September") R Learning output: Welch Two Sample t-test data: Price by Month t = -3.0007, df = 6.603, p-value = 0.02136 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.56808314 -0.06391686 sample estimates: mean in group July mean in group September 7.1075 7.4235 The 95% Confidence interval is (-0.56808314, -0.06391686) For the hypothesis test with = 0.05 (again from the 95% CI), df = 6.603, t-value = -3.0007, p-value = 0.02136. Note: the numbers are opposite those in the example in the book because the book performs the test September - July, and R performs the test July - September because July is first alphabetically. 2) Inference 2 – sample paired For paired data, we will again use “proc ttest” like we did before; however, we use a “paired” command for both variables in place of the “var” command for a single variable. The difference will be the first variable in the paired command minus the second variable. In addition, we now have separate columns for each of the two variables instead of using a grouping variable. The file h2.sas was taken from Example 7.7 (Data file: 7.french). In addition, this example uses a one-tailed alternative hypothesis. When you are performing directional hypothesis, be sure to know which variable is which so the direction is appropriate. 2 Stat 350 Lab 8 R For this situation, the difference between the samples needs to be normal. I have provided the code to generate the difference vector. I did not include the code to generate the required plots. R Learning code: (h2.R) language <- read.table("french.txt", header = TRUE) language #t.test is used for confidence intervals and hypothesis tests # conf.level = C = 1 - alpha # for the hypothesis test. mu is mu_0 # var.equal = FALSE (the variances are not equal, R calls the Satterthwaite approximation the Welch approximation) # alternative = "greater" or "less" or "two.sided" (this is the # appropriate altnerative hypothesis) # paired = True (2 sample paired) t.test(language$Posttest,language$Pretest,conf.level=0.90,paired = TRUE, alternative = "greater") # the following creates the one sample data. You will need to create the # histogram and QQPlot on this data set (script not included) normaltest = language$Posttest - language#Pretest SAS Learning output: Paired t-test data: language$Posttest and language$Pretest t = 3.8649, df = 19, p-value = 0.0005216 alternative hypothesis: true difference in means is greater than 0 90 percent confidence interval: 1.641153 Inf sample estimates: mean of the differences 2.5 Problems All of these problems are for 2 samples; however, you will need to decide whether the samples are independent or paired. There should only be one code for each problem; that is, there should not be separate code for each of the parts. The correct answer to part a) is NOT the format of the data file or which section the problem is from. Problem 1 (6.5 pts.) (7.44 Potential insurance fraud? Data Set: 7.FRAUD) Insurance adjusters are concerned about the high estimates they are receiving from Jocko’s Garage. To see if the estimates are unreasonably high, each of 10 damaged cars was taken to Jocko’s and to a “trusted” garage and the estimates recorded. Here are the results: 3 Stat 350 Lab 8 R a) Which procedure should you use (independent or paired)? Please explain your answer. b) Examine each sample graphically, with special attention to outliers and skewness (histogram and normal quantile plot). Is use of a t procedure acceptable for these data? c) Perform a hypothesis test to determine if there is a difference between the two garages at a significance level of 0.01. Be sure to perform the 7 steps. d) Calculate and interpret the appropriate confidence interval. e) Based on the answers to c) and d), is there a difference between the two garages? Why or why not? Your submission should consist one code for all of the parts, the plots in b) and the appropriate output in parts c) and d), and the answers to all of the questions. In part d), you may either rewrite the confidence interval or just indicate where it is in the output. Problem 2 (6.5 pts.) (7.93 Study habits. Data Set: 7.Studyhabits) The Survey of Study Habits and Attitudes (SSHA) is a psychological test designed to measure the motivation, study habits, and attitudes toward learning of college students. These factors, along with ability, are important in explaining success in school. Scores on the SSHA range from 0 to 200. A selective private college gives the SSHA to an SRS of both male and female first-year students. The data for the women are as follows: Here are the scores of the men: a) Which procedure should you use (independent or paired)? Please explain your answer. b) Examine each sample graphically, with special attention to outliers and skewedness. Is use of a t procedure acceptable for these data? c) Most studies have found that the mean SSHA score for men is lower than the mean score in a comparable group of women. Perform the appropriate hypothesis test (7 steps) at a significance level of 0.1. (Hint: Please look at the answer key for Lab 6.) d) Calculate and interpret the appropriate confidence bound for the mean difference between the SSHA scores of male and female first-year students at this college. e) Based on the answers to c) and d), is the mean score for men lower than the mean score for women? Why or why not? Your submission should consist one code for all of the parts, the plots in b) and the appropriate output in parts c) and d), and the answers to all of the questions. In part d), you may either rewrite the confidence interval or just indicate where it is in the output. 4 Stat 350 Lab 8 R Problem 3 (6.5 pts.) (7.72 Sadness and spending. Data Set: 7.Sadness) The “misery is not miserly” phenomenon refers to a sad person’s spending judgment going haywire. In a recent study, 31 young adults were given $10 and randomly assigned to either a sad or a neutral group. The participants in the sad group watched a video about the death of a boy’s mentor (from The Champ), and those in the neutral group watched a video on the Great Barrier Reef. After the video, each participant was offered the chance to trade $0.50 increments of the $10 for an insulated water bottle. Here are the data: a) Which procedure should you use (independent or paired)? Please explain your answer. b) Examine each group’s prices graphically. Is use of the t procedures appropriate for these data? Carefully explain your answer. c) Perform the significance test at a significance level of 0.05 to determine if the spending is dependent on whether the person is sad or not. d) Calculate and interpret the appropriate confidence interval for the mean difference in purchase price between the two groups. e) Based on the answers to c) and d), does spending depend on whether a person is sad or not? Why or why not? Your submission should consist one code for all of the parts, the plots in b) and the appropriate output in parts c) and d), and the answers to all of the questions. In part d), you may either rewrite the confidence interval or just indicate where it is in the output. Problem 4 BONUS (2 pts.) Generate the code to calculate the power curve for a t distribution as described in Section 7.3 in the text. Generate a power curve for the example 7.14 in the book (Part 1 above). 5