Lab 8: Inference on two samples (19.5 pts. + 2...

Transcription

Lab 8: Inference on two samples (19.5 pts. + 2...
STAT 350
class: April 2, 2014
due: April 7, 2014
Lab 8: Inference on two samples (19.5 pts. + 2 pts. BONUS)
Purposes: 1) Inference for 2 – sample independent
2) Inference 2 – sample paired
Remember:
a) Please put your name, my name, course, section number (class time) and lab # on the front
of the lab.
b) Label each part and put them in logical order.
c) ALWAYS include your R code and relevant output for each problem (DO NOT SPAM ME
WITH OUTPUT!)
1) Inference for 2 – sample independent
For two means, we will again use “proc ttest”. However, like in creating a boxplot, we need to
indicate which value is for which of the two cases. In example 7.14 (Data file: 7.wheat):
Price
6.8250
7.3025
7.0275
7.0825
7.3000
7.3325
7.5575
7.3125
7.3600
7.5550
Month
July
July
July
July
July
September
September
September
September
September
Be aware that you must sort the data by the group variable first (Month in this example), in order
for the proc ttest to work correctly. Because you are sorting the data, be careful to determine
what is first and what is second. SAS always does the inference as first – second. The default
value for H0 is 0, therefore H0 is only required if  ≠ 0.
For pooled variance, use the ‘Pooled’ rows, for unpooled variance, use the ‘Satterthwaite’ rows.
The section titled ‘Equality of Variances’ is a hypothesis test whether the variances are the
same.
For this situation, each of the samples needs to be normal. To create the plots, you need to
separate them the two data sets. The code to do that is provided in the script below. I did not
include the code to generate the required plots.
1
Stat 350
Lab 8 R
R Learning script: (h1.R)
prices <- read.table("wheat.txt", header = TRUE)
prices
#t.test is used for confidence intervals and hypothesis tests
# conf.level = C = 1 - alpha
# for the hypothesis test. mu is mu_0
# the first column is quantitative values ~ categorical column
# the second column is the name of the table
# var.equal = FALSE (the variances are not equal, R calls the Satterthwaite
approximation the Welch approximation)
# paried = FALSE (2 sample independent)
t.test(Price ~ Month,prices,conf.level=0.95)
#To create the histograms and QQ plots, you need to create the sets
# of data individually. You can then create the plots as you have
# done previously
JulyPrice <- subset(prices,Month == "July")
SeptPrice <- subset(prices, Month == "September")
R Learning output:
Welch Two Sample t-test
data: Price by Month
t = -3.0007, df = 6.603, p-value = 0.02136
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.56808314 -0.06391686
sample estimates:
mean in group July mean in group September
7.1075
7.4235
The 95% Confidence interval is (-0.56808314, -0.06391686)
For the hypothesis test with  = 0.05 (again from the 95% CI),
df = 6.603, t-value = -3.0007, p-value = 0.02136.
Note: the numbers are opposite those in the example in the book because the book performs
the test September - July, and R performs the test July - September because July is first
alphabetically.
2) Inference 2 – sample paired
For paired data, we will again use “proc ttest” like we did before; however, we use a “paired”
command for both variables in place of the “var” command for a single variable. The difference
will be the first variable in the paired command minus the second variable. In addition, we now
have separate columns for each of the two variables instead of using a grouping variable. The
file h2.sas was taken from Example 7.7 (Data file: 7.french). In addition, this example uses a
one-tailed alternative hypothesis. When you are performing directional hypothesis, be sure to
know which variable is which so the direction is appropriate.
2
Stat 350
Lab 8 R
For this situation, the difference between the samples needs to be normal. I have provided the
code to generate the difference vector. I did not include the code to generate the required plots.
R Learning code: (h2.R)
language <- read.table("french.txt", header = TRUE)
language
#t.test is used for confidence intervals and hypothesis tests
# conf.level = C = 1 - alpha
# for the hypothesis test. mu is mu_0
# var.equal = FALSE (the variances are not equal, R calls the Satterthwaite
approximation the Welch approximation)
# alternative = "greater" or "less" or "two.sided" (this is the
#
appropriate altnerative hypothesis)
# paired = True (2 sample paired)
t.test(language$Posttest,language$Pretest,conf.level=0.90,paired = TRUE,
alternative = "greater")
# the following creates the one sample data. You will need to create the
#
histogram and QQPlot on this data set (script not included)
normaltest = language$Posttest - language#Pretest
SAS Learning output:
Paired t-test
data: language$Posttest and language$Pretest
t = 3.8649, df = 19, p-value = 0.0005216
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
1.641153
Inf
sample estimates:
mean of the differences
2.5
Problems
All of these problems are for 2 samples; however, you will need to decide whether the samples
are independent or paired. There should only be one code for each problem; that is, there
should not be separate code for each of the parts. The correct answer to part a) is NOT the
format of the data file or which section the problem is from.
Problem 1 (6.5 pts.) (7.44 Potential insurance fraud? Data Set: 7.FRAUD) Insurance
adjusters are concerned about the high estimates they are receiving from Jocko’s Garage. To
see if the estimates are unreasonably high, each of 10 damaged cars was taken to Jocko’s and
to a “trusted” garage and the estimates recorded. Here are the results:
3
Stat 350
Lab 8 R
a) Which procedure should you use (independent or paired)? Please explain your answer.
b) Examine each sample graphically, with special attention to outliers and skewness (histogram
and normal quantile plot). Is use of a t procedure acceptable for these data?
c) Perform a hypothesis test to determine if there is a difference between the two garages at a
significance level of 0.01. Be sure to perform the 7 steps.
d) Calculate and interpret the appropriate confidence interval.
e) Based on the answers to c) and d), is there a difference between the two garages? Why or
why not?
Your submission should consist one code for all of the parts, the plots in b) and the appropriate
output in parts c) and d), and the answers to all of the questions. In part d), you may either
rewrite the confidence interval or just indicate where it is in the output.
Problem 2 (6.5 pts.) (7.93 Study habits. Data Set: 7.Studyhabits) The Survey of Study
Habits and Attitudes (SSHA) is a psychological test designed to measure the motivation, study
habits, and attitudes toward learning of college students. These factors, along with ability, are
important in explaining success in school. Scores on the SSHA range from 0 to 200. A selective
private college gives the SSHA to an SRS of both male and female first-year students. The data
for the women are as follows:
Here are the scores of the men:
a) Which procedure should you use (independent or paired)? Please explain your answer.
b) Examine each sample graphically, with special attention to outliers and skewedness. Is use
of a t procedure acceptable for these data?
c) Most studies have found that the mean SSHA score for men is lower than the mean score in
a comparable group of women. Perform the appropriate hypothesis test (7 steps) at a
significance level of 0.1. (Hint: Please look at the answer key for Lab 6.)
d) Calculate and interpret the appropriate confidence bound for the mean difference between
the SSHA scores of male and female first-year students at this college.
e) Based on the answers to c) and d), is the mean score for men lower than the mean score for
women? Why or why not?
Your submission should consist one code for all of the parts, the plots in b) and the appropriate
output in parts c) and d), and the answers to all of the questions. In part d), you may either
rewrite the confidence interval or just indicate where it is in the output.
4
Stat 350
Lab 8 R
Problem 3 (6.5 pts.) (7.72 Sadness and spending. Data Set: 7.Sadness) The “misery is not
miserly” phenomenon refers to a sad person’s spending judgment going haywire. In a recent
study, 31 young adults were given $10 and randomly assigned to either a sad or a neutral
group. The participants in the sad group watched a video about the death of a boy’s mentor
(from The Champ), and those in the neutral group watched a video on the Great Barrier Reef.
After the video, each participant was offered the chance to trade $0.50 increments of the $10 for
an insulated water bottle. Here are the data:
a) Which procedure should you use (independent or paired)? Please explain your answer.
b) Examine each group’s prices graphically. Is use of the t procedures appropriate for these
data? Carefully explain your answer.
c) Perform the significance test at a significance level of 0.05 to determine if the spending is
dependent on whether the person is sad or not.
d) Calculate and interpret the appropriate confidence interval for the mean difference in
purchase price between the two groups.
e) Based on the answers to c) and d), does spending depend on whether a person is sad or
not? Why or why not?
Your submission should consist one code for all of the parts, the plots in b) and the appropriate
output in parts c) and d), and the answers to all of the questions. In part d), you may either
rewrite the confidence interval or just indicate where it is in the output.
Problem 4 BONUS (2 pts.)
Generate the code to calculate the power curve for a t distribution as described in Section 7.3 in
the text. Generate a power curve for the example 7.14 in the book (Part 1 above).
5