R-Day3 - jgill

Transcription

R-Day3 - jgill
R Workshop: Day 3
Yun Ju Sung
7/15/2014
5. Regression Analysis
Regression analysis is fundamental
and forms a major part of statistical
analysis!
Linear regression analysis
• The regression model: y = β0 + β1x + ε
• The values of β0 and β1 are unknown and will be estimated in a
reasonable manner from the data
• The estimated regression line is yˆ  ˆ 0  ˆ1 x
(using "hats" to denote the estimates)
• For each data point xi we have
yˆ i  ˆ0  ˆ1 xi
(called the predicted value)
• The difference between the true and
predicted value is the residual
ei  yi  yˆ i
Example: linear regression with R
• The maximum heart rate of a person is often said to be related
to age by the equation: Max rate= 220 - Age
• Suppose this is to be empirically proven and 15 people of
varying ages are tested for their maximum heart rate.
• We use lm() to fit a linear model
Example: linear regression with R
• The result of the lm function can be stored: lm.result=lm(y~x)
• We can use summary to get regression coefficients (and more):
summary(lm.result)
• The result of the lm function is of class lm and so the plot and
summary commands have been adapted.
• For several generic functions (including print, plot, and
summary), the result depends on the class of object that is
given as argument.
Prediction for new values
• One primary use of regression analysis is to make predictions
for the response value for new values of the predictor.
• We can use a regression line yˆ  ˆ0  ˆ1 x to get the predicted
value for a new value of x.
• It is a little chunk to use the predict function!
• For example, to get a prediction for 50 and 60 year old
• We need a data frame with column names matching the
predictor variable, which can be done with the data.frame
function with name=values.
• Can you get the same answer with estimated coefficients?
Assignment: Question 10
For the data set on housing prices, homedata, investigate the
relationship between old assessed value and new.
a. Use old as the predictor variable. Does the data suggest a
linear relationship?
b. Are there any outliers? What may have caused these
outliers?
b. What is the predicted new assessed value for a $75,000
house in 1970?
Interacting with a scatterplot
• When looking at a scatterplot we see the trend of the data as
•
•
•
•
well as individual data points. If one of these data points stands
out, how can it be identified?
We can use identify function to identify points on a scatterplot
by their corresponding index with each mouse click.
A template for its usage is identify (x, y, labels=…,n=…)
The value n specifies the number of points to identify.
For example, identify(BUSH,BUCHANAN,n=2,labels=County)
can be used to identify two outliers in Question 11.
Assignment: Question 11
For the florida dataset, investigate the relationship between
Bush vs. Buchanan.
a. Fit a regression line using Bush as the predictor variable.
Does the data suggest a linear relationship?
b. Are there any outliers? Identify two outliers
c. Fit another regression line after excluding two outliers. What
is the predicted value for the number of votes Buchanan
would get in Miami-Dade county based on the number of
Bush votes?
The assumptions of the linear model
• The regression model is yi=β0+β1xi+εi
• In order to make statistical inference, we need to make
assumptions about the errors εi.
• The standard assumptions are that these errors are
independent, normals with mean 0 and common variance 2
• The validity of the model can be checked graphically
• The plot function can do this if we give it the result of the
regression, as plot(lm.result, which=1:4)
Checking the validity of the linear model
• Residuals vs. fitted: Look for spread around the line y = 0 and
no obvious trend.
• Normal Q-Q plot: The residuals are normal if this graph falls
close to a straight line.
• Scale-Location plot shows the square root of the standardized
residuals. The tallest points, are the largest residuals.
• Cook's distance plot identifies points which have a lot of
influence in the regression line.
Inference about the slope β1
• Once you are satisfied that the model fits the data, statistical
inferences can be made.
• The normalized value of estimator b1 for β1
has the t-distribution with n-2 degrees of freedom
(here we use b1 instead of ̂1)
• To test H0: β1 = a (the null hypothesis) against
HA: β1 ≠ a (the alternative hypothesis)
we compute the t statistic
and finds the p-value from the t-distribution.
Inference about the slope β1
• Let’s do a test to see if the slope of -1 is correct for our heart-
rate example. This corresponds to H0: β1 = -1 vs. HA: β1 ≠ -1
• The actual p-value is twice this as the problem is two-sided. It is
unlikely that for this data the slope is -1.
• R will automatically test for the assumption β1= 0, which means
no slope. See how the p-value is included in the output of the
summary command in the column Pr(>|t|).
• Can you compute this p-value by yourself?
Inference about the intercept β0
• A statistical test for b0 can be done using
that has
also a t-distribution with n-2 degrees of freedom
• Let's check if the data supports the intercept of 220. Formally,
we test H0: β0 = 220 vs. HA: β0 < 220.
• We need to compute the value of the test statistic and then
look up the one-sided p-value.
• We would reject the value of 220 based on this p-value as well.
Assignment: Question 12
The cost of a home depends on the number of bedrooms in the
house. Suppose the following data is recorded for homes in a
given town
Price (in
300 250 400 550 317 389 425 289 389 559
thousands)
No. of
3
3
4
5
4
3
6
3
4
5
bedrooms
Make a scatterplot, and fit the data with a regression line.
b. Test the hypothesis that an extra bedroom costs $60,000
against the alternative that it costs more.
a.
Assignment: Question 13
It is well known that the more beer you drink, the more your
blood alcohol level rises. Suppose we have the following data on
student beer consumption
Beers
BAL
5
2
9
8
3
7
3
5
3
5
0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05
Make a scatterplot and fit the data with a regression line.
b. Test the hypothesis that another beer raises your BAL by 0.02
percent against the alternative that it is less.
c. Test the hypothesis that the intercept is 0 with a two-sided
alternative.
a.
6. Multivariate Data
Graphics and other simple functions
to explore multivariate data, data
with many variables
Multivariate data
• Multivariate data consists of multiple data vectors considered
as a whole.
• There are many advantages to combining them into a single
data object. This makes it easier to save our work, is convenient
for many functions, and is much more organized.
• Multivariate data can be summarized and viewed in ways that
are similar to those for bivariate and univariate data.
• As there are many more possible relationships, we can look at
all the variables simultaneously or hold some variables
constant while we look at others.
Data frame
• Often in statistics, data is presented in a tabular format similar
to a spreadsheet. The columns are for different variables. Each
row corresponds to measurements on the same subject or
experimental unit.
• R uses a data frame to store rectangular grids of data.
• Each column is a data vector containing data for one of the
variables.
• The columns need not all be of the same type (e.g., numeric,
character, or logical).
Store multivariate data in data frames
• Data frames are created with the data.frame() function
• We can use the row.names() function to set the rows names:
List
• A list is a more general type of storage than a data frame.
• Think of a list as a collection of components. Each component
can be any R object, such as a vector, a data frame, a function,
or even another list.
• A list is a very flexible type of data object containing ordered
collections of objects (components)!
• Many of the functions in R, such as lm(), have return values
that are lists, although only selected portions may be displayed
when the return value is printed.
• Data frames must have variables of the same length, whereas
lists can have variables of different length.
Store multivariate data in list
• Lists are created with list()
> x=1:2 ; y=letters[1:2] ; z=1:3
> data.frame(x,y,z) # Can’t
> list(x,y,z)
[[1]]
[1] 1 2
[[2]]
[1] "a" "b"
[[3]]
[1] 1 2 3
> list(x=x, y=y, z=z)
$x
[1] 1 2
$y
[1] "a" "b"
$z
[1] 1 2 3
Accessing data in data frames
• To access as an array, use single brackets [row, column]
• To access as a list, use either a dollar sign ($) or double bracket
Sorting a data frame by one column
• The sort() function can be used to arrange the values of a data
vector in increasing or decreasing order.
• To sort a data frame by one of its columns, sort() will not work.
• We use order() to rearrange, or permute, the row indices so
that the new data frame is sorted as desired.
• Try:
> sort(mpg)
> mpg[order(mpg)]
> mtcars[order(mpg),] # sort by miles per gallon
> mtcars[order(mpg, decreasing=TRUE), ] # best mpg first
Assignment: Question 14
Use the data set mtcars:
a. Sort the data set by weight, heaviest first.
b. Which car gets the best mileage (largest mpg)? Which gets
the worst?
c. The cars in rows c(1:3, 8:14, 18:21, 26:28, 30:32) were
imported into the United States. Compare the variable mpg
for imported and domestic cars using a boxplot. Is there a
difference?
d. Make a scatterplot of weight, wt, versus miles per gallon,
mpg. Label the points according to the number of cylinders,
cyl. Describe any trends.
Assignment: Question 15
The data set u2 (in UsingR) contains the time in seconds for
albums released by the band U2 from 1980 to 1997. The data is
stored in a list.
a. Make a boxplot of the song lengths by album. Which album
has the most spread? Are the means all similar?
b. Use sapply() to find the mean song time for each album.
Which album has the shortest mean? Repeat with the
median. Are the results similar?
c. What are the three longest songs? The unlist() function will
turn a list into a vector. First unlist the song lengths, then
sort.
Could you use a data frame to store this data?
Scatterplot matrix
• R has a built-in function for creating scatterplots of all possible
pairs of variables.
• This scatterplot matrix can be made with the pairs() function
Scatterplot with colors
• We can plot multiple relationships on the same scatterplot
180
using different plot characters or colors to distinguish the
variables.
• Example: gestation period vs. birth weight in babies data
> gestation[gestation == 999]= NA # 999 for NA
> plot(gestation, wt, pch=smoke)
> table(smoke) # values of plot characters
120
60
80
100
wt
140
160
never
yes
until pregnant
longago
unknown
150
> legend(locator(1), legend=c("never","yes","until
pregnant","longago","unknown"), pch=c(0:3,9),)
200
250
gestation
300
350
180
Example: gestation vs. weight by smoke
160
never smoked
smokes now
120
100
80
60
wt
140
> f0=wt[smoke==0]~gestation[smoke==0]
> plot(f0,xlab="gestation",ylab="weight")
> abline(lm(f0))
> f1=wt[smoke==1]~gestation[smoke==1]
> points(f1,col=2,pch=3)
> abline(lm(f1),col=2)
> legend("topleft", legend=c("never smoked","smokes now"),
pch=c(1,3), lty=1, bty="n", col=1:2)
150
200
250
gestation
300
350
Assignment: Question 16
The data set UScereal (in MASS) lists facts about specific cereals sold
in a United States supermarket.
a. Is there a relationship between manufacturer (mfr) and vitamin
type (vitamins) by shelf location (shelf)? Do you expect one?
Why?
b. Look at the relationship between calories and sugars with a
scatterplot. Identify the outliers. Are these also fat-laden cereals?
c. Now look at the relationship between calories and sugars with a
scatterplot using different size points given by cex=2*sqrt (fat).
(This is called a bubble plot. The area of each bubble is
proportional to the value of fat.) Describe any additional trend
that is seen by adding the bubbles.
Any other expected relationships between the variables?
The lattice package
• Using various colors and plotting characters, we tried to show
•
•
•
•
whether the third variable, smoke, affected the relationship
between gestation time and birth weight.
A better solution would be to create a separate scatterplot for
each level of the third variable.
These graphs can be made using the lattice graphics package.
If the package is loaded (with library (lattice)), a description of
lattice graphics is available in the help system under ?Lattice.
Let's use the data set Cars93 to illustrate.
Histogram and boxplot using lattice
• Histograms
• Boxplots
20
6
40
60
20
80
8
rotary
40
60
80
6
8
rotary
3
4
5
100
80
60
Percent of Total
40
20
0
3
4
5
100
80
60
40
20
0
20
20
40
60
80
20
Max.Price
40
60
80
40
60
80
20
Max.Price
40
60
80
The lattice package
• The basic idea is that the graphic consists of a number of
panels. Each panel corresponds to some value(s) of a
conditioning variable.
• The lattice graphing functions are called using the modelformula interface. The formulas have the format
response ~ predictor|condition
• For univariate graphs such as histograms, the response
variable, the left side, is empty.
Scatterplot: MPG and fuel rank size
• The following will plot the relationship between MPG and tank
size. We expect that cars with better mileage can have smaller
tanks. This type of plot allows us to check if it is the same for all
types of cars.
Scatterplot: MPG and fuel rank size
10
Small
15
20
25
Sporty
Van
50
45
40
35
MPG.highway
30
25
20
Compact
Large
Midsize
50
45
40
35
30
25
20
10
15
20
25
10
Fuel.tank.capacity
15
20
25
Back to gestation vs weight by smoke
150 200 250 300 350
3
9
180
160
140
120
100
80
wt
60
0
1
2
180
160
140
120
100
80
60
150 200 250 300 350
150 200 250 300 350
gestation
Assignment: Question 17
The data set survey (in MASS) contains survey information on
students.
a. Make scatterplots using xyplot() of the writing-hand size
(Wr.Hnd) versus nonwriting-hand size (NW.Hnd), broken
down by gender (Sex). Is there a trend?
b. Make boxplots using bwplot() of Pulse for the four levels of
Smoke broken down by gender Sex. Are there any trends?
Differences?
Do you expect any other linear relationships among the
variables?
Summary
Data, univariate and bivariate data
• One use of R is as a calculator, to evaluate arithmetic
expressions. Calculations can be carried out in parallel, across
all elements of a vector at once.
• The data is stored in R as a vector. Most functions can be
applied to vectors without loops!
• Exploratory data analysis aims to allow the data to speak for
themselves. It gives multiple views of the data that may
provide useful insights.
• We covered commonly-used functions for exploring data with
graphs. Histograms, density plots, stem-and-leaf displays, and
boxplots are useful for examining the distributions of individual
variables. Scatterplots are useful for looking at relationships
two at a time.
Bivariate and regression analysis
• In exploring the relationships in bivariate data, the correlation
•
•
•
•
•
coefficient provides a numeric measure.
The linear regression analysis can provide the richer and more
insightful framework.
There may be many more possibilities than regression lines. It
is useful to fit a smooth curve, to see whether it suggests
systematic departure from a line.
Model assumptions, such as normality, independence, and
constancy of the variance, should be checked.
In particular, plot residuals against fitted values and draw a plot
that checks homogeneity of variance.
Use of the function plot(), with an lm model, gives both these
plots.
Some technical notes
• Help functions include help() (help page of a known function)
•
•
•
•
and apropos() (search for function names that include a
specified character string).
The function c() (concatenate) joins vector elements into
vectors. It may be used for numeric, logical and character
vectors.
Use is.na() to identify elements that are NAs.
Commonly used generic functions include print(), plot(), and
summary(). For such functions, the result depends on the class
of object that is given as argument.
For simple forms of scatterplot, use plot() or the lattice
function xyplot().
More technical notes
• Important R data structures are vectors, factors, lists, and data
•
•
•
•
frames. Vectors may be of mode numeric, or logical, or character.
Data frames use a list structure to group columns, which must all
have the same length, together into a single R object. The different
columns may be any mix of logical, numeric, character, or factor.
Contrast data frames with matrices. All matrix elements have the
same mode. A matrix is stored as one long vector that is formatted to
appear in a row by column layout.
The R system has extensive abilities for inputting data from
rectangular files (see help(read.table)), from spreadsheets, and from
a variety of statistical package formats.
Use attach() or with() (temporary attachment) to give access to the
columns of a data frame, without the need to name the data frame
whenever a column is accessed.
Things not covered
• Random data: uniform, normal, binomial, exponential,…
• Confidence interval estimation: proportion test, t test
• Hypothesis testing
• Two sample tests
• Chi square tests: goodness of fit, independence,…
• Analysis of variance
Additional reading
• To learn more material at a level of this lecture:
• simpleR – using R for Introductory Statistics by John Verzani
• Using R for Data Analysis and Graphics by John Maindonald
• For those who need more sophisticated and challenging material:
• Data Analysis and Graphics Using R: An Example-Based
Approach by John Maindonald and W. Hohn Braun is my
favoriate!
• The manual An Introduction to R by the R core
development team is excellent for people familiar
with statistics.
Left-over Material
May be included in the next year?
Prediction for new values
• Interpretation: Changes in the predictor variable correspond to
changes in the response variable in a linear manner: a unit
change in the predictor corresponds to a change in the
response.
7. Multiple Linear Regression
Assignment: Question 8
The data set babies contains much information about babies and
their mothers for 1236 observations.
Find the correlation coefficient (both Pearson and Spearman)
between age and weight. Repeat for the relationship between
height and weight. Make scatter plots of each pair and see if
your answer makes sense.
Functions in R
• Easy to create your own functions in R
• As tasks become complex, it is a good idea to organize code
into functions that perform defined tasks
• In R, it is good practice to give default values to function
arguments
• Template for function:
Example for defining functions
• A function to simulate the t statistic from a sample of normals with mean 10
and standard deviation 5
• To make mean and standard deviation variable:
Control statements in R
Similar to c, R has a rich set of program control
options
• if … else …
• for, repeat, while loops
• next, break statements
For loops
• A for loop allows you to loop over values in a vector or list of
numbers. It is a powerful programming feature.
• To add up the numbers in the vector x
• better done with sum
Conditional expressions
• To do different things based on the value of a variable, use
conditional expressions
• For example, to get the absolute value function,
• To fix a bug, try x[x<0]<- -x[x<0] (or the built in function abs).
DAGG (page 148)