BIVARIATE CATEGORICAL DATA bivariate data

Transcription

BIVARIATE CATEGORICAL DATA bivariate data
BIVARIATE CATEGORICAL DATA
Recall from section 2.1 that we learned how to represent a categorical variable numerically in a
table and graphically in a barplot. Suppose we have bivariate data, data from two variables,
which we’d like to analyze.
Tables
Suppose we are doing a technology survey of STAT 200 students. Specifically, who owns a cell
phone, and who owns a laptop? Fill in the data in this two-way contingency table.
Owns Laptop
Yes
No
Owns
Cell
Phone
Yes
No
This table demonstrates the full distribution of data at our disposal. But if we shift our focus to
only one of the variables independent of how the other behaves, that is known as a marginal
distribution. For a marginal distribution, we simply sum the values of the rows (or columns) for
variable of interest.
Owns Laptop
Yes
No
Owns
Cell
Phone
Yes
Marginal Distribution
for Cell Phone
No
Marginal Distribution for Laptop
At other times, we are interested in how a variable behaves given certain information regarding
another variable. This is known as the conditional distribution because we are studying a
variable conditioned on the behavior of another. Let’s look at the conditional distribution of
laptop ownership. That is, the distribution of laptops, conditioned on whether the subject owns a
cell phone.
Owns Laptop
Yes
No
Owns
Cell
Phone
Yes
No
Now you construct the conditional distribution of cell phone ownership, given whether the
subject owns a laptop.
Owns Laptop
Yes
No
Owns
Cell
Phone
Yes
No
Barplots
Since we are dealing with categorical variables, the barplot will be our most reliable form of
graphing. First, we choose one of the variables to be represented on the x-axis, and we list the
factors of that variable on the x-axis. Following that we have two options.
•
•
Segmented Barplot – This plot has one main bar displaying the total frequency for each
factor on the x-axis. The frequencies of the second variable conditioned on the first are
segmented in that one bar.
Side-by-side Barplot – This plot has one for each factor of the second variable above
each factor on the x-axis. The frequencies of the second variable conditioned on the first
are each drawn in their own bar and displayed side-by-side.
Construct a segmented barplot and a side-by-side barplot for the technology survey results.
Examples in R
Be sure to review 3.1.1 in the text regarding the rbind, cbind, and matrix functions to enter
your own table data. But here, we’ll use the UScereal (MASS) dataset which contains
information about cereals on a shelf in a grocery store. Load the dataset and attach the variables.
>
Make a frequency table displaying the relationship between the mfr and shelf variables and
store it in the variable name x.
>
Are there any obvious differences between manufacturers? This may be difficult to tell in the
above table, so let’s use the margin.table function to look at the marginal distribution of the
row variable mfr. You can also see both marginal distributions using addmargins(x).
> margin.table(x, 1)
How would you look at the marginal distribution of the column variable shelf using
margin.table? Anything worth noting there?
>
We might be interested in making sure that the cereal companies are all getting a fair shake, and
that the brands are evenly distributed among the shelves. We must look at the conditional
distributions to help analyze that. Use the prop.table function.
> prop.table(x)
# What does this do?
Now, look at the conditional distribution of the shelf given the row variable, mfr. For each of
the given manufacturers, what is the proportion of its brands that appear on the lowest shelf?
> prop.table(x, 1)
#
G
K
N
P
Q
R
Now, look at the conditional distribution of the manufacturer variable given the column variable,
shelf number. What do you observe?
> prop.table(x, 2)
COMPARING INDEPENDENT SAMPLES
We will formalize our definition of independence later. But for now, you may think of two
samples as being independent if they are taken in such a way that knowing the distribution of
one sample doesn’t affect our knowledge of the other.
Though the samples are independent, it is still possible that share similar features. For example,
consider the ACT scores at two distinct high schools. Do the have a similar center? Similar
spread? Similar shape? All are possible, but only answerable through statistical analysis.
In our recurring example, we’ll look at the dataset twins (UsingR) containing the IQ scores for
27 pairs of identical twins separated near birth. Are these samples independent?
Stem-and-leaf plots can be drawn side by side on the same “stem” to note general distinctions.
R does not have a built-in function for this.
Histograms are perhaps the most useful means of displaying univariate data, but it becomes
difficult to create and read histograms comparing two variables. Recall, that the y-axis of a
histogram measures density. We can create densityplots to outline the behavior of a histogram.
What do you observe in the twins data?
> plot(density(Biological))
> lines(density(Foster), lty=2)
Boxplots also lend themselves well to displaying two variables graphically and side-by-side.
Does this graph support your observations from the densityplots?
> boxplot(Biological, Foster, horizontal=T)
A quantile-quantile plot displays the quantiles of one dataset against the same quantiles of the
other as points. If the distributions have a similar shape, then the quantile values will be similar
at each proportion, and thus the qq-plot will be roughly a straight line. If the shapes are
different, the points will not be linear.
Before looking at the qq-plot for the twins, do you expect the points to generally fall into a line
or not?
> qqplot(Biological, Foster)
SCATTERPLOTS
A scatterplot is a graph used for showing the relationship between two variables. One variable is
assigned to the x-axis and the other is assigned to the y-axis. The convention is to call the x
variable the independent variable and the y variable the dependent variable. Usually the
independent variable is thought to influence the dependent variable. The corresponding function
in R is simply plot().
Example 1: Construct a scatter plot for the Exam 1 and Exam 2 scores of 7 students:
Exam 1 Exam 2
55
62
60
50
70
65
80
70
85
95
90
80
100
90
100
90
Exam 2
Student
1
2
3
4
5
6
7
80
70
60
50
50
60
70
80
Exam 1
90
100
There is a positive association between x and y when the pattern of the points slopes upward.
As x increases, y tends to increase.
There is a negative association between x and y when the pattern of the points slopes downward.
As x increases, y tends to decrease.
Is there a positive or negative association between exam 1 and exam 2 in the example above?
What about in the twins (UsingR) dataset?
> plot(Biological, Foster)
For the following pairs of variables state whether you think the association is positive, negative,
or neither.
• Height and weight
• Weight of a car and how many miles per gallon it gets
• Years of Education and Income
• Height and GPA among college students
• Temperature in Fahrenheit and temperature in Celsius.
• # right and # wrong on a test
CORRELATION
The correlation coefficient measures the strength of the linear association between two
variables X and Y. It measures how tightly points are clustered around a line, but does not
measure clustering around a curve.
The sample correlation coefficient is defined by
Cov(X, Y)
∑ ( xi − x )( yi − y ) = 1  ( xi − x )  ( yi − y ) 
r = cor(X, Y) =
=
∑

Var(X)Var(Y)
∑ ( xi − x ) 2 ∑ ( yi − y ) 2 n − 1  s x  s y 
The corresponding function in R is cor( ). If any of your variables has missing values, use the
command cor(x,y,use=’complete.obs’).
The correlation coefficient is always between –1 and 1.
The closer the points hug a line with a positive slope, the closer r is to +1. The closer the points
hug a line with a negative slope the closer r is to -1. A correlation of 1 or -1 means you can
perfectly predict one variable knowing the other. If there is no association between X and Y then
the correlation coefficient is near 0, and the scatterplot has no ascertainable pattern.
Look at the variables at the bottom of the last page. For which pair does r = 1? r = -1? r = 0?
The correlation coefficient has no units and the same under location-scale transforms of X
and Y.
• Adding a constant to all x or y values does NOT change r.
• Multiplying all x or y values by a positive constant does NOT change r. What does
multiplying by a negative constant do?
• Interchanging all x and y values does not change r. (The correlation between height and
weight is the same as the correlation between weight and height)
• Changing units does NOT change r. (So correlation between height in inches and weight
in lbs. is the same as between height in meters and weight in kgs.)
Outlier Issue
Including outliers may ruin the correlation even though the rest of the points show a strong linear
relation. But outliers should only be thrown out for good reason. The correlation coefficient
should be used with caution when there are outliers.
Correlation is Not Causation
An association may be found between two variables, but that does not mean one causes the other.
• Shoe size and reading level are highly correlated among elementary school kids. Does
that mean that big feet help students learn to read?
• Surveys have shown that people who use artificial sweeteners tend to be heavier than
people who use sugar. Does that mean that sweeteners cause weight gain?
Example: Try to match the plot with the correlation coefficient to which it is nearest. The
choices for r are -1, -0.90, -0.50, 0, 0.50, 0.90, and 1.
B)
C)
D)
G)
H)
0
1
2
3
4
5
6
A)
E)
F)
For some extra practice, go to http://www.stat.uiuc.edu/~stat100/cuwu/Games.html, and click on
the “Correlations” link to play the Guessing Correlations game.
Computing the Correlation Coefficient

x −x
 Std. Units = z = i

1. Convert x-values and y-values to standard units.
s
x


2. Multiply each zx value by each corresponding zy value.
3. The correlation coefficient is the “adjusted” average (divide by n-1) of the products.
Calculate the correlation coefficient for scores on two quizzes for 5 students.
Quiz 1 (x)
Quiz 2 (y)
1
4
2
5
4
9
2
5
6
7
x = 3, s x = 2
y = 6, s y = 2
Zx
Correlation coefficient (r) = ________________
Zy
Sum of products =
Zx * Zy
SIMPLE LINEAR REGRESSION
Recall the scatterplot from the twins (UsingR) dataset and note several important features.
70
80
90
Foster
100
110
120
130
IQ Scores for Separately Raised Identical Twins
70
80
90
100
110
120
130
Biological
•
•
•
•
•
•
Two variables are associated with each observation/subject.
There appears to be a rough trend or relationship between the two variables.
This relationship is not exactly precise in that there exists substantial variation or scatter.
We may want to summarize this relationship with a linear equation with slope and
intercept terms.
We may want to characterize the variability about this linear equation.
We may want to know whether or not any apparent trend is statistically significant.
Linear regression is a statistical method of assessing these features. A simple linear
regression model is a summary of the relationship between a dependent variable (a.k.a.
response) Y and a single independent variable (a.k.a. predictor or covariate) X. Y is assumed to
be a random variable while, even if X is a random variable, we condition on X (that is, assume X
is fixed). Essentially, we are interested in knowing the behavior of Y given we know X.
Given X, the linear regression model assumption is simply that Y = β 0 + β1 X + ε , where e is a
random variable with mean 0. This line is the population regression line while β 0 and β1 are the
population regression coefficients. When we estimate β 0 and β1 based on a sample of X and Y
pairs, their estimates, βˆ and βˆ , are called estimated regression coefficients or just regression
0
1
coefficients.
Once estimates βˆ0 and βˆ1 of β 0 and β1 have been computed, the predicted value of yi given xi is
obtained from the estimated regression line: yˆ = βˆ + βˆ x where yˆ is the prediction of the true
i
value of yi , for observation i, i = 1, …, n.
0
1 i
i
Finding the Regression Line
There are several methods to finding a reliable regression line, but perhaps the most common is
the least squares method. This method chooses the line (and thus the coefficients) so that the
sum of the squared residuals is as small as possible. When this is true, the regression coefficients
are found by these equations:
βˆ1 = ∑
( xi − x )( yi − y )
∑ ( xi − x )2
=r∗
sy
sx
, and βˆ0 = y − βˆ1 x .
Let’s look more closely at why we can use the correlation coefficient to estimate the y value for a
given x value. If there is a perfect correlation (that is, r = 1 or r = -1) then we can perfectly
predict y from x.
r=1
r = .8
r = .5
r = -.3
If x goes up 1 SD from average, then y goes up 1 SD from average. All points lie
on the SD line.
If x increases 1 SD, then y increases only .8 SD on the average.
If x increases 1 SD, then y increases only .5 SD on the average.
If x increases 1 SD, then y decreases only .3 SD on the average.
The regression line estimates the mean value for the dependent variable (y) corresponding to
each value of the independent variable (x).
1 SD increase in x means only an r SD increase in y.
sy
So, the slope of the regression line is βˆ1 = r * .
sx
The following points are always true for simple linear regression:
• The least squares regression line goes through the center of the point cloud.
• The regression line passes through the intersection of x and y .
• On average, the distance between yˆ and y is shorter than the distance between y and y.
Example: A large class took two exams with the following results:
AvgExam1 = 80,
AvgExam2 = 70,
SDExam1 = 10
SDExam2 = 15
r = .6
Find the regression equation for predicting Exam 2 score using Exam 1 score.
Residuals (Error)
The regression line estimates the average value of y for each value of x. But unless the
correlation is perfect, the actual y values differ from the predicted values. These differences are
called prediction errors or residuals.
Residual = Actual value – Predicted value
(ei = εˆi = yi − yˆi )
For any regression line, the average (and the sum) of the residuals is zero.
2
1
0
-3
-2
-1
Residual
0
-3
-2
-1
y
1
2
3
Plot of Residuals
3
Scatterplot with Regression Line
-3
-2
-1
0
1
2
3
-3
-2
x
-1
0
1
2
3
x
Calculating “spread” for regression
The standard deviation of the residuals, also called the residual standard error or Root Mean
Square Error (RMSE), is a measure of the typical spread of the data around the regression line.
Rather than finding all the residuals, then finding their variance, then finding their standard
deviation, it's much easier to use this formula:
SDresiduals = se = 1 − r 2 * s y
If r = ±1, we can perfectly predict y from x, so there is no prediction error.
If r = 0, then our best prediction is the average of y and the likely size of the prediction error is
just the SD of y.
Example: The HANES height-weight study on parents and children gives the following
summary statistics for men aged 18-24:
Height
Weight
Avg
70 inches
162 pounds
SD
3 inches
30 pounds
Correlation: r = 0.5
Predicting weight from height
a)
Find the regression line of interest.
b)
What is the residual standard error for predicting weight from height?
c)
Predict the weight for someone who is 73" tall.
d)
This prediction is likely to be off by _________lbs. or so.
Predicting height from weight
e)
Find the regression line of interest.
f)
What is the residual standard error for predicting height from weight?
g)
Predict the height of someone who weighs 132 lbs.
i)
This prediction is likely to be off by _______inches or so.
j)
What is the residual for someone who is 132 lbs. but only 65”?
Example in R
Let’s look at those identical twins again. Suppose we want to regress the variable Foster onto the
variable Biological for the purposes of prediction.
>
>
>
>
library(UsingR)
attach(twins)
plot(Foster~Biological)
cor(Biological, Foster)
# previously we used plot(Biological, Foster)
Do the plot and correlation coefficient support a simple linear regression?
The linear model function in R is simply lm(y~x) where y is your dependent variable and x is
your independent variable. What is the regression equation for Foster given Biological?
> lm(Foster~Biological)
We can store this regression, and see that there is actually much more in the model results. (The
text often stores linear models as res which is too much like “residual” and thus deceiving.)
What other things does the lm( ) function generate?
> model1 = lm(Foster~Biological)
> names(model1)
Remember that before we sell ourselves on this model, we want to look at the plots once more.
First, the original data scatterplot with the regression line.
> par(mfrow=c(1,2))
> plot(Foster~Biological, main=’Biological vs Foster raised twins’)
> abline(model1)
Second, the fitted and residual values. What two things are we looking for here? Does this plot
support the linear regression model?
> res = model1$residuals; fit = model1$fitted
> plot(res~fit, xlab=’Fitted values’, ylab=’Residuals’,
main=’Fitted vs Residual values’)
> abline(h=0, lty=2)
Now that we have the linear regression equation, maybe we’d like to try some prediction. Say
that a child raised by his Biological parents has an IQ of 128. What is the predicted IQ of his
identical twin raised by Foster parents? How much are we likely to be off by?
> predict(model1, data.frame(Biological=128))
> sd(res)
Example in R
Let’s look at another dataset. Load the data for kid.weights (UsingR), and then attach the
variables. Answer each question and write down any R commands you used.
1) Look at the scatterplot with height on the x-axis and weight on the y-axis. Compute the
correlation coefficient of height and weight. What does this exploratory analysis tell you about
the validity of a linear regression?
2) Find the simple linear regression equation of weight on height and save it as model.h.
3) The shortest child has a height of 12 inches and weight of 10 pounds. What is his predicted
weight? What do you think about the predicted value? What does it say about the regression
equation?
4) Look at the scatterplot with fitted values on the x-axis and residuals on the y-axis. Add a line
at y=0. What does this residual plot tell you about the validity of a linear regression?
5) Look at the scatterplot with age on the x-axis and weight on the y-axis. Find the simple linear
regression equation of weight on height and save it as model.a.
6) Compute the correlation coefficient of age and weight, and the residual standard error of
weight on age. Does age or height seem to be a better linear predictor of weight?
7) Look at the scatterplot with fitted values on the x-axis and residuals on the y-axis. Add a line
at y=0. What does this residual plot tell you about the validity of a linear regression?