BIVARIATE CATEGORICAL DATA bivariate data
Transcription
BIVARIATE CATEGORICAL DATA bivariate data
BIVARIATE CATEGORICAL DATA Recall from section 2.1 that we learned how to represent a categorical variable numerically in a table and graphically in a barplot. Suppose we have bivariate data, data from two variables, which we’d like to analyze. Tables Suppose we are doing a technology survey of STAT 200 students. Specifically, who owns a cell phone, and who owns a laptop? Fill in the data in this two-way contingency table. Owns Laptop Yes No Owns Cell Phone Yes No This table demonstrates the full distribution of data at our disposal. But if we shift our focus to only one of the variables independent of how the other behaves, that is known as a marginal distribution. For a marginal distribution, we simply sum the values of the rows (or columns) for variable of interest. Owns Laptop Yes No Owns Cell Phone Yes Marginal Distribution for Cell Phone No Marginal Distribution for Laptop At other times, we are interested in how a variable behaves given certain information regarding another variable. This is known as the conditional distribution because we are studying a variable conditioned on the behavior of another. Let’s look at the conditional distribution of laptop ownership. That is, the distribution of laptops, conditioned on whether the subject owns a cell phone. Owns Laptop Yes No Owns Cell Phone Yes No Now you construct the conditional distribution of cell phone ownership, given whether the subject owns a laptop. Owns Laptop Yes No Owns Cell Phone Yes No Barplots Since we are dealing with categorical variables, the barplot will be our most reliable form of graphing. First, we choose one of the variables to be represented on the x-axis, and we list the factors of that variable on the x-axis. Following that we have two options. • • Segmented Barplot – This plot has one main bar displaying the total frequency for each factor on the x-axis. The frequencies of the second variable conditioned on the first are segmented in that one bar. Side-by-side Barplot – This plot has one for each factor of the second variable above each factor on the x-axis. The frequencies of the second variable conditioned on the first are each drawn in their own bar and displayed side-by-side. Construct a segmented barplot and a side-by-side barplot for the technology survey results. Examples in R Be sure to review 3.1.1 in the text regarding the rbind, cbind, and matrix functions to enter your own table data. But here, we’ll use the UScereal (MASS) dataset which contains information about cereals on a shelf in a grocery store. Load the dataset and attach the variables. > Make a frequency table displaying the relationship between the mfr and shelf variables and store it in the variable name x. > Are there any obvious differences between manufacturers? This may be difficult to tell in the above table, so let’s use the margin.table function to look at the marginal distribution of the row variable mfr. You can also see both marginal distributions using addmargins(x). > margin.table(x, 1) How would you look at the marginal distribution of the column variable shelf using margin.table? Anything worth noting there? > We might be interested in making sure that the cereal companies are all getting a fair shake, and that the brands are evenly distributed among the shelves. We must look at the conditional distributions to help analyze that. Use the prop.table function. > prop.table(x) # What does this do? Now, look at the conditional distribution of the shelf given the row variable, mfr. For each of the given manufacturers, what is the proportion of its brands that appear on the lowest shelf? > prop.table(x, 1) # G K N P Q R Now, look at the conditional distribution of the manufacturer variable given the column variable, shelf number. What do you observe? > prop.table(x, 2) COMPARING INDEPENDENT SAMPLES We will formalize our definition of independence later. But for now, you may think of two samples as being independent if they are taken in such a way that knowing the distribution of one sample doesn’t affect our knowledge of the other. Though the samples are independent, it is still possible that share similar features. For example, consider the ACT scores at two distinct high schools. Do the have a similar center? Similar spread? Similar shape? All are possible, but only answerable through statistical analysis. In our recurring example, we’ll look at the dataset twins (UsingR) containing the IQ scores for 27 pairs of identical twins separated near birth. Are these samples independent? Stem-and-leaf plots can be drawn side by side on the same “stem” to note general distinctions. R does not have a built-in function for this. Histograms are perhaps the most useful means of displaying univariate data, but it becomes difficult to create and read histograms comparing two variables. Recall, that the y-axis of a histogram measures density. We can create densityplots to outline the behavior of a histogram. What do you observe in the twins data? > plot(density(Biological)) > lines(density(Foster), lty=2) Boxplots also lend themselves well to displaying two variables graphically and side-by-side. Does this graph support your observations from the densityplots? > boxplot(Biological, Foster, horizontal=T) A quantile-quantile plot displays the quantiles of one dataset against the same quantiles of the other as points. If the distributions have a similar shape, then the quantile values will be similar at each proportion, and thus the qq-plot will be roughly a straight line. If the shapes are different, the points will not be linear. Before looking at the qq-plot for the twins, do you expect the points to generally fall into a line or not? > qqplot(Biological, Foster) SCATTERPLOTS A scatterplot is a graph used for showing the relationship between two variables. One variable is assigned to the x-axis and the other is assigned to the y-axis. The convention is to call the x variable the independent variable and the y variable the dependent variable. Usually the independent variable is thought to influence the dependent variable. The corresponding function in R is simply plot(). Example 1: Construct a scatter plot for the Exam 1 and Exam 2 scores of 7 students: Exam 1 Exam 2 55 62 60 50 70 65 80 70 85 95 90 80 100 90 100 90 Exam 2 Student 1 2 3 4 5 6 7 80 70 60 50 50 60 70 80 Exam 1 90 100 There is a positive association between x and y when the pattern of the points slopes upward. As x increases, y tends to increase. There is a negative association between x and y when the pattern of the points slopes downward. As x increases, y tends to decrease. Is there a positive or negative association between exam 1 and exam 2 in the example above? What about in the twins (UsingR) dataset? > plot(Biological, Foster) For the following pairs of variables state whether you think the association is positive, negative, or neither. • Height and weight • Weight of a car and how many miles per gallon it gets • Years of Education and Income • Height and GPA among college students • Temperature in Fahrenheit and temperature in Celsius. • # right and # wrong on a test CORRELATION The correlation coefficient measures the strength of the linear association between two variables X and Y. It measures how tightly points are clustered around a line, but does not measure clustering around a curve. The sample correlation coefficient is defined by Cov(X, Y) ∑ ( xi − x )( yi − y ) = 1 ( xi − x ) ( yi − y ) r = cor(X, Y) = = ∑ Var(X)Var(Y) ∑ ( xi − x ) 2 ∑ ( yi − y ) 2 n − 1 s x s y The corresponding function in R is cor( ). If any of your variables has missing values, use the command cor(x,y,use=’complete.obs’). The correlation coefficient is always between –1 and 1. The closer the points hug a line with a positive slope, the closer r is to +1. The closer the points hug a line with a negative slope the closer r is to -1. A correlation of 1 or -1 means you can perfectly predict one variable knowing the other. If there is no association between X and Y then the correlation coefficient is near 0, and the scatterplot has no ascertainable pattern. Look at the variables at the bottom of the last page. For which pair does r = 1? r = -1? r = 0? The correlation coefficient has no units and the same under location-scale transforms of X and Y. • Adding a constant to all x or y values does NOT change r. • Multiplying all x or y values by a positive constant does NOT change r. What does multiplying by a negative constant do? • Interchanging all x and y values does not change r. (The correlation between height and weight is the same as the correlation between weight and height) • Changing units does NOT change r. (So correlation between height in inches and weight in lbs. is the same as between height in meters and weight in kgs.) Outlier Issue Including outliers may ruin the correlation even though the rest of the points show a strong linear relation. But outliers should only be thrown out for good reason. The correlation coefficient should be used with caution when there are outliers. Correlation is Not Causation An association may be found between two variables, but that does not mean one causes the other. • Shoe size and reading level are highly correlated among elementary school kids. Does that mean that big feet help students learn to read? • Surveys have shown that people who use artificial sweeteners tend to be heavier than people who use sugar. Does that mean that sweeteners cause weight gain? Example: Try to match the plot with the correlation coefficient to which it is nearest. The choices for r are -1, -0.90, -0.50, 0, 0.50, 0.90, and 1. B) C) D) G) H) 0 1 2 3 4 5 6 A) E) F) For some extra practice, go to http://www.stat.uiuc.edu/~stat100/cuwu/Games.html, and click on the “Correlations” link to play the Guessing Correlations game. Computing the Correlation Coefficient x −x Std. Units = z = i 1. Convert x-values and y-values to standard units. s x 2. Multiply each zx value by each corresponding zy value. 3. The correlation coefficient is the “adjusted” average (divide by n-1) of the products. Calculate the correlation coefficient for scores on two quizzes for 5 students. Quiz 1 (x) Quiz 2 (y) 1 4 2 5 4 9 2 5 6 7 x = 3, s x = 2 y = 6, s y = 2 Zx Correlation coefficient (r) = ________________ Zy Sum of products = Zx * Zy SIMPLE LINEAR REGRESSION Recall the scatterplot from the twins (UsingR) dataset and note several important features. 70 80 90 Foster 100 110 120 130 IQ Scores for Separately Raised Identical Twins 70 80 90 100 110 120 130 Biological • • • • • • Two variables are associated with each observation/subject. There appears to be a rough trend or relationship between the two variables. This relationship is not exactly precise in that there exists substantial variation or scatter. We may want to summarize this relationship with a linear equation with slope and intercept terms. We may want to characterize the variability about this linear equation. We may want to know whether or not any apparent trend is statistically significant. Linear regression is a statistical method of assessing these features. A simple linear regression model is a summary of the relationship between a dependent variable (a.k.a. response) Y and a single independent variable (a.k.a. predictor or covariate) X. Y is assumed to be a random variable while, even if X is a random variable, we condition on X (that is, assume X is fixed). Essentially, we are interested in knowing the behavior of Y given we know X. Given X, the linear regression model assumption is simply that Y = β 0 + β1 X + ε , where e is a random variable with mean 0. This line is the population regression line while β 0 and β1 are the population regression coefficients. When we estimate β 0 and β1 based on a sample of X and Y pairs, their estimates, βˆ and βˆ , are called estimated regression coefficients or just regression 0 1 coefficients. Once estimates βˆ0 and βˆ1 of β 0 and β1 have been computed, the predicted value of yi given xi is obtained from the estimated regression line: yˆ = βˆ + βˆ x where yˆ is the prediction of the true i value of yi , for observation i, i = 1, …, n. 0 1 i i Finding the Regression Line There are several methods to finding a reliable regression line, but perhaps the most common is the least squares method. This method chooses the line (and thus the coefficients) so that the sum of the squared residuals is as small as possible. When this is true, the regression coefficients are found by these equations: βˆ1 = ∑ ( xi − x )( yi − y ) ∑ ( xi − x )2 =r∗ sy sx , and βˆ0 = y − βˆ1 x . Let’s look more closely at why we can use the correlation coefficient to estimate the y value for a given x value. If there is a perfect correlation (that is, r = 1 or r = -1) then we can perfectly predict y from x. r=1 r = .8 r = .5 r = -.3 If x goes up 1 SD from average, then y goes up 1 SD from average. All points lie on the SD line. If x increases 1 SD, then y increases only .8 SD on the average. If x increases 1 SD, then y increases only .5 SD on the average. If x increases 1 SD, then y decreases only .3 SD on the average. The regression line estimates the mean value for the dependent variable (y) corresponding to each value of the independent variable (x). 1 SD increase in x means only an r SD increase in y. sy So, the slope of the regression line is βˆ1 = r * . sx The following points are always true for simple linear regression: • The least squares regression line goes through the center of the point cloud. • The regression line passes through the intersection of x and y . • On average, the distance between yˆ and y is shorter than the distance between y and y. Example: A large class took two exams with the following results: AvgExam1 = 80, AvgExam2 = 70, SDExam1 = 10 SDExam2 = 15 r = .6 Find the regression equation for predicting Exam 2 score using Exam 1 score. Residuals (Error) The regression line estimates the average value of y for each value of x. But unless the correlation is perfect, the actual y values differ from the predicted values. These differences are called prediction errors or residuals. Residual = Actual value – Predicted value (ei = εˆi = yi − yˆi ) For any regression line, the average (and the sum) of the residuals is zero. 2 1 0 -3 -2 -1 Residual 0 -3 -2 -1 y 1 2 3 Plot of Residuals 3 Scatterplot with Regression Line -3 -2 -1 0 1 2 3 -3 -2 x -1 0 1 2 3 x Calculating “spread” for regression The standard deviation of the residuals, also called the residual standard error or Root Mean Square Error (RMSE), is a measure of the typical spread of the data around the regression line. Rather than finding all the residuals, then finding their variance, then finding their standard deviation, it's much easier to use this formula: SDresiduals = se = 1 − r 2 * s y If r = ±1, we can perfectly predict y from x, so there is no prediction error. If r = 0, then our best prediction is the average of y and the likely size of the prediction error is just the SD of y. Example: The HANES height-weight study on parents and children gives the following summary statistics for men aged 18-24: Height Weight Avg 70 inches 162 pounds SD 3 inches 30 pounds Correlation: r = 0.5 Predicting weight from height a) Find the regression line of interest. b) What is the residual standard error for predicting weight from height? c) Predict the weight for someone who is 73" tall. d) This prediction is likely to be off by _________lbs. or so. Predicting height from weight e) Find the regression line of interest. f) What is the residual standard error for predicting height from weight? g) Predict the height of someone who weighs 132 lbs. i) This prediction is likely to be off by _______inches or so. j) What is the residual for someone who is 132 lbs. but only 65”? Example in R Let’s look at those identical twins again. Suppose we want to regress the variable Foster onto the variable Biological for the purposes of prediction. > > > > library(UsingR) attach(twins) plot(Foster~Biological) cor(Biological, Foster) # previously we used plot(Biological, Foster) Do the plot and correlation coefficient support a simple linear regression? The linear model function in R is simply lm(y~x) where y is your dependent variable and x is your independent variable. What is the regression equation for Foster given Biological? > lm(Foster~Biological) We can store this regression, and see that there is actually much more in the model results. (The text often stores linear models as res which is too much like “residual” and thus deceiving.) What other things does the lm( ) function generate? > model1 = lm(Foster~Biological) > names(model1) Remember that before we sell ourselves on this model, we want to look at the plots once more. First, the original data scatterplot with the regression line. > par(mfrow=c(1,2)) > plot(Foster~Biological, main=’Biological vs Foster raised twins’) > abline(model1) Second, the fitted and residual values. What two things are we looking for here? Does this plot support the linear regression model? > res = model1$residuals; fit = model1$fitted > plot(res~fit, xlab=’Fitted values’, ylab=’Residuals’, main=’Fitted vs Residual values’) > abline(h=0, lty=2) Now that we have the linear regression equation, maybe we’d like to try some prediction. Say that a child raised by his Biological parents has an IQ of 128. What is the predicted IQ of his identical twin raised by Foster parents? How much are we likely to be off by? > predict(model1, data.frame(Biological=128)) > sd(res) Example in R Let’s look at another dataset. Load the data for kid.weights (UsingR), and then attach the variables. Answer each question and write down any R commands you used. 1) Look at the scatterplot with height on the x-axis and weight on the y-axis. Compute the correlation coefficient of height and weight. What does this exploratory analysis tell you about the validity of a linear regression? 2) Find the simple linear regression equation of weight on height and save it as model.h. 3) The shortest child has a height of 12 inches and weight of 10 pounds. What is his predicted weight? What do you think about the predicted value? What does it say about the regression equation? 4) Look at the scatterplot with fitted values on the x-axis and residuals on the y-axis. Add a line at y=0. What does this residual plot tell you about the validity of a linear regression? 5) Look at the scatterplot with age on the x-axis and weight on the y-axis. Find the simple linear regression equation of weight on height and save it as model.a. 6) Compute the correlation coefficient of age and weight, and the residual standard error of weight on age. Does age or height seem to be a better linear predictor of weight? 7) Look at the scatterplot with fitted values on the x-axis and residuals on the y-axis. Add a line at y=0. What does this residual plot tell you about the validity of a linear regression?