ANOVA and Regression Brian Healy, PhD
Transcription
ANOVA and Regression Brian Healy, PhD
ANOVA and Regression Brian Healy, PhD Objectives ANOVA – Multiple comparisons Introduction to regression – Relationship to correlation/t-test Comments from reviews Please fill them out because I read them More examples and not just MS More depth on technical details/statistical theory/equations – First time ever!! – I have made slides from more in depth courses available on-line so that you have access to formulas for t-test, ANOVA, etc. Talks too fast for non-native speakers Review Types of data p-value Steps for hypothesis test – How do we set up a null hypothesis? Choosing the right test – Continuous outcome variable/dichotomous explanatory variable: Two sample t-test Steps for hypothesis testing State null hypothesis State type of data for explanatory and outcome variable Determine appropriate statistical test State summary statistics Calculate p-value (stat package) Decide whether to reject or not reject the null hypothesis 1) 2) 3) 4) 5) 6) • 7) NEVER accept null Write conclusion Example In previous class, two groups were compared on a continuous outcome What if we have more than two groups? Ex. A recent study compared the intensity of structures on MRI in normal controls, benign MS patients and secondary progressive MS patients Question: Is there any difference among these groups? Two approaches Compare each group to each other group using a t-test – Problem with multiple comparisons Complete global comparison to see if there is any difference – Analysis of variance (ANOVA) – Good first step even if eventually complete pairwise comparisons Types of analysis-independent samples Outcome Explanatory Analysis Continuous Dichotomous t-test, Wilcoxon test Continuous Categorical Continuous Continuous ANOVA, linear regression Correlation, linear regression Dichotomous Dichotomous Chi-square test, logistic regression Dichotomous Continuous Logistic regression Time to event Dichotomous Log-rank test Global test-ANOVA As a first step, we can compare across all groups at once The null hypothesis for ANOVA is that the means in all of the groups are equal ANOVA compares the within group variance and the between group variance – If the patients within a group are very alike and the groups are very different, the groups are likely different Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: meannormal=meanBMS=meanSPMS Outcome variable: continuous Explanatory variable: categorical Test: ANOVA meannormal=0.41; meanBMS= 0.34; meanSPMS=0.30 Results: p=0.011 Reject null hypothesis Conclusion: At least one of the groups is significantly different than the others Technical aside Our F-statistic is the ratio of the between group variance and the within group variance n x x k 2 between 2 within s F s n 1s 1 2 1 i 1 2 i i nk 1sk2 k 1 n 1 n 1 k 1 This ratio of variances has a known distribution (Fdistribution) If our calculated F-statistic is high, the between group variance is higher than the within group variance, meaning the differences between the groups are not likely due to chance Therefore, the probability of the observed result or something more extreme will be low (low p-value) This is the distribution under the null This small shaded region is the part of the distribution that is equal to or more extreme than the observed value. The p-value!!! Now what The question often becomes which groups are different Possible comparisons – All pairs – All groups to a specific control – Pre-specified comparisons If we do many tests, we should account for multiple comparisons Type I error Type I error is when you reject the null hypothesis even though it is true (a=P(reject H0|H0 is true)) We accept making this error 5% of the time If we run a large experiment with 100 tests and the null hypothesis was true in each case, how many times would we expect to reject the null? Multiple comparisons For this problem, three comparisons – NC vs. BMS; NC vs. SPMS; BMS vs. SPMS If we complete each test at the 0.05 level, what is the chance that we make a type I error? – P(reject at least 1 | H0 is true) = a – P(reject at least 1 | H0 is true) = 1- P(fail to reject all three| H0 is true) = 1-0.953 = 0.143 Inflated type I error rate Can correct p-value for each test to maintain experiment type I error Bonferroni correction The Bonferroni correction multiples all pvalues by the number of comparisons completed – In our experiment, there were 3 comparisons, so we multiply by 3 – Any p-value that remains less than 0.05 is significant The Bonferroni correction is conservative (it is more difficult to obtain a significant result than it should be), but it is an extremely easy way to account for multiple comparisons. – Can be very harsh correction with many tests Other corrections All pairwise comparisons – Tukey’s test All groups to a control – Dunnett’s test MANY others False discovery rate Example For our three-group comparison, we compare each and get the following results from Tukey’s test Groups NC vs. BMS NC vs. SPMS BMS vs. SPMS Mean diff 0.075 0.114 0.039 p-value 0.10 0.012 0.60 Significant * Questions to ask yourself What is the null hypothesis? We would like to test the null hypothesis at the 0.05 level If well defined prior to the experiment, the correction for multiple comparison if necessary will be clear Hypothesis generating vs. hypothesis testing Conclusions If you are doing a multiple group comparison, always specify before the experiment which comparisons are of interest if possible If the null hypothesis is that all the groups are the same, test global null using ANOVA Complete appropriate additional comparisons with corrections if necessary No single right answer for every situation Types of analysis-independent samples Outcome Explanatory Analysis Continuous Dichotomous t-test, Wilcoxon test Continuous Categorical Continuous Continuous ANOVA, linear regression Correlation, linear regression Dichotomous Dichotomous Chi-square test, logistic regression Dichotomous Continuous Logistic regression Time to event Dichotomous Log-rank test Correlation Is there a linear relationship between IL-10 expression and IL-6 expression? The best graphical display for this data is a scatter plot Correlation Definition: the degree to which two continuous variables are linearly related – Positive correlation- As one variable goes up, the other goes up (positive slope) – Negative correlation- As one variable goes up, the other goes down (negative slope) Correlation (r) ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) A correlation of 0 means that there is no linear relationship between the two variables Positive correlation Negative correlation 12 12 10 10 8 8 6 6 4 4 2 2 0 0 0 2 4 6 8 10 12 0 No correlation 2 4 6 8 10 12 No correlation (quadratic) 10 18 9 16 8 14 7 12 6 10 5 4 8 3 6 2 4 1 2 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: correlation between IL-10 expression and IL-6 expression=0 Outcome variable: IL-6 expression- continuous Explanatory variable: IL-10 expressioncontinuous Test: correlation Summary statistic: correlation=0.51 Results: p=0.011 Reject null hypothesis Conclusion: A statistically significant correlation was observed between the two variables Technical aside-correlation The formal definition of the correlation is given by: Cov( x, y) Corr ( x, y) Var ( x) Var ( y) Note that this is dimensionless quantity This equation shows that if the covariance between the two variables is the same as the variance in the two variables, we have perfect correlation because all of the variability in x and y is explained by how the two variables change together How can we estimate the correlation? The most common estimator of the correlation is the Pearson’s correlation coefficient, given by: x x y n r i 1 i i y n 2 2 n xi x yi y i 1 i 1 This is a estimate that requires both x and y are normally distributed. Since we use the mean in the calculation, the estimate is sensitive to outliers. Distribution of the test statistic The standard error of the sample correlation coefficient is given by 1 r 2 sˆe(r ) n2 The resulting distribution of the test statistic is a t-distribution with n-2 degrees of freedom where n is the number of patients (not the number of measurements) r 0 n2 t r 2 2 1 r 1 r n2 Regression-Everything in one place All analyses we have done to this point can be completed using regression!!! Quick math review As you remember, the equation of a line is y=mx+b For every one unit increase in x, there is an m unit increase in y b is the value of y when x is equal to zero Line 20 18 16 y = 1.5x + 4 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 Picture Does there seem to be a linear relationship in the data? Is the data perfectly linear? Could we fit a line to this data? 25 20 15 10 5 0 0 2 4 6 8 10 12 How do we find the best line? Linear regression tries to find the best line (curve) to fit the data Let’s look at three candidate lines Which do you think is the best? What is a way to determine the best line to use? What is linear regression? The method of finding the best line (curve) is least squares, which minimizes the distance from the line for each of points The equation of the line is y=1.5x + 4 25 20 y = 1.5x + 4 15 10 5 0 0 2 4 6 8 10 12 Example For our investigation of the relationship between IL-10 and IL-6, we can set up a regression equation IL6i b 0 b1 * IL10i e i b0 is the expression of IL-6 when IL-10=0 (intercept) b1 is the change in IL-6 for every 1 unit increase in IL-10 (slope) ei is the residual from the line The final regression equation is IL6̂ 26.4 0.63 * IL10 The coefficients mean – the estimate of the mean expression of IL-6 for a patient with IL-10 expression=0 (b0) – an increase of one unit in IL-10 expression leads to an estimated increase of 0.63 in the mean expression of IL-6 (b1) Tough question In our correlation hypothesis test, we wanted to know if there was an association between the two measures If there was no relationship between IL-10 and IL-6 in our system, what would happen to our regression equation? – No effect means that the change in IL-6 is not related to the change in IL-10 – b1=0 Is b1 significantly different than zero? Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: no relationship between IL-6 expression and IL-10 expression, b1 =0 Outcome variable: IL-6- continuous Explanatory variable: IL-10- continuous Test: linear regression Summary statistic: b1 = 0.63 Results: p=0.011 Reject null hypothesis Conclusion: A significant correlation was observed between the two variables Wait a second!! Let’s check something – p-value from correlation analysis = 0.011 – p-value from regression analysis = 0.011 – They are the same!! Regression leads to same conclusion as correlation analysis Other similarities as well from models Technical aside-Estimates of regression coefficients Once we have solved the least squares equation, we obtain estimates for the b’s, which we refer n to bˆ0 , bˆ1 as bˆ1 x x y i 1 i i y x x n i 1 2 i bˆ0 y bˆ1 x To test if this estimate is significantly different than 0, we use the following equation: bˆ1 b1 t seˆ bˆ1 Assumptions of linear regression Linearity – Linear relationship between outcome and predictors – E(Y|X=x)=b0 + b1x1 + b2x22 is still a linear regression equation because each of the b’s is to the first power Normality of the residuals – The residuals, ei, are normally distributed, N(0, s2 Homoscedasticity of the residuals – The residuals, ei, have the same variance Independence – All of the data points are independent – Correlated data points can be taken into account using multivariate and longitudinal data methods Linear regression with dichotomous predictor Linear regression can also be used for dichotomous predictors, like sex Last class we compared relapsing MS patients to progressive MS patients To do this, we use an indicator variable, which equals 1 for relapsing and 0 for progressive. The resulting regression equation for expression is exi b 0 b1 * Ri e i Interpretation of model The meaning of the coefficients in this case are – b0 is the estimate of the mean expression when R=0, in the progressive group – b0 b1 is the estimate of the mean expression when R=1, in the relapsing group – b1 is the estimate of the mean increase in expression between the two groups The difference between the two groups is b1 If there was no difference between the groups, what would b1 equal? Mean in wildtype=b0 Difference between groups=b1 Mean in Progressive group=b0 Hypothesis test 1) 2) 3) 4) 5) 6) 7) Null hypothesis: meanprogressive=meanrelapsing (b1=0) Explanatory: group membership- dichotomous Outcome: cytokine production-continuous Test: Linear regression b1=6.87 p-value=0.199 Fail to reject null hypothesis Conclusion: The difference between the groups is not statistically significant T-test As hopefully you remember, you could have tested this same null hypothesis using a two sample t-test Very similar result to previous class If we would have assumed equal variance for our t-test, we would have gotten to the same result!!! ANOVA results can also be tested using regression using more than one indicator Multiple regression A large advantage of regression is the ability to include multiple predictors of an outcome in one analysis A multiple regression equation looks just like a simple regression equation. Y b 0 b1 x1 b 2 x2 ... b n xn e Example Brain parenchymal fraction (BPF) is a measure of disease severity in MS We would like to know if gender has an effect on BPF in MS patients We also know that BPF declines with age in MS patients Is there an effect of sex on BPF if we control for age? .95 .9 .8 BPF .85 .75 0 .2 .4 .6 Sex Blue=males; Red=females .8 1 .95 .9 .75 .8 BPF .85 20 30 40 Age 50 Blue=males; Red=females 60 Is age a potential confounder? We know that age has an effect on BPF from previous research We also know that male patients have a different disease course than female patients so the age at time of sampling may also be related to sex Age Sex BPF Model The multiple linear regression model includes a term for both age and sex BPFi b 0 b1 * genderi b 2 * agei e i What are the values genderi takes on? – genderi=0 if the patient is female – genderi=1 if the patient is male Expression Females: – BPFi = b0+ b2*agei+ei Males: – BPFi = (b0+ b1)+ b2*agei+ei What is different about the equations? – Intercept What is the same? – Slope This model allows an effect of gender on the intercept, but not on the change with age Interpretation of coefficients The meaning of each coefficient – b0: the average BPF when age is 0 and the patient is female – b1: the average difference in BPF between males and female, HOLDING AGE CONSTANT – b2: the average increase in BPF for a one unit increase in age, HOLDING GENDER CONSTANT Note that the interpretation of the coefficient requires mention of the other variables in the model Estimated coefficients Here is the estimated regression equation BPFˆi 0.942 0.017 * sexi 0.0026 * agei The average difference between males and females is 0.017 holding age constant For every one unit increase in age, the mean BPF decreases 0.0026 units holding sex constant Are either of these effects statistically significant? – What is the null hypothesis? Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: No effect of sex, controlling for age b1 =0 Continuous outcome, continuous predictor Linear regression controlling for sex Summary statistic: b1 =0.017 p-value=0.37 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between sex and BPF controlling for age Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: No effect of age, controlling for sex b2 =0 Continuous outcome, continuous predictor Linear regression controlling for sex Summary statistic: b2 =-0.0026 p-value=0.00 4 Since the p-value is less than 0.05, we reject the null hypothesis We conclude that there is a significant association between age and BPF controlling for sex Estimated effect of sex Estimated effect of age p-value for sex p-value for age 20 30 40 Age 50 60 .75 .8 BPF .85 .9 .95 Conclusions Although there was a marginally significant association of sex and BPF, this association was not significant after controlling for age The significant association between age and BPF remained statistically significant after controlling for sex What we learned (hopefully) ANOVA Correlation Basics of regression