Data Analysis and Surveying 101: Basic research methods and biostatistics
Transcription
Data Analysis and Surveying 101: Basic research methods and biostatistics
Data Analysis and Surveying 101: Basic research methods and biostatistics as they apply to the Theresa Jackson Hughes, MPH American College Health Association December 2006 What we will cover today Research Methods • • • • • Sampling Frame and Sampling Generalizability Bias Reliability and Validity Levels of measurement Biostatistics • • • • Statistical significance Other key terms Appropriate statistical tests Fun examples from the Spring 2005 dataset! Get excited! It’s data time!!! Research Methods “To do successful research, you don't need to know everything, you just need to know of one thing that isn't known.” • Arthur Schawlow “That's the nature of research - you don't know what in hell you're doing.” • Harold "Doc" Edgerton “If we knew what it was we were doing, it would not be called research, would it?” • Albert Einstein What exactly is research? “Scientific research is systematic, controlled, empirical, and critical investigation of natural phenomena guided by theory and hypotheses about the presumed relations among such phenomena.” • Kerlinger, 1986 Research is an organized and systematic way of finding answers to questions Important Components of Empirical Research Problem statement, research questions, purposes, benefits Theory, assumptions, background literature Variables and hypotheses Operational definitions and measurement Research design and methodology Instrumentation, sampling Data analysis Conclusions, interpretations, recommendations Sampling What is your population of interest? • To whom do you want to generalize your results? All students (18 and over) Undergraduates only Greeks Athletes Other Can you sample the entire population? Sampling A sample is “a smaller (but hopefully representative) collection of units from a population used to determine truths about that population” (Field, 2005) Why sample? • Resources (time, money) and workload • Gives results with known accuracy that can be calculated mathematically The sampling frame is the list from which the potential respondents are drawn • Registrar’s office • Class rosters • Must assess sampling frame errors Types of Samples Probability (Random) Samples • Simple random sample • Systematic random sample • Stratified random sample Proportionate Disproportionate • Cluster sample Non-Probability Samples • Convenience sample • Purposive sample • Quota Sample Size Depends on expected response rate • Average 85% for paper FINAL SAMPLE DESIRED / .85 = SAMPLE • Average 25% for web FINAL SAMPLE DESIRED / .25 = SAMPLE Size of Campus <600 Final Desired N All students 600-2,999 600 3,000-9,999 700 10,000-19,999 800 20,000-29,000 900 ≥30,000 1,000 Bias and Error Bias and Error Systematic Error or Bias: unknown or unacknowledged error created during the design, measurement, sampling, procedure, or choice of problem studied • Error tends to go in one direction Examples: Selection, Recall, Social desirability Random • Unrelated to true measures Example: Momentary fatigue Reliability and Validity Reliability • The extent to which a test is repeatable and yields consistent scores • Affected by random error/bias Validity • The extent to which a test measures what it is supposed to measure • A subjective judgment made on the basis of experience and empirical indicators • Asks "Is the test measuring what you think it’s measuring?“ • Affected by systematic error/bias Reliability vs. Validity In order to be valid, a test must be reliable; but reliability does not guarantee validity. Levels of Measurement Levels of Measurement Nominal • Gender Interval • Body Mass Index (BMI) Male, Female • Vaccinations Yes, No, Unsure Ordinal • Personal health status Excellent, Very good, Good, Fair, Poor • Last 30 days Never used, Not in last 30 days, 1-2 days, 3-5 days, 6-9 days, 10-19 days, 20-29 days, All 30 days Ratio • Number of drinks • Number of sexual partners • Perception percentages • Blood alcohol concentration (BAC) Biostatistics “It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.” • R. Hooke “Torture numbers, and they'll confess to anything.” • Gregg Easterbrook “98% of all statistics are made up.” • Author Unknown Types of Statistics Descriptive statistics • Describe the basic features of data in a study • Provide summaries about the sample and measures Inferential statistics • Investigate questions, models, and hypotheses • Infer population characteristics based on sample • Make judgments about what we observe Descriptive Statistics Mode Median Mean Central Tendency Variation Range Variance Standard Deviation Frequency Descriptive Statistics Examples Categorical Variables (Nominal/Ordinal) Q1 Gen health Valid Mis sing Total 1 excellent 2 very good 3 good 4 fair 5 poor 6 don't know Total Sys tem Frequency 9145 23767 16442 3737 565 132 53788 323 54111 Percent 16.9 43.9 30.4 6.9 1.0 .2 99.4 .6 100.0 Valid Percent 17.0 44.2 30.6 6.9 1.1 .2 100.0 Cumulative Percent 17.0 61.2 91.8 98.7 99.8 100.0 Descriptive Statistics Examples Categorical Variables (Nominal/Ordinal) Q49 Year in school * Q46 Sex Crosstabulation Q49 Year in s chool 1 1st year undergrad 2 2nd year under 3 3rd year under 4 4th year under 5 5th year or more under 6 graduate 7 adult special 8 other Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Count % of Total Q46 Sex 1 female 2 male 7366 4154 14.5% 8.2% 6755 3678 13.3% 7.2% 6195 3333 12.2% 6.6% 5192 2676 10.2% 5.3% 1380 985 2.7% 1.9% 5088 3246 10.0% 6.4% 203 105 .4% .2% 266 145 .5% .3% 32445 18322 63.9% 36.1% Total 11520 22.7% 10433 20.6% 9528 18.8% 7868 15.5% 2365 4.7% 8334 16.4% 308 .6% 411 .8% 50767 100.0% Descriptive Statistics Examples Continuous Variables (Interval/Ratio) Descriptive Statistics Q48 Weight in pounds HT_INCH Height in Inches Q13 How many drinks Q12 Hours alcohol BAC Blood Alcohol Content Valid N (lis twis e) N 51935 Range 534 Minimum 52 Maximum 586 Mean 153.16 Std. Deviation 35.791 Variance 1281.031 52017 56.00 48.00 104.00 67.2035 4.01241 16.099 53374 53326 88 65 0 0 88 65 4.42 2.99 4.401 2.726 19.370 7.430 50604 2.47 .00 2.47 .0731 .08357 .007 50218 Hypotheses Null hypotheses • Presumed true until statistical evidence in the form of a hypothesis test indicates otherwise There is no effect/relationship There is no difference in means Alternative hypotheses • Tested using inferential statistics There is an effect/relationship There is a difference in means Alpha, Beta, Power, Effect Size Alpha – probability of making a Type I error • Reject null when null is true • Level of significance, p value Beta – probability of making a Type II error • Fail to reject null when null is false Power – probability of correctly rejecting null • 1 – Beta Effect Size • Measure of the strength of the relationship between two variables Reject null Fail to Reject null Null is true Null is false Alpha Type I error 1 – Beta Power 1 – Alpha CORRECT NONREJECTION CORRECT REJECTION Beta Type II error Let’s test some hypotheses!!! Test of the mean of one continuous variable College students report drinking an average of 5 drinks the last time they “partied”/socialized • Hypotheses Ho: µ = 5 HA: µ ≠ 5 • Test: Two-tailed t-test • Result: Reject null One-Sample Statistics How many drinks N 53374 Mean 4.42 Std. Deviation 4.401 Std. Error Mean .019 One-Sample Test Tes t Value = 5 How many drinks t -30.352 df 53373 Sig. (2-tailed) .000 Mean Difference -.578 95% Confidence Interval of the Difference Lower Upper -.62 -.54 Test of a single proportion of one categorical variable 20% of college students report their health is excellent • Hypotheses Ho: p = 20 HA: p ≠ 20 (one-tailed) • Test: Z-test for a single proportion • Result: Reject null Binomial Test Gen health Group 1 Group 2 Total Category <= 1 >1 N 9145 44643 53788 Obs erved Prop. .170 .830 1.000 Tes t Prop. .2 Asymp. Sig. (1-tailed) .000 a, b a. Alternative hypothesis s tates that the proportion of cas es in the first group < .2. b. Bas ed on Z Approximation. Test of a relationship between two continuous variables There is a relationship between the number of drinks students report drinking the last time they drank and the number of sex partners they have had within the last school year • Hypotheses Ho: ρ = 0 HA: ρ ≠ 0 • Test: Pearson Product Moment Correlation • Result: Reject null Correlations How many drinks Partners you had Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N How many drinks 1 Partners you had .238** .000 53374 52576 .238** 1 .000 52576 52896 **. Correlation is s ignificant at the 0.01 level (2-tailed). Test of the difference between two means Men and women report significantly different numbers of sexual partners over the past 12 months • Hypotheses µ1 = µ2 µ1 ≠ µ2 • Test: Independent Samples t-test OR One-way ANOVA • Result: Reject null Group Statistics Partners you had Sex female male N 32687 18474 Mean 1.34 1.82 Std. Deviation 2.017 3.627 Std. Error Mean .011 .027 Independent Samples Test Levene's Test for Equality of Variances F Partners you had Equal variances ass umed Equal variances not as sumed 867.978 Sig. .000 t-tes t for Equality of Means 95% Confidence Interval of the Difference Lower Upper Sig. (2-tailed) Mean Difference Std. Error Difference 51159 .000 -.483 .025 -.532 -.434 -16.704 25065.988 .000 -.483 .029 -.540 -.426 t -19.360 df Test of the difference between two or more means Mean BAC reported differs across student residences • Hypotheses µ1 = µ 2 = µ 3 = µ4 = µ 5 = µ 6 µi ≠ µj for at least one pair i, j • Test: One-way ANOVA • Result: Reject null Descriptives Blood Alcohol Content residence hall frat/sorority hous e other univers ity housing off campus with parents other Total N 21285 781 3620 18151 4279 2266 50382 Mean .0741 .1127 .0622 .0773 .0606 .0579 .0731 Std. Deviation .08215 .09278 .07357 .08539 .08490 .08296 .08357 Std. Error .00056 .00332 .00122 .00063 .00130 .00174 .00037 95% Confidence Interval for Mean Lower Bound Upper Bound .0730 .0752 .1062 .1193 .0598 .0646 .0760 .0785 .0581 .0631 .0545 .0613 .0724 .0738 Minimum .00 .00 .00 .00 .00 .00 .00 Maximum 1.27 .75 1.41 2.47 1.17 1.26 2.47 ANOVA Blood Alcohol Content Between Groups Within Groups Total Sum of Squares 3.188 348.695 351.884 df 5 50376 50381 Mean Square .638 .007 F 92.123 Sig. .000 Test of the difference between two or more means Multiple Comparisons Dependent Variable: Blood Alcohol Content Games-Howell (I) Currently live residence hall frat/sorority hous e other univers ity housing off campus with parents other (J) Currently live frat/sorority hous e other univers ity housing off campus with parents other residence hall other univers ity housing off campus with parents other residence hall frat/sorority hous e off campus with parents other residence hall frat/sorority hous e other univers ity housing with parents other residence hall frat/sorority hous e other univers ity housing off campus other residence hall frat/sorority hous e other univers ity housing off campus with parents Mean Difference (I-J) -.03865* .01190* -.00316* .01350* .01623* .03865* .05055* .03548* .05215* .05488* -.01190* -.05055* -.01506* .00160 .00433 .00316* -.03548* .01506* .01667* .01940* -.01350* -.05215* -.00160 -.01667* .00273 -.01623* -.05488* -.00433 -.01940* -.00273 *. The mean difference is significant at the .05 level. Std. Error .00337 .00135 .00085 .00141 .00183 .00337 .00354 .00338 .00356 .00375 .00135 .00354 .00138 .00178 .00213 .00085 .00338 .00138 .00144 .00185 .00141 .00356 .00178 .00144 .00217 .00183 .00375 .00213 .00185 .00217 Sig. .000 .000 .003 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .947 .323 .003 .000 .000 .000 .000 .000 .000 .947 .000 .809 .000 .000 .323 .000 .809 95% Confidence Interval Lower Bound Upper Bound -.0483 -.0290 .0081 .0157 -.0056 -.0007 .0095 .0175 .0110 .0215 .0290 .0483 .0404 .0606 .0258 .0451 .0420 .0623 .0442 .0656 -.0157 -.0081 -.0606 -.0404 -.0190 -.0111 -.0035 .0067 -.0017 .0104 .0007 .0056 -.0451 -.0258 .0111 .0190 .0125 .0208 .0141 .0247 -.0175 -.0095 -.0623 -.0420 -.0067 .0035 -.0208 -.0125 -.0035 .0089 -.0215 -.0110 -.0656 -.0442 -.0104 .0017 -.0247 -.0141 -.0089 .0035 Test for a relationship between two categorical variables Is there an association between being a member of a fraternity/sorority and ever being diagnosed with depression? • Hypotheses Ho: There is no association between being a member of a fraternity/sorority and ever being diagnosed with depression. HA: There is an association between being a member of a fraternity/sorority and ever being diagnosed with depression. • Test: Chi-square test for independence • Result: Fail to reject null Test for relationship between two categorical variables Ever - Depression * Frat or sorority? Crosstabulation Ever - Depress ion yes no Total Count Expected Count Count Expected Count Count Expected Count Frat or s orority? yes no 681 7692 715.6 7657.4 3744 39657 3709.4 39691.6 4425 47349 4425.0 47349.0 Chi-Square Tests Pears on Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Ass ociation N of Valid Cas es Value 2.185 b 2.122 2.211 2.185 df 1 1 1 1 Asymp. Sig. (2-s ided) .139 .145 .137 Exact Sig. (2-s ided) Exact Sig. (1-s ided) .141 .073 .139 51774 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 715. 62. Total 8373 8373.0 43401 43401.0 51774 51774.0 Important Points to Remember An significant association does not indicate causation Statistical significance is not always the same as practical significance Multiple factors contribute to whether your results are significant It gets easier and easier as you practice! Questions???