Slide Show
Transcription
Slide Show
How to Speak R Getting to Know Your Data Introduction to R: for Absolute Beginners Office of Methodological & Data Sciences Sarah Schwartz1 BNR 278 12:30 pm - 3:20 pm, October 2, 2012 1 EDUC 455, (435)797-0169, [email protected] or [email protected], http://www.cehs.usu.edu/research/omds Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Download & Install 2 pieces of free software Video walk-through of both installations link: HERE accept all defaults https://www.r-project.org/ https://www.rstudio.com/ • Install first • Install second • “Software Environment” • “User Interface” • The brain • The go-between for us • We won’t work directly with it • Auto completes & color codes How to Speak R Getting to Know Your Data Fitting Statistical Models Helpful Websites Tutorials by William B. King, PhD, Coastal Carolina University http://ww2.coastal.edu/kingw/statistics/R-tutorials RexRepos R Example Repository http://www.uni-kiel.de/psychologie/rexrepos R-bloggers R news & tutorials: broad coverage http://www.r-bloggers.com Quick-R Accessing the power of R includes some graphs http://www.statmethods.net Psychology Using R for psychological research http://personality-project.org/r How to Speak R Getting to Know Your Data Outline How to Speak R Nuts & Bolts Using Add-on Packages How to Read in YOUR Own Data Getting to Know Your Data Numeric Summaries Graphical Summaries Fitting Statistical Models Motor Trend Car Road Tests Comparing Group Centers Regression Models Fitting Statistical Models How to Speak R Getting to Know Your Data Rstudio Workspace Fitting Statistical Models How to Speak R Getting to Know Your Data Other User Interfaces Exist... R Commander (Rcmdr) http://www.rcommander.com Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Basic Calculations 1 + 3 #### addition ## [1] 4 16 / 2 #### division ## [1] 8 5 ^ 2 ###### powers ## [1] 25 prompt in the console, command-line case sensetive ‘anova’ not the same as ‘ANOVA’ comment lines Use the # symbol at least once sqrt(144) # square root ## [1] 12 log(1.3) ## [1] 0.2623643 #### logrithm How to Speak R Getting to Know Your Data Fitting Statistical Models Create & Remove Objects # ALL OF THESE DO THE SAME THING x=7 x = 7 x= 7 x = 7 x = # Press Enter here. 7 # Press Enter again. # TWO WAYS TO ASSIGN OBJECTS Aval = 7 # use the equal B.val = 15 # names: no spaces Cval <- 10 # use an arrow ls() # list environment ## [1] "Aval" "B.val" "Cval" # YOU CAN REMOVE OBJECTS AFTER CREATING THEM rm(B.val) # remove from environment ls() # list the environment ## [1] "Aval" "Cval" "x" Aval # what is assigned to this? ## [1] 7 aval # CAPS MATTER!!! ## Error in eval(expr, envir, enclos): object ’aval’ not found "x" How to Speak R Getting to Know Your Data A double-equal tests for equivalence: 5 == 6 # Are there ANY TWOs? 2 %in% vec1 # 'less than' ## [1] TRUE 1 < 2 | 2 == 3 ## [1] TRUE # test EACH VALUE to see if it is TWO 2 == vec1 # '|' means `or' ## [1] FALSE TRUE FALSE FALSE ## [5] TRUE FALSE ## [1] TRUE Aval < Cval # Create a vector with "combine" vec1 = c(1, 2, 7, 3, 2, -3) # are these equal? ## [1] FALSE 3 < 10 Fitting Statistical Models # can test objects # COUNT the number of TWOs sum(2 == vec1) ## [1] TRUE ## [1] 2 How to Speak R Getting to Know Your Data Some Possible CLASSES of R Objects Individual VALUES: numeric number values logical either ‘TRUE’ (codes to 1) or ‘FALSE’ (codes to 0) factor categorical levels, nominal or ordinal character text or ‘string’ in SPSS Data OBJECTS: vector a 1-dimentional listing of single elements matrix a 2-dimentional array of elements (rows & columns) data.frame a matrix with more formatting (nice labels) Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models x = 1:5 y = x / 3 z = x > 4 class(x) class(y) class(z) c = factor(c("m", "m" , "f", "f", "m")) class(c) ## [1] "integer" ## [1] "numeric" ## [1] "logical" ## [1] "factor" x y z c ## [1] 1 2 3 4 5 ## ## ## ## ## ## [1] FALSE FALSE ## [3] FALSE FALSE ## [5] TRUE ## [1] m m f f m ## 2 Levels: f ... [1] [2] [3] [4] [5] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 How to Speak R Getting to Know Your Data Finding a Function If you’re not sure of a function’s name, use ‘apropors’ to search for it: apropos("round") ## [1] "round" ## [2] "round.Date" ## [3] "round.POSIXt" Then you can search the name of the function in the HELP tab of the RStudio. (or use google) apropos("mean") ## [1] ".colMeans" ## [2] ".rowMeans" ## [3] "colMeans" ## [4] "kmeans" ## [5] "mean" ## [6] "mean.Date" ## [7] "mean.default" ## [8] "mean.difftime" ## [9] "mean.POSIXct" ## [10] "mean.POSIXlt" ## [11] "rowMeans" ## [12] "weighted.mean" Fitting Statistical Models How to Speak R Getting to Know Your Data You can use the Help tab in RStudio to find out about a function. # Ask for the function's arguments args(round) ## function (x, digits = 0) ## NULL round(2.4) round(2.7) ## [1] 2 ## [1] 3 ceiling(2.4) ceiling(2.7) ## [1] 3 ## [1] 3 floor(2.4) floor(2.7) ## [1] 2 ## [1] 2 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Missing Values data = c(1, 0, 2, 5, NA) is.na(data) 1 + 0 + 2 + 5 ## [1] 8 ## [1] FALSE FALSE FALSE ## [4] FALSE TRUE mean(data) anyNA(data) ## [1] NA ## [1] TRUE mean(data, na.rm = TRUE) Different functions have different default ways to handle missing values. Use the HELP to determine what is the default and how to change it. ## [1] 2 sd(data) ## [1] NA sd(data, na.rm = TRUE) ## [1] 2.160247 How to Speak R Getting to Know Your Data Fitting Statistical Models R Base vs. External Packages When you download R, you are only getting the base functions. This is a relatively small collection of functions, but it keeps R running fast. # included in R base: summary(data) ## ## Min. 1st Qu. 0.00 0.75 table(data) # basic summary statistics Median 1.50 Mean 3rd Qu. 2.00 2.75 Max. 5.00 NA's 1 # tabulates categoricals ## data ## 0 1 2 5 ## 1 1 1 1 Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. By only downloading and installing the packages you need, on a project-by-project basis, R uses less storage space on your hard drive and active memory. How to Speak R Getting to Know Your Data Fitting Statistical Models Hundreds of packages are available for download and installation. Many are vetted and distributed by CRAN, others are available on GitHub, or you can create & share packages on an individual level. Install Download to your computer’s hard drive ONLY ONCE Load Activate the package’s library EVERY session # Code for installing all the # packagesin this document install.packages("psych", "xlsx", "haven", "lattice", "MASS", "ggplot2", "popbio", "beeswarm") NOTE: when you download your first package, select a mirror (a proxy server) How to Speak R Getting to Know Your Data Fitting Statistical Models The ‘Psych’ Package This has been developed at Northwestern University since 2005 to include functions most useful for personality, psychometric, and psychological research. The package is also meant to supplement a text on psychometric theory, a draft of which is available at http://personality-project.org/r/book. # 'LOAD' or 'activate' the package library(psych) This package has a nice feature for reading in data from your clipboard: 1. Highlight the data in Excel, including the first row with variable names 2. ‘Copy’ the selection, moving the information to the clipboard 3. Run the code below to store it in R as an object named pipiData # International Personality Item Pool bfi = read.clipboard.tab() How to Speak R Getting to Know Your Data Fitting Statistical Models Personality self report items taken from the International Personality Item Pool (http://ipip.ori.org) and was included as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project http://SAPA-project.org. 5 Items x 5 Factors Response Scale Demographic • Agreeableness 1. Very Inaccurate • gender • Conscientiousness 2. Moderately Inaccurate • education • Extraversion 3. Slightly Inaccurate • age • Neuroticism 4. Slightly Accurate • Opennness 5. Moderately Accurate 6. Very Accurate How to Speak R Getting to Know Your Data Fitting Statistical Models Investigate the Form of Your Data class(bfi) # you probably want a data.frame ## [1] "data.frame" dim(bfi) # rows (subjeccts) & columns (variables) ## [1] 2800 28 names(bfi) ## [1] "A1" ## [8] "C3" ## [15] "E5" ## [22] "O2" # columns should have avariables names "A2" "C4" "N1" "O3" table(complete.cases(bfi)) ## ## FALSE ## 564 TRUE 2236 "A3" "C5" "N2" "O4" "A4" "E1" "N3" "O5" "A5" "E2" "N4" "gender" "C1" "E3" "N5" "education" "C2" "E4" "O1" "age" # are the cases complete? (no missing values) How to Speak R Getting to Know Your Data Fitting Statistical Models Declare Categorical Variables - GENDER # look at the raw form: 4 ways designate a variable bfi[, 26] # designate column number... bfi[, c("gender")] # ...or column name... bfi["gender"] # ...all do the same thing... bfi$gender # ...this is the most common class(bfi$gender) # the variable's "class" ## [1] "integer" head(bfi$gender) # look at top cases ## [1] 1 2 2 2 1 2 summary(bfi$gender) ## ## Min. 1st Qu. 1.000 1.000 table(bfi$gender) ## ## ## 1 2 919 1881 # how does it get summarized? Median 2.000 Mean 3rd Qu. 1.672 2.000 Max. 2.000 # what does "table" do? How to Speak R Getting to Know Your Data Fitting Statistical Models Declare Categorical Variables - GENDER # define it as categorical: FACTOR is "nominal" bfi$gender = factor(bfi$gender, labels = c("male", "female")) # now its ready to go class(bfi$gender) # did the "class" change? ## [1] "factor" head(bfi$gender) # does it look different? ## [1] male female female female male ## Levels: male female summary(bfi$gender) ## ## female # is the summary the same? male female 919 1881 levels(bfi$gender) ## [1] "male" "female" # this gives a list the LABELS How to Speak R Getting to Know Your Data Fitting Statistical Models Declare Categorical Variables - EDUCATION table(bfi$education) ## ## ## 1 224 2 3 292 1249 4 394 # look at the raw form 5 418 # define as categorical: ORDERED is "ordinal" bfi$education = ordered(bfi$education, labels = c("<HS", "HS", "HS+ ", "degree", "grad+")) # now its ready to go head(bfi$education, n = 15) ## [1] <NA> <NA> <NA> <NA> <NA> HS+ <NA> HS ## Levels: <HS < HS < HS+ < degree < grad+ <HS summary(bfi$education) ## ## <HS 224 HS 292 HS+ degree 1249 394 grad+ 418 NA's 223 levels(bfi$education) ## [1] "<HS" "HS" "HS+ " "degree" "grad+" <NA> <HS <NA> <NA> <NA> <HS How to Speak R Getting to Know Your Data bfi[1:3, ] ## ## ## ## ## ## ## ## # specify rows (subjects) in FRONT of the comma A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 gender 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3 male 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 female 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 female education age 61617 <NA> 16 61618 <NA> 18 61620 <NA> 17 61617 61618 61620 bfi[1:4, 1:7] ## ## ## ## ## Fitting Statistical Models 61617 61618 61620 61621 # specify columns (variables) AFTER the comma A1 A2 A3 A4 A5 C1 C2 2 4 3 4 4 2 3 2 4 5 2 5 5 4 5 4 5 4 4 4 5 4 4 6 5 5 4 4 # ...or list the names of the variables (after comma) bfi[1:3, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")] ## A1 A2 A3 A4 A5 gender education age ## 61617 2 4 3 4 4 male <NA> 16 ## 61618 2 4 5 2 5 female <NA> 18 ## 61620 5 4 5 4 4 female <NA> 17 How to Speak R Getting to Know Your Data Saving a Reduced Dataset # suppose I'm only interested in subjects under the age of 35 table(bfi$age < 35) ## ## FALSE ## 738 TRUE 2062 # AND I only want to keep a few variables (for demo) bfiA = bfi[bfi$age < 35, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")] dim(bfiA) ## [1] 2062 # see a few lines from top and bottom 8 headTail(bfiA) ## ## ## ## ## ## ## ## ## ## A1 A2 A3 A4 A5 gender education age 61617 2 4 3 4 4 male <NA> 16 61618 2 4 5 2 5 female <NA> 18 61620 5 4 5 4 4 female <NA> 17 61621 4 4 6 5 5 female <NA> 17 ... ... ... ... ... ... <NA> <NA> ... 67551 6 1 3 3 3 male HS+ 19 67552 2 4 4 3 5 male degree 27 67556 2 3 5 2 5 female degree 29 67559 5 2 2 4 4 male degree 31 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models How to Read in YOUR Own Data Before you can load your data, you need to tell R where to look. # get the working directory getwd() ## [1] "C:/Users/A00315273/Box Sync/Office of Research Services/OMDS/OMDS Workshops/OMDS in Notice: you need to use shashes instead of backslashes # change the working directory to YOUR COMPUTER!!! setwd("C:/Users/A00315273/OMDSworkshop") If the data is stored in a TEXT file, comma delimited... # there functions are part of the BASE R myData = read.table("data.txt", header = TRUE) myData = read.csv("data.csv", header = TRUE) How to Speak R Getting to Know Your Data Fitting Statistical Models Best Practices: DataSet in Excel Often, you may enter your data into Excel. Make sure the FIRST ROW contains the names of variables. Names, Values, & Fields • FIRST variable is unit identification • NEVER use white SPACES • AVOID symbols or punctuation: ? [ } * $ % • USE . or to push words together • KEEP it short, but meaningful • ALWAYS use numbers over text • LEAVE missing cells blank (not .) How to Speak R Getting to Know Your Data Read in Data from Excel Files Bad Example Much Better! Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Read in Data from Excel Files # there's a package for that! # "Read, write, format Excel 2007 (xlsx) files" library(xlsx) # read.xlsx tries to guess variables classes # read.xlsx2 is faster at bigger datasets myData = read.xlsx("data.xlsx", sheetIndex = 1, header = TRUE) # or use sheetName, instead # TRUE if 1st row = names NOTE: If you are having problems with Excel datasets, try saving it as a “.csv” file (comma delimited) and use the read.table function in Base R. How to Speak R Getting to Know Your Data Read in Data from SPSS, SAS, & Stata Files # New package this summer...Hadley Wickham is my HERO! library(haven) # Currently haven can read and write: # logical, integer, numeric, character and factors # SPSS: Supports both sav & por files myData = read_spss("data.sav") myData = read_sav("data.sav") myData = read_por("data.sav") # SAS: Supports both b7dat & b7cat files myData = read_sas("data.b7dat") # Stata myData = read_stata("data.dta") myData = read_dta("data.dta") # NOTE all labeled variables are a new class: "labelled" # ... use as_factor() to treat the variable categorical # ... use zap_labels() to treat the variable as continuous Fitting Statistical Models How to Speak R Getting to Know Your Data Outline How to Speak R Nuts & Bolts Using Add-on Packages How to Read in YOUR Own Data Getting to Know Your Data Numeric Summaries Graphical Summaries Fitting Statistical Models Motor Trend Car Road Tests Comparing Group Centers Regression Models Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Mean, Standard Deviation, Ect... # descriptives on all variables describe(bfiA) ## ## ## ## ## ## ## ## ## A1 A2 A3 A4 A5 gender* education* age vars 1 2 3 4 5 6 7 8 n mean 2053 2.52 2040 4.75 2048 4.57 2048 4.59 2050 4.50 2062 1.66 1853 3.09 2062 23.16 sd median trimmed mad min max range skew kurtosis se 1.42 2 2.36 1.48 1 6 5 0.73 -0.44 0.03 1.20 5 4.92 1.48 1 6 5 -1.07 0.86 0.03 1.31 5 4.75 1.48 1 6 5 -0.97 0.39 0.03 1.54 5 4.81 1.48 1 6 5 -0.91 -0.29 0.03 1.26 5 4.64 1.48 1 6 5 -0.79 0.07 0.03 0.47 2 1.70 0.00 1 2 1 -0.68 -1.54 0.01 1.06 3 3.11 0.00 1 5 4 -0.04 -0.03 0.02 5.22 22 22.98 5.93 3 34 31 0.25 -0.59 0.12 How to Speak R Getting to Know Your Data Fitting Statistical Models Mean, Standard Deviation, Ect... # split by a grouping variable describeBy(bfiA, bfiA$gender) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## group: male vars n mean sd median trimmed mad min max range skew kurtosis se A1 1 699 2.81 1.43 3 2.71 1.48 1 6 5 0.48 -0.75 0.05 A2 2 691 4.46 1.30 5 4.61 1.48 1 6 5 -0.88 0.25 0.05 A3 3 695 4.38 1.30 5 4.52 1.48 1 6 5 -0.78 0.01 0.05 A4 4 697 4.31 1.51 5 4.45 1.48 1 6 5 -0.64 -0.62 0.06 A5 5 695 4.35 1.33 5 4.49 1.48 1 6 5 -0.74 -0.13 0.05 gender* 6 699 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00 education* 7 626 3.11 1.15 3 3.14 1.48 1 5 4 -0.04 -0.40 0.05 age 8 699 22.83 5.04 22 22.63 4.45 3 34 31 0.27 -0.29 0.19 ------------------------------------------------------------------group: female vars n mean sd median trimmed mad min max range skew kurtosis se A1 1 1354 2.37 1.39 2 2.17 1.48 1 6 5 0.88 -0.16 0.04 A2 2 1349 4.90 1.12 5 5.07 1.48 1 6 5 -1.16 1.22 0.03 A3 3 1353 4.67 1.31 5 4.86 1.48 1 6 5 -1.10 0.68 0.04 A4 4 1351 4.74 1.53 5 4.99 1.48 1 6 5 -1.08 0.04 0.04 A5 5 1355 4.58 1.22 5 4.71 1.48 1 6 5 -0.80 0.14 0.03 gender* 6 1363 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN 0.00 education* 7 1227 3.08 1.02 3 3.09 0.00 1 5 4 -0.04 0.21 0.03 age 8 1363 23.32 5.31 23 23.17 5.93 9 34 25 0.23 -0.73 0.14 How to Speak R Getting to Know Your Data Cross Tabulations & χ2 test for Independence # split by a grouping variable # If a variable is included on the left side of the formula, # it is assumed to be a vector of frequencies edXgender = xtabs(~ education + gender, data = bfiA) edXgender ## gender ## education male female ## <HS 71 109 ## HS 70 121 ## HS+ 303 691 ## degree 84 169 ## grad+ 98 137 # chi-squared test for independence chisq.test(edXgender) ## ## Pearson's Chi-squared test ## ## data: edXgender ## X-squared = 14.746, df = 4, p-value = 0.005258 Fitting Statistical Models How to Speak R Getting to Know Your Data Correlation Matrix How strong is the association between the 5 Agreement Items? # reduce the dataset for easy of demonstration bfiAonly = bfi[, c("A1", "A2", "A3", "A4", "A5")] # GET CORRELATION VALUES & P-VALUES cor(bfiAonly, use = "pairwise.complete.obs") ## ## ## ## ## ## A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383 -0.3401932 1.0000000 0.4850980 0.3350872 0.3900836 -0.2652471 0.4850980 1.0000000 0.3604283 0.5041411 -0.1464245 0.3350872 0.3604283 1.0000000 0.3075373 -0.1814383 0.3900836 0.5041411 0.3075373 1.0000000 round(cor(bfiAonly, use = "pairwise.complete.obs"), 3) ## ## ## ## ## ## A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 1.000 -0.340 -0.265 -0.146 -0.181 -0.340 1.000 0.485 0.335 0.390 -0.265 0.485 1.000 0.360 0.504 -0.146 0.335 0.360 1.000 0.308 -0.181 0.390 0.504 0.308 1.000 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Correlation Matrix with p-values corr.test(bfiAonly, adjust = "none", method = "spearman") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call:corr.test(x = bfiAonly, method = "spearman", adjust = "none") Correlation matrix A1 A2 A3 A4 A5 A1 1.00 -0.37 -0.30 -0.16 -0.22 A2 -0.37 1.00 0.50 0.34 0.40 A3 -0.30 0.50 1.00 0.36 0.53 A4 -0.16 0.34 0.36 1.00 0.31 A5 -0.22 0.40 0.53 0.31 1.00 Sample Size A1 A2 A3 A4 A5 A1 2784 2757 2759 2767 2769 A2 2757 2773 2751 2758 2757 A3 2759 2751 2774 2759 2758 A4 2767 2758 2759 2781 2765 A5 2769 2757 2758 2765 2784 Probability values (Entries above the diagonal are adjusted for multiple tests.) A1 A2 A3 A4 A5 A1 0 0 0 0 0 A2 0 0 0 0 0 A3 0 0 0 0 0 A4 0 0 0 0 0 A5 0 0 0 0 0 To see confidence intervals of the correlations, print with the short=FALSE option How to Speak R Getting to Know Your Data Fitting Statistical Models Correlation Matrix Visualize A picture can be worth a thousand words cor.plot(cor(bfiAonly, use = "pairwise.complete.obs", method = "spearman")) Correlation plot A1 1 0.8 A2 0.6 0.4 A3 0.2 0 A4 −0.2 −0.4 A5 −0.6 −0.8 −1 A1 A2 A3 A4 A5 How to Speak R Getting to Know Your Data Fitting Statistical Models psych’s All-in-One Plot A picture can be worth a thousand words # plots pairs of variables pairs.panels(bfiAonly) 3 4 5 6 1 −0.27 3 4 5 6 −0.15 −0.18 1 −0.34 2 5 A1 2 3 1 0.31 1 3 5 A4 3 0.36 0.50 1 1 A3 5 0.49 0.34 0.39 3 5 A2 1 3 5 A5 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 How to Speak R Getting to Know Your Data Fitting Statistical Models Histogram: Defaults vs. Options # all defaults hist(bfi$A1) # better with hist(bfi$A1, breaks = main = xlab = col = 600 0.5:6.5, "This is Much Better", "Item A-1", "gray") This is Much Better 3 4 bfi$A1 5 6 600 2 0 200 1 Frequency 0 200 Frequency Histogram of bfi$A1 some defaults 1 2 3 4 Item A−1 5 6 How to Speak R Getting to Know Your Data Fitting Statistical Models Histogram: Use More Code! 600 400 200 0 Frequency 800 1000 Ready for Publication Very Mod Inaccuration Slight Slight Mod Accurate Agreeableness Item #1 (q.1146) ''Am indifferent to the feelings of others'' Very How to Speak R Getting to Know Your Data Fitting Statistical Models Density Plot: Continuous Distribution # one way to put two plots on the same page par(mfrow=c(1, 2)) hist(bfi$age) plot(density(bfi$age, na.rm = TRUE)) density.default(x = bfi$age, na.rm = TRUE) 0.04 0.00 0.02 Density 400 200 0 Frequency 600 Histogram of bfi$age # 1 row & 2 columns # rough distribution # smoothed out 0 20 40 60 bfi$age 80 0 20 40 60 80 N = 2800 Bandwidth = 2.047 How to Speak R Getting to Know Your Data Fitting Statistical Models Density Plot: AGE 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Proportion Compare to the Normal Curve Curves density normal 0 20 40 60 Age 80 How to Speak R Getting to Know Your Data Fitting Statistical Models Bar Plot: Categorical Distribution par(mfrow=c(1, 2)) # 1 row & 2 columns 0 0 200 500 600 1000 1500 1000 # one variable at a time (must give it counts!) barplot(table(bfi$gender)) barplot(table(bfi$education)) male female <HS HS degree How to Speak R Getting to Know Your Data Fitting Statistical Models Bar Plot: Compare 2 Categorical Distributions 1000 Synthetic Aperture Personality Assessment (SAPA) 600 400 200 0 Frequency 800 male female <HS HS HS+ degree Highest Level of Education grad+ How to Speak R Getting to Know Your Data Fitting Statistical Models Boxplots: GENDER & EDUCATION par(mfrow=c(1, 2)) # 1 row & 2 columns # all together boxplot(bfiA$age) 80 20 0 5 10 40 20 60 30 # split by education groups boxplot(bfi$age ~ bfi$education) <HS HS+ grad+ How to Speak R Getting to Know Your Data Fitting Statistical Models Boxplots: Use More Options Build a Better Boxplots Age (years) 40 20 0 # make it look better boxplot(age ~ education, data = bfi, col = heat.colors(5), main = "Build a Better Boxplots", xlab = "Highest Education Obtained", ylab = "Age (years)") 60 80 # reset to one plot per page par(mfrow=c(1, 1)) <HS HS HS+ degree Highest Education Obtained grad+ How to Speak R Getting to Know Your Data Fitting Statistical Models Boxplots: AGE & EDUCATION male female 40 20 0 Age (years) 60 80 Compare the Genders <HS HS HS+ degree Highest Education Obtained grad+ How to Speak R Getting to Know Your Data Fitting Statistical Models Scatterplots: Display Associations Jitter the education level so dots don’t cover each other so much. factor = 0.25 factor = 1 factor = 2 2 3 jitter(as.numeric(bfi$education), factor = 2) 4 4 3 jitter(as.numeric(bfi$education), factor = 1) 2 1 jitter(as.numeric(bfi$education), factor = 0.25) 3 1 plot(bfi$age, jitter(as.numeric(bfi$education), factor = 2), main = "factor = 2") 2 plot(bfi$age, jitter(as.numeric(bfi$education), factor = 1), main = "factor = 1") 1 plot(bfi$age, jitter(as.numeric(bfi$education), factor = 0.25), main = "factor = 0.25") 4 5 5 5 # put 3 plots in one row/page par(mfrow = c(1, 3)) 0 20 40 bfi$age 60 80 0 20 40 bfi$age 60 80 0 20 40 bfi$age 60 80 How to Speak R Getting to Know Your Data Fitting Statistical Models Scatterplots: AGE & EDUCATION Jitter the Ordinal Variable grad+ Education degree HS+ HS <HS 0 20 40 Age (years) 60 80 How to Speak R Getting to Know Your Data Fitting Statistical Models Bubble Plot: Helpful with Overplotting If you can dream of a type of plot, you can create it! Bubble Plot 3 2 1 # circle's area ~ number of points symbols(bfiAag$Group.1, bfiAag$Group.2, circles = sqrt(bfiAag$A1/pi)/50, inches = FALSE, main = "Bubble Plot", xlab = "item A1", ylab = "item A2") item A2 4 5 6 # aggregate the data bfiAag = aggregate(bfiA, by = list(bfiA$A1, bfiA$A2), length) 1 2 3 4 item A1 5 6 How to Speak R Getting to Know Your Data Outline How to Speak R Nuts & Bolts Using Add-on Packages How to Read in YOUR Own Data Getting to Know Your Data Numeric Summaries Graphical Summaries Fitting Statistical Models Motor Trend Car Road Tests Comparing Group Centers Regression Models Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Motor Trend Car Road Tests The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). mpg Miles/(US) gallon cyl Number of cylinders disp Displacement (cu.in.) hp Gross horsepower drat Rear axle ratio wt Weight (lb/1000) qsec 1/4 mile time vs V/S am Transmission gear Number of forward gears carb Number of carburetors How to Speak R Getting to Know Your Data Fitting Statistical Models Load car Package & the mtcars Data # Load a New Package: library(car) # "Companion to Applied Regression" (a textbook) data(mtcars) # Make its Included Data Set Active in the Environment # check out the data dim(mtcars) ## [1] 32 11 names(mtcars) ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" # set the categorical variables mtcars$vs = factor(mtcars$vs, labels = c("v", "s")) mtcars$am = factor(mtcars$am, labels = c("automatic", "manual")) "am" "gear" "carb" How to Speak R Getting to Know Your Data Fitting Statistical Models headTail(mtcars) ## ## ## ## ## ## ## ## ## ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ... Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E mpg cyl disp hp drat wt 21 6 160 110 3.9 2.62 21 6 160 110 3.9 2.88 22.8 4 108 93 3.85 2.32 21.4 6 258 110 3.08 3.21 ... ... ... ... ... ... 15.8 8 351 264 4.22 3.17 19.7 6 145 175 3.62 2.77 15 8 301 335 3.54 3.57 21.4 4 121 109 4.11 2.78 qsec vs am gear carb 16.46 v manual 4 4 17.02 v manual 4 4 18.61 s manual 4 1 19.44 s automatic 3 1 ... <NA> <NA> ... ... 14.5 v manual 5 4 15.5 v manual 5 6 14.6 v manual 5 8 18.6 s manual 4 2 summary(mtcars) ## ## ## ## ## ## ## ## ## ## ## ## ## ## mpg Min. :10.40 1st Qu.:15.43 Median :19.20 Mean :20.09 3rd Qu.:22.80 Max. :33.90 wt Min. :1.513 1st Qu.:2.581 Median :3.325 Mean :3.217 3rd Qu.:3.610 Max. :5.424 cyl Min. :4.000 1st Qu.:4.000 Median :6.000 Mean :6.188 3rd Qu.:8.000 Max. :8.000 qsec Min. :14.50 1st Qu.:16.89 Median :17.71 Mean :17.85 3rd Qu.:18.90 Max. :22.90 disp hp drat Min. : 71.1 Min. : 52.0 Min. :2.760 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 Median :196.3 Median :123.0 Median :3.695 Mean :230.7 Mean :146.7 Mean :3.597 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 Max. :472.0 Max. :335.0 Max. :4.930 vs am gear carb v:18 automatic:19 Min. :3.000 Min. :1.000 s:14 manual :13 1st Qu.:3.000 1st Qu.:2.000 Median :4.000 Median :2.000 Mean :3.688 Mean :2.812 3rd Qu.:4.000 3rd Qu.:4.000 Max. :5.000 Max. :8.000 How to Speak R Getting to Know Your Data Fitting Statistical Models Test Central Differences in 2 Independent Groups # find the means describeBy(mtcars$mpg, mtcars$am) ## ## ## ## ## ## ## group: automatic vars n mean sd median trimmed mad min max range skew kurtosis se 1 1 19 17.15 3.83 17.3 17.12 3.11 10.4 24.4 14 0.01 -0.8 0.88 ------------------------------------------------------------------group: manual vars n mean sd median trimmed mad min max range skew kurtosis se 1 1 13 24.39 6.17 22.8 24.38 6.67 15 33.9 18.9 0.05 -1.46 1.71 automatic # view the two groups side-by-side boxplot(mpg ~ am, data = mtcars, horizontal = TRUE) 10 15 20 25 30 How to Speak R Getting to Know Your Data Test Central Differences in 2 Independent Groups PARAMETRIC t-test for means, assumes normality t.test(mpg ~ am, data = mtcars) ## ## ## ## ## ## ## ## ## ## ## Welch Two Sample t-test data: mpg by am t = -3.7671, df = 18.332, p-value = 0.001374 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.280194 -3.209684 sample estimates: mean in group automatic mean in group manual 17.14737 24.39231 NON-PARAMETRIC Mann-Whitney U Test, based on ranks wilcox.test(mpg ~ am, data = mtcars) ## ## Wilcoxon rank sum test with continuity correction ## ## data: mpg by am ## W = 42, p-value = 0.001871 ## alternative hypothesis: true location shift is not equal to 0 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models More than Two Groups? # plot to investigate boxplot(drat ~ cyl, data = mtcars, main = "Between vs. Within", xlab = "Number of Cylinders", ylab = "Rear Axle Ratio", col = "light gray") grid() 5.0 4.5 Rear Axle Ratio 4.0 3.5 stripchart(drat ~ cyl, data = mtcars, vertical = TRUE, method = 'jitter', jitter = 0.2, cex = 1, pch = 16, col = c("red", "blue", "dark green"), add = TRUE) 3.0 # we can use another package library(beeswarm) Between vs. Within 4 6 Number of Cylinders 8 How to Speak R Getting to Know Your Data ANOVA # run the ANOVA anova1 = aov(drat ~ cyl, data = mtcars) summary(anova1) ## ## ## ## ## Df Sum Sq Mean Sq F value Pr(>F) cyl 1 4.342 4.342 28.81 8.24e-06 *** Residuals 30 4.521 0.151 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # to get type III sums of squares Anova(anova1, type = "III") ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: drat Sum Sq Df F value Pr(>F) (Intercept) 57.217 1 379.714 < 2.2e-16 *** cyl 4.342 1 28.814 8.245e-06 *** Residuals 4.521 30 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Fitting Statistical Models How to Speak R Getting to Know Your Data ANCOVA # add a continuous covariate anova2 = aov(drat ~ cyl + wt, data = mtcars) summary(anova2) ## ## ## ## ## ## Df Sum Sq Mean Sq F value Pr(>F) cyl 1 4.342 4.342 32.284 3.83e-06 *** wt 1 0.620 0.620 4.613 0.0402 * Residuals 29 3.900 0.134 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Anova(anova2, type = "III") ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: drat Sum Sq Df F value Pr(>F) (Intercept) 56.578 1 420.6933 < 2e-16 *** cyl 0.464 1 3.4493 0.07346 . wt 0.620 1 4.6129 0.04022 * Residuals 3.900 29 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Fitting Statistical Models How to Speak R Getting to Know Your Data Kruskal Wallis Test # non-parametric version: uses ranks instead of means kruskal.test(drat ~ cyl, data = mtcars) ## ## Kruskal-Wallis rank sum test ## ## data: drat by cyl ## Kruskal-Wallis chi-squared = 14.395, df = 2, p-value = 0.0007486 Fitting Statistical Models How to Speak R Getting to Know Your Data Simple Linear Regression: Fit Model # Simple Linear Regression linreg = lm(mpg ~ wt, data = mtcars) slr = summary(linreg) slr ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median -4.5432 -2.3647 -0.1252 3Q 1.4096 Max 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 summary(linreg)$r.squared ## [1] 0.7528328 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Simple Linear Regression: Visualize the Fit Linear Regression 15 mtcars$mpg 20 25 30 adj − R2 = 0.745 R2 = 0.753 10 # Plot of relationship and least squares line plot(mtcars$wt, mtcars$mpg) abline(linreg, col = "red") text(x = 2, y = 12, labels = bquote(~R^2 == .(round(slr$r.squared, 3))), col = "red") text(x = 4.75, y = 30, labels = bquote(~adj-R^2 == .(round(slr$adj.r.squared, 3))), col = "blue") title(main = "Linear Regression") grid() 2 3 4 mtcars$wt 5 How to Speak R Getting to Know Your Data Fitting Statistical Models Introducing ggplot2 # a VERY COOL plotting package for next semester's workshop... library(ggplot2) ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + stat_smooth(method = "lm", col = "red") + facet_grid(. ~ am) + theme_bw() automatic manual mpg 30 20 10 2 3 4 5 2 wt 3 4 5 How to Speak R Getting to Know Your Data Multiple Linear Regression: Fit the Model # add several variables to the model linreg2 = lm(mpg ~ wt + cyl + hp, data = mtcars) summary(linreg2) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = mpg ~ wt + cyl + hp, data = mtcars) Residuals: Min 1Q Median -3.9290 -1.5598 -0.5311 3Q 1.1850 Max 5.8986 Coefficients: Estimate Std. Error t value (Intercept) 38.75179 1.78686 21.687 wt -3.16697 0.74058 -4.276 cyl -0.94162 0.55092 -1.709 hp -0.01804 0.01188 -1.519 --Signif. codes: 0 '***' 0.001 '**' 0.01 Pr(>|t|) < 2e-16 *** 0.000199 *** 0.098480 . 0.140015 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.512 on 28 degrees of freedom Multiple R-squared: 0.8431,Adjusted R-squared: 0.8263 F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11 Fitting Statistical Models How to Speak R Getting to Know Your Data Fitting Statistical Models Multiple Linear Regression: Residual Diagnostics 0.3 0.2 0.1 0.0 Density 0.4 0.5 Distribution of Studentized Residuals −2 −1 0 1 sresid 2 3 How to Speak R Getting to Know Your Data Fitting Statistical Models Logistic Regression: Fit the Model # run the logistic regression (outcome has 2 levels) logreg = glm(am ~ mpg, data = mtcars, family = binomial(link = "logit")) summary(logreg) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = am ~ mpg, family = binomial(link = "logit"), data = mtcars) Deviance Residuals: Min 1Q Median -1.5701 -0.7531 -0.4245 3Q 0.5866 Max 2.0617 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.6035 2.3514 -2.808 0.00498 ** mpg 0.3070 0.1148 2.673 0.00751 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.230 Residual deviance: 29.675 AIC: 33.675 on 31 on 30 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 How to Speak R Getting to Know Your Data Fitting Statistical Models Logistic Regression: Visualize the Fit Motor Trend Car Road Tests 1.0 0 Transmission 10 0.6 0.4 0.2 10 5 0.0 0 10 15 20 25 Miles/(US) gallon 30 Automatic vs. Manual 5 0.8 How to Speak R Getting to Know Your Data Fitting Statistical Models Other Generalized Regresion Models # Can do other distributions and links poisreg = glm(carb ~ hp, data = mtcars, family = poisson(link="log")) summary(poisreg) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = carb ~ hp, family = poisson(link = "log"), data = mtcars) Deviance Residuals: Min 1Q Median -0.86441 -0.55608 -0.07877 3Q 0.21395 Max 1.49103 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.148971 0.265018 0.562 0.574 hp 0.005517 0.001387 3.977 6.97e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 27.043 Residual deviance: 12.279 AIC: 105.64 on 31 on 30 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4