Slide Show

Transcription

Slide Show
How to Speak R
Getting to Know Your Data
Introduction to R: for Absolute Beginners
Office of Methodological & Data Sciences
Sarah Schwartz1
BNR 278
12:30 pm - 3:20 pm, October 2, 2012
1
EDUC 455, (435)797-0169, [email protected] or [email protected],
http://www.cehs.usu.edu/research/omds
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Download & Install 2 pieces of free software
Video walk-through of both installations link: HERE
accept all defaults
https://www.r-project.org/
https://www.rstudio.com/
• Install first
• Install second
• “Software Environment”
• “User Interface”
• The brain
• The go-between for us
• We won’t work directly with it
• Auto completes & color codes
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Helpful Websites
Tutorials by William B. King, PhD, Coastal Carolina University
http://ww2.coastal.edu/kingw/statistics/R-tutorials
RexRepos R Example Repository
http://www.uni-kiel.de/psychologie/rexrepos
R-bloggers R news & tutorials: broad coverage
http://www.r-bloggers.com
Quick-R Accessing the power of R includes some graphs
http://www.statmethods.net
Psychology Using R for psychological research
http://personality-project.org/r
How to Speak R
Getting to Know Your Data
Outline
How to Speak R
Nuts & Bolts
Using Add-on Packages
How to Read in YOUR Own Data
Getting to Know Your Data
Numeric Summaries
Graphical Summaries
Fitting Statistical Models
Motor Trend Car Road Tests
Comparing Group Centers
Regression Models
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Rstudio Workspace
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Other User Interfaces Exist...
R Commander (Rcmdr) http://www.rcommander.com
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Basic Calculations
1 + 3
#### addition
## [1] 4
16 / 2
#### division
## [1] 8
5 ^ 2
###### powers
## [1] 25
prompt in the console,
command-line
case sensetive ‘anova’ not the same as
‘ANOVA’
comment lines Use the # symbol at
least once
sqrt(144)
# square root
## [1] 12
log(1.3)
## [1] 0.2623643
#### logrithm
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Create & Remove Objects
# ALL OF THESE DO THE SAME THING
x=7
x = 7
x=
7
x
=
7
x =
# Press Enter here.
7
# Press Enter again.
# TWO WAYS TO ASSIGN OBJECTS
Aval = 7
# use the equal
B.val = 15
# names: no spaces
Cval <- 10
# use an arrow
ls()
# list environment
## [1] "Aval"
"B.val" "Cval"
# YOU CAN REMOVE OBJECTS AFTER CREATING THEM
rm(B.val)
# remove from environment
ls()
# list the environment
## [1] "Aval" "Cval" "x"
Aval
# what is assigned to this?
## [1] 7
aval
# CAPS MATTER!!!
## Error in eval(expr, envir, enclos):
object ’aval’ not found
"x"
How to Speak R
Getting to Know Your Data
A double-equal tests for equivalence:
5 == 6
# Are there ANY TWOs?
2 %in% vec1
# 'less than'
## [1] TRUE
1 < 2 | 2 == 3
## [1] TRUE
# test EACH VALUE to see if it is TWO
2 == vec1
# '|' means `or'
## [1] FALSE TRUE FALSE FALSE
## [5] TRUE FALSE
## [1] TRUE
Aval < Cval
# Create a vector with "combine"
vec1 = c(1, 2, 7, 3, 2, -3)
# are these equal?
## [1] FALSE
3 < 10
Fitting Statistical Models
# can test objects
# COUNT the number of TWOs
sum(2 == vec1)
## [1] TRUE
## [1] 2
How to Speak R
Getting to Know Your Data
Some Possible CLASSES of R Objects
Individual VALUES:
numeric number values
logical either ‘TRUE’ (codes to 1) or ‘FALSE’ (codes to 0)
factor categorical levels, nominal or ordinal
character text or ‘string’ in SPSS
Data OBJECTS:
vector a 1-dimentional listing of single elements
matrix a 2-dimentional array of elements (rows & columns)
data.frame a matrix with more formatting (nice labels)
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
x = 1:5
y = x / 3
z = x > 4
class(x)
class(y)
class(z)
c = factor(c("m",
"m" ,
"f",
"f",
"m"))
class(c)
## [1] "integer"
## [1] "numeric"
## [1] "logical"
## [1] "factor"
x
y
z
c
## [1] 1 2 3 4 5
##
##
##
##
##
## [1] FALSE FALSE
## [3] FALSE FALSE
## [5] TRUE
## [1] m m f f m
## 2 Levels: f ...
[1]
[2]
[3]
[4]
[5]
0.3333333
0.6666667
1.0000000
1.3333333
1.6666667
How to Speak R
Getting to Know Your Data
Finding a Function
If you’re not sure of a function’s name,
use ‘apropors’ to search for it:
apropos("round")
## [1] "round"
## [2] "round.Date"
## [3] "round.POSIXt"
Then you can search the name of the
function in the HELP tab of the
RStudio. (or use google)
apropos("mean")
## [1] ".colMeans"
## [2] ".rowMeans"
## [3] "colMeans"
## [4] "kmeans"
## [5] "mean"
## [6] "mean.Date"
## [7] "mean.default"
## [8] "mean.difftime"
## [9] "mean.POSIXct"
## [10] "mean.POSIXlt"
## [11] "rowMeans"
## [12] "weighted.mean"
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
You can use the Help tab in RStudio to find out about a function.
# Ask for the function's arguments
args(round)
## function (x, digits = 0)
## NULL
round(2.4)
round(2.7)
## [1] 2
## [1] 3
ceiling(2.4)
ceiling(2.7)
## [1] 3
## [1] 3
floor(2.4)
floor(2.7)
## [1] 2
## [1] 2
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Missing Values
data = c(1, 0, 2, 5, NA)
is.na(data)
1 + 0 + 2 + 5
## [1] 8
## [1] FALSE FALSE FALSE
## [4] FALSE TRUE
mean(data)
anyNA(data)
## [1] NA
## [1] TRUE
mean(data, na.rm = TRUE)
Different functions have
different default ways to
handle missing values.
Use the HELP to
determine what is the
default and how to
change it.
## [1] 2
sd(data)
## [1] NA
sd(data, na.rm = TRUE)
## [1] 2.160247
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
R Base vs. External Packages
When you download R, you are only getting the base functions. This is a
relatively small collection of functions, but it keeps R running fast.
# included in R base:
summary(data)
##
##
Min. 1st Qu.
0.00
0.75
table(data)
# basic summary statistics
Median
1.50
Mean 3rd Qu.
2.00
2.75
Max.
5.00
NA's
1
# tabulates categoricals
## data
## 0 1 2 5
## 1 1 1 1
Packages are collections of R functions, data, and compiled code in a
well-defined format. The directory where packages are stored is called the
library.
By only downloading and installing the packages you need, on a
project-by-project basis, R uses less storage space on your hard drive and active
memory.
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Hundreds of packages are available for download and installation. Many are
vetted and distributed by CRAN, others are available on GitHub, or you can
create & share packages on an individual level.
Install Download to your computer’s hard drive ONLY ONCE
Load Activate the package’s library EVERY session
# Code for installing all the
# packagesin this document
install.packages("psych",
"xlsx",
"haven",
"lattice",
"MASS",
"ggplot2",
"popbio",
"beeswarm")
NOTE: when you download your first package, select a mirror (a proxy server)
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
The ‘Psych’ Package
This has been developed at Northwestern University since 2005 to include
functions most useful for personality, psychometric, and psychological research.
The package is also meant to supplement a text on psychometric theory, a
draft of which is available at http://personality-project.org/r/book.
# 'LOAD' or 'activate' the package
library(psych)
This package has a nice feature for reading in data from your clipboard:
1. Highlight the data in Excel, including the first row with variable names
2. ‘Copy’ the selection, moving the information to the clipboard
3. Run the code below to store it in R as an object named pipiData
# International Personality Item Pool
bfi = read.clipboard.tab()
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Personality self report items taken from the International Personality Item
Pool (http://ipip.ori.org) and was included as part of the Synthetic
Aperture Personality Assessment (SAPA) web based personality assessment
project http://SAPA-project.org.
5 Items x 5 Factors
Response Scale
Demographic
• Agreeableness
1. Very Inaccurate
• gender
• Conscientiousness
2. Moderately Inaccurate
• education
• Extraversion
3. Slightly Inaccurate
• age
• Neuroticism
4. Slightly Accurate
• Opennness
5. Moderately Accurate
6. Very Accurate
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Investigate the Form of Your Data
class(bfi)
# you probably want a data.frame
## [1] "data.frame"
dim(bfi)
# rows (subjeccts) & columns (variables)
## [1] 2800
28
names(bfi)
## [1] "A1"
## [8] "C3"
## [15] "E5"
## [22] "O2"
# columns should have avariables names
"A2"
"C4"
"N1"
"O3"
table(complete.cases(bfi))
##
## FALSE
##
564
TRUE
2236
"A3"
"C5"
"N2"
"O4"
"A4"
"E1"
"N3"
"O5"
"A5"
"E2"
"N4"
"gender"
"C1"
"E3"
"N5"
"education"
"C2"
"E4"
"O1"
"age"
# are the cases complete? (no missing values)
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Declare Categorical Variables - GENDER
# look at the raw form: 4 ways designate a variable
bfi[, 26]
# designate column number...
bfi[, c("gender")]
# ...or column name...
bfi["gender"]
# ...all do the same thing...
bfi$gender
# ...this is the most common
class(bfi$gender)
# the variable's "class"
## [1] "integer"
head(bfi$gender)
# look at top cases
## [1] 1 2 2 2 1 2
summary(bfi$gender)
##
##
Min. 1st Qu.
1.000
1.000
table(bfi$gender)
##
##
##
1
2
919 1881
# how does it get summarized?
Median
2.000
Mean 3rd Qu.
1.672
2.000
Max.
2.000
# what does "table" do?
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Declare Categorical Variables - GENDER
# define it as categorical: FACTOR is "nominal"
bfi$gender = factor(bfi$gender, labels = c("male", "female"))
# now its ready to go
class(bfi$gender)
# did the "class" change?
## [1] "factor"
head(bfi$gender)
# does it look different?
## [1] male
female female female male
## Levels: male female
summary(bfi$gender)
##
##
female
# is the summary the same?
male female
919
1881
levels(bfi$gender)
## [1] "male"
"female"
# this gives a list the LABELS
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Declare Categorical Variables - EDUCATION
table(bfi$education)
##
##
##
1
224
2
3
292 1249
4
394
# look at the raw form
5
418
# define as categorical: ORDERED is "ordinal"
bfi$education = ordered(bfi$education,
labels = c("<HS", "HS", "HS+ ", "degree", "grad+"))
# now its ready to go
head(bfi$education, n = 15)
## [1] <NA> <NA> <NA> <NA> <NA> HS+ <NA> HS
## Levels: <HS < HS < HS+ < degree < grad+
<HS
summary(bfi$education)
##
##
<HS
224
HS
292
HS+ degree
1249
394
grad+
418
NA's
223
levels(bfi$education)
## [1] "<HS"
"HS"
"HS+ "
"degree" "grad+"
<NA> <HS
<NA> <NA> <NA> <HS
How to Speak R
Getting to Know Your Data
bfi[1:3, ]
##
##
##
##
##
##
##
##
# specify rows (subjects) in FRONT of the comma
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 gender
2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3
male
2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 female
5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 female
education age
61617
<NA> 16
61618
<NA> 18
61620
<NA> 17
61617
61618
61620
bfi[1:4, 1:7]
##
##
##
##
##
Fitting Statistical Models
61617
61618
61620
61621
# specify columns (variables) AFTER the comma
A1 A2 A3 A4 A5 C1 C2
2 4 3 4 4 2 3
2 4 5 2 5 5 4
5 4 5 4 4 4 5
4 4 6 5 5 4 4
# ...or list the names of the variables (after comma)
bfi[1:3, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]
##
A1 A2 A3 A4 A5 gender education age
## 61617 2 4 3 4 4
male
<NA> 16
## 61618 2 4 5 2 5 female
<NA> 18
## 61620 5 4 5 4 4 female
<NA> 17
How to Speak R
Getting to Know Your Data
Saving a Reduced Dataset
# suppose I'm only interested in subjects under the age of 35
table(bfi$age < 35)
##
## FALSE
##
738
TRUE
2062
# AND I only want to keep a few variables (for demo)
bfiA = bfi[bfi$age < 35,
c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]
dim(bfiA)
## [1] 2062
# see a few lines from top and bottom
8
headTail(bfiA)
##
##
##
##
##
##
##
##
##
##
A1 A2 A3 A4 A5 gender education age
61617
2
4
3
4
4
male
<NA> 16
61618
2
4
5
2
5 female
<NA> 18
61620
5
4
5
4
4 female
<NA> 17
61621
4
4
6
5
5 female
<NA> 17
...
... ... ... ... ...
<NA>
<NA> ...
67551
6
1
3
3
3
male
HS+
19
67552
2
4
4
3
5
male
degree 27
67556
2
3
5
2
5 female
degree 29
67559
5
2
2
4
4
male
degree 31
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
How to Read in YOUR Own Data
Before you can load your data, you need to tell R where to look.
# get the working directory
getwd()
## [1] "C:/Users/A00315273/Box Sync/Office of Research Services/OMDS/OMDS Workshops/OMDS in
Notice: you need to use shashes instead of backslashes
# change the working directory to YOUR COMPUTER!!!
setwd("C:/Users/A00315273/OMDSworkshop")
If the data is stored in a TEXT file, comma delimited...
# there functions are part of the BASE R
myData = read.table("data.txt", header = TRUE)
myData = read.csv("data.csv", header = TRUE)
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Best Practices: DataSet in Excel
Often, you may enter your data into Excel.
Make sure the FIRST ROW contains the names of variables.
Names, Values, & Fields
• FIRST variable is unit identification
• NEVER use white SPACES
• AVOID symbols or punctuation: ? [ } * $ %
• USE . or
to push words together
• KEEP it short, but meaningful
• ALWAYS use numbers over text
• LEAVE missing cells blank (not .)
How to Speak R
Getting to Know Your Data
Read in Data from Excel Files
Bad Example
Much Better!
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Read in Data from Excel Files
# there's a package for that!
# "Read, write, format Excel 2007 (xlsx) files"
library(xlsx)
# read.xlsx tries to guess variables classes
# read.xlsx2 is faster at bigger datasets
myData = read.xlsx("data.xlsx",
sheetIndex = 1,
header = TRUE)
# or use sheetName, instead
# TRUE if 1st row = names
NOTE: If you are having problems with Excel datasets, try saving it as a “.csv”
file (comma delimited) and use the read.table function in Base R.
How to Speak R
Getting to Know Your Data
Read in Data from SPSS, SAS, & Stata Files
# New package this summer...Hadley Wickham is my HERO!
library(haven)
# Currently haven can read and write:
#
logical, integer, numeric, character and factors
# SPSS: Supports both sav & por files
myData = read_spss("data.sav")
myData = read_sav("data.sav")
myData = read_por("data.sav")
# SAS: Supports both b7dat & b7cat files
myData = read_sas("data.b7dat")
# Stata
myData = read_stata("data.dta")
myData = read_dta("data.dta")
# NOTE all labeled variables are a new class: "labelled"
# ... use as_factor() to treat the variable categorical
# ... use zap_labels() to treat the variable as continuous
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Outline
How to Speak R
Nuts & Bolts
Using Add-on Packages
How to Read in YOUR Own Data
Getting to Know Your Data
Numeric Summaries
Graphical Summaries
Fitting Statistical Models
Motor Trend Car Road Tests
Comparing Group Centers
Regression Models
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Mean, Standard Deviation, Ect...
# descriptives on all variables
describe(bfiA)
##
##
##
##
##
##
##
##
##
A1
A2
A3
A4
A5
gender*
education*
age
vars
1
2
3
4
5
6
7
8
n mean
2053 2.52
2040 4.75
2048 4.57
2048 4.59
2050 4.50
2062 1.66
1853 3.09
2062 23.16
sd median trimmed mad min max range skew kurtosis
se
1.42
2
2.36 1.48
1
6
5 0.73
-0.44 0.03
1.20
5
4.92 1.48
1
6
5 -1.07
0.86 0.03
1.31
5
4.75 1.48
1
6
5 -0.97
0.39 0.03
1.54
5
4.81 1.48
1
6
5 -0.91
-0.29 0.03
1.26
5
4.64 1.48
1
6
5 -0.79
0.07 0.03
0.47
2
1.70 0.00
1
2
1 -0.68
-1.54 0.01
1.06
3
3.11 0.00
1
5
4 -0.04
-0.03 0.02
5.22
22
22.98 5.93
3 34
31 0.25
-0.59 0.12
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Mean, Standard Deviation, Ect...
# split by a grouping variable
describeBy(bfiA, bfiA$gender)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
group: male
vars
n mean
sd median trimmed mad min max range skew kurtosis
se
A1
1 699 2.81 1.43
3
2.71 1.48
1
6
5 0.48
-0.75 0.05
A2
2 691 4.46 1.30
5
4.61 1.48
1
6
5 -0.88
0.25 0.05
A3
3 695 4.38 1.30
5
4.52 1.48
1
6
5 -0.78
0.01 0.05
A4
4 697 4.31 1.51
5
4.45 1.48
1
6
5 -0.64
-0.62 0.06
A5
5 695 4.35 1.33
5
4.49 1.48
1
6
5 -0.74
-0.13 0.05
gender*
6 699 1.00 0.00
1
1.00 0.00
1
1
0
NaN
NaN 0.00
education*
7 626 3.11 1.15
3
3.14 1.48
1
5
4 -0.04
-0.40 0.05
age
8 699 22.83 5.04
22
22.63 4.45
3 34
31 0.27
-0.29 0.19
------------------------------------------------------------------group: female
vars
n mean
sd median trimmed mad min max range skew kurtosis
se
A1
1 1354 2.37 1.39
2
2.17 1.48
1
6
5 0.88
-0.16 0.04
A2
2 1349 4.90 1.12
5
5.07 1.48
1
6
5 -1.16
1.22 0.03
A3
3 1353 4.67 1.31
5
4.86 1.48
1
6
5 -1.10
0.68 0.04
A4
4 1351 4.74 1.53
5
4.99 1.48
1
6
5 -1.08
0.04 0.04
A5
5 1355 4.58 1.22
5
4.71 1.48
1
6
5 -0.80
0.14 0.03
gender*
6 1363 2.00 0.00
2
2.00 0.00
2
2
0
NaN
NaN 0.00
education*
7 1227 3.08 1.02
3
3.09 0.00
1
5
4 -0.04
0.21 0.03
age
8 1363 23.32 5.31
23
23.17 5.93
9 34
25 0.23
-0.73 0.14
How to Speak R
Getting to Know Your Data
Cross Tabulations & χ2 test for Independence
# split by a grouping variable
# If a variable is included on the left side of the formula,
#
it is assumed to be a vector of frequencies
edXgender = xtabs(~ education + gender, data = bfiA)
edXgender
##
gender
## education male female
##
<HS
71
109
##
HS
70
121
##
HS+
303
691
##
degree
84
169
##
grad+
98
137
# chi-squared test for independence
chisq.test(edXgender)
##
## Pearson's Chi-squared test
##
## data: edXgender
## X-squared = 14.746, df = 4, p-value = 0.005258
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Correlation Matrix
How strong is the association between the 5 Agreement Items?
# reduce the dataset for easy of demonstration
bfiAonly = bfi[, c("A1", "A2", "A3", "A4", "A5")]
# GET CORRELATION VALUES & P-VALUES
cor(bfiAonly, use = "pairwise.complete.obs")
##
##
##
##
##
##
A1
A2
A3
A4
A5
A1
A2
A3
A4
A5
1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383
-0.3401932 1.0000000 0.4850980 0.3350872 0.3900836
-0.2652471 0.4850980 1.0000000 0.3604283 0.5041411
-0.1464245 0.3350872 0.3604283 1.0000000 0.3075373
-0.1814383 0.3900836 0.5041411 0.3075373 1.0000000
round(cor(bfiAonly, use = "pairwise.complete.obs"), 3)
##
##
##
##
##
##
A1
A2
A3
A4
A5
A1
A2
A3
A4
A5
1.000 -0.340 -0.265 -0.146 -0.181
-0.340 1.000 0.485 0.335 0.390
-0.265 0.485 1.000 0.360 0.504
-0.146 0.335 0.360 1.000 0.308
-0.181 0.390 0.504 0.308 1.000
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Correlation Matrix with p-values
corr.test(bfiAonly,
adjust = "none",
method = "spearman")
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:corr.test(x = bfiAonly, method = "spearman", adjust = "none")
Correlation matrix
A1
A2
A3
A4
A5
A1 1.00 -0.37 -0.30 -0.16 -0.22
A2 -0.37 1.00 0.50 0.34 0.40
A3 -0.30 0.50 1.00 0.36 0.53
A4 -0.16 0.34 0.36 1.00 0.31
A5 -0.22 0.40 0.53 0.31 1.00
Sample Size
A1
A2
A3
A4
A5
A1 2784 2757 2759 2767 2769
A2 2757 2773 2751 2758 2757
A3 2759 2751 2774 2759 2758
A4 2767 2758 2759 2781 2765
A5 2769 2757 2758 2765 2784
Probability values (Entries above the diagonal are adjusted for multiple tests.)
A1 A2 A3 A4 A5
A1 0 0 0 0 0
A2 0 0 0 0 0
A3 0 0 0 0 0
A4 0 0 0 0 0
A5 0 0 0 0 0
To see confidence intervals of the correlations, print with the short=FALSE option
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Correlation Matrix Visualize
A picture can be worth a thousand words
cor.plot(cor(bfiAonly, use = "pairwise.complete.obs", method = "spearman"))
Correlation plot
A1
1
0.8
A2
0.6
0.4
A3
0.2
0
A4
−0.2
−0.4
A5
−0.6
−0.8
−1
A1
A2
A3
A4
A5
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
psych’s All-in-One Plot
A picture can be worth a thousand words
# plots pairs of variables
pairs.panels(bfiAonly)
3
4
5
6
1
−0.27
3
4
5
6
−0.15
−0.18
1
−0.34
2
5
A1
2
3
1
0.31
1
3
5
A4
3
0.36 0.50
1
1
A3
5
0.49 0.34 0.39
3
5
A2
1
3
5
A5
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Histogram: Defaults vs. Options
# all defaults
hist(bfi$A1)
# better with
hist(bfi$A1,
breaks =
main
=
xlab
=
col
=
600
0.5:6.5,
"This is Much Better",
"Item A-1",
"gray")
This is Much Better
3
4
bfi$A1
5
6
600
2
0 200
1
Frequency
0 200
Frequency
Histogram of bfi$A1
some defaults
1
2
3
4
Item A−1
5
6
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Histogram: Use More Code!
600
400
200
0
Frequency
800
1000
Ready for Publication
Very
Mod
Inaccuration
Slight
Slight
Mod
Accurate
Agreeableness Item #1 (q.1146)
''Am indifferent to the feelings of others''
Very
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Density Plot: Continuous Distribution
# one way to put two plots on the same page
par(mfrow=c(1, 2))
hist(bfi$age)
plot(density(bfi$age, na.rm = TRUE))
density.default(x = bfi$age, na.rm = TRUE)
0.04
0.00
0.02
Density
400
200
0
Frequency
600
Histogram of bfi$age
# 1 row & 2 columns
# rough distribution
# smoothed out
0
20
40
60
bfi$age
80
0
20
40
60
80
N = 2800 Bandwidth = 2.047
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Density Plot: AGE
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Proportion
Compare to the Normal Curve
Curves
density
normal
0
20
40
60
Age
80
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Bar Plot: Categorical Distribution
par(mfrow=c(1, 2))
# 1 row & 2 columns
0
0 200
500
600
1000
1500
1000
# one variable at a time (must give it counts!)
barplot(table(bfi$gender))
barplot(table(bfi$education))
male
female
<HS HS
degree
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Bar Plot: Compare 2 Categorical Distributions
1000
Synthetic Aperture Personality Assessment (SAPA)
600
400
200
0
Frequency
800
male
female
<HS
HS
HS+
degree
Highest Level of Education
grad+
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Boxplots: GENDER & EDUCATION
par(mfrow=c(1, 2))
# 1 row & 2 columns
# all together
boxplot(bfiA$age)
80
20
0
5
10
40
20
60
30
# split by education groups
boxplot(bfi$age ~ bfi$education)
<HS
HS+
grad+
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Boxplots: Use More Options
Build a Better Boxplots
Age (years)
40
20
0
# make it look better
boxplot(age ~ education, data = bfi,
col = heat.colors(5),
main = "Build a Better Boxplots",
xlab = "Highest Education Obtained",
ylab = "Age (years)")
60
80
# reset to one plot per page
par(mfrow=c(1, 1))
<HS
HS
HS+
degree
Highest Education Obtained
grad+
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Boxplots: AGE & EDUCATION
male
female
40
20
0
Age (years)
60
80
Compare the Genders
<HS
HS
HS+
degree
Highest Education Obtained
grad+
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Scatterplots: Display Associations
Jitter the education level so dots don’t cover each other so much.
factor = 0.25
factor = 1
factor = 2
2
3
jitter(as.numeric(bfi$education), factor = 2)
4
4
3
jitter(as.numeric(bfi$education), factor = 1)
2
1
jitter(as.numeric(bfi$education), factor = 0.25)
3
1
plot(bfi$age,
jitter(as.numeric(bfi$education),
factor = 2),
main = "factor = 2")
2
plot(bfi$age,
jitter(as.numeric(bfi$education),
factor = 1),
main = "factor = 1")
1
plot(bfi$age,
jitter(as.numeric(bfi$education),
factor = 0.25),
main = "factor = 0.25")
4
5
5
5
# put 3 plots in one row/page
par(mfrow = c(1, 3))
0
20
40
bfi$age
60
80
0
20
40
bfi$age
60
80
0
20
40
bfi$age
60
80
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Scatterplots: AGE & EDUCATION
Jitter the Ordinal Variable
grad+
Education
degree
HS+
HS
<HS
0
20
40
Age (years)
60
80
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Bubble Plot: Helpful with Overplotting
If you can dream of a type of plot, you can create it!
Bubble Plot
3
2
1
# circle's area ~ number of points
symbols(bfiAag$Group.1,
bfiAag$Group.2,
circles = sqrt(bfiAag$A1/pi)/50,
inches = FALSE,
main
= "Bubble Plot",
xlab
= "item A1",
ylab
= "item A2")
item A2
4
5
6
# aggregate the data
bfiAag = aggregate(bfiA,
by = list(bfiA$A1,
bfiA$A2),
length)
1
2
3
4
item A1
5
6
How to Speak R
Getting to Know Your Data
Outline
How to Speak R
Nuts & Bolts
Using Add-on Packages
How to Read in YOUR Own Data
Getting to Know Your Data
Numeric Summaries
Graphical Summaries
Fitting Statistical Models
Motor Trend Car Road Tests
Comparing Group Centers
Regression Models
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973-74 models).
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission
gear Number of forward gears
carb Number of carburetors
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Load car Package & the mtcars Data
# Load a New Package:
library(car)
# "Companion to Applied Regression" (a textbook)
data(mtcars)
# Make its Included Data Set Active in the Environment
# check out the data
dim(mtcars)
## [1] 32 11
names(mtcars)
##
[1] "mpg"
"cyl"
"disp" "hp"
"drat" "wt"
"qsec" "vs"
# set the categorical variables
mtcars$vs = factor(mtcars$vs, labels = c("v", "s"))
mtcars$am = factor(mtcars$am, labels = c("automatic", "manual"))
"am"
"gear" "carb"
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
headTail(mtcars)
##
##
##
##
##
##
##
##
##
##
Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
...
Ford Pantera L
Ferrari Dino
Maserati Bora
Volvo 142E
mpg cyl disp hp drat
wt
21
6 160 110 3.9 2.62
21
6 160 110 3.9 2.88
22.8
4 108 93 3.85 2.32
21.4
6 258 110 3.08 3.21
... ... ... ... ... ...
15.8
8 351 264 4.22 3.17
19.7
6 145 175 3.62 2.77
15
8 301 335 3.54 3.57
21.4
4 121 109 4.11 2.78
qsec
vs
am gear carb
16.46
v
manual
4
4
17.02
v
manual
4
4
18.61
s
manual
4
1
19.44
s automatic
3
1
... <NA>
<NA> ... ...
14.5
v
manual
5
4
15.5
v
manual
5
6
14.6
v
manual
5
8
18.6
s
manual
4
2
summary(mtcars)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
mpg
Min.
:10.40
1st Qu.:15.43
Median :19.20
Mean
:20.09
3rd Qu.:22.80
Max.
:33.90
wt
Min.
:1.513
1st Qu.:2.581
Median :3.325
Mean
:3.217
3rd Qu.:3.610
Max.
:5.424
cyl
Min.
:4.000
1st Qu.:4.000
Median :6.000
Mean
:6.188
3rd Qu.:8.000
Max.
:8.000
qsec
Min.
:14.50
1st Qu.:16.89
Median :17.71
Mean
:17.85
3rd Qu.:18.90
Max.
:22.90
disp
hp
drat
Min.
: 71.1
Min.
: 52.0
Min.
:2.760
1st Qu.:120.8
1st Qu.: 96.5
1st Qu.:3.080
Median :196.3
Median :123.0
Median :3.695
Mean
:230.7
Mean
:146.7
Mean
:3.597
3rd Qu.:326.0
3rd Qu.:180.0
3rd Qu.:3.920
Max.
:472.0
Max.
:335.0
Max.
:4.930
vs
am
gear
carb
v:18
automatic:19
Min.
:3.000
Min.
:1.000
s:14
manual
:13
1st Qu.:3.000
1st Qu.:2.000
Median :4.000
Median :2.000
Mean
:3.688
Mean
:2.812
3rd Qu.:4.000
3rd Qu.:4.000
Max.
:5.000
Max.
:8.000
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Test Central Differences in 2 Independent Groups
# find the means
describeBy(mtcars$mpg, mtcars$am)
##
##
##
##
##
##
##
group: automatic
vars n mean
sd median trimmed mad min max range skew kurtosis
se
1
1 19 17.15 3.83
17.3
17.12 3.11 10.4 24.4
14 0.01
-0.8 0.88
------------------------------------------------------------------group: manual
vars n mean
sd median trimmed mad min max range skew kurtosis
se
1
1 13 24.39 6.17
22.8
24.38 6.67 15 33.9 18.9 0.05
-1.46 1.71
automatic
# view the two groups side-by-side
boxplot(mpg ~ am, data = mtcars, horizontal = TRUE)
10
15
20
25
30
How to Speak R
Getting to Know Your Data
Test Central Differences in 2 Independent Groups
PARAMETRIC t-test for means, assumes normality
t.test(mpg ~ am, data = mtcars)
##
##
##
##
##
##
##
##
##
##
##
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group automatic
mean in group manual
17.14737
24.39231
NON-PARAMETRIC Mann-Whitney U Test, based on ranks
wilcox.test(mpg ~ am, data = mtcars)
##
## Wilcoxon rank sum test with continuity correction
##
## data: mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
More than Two Groups?
# plot to investigate
boxplot(drat ~ cyl,
data = mtcars,
main = "Between vs. Within",
xlab = "Number of Cylinders",
ylab = "Rear Axle Ratio",
col = "light gray")
grid()
5.0
4.5
Rear Axle Ratio
4.0
3.5
stripchart(drat ~ cyl,
data
= mtcars,
vertical = TRUE,
method
= 'jitter',
jitter
= 0.2,
cex
= 1,
pch
= 16,
col
= c("red",
"blue",
"dark green"),
add
= TRUE)
3.0
# we can use another package
library(beeswarm)
Between vs. Within
4
6
Number of Cylinders
8
How to Speak R
Getting to Know Your Data
ANOVA
# run the ANOVA
anova1 = aov(drat ~ cyl, data = mtcars)
summary(anova1)
##
##
##
##
##
Df Sum Sq Mean Sq F value
Pr(>F)
cyl
1 4.342
4.342
28.81 8.24e-06 ***
Residuals
30 4.521
0.151
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# to get type III sums of squares
Anova(anova1, type = "III")
##
##
##
##
##
##
##
##
##
Anova Table (Type III tests)
Response: drat
Sum Sq Df F value
Pr(>F)
(Intercept) 57.217 1 379.714 < 2.2e-16 ***
cyl
4.342 1 28.814 8.245e-06 ***
Residuals
4.521 30
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
ANCOVA
# add a continuous covariate
anova2 = aov(drat ~ cyl + wt, data = mtcars)
summary(anova2)
##
##
##
##
##
##
Df Sum Sq Mean Sq F value
Pr(>F)
cyl
1 4.342
4.342 32.284 3.83e-06 ***
wt
1 0.620
0.620
4.613
0.0402 *
Residuals
29 3.900
0.134
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(anova2, type = "III")
##
##
##
##
##
##
##
##
##
##
Anova Table (Type III tests)
Response: drat
Sum Sq Df F value Pr(>F)
(Intercept) 56.578 1 420.6933 < 2e-16 ***
cyl
0.464 1
3.4493 0.07346 .
wt
0.620 1
4.6129 0.04022 *
Residuals
3.900 29
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Kruskal Wallis Test
# non-parametric version: uses ranks instead of means
kruskal.test(drat ~ cyl, data = mtcars)
##
## Kruskal-Wallis rank sum test
##
## data: drat by cyl
## Kruskal-Wallis chi-squared = 14.395, df = 2, p-value = 0.0007486
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Simple Linear Regression: Fit Model
# Simple Linear Regression
linreg = lm(mpg ~ wt, data = mtcars)
slr = summary(linreg)
slr
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min
1Q Median
-4.5432 -2.3647 -0.1252
3Q
1.4096
Max
6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851
1.8776 19.858 < 2e-16 ***
wt
-5.3445
0.5591 -9.559 1.29e-10 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
summary(linreg)$r.squared
## [1] 0.7528328
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Simple Linear Regression: Visualize the Fit
Linear Regression
15
mtcars$mpg
20
25
30
adj − R2 = 0.745
R2 = 0.753
10
# Plot of relationship and least squares line
plot(mtcars$wt, mtcars$mpg)
abline(linreg, col = "red")
text(x = 2,
y = 12,
labels = bquote(~R^2 ==
.(round(slr$r.squared, 3))),
col = "red")
text(x = 4.75,
y = 30,
labels = bquote(~adj-R^2 ==
.(round(slr$adj.r.squared, 3))),
col = "blue")
title(main = "Linear Regression")
grid()
2
3
4
mtcars$wt
5
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Introducing ggplot2
# a VERY COOL plotting package for next semester's workshop...
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
facet_grid(. ~ am) +
theme_bw()
automatic
manual
mpg
30
20
10
2
3
4
5
2
wt
3
4
5
How to Speak R
Getting to Know Your Data
Multiple Linear Regression: Fit the Model
# add several variables to the model
linreg2 = lm(mpg ~ wt + cyl + hp, data = mtcars)
summary(linreg2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
Residuals:
Min
1Q Median
-3.9290 -1.5598 -0.5311
3Q
1.1850
Max
5.8986
Coefficients:
Estimate Std. Error t value
(Intercept) 38.75179
1.78686 21.687
wt
-3.16697
0.74058 -4.276
cyl
-0.94162
0.55092 -1.709
hp
-0.01804
0.01188 -1.519
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
< 2e-16 ***
0.000199 ***
0.098480 .
0.140015
'*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared: 0.8431,Adjusted R-squared: 0.8263
F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Fitting Statistical Models
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Multiple Linear Regression: Residual Diagnostics
0.3
0.2
0.1
0.0
Density
0.4
0.5
Distribution of Studentized Residuals
−2
−1
0
1
sresid
2
3
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Logistic Regression: Fit the Model
# run the logistic regression (outcome has 2 levels)
logreg = glm(am ~ mpg,
data = mtcars,
family = binomial(link = "logit"))
summary(logreg)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = am ~ mpg, family = binomial(link = "logit"), data = mtcars)
Deviance Residuals:
Min
1Q
Median
-1.5701 -0.7531 -0.4245
3Q
0.5866
Max
2.0617
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.6035
2.3514 -2.808 0.00498 **
mpg
0.3070
0.1148
2.673 0.00751 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230
Residual deviance: 29.675
AIC: 33.675
on 31
on 30
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Logistic Regression: Visualize the Fit
Motor Trend Car Road Tests
1.0
0
Transmission
10
0.6
0.4
0.2
10
5
0.0
0
10
15
20
25
Miles/(US) gallon
30
Automatic vs. Manual
5
0.8
How to Speak R
Getting to Know Your Data
Fitting Statistical Models
Other Generalized Regresion Models
# Can do other distributions and links
poisreg = glm(carb ~ hp,
data
= mtcars,
family = poisson(link="log"))
summary(poisreg)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = carb ~ hp, family = poisson(link = "log"), data = mtcars)
Deviance Residuals:
Min
1Q
Median
-0.86441 -0.55608 -0.07877
3Q
0.21395
Max
1.49103
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.148971
0.265018
0.562
0.574
hp
0.005517
0.001387
3.977 6.97e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 27.043
Residual deviance: 12.279
AIC: 105.64
on 31
on 30
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4