R Introduction Workshop Slides
Transcription
R Introduction Workshop Slides
Introduction to R http://dataservices.gmu.edu/workshop/r 1. Create a folder with your name on the S: Drive 2. Copy titanic.csv from R Workshop Files to that folder History R ≈ S ≈ S-Plus Open Source, Free http://www.r-project.org/ CRAN = Comprehensive R Archive Network download R from: cran.rstudio.com RStudio: www.rstudio.com R Console Console > + prompt for new command waiting for rest of command Console ↑[Up] to get previous command History Double-click to put in console Type in everything in Courier New (only): 3+2 32 Script Window or File | New … | R Script Use # to write comments Objects nine <- 9 nine three <three nine / 3 my.school <- "gmu" my.school Historical Conventions Use <- to assign values Use . to separate names Current Capabilities = is okay now in most cases _ is okay now in most cases RStudio: Press Alt - (minus) to insert "assignment operator" Press Ctrl-Enter to run the current line Environment History Double-click to put in Console Other Stuff Files Plots Packages Help Packages Packages must be both Installed and Loaded To Install: install.packages("name") Install To Load: library( name ) or, require( name ) or, check the box Loaded Confirm these are installed: dplyr tidyr descr ggplot2 Installed Functions read.table( datafile, header=TRUE, sep = ",") Function Positional Argument Named Argument Named Argument titanic <- read.table( datafile, header=TRUE, sep = "," ) becomes the convenience function… titanic <- read.csv( datafile ) Get help for any function with ? ?read.csv titanic <- read.csv("S:/name/titanic.csv") (to use \ , type \\ ) Help read.csv or read.table titanic <- read.csv("S:/name/titanic.csv") Object Types Vectors & Lists numbers <- c(101,102,103,104,105) numbers <- 101:105 the same numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] numbers[-c(2,4,5)] numbers[ numbers > 102 ] Vector Variable Data Frames int / num = Numeric (Interval / Ratio) str(titanic) think structure Factor = Categorical (Nominal /Ordinal ) titanic$pclass titanic <- read.csv("S:/name/titanic.csv", as.is = "name") Object Types Numbers, Strings Vectors, Lists Data – data.frame – data.table package dplyr tbl_df package History Fact Hadley Wickham, who created dplyr, works at RStudio tbl_dt titanic library(dplyr) titanic <- tbl_df(titanic) titanic str(titanic) Factors - Categorical Variables titanic$pclass <- factor( titanic$pclass, levels = c(1,2,3), labels = c("1st Class", "2nd Class", "3rd Class"), ordered = TRUE ) current values labels in the same order ordinal variable labels(titanic$embarked) < c("", "Cherbourg","Queenstown","Southampton") NA and NULL Delete Variable titanic$sibsp <- NULL Set Values to Missing titanic$age[titanic$age == 99] <- NA same thing while reading in data: titanic <- read.csv("S:/name/titanic.csv", na.strings = "99") Ignore NAs Option na.rm = TRUE Review Words with Stuff word (Object) word[ stuff ] (Object Part) word( stuff ) (Function) "word" Words that are not Objects TRUE or T FALSE or F NaN (Not a Number) NA (Not Available) (String) NULL (Empty) Inf (Infinity) Descriptive Statistics summary(titanic) descr Package library(descr) freq(titanic$pclass) freq(titanic$age) CrossTable(titanic$pclass, titanic$survived) CrossTable(titanic$pclass, titanic$survived, prop.t = F, prop.c = F, prop.r = T, T is default for all digits = 2 ) ggplot2 library(ggplot2) qplot(pclass, fill=survived, data=titanic) titanic$survived <factor(titanic$survived, labels = c("Died","Survived") ) full documentation: http://ggplot2.org/ alternative: lattice More with qplot qplot(age, data=titanic) qplot(age, data=titanic , fill = survived, alpha = I(0.3), position = "identity") qplot(age,fare,data=titanic) qplot(age,fare,color=survived,data=titanic) R Markdown Writing with R - Knitr – html, pdf, docx, slides – Descriptions with Code – Descriptions with Output Interactive Graphs – Shiny – ggvis dplyr for data carpentry select filter : Choose variables : Choose cases mutate : Change values summarize : Aggregate values group_by : Create groups arrange : Order cases History Fact update of plyr for data tables Choose Variables base titanic$name titanic[,"name"] titanic[,-"name"] titanic[,c("age","gender")] dplyr select( select( select( select( select( contains starts_with ends_with matches distinct titanic, titanic, titanic, titanic, titanic, name) -name) age, gender) gender : pclass) starts_with("p")) Choose Cases base titanic[titanic$age < 5 , ] attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ] titanic[(age<5|pclass==1)& is.na(age)==F , ] dplyr filter(titanic, age < 5 ) filter(titanic, age < 5, pclass == 1 ) filter(titanic, age < 5 | pclass == 1 ) Change data base titanic$child <- titanic$age <= 12 titanic$totfam <- t$sibsp + t$parch titanic$bigfam <- titanic$totfam > 4 dplyr titanic <- mutate(titanic, child = age<=12) titanic <- mutate(titanic, totfam = sibsp + parch, bigfam = totfam > 4 ) Chaining / Piping %>% RStudio: Ctrl+Shift+M Read "then…" select(titanic, name, age) vs titanic %>% select(name, age) titanic %>% filter(age<5) %>% select(name, age) works anytime the 1st argument is the dataset History Fact from magrittr originally %.% Summarize base mean(titanic$age ) mean(titanic$age , na.rm = T ) sd(titanic$age , na.rm = T ) dplyr summarize(titanic, xbar=mean(age, na.rm=T)) summarize(titanic, n=n(), sd=sd(sibsp)) Other functions dplyr group_by arrange tidyr spread gather separate bind_rows Other packages from Hadley Wickham: lubridate stringr Pivot Table library(dplyr) library(tidyr) titanic %>% group_by(pclass, gender) %>% summarize( pct=mean(survived) ) %>% spread( gender, pct ) Statistical Analysis http://www.ats.ucla.edu/stat/r/whatstat/whatstat.htm Writing Models y~x y ~ x1 + x2 + x1 : x2 y ~ x1 * x2 ~n|c ~ + : * | Simple Regression 2 variables + Interaction 2 variables + Interaction n by each group of c Separates Y from X (e.g., "predicted from") adds another IV adds an interaction adds another IV plus the interaction creates subsets t.test( fare ~ gender, data = titanic ) Analysis Objects tt.aov <- aov( fare ~ gender*pclass, data = titanic ) summary(tt.aov) tt.glm <- glm( survived ~ pclass + gender + age + child + gender*pclass, family = binomial, data = titanic ) summary(tt.glm) More with Analysis Objects plot(tt.glm) tt.pred <- predict(tt.glm) tt.resid <- residuals(tt.glm) plot(tt.pred, tt.resid) compare to qplot(tt.pred, tt.resid) What now? Analysis Environments R Commander Separate Interface More/Better Statistics www.rcommander.com install.packages("Rcmdr") library(Rcmdr) Deducer Adds to R Interface (not RStudio!) Easier Data Management www.deducer.org install.packages("Deducer") library(Deducer) Data Mining GUI install.packages("rattle") require ("rattle") rattle() Tutorials install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com Tutorials http://dataservices.gmu.edu/software/r http://tryr.codeschool.com/ https://www.datacamp.com/ Coursera, EdX, HarvardX