Outline R Crash Course
Transcription
Outline R Crash Course
Outline R Crash Course Introduction Part 1: Working with Data How to represent data? Christophe {Lalanne, Pallier} A simple tossing experiment http://pallier.org/cours/stats.2011/ From vectors to matrix and dataframe Cogmaster 2011 Import/export facilities 1 / 50 What is R? 2 / 50 How to interact with R? You will need a “decent” text editor during this course. Let’s not kid ourselves: the most widely used piece of software for statistics is Excel. —Brian Ripley What do we expect from it? Basically, we want good syntax highlight support, to send command to an R process; an internal help display system is a plus. R is a statistical programming language for computing with data. Why R? Not just because it is free and available on all platforms, but because it is developed by an active community of statisticians and end-users enclined to reropducible research. This means good on-line help (mailing-list, wiki, package maintainer, the StackExchange community), continuously updated packages, top-notch algorithms, etc. The general idea will be to write R code, test it in the console, and keep related parts of our statistical progress in dedicated R files. The penultimate idea is to learn to manage a statistical project. See How to efficiently manage a statistical analysis project? on CrossValidated to learn more. 3 / 50 Emacs + ESS 4 / 50 RStudio http://ess.r-project.org/ http://rstudio.org/ 5 / 50 6 / 50 Additional packages Additional resources • Tutorials, textbooks, etc.: See the CRAN docs section, The core R packages might be sufficient in almost all the case study we will encounter. Some recommend add-ons are listed below: http://cran.r-project.org/other-docs.html • Getting help specific to R programming: R mailing, http://stackoverflow.com • Hmisc, a large set of utilities for importing/formatting • Getting help specific to stats (language agnostic): data, power analysis, clustering, tabular output. • rms, everything we ever need for regression models. • plyr, for data manipulation (reshaping, aggregating, etc.). • ggplot2, a high-level visualization backend. http://crossvalidated.com Please read carefully the relevant FAQ before asking questions. To install and update packages, the commands are: install.packages and update.packages (default to CRAN, but take a look at http://r-forge.r-project.org/). 7 / 50 Aims and scope of this course 8 / 50 Let’s get started The objective is to teach you how to do statistical analysis on your own. This means knowing a little bit of the R language itself, as well as having a good understanding of statistical theory and common techniques of data analysis. You’ll need to practice a lot! This session is all about the R language, but with some incursion into statistical concepts like sampling distribution, likelihood theory, parameter estimation. 9 / 50 R knows basic arithmetic 10 / 50 Tail or Head? We can use R as a “simple” calculator. Imagine a simple experiment, one consisting in tossing a coin. > seq(1, 9, by=2)^2 > x <- c(0,1,1,0,1,0,1,0,0) > x[3] [1] 1 9 25 49 81 [1] 1 > x <- seq(-2, 2) > t(x) %*% x * 1/(length(x)-1) [1,] > x[length(x)] [,1] 2.5 [1] 0 > x[2:4] > rm(x) > x <- sample(1:10, 20, replace=TRUE) > x[ceiling(length(x)/2)] [1] 1 1 0 > x[c(2,4)] [1] 6 [1] 1 0 > x[seq(1, length(x), by=2)] [1] 0 1 1 1 0 11 / 50 12 / 50 R vectors More on addressing vector elements 1 Events were stored in a vector that consitutes the base R object. We’ve already seen how to address its values or slicing patterns. What about their internal properties? There are three fundamental types: numeric, character, and logical. Consider the following “named” vector, x: > > > > > is.vector(x) [1] TRUE > is.numeric(x) x <- seq(1, 6) names(x) <- letters[1:6] y <- rep(c(T,F), 3) x[x == 2] == x["b"] [1] TRUE b TRUE > cat(class(x), typeof(x), "\n") > x[y] numeric double a c e 1 3 5 > x <- c(0L,1L,1L,0L,1L,0L,1L,0L,0L) > cat(class(x), typeof(x), "\n") > which(x %in% y) integer integer [1] 1 13 / 50 More on addressing vector elements 2 Missing values We’ll see more dictionnary-like facilities in the sequel of this tutorial. We can use boolean operators too. Some data may be missing. R use NA to denote them. > xc <- x > xc[sample(1:6, 2)] <- NA > is.na(xc) > z1 <- sample(c(T,F), 6, replace=TRUE) > z2 <- sample(c(T,F), 6, replace=TRUE) > z2 [1] TRUE TRUE TRUE FALSE TRUE a TRUE TRUE b c d e f TRUE FALSE FALSE FALSE FALSE > which(is.na(xc)) > z1 & z2 [1] FALSE FALSE 14 / 50 a b 1 2 TRUE FALSE FALSE FALSE > xc[is.na(xc)] <- "." > print(xc) > x[x < 5 & z2] a b c 1 2 3 a b c d e f "." "." "3" "4" "5" "6" 15 / 50 Sorting 16 / 50 Tail or Head again 1 It is also possible to sort vectors, and work with ordered indexes. A better way to simulate this process might be > x <- sample(c(0,1), 100, rep=T) > table(x) > xs <- sample(x) > sort(xs, decreasing=TRUE) f e d c b a 6 5 4 3 2 1 x 0 1 48 52 > xs[order(xs)] > head(ifelse(x==0, "T", "H")) a b c d e f 1 2 3 4 5 6 [1] "T" "T" "H" "T" "T" "T" > rank(rev(xs)) [1] 52 b f c a e d 2 6 3 1 5 4 Is there another way to generate such artificial data? > sum(as.logical(x)) 17 / 50 18 / 50 Probability distribution 1 Probability distribution 2 We could use a very simple probability law, namely the uniform distribution (“all outcomes equally probable”) on the [0, 1] interval and filter those results that are above or below its centre point. Here’s a possible solution: Lookup the on-line R help: > > > > > x <- runif(100) > print(length(x[x < .5])) [1] 52 > apropos("unif") > help.search("unif") > if (require(sos)) findFn("unif") # do it again, we'll get a different result x <- runif(100) xb <- ifelse(x < .5, 0, 1) table(xb) xb 0 1 57 43 19 / 50 Probability distribution 3 20 / 50 The tossing experiment revisited 1 But, we know that such events each follow a Bernoulli distribution, and provided the trials are independent the expected number of tails will follow a Binomial distribution, see Tossing a coin Failure Success 5 ● ● ● ● ● ● ● ● ● ● 6/10 4 ● ● ● ● ● ● ● ● ● ● 7/10 3 ● ● ● ● ● ● ● ● ● ● 5/10 2 ● ● ● ● ● ● ● ● ● ● 6/10 1 ● ● ● ● ● ● ● ● ● ● 3/10 1 2 3 4 5 6 7 8 9 10 > help(rbinom) Sequence Now, we can rewrite our little experiment. > > > + > > n <- seq(10, 100, by=10) res <- numeric(length(n)) for (i in seq_along(n)) res[i] <- table(rbinom(n[i], 1, .5))["1"] names(res) <- n print(round(res/n, 2)) 10 20 30 40 50 60 70 80 90 100 0.60 0.55 0.43 0.42 0.54 0.40 0.50 0.46 0.43 0.46 Trial 21 / 50 The tossing experiment revisited 2 22 / 50 The tossing experiment revisited 3 0.20 Here are two theoretical distributions: 0.15 Please note that these are the results for each single run. Do those results allow to draw a reliable conclusion about the expected number of heads? 0.00 0.00 0.05 0.05 0.10 Density 0.10 0.15 How about increasing n? How about trying this experiment again? How about running it 100 times and calculating the average number of tails? 5 10 No. success 15 20 5 10 No. success 15 20 You tossed a coin 20 times; you got 13 success. Would you say this is a fair coin? Are those results likely to come from one of the above distribution? 23 / 50 24 / 50 Grouping instructions together 1 Grouping instructions together 2 How about reusing the same code several times? It would be easier to call a single command with varying arguments. For that, we need to write a function. Here is a function for the tossing experiment: We’ve already used a small simulation. Let’s look at the following illustration of Marsaglia’s polar method: > > > + + + + + + + > n <- 500 v <- w <- z <- numeric(n) for (i in 1:n) { t <- runif(1) u <- runif(1) v[i] <- 2 * t - 1 w[i] <- 2 * u - 1 a <- v[i]^2 + w[i]^2 z[i] <- ifelse(a <=1 , a, NA) } round(summary(z), 2) Min. 1st Qu. 0.00 0.23 Median 0.46 Mean 3rd Qu. 0.49 0.74 > get.tails <- function(n=5, p=0.5) sum(rbinom(n, 1, p)) with some examples of use: Max. 1.00 > > > > > > > NA's 111.00 get.tails() get.tails(20) get.tails(20, 0.75) get.tails(p=0.75) replicate(10, get.tails(n=100)) sapply(10:20, get.tails) as.numeric(Map("get.tails", n=10:20)) 26 / 50 25 / 50 Writing your own function Some of R RNGs Below are some probability laws that we will use throughout the course. Their parameters are reported in the first column on the left. A function in R is composed of a list of arguments (with or without default value) and a body, which is a series of R instructions. Usually, a value is returned, either the last statement or the one enclosed in a call to return(). Here is an example (what does the function do actually doesn’t matter): > odds.ratio <- function(x, alpha=.05) { + if (!is.table(x)) x <- as.table(x) + pow <- function(x, a=-1) x^a + or <- (x[1,1] * x[2,2]) / (x[1,2] * x[2,1]) + or.se <- sqrt(sum(pow(x))) + or.ci <- exp(log(or) + c(-1,1) * qnorm(1-alpha/2) * or.se) + return(list(OR=or, SE=or.se, CI=or.ci)) + } Binomial B(n, p) rbinom pbinom Uniform U[a,b] runif punif Gaussian N (µ; σ 2 ) rnorm pnorm Student T (ν) rt pt Chi square χ2 (ν) rchisq pchisq qbinom dbinom qunif dunif qnorm dnorm qt dt qchisq dchisq 27 / 50 Your turn 1 2 3 4 28 / 50 Recap’ Convert the Marsaglia’s code to a function (and simplify the code, when possible). Using the code related to the tossing experiment, let’s take n = 50 and generate 100 experiments, store the results, calculate the mean and variance. Simulate a Binomial experiment (0/1), and lookup the index of the first “one”. Using the same n and p, reiterate a large number of times, say 1,000; check whether the variance is approximately 1−p . p2 Simulate data from a Poisson process with a large n and λ < 0.1. Compare to what you would get using a Binomial distribution of parameter λ/n. You should now be familiar with • How to create and manipulate vectors, how to access their elements. • How to access the help system. • How to generate random draws from a discrete or continuous distribution. • How to run simple experiment with a for loop. • How to write little functions. 29 / 50 30 / 50 Beyond vectors Working with matrix Sometimes we need more than a single vector to represent our data. Then, we can rely on matrix. All ways to address vector values works here. > m <- matrix(1:16, nc=4) > m[3,2] > x1 <- rpois(5, 3.8) > x2 <- rpois(5, 3.8) > rbind(x1, x2) [1] 7 > m[,2] [,1] [,2] [,3] [,4] [,5] x1 3 6 6 3 2 x2 3 4 4 2 12 [1] 5 6 7 8 > m[c(1,3), c(2,4)] Or [1,] [2,] > matrix(c(x1, x2), nrow=2, byrow=TRUE) [1,] [2,] [,1] [,2] [,3] [,4] [,5] 3 6 6 3 2 3 4 4 2 12 [,1] [,2] 5 13 7 15 > dim(m) # i.e., nrow(m) x ncol(m) [1] 4 4 32 / 50 31 / 50 Row- and colwise operations Yet another famous distribution We can generate gaussian variates (i.e., realizations from the “normal distribution”) using rnorm(). Say we want to fill a 100 × 10 matrix owith such values (10 draws of 100 observations). Matrix are interesting because we can apply a lot of operations by row or column. For example, > m <- matrix(runif(100), nc=10) > round(apply(m, 1, mean), 2) > res <- matrix(nr=100, nc=10) > for (i in 1:10) + res[,i] <- rnorm(100, mean=12, sd=2) [1] 0.47 0.54 0.54 0.38 0.59 0.58 0.43 0.50 0.46 0.61 > apply(m, 2, function(x) round(mean(x), 2)) A more efficient way to generate those data is to use the replicate() function. [1] 0.57 0.55 0.47 0.36 0.53 0.51 0.61 0.44 0.52 0.52 Matrix objects will only hold values of the same type (e.g., all character or all numeric). Compute the following five-number summary for each column: min, max, 1st and 3rd quartile, median, mean. See help(summary). 33 / 50 Dataframe 34 / 50 More on dataframes 1 Dataframe are matrix-like object (more exactly, they are list) but can hold vectors of different types. Accessing elements from a dataframe might be done like for a matrix, e.g. > d <- data.frame(x=1:2, y=4:1, g=rep(c("m","f"),2)) > dim(d) > d[2,2] [1] 3 [1] 4 3 We can also use named columns: > str(d) > d$y 'data.frame': 4 obs. of 3 variables: $ x: int 1 2 1 2 $ y: int 4 3 2 1 $ g: Factor w/ 2 levels "f","m": 2 1 2 1 [1] 4 3 2 1 > d$g[3] [1] m Levels: f m They can be used to store results of an experiment according to certain conditions, some outcomes together with subjects’ characteristics, etc. There are special operations available for this kind of R object like > summary(d) 35 / 50 36 / 50 More on dataframes 2 x Min. :1.0 1st Qu.:1.0 Median :1.5 Mean :1.5 3rd Qu.:2.0 Max. :2.0 y Min. :1.00 1st Qu.:1.75 Median :2.50 Mean :2.50 3rd Qu.:3.25 Max. :4.00 Factors 1 g f:2 m:2 Factors are categorical variables allowing to establish a distinction between statistical units, experimental conditions, etc. They induce a partition of the dataset. > # class(d$g) > levels(d$g) [1] "f" "m" This is known as the long-format representation. Compare to, e.g. > unique(d$g) [1] m f Levels: f m > m <- matrix(d$y, nc=2) > colnames(m) <- rev(levels(d$g)) > rownames(m) <- unique(d$x) > relevel(d$g, ref="m") [1] m f m f Levels: m f There are many ways to generate factors: 38 / 50 37 / 50 Factors 2 Illustration > print(f1 <- factor(rep(1:2, length=4))) Here are two series of height measurements from two classes: [1] 1 2 1 2 Levels: 1 2 > d <- data.frame(height=rnorm(40, 170, 10), + class=sample(LETTERS[1:2], 40, rep=T)) > d$height[sample(1:40,1)] <- 220 > print(f2a <- as.factor(rep(c("a1", "a2"), each=2))) [1] a1 a1 a2 a2 Levels: a1 a2 220 ● 210 Height (cm) > print(f2b <- gl(2, 2, labels=c("a1","a2"))) [1] a1 a1 a2 a2 Levels: a1 a2 200 ● 190 ● ● 170 ● ● 160 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 180 20 30 ● ● ● 40 Subject Does the taller come from class A or B? Which class shows more variability with respect to height? 39 / 50 Internal datasets 40 / 50 Importing data 1 R comes with a lot of example datasets. They can be loaded using the data() command. Some of them are directly available from the base packages, others can be imported from additional packages. To list all available datasets, you can try: R can read data from a variety of proprietary format (SPSS, Stata, SAS), “connections”, but also simple tab-delimited or csv file. Here is an example with a tab-delimited dataset: > colcan <- read.table("./data/colon-cancer.txt", header=TRUE) > head(colcan, 3) > data() If you already know the name of the dataset, you can use it directly, e.g. 1 2 3 > data(sleep) > help(sleep) > data(ISwR::intake) stages local local local year n 1974 5 1975 8 1976 10 > table(colcan$stages) local regional 12 12 > with(colcan, tapply(n, stages, sum)) 41 / 50 42 / 50 Importing data 2 Writing data 1 local regional 112 32 Sometimes, it might be interesting to save data or results to disk. We can use quite the same commands, e.g. write.table(). However, here are some possible alternatives to save results to a file: And here is an example with an Excel file converted to csv: > perinatal <- read.csv("./data/perinatal.csv") > str(perinatal) 'data.frame': 74 obs. of 5 variables: > n <- 100 $ weight : Factor w/ 37 levels "<800",">4300",..: 1 36 37 3 4 5 6 7 8 9 ... > d <- data.frame(age=sample(20:35, n, rep=T), $ infants: Factor w/ 2 levels "black","white": 1 1 1 1 1 1 1 1 1 1 ... + y=rnorm(n, 12, 2), $ deaths : int 533 65 40 30 29 21 19 19 20 24 ... + gender=gl(2, n/2, labels=c("male", "female"))) $ births : int 618 131 122 131 137 143 143 165 167 219 ... > cat("Mean of y is:", mean(d$y), "\n", file="out1.txt") $ prob : num 862 496 328 229 212 ... Check your working directory with getwd(). Using save() and load(), you can work with R objects saved in a binary image. Read carefully the default options for read.table(), read.csv(), and read.csv2(). 44 / 50 43 / 50 Writing data 2 Your turn Or, 1 > sink("out2.txt") > with(d, tapply(y, gender, mean)) male female 11.77841 11.92955 2 > sink() 3 Or, > capture.output(summary(d), file="out3.txt") 4 To check whether those files were saved, you can use: > list.files(dir(), pattern="*.txt") 5 The odds.ratio() function expects a 2 × 2 table or matrix. Check that these conditions are fullfilled at the beginning of the function body. Write a summary-like function including the SD in addition to the five-number summary. Load the sleep dataset and summarize the difference in hours of sleep for both groups. Load the USArrests dataset and summarize all numeric variables. Reorder the dataset by numbr of murders. Hint: See help(do.call). Create a fake dataset comprised of two columns of gaussian variates and a factor. Export it as a csv file. 45 / 50 Recap’ 46 / 50 Some words of caution Be careful with floating point arithmetic. (1,2) You should now be familiar with • How to work with matrix and dataframe objects. > .1 == .3 / 3 • How R represents qualitative or (ordered) discrete [1] FALSE variables. • How to avoid looping for row-, col-, and groupwise computations. • How to read external data files. • How to write intermediate results to your HD. > any(seq(0, 1, by=.1) == .3) [1] FALSE Factors are internally stored as numeric values and levels are recorded in lexicographic order. > as.numeric(f <- factor(1:2, levels=2:1)[1]) [1] 2 > as.numeric(levels(f))[f] [1] 1 47 / 50 48 / 50 References 1 List of R commands 1. P Burns. The R Inferno, 2011. URL http://www.burns-stat.com. &, 15 any, 48 apply, 33 apropos, 21 c, 11 capture.output, 45 cat, 13 data, 41 data.frame, 35 factor, 39 function, 26 gl, 39, 45 head, 18 help, 26 help.search, 21 ifelse, 21 2. D Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1), 1991. URL http://www.validlab.com/goldberg. 49 / 50 is.na, 16 length, 11 levels, 39 Map, 26 matrix, 31 mean, 33 names, 17 numeric, 26 order, 17 rank, 17 rbind, 31 rbinom, 26 read.csv, 43 read.table, 43 relevel, 39 rep, 15 replicate, 34 rev, 17 rnorm, 34 round, 26 rpois, 31 runif, 21 sample, 11 sapply, 26 seq, 11 sink, 45 sort, 17 sum, 18 t, 11 table, 18 tapply, 43 unique, 39 Last updated: September 28, 2011. LATEX + pdfTEX and R 2.13.1. 50 / 50