The aroma.affymetrix package - on a notebook
Transcription
The aroma.affymetrix package - on a notebook
The aroma.affymetrix package How to analyze huge Affymetrix data sets in R on a notebook Henrik Bengtsson - [email protected] University of California, Berkeley Sept 28, 2006 Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Affymetrix chips Hybridization of target sequences to probes Target sequence: Perfect-match (PM) probe: ...GGTTACCATCGGTAAGTACTCAATGATTA... ATCATGCGCCATTCATGAGTTACTA Hybridization of target sequences to probes Target sequence: Perfect-match (PM) probe: Mismatch (MM) probe: ...GGTTACCATCGGTAAGTACTCAATGATTA... ATCATGCGCCATTCATGAGTTACTA ATCATGCGCCATACATGAGTTACTA Scanning & Image analysis Example array: 1600x1600 cells; 65536 intensity levels (16 bits). Scanning & Image analysis Example array: 1600x1600 cells; 65536 intensity levels (16 bits). Scanned image: 9x9 (+ cell margins) pixels/cell. Scanning & Image analysis Example array: 1600x1600 cells; 65536 intensity levels (16 bits). Scanned image: 9x9 (+ cell margins) pixels/cell. Analyzed image: (mean pixel, stddev pixel, #pixels). Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Amount of data per array Affymetrix chip data is stored in “CEL” files. Per cell [10 bytes]: ◮ Average pixel intensity [float = 4 bytes = 40%] ◮ Std dev pixel intensity [float = 4 bytes = 40%] ◮ # pixels [integer = 2 bytes = 20%] With an array of 1600x1600 cells this sums up to 25.6 · 106 bytes = 24.4 MB/array1 . 1 1 kB = 1024 bytes, 1Mb = 1024 kB = 1048576 bytes. Example of different Affymetrix chips Chip type Hu6800 HG U95Av2 Mapping10K Xba142 HG-U133A HG-U133B Mapping10K Xba131 Mouse 430 v2 Mapping50K Hind240 Mapping50K Xba240 Mapping250K Nsp Mapping250K Sty HuEx-1 0-st-v2 Dimension 536x536 640x640 658x658 712x712 712x712 712x712 1002x1002 1600x1600 1600x1600 2560x2560 2560x2560 2560x2560 # cells 0.29 · 106 0.41 · 106 0.43 · 106 0.51 · 106 0.51 · 106 0.51 · 106 1.00 · 106 2.56 · 106 2.56 · 106 6.55 · 106 6.55 · 106 6.55 · 106 # Units 7129 12625 10208 22283 22645 11564 45101 57299 59015 262338 238378 1432154 *) Sizes of binary CEL files; ASCII CEL files are much larger. File size 2.9MB 3.9MB 4.1MB 4.8MB 4.8MB 4.8MB 9.6MB 24.4MB 24.4MB 62.5MB 62.5MB 62.5MB Example of data sets Some public data sets: Name Affymetrix CEPH 100K Affymetrix CEPH 500K # samples 90x2 chips 48x2 chips Chip type Mapping 100K Mapping 500K Size 4.5GB 6.0GB Signals 1.8GB 2.4GB Some data sets we’ve been working on: Name Slater 100K Affymetrix CEPH 500K Broad Institute 500K Affymetrix Services Laboratory Sinclair 500K WEHI Exon # samples 22+21 chips 270x2 chips 96x2 chips 190+154 chips 19+16 chips 35x2 chips Chip type Mapping 100K Mapping 500K Mapping 500K Mapping 500K Mapping 500K Human Exon Size 1.0GB 33.8GB 11.7GB 21.1GB 2.4GB 2.2GB Signals 0.4GB 13.5GB 4.7GB 8.4GB 1.0GB 0.9GB Some large data sets we know of: Name Sanger’s 500K # samples 15,000x2 chips Chip type Mapping 500K Size 1746GB Signals 698GB “Signals” = amount of RAM required for probe intensities only. Example of data sets Some public data sets: Name Affymetrix CEPH 100K Affymetrix CEPH 500K # samples 90x2 chips 48x2 chips Chip type Mapping 100K Mapping 500K Size 4.5GB 6.0GB Signals in R 3.6GB 4.8GB Some data sets we’ve been working on: Name Slater 100K Affymetrix CEPH 500K Broad Institute 500K Affymetrix Services Laboratory Sinclair 500K WEHI Exon # samples 22+21 chips 270x2 chips 96x2 chips 190+154 chips 19+16 chips 35x2 chips Chip type Mapping 100K Mapping 500K Mapping 500K Mapping 500K Mapping 500K Human Exon Size 1.0GB 33.8GB 11.7GB 21.1GB 2.4GB 2.2GB Signals in R 0.8GB 27.0GB 9.4GB 16.8GB 2.0GB 1.8GB Some large data sets we know of: Name Sanger’s 500K # samples 15,000x2 chips Chip type Mapping 500K Size 1746GB Signals in R 1396GB “Signals” = amount of RAM required for probe intensities only. Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Computer systems Operating systems: Operating system Max address space Windows XP 4GB* Windows XP 64-bit 17 billion GB Linux 32-bit 4GB Linux 64-bit 17 billion GB *) A single application can only use 2GB. Hardware: Hardware limits the amount of memory to about 32-64 GB. Department of Statistics, UC Berkeley: The main computational Linux (64-bit) server has 16 GB RAM. The R software Overview of R: ◮ A free open source application (GPL). ◮ Great community forums. ◮ Widely used. Dominant in Bioinformatics applications. Overview of the language: ◮ All floating-point values are stored as double:s. ⇒ float (4 bytes) to double (8 bytes); 200% more RAM. ◮ Functional language (no pointers/reference variables) ⇒ there often 2-3 copies of a data object at any time. ◮ A workaround is to use environments (or the R.oo package) ⇒ one copy of each data object. Main issue is memory! (not only R) Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Probeset design In gene-expression analysis, the “activity” (amount of RNA transcripts) of a single gene is measured by a about 10-50 probes each quering a small fraction of the genes DNA. A probeset (aka unit): 1 2 3 PM MM 30 31 32 Probeset design In gene-expression analysis, the “activity” (amount of RNA transcripts) of a single gene is measured by a about 10-50 probes each quering a small fraction of the genes DNA. A probeset (aka unit): 1 2 3 cell PM MM 30 31 32 Probeset design In gene-expression analysis, the “activity” (amount of RNA transcripts) of a single gene is measured by a about 10-50 probes each quering a small fraction of the genes DNA. A probeset (aka unit): 1 2 3 cell PM MM probe pair 30 31 32 Probeset design In gene-expression analysis, the “activity” (amount of RNA transcripts) of a single gene is measured by a about 10-50 probes each quering a small fraction of the genes DNA. A probeset (aka unit): 1 2 3 cell PM MM probe pair 30 31 32 Most of the modelling is done probeset by probeset. Thus this the #1 way we access data. There are 10,000s probesets on each array. Probe-level modelling (PLM) Ignoring the mismatch probes, model the PMs only: 1 2 3 PM 14 15 16 Consider a given SNP with PM probes k = 1, . . . , K and samples i = 1, . . . , I . The PLM used in RMA is: log yik = αi + βk + εik with PM signal yik , chip effect αi for sample i, probe affinity (sensitivity) βk for probe k, and random error ξik . Probe-level modelling (PLM) Ignoring the mismatch probes, model the PMs only: 1 2 3 PM 14 15 16 Consider a given SNP with PM probes k = 1, . . . , K and samples i = 1, . . . , I . The PLM used in RMA is: log yik = αi + βk + εik with PM signal yik , chip effect αi for sample i, probe affinity (sensitivity) βk for probe k, and random error ξik . The PLM used by dChip (MBEI) is: yik = θi · φk · ξik . Chip-definition files (CDFs) Probesets are defined in CDF files (one per chip type), e.g. Mapping250K Nsp.CDF. A fraction of this CDF: $ SNP_A-1782949:List of 3 ..$ type : int 2 ..$ direction: int 1 ..$ groups :List of 2 .. ..$ A:List of 5 .. .. ..$ x : int [1:12] .. .. ..$ y : int [1:12] .. .. ..$ pbase: chr [1:12] .. .. ..$ tbase: chr [1:12] .. .. ..$ expos: int [1:12] .. ..$ G:List of 5 .. .. ..$ x : int [1:12] .. .. ..$ y : int [1:12] .. .. ..$ pbase: chr [1:12] .. .. ..$ tbase: chr [1:12] .. .. ..$ expos: int [1:12] 651 652 458 457 940 939 ... 1772 1772 1388 1388 221 ... "c" "g" "a" "t" ... "g" "g" "a" "a" ... 13 13 15 15 16 16 17 17 ... 651 652 458 457 940 939 ... 1771 1771 1389 1389 220 ... "c" "g" "a" "t" ... "g" "g" "a" "a" ... 13 13 15 15 16 16 17 17 ... Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Allelic cross-talk calibration (genotyping chips) PMA probe: PMB probe: ATCATGCGCCATCCATGAGTTACTA ATCATGCGCCATACATGAGTTACTA Allelic cross-talk calibration (genotyping chips) ATCATGCGCCATCCATGAGTTACTA ATCATGCGCCATACATGAGTTACTA 15000 5000 0 yA 25000 PMA probe: PMB probe: 0 5000 15000 yC 25000 Allelic cross-talk calibration An affine model for cross-talk between allele A and allele B is (ignoring sample index i) is: yA,j,k a WAA WAB xA,j,k ε = A,j + + A,j,k yB,j,k aB,j WBA WBB xB,j,k εB,j,k and in vector format yj = a + Wxj + εj We estimate this robustly using unpublished work by Wirapati & Speed (2002). Allelic cross-talk calibration > path <- findCelSet("SinclairA_etal_2006") > ds <- AffymetrixCelSet$fromFiles(path) > dsC <- calibrateAllelicCrosstalk(ds) Benchmarking: # arrays Chip type 90x2 Mapping 100K 270x2 Mapping 500K 15,000x2 Mapping 500K Total time 1:26h 5:30h 13.0 days* Time/array 28s 75s 75s* Overheads (approx.): Reading: 15%, Fitting: 50%, Writing: 30%. Allelic cross-talk calibration 25000 15000 5000 yA 15000 0 5000 0 yA 25000 (699,771) 1.000 0.035 0.121 0.959 0 5000 15000 yC before 25000 0 5000 15000 yC after 25000 Quantile normalization > path <- findCelSet("SinclairA_etal_2006") > ds <- AffymetrixCelSet$fromFiles(path) > dsN <- normalizeQuantile(ds) Calculating target distribution (averaging): # arrays Chip type Total time Time/array 90x2 Mapping 100K 0:07h 2.1s 270x2 Mapping 500K 1:45h 11.8s 15,000x2 Mapping 500K 4.1 days* 11.8s* Normalizing # arrays 90x2 270x2 15,000x2 arrays to target distribution: Chip type Total time Mapping 100K 0:55h Mapping 500K 9:20h Mapping 500K 21.5 days* Time/array 18s 62s 62s* Overheads (approx.): Reading: 10%, Fitting: 65%, Writing: 25%. Fitting RMA PLM for total copy numbers > path <- findCelSet("SinclairA_etal_2006") > ds <- AffymetrixCelSet$fromFiles(path) > model <- RmaCnPlm(ds, mergeStrands=TRUE, combineAlleles=TRUE) > fit(model) Benchmarking: # arrays Chip type 22+21 Mapping 100K 19+16 Mapping 500K 90x2 Mapping 100K 270x2 Mapping 500K 15,000x2 Mapping 500K Total time 1:00h 3:20h 7:15h 2.0 days* 119 days* Time/array & unit 1.4ms 1.3ms 1.3ms 1.3ms* 1.3ms* Overheads (approx.): Reading: 20%, Fitting: 15%, Writing: 65%. Fitting multiple copy-number PLMs at once path <- findCelSet("SinclairA_etal_2006") ds <- AffymetrixCelSet$fromFiles(path) models <- list( rma = RmaCnPlm(ds, mergeStrands=TRUE), mbei = MbeiCnPlm(ds, mergeStrands=TRUE), affine = AffineCnPlm(ds, mergeStrands=TRUE) } lapply(models, fit, units=1:5000) Note: Read data is cached ⇒ average reading time scales down. Displaying results Graphical output is still under development, but... User feedback - fit()[.ProbeLevelModel] worked perfectly. Ran rma on 18 mouse 430 2 chips [1002x1002 cells, 45101 units] in 14 minutes. > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 1063975 28.5* 2403845 64.2 2403845 64.2* Vcells 957616 7.4* 4826040 36.9 15966139 121.9* Compare to memory usage for fitPLM(): > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 2088619 55.8* 4953636 132.3 3950498 105.5* Vcells 42847107 326.9* 90835631 693.1 90538060 690.8* :) Cheers / Ken Software robustness All transformed data and parameter estimates are stored to file immediately (in chunks). This means: ◮ If/when R crashes (it happens!), or when there is a power failure, algorithms can pick from where they were interrupted. This is automagically taken care of by aroma.affymetrix. ◮ The algorithms may be interrupted in order to temporarily release computer resources for other needs. ◮ The algorithms can be restarted on a different host. Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Interfacing to Bioconductor ◮ Pre-processed data is already stored as CEL files, which can be imported to Bioconductor (and other software). ◮ Ongoing: Porting algorithms to aroma.affymetrix. Most of this can be done by simple wrappers calling existing implementations, cf. fitPLM(), crlmm(), gcrma() etc. To do: Provide an eSet interface to the data classes; ◮ 1. Extract data in memory. 2. Virtual eSet class to still work with data on file (more tricky). Parallelization Since all data and parameter estimates are kept in a shared persistent memory (the file system), multiple processes/hosts can access the data and estimates at any time. Speed up: With N parallel hosts, total time T shrinks to ≈T /N, e.g. N = 20, T = 2.0 days ⇒ T /N = 2.4 hours. One process writing, multiple reading: With this setup there are no conflicts. All readers can access the estimates as soon as they are available. Examples: ◮ Visualizers, e.g. CN plots, SNP scatter plots. ◮ Progress bars. Multiple writing processes: A file-lock mechanism for writing is required (todo). Examples: ◮ Single-array calibration and normalization methods. ◮ Modelling of data subsets, e.g. chromsome by chromsome. Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Conclusions ◮ We are now capable of analyzing very large data sets. ◮ Almost all models and algorithm we work with can be performed with bounded memory constraints. Advantages of storing “intermediate” data and estimates on the file system are: ◮ ◮ ◮ ◮ ◮ Standard CEL files: data is ready to be imported in other tools. Persistent memory: can be restarted after a software failure. Parallelization: Data and estimates can be shared and process by multiple hosts simultaneously. The package can easily be extended by other developers. Outline Introduction to the Affymetrix platform Description of data Computer systems & software More on the design of Affymetrix arrays Examples Ongoing work & Future directions Conclusions General discussion on computation on large data sets Suggestions ◮ ASCII (tab-delimited) data files are orders of magnitude slower to parse than binary files. ◮ Know how to use read.table(..., colClasses=...). ◮ Understand how data is kept in memory. ◮ Understand that data in matrices in R are stored as stacked columns, that is, it is more efficient (caching) to work column by column, rather than row by row. ◮ Understand how I/O of data can be optimized: contiguous data is much faster to access than scattered data. HDF5 National Center for Supercomputing Applications, University of Illinois at Urbana ◮ File format for storing scientific data: Primary objects: data sets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. ◮ Efficient storage and I/O: Meets data management needs of scientists and engineers working in high performance, data intensive computing environments. Compressed or chunked data. Read and write data efficiently on parallel computing systems. ◮ Large user community: Engineering, scientific, and other fields, ranging from computational fluid dynamics to film making. ◮ R package hdf5: Interface to the NCSA HDF5 library. Experimental. Package R.huge ◮ Provides in memory like access to extremely large-size data living on the file system, e.g. x[1:30,56:60] and x[939220+1:20,2] <- NA. ◮ Supported dimensions: vectors, matrices, and multi-dimensional arrays. ◮ Supported data types: byte (1 byte), single (2 bytes), integer (4 bytes), float (4 bytes), and double (8 bytes). ◮ Written using plain R; easy to extend. ◮ Experimental. ◮ Most of it’s usage in aroma.affymetrix have been replaced by I/O support of CEL/CDF files in order to ease migration of data and analysis. Package R.huge - Example > x <- FileByteMatrix("x.Rmatrix", nrow=1e6, ncol=1e4) > x [1] "FileByteMatrix: Pathname: ./x.Rmatrix. Opened: TRUE. File size: 10000004268 bytes (9.3 GB). Dimensions: 1e+06x 1e+04. Number of elements: 1e+10. Bytes per cell: 1." > x[939220+1:20,2] <- 1:40 > x[939220+5:8,1:3] [,1] [,2] [,3] [1,] 0 5 0 [2,] 0 6 0 [3,] 0 7 0 [4,] 0 8 0 Acknowledments In no specific order: ◮ James Bullard, UC Berkeley. ◮ Pratyaksha Wirapati, Swiss Cancer Research Center. ◮ Ben Bolstad, UC Berkeley. ◮ Rafael Irizarry, John Hopkins University. ◮ Ken Simpson, WEHI, Melbourne, Australia. ◮ Benilton Carvalho, John Hopkins University. ◮ Terry Speed, UC Berkeley/WEHI. ◮ Jan Holst, Lund University, Sweden. ◮ Kasper D. Hansen, UC Berkeley. ◮ Jane Fridlyand, UCSF. ◮ Ola H¨ ossjer, Stockholm University, Sweden. aroma.affymetrix is available at http://www.braju.com/R/. Converting an ASCII CDF to a binary CDF Example: Human Exon array with > 1.4·106 units. > cdf <- AffymetrixCdfFile$fromChipType("HuEx-1_0-st-v2") > cdf AffymetrixCdfFile: Filename: HuEx-1_0-st-v2.text.cdf Chip type: HuEx-1_0-st-v2 Number of units: 1432154 File size: 933.84 MB > cdf2 <- convert(cdf) > cdf2 AffymetrixCdfFile: Filename: HuEx-1_0-st-v2.cdf Chip type: HuEx-1_0-st-v2 Number of units: 1432154 File size: 376.78 MB