Practical RNA-seq Analysis - Bioinformatics and Research Computing

Transcription

Practical RNA-seq Analysis - Bioinformatics and Research Computing
Practical RNA-seq Analysis
Yanmei Huang
BaRC Hot Topics – March 17, 2015
Bioinformatics and Research Computing
Whitehead Institute
http://barc.wi.mit.edu/hot_topics/
Outline
•
•
•
•
•
•
•
RNA-seq overview
Before sequencing
QC
Hot Topic on Feb 12, 2015
Mapping
Quantifying
Today’s focus
Differential expression
Presenting the results
2
RNA-seq Overview
What is RNA-seq? - High throughput sequencing of RNA-derived cDNAs
‘Regular’ RNA-seq
total RNA, mRNA, ncRNA
Ribosome profiling
ribosome protected RNA fragments
RIP-seq, CLIP-seq, PAR-CLIP, iCLIP
protein-bound RNA fragments
Purpose? - To identify and quantify (same as microarray)
Advantages? (compared to microarray)
Identify:
Single nucleotide resolution
- RNA editing
- alternative splicing
- novel transcripts
Quantify: Bigger dynamic range
More robust and repeatable (arguably)
3
Before sequencing
Experimental design
Replication (especially biological replicates)
Multiplexing
Read length
Read depth
Paired or unpaired
Stranded or unstranded
Lab practice
“Garbage in, garbage out” rule applies!
Use only high quality input RNA for library prep
4
Data analysis pipeline
Quality Control
Mapping
Quantifying
Differential expression
Presenting the results
5
Quality control
• Check quality of file of raw reads
FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
See handout for fastqc command (step 1)
• Respond to QC analysis:
– Filter poor-quality reads
– Trim poor-quality positions
– Trim adapter and/or other vector
• Check quality of file of modified reads
• See previous Hot Topic:
NGS: Quality Control and Mapping Reads (Feb 12, 2015)
6
Quality control
7
Responding to quality issues
• Method 1:
– Keep all reads as is
– Map as many as possible
• Method 2:
– Drop all poor-quality reads
– Trim poor-quality bases
– Map only good-quality bases
• Which makes more sense for your experiment?
8
Mapping Reads
Various mappers for spliced alignment:
TopHat (recommended), STAR, GSNAP, MapSplice,
PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET…
Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013
9
Spliced alignment with tophat2
http://ccb.jhu.edu/software/tophat/manual.shtml
Command: tophat [options] <genome_index_base> <reads.fastq>
- see handout “step 2” for sample tophat commands
Commonly considered parameters
quality score encoding
Find out from fastQC result
--segment-length
Shortest length of a spliced read that can map to one side of the
junction. default: 25
--no-novel-juncs
Only look at reads across junctions in the supplied GFF file
-G <GTF file>
Map reads to virtual transcriptome (from gtf file) first.
-N
max. number of mismatches in a read, default is 2
-o/--output-dir
default = tophat_out
--library-type
(fr-unstranded, fr-firststrand, fr-secondstrand)
-I/--max-intron-length
default: 500000
10
Mapping reads
• Inspect mapping result in a genome browser
• Remap if necessary
11
Quantifying Expression
• Different ways of quantifying expression level
for different purposes
Normalized count
Sample 1
Sample 2
Gene 1
RPKM
FPKM
Gene 2
Reads (Fragments) Per Kilobase per Million mapped reads
12
Counting methods
• htseq-count (recommended)
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
– Output is raw counts
• Cufflinks
http://cufflinks.cbcb.umd.edu
– Output is FPKM and related statistics
• Bedtools (intersectBed; coverageBed)
http://code.google.com/p/bedtools/
– Output is raw counts (but may need post-processing)
13
Using htseq-count
Command:
htseq-count [options] <alignment_file> <gff_file>
htseq-count "modes"
Frequently considered options default
-r <order>
name
-s <yes/no/reverse>
yes
-t <feature type>
exon
-m <mode>
union
-f, -i, -o, -q, -h …
…
Keep in mind:
Only unique alignments are counted!!
14
Differential Expression Analysis
• Normalization
• Estimating variability (dispersion)
15
Differential Expression Analysis
• Normalization
- problem with simply normalizing to total reads:
• Variation (dispersion of data)
Total reads
200
400
16
Preferred normalization method - geometirc
(implemented in DESeq, cuffdiff)
1. Construct a "reference sample" by taking, for each gene, the geometric mean
of the counts in all samples.
2. Calculate for each gene the quotient of the counts in your sample divided
by the counts of the reference sample.
3. Take the median of all the quotients to get the relative depth of the library
and use it as the scaling factor.
1
𝑥𝑖
Gene 1
Gene 2
Gene 3
Gene 4
Sample 1
2135 (447)
600
3150
378
> cds = estimateSizeFactors(cds)
> sizeFactors(cds)
Sample 1
Sample 2
4.7772242
1.0490870
𝑛
𝑛
𝑥𝑖
𝑖=1
Sample 2
615 (586)
58
1346
187
Sample 3
0.3697529
Sample 3
128 (346)
103
68
11
Sample 4
161(288)
189
88
22
Sample 4
0.5590669
Differential Expression Analysis
Dispersion affects ability to differentiate
x1
x
x2
x1
x
x2
18
Methods for estimating dispersion
implemented by DEseq
pooled
Each replicated condition is used to build a model, then these
models are averaged to provide a single global model for all
conditions in the experiment. (Default)
per-condition
Each replicated condition receives its own model. Only
available when all conditions have replicates.
blind
All samples are treated as replicates of a single global
"condition" and used to build one model.
pooled-CR
Fit models according to modelFormula and estimate the
dispersion by maximizing a Cox-Reid adjusted profile likelihood
(CR-APL)
19
Interpreting DESeq output
Gene ID
(from GTF file)
id
Mean
norm
counts
Mean
normalized
counts
for group
Fold
change
Log2
Raw
FDR
(fold
p-value p-value
change)
baseMe baseMe baseMe
log2(YRI
YRI/CEU
pval
an
an.CEU an.YRI
/CEU)
ENSG00000213442
74.71 149.06
0.36
ENSG00000138061
83.15 159.78
6.52
ENSG00000198618
11.71
23.42
0.00
ENSG00000134184
8.17
0.00
16.34
CEU_N CEU_N YRI_NA YRI_NA
CEU_N CEU_N YRI_NA YRI_NA
padj
A07357 A11881 18502_ 19200_
A07357 A11881 18502 19200
_norm _norm norm norm
8.13E- 1.55E169
18
13
5.61E- 5.36E0.04 -4.62
184
12
08
#NAME 1.51E0.00
0.00 31
?
06
0.00021
Inf
Inf
0.11 0
1
0.00
division by 0
Normalized counts =
raw / (size factor)
Raw counts
-8.69
145
1
0
148.28 149.85
0.72
0
153
10
4
161.44 158.12
7.2
5.83
19
0
0
27.2
19.64
0
0
0
15
15
0
0
10.8
21.88
-Inf
sizeFactors (from DESeq):
CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200
1.1397535
0.9676201
1.3891488
0.6856959
20
Presenting results
• What do you want to show?
• All-gene scatterplots can be helpful to
– See level and fold-change ranges
– Identify sensible thresholds
– Hint at data or analysis problems
• Heatmaps are useful if many conditions are being compared but
only for gene subsets
• Output normalized read counts with same method used for DE
statistics
• Whenever one gene is especially important,
look at the mapped reads in a genome browser
21
Scatterplots
Standard scatterplot
MA (ratio-intensity) plot
22
Clustering and heatmap
Softwares:
Cluster 3.0:
•Log-transform
•Mean center
•Cluster
Java TreeView:
•Visualize
•Export
Illustrator
•Assemble
Liang et al. BMC Genomics 2010, 11:173
23
Further mining of the data
Find enrichment in descriptive terms, regulatory pathways and
networks:
Commercial
Ingenuity Pathway analysis
Pathway studio
GeneGo
…
Public domain
DAVID
GSEA
…
24
Summary
•
•
•
•
•
Experimental design
Quality control (fastqc)
Mapping spliced reads (tophat)
Counting gene levels (htseq-count)
Identifying "differentially expressed" genes (DESeq R
package)
• Exploring your results
Slides and exercise materials are available at
http://jura.wi.mit.edu/bio/education/hot_topics/
More practical commands and procedures available at
http://barcwiki.wi.mit.edu/wiki/SOPs
25