Title Estimation and analysis of gene expression and

Transcription

Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published
version when available.
Title
Author(s)
Estimation and analysis of gene expression and alternative
splicing: perspectives from development and disease
Korir, Paul Kibet
Publication
Date
2014-01-30
Item record
http://hdl.handle.net/10379/4301
Downloaded 2016-10-29T13:56:54Z
Some rights reserved. For more information, please see the item record link above.
NATIONAL UNIVERSITY OF IRELAND, GALWAY
Estimation and Analysis of Gene
Expression and Alternative Splicing:
Perspectives from Development and
Disease
by
Paul Kibet Korir
A thesis submitted in partial fulfillment for the degree of
Doctor of Philosophy
in the
College of Science
School of Mathematics, Statistics and Applied Mathematics
Supervised by Professor Cathal Seoighe
Co-supervised by Dr. Tim Downing
April 2014
Declaration of Authorship
I, Paul Kibet Korir, declare that this thesis titled, ‘Estimation and Analysis of Gene
Expression and Alternative Splicing: Perspectives from Development and Disease’ and
the work presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a research degree
at this University.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
15/04/2014
iii
“gloria Dei celare verbum et gloria regum investigare sermonem”
Proverbia 25:2 (VULGATE)
Abstract
The development of high-throughput genomics technologies has contributed substantially to the understanding of gene expression regulation. With the growing appreciation
of the importance of alternative splicing, quantitative techniques have had to keep in
step with the demand for an accurate and high resolution view of the transcriptome. In
this thesis, we use results and methods from quantitative genomics to explore how gene
regulation may be modified in development and disease.
Precise regulation of gene expression timing can be critical for some biological processes.
This is particularly the case for genes with oscillating patterns of expression. Oscillations can be brought about through negative feedback loops, with a delay between gene
activation and negative autoregulation. The time required for gene transcription contributes to the delay in gene activation and, thus, the intron content of genes involved in
negative autoregulatory loops can be functionally significant. An example of this occurs
in Hes7, in which oscillation is coupled to the formation of segmental body plans during
animal development. To identify further examples of genes in which the transcriptional
delay introduced by introns may be functionally significant, we carried out a search for
genes with conserved intron content across a diverse panel of 19 mammals and found
that the set of genes with the most extreme conservation was enriched for genes involved
in embryonic development. We found that these genes had both fewer insertions and
deletions as well as a balance between the cumulative insertions and deletions, suggesting that selection functions to prevent indels in these introns and to balance the impact
of insertions and deletions.
There has been considerable success in mapping local (cis) variants associated with
phenotypes such as disease. Many cis-acting variants that cause disease disrupt splicing.
However, mapping distant (trans) acting variants that affect splicing is a more formidable
task. By exploiting high-density transcriptome microarrays, we show that a mutation in
the splicing factor PRPF8, causally associated with retinitis pigmentosa, is a trans-acting
splicing variant. We estimate that up to 20% of all exons are mis-spliced through higher
exon inclusion in affected individuals. Characteristics of affected exons suggest that they
tend to be spliced co-transcriptionally and via the exon-defined splicing pathway.
The importance of gene expression microarrays in quantitative genomics has led to the
development of numerous algorithms to estimate gene expression from raw microarray intensities. But microarrays have several shortcomings relative to more recently
developed sequencing-based methods for measuring gene expression. We exploited the
benefits of quantitative transcriptome sequencing (RNA-Seq) by using a statistical learning approach to obtain better expression estimates from arrays, based on a high-quality
dataset for which both microarray and RNA-Seq data are available. Our analyses show
that this approach compares favourably to existing algorithms for microarray analysis, with the added advantage of providing estimates of the abundance of individual
transcript isoforms on an absolute scale.
Acknowledgements
I am, first and foremost, grateful to Almighty God, Maker of all things, for the strength
to complete this work. I count it as non-trivial that during the study period I have
retained excellent health. The words of King David echo my sentiments when he writes,
The Lord is the portion of mine inheritance and of my cup: thou maintainest my lot.
The lines are fallen unto me in pleasant places; yea, I have a goodly heritage.
I will bless the Lord, who hath given me counsel: my reins also instruct me in the night
seasons.
I have set the Lord always before me: because he is at my right hand, I shall not be
moved.
Therefore my heart is glad, and my glory rejoiceth: my flesh also shall rest in hope.
Psalm 16:5-9 (KJV)
A million thanks go to Prof. Cathal Seoighe, my main supervisor for being a superb
advisor through this journey. I greatly admire his discipline, integrity, breadth of knowledge, and above all his patience. I have learned so much from him and he has left a
deep mark on me that I hope to live up to. I also thank Dr. Tim Downing, my cosupervisor for his numerous helpful comments. I will forever remember how he patiently
read through some of the early drafts even when he was a new member of staff having
not seen most of this material before.
I cannot say thank you enough to Maureen Olembo, my fiancée and friend. I am
especially grateful for keeping me accountable, and encouraging me and praying with
me through the thick and thin of the last few years. Words are not enough, Mo!
I am also deeply indebted to the staff and fellow graduates students both in the bioinformatics research group and the wider School of Maths community. Many stimulating
discussions have passed between us and it has been this atmosphere of inquisitiveness and
curiosity that that I have deeply enjoyed. In particular, I would like to single out Paul
Geeleher, Thong Nguyen and Peter Keane for many helpful debates and discussions—not
all of which ended with agreement!
Lastly, but certainly not least, I owe an enormous debt of gratitude to my family for
always being there, in particular my sister Winnie Nasurutia for her frequent phone calls
and for the Galway Christian Fellowship for their prayers and support.
Paul Kibet Korir
Galway
January 2014
ix
Contents
Declaration of Authorship
iii
Abstract
vii
Acknowledgements
ix
List of Figures
xv
List of Tables
xxiii
1 Introduction
1
2 Introns, Splicing and Alternative Splicing
2.1 Introns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Prevalence, Designations and Structural Characterisation
2.1.2 Origin and Evolution . . . . . . . . . . . . . . . . . . . . .
2.1.3 Functions of Introns . . . . . . . . . . . . . . . . . . . . .
2.2 Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 The Spliceosome . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 The Splicing Pathway . . . . . . . . . . . . . . . . . . . .
2.2.3 Transesterification Reactions in Splicing . . . . . . . . . .
2.2.4 Other Splicing Pathways . . . . . . . . . . . . . . . . . . .
2.2.5 Exon-defined Splicing . . . . . . . . . . . . . . . . . . . .
2.2.6 ‘Extreme’ Splicing . . . . . . . . . . . . . . . . . . . . . .
2.2.7 Timing of Splicing Relative to Transcription Termination
2.3 Alternative Splicing . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Origin and Evolution . . . . . . . . . . . . . . . . . . . . .
2.3.3 Functions of Alternative Splicing . . . . . . . . . . . . . .
2.3.4 Modulation of Alternative Splicing . . . . . . . . . . . . .
2.3.5 Stochasticity of Alternative Splicing . . . . . . . . . . . .
2.4 Measurement of Transcript Expression . . . . . . . . . . . . . . .
2.4.1 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Transcript Isoforms . . . . . . . . . . . . . . . . . . . . . .
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
11
12
14
15
16
18
19
22
24
27
28
30
31
33
35
39
41
41
43
Contents
xii
3 Evidence for intron length conservation in a set of mammalian genes
associated with embryonic development
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Data and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Randomization study of intron length evolution along the human
lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Intronic insertions and deletions along lineages leading to human
and mouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Contribution of conserved sequence elements to intron length conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 A mutation in a splicing factor that causes retinitis pigmentosa has
transcriptome-wide effect on mRNA splicing
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Exon array sample preparation and data generation . . . . . . .
4.2.2 Expression quantification of microarrays . . . . . . . . . . . . . .
4.2.3 Characterisation of differentially spliced exons . . . . . . . . . . .
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Transcriptome-wide perturbation of splicing . . . . . . . . . . .
4.3.2 Differential splicing primarily involves higher inclusion of exons in
cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Characterisation of differentially included exons . . . . . . . . . .
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
48
49
50
51
54
56
57
a
.
.
.
.
.
.
.
59
60
62
62
63
64
64
64
.
.
.
.
66
66
70
73
5 Seq-ing improved gene expression estimates from microarrays using
machine learning
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 GTEx data preparation . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Selecting and tuning the best learning algorithm . . . . . . . . . .
5.2.3 Effect of training set size . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Differential gene expression (DGE) between GTEx tissues . . . . .
5.2.5 Application to archived samples . . . . . . . . . . . . . . . . . . .
5.2.6 Application to the Affymetrix tissue mixture dataset . . . . . . . .
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Correlation with RNA-Seq . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Restricting to well-estimated genes . . . . . . . . . . . . . . . . . .
5.3.3 Inference of differential expression . . . . . . . . . . . . . . . . . .
5.3.4 Estimation of transcript isoform expression . . . . . . . . . . . . .
5.3.5 Application of MaLTE trained on GTEx data to independent microarray datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.6 Evaluation of MaLTE applied to a controlled tissue mixture dataset
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
76
77
77
78
79
79
80
81
81
82
85
87
87
88
90
91
Contents
xiii
6 Differential expression analysis on retinitis pigmentosa samples using
RNA-Seq
93
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.1 RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.2 RNA-Seq data analysis . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.3 Preparation of MaLTE and RMA expression estimates . . . . . . . 95
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Quality assessment of RNA-Seq data . . . . . . . . . . . . . . . . . 96
6.3.2 Identification of retinitis pigmentosa mutations from RNA-Seq data103
6.3.3 Quality control optimisation . . . . . . . . . . . . . . . . . . . . . . 104
6.3.4 Comparison of RNA-Seq to MaLTE and RMA . . . . . . . . . . . 108
6.3.5 Comparison in differential expression analysis . . . . . . . . . . . . 110
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Conclusion
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 More emphasis on isoforms rather than genes .
7.2.2 From algorithms to better quantitative assays .
7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Conservation of introns . . . . . . . . . . . . .
7.3.2 Retinitis pigmentosa transcriptome . . . . . . .
7.3.3 Further development of the MaLTE framework
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Supplementary results for Chapter 4
A.1 Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Sample Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2.1 Core Probesets . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2.2 Full Probesets . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Additional Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3.1 Splice Site Scores . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3.2 Constitutive versus Alternative Splicing Analyses . . . . . . . .
A.3.3 U12-type Introns . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 Number of Differentially Included Probesets . . . . . . . . . . . . . . .
A.4.1 Differential Inclusion by Probeset Class . . . . . . . . . . . . .
A.4.2 Core Probesets . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.3 Full Probesets . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.4 Exonic Probesets . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.5 Intronic Probesets . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.6 U12-type Intronic Probesets . . . . . . . . . . . . . . . . . . . .
A.5 Transcript Cluster Identifiers (Metaprobesets) Associated with PRPF8
A.6 Associated Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7 Associated Exons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7.1 Lower Inclusion in Cases . . . . . . . . . . . . . . . . . . . . . .
B Supplementary results for Chapter 5
.
.
.
.
.
.
.
.
117
. 117
. 118
. 118
. 118
. 119
. 119
. 119
. 120
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
. 122
. 123
. 123
. 124
. 125
. 125
. 125
. 126
. 128
. 128
. 129
. 130
. 131
. 132
. 133
. 133
. 134
. 134
. 141
143
Contents
B.1
B.2
B.3
B.4
B.5
xiv
Effect of training set size . . . . . . . . . . . . . . . . . . . . . . . . .
OOB filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of DGE results . . . . . . . . . . . . . . . . . . . . . . .
Cross-sample correlations for transcript isoform expression estimates
Slopes of regression between RNA-Seq and array methods assessed
Brain data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 Prediction of tissue-mixture proportions . . . . . . . . . . . . . . . .
C Use
C.1
C.2
C.3
C.4
Case: MaLTE Trained with GTEx Data
Introduction . . . . . . . . . . . . . . . . . . .
Obtaining Auxiliary Data and Scripts . . . .
Working Directory Structure . . . . . . . . .
Obtaining MaLTE . . . . . . . . . . . . . . .
C.4.1 Creating the file samples.txt . . . . . .
C.4.2 Obtaining the training data . . . . . .
C.5 Transforming exon array probes to gene array
C.6 Preparing the data for training-and-testing .
C.7 Prediction . . . . . . . . . . . . . . . . . . . .
C.8 Filtering by OOB . . . . . . . . . . . . . . . .
C.9 Collating predicted gene expression values . .
C.10 Experimental features . . . . . . . . . . . . .
C.10.1 Per-gene/transcript tuning . . . . . .
C.10.2 Incorporating principal components .
Bibliography
. .
. .
. .
. .
for
. .
. .
.
.
.
.
143
144
145
147
. 148
. 149
Applied on Exon Array 151
. . . . . . . . . . . . . . . . 151
. . . . . . . . . . . . . . . . 153
. . . . . . . . . . . . . . . . 154
. . . . . . . . . . . . . . . . 154
. . . . . . . . . . . . . . . . 155
. . . . . . . . . . . . . . . . 155
probes . . . . . . . . . . . . 156
. . . . . . . . . . . . . . . . 158
. . . . . . . . . . . . . . . . 158
. . . . . . . . . . . . . . . . 159
. . . . . . . . . . . . . . . . 159
. . . . . . . . . . . . . . . . 160
. . . . . . . . . . . . . . . . 160
. . . . . . . . . . . . . . . . 160
163
List of Figures
1.1
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Electron micrograph and schematic of adenoviral hexon mRNA hybridised
to single-stranded genomic DNA selected using EcoR1 restriction enzyme
(region A; see [2] for full details). Loops A, B and C are genomic DNA
absent from mRNA and form loops between the hybridised regions. Image
reproduced from [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Distribution of spliceosomal introns in eukaryotes. A wide selection of organisms are indicated together with the average number of introns per
gene. Anopheles gambiae; Arabidopsis thaliana; Aspergillus nidulans;
Bigelowiella natans; Nucleomorph; Caenorhabditis briggsae; Caenorhabditis elegans; Candida albicans; Chlamydomonas reinhardtii; Ciona intestinalis; Cryptococcus neoformans; Cryptosporidium parvum; Cyanidioschyzon merolae; Dictyostelium discoideum; Drosophila melanogaster;
Encephalitozoon cuniculi; Giardia lamblia, Guillardia theta Nucleomorph;
Homo sapiens; Leishmania major; Mus musculus; Neurospora crassa;
Oryza sativa; Paramecium aurelia; Phanerochaete chrysosporium; Plasmodium falciparum; Plasmodium yoelii; Saccharomyces cerevisiae; Schizosaccharomyces pombe; Takifugu rubripes; Thalassiosira pseudonana; Trichomonas vaginalis. Image reproduced from [10]. . . . . . . . . . . . . . . 6
Intron density and intron length in 100 eukaryotes. Image reproduced
from [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Structure and splicing pathway of group I introns. Image reproduced
from [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Structure of generic group II intron secondary structure. EBS: exonbinding site; IBS: intron-binding site; MGC: Manganese ion C; J2/3:
junction between domains 2 and 3. Image reproduced from [18]. . . . . . 9
(a) Secondary structure of tRNA with the positions of known introns
marked by numbers (in parentheses). The box at the bottom indicates the
anticodon. (b) Conserved ‘bulge-helix-bulge’ in tRNA (left) and rRNA
(right) with cleavage sites shown by arrows. Uppercase: > 85% conservation; lowercase: 60-85% conservation; unconserved indicated by dots;
Purines - R; Pyrimdines - Y. Image reproduced from [21] with permission. 10
Conserved U2-type spliceosomal motifs in five divergent species. (A) 5’
splice site (5ss). (B) branch point site (BPS) where a highly conserved
adenine is located. (C) 3’ splice site (3ss). Image reproduced from [27]. . 11
Distribution of intron lengths in various organisms. Image reproduced
from [30]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Origin of nucleus-cytosol compartmentalisation at the onset of mitochondrial origin. Image reproduced from [34] with permission. . . . . . . . . . 13
xv
List of Figures
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
Short nuclear RNA (snRNA) involved in splicing. (a) Sm-class snRNA;
(b) Lsm-class snRNA. Image reproduced from [46] with permission. . . . .
Splicing cascade showing recycling of spliceosomal components. Image
reproduced from [49]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Schematic depicting first and second transesterification splicing reactions.
Image reproduced from [68] with permission. . . . . . . . . . . . . . . . .
tRNA splicing cascade in archaeal introns showing ligated exons (top
right) and circularised intron (bottom right). Image reproduced from [21]
with permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Splicing pathways of group II introns. Image reproduced from [72]. . . . .
Various splicing reactions exhibited by group II introns. (A) Forward
and reverse splicing. (B) Hydrolytic splicing. (C) Partial reverse splicing.
Introns indicated in red; exons indicated in dark/light blue (E1 and E2).
Schematic representing (A) cis- and (B) trans-splicing. (C) Conserved
motifs and elements involved in cis-splicing (ESE: exonic splicing enhancers; 5’SS: 5’ splice site; BP: branch point; PPT: poly-pyrimidine
tract; 3’SS: 3’ splice site). (D) Schematic representing the trans-splicing
between (a) androgen-binding protein (ADP) and (b) histidine decarboxylase (HDC) transcripts to yield (c) a mature chimeric transcript. Image
reproduced from [80]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Schematic representing exon definition including transition to intron definition via exon justaposition. Image reproduced from [29] with permission.
Three examples of ultra-short introns together with their sequence. Images reproduced from [91] with permission. . . . . . . . . . . . . . . . . .
Three mechanisms of splicing in ultra-long introns. Image reproduced
from [97] with permission. . . . . . . . . . . . . . . . . . . . . . . . . . . .
α-synuclein (SNCA) gene showing 11 transcript isoforms of various lengths
and with different exon assemblies. Image from Ensembl [117]. . . . . . .
Patterns of alternative splicing. (a) Exon skipping/cassette exons - most
prevalent in higher eukaryotes but low prevalence in plants; (b & c) alternative 3’ splice sites (A3SS) and alternative 5’ splice sites (A5SS) considered as intermediates between constitutive and alternative splicing;
(d) Mutually exclusive exons; (e) Intron retention. Image reproduced
from [127] with permission. . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of extreme alternative splicing in the Drosophila Dscam gene
that potentially generates 38,016 transcript isoforms through mutualexclusive splicing of four exon clusters (coloured red, blue, green, yellow).
Exonisation of Alu elements. (a) Mutations surrounding Alu elements
that could give rise to new exons. (b) Transformation of RNA secondary
structures by enzymatic editing of adenosine (A) to inosine (I), which is
recognised as guanosine (G). In the presence of two adjacent Alu elements
of opposite orientation can then create a functional 3’ss (AA to AG). Image
reproduced from [127] with permission. . . . . . . . . . . . . . . . . . . . .
xvi
16
17
19
20
21
21
22
23
25
26
29
30
31
32
List of Figures
2.23 Exon transition from constitutive to alternative. (A) Degenerate splice
signals that can form (Aa) A5SS of A3SS, (Ab) weak 5’ splice sites leading
to skipping, (Ac) introduction of an exonic splicing suppressor, or (B)
secondary RNA structure that sequester splice signals. Image reproduced
from [127] with permission. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.24 Exon shuffling. Image reproduced from [127] with permission. . . . . . . .
2.25 Proteomic modulation of alternative splicing. Serine-Arginine (SR) and
heteronuclear ribonucleoproteins (hnRNP) classes of proteins are predominantly involved in influencing splicing decisions by binding to exonic/intronic splice enhancers (ESE/ISE) or silencers (ESS/ISS). Image reproduced from [114] with permission. . . . . . . . . . . . . . . . . . . . . . . .
2.26 Epigenetic modulation of alternative splicing. (a) Histone modifications
(H3K9ac) leads to an increase in RNA pol II elongation leading to skipping in neural cell adhesion molecule (NCAM ) exon 18. (b) Histone modification (H3K36me3) at the fibroblast growth factor receptor 2 (FGFR2 )
recruits polypyrimidine tract binding (PTB) protein which suppresses an
alternate exon. Images reproduced from [114] with permission. . . . . . .
2.27 Kinetic modulation of alternative splicing. (a) SR splicing factor 3 (SRSF3)
is controlled by the phosphorylation state of the RNA pol II C-terminal
domain (CTD), which then determines inclusion of alternate exons recruitment is affected. (b) The Mediator, which associates with transcription factors, recruits the splicing suppressor hnRNPL. (c) CCCTC-binding
factor (CCTF) binds to unmethylated CG-rich DNA sequences and slows
RNA pol II transcription allowing competitive splicing of a downstream
exon. (d) DBIRD facilitates exclusion of A/T-rich exons by preventing
RNA pol II slowing down at these exons. (e) Kinetic coupling following
DNA damage, which leads to hyperphosphorylation of the RNA pol II
CTD further inhibiting RNA pol II elongation. (f) SWI/SNF (a chromatin remodelling factor) recruits SAM68, which stalls poll II enhancing
exon inclusion. Image reproduced from [114] with permission. . . . . . . .
2.28 Splicing code developed for five mouse tissues. Tissue codes: CNS (C),
muscle (M), embryo (E), digestive (D), and tissue-independent mixture
(I). The code is read using the key on the top left of the image. For
example, abundant CU-rich (nPTB ) motifs (fifth row) are strongly determinative of high inclusion of the alternate exon if within 300 nt upstream.
2.29 An illustration of noisy splicing on the HERPUD1 locus. The top level
shows average expression level at each position. Green marks exonic regions and black marks intronic regions. Middle panel shows identified
exons from junction mapping RNA-Seq reads. Black bars are annotated
exons; red bars are not annotated. Bottom panel shows Ensembl transcript isoforms. Image reproduced from [169]. . . . . . . . . . . . . . . . .
2.30 A schematic showing various tools available for detecting and measuring
splicing events using RNA-Seq. Image reproduced from [175]. . . . . . . .
3.1
xvii
34
35
36
38
39
40
42
43
Phylogeny. Phylogenetic tree illustrating relationships of the 19 mammal included in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
List of Figures
3.2
3.3
4.1
4.2
4.3
4.4
5.1
5.2
xviii
Comparison of insertions and deletions. The sum of the insertions
and deletions that have taken place in genes in size class one along the
human (a) and mouse (b) lineages since divergence from their common
ancestor. Blue points, corresponding to the genes that showed as much
or more intron content conservation than Hes7 are shown against a background of grey points, corresponding to all genes in the size class. All
axes are truncated, but the truncated axes include all of the blue points.
Grey points lying with cumulative insertions or deletions greater than 200
bp for human or 2000 bp for mouse are not shown. . . . . . . . . . . . . . 55
phastCons scores across introns of the Hes7 gene . . . . . . . . . . . . . . 57
Empirical distribution of p-values. Each p-values was calculated
using pairwise t-tests for a normalised probeset between cases and controls.
Proportion of probesets indicating higher/lower inclusion in cases
for binned p-value thresholds. All bins have an equal width of ∆p =
0.05 for p-values in the range [0, 1]. Red bars symbolise the proportion of
probesets indicating higher inclusion in cases; similarly, blue bars refer to
lower inclusion in cases. The absolute height of a bar of each colour represents the proportion of probesets with higher/lower inclusion in cases.
The t-statistic was used to determine relative inclusion: t > 0 - higher
inclusion in cases; t < 0 - lower inclusion in cases. Plot only shows core
and exonic (full) probesets. . . . . . . . . . . . . . . . . . . . . . . . . . .
Distributions of intron length. Density plot of intron length (A)
upstream and (B) downstream of differentially included exons in cases
relative to controls. A similar plot for background exons (core exons) is
provided for reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nucleotide content of differentially included exons. The proportion of each nucleotide in differentially included exons (A) with exonic
splicing enhancers (ESEs) present and (B) with ESEs removed. . . . . . .
65
67
68
70
Estimation accuracy. MaLTE provides an estimate of the RNA-Seq
gene expression levels from microarray probe intensities. (a) The relative
error (i.e. difference between MaLTE estimate and RNA-Seq, divided by
the RNA-Seq value) as a function of the RNA-Seq expression level. Each
point corresponds to a bin of 75 genes. The data represents all genes
but with a random subset of 10 samples for each gene. Only relative
errors below 2 and RNA-Seq values between 1 and 1000 are represented.
Low expression genes were excluded due to high stochasticity for low read
counts. A Loess regression line is shown in red, illustrating that MaLTE
slightly underestimates RNA-Seq particularly for highly-expressed genes.
(b) The distribution of relative error with percentage median error and
median absolute error displayed with the median error indicated by the
dashed red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Within-sample correlation with RNA-Seq. (a, b, c) Scatter plot of
gene expression for a single exemplary sample for each method against
RNA-Seq. The sample with the within-sample Pearson correlation closest
to the median over all samples was chosen. Box plots are provided to show
the range of within-sample (d) Pearson and (e) Spearman correlation
coefficients across samples in the test dataset. Median correlations are
indicated beneath in brackets. . . . . . . . . . . . . . . . . . . . . . . . . . 84
List of Figures
5.3
5.4
5.5
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
xix
Cross-sample correlation with RNA-Seq. For each gene, the crosssample correlation was determined between the gene expression values estimated from the microarrays using MaLTE, median-polish and PLIER.
Density plots show the distribution across genes of (a) Pearson and (b)
Spearman correlation coefficients. Mean and median values of the correlation coefficients are provided in parentheses next to the method name in
the legend. Vertical lines show mean cross-sample correlation for MaLTE
(solid), median-polish (dashed) and PLIER (dotted). . . . . . . . . . . . . 85
The effects of OOB filtering. Mean cross-sample Pearson correlation as a function of OOB correlation threshold for (a) genes and (b)
transcripts. Error bars correspond to two standard errors. Note that
transcript-level estimates are not provided by RMA and PLIER. The
black line represents the number of genes/transcripts at each level. . . . . 86
Application to archived data. MaLTE, trained using the GTEx
data, was applied to predict gene expression from published microarray
data based on brain samples for which RNA-Seq data was also available. Despite the fact that the two studies used different array platforms
(Affymetrix Human Exon 1.0 ST arrays and Affymetrix Human Gene 1.1
ST arrays for the brain and GTEx studies, respectively), MaLTE predictions exceeded the within-sample correlations obtained using medianpolish and PLIER. MaLTE predictions were based on probes shared between the two array platforms. Box plots of (a) Pearson and (b) Spearman
within correlations are shown. (c) Pearson and (d) Spearman cross sample correlations with OOB filtering. The black line represents the number
of genes/transcripts at each level. . . . . . . . . . . . . . . . . . . . . . . . 89
Plot of GC content found in each base position. . . . . . . . . . . . . . .
Plot of distribution of quality scores in each base position. . . . . . . . .
Plot of mean quality score for each short read sequence. . . . . . . . . .
Plot of nucleotide percentage at each base position. . . . . . . . . . . . .
Plot of k -mer enrichment profile across short read sequences. . . . . . .
Effect of removing duplicates on the number of mapped reads. . . . . .
First two principal components for RNA-Seq expression estimates with and without various filtering approaches. (a) No filtering applied, (b) filtering using the subset of genes that passed OOB
filtering in the GTEx data (for comparison with MaLTE), (c) combining
OOB filtering and removal of noisy genes, and (d) OOB filtering, removal
of noisy genes then quantile-normalisation. . . . . . . . . . . . . . . . . .
First two principal components for MaLTE and RMA expression
estimates. (a) MaLTE prior to filtering out noisy genes, (b) MaLTE after
filtering out noisy genes. (c and d) Corresponding plots for RMA. . . .
Density of expression (a) before and (b) after filtering noisy
genes. Both plots are truncated on the x-axis because the scale extends
to -700 indicating numerous genes at very low expression levels. Truncation results in the shown densities not being normalised. Expression
estimates (x-axis) are log-2 transformed. . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
98
99
100
101
102
103
. 105
. 106
. 107
List of Figures
6.10 Effect of quantile normalisation of RNA-Seq data on nominal
differential expression. (a and b) Differential expression using standard t-tests. (a) No quantile normalisation, (b) quantile-normalised expression estimates. (c and d) Differential expression using moderated ttests (performed with limma). (c) No quantile normalisation, (d) quantilenormalised expression estimates. Red denotes the proportion of genes
that are expressed higher in cases, while blue denotes those higher in
controls. Under the null hypothesis of no differences in expression both
proportions should be equal at all p-values. . . . . . . . . . . . . . . . .
6.11 Cross-sample and within-sample correlations. (a) Pearson crosssample correlations, (b) Spearman cross-sample correlations, (c) Pearson
within-sample correlations, (d) Spearman within-sample correlations. . .
6.12 Densities of cross-sample correlations after filtering and quantile normalisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.13 Proportion of overlap with RNA-Seq to nominally significant
genes using MaLTE and RMA. (a) Standard t-tests, (b) Moderated
t-tests (performed with limma). Each bar represents a single p-value
threshold. The proportion is computed relative to the overlap between
MaLTE and RMA. Error bars indicate two standard errors. . . . . . . .
6.14 α-synuclein transcript isoforms on Ensembl. . . . . . . . . . . . . . . . .
xx
. 108
. 109
. 110
. 112
. 113
A.1 Quality assessment of exon array data. A. Hierarchical cluster of
samples. Ti are cases, Ci are controls. B. Plot of first two principal
components (PCs) for full probesets. C.,D. First two PCs for exonic and
intronic probesets, respectively. . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Sample permutations for core probesets. Plots representing the
complete set (eight) of pairwise swaps between cases (Ti ) and controls
(Ci ). Only core probesets are considered here. Plots are ordered from the
top by the value of π0 . Dashed red line represents a uniform distribution;
dashed blue lines represent the value of π0 . . . . . . . . . . . . . . . . . .
A.3 Sample permutations for full probesets. Plots representing the complete set (eight) of pairwise swaps between cases (Ti ) and controls (Ci ).
Plots are ordered from the top by the value of π0 . Dashed red line represents a uniform distribution; dashed blue lines represent the value of
π0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 Distribution of splice site scores. Density plot of splice site scores
of A. 5’ splice sites and B. 3’ splice sites. The dashed line indicates a
similar plot for background exons (core exons). . . . . . . . . . . . . . . .
A.5 Proportion of probesets indicating higher/lower inclusion in cases
relative to controls for binned p-value thresholds for U12-type
introns. Each column represents all probesets (exons) for that bin. Red
represent probesets included at higher levels in cases; blue are those included at lower levels in cases. . . . . . . . . . . . . . . . . . . . . . . . . .
A.6 Illustration of differential inclusion for different classes of probesets. A. Core, B. full, C. exonic, and D. intronic probesets. Each
point represents the number of probesets presenting higher (‘up’)/lower
(‘down’) inclusion in cases relative to controls within ∆p = 0.05 p-value
bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
123
124
125
127
128
List of Figures
xxi
B.1 The effect of training size on predictions. A test set of 100 samples
was used for all training sample sizes. . . . . . . . . . . . . . . . . . . . . 143
B.2 Out-of-bag (OOB) filtering. Scatter plot of cross-sample correlation
coefficients from on OOB gene expression estimates and estimates obtained on test data for 1,000 randomly selected genes. . . . . . . . . . . . 144
B.3 Comparison of differential expression results. Comparison of the
performance of MaLTE and median-polish on the problem of detecting
differential gene expression between five heart-muscle and five skeletalmuscle samples in the GTEx dataset. Each method results in a list of
genes, ranked by the q-value from the comparison of the gene expression
level in the two groups of samples. We used the (a) cumulative Jaccard
index and (b) concordance correlation to compare the similarity as a
function of rank and the overall similarity, respectively, between lists of
genes ranked by MaLTE/median-polish and by RNA-seq. The set of
genes assessed were defined by RNA-Seq: genes with RPKM of above
one in both tissues. Bootstrap re-sampling (100 pseudo-replicates) was
used to assess the effect of sampling error for both cumulative Jaccard
index and concordance correlations. Results of bootstrapping are shown
as faint lines around the observed Jaccard index and yield the density
plots shown in (b). Log of fold change estimates for (c) 120 differentially
expressed genes common to MaLTE and median-polish and (d) MaLTE
and PLIER for six common genes. . . . . . . . . . . . . . . . . . . . . . . 145
B.4 Comparison of 10-sample DGE to 44-sample DGE. (a) Self-comparisons
(e.g. MaLTE 10-sample DGE to MaLTE 44-sample DGE, with five and
22 samples in each tissue, respectively). Forty-four-sample DGE results
are treated as putative true differential expressions at the indicated (top
right or each plot) false discover rates (FDRs). FDRs were computed
using the q-value method [256]. Comparisons are made using the Jaccard
index. (b) Comparison of each method to 44-sample DGE using RNA-Seq.146
B.5 Transcript isoform expression estimates. Densities of (a) Pearson
and (b) Spearman cross-sample correlations for transcript isoform expression estimates obtained using MaLTE. Filtered data corresponds to
transcript isoforms with rOOB > 0. . . . . . . . . . . . . . . . . . . . . . . 147
B.6 Box plots of slopes β computed by linearly regressing RNA-Seq
against array method. Dotted red line indicates unit gradient. Each
box plot represents 12 slopes for brain samples. RNA-Seq expression is
restricted to RPKM between one and 1000 because of high uncertainty
at low values and few genes at high values representing 78% of genes
quantified. Using all genes results in the same medians but wider variance
in slopes due to outlying genes. . . . . . . . . . . . . . . . . . . . . . . . . 148
B.7 Estimation of tissue mixture proportions. (a) MaLTE, (b) medianpolish (RMA), and (c) PLIER. Each plot shows comparisons of true and
estimated proportions with key statistics indicated in the legends. . . . . 149
C.1 Schematic representing a MaLTE use-case. The four text files (dark
green) are the main input to MaLTE. These files are obtained by processing
the training data using the auxiliary scripts. . . . . . . . . . . . . . . . . . 152
List of Tables
3.1
3.2
4.1
4.2
4.3
4.4
4.5
6.1
6.2
6.3
Conservation of Hes7 intron length. Intron content of orthologues
of human Hes7 (OrthoDB ID EOG45TDC4). The tarsier orthologue was
omitted. This gene had no annotated introns but the annotation appeared
to be incomplete. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Gene set enrichment analysis. Gene set enrichment analysis of genes
in size class 1 showing evidence of intron content conservation in mammals. Top twenty most significantly enriched terms are shown. . . . . . . 54
Length of introns flanking differentially included exons. Comparison of median of intron length for exons with higher (up) and lower
(down) inclusion in cases relative to controls. . . . . . . . . . . . . . . . .
Length of differentially included exons. Comparison of mean and
median lengths for exons with higher (up) and lower (down) inclusion in
cases relative to controls. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Splice site strength of differentially included exons. Mean and
median splice site strength computed using MaxEntScan [260]. . . . . . . .
Combined splice site strength (5SS and 3SS) of differentially
included exons. Similar to Table 3 but using combined splice site strength.
Indicators of co-transcriptional splicing. Gene length and distance
from the 3’ end of genes for exons differentially included. . . . . . . . . . .
67
68
69
69
70
Summary of quality assessment on RP RNA-Seq data from FastQC and
Picard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
RP-associated mutations identified using RNA-Seq data. Mutations identified in whole blood samples.(+) present, (-) absent. . . . . . . 104
Genes identified as differentially expressed at FDR < 10%. All
genes are expressed higher in RP cases. . . . . . . . . . . . . . . . . . . . 114
A.1 Number of core probesets in each bin of width ∆p = 0.05. Data
used to produce the plots in Fig. 4.2 and Fig. A.6. . . . . . . . . . . . .
A.2 Number of full probesets in each bin of width ∆p = 0.05 . . . . .
A.3 Number of exonic probesets in each bin of width ∆p = 0.05 . . .
A.4 Number of intronic probesets in each bin of width ∆p = 0.05 . .
A.5 Number of U12-type intronic probesets in each bin of width
∆p = 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.6 Metaprobesets along PRPF8 . The coordinates of transcript cluster
IDs along the PRPF8 gene. . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
129
130
131
132
. 133
. 134
C.1 Contents of MaLTE-aux and the functions they provide . . . . . . . . . . . 154
xxiii
To my parents,
Joseph and Mary Keter,
for their courage, selfless sacrifice
and determination in the face of
persistent uncertainty.
xxv
Chapter 1
Introduction
By the summer of 1977, the race to delineate the structure of eukaryotic genes was in
full swing [1]. Eukaryotic genes, unlike their prokaryotic counterparts, presented a comparatively greater challenge to unravel. Attention then shifted to human adenoviruses
for two principal reasons: they have smaller genomes (around 35 kbp) rendering experimentation easier [1] and they are processed by the host transcriptional machinery giving
insight into how human genes, and by extension eukaryotic genes, were structured and
processed. Shortly thereafter, Susan Berget and Claire Moore working under Phillip A.
Sharp performed an experiment based on the adenoviral hexon mRNA that encodes the
major structural polypeptide of the virus [2]. Figure 1.1 shows one electron micrograph
from their paper of an RNA-DNA hybrid and its schematic.
Figure 1.1: Electron micrograph and schematic of adenoviral hexon mRNA hybridised
to single-stranded genomic DNA selected using EcoR1 restriction enzyme (region A;
see [2] for full details). Loops A, B and C are genomic DNA absent from mRNA and
form loops between the hybridised regions. Image reproduced from [2].
1
Introduction
Berget and colleagues made three key discoveries. First, they observed that the length of
RNA-DNA hybridisation depended on the genomic segment used (selected using different
restriction enzymes). They also observed that segments of single-stranded genomic DNA
formed loops between hybridised portions suggesting a process by which the mRNA was
assembled from discontinuous segments with long intragenic regions somehow removed.
Finally, they found that multiple mRNA fragments could hybridise to an individual
genomic segment as several previous experiments had reported [3]. Based on these
findings, they proposed that the multiple segments were “probably spliced to the body
of this mRNA during post-transcriptional processing”. A few weeks later, Richard J.
Roberts’ lab replicated these findings [4]. Sharp and Roberts were awarded the Nobel
Prize in Physiology or Medicine in 1993 “for their discovery of split genes” (http:
//www.nobelprize.org/nobel_prizes/medicine/laureates/1993/).
These and other experiments elucidated that eukaryotic genes were subject to an additional processing step that could explain their distinction from prokaryotic genes. In
one of the most cited articles on splicing [5], Walter Gilbert introduced the term intron
for the ‘silent’ intragenic regions of DNA found in conjunction with expressed regions
(exons). He also made several startling predictions based on the nature of the splicing
process whose extent and implications have only recently been fully appreciated. First,
he proposed that the presence of introns could significantly contribute to evolutionary
synthesis either by allowing inclusion or exclusion of significant portions of amino acid
sequence due to mutations at splicing hotspots or by tolerating splicing error as a way to
introduce novel transcripts. Second, that the dogma of ‘one-gene, one-peptide’ was untenable. Third, that the presence of large introns could accelerate recombination as well
as enable exon shuffling, and finally, that introns could serve regulatory roles promoting
cell differentiation events.
This thesis explores the estimation and analysis of gene expression and alternative splicing in the context of development and disease. It consists of three main analyses with
an additional analytical chapter that expands on work from a preceding chapter. Chapter 2 provides a broad survey of introns, splicing and alternative splicing culminating in
a brief discussion of transcription isoform expression measurement.
Chapters 3, 4 and 5 constitute the main work carried out over the period of study. In
Chapter 3, we present an analysis of intron content (defined as the total intron length
in a gene) in nineteen mammals and show that genes with conserved intron content are
preferentially associated with development. This work was based on the observation
that the intron content in some genes, such as the hairy and enhancer of split 7 (Hes7 ),
plays a crucial role in the timing and dynamics of gene induction that is important for
the correct formation of segmented body parts (somites) [6].
2
Introduction
In Chapter 4, we carried out a transcriptome-wide analysis of aberrant splicing using
high-density oligonucleotide arrays on whole blood samples from individuals affected
by autosomal dominant retinitis pigmentosa. One of the mutations that leads to this
disease is a carboxyl-terminal domain mutation in a core component of the spliceosome,
pre-mRNA processing factor 8 (PRPF8 ). We hypothesised that non-retinal tissue may
provide evidence of non-symptomatic transcriptome-wide splicing aberration due to the
ubiquity of the spliceosome.
Chapter 5 introduces a novel approach to gene- and transcript-expression quantification
that uses a machine learning algorithm to provide improved estimates of gene expression.
MaLTE, which stands for Machine Learning of Transcript Expression, uses statistical
learning between a gold-standard (such as high-throughput transcriptome sequencing
or RNA-Seq) and microarray probe fluorescence intensities to substitute conventional
summarisation algorithms. In Chapter 6, we extend the analysis of Chapter 4 using
RNA-Seq data and compare the results to MaLTE and RMA. Finally, we make concluding remarks in Chapter 7.
3
Chapter 2
Introns, Splicing and Alternative
Splicing
In the preceding chapter, we introduced the seminal discovery of introns, splicing and
alternative splicing. This event had a remarkable impact on our understanding of one
of the most important aspects of transcriptional processing, particularly in eukaryotes.
This chapter now focuses attention on these three topics and outlines some of the salient
discoveries relating to each. This account provides the context for the work in the
subsequent chapters.
This chapter is organised into four main sections. The first section covers key aspects
of introns, from their structural diversity to their functional roles. The second section is
concerned with the splicing process and the various mechanisms involved that account
for the relatively high fidelity of this process. In the third section, we focus on alternative
splicing and conclude the chapter with a brief introduction to approaches to measuring
gene expression with an emphasis on estimating transcript isoform expression.
2.1
Introns
An intron is an intr agenic region of transcribed RNA that is routinely eliminated and
excluded from the final functional transcript [5]. Excision of introns predominantly
proceeds co-transcriptionally [7] through splicing of flanking sequences called ex pressed
regions (exons). A discussion of splicing (and subsequently alternative splicing) requires
a description of introns because their structure largely determines the nature of splicing.
Additionally, an assessment of the various intron structures and the organisms in which
they occur reveals an intriguing evolutionary history that had fueled debate ever since
5
Introns, Splicing and Alternative Splicing
Figure 2.1: Distribution of spliceosomal introns in eukaryotes. A wide selection of organisms are indicated together with the average number of introns per gene. Anopheles
gambiae; Arabidopsis thaliana; Aspergillus nidulans; Bigelowiella natans; Nucleomorph;
Caenorhabditis briggsae; Caenorhabditis elegans; Candida albicans; Chlamydomonas
reinhardtii; Ciona intestinalis; Cryptococcus neoformans; Cryptosporidium parvum;
Cyanidioschyzon merolae; Dictyostelium discoideum; Drosophila melanogaster; Encephalitozoon cuniculi; Giardia lamblia, Guillardia theta Nucleomorph; Homo sapiens;
Leishmania major; Mus musculus; Neurospora crassa; Oryza sativa; Paramecium aurelia; Phanerochaete chrysosporium; Plasmodium falciparum; Plasmodium yoelii; Saccharomyces cerevisiae; Schizosaccharomyces pombe; Takifugu rubripes; Thalassiosira
pseudonana; Trichomonas vaginalis. Image reproduced from [10].
their discovery. Therefore, this section outlines key aspects of introns, commencing with
their designations and structural characterisations followed by a brief discussion of their
origin and evolution. We then proceed to outline functions of introns before launching
into a discussion on the splicing process.
2.1.1
Prevalence, Designations and Structural Characterisation
Introns are found in all the domains of life but it is in eukaryotes that they have undergone considerable evolution exhibiting the greatest complexity in vertebrates where
they constitute a large proportion of gene sequence (Fig. 2.1). The abundance and
variety of introns found also varies between organisms with unicellular organisms possessing primitive introns most of which appear to be undergoing purifying selection [8, 9]
(Fig. 2.2).
6
1 kbp
intron length
Alveolates & heterokonts
Emiliana
Naegleria
Green plants & algae
Amoebozoa
Holozoa
Fungi
Sman
Hsap
Drer
Bflo
Tgon
Vcar
Aaeg
Scer
Ehux
Ngru
Cint
Hrob
Vvin
100 bp
Tgu
Spur
Bmor
Sbic
C16
ln y=0.558x+2.7089
Tadh
Lbic
Tcas
Bbov
introns per kbp coding sequence
Ptet
1
2
3
4
5
Figure 2.2: Intron density and intron length in 100 eukaryotes. Image reproduced
from [9].
A survey of the literature reveals five distinct categories of introns designated as group I
to group III, archaeal introns and spliceosomal (or nuclear) introns. Each category has
distinct characteristics and it is the shared characteristics that indicate how introns may
have originated and subsequently proliferated into eukaryotic lineages.
Group I introns were first discovered in the protozoan ciliate Tetrahymena thermophila
[11] (Fig. 2.3). They are found in several protozoan ribosomal genes, fungal mitochondrial genes, and some phage genes [12]. In eukaryotes, group I introns are found in
ribosomal RNA genes in ribosomal DNA loci. They are exclusive to functional ribosomal RNA (rRNA) genes in eukaryotes but also occur in a wide variety of protozoans.
Group I introns lack conserved nucleotide residues. A substantial treatment of group I
introns can be found in the review by [13] and [14].
In eukaryotes, Group II introns are principally found in mitochondrial genes. They are
also found in archaea and bacteria as well as in simple eukaryotes where they ensure that
housekeeping genes are accurately transcribed by preventing invasion by other harmful
mobile DNA elements [15]. Group II introns are mobile DNA elements: upon excision
they may be reverse transcribed back into the genome in a process termed reverse
splicing. They are structurally composed of six functional domains, D1 to D6 (Fig. 2.4).
They are further classified into subgroups IIA, IIB and IIC. Subgroups IIA and IIB are
typically around 900 bp long and occur in bacteria, archaea and mitochondria, while
subgroup IIC are about half as long (∼400 bp) and are exclusively found in prokaryotes,
7
Figure 2.3: Structure and splicing pathway of group I introns. Image reproduced
from [13].
representing the most primitive lineage of group II introns. More information on group
II introns is available from the following reviews [15–17].
Group III introns were first characterised in the Euglena gracilis chloroplast genome
and are mostly found in mRNA genes of chloroplasts and other euglenoids [19]. They
were identified in conjunction with a set of group II introns but differed in several traits.
Group III introns are much shorter and more uniform in size, ranging in length from
95-110 bp having an average length of about 102 bp (less than half the length of group
II introns). They have fewer conserved splicing signals compared to group II introns.
In Euglena, the 5’ splice site is designated by the consensus sequence 5’-NTNNG as
opposed to the 5’-GTGYG- of group II introns. The 3’ splice site in group III introns
is ANNTNNNN-3’, while group II have ATTTTAT-3’. Their secondary structure does
not resemble that of group II introns consisting of a central core with six radiating,
helical domains I-VI. In Euglena, group III introns are mainly found in genes associated
8
Figure 2.4: Structure of generic group II intron secondary structure. EBS: exonbinding site; IBS: intron-binding site; MGC: Manganese ion C; J2/3: junction between
domains 2 and 3. Image reproduced from [18].
with transcription and the transcription machinery and are highly enriched for A and T
nucleotides. [19]. A substantial treatment of group III introns is provided in [20].
Archaea possess introns in their tRNA and rRNA genes. Archaeal introns share structural similarities to other intron types. The introns are typically located one nucleotide
3’ of the anticodon but may also be found in the variable arm of the pre-tRNA. In prerRNAs, they are found associated with important functional sites that correspond with
the surface of the ribosome. Several archaeal introns have been found to posses open
reading frames of about 600 nt with relatively lower GC content. Archaeal pre-tRNA and
pre-rRNA introns adopt a ‘bulge-helix-bulge’ conformation at the exon-intron junction
(Fig. 2.5) which is then recognised and cut at the bulges by a splicing endoribonuclease.
Archaeal introns are discussed at great length in a review by Lykke-Andersen et al. [21].
Spliceosomal introns are an evolutionary success story [22]. Their diversity and the
correlation of their sizes with genome complexity, particularly in metazoans, point to
the central roles they play in conferring organismal complexity [23]. They fall into two
main categories depending on the splicing pathway they are subject to: U2-type and
U12-type introns. U2-type introns are spliced by the major spliceosome and constitute
more than 99% of all spliceosomal introns in human; U12-type introns are spliced by
the minor spliceosome and make up the remainder. The U* designations stem from
the names of the spliceosomal short nuclear RNAs (snRNAs), which are rich in uridine
(U) and are core components of the spliceosome. The initial segregation into these two
classes was based on conserved dinucleotide sequences that mark the 5’ and 3’ splice sites
9
Figure 2.5: (a) Secondary structure of tRNA with the positions of known introns
marked by numbers (in parentheses). The box at the bottom indicates the anticodon.
(b) Conserved ‘bulge-helix-bulge’ in tRNA (left) and rRNA (right) with cleavage sites
shown by arrows. Uppercase: > 85% conservation; lowercase: 60-85% conservation;
unconserved indicated by dots; Purines - R; Pyrimdines - Y. Image reproduced from
[21] with permission.
(5’ss and 3’ss): U2-type introns were associated with the terminal dinucleotides GT-AG
or GT-AC while U12-type introns were found in those with AT-AC dinucleotides but there
are numerous exceptions to this rule [24, 25]. In addition to the terminal dinucleotides,
spliceosomal introns also have a branch point site (BPS) bearing the conserved sequence
YTNAY (Y = pyrimidines C and T), in which the adenine base plays a critical role as an
intermediate transition site during the splicing process (Fig. 2.6). The BPS is usually
found between 30 and 50 bp upstream of the 3’ splice site [26]. Between the BPS and
the 3’ splice site (3’SS) is a conserved sequence of pyrimidines called the polypyrimidine
tract (PPT). The PPT is a crucial docking site for auxiliary factors (U2AF heterodimer)
that facilitate recognition of both the BPS and the 3’ splice site.
Spliceosomal introns vary in length both within and across eukaryotic lineages: unicellular eukaryotic organisms such as yeast (S. cerevisiae and S. pombe have intron lengths of
about 270 bp and 40-75 bp, respectively [28]) have few and short introns; intron length
increases from simple metazoans (C. elegans: mean of ∼500 bp) and invertebrates (D.
melanogaster : ∼560 bp), and shows remarkable development in length in vertebrates
(M. musculus: ∼1300 bpt; H. sapiens: ∼1700 nt) (Fig. 2.7). The length of introns
plays a crucial role in influencing how splicing takes place with short introns preferring
cross-intron spliceosomal assembly (intron-defined) and long introns being subject to
10
Figure 2.6: Conserved U2-type spliceosomal motifs in five divergent species. (A)
5’ splice site (5ss). (B) branch point site (BPS) where a highly conserved adenine is
located. (C) 3’ splice site (3ss). Image reproduced from [27].
Figure 2.7: Distribution of intron lengths in various organisms. Image reproduced
from [30].
a transition from cross-exon assembly to cross-intron assembly (exon-defined) splicing
[29].
2.1.2
Origin and Evolution
How did introns originate? Attempts to answer this question led to two main hypotheses:
the intron-early (later renamed introns-first) and the intron-late hypotheses [9], though
hybrids are more widely accepted than the extreme versions. While debate has simmered
since their discovery, it is only within the last few months that decisive evidence may
11
finally resolve this issue. Recent empirical results that examined the function of U-rich
snRNAs in the major spliceosome suggest that the evidence may point in favour of the
intron-late theory [31].
The intron-early theory was first proposed by Doolittle [32]. It attempted to reconcile
what then appeared as, the disparate occurrence of introns in prokaryotes and eukaryotes. It postulated that the former once had introns but had since lost them due to
genome streamlining to facilitate efficient transcript processing in response to competitiveness in growth. Both domains emerged from a common ancestor having introns but
introns were retained only in eukaryotes. Furthermore, it proposed that eukaryotes were
an independent evolutionary line because neither prokaryotes nor archaea were known
to have introns at the time [33]. According to this theory, exons are modular elements
that were shuffled to generate new genes from which it followed that protein domain
boundaries should coincide with exon boundaries. The absence of evidence in support
of the exon-to-protein-domain prediction cast doubt on this theory [34]. Eventually,
prokaryotic and archaeal introns were discovered weakening support for this theory.
The dominance of introns in eukaryotes, the exon shuffling hypothesis [34], and poor
conservation of introns [35] have retained some adherents. However, this theory did not
take into account the diversity of introns with strong evolutionary relationships (between
group II and spliceosomal introns) that support the intron-late theory.
The intron-late theory was first proposed by Thomas R. Cech [36] and takes a radically
different perspective by postulating that introns survived into the eukaryotic lineage
and have since undergone various phases of evolution starting with substantial accumulation in the last eukaryotic common ancestor (LECA) to substantial intron loss in
most current eukaryotic lineages [9]. It proposes that introns may have been introduced
into eukaryotes as group II introns in the mitochondrial endosymbiont because of their
homology to spliceosomal short nuclear RNAs (snRNAs) (Fig. 2.8). The innovation of
the nucleus thereafter allowed the proliferation of introns to largely go unchallenged because transcription was separated from translation [34]. Additionally, the invention of
the spliceosome as an efficient splicing machine permitted unconstrained intron lengths
[9]. Recent evidence has conclusively determined that the U6 spliceosomal snRNA is a
ribozyme providing a strong link to group II introns [37].
2.1.3
Functions of Introns
Introns harbour regulatory sequences that play roles in various transcript processing
pathways. As we highlighted above, introns bear the conserved splicing signals necessary
for the assembly of the spliceosome. These sequences may also regulate the efficiency of
12
Figure 2.8: Origin of nucleus-cytosol compartmentalisation at the onset of mitochondrial origin. Image reproduced from [34] with permission.
13
splicing by the strength to which spliceosome components bind to them. For example,
several proteins such as polypyrimidine tract binding (PTB) protein insulate the PPT
from being accessed by U2AF. In fact, similar sequences in the downstream intron allow
PTB to sequester the central exon ensuring that it is skipped [38, 39]. Introns may
also host sequence elements that alter the speed of RNA polymerase II (RNA pol II)
transcription. An example of this is the MAZ4 elements that slow down RNA pol II
in order to ensure competitive splicing of an upstream exon, which would otherwise be
skipped [40].
The rate of gene induction is influenced by the total transcribing gene length [41], which
is dominated by the length of introns (intron content) [42]. In higher eukaryotes, introns
contribute substantially to intron length resulting in potentially very long induction
times. In particular, developmental processes have been shown to be sensitive to the
impact of introns on the timing of gene induction. One example of this occurs during
somitogenesis when the rate of gene induction dynamically determines the formation
of somites. Takashima et al. demonstrated that exclusion of the introns of the Hes7
gene resulted in abolition of oscillation in the levels of Hes7 protein and led to severe
segmental defects [6].
In eukaryotes, parental genomes undergo sexual recombination to create allelic combinations. The presence of introns reduces the probability of recombination events occurring
on exonic sequences particularly when long (as in most eukaryotes). Exon shuffling is
regarded as one of the principal means by which new exons can be incorporated as well
as propagated between genes [43].
2.2
Splicing
Splicing is the process through which intronic sequence is systematically excised and
flanking exons are ligated. It is one of several transcriptional processing steps. Splicing diversifies an organism’s transcriptome by availing linear sequence absent from the
genome [44]. For splicing to take place introns must first be identified before a transition is made into the splicing conformation. The actual splicing reaction then occurs,
releasing the excised intron as a lariat.
This section addresses several aspects of splicing paying greater attention to spliceosomal introns. It begins with an overview of the spliceosome then proceeds to the major
spliceosome (associated with U2-type introns) because it splices out more than 99% of
14
eukaryotic introns. Next, we outline the splicing pathway including the transesterification reactions involved. We omit the minor splicing pathway because it largely resembles the major splicing pathway. Thereafter, we briefly sketch other splicing pathways:
self-splicing, tRNA splicing and splicing of archaeal introns. Lastly, we address several
important topics in splicing such as exon-defined splicing, ‘extreme’ splicing (ultra-short,
ultra-long and micro-exons), and the timing of splicing relative to transcription.
2.2.1
The Spliceosome
The spliceosome is a dynamic, macromolecular complex that is systematically assembled
at splice sites to catalyse the splicing reaction. It is composed of several uridine-rich
short nuclear RNA (snRNAs) in conjunction with an elaborate proteome. The snRNAs
strongly associate with some of the proteins forming short nuclear ribonucleoproteins
(snRNPs, pronounced ‘snurps’). In addition to core snRNPs are a host of protein factors
that temporarily aid in guiding snRNPs to their destinations.
Nucleoplasmic snRNAs can further be broken down into two classes: Sm1 or Lsm [46]
(Fig. 2.9). Sm snRNAs are synthesised by a specialised form of RNA polymerase II,
have, at the 5’ end, a trimethylated guanosine cap and form a stem loop at the 3’
end. Additionally, they have binding sites for the proteins they associate with. They
translocate into the cytoplasm where the 3’ stem loop is formed then return to the
nucleus. The major spliceosome is populated by Sm-snRNAs: U1, U2, and U4; the
minor spliceosome has U11, U12 and U4atac. Both spliceosomes have U5 [46].
Lsm-snRNA are synthesised by RNA polymerase III. The ‘L’ stands for ‘like’ because
they are homologous to the Sm proteins.
They are structurally composed of a 5’
monomethylguanosine cap and a 3’ stem loop structure. Unlike the Sm-class snRNA,
Lsm-snRNAs do not leave the nucleus. The only known members are U6 and its minorspliceosome homologue U6atac. The U6 snRNA was recently shown to be a ribozyme
forming the catalytic center of the spliceosome [37].
Each of the snRNAs is found to interact, at various points in the splicing pathway,
with the precursor RNA substrate and multiple proteins [47]. U1 binds at the 5’ splice
site (5’ss) and is associated with the proteins U1-70K, U1-A, U1-C and SRPK1. U2
is recruited by the U2 auxiliary factor (U2AF) protein and binds to the BPS. It is
associated with the proteins SPF45, SPF31, CHERP as well as many others. U4 and
U6 are found base-paired to each other and weakly bound to U5 as a tri-snRNP, which
1
Named in honour of Stephanie Smith, in whom they were first identified after her diagnosis with
systemic lupus erythematosus [45].
15
Figure 2.9: Short nuclear RNA (snRNA) involved in splicing. (a) Sm-class snRNA;
(b) Lsm-class snRNA. Image reproduced from [46] with permission.
associates with the proteins PRPF8, BRR2, SNU114, PRPF28 and several others. The
U4/U6 pair associate with PRPF4, CYPH, PRPF31, SNU13 and PRPF3.
2.2.2
The Splicing Pathway
The spliceosome assembles in stages but in some organisms such as yeast it may also be
found as a complete assembly ready to be activated [48]. Stepwise assembly involves the
formation of complexes that in human have been characterised as early (E), activated
(A), B, Bact, B*, and catalytically (C) active complexes [49] (Fig. 2.10). Each stage
requires the presence of various factors that facilitate recruitment of the various components of the spliceosome. Also, some of these stages are ATP-driven with others being
ATP-independent. It culminates in two transesterification reactions that ligate the 5’
and 3’ exons releasing the intron as a lariat. The splicing pathway is determined by
the intron to be spliced. Here we concentrate on U2-type (and by extension, U12-type
due to homology) spliceosomal introns only. Splicing of other introns is discussed in
Section 2.2.4.
Some of the first detailed studies of splicing were carried out in yeast [48]. It was
observed that the association of U1 to the 5’ splice site committed a site to splicing.
This was therefore called the commitment complex. In mammalian splicing systems
this is referred to as the E complex [50, 51]. The E complex forms due to substantial
complementarity between the 5’ part of U1 and the 5’ splice site [52]. However, it has
also been shown that the absence of U1 does not prevent complete splicing but rather
16
Figure 2.10: Splicing cascade showing recycling of spliceosomal components. Image
reproduced from [49].
U1 enhances splicing efficiency. The interaction that U1 makes with the 3’ splice site
also plays an active role in complex formation [53]. U1 is also associated with the 3’
splice site of the upstream intron and has been shown to facilitate binding of U2AF at
the upstream 3’ splice site [54]. This observation led to the discovery of a mechanism
by which exons are recognised called exon definition [29] discussed in more detail in
Section 2.2.5.
The presence of greater sequence complexity at the 3’ end of eukaryotic introns (BPS,
PPT and 3’ss) delayed a detailed understanding of the assembly of spliceosomal components. Indeed, many more factors are involved at the 3’ end [47]. In an ATP-dependent
reaction and in the presence of a host of splicing factors (SF3a, SF3b and SF1 ), the
U2AF heterodimer binds to the polypyrimidine tract (PPT) [50]. U2AF is made up
of U2AF65 and U2AF35 and it is the latter that specifically binds to the 3’ splice
site [55, 56]. This binding has been shown to be influenced by enhancers (discussed in
Section 2.3.4.2) and may modulate the inclusion of the downstream exon [56]. U2 is
then recruited to the BPS immediately upstream of the PPT forming the activated (A)
complex.
U4 and U6 transiently exist as a duplex and are joined by U5 to form the U5-U4/U6
tri-snRNP. U5 is associated with PRPF8, which interacts with the 5’ splice site [49].
The recruitment of the tri-snRNP to the U1-U2 assembly constitutes the B complex.
Several helicases are now involved in destabilising U1 and U4. The U5 helicases PRP28
17
and BRR2 initiate U4 unwinding from U6. BRR2 is activated by SNU114 [57]. p68
then unwinds U1 from the 5’ss [58, 59] leading to the Bact complex. PRPF2 (an ATPdependent DEAH-box protein) triggers further rearrangements that result in the B*
complex.
The spliceosome then undergoes a series of further rearrangements facilitated by the
following factors and helicases: BRR2, PRPF2, PRPF5, PRPF16, PRPF22, PRPF28,
PRPF43, and SUB2 [60]. The spliceosome is now in the catalytically active conformation. PRPF2 briefly interacts with the spliceosome leading to further rearrangements
to prepare for the first of two transesterification reactions. This reaction joins the 5’ss
to the BPS. After the first step is complete, PRPF16 is recruited to enable structural
rearrangements for the second transesterification reaction specifically associating with
the 3’ss [61, 62]. In order for the second step to successfully complete, PRPF8, SLU7,
PRPF17, and PRPF18 are needed [63–65]. Once the splicing of adjacent exons is complete, PRPF22 and PRPF43, two other helicases, enable the spliced mRNA and lariat
intron to be released [66, 67].
2.2.3
Transesterification Reactions in Splicing
Spliceosome mediated splicing takes place through two transesterification reactions.
Transesterification (TE) reactions are transfer reactions that occur between two specific
chemical groups: esters and alcohols. A transfer reaction occurs whenever substrates
exchange functional groups. Alcohols are identified by the hydroxyl (OH) functional
group while esters are characterised by the generic formula RCO2 R’ or RCOOR’2 .
The linear structure of the RNA substrate consists of the nucleic acid bases attached
to a sugar-phosphate backbone. The first TE reaction takes place between the 2’-OH
of the BPS adenosine and the phosphodiester bond at the 5’ splice site (Fig. 2.11).
This reaction swaps the 2’ of the BPS adenosine with the 5’ss phosphodiester forming a
closed intronic loop. The upstream exon is left free but scaffolded to PRPF8 [49]. The
3’-OH at the free exon now attacks the phosphodiester bond at the 3’ss in the second
TE reaction swapping out the hydroxyl for the phosphate, completing the ligation. The
resulting exon junction now conforms to a regular phosphodiester bond. Because there
is no resulting change in the number of phosphates, TE reactions proceed without an
external energy source such as ATP or GTP [69].
2
R (and R’) refers to any alkyl functional group. They are formed from alkanes, which are saturated
hydrocarbons. Alkyl groups are formed from the removal of one hydrogen from an alkanes e.g. the
methyl group, CH2 - from methane, CH3 .
18
Figure 2.11: Schematic depicting first and second transesterification splicing reactions. Image reproduced from [68] with permission.
2.2.4
2.2.4.1
Other Splicing Pathways
tRNA splicing
The splicing of tRNA genes in archaea takes place in the presence of three enzymes:
an endonuclease, a ligase and a phosphotransferase, and requires ATP for hydrolysis.
The first step of tRNA splicing is the association of an endonuclease with the pre-tRNA,
which cleaves both splice sites resulting in the two loose exons and the intron (Fig. 2.12).
The intron is cleaved at the 5’-hydroxyl and 3’-cyclic phosphate ends. A ligase reaction
ensues, catalysed by a 90 kDa tRNA ligase in a multistep process that terminates with
both exons fused but still retaining an extra 2’-phosphate. This phosphate is then
removed by the last enzyme, a nicotinamide adenine dinucleotide phosphotransferase.
A more detailed analysis can be found in a review by Abelson et al. [70].
19
Figure 2.12: tRNA splicing cascade in archaeal introns showing ligated exons (top
right) and circularised intron (bottom right). Image reproduced from [21] with permission.
2.2.4.2
Self-splicing
Self-splicing affects group I-III introns. It involves intricate folding mechanisms that
lead to similar transesterification reactions as described in Section 2.2.3.
Group I introns are cleaved in two transesterification reactions but these involve phosphorylated guanosine in one of its forms (GMP, GDP or GTP) (Fig. 2.3B). Alternatively,
it may proceed with unphosphorylated guanosine. In the first transesterification reaction, the guanosine attacks the phosphodiester bond at 5’ss freeing the 5’ exon now
terminated by a 3’-hydroxyl. The second transesterification reaction involves the free 5’
exon attacking the phosphodiester bond at the 3’ss ligating both exons. More details of
group I self-splicing are provided in a review by Cech [71].
Self-splicing with group II introns resembles spliceosomal splicing and involves the formation of a lariat (Fig. 2.14). It proceeds in a strikingly similar fashion to spliceosomemediated splicing with the main difference being the absence of an external catalyst. An
SN 2 reaction is involved, which entails synchronous breaking and formation of the bond.
The splicing reaction is facilitated by the folding of the group II intron into the active
structure and this places the active site (a 2’-hydroxyl) in contact with the 5’ splice
site. The first reaction can either take place due to hydrolysis or transesterification.
Thereafter, the intron is rearranged so that the 3’-hydroxyl at the loose 5’ exon attacks
the 3’ splice site, ligating the exons and releasing the intron as a lariat. The released
lariat intron is still able to perform its catalytic activities to facilitate its re-entry into
the genome or other RNA by reverse transcription or retrotransposition.
20
Figure 2.13: Splicing pathways of group II introns. Image reproduced from [72].
Figure 2.14: Various splicing reactions exhibited by group II introns. (A) Forward
and reverse splicing. (B) Hydrolytic splicing. (C) Partial reverse splicing. Introns
indicated in red; exons indicated in dark/light blue (E1 and E2). Image reproduced
from [17] with permission.
The scarcity of group III introns coupled with their low sequence conservation means
that little is understood on how they mediate self-splicing [20]. They are spliced out in
a lariat after undergoing two transesterification reactions and their self-splicing shares
other similarities to group II self-splicing.
Nested splicing involves self-splicing of twintrons, group II introns that contain within
them group III introns. They undergo the self-splicing pathway by first excising the
group III intron followed by the group II intron [73].
2.2.4.3
Trans-splicing
Trans-splicing involves the ligation of exons from precursor-mRNA transcribed from different loci and has only been observed in eukaryotes (Fig. 2.15). Early characterisation
introduced two primary categories of trans-splicing: those due to a ‘spliced leader’ (SL)
found in protozoans in which a short RNA sequence (the ‘splice leader’) is added to the
5’ of the precursor mRNA, and the ‘discontinuous group II introns’ found in chloroplasts
(in plants and algae) and plant mitochondria, which involve joining independently transcribed sequences [74]. SL RNAs are short (∼45-140 bp) non-mRNAs transcribed by
RNA pol II from intronless genes and feature a hypermodified 5’ cap structure and a
21
Figure 2.15: Schematic representing (A) cis- and (B) trans-splicing. (C) Conserved
motifs and elements involved in cis-splicing (ESE: exonic splicing enhancers; 5’SS: 5’
splice site; BP: branch point; PPT: poly-pyrimidine tract; 3’SS: 3’ splice site). (D)
Schematic representing the trans-splicing between (a) androgen-binding protein (ADP)
and (b) histidine decarboxylase (HDC) transcripts to yield (c) a mature chimeric transcript. Image reproduced from [80].
splice donor site [75]. However, subsequent results have identified non-SL type transsplicing (also called genic trans-splicing) in which exons from different pre-mRNA transcripts are ligated [76]. There is evidence for trans-splicing in Drosophila [77, 78] and
human [79].
Trans-splicing may be spliceosome-mediated or self-spliced. In trypanosomes, the spliceosomal snRNPs have homologues for U2, U4 and U6 and it is thought that SL RNA plays
the role of U1 because it shares partial complementarity to the 5’ splice site similar to
U1. Splicing occurs between the splice donor on the SL RNA and the acceptor site in
the resident RNA. Unlike in conventional splicing, where the intron is excised as a lariat,
trans-splicing releases the intron as a Y-shaped structure with the junction at the BPS:
one arm of the Y is the 5’ end of the SL RNA bound to the BPS by a 2’-5’ bond from
to the first transesterification reaction.
2.2.5
Exon-defined Splicing
The intron may be regarded as the principal locus for splicing because it harbours the
majority of the conserved splice signals. Spliceosomal components assemble after the
U1 snRNP base pairs with the 5’ splice site. U1 has also been found to be a necessary
22
Figure 2.16: Schematic representing exon definition including transition to intron
definition via exon justaposition. Image reproduced from [29] with permission.
recruitment factor for U2AF, which in turn recruits U2 for the upstream intron of internal
exons [81].
This observation prompted Robberson and colleagues to investigate the effect of exon
length on its splicing to adjacent exons for internal exons. They found that exon skipping
occurred when exon length was extended beyond a length of about 300 bp. This exonlength dependent pathway is called exon definition or exon recognition [53] (Fig. 2.16).
In addition to U1 and U2 snRNPs, several other components of the spliceosome including
the tri-snRNP were found in an exon defined complex [82]. Schneider and colleagues
observed that this complex could easily be converted into a pre-catalytic B complex
conformation giving an indication of the formation of the spliceosome. However, in order
for splicing to proceed to completion, there needs to be a transition from exon definition
to intron definition. This transformation is called exon juxtaposition (Fig. 2.16) and
results in two intron defined complexes across the flanking introns [29]. Thereafter,
splicing occurs as outlined in Section 2.2.2.
A comparison between human and fruitfly showed that the transition from intron-defined
to exon-defined splicing may occur when introns are about 250 bp: intron-definition takes
place below this length [83]. The average length of internal exons in humans is roughly
140 bp [84] and the average length of introns is approximately 1700 bp [83], which means
that majority of splicing in humans takes place via the exon definition pathway [85]. The
distribution of intron lengths clearly shows a bimodal characteristic with a trough at
introns of about this length (Fig. 2.7) [27, 86]. The transition from a predominantly
intron-defined to a predominantly exon-defined splicing genome can also be observed
23
when one looks at the exon-intron structure in nematode, fruitfly, zebrafish, mouse and
human suggesting a preference for exons definition in higher organisms [23, 86]. As we
will present in Section 2.3.2, the exon-intron architecture is important in alternative
splicing.
Exon-defined splicing has two important implications. First, it implies an upper bound
on exon length for efficient splicing [84]. This may vary from organism to organism and in
human appears to range from between 300 and 500 bp. Splicing of exons that are longer
than these lengths with long flanking introns therefore becomes problematic. Bolisetty
and colleagues showed that these exons contain novel exon splicing enhancers (ESE) that
help overcome this limitation [84]. Second, it implies that there must exist a minimum
exon length. If exon lengths get too small then the components of the spliceosome
will get in the way of the splicing process (steric hindrance) [29]. An intriguing study
on minimum exon length showed that skipping occurred in human beta-globin gene
when exon length was shortened below 50 bp [87] but was recovered when the splice
sites were mutated towards the conserved splice site sequence (strong splice sites) [88].
Finally, exon definition implies a direct mechanism that the cell may use to modulate the
inclusion of exons. By blocking communication across the exon, the exon can effectively
be spliced out together with the flanking introns. Specifically, many heteronuclear RNP
proteins (such as hnRNP A1, F, H, L [89, 90]) have been found to play a critical role in
modulating the inclusion of an exon by interfering with this activity. We briefly cover
such splicing modulation in the section on alternative splicing.
2.2.6
‘Extreme’ Splicing
‘Extreme’ here refers to extremes of intron or exon lengths in relation to spliceosomal
dimensions. There are three scenarios that may be defined as ‘extremes’: ultra-short
introns, ultra-long introns and micro exons. The extremity of these features introduces
new challenges for the splicing machinery. Nevertheless, efficient splicing can still occur.
2.2.6.1
Ultra-Short Introns
The existence of conserved sequences in introns implies that introns need to have a
minimum length for splicing to take place. The distribution of intron length in human
(Fig. 2.7) clearly shows a hard limit below which splicing is impossible by the pathways
described so far. Introns shorter than 65 bp have been referred to as ultra-short introns
[91]. This in turn has consequences on the splicing pathway undertaken. The fully
assembled major spliceosome has a footprint of approximately ∼ 26 × 20 × 19.5 nm and
given that 1 bp in RNA is approximately 0.23 nm this translates to between 85 and 113
24
Figure 2.17: Three examples of ultra-short introns together with their sequence.
Images reproduced from [91] with permission.
bp of linearised RNA [91, 92]. Three examples of ultra-short introns are the 56 bp intron
in HNRNPH1, the 49 bp intron in NDOR1, and the 43 bp intron in ESRP2 genes [91]
all of which appear to be efficiently spliced (Fig. 2.17).
Splicing of ultra-short introns is ad hoc because these introns lack conserved regions
such as the polypyrimidine tract (PPT). However, all introns analysed in this study
required the U2-associated splicing factor SF3b. The longest of these introns, the 56 bp
intron of HNRNPH1 was excised with a lariat structure in vitro while two of the other
ultra-short introns harboured previously unidentified conserved regions consisting of Grich domains, which have previously been shown to be associated with intron splicing
enhancer (ISE) domains [93, 94] and also play a role by enforcing the location of boundaries [95]. Furthermore, these G-rich domains, when mutated, resulted in a moderate
splicing defect with the retention of these introns.
2.2.6.2
Ultra-Long Introns
At the extreme end of the scale are the ultra-long introns with the longest known human
intron being 740,920 bp in the heparan sulfate 6-O-sulfotransferase 3 (HS6ST3) gene
[96]. This gene appears to have a single intron and only one transcript isoform therefore
exon definition is inadmissible. Fedorova and Fedorov also pointed out that over 3000
human introns are over 50 kb, ∼1200 over 100 kb and ∼300 over 200 kb, and 9 over 500
25
Figure 2.18: Three mechanisms of splicing in ultra-long introns. Image reproduced
kb. Apart from the obvious challenge of locating ∼200 bp exons among such introns,
it is intriguing that such introns are correctly excised. Even where exon definition
might explain how tandem spliceosomes may form it still does not account for how exon
juxtaposition may take place.
There are two known ways in which long introns may be spliced: recursive splicing as
in the Drosophila gene Ubx and nested splicing as in the human gene DMD. In fruitfly,
splicing of an ultra-long intron proceeds recursively. This occurs by stepwise reuse of
one splice site (Fig. 2.18) ultimately leading to excision of the full intronic sequence
[98, 99]. In order for this to take place, these relatively long introns have splice sites
referred to as ratchetting points (RP)-sites (Fig. 2.18) leading to new intermediate splice
sites called regenerated splice (RS) sites. The RP-site takes on a dual role as donor and
acceptor and effectively simulates a zero-length exon [100]. The splicing can then be
carried out either upstream-first or downstream-first. However, a computational survey
of vertebrate introns showed that RP-sites are depleted suggesting an alternative mechanism for splicing of long introns [100]. Additionally, vertebrate long introns are enriched
with short interspersed elements (SINEs) and long interspersed elements (LINEs) that
may form secondary structures of introns, thereby reducing apparent intron length to
facilitate the transition from exon definition to intron definition [100].
Recently, it was shown that nested splicing appears to take place in the human dystrophin
(DMD) gene [97]. The seventh intron of DMD has a 110,199 bp intron. Using RTPCR, Suzuki and colleagues were unable to observe part of the intronic sequence and
instead only observed an alternative BPS between the 5’ss and the authentic BPS.
26
Furthermore, they also observed a closed lariat missing sequence from the original long
intron suggesting that the intron was modified through a nested splicing mechanism.
2.2.6.3
Micro-Exons
Splicing of extremely short exons appears to be strongly influenced by the presence of
enhancers located in either of the flanking introns or nearby exons. For example, the 61
bp exon of the chicken cardiac troponin T (cTNT) gene was found to require a 134 bp
sequence in the downstream intron [94]. This intronic splicing enhancer (ISE) contains
a G-rich motif that serves as the active site and may interact with components of the
splicing machinery. The 7 bp exon in chicken skeletal troponin I (sTNI) depends on
the presence of the upstream exon (an exonic splicing enhancer, ESE) [101]. Another
example is the 18 bp N1 src exon spliced in an enhancer-dependent manner [102–104].
2.2.7
Timing of Splicing Relative to Transcription Termination
The timing and intra-nuclear localisation of splicing has important consequences for
induction of functional transcripts [105]. Timing may be co-transcriptional, in which
it is coupled to transcription and occurs along the nascent transcript or may be posttranscriptional. However, this requires a functional interaction between the transcriptional machinery and the splicing machinery.
2.2.7.1
Coupling between Transcription and Splicing
The carboxyl-terminal domain (CTD) of RNA polymerase II is necessary for in vivo
splicing and cleavage as well as other aspects of post-transcriptional processing [106].
RNA pol II lacking a CTD led to accumulation of unspliced product as well as of readaround (RA) transcripts around a full plasmid sequence. The phosphorylation state of
the CTD of RNA pol II plays a vital role splicing in vitro: hyperphosphorylated RNA
pol II (RNA pol IIO) activates while hypophosphorylated RNA pol II (RNA pol IIA)
may inhibit splicing. This suggests that specific sites along the CTD may play a decisive
role in activating splicing both in vivo and in vitro [107, 108].
RNA pol II also takes part in splicing by influencing the early stages of spliceosome
formation. The CTD activates splicing through its interaction with two splicing factors:
U2AF65 and PRPF19 complex (PRPF19C, also called the nineteen complex, NTC ).
PRPF19C is poorly understood but it is seen to play an important role through proteinprotein interactions with other spliceosomal components [109]. U2AF65 directly binds to
27
the phosphorylated CTD, which enhances further recruitment of U2AF65 and PRPF19C
[110].
There is also evidence that the exon-defined splicing pathway exhibits coupling between
transcription and splicing via exon definition [111]. Mammalian internal exons were
spliced in vitro only when complete exon-defined complexes were formed in the presence
of RNA pol II with a recombinant CTD. Furthermore, exon-defined complexes were
stably associated with the recombinant CTD implying a direct interaction between RNA
pol II in vitro and splicing.
2.2.7.2
Co-transcriptional Splicing versus Post-transcriptional Splicing
The majority of splicing events take place co-transcriptionally [7] but there are multiple examples of post-transcriptional splicing. However, the relative extent of co- versus
post-transcriptionality of splicing as well as its predictability is unknown. Coupling
of transcription to splicing avails an avenue through which alternative splicing may be
effected either through kinetic coupling or structural coupling [112, 113]. Nevertheless, co-transcriptionality does not necessarily imply that the order of intron excision
is strictly a 5’-to-3’ affair - a ‘first-come, first-served’ model; rather, it appears that
the commitment to splicing (by formation of the U1-5’ss bond) may be decisive - a
‘first-committed, first-served’ model [114].
There are good reasons why splicing may need to take place post-transcriptionally.
For example, in anucleate platelets, splicing of interleukin-1β (IL-1β) requires external
activation. Prior to splicing, IL-1β pre-mRNA is present as unspliced transcripts [115].
This allows rapid induction of IL-1β for translation without the need for time-consuming
transcription. Vargas et al. also demonstrated post-transcriptional splicing of sex lethal
(Sxl) in Drosophila and polypyrimidine tract-binding (PTB) protein in HeLa cells by
visualising splicing in live cells [113]. They observed that constitutive splicing (in male
Sxl and canonical PTB, respectively) occurred co-transcriptionally but regulated splicing
(in female Sxl and paralogous neuronal PTB, nPTB ) resulted in intron-containing premRNA dispersed throughout the nucleoplasm indicating post-transcriptional splicing.
However, it has been argued that it may still be the case that splicing factors are cotranscriptionally recruited with splicing catalysis occurring post-transcriptionally [114].
2.3
Alternative Splicing
Splicing that results in ligation of all exons is termed constitutive and occurs predominantly in simple eukaryotes such as yeast [116]. However, in higher eukaryotes, splicing
28
134.22 kb
90.64Mb
90.66Mb
90.68Mb
Forward strand
90.70Mb
90.72Mb
90.74Mb
90.76Mb
Genes (GENCODE)
RP11-115D19.1-005 >
RP11-67M1.1-002 >
antisense
antisense
RP11-115D19.1-003 >
RP11-67M1.1-001 >
antisense
antisense
Genes (GENCODE)
< SNCA-201
protein coding
< SNCA-202
protein coding
< SNCA-005
protein coding
< SNCA-001
protein coding
< SNCA-002
protein coding
< SNCA-003
protein coding
< SNCA-008
protein coding
< SNCA-006
protein coding
< SNCA-004
protein coding
< SNCA-009
protein coding
< SNCA-007
protein coding
90.64Mb
90.66Mb
90.68Mb
90.70Mb
Reverse strand
Gene Legend
90.72Mb
90.74Mb
90.76Mb
134.22 kb
protein coding
processed transcript
merged Ensembl/Havana
Figure 2.19: α-synuclein (SNCA) gene showing 11 transcript isoforms of various
lengths and with different exon assemblies. Image from Ensembl [117].
takes place with a remarkable diversity in splicing decisions, significantly increasing the
information coding capacity of the organism’s genome. This form of splicing generates a
set of alternative transcript isoforms from individual genes and may employ previously
uncharacterised, cryptic splice sites, a process called alternative splicing. Figure 2.19
shows a gene with multiple transcript isoforms displayed Ensembl. Multiple transcript
isoforms are indicated as parallel assemblies of exons. Figure 2.20 shows the set of
splicing patterns that can be generated through alternative splicing.
Current estimates suggest that nearly all multi-exon genes in humans and a large proportion in vertebrates are alternatively spliced [118, 119]. Widespread application of
high-throughput assays that interrogate genes along their entire length such as exon arrays [120], exon junction arrays [121], tiling arrays [122] and high-throughput sequencing
[118] provide unprecedented transcriptome-wide quantification, allowing investigators to
identify aberrant and alternative splicing events. Notwithstanding these developments,
transcriptome analysis is still in its infancy [123, 124].
This section covers various themes in alternative splicing. In particular, we address five
broad topics: prevalence, origin and evolution, functions, modulation mechanisms, and
stochasticity. To keep this chapter brief, we only highlight salient points in each section
commencing with one of the most cited examples of alternative splicing.
The Down’s syndrome cell adhesion molecule (DSCAM) homologue in Drosophila exemplifies the astounding potential of alternative splicing with the capacity to generate
a possible 38,016 individual transcript isoforms. Figure 2.21 shows a schematic of this
29
Figure 2.20: Patterns of alternative splicing. (a) Exon skipping/cassette exons most prevalent in higher eukaryotes but low prevalence in plants; (b & c) alternative 3’
splice sites (A3SS) and alternative 5’ splice sites (A5SS) - considered as intermediates
between constitutive and alternative splicing; (d) Mutually exclusive exons; (e) Intron
retention. Image reproduced from [127] with permission.
gene. Dscam has 24 exons, three of which (exons 4, 6, 9 and 17) occur in clusters of
12, 48, 33 and 2 (for a total of 95) mutually exclusive exons. This gene is expressed in
neurons where it is crucial for cell recognition during axonal modeling [125]. Alternative
splicing allows a choice of exons from the clusters while the remaining exons are constitutively spliced. A similar example of a high diversity gene in human is the neurexin 3
(NRXN3 ) gene having 1,728 transcript isoforms [126].
2.3.1
Prevalence
Alternative splicing has only been observed in eukaryotes even though introns occur in
other domains of life. Alternative splicing in unicellular eukaryotes has been extensively
studied in yeast (S. cerevisiae and S. pombe) [128]. Both species differ markedly in their
transcriptomes with S. pombe showing closer resemblance to mammalian transcriptomes.
Alternative splicing in yeast takes place mainly through intron retention because almost
all intron-containing genes have only one intron.
Metazoans display alternative splicing complexity that correlates with genome size [128].
Similarly, the relationship between genome size and number of genes saturates at about
30
Figure 2.21: An example of extreme alternative splicing in the Drosophila Dscam
gene that potentially generates 38,016 transcript isoforms through mutual-exclusive
splicing of four exon clusters (coloured red, blue, green, yellow). Image reproduced
20,000 genes suggesting that increased transcriptome complexity is achieved through
increased alternative splicing [129]. Vertebrates exhibit a greater diversity in alternative
splicing compared to invertebrates [130]. One study examined the rate and characterisation of alternative splicing across vertebrates and invertebrates and found alternatively
spliced exons exhibit longer flanking introns but only for cassette exons (as opposed
to alternative 5’ss (A5SS) and alternative 3’ss (A3SS) exons) [131]. Furthermore, they
found that intron retention appears to be the rarest of all observed alternative splicing
patterns.
2.3.2
Origin and Evolution
To explain the origin and evolution of alternatively spliced exons, it is instructive to
outline their properties. Alternatively spliced exons tend to be short, have flanking splice
sites with weak conservation of splice motifs and have higher enrichment of regulatory
sequences compared to constitutively spliced exons [127]. Their sequences show higher
conservation and their lengths are preferentially divisible by three (suggesting that they
may function as cassettes) [128]. These properties of alternatively spliced exons help
shed light on how they may have emerged.
There are currently three theories that attempt to explain the emergence of alternative
splicing: de novo exonisation, exon shuffling and exon transition [127]. De novo exonisation proposes that previously non-exonic sequence may have acquired exonic status
when cryptic splice sites converted to weak splice sites. The weak splice sites resulted
in the exon being included in a fraction of transcript isoforms. One example of such a
scenario is provided by Alu elements, which are primate-specific repetitive and transposable sequence elements (Fig. 2.22). About 5% of alternatively spliced exons are known
31
Figure 2.22: Exonisation of Alu elements. (a) Mutations surrounding Alu elements
that could give rise to new exons. (b) Transformation of RNA secondary structures by
enzymatic editing of adenosine (A) to inosine (I), which is recognised as guanosine (G).
In the presence of two adjacent Alu elements of opposite orientation can then create a
functional 3’ss (AA to AG). Image reproduced from [127] with permission.
to have originated from Alu elements [132, 133]. Alu elements are also found in introns
and explain a possible route by which these exons may have emerged. Additionally, de
novo alternative exons may have emerged through tandem exon duplication [128, 134].
The duplicated exons may later be subject to different selective pressures leading to
exon specialisation [128].
Exon transition theory proposes that alternatively spliced exons may have evolved from
extant exons in two possible ways: relaxation of splicing motifs or evolution of novel
regulatory proteins. Relaxation of splicing signals (splice sites, BPS or PPT) may have
resulted from splice site mutations influencing exon inclusion. For example, 5’ss differ
between yeast and metazoans with the former having fewer base pairing to U1, while
the latter have base pairing that extends into the exon [28] and are alternatively spliced.
Furthermore, one experiment showed stronger binding of U1 to constitutive versus alternative exons [133]. The second method may involve evolution of regulatory elements
32
that may associate either with exonic, or intronic sequence to negatively influence splicing. Over time these regulatory elements (mostly protein) may then render constitutive
exons alternative [127] (Fig. 2.23).
Exon shuffling relies on how eukaryotic exons typically constitute a miniscule proportion
of genomic sequence, which elevates the likelihood for novel recombination events to
occur predominantly at intronic loci. This introduces opportunities to shuffle exons
thereby introducing phenotypic diversity. The resulting shuffled exons would have been
propagated as alternatively spliced exons to their new resident genes (Fig. 2.24) and
appears to have been predominant in higher organisms due to their abundance of intronic
sequence. Exon shuffling may therefore have appeared much later in evolution and played
an important role in distributing alternatively spliced exons as opposed to creating them
de novo [127].
2.3.3
Functions of Alternative Splicing
As the preceding section has highlighted, alternative splicing has made a substantial
contribution to the evolution of genome complexity particularly in metazoans leading
up to higher eukaryotes. In particularly, alternative splicing introduced a space-saving
means by which genomic information may be regulated [129]. However, this innovation
was introduced at the considerable cost of transcribing non-coding introns as well as
requiring the development of the sophisticated spliceosome. The ability to explore transcript space at low evolutionary risk was a useful invention for introducing new functions
[5].
Nearly 75% of all alternative splicing events occur within the coding sequence, leading
to a direct impact on protein sequence [135, 136]. These can lead to changes in binding
properties, intracellular localisation, enzymatic and signalling activities, protein stability, addition of domains that lead to protein conformational changes and changes in ion
channel properties [137].
Alternative splicing also facilitates developmental programmes and cell differentiation
programmes, which results in tissue specificity. The facility for a single genome to
encode the full range of transcripts required in the course of development and cell differentiation is largely attributable to alternative splicing programmes [138]. For example,
FOXP1 has an alternative splicing event that leads to pluripotency of embryonic stem
cells [139] while brain is a well-studied example of tissue specificity [140]. This feature
of metazoans also reflects the difference in transcriptome complexity between unicellular eukaryotes, which exhibit primitive alternative splicing (predominated by intron
33
Figure 2.23: Exon transition from constitutive to alternative. (A) Degenerate splice
signals that can form (Aa) A5SS of A3SS, (Ab) weak 5’ splice sites leading to skipping,
(Ac) introduction of an exonic splicing suppressor, or (B) secondary RNA structure
that sequester splice signals. Image reproduced from [127] with permission.
34
Figure 2.24: Exon shuffling. Image reproduced from [127] with permission.
retention) and metazoans (show increasing exon skipping dominance with genome complexity) [141].
Other roles of alternative splicing have been extensively discussed in [137] and [142]
and include gene expression regulation by introduction of premature stop codons (PTC)
leading to nonsense mediated decay (NMD) [143], modification of mRNA properties
such as stability [144] as well as the role alternative splicing plays when dysregulated in
diseases such as cancer [145–147].
2.3.4
Modulation of Alternative Splicing
There are four major ways through which alternative splicing is modulated: genetic (at
the sequence level), proteomic, epigenetic (chromatin), and kinetic (coupling to transcription). In many instances, modulation employs several means simultaneously resulting in a splicing code [148], which can be used to predict tissue-specific splicing patterns
based on the set of available modulators.
2.3.4.1
Genetic
There are two main ways in which the nucleotide sequence can serve to modulate splicing:
through sequence motifs and through sequence variants [149]. Since the nucleotide
content of genes is typically unchanged for the duration of an organism’s lifetime, the
main genetic regulatory elements are those that lead to differential splicing without
harmful phenotypes. Genetic regulators typically work in conjunction with proteomic,
epigenetic and kinetic regulators for dynamic modulation of splicing.
Sequence motifs come in two forms: (i) those that influence some dynamic of transcription either through interaction with the spliceosome or an associated factor [150, 151];
(ii) those that influence binding of external elements that have a subsequent effect on
the activity of the spliceosome [89, 152]. The main candidates for (i) are the conserved
35
Figure 2.25: Proteomic modulation of alternative splicing. Serine-Arginine (SR)
and heteronuclear ribonucleoproteins (hnRNP) classes of proteins are predominantly
involved in influencing splicing decisions by binding to exonic/intronic splice enhancers
(ESE/ISE) or silencers (ESS/ISS). Image reproduced from [114] with permission.
sequence elements themselves. In fact, it is by selectively mutating splice sites that a lot
has been learned about the various steps and factors involved in splicing and some of the
sequence variants (e.g. single nucleotide polymorphisms, tandem repeats) may lead to
deviation from the consensus. In these cases, regulation is unintended and often leads to
unwanted consequences (defective transcripts). Also, the SINEs and LINEs mentioned
in Section 2.2.6.2 in conjunction with long introns create RNA secondary structure that
may influence accessibility or efficiency in splicing. Sequence motifs in (ii) are associated
with proteomic regulators and are discussed below.
Of particular interest are sequence variants that are linked to changes in abundance of
individual transcript isoforms. These are referred to as splicing quantitative trait loci
(sQTLs). They may be found proximal to the gene (cis-sQTLs) or may operate in trans
(trans-sQTLs).
2.3.4.2
Proteomic
Proteomic modulators work in conjunction with other components of the splicing system
to positively or negatively influence splicing. Their action may directly influence the
splicing complex or may involve interaction with the pre-mRNA. Their association with
the transcript partitions them into those that interact with exonic or intronic sequence.
Consequently, many proteomic splicing modulators have extensive RNA binding domains
[89].
The most widely characterised exonic splicing enhancers (ESEs) are from the class of
serine-arginine (SR) proteins (Fig. 2.25). ESEs preferentially bind to purine-rich sequences generally found in protein-coding regions bearing an A-rich consensus sequence
36
(GAR)n . ESEs have been reported to be present in almost all exons independent of
whether they are constitutively or alternatively spliced [153]. However, individual enhancer proteins typically may have sharply divergent preferential binding motifs that
depart from this pattern [154]. SR proteins consist of an RNA binding domain (RBD)
and a repeated sequence of arginine/serine (RS) dipeptides, which appear to be interchangeable between SR ESEs [155, 156]. Some examples of proteins in this family are
ASF, SC35, SR 20, SR 30c, SR 40, SR 55 and SR 70 [157]. Long exons (>300 bp)
have been shown to be enriched for ESEs, likely to aid in the exon definition process
[84].
In addition to exonic splicing enhancers are intronic splicing enhancers (ISE), exonic
splicing silencers/suppressors (ESS), and intronic splicing silencers/suppressors (ISS).
An example of an intronic splicing silencer is the polypyrimidine tract binding (PTB)
protein which binds to the PPT and at the 5’ end of the downstream introns [158]. It
base pairs with the transcript then folds to sequester the exon away from the spliceosome.
Exonic splicing silencers are enriched in exons that are flanked by long introns likely to be
spliced by exon definition. They operate by preventing U2AF and U1 from interacting
across the exon leading to exon skipping.
2.3.4.3
Epigenetic
There is a lot of interest on the link between chromatin modifications and alternative
splicing [114, 159]. Epigenetic changes involving histone modifications can have a major
influence on splicing decisions (Fig. 2.26). This is largely due to the fact that nucleosomes are preferentially located at exons and particularly alternatively spliced exons
[29]. Additionally, in relation to RNA pol II kinetics, the speed of transcription at nucleosomes is lower than between nucleosomes favouring higher inclusion because spliceosomal components have more time to aggregate at splice sites [160, 161]. It has also been
demonstrated that some histone acetyltransferases such as Gcn5 in yeast and STAGA
in human physically interact with spliceosomal snRNPs implying that the chromatin
complex may directly influence accurate assembly of the pre-spliceosome [162, 163].
Recent work has shown a link between epigenetic modifications and splicing regulation.
In particular, methylation at CTCF sites has been linked to alternative splicing of
CD45. In this case, methylation at CTCF sites modifies the rate of RNA pol II (see
Section 2.3.4.4), influencing the amount of time available to recognise the 5’ splice site
of intron 5 [90].
37
Figure 2.26: Epigenetic modulation of alternative splicing. (a) Histone modifications
(H3K9ac) leads to an increase in RNA pol II elongation leading to skipping in neural
cell adhesion molecule (NCAM ) exon 18. (b) Histone modification (H3K36me3) at
the fibroblast growth factor receptor 2 (FGFR2 ) recruits polypyrimidine tract binding
(PTB) protein which suppresses an alternate exon. Images reproduced from [114] with
permission.
2.3.4.4
Kinetic
Coupling of transcription and splicing provides kinetic modulation: faster transcription
means that weak splice sites may be ignored leading to exon skipping; slower transcription implies longer time for the spliceosome to form and catalyse splicing (Fig. 2.27). In
this way, kinetic modulation introduces competition between splice sites conditional on
the speed of transcription. One study on co-transcriptionality of splicing in yeast found
that RNA pol II appears to slow down in the last intron to increase the likelihood of its
excision [164]. Similarly, pausing elements such as MAZ4 immediately downstream of
an alternatively spliced exon ensure its inclusion [40].
One recent example involved elucidation of a novel complex found in proximity with
RNA pol II. The DBIRD complex [165] was found to speed up transcription at A+T rich
exons, increasing the likelihood of them being skipped. A+T rich regions are associated
with RNA pol II pausing [166]. When present, DBIRD slowed down RNA pol II leading
to higher inclusion of those exons.
2.3.4.5
Combinatorial Control: The Splicing Code
The mechanisms outlined above typically act in concert implying that they combine
in various ways to exert complex control patterns. Analysis of how various factors
contribute to alternative splicing events has given rise to the notion of a splicing code
[167]. Barash et al. [148] studied the combinatorial effect of multiple splicing regulators
38
Figure 2.27: Kinetic modulation of alternative splicing. (a) SR splicing factor 3
(SRSF3) is controlled by the phosphorylation state of the RNA pol II C-terminal domain (CTD), which then determines inclusion of alternate exons recruitment is affected.
(b) The Mediator, which associates with transcription factors, recruits the splicing suppressor hnRNPL. (c) CCCTC-binding factor (CCTF) binds to unmethylated CG-rich
DNA sequences and slows RNA pol II transcription allowing competitive splicing of a
downstream exon. (d) DBIRD facilitates exclusion of A/T-rich exons by preventing
RNA pol II slowing down at these exons. (e) Kinetic coupling following DNA damage,
which leads to hyperphosphorylation of the RNA pol II CTD further inhibiting RNA pol
II elongation. (f) SWI/SNF (a chromatin remodelling factor) recruits SAM68, which
stalls poll II enhancing exon inclusion. Image reproduced from [114] with permission.
to come up with the code table shown in Figure 2.28. Their assessment centred on exon
inclusion as the main alternative splicing event. However, development of a complete
splicing code will require considerable effort given the tissue- and development-specific
natures of alternative splicing.
2.3.5
Stochasticity of Alternative Splicing
As we mentioned in Section 2.3.3, alternative splicing may allow the cell to explore
a range of transcripts prior to committing to produce them exclusively or at abundant
levels [5]. This evolutionary trial-and-error would require that splicing possesses a degree
of tolerable noise in order to generate novel transcript isoforms. Therefore, splicing is
an inherently stochastic process.
39
Figure 2.28: Splicing code developed for five mouse tissues. Tissue codes: CNS (C),
muscle (M), embryo (E), digestive (D), and tissue-independent mixture (I). The code
is read using the key on the top left of the image. For example, abundant CU-rich
(nPTB ) motifs (fifth row) are strongly determinative of high inclusion of the alternate
exon if within 300 nt upstream. Image reproduced from [148] with permission.
40
Studies on the stochasticity of the splicing process have provided evidence that even
under correct functioning, the spliceosome is tolerable to some degree of noise. For example, Melamud et al. used over 8,000 EST libraries to estimate noise levels using a
stochastic model based on the number of introns and gene expression [168]. They estimated that between 1% and 10% of all transcripts are spliced erroneously. Interestingly,
they found that the number of introns correlates with number of alternative splicing
events (per transcript) but while admitting fewer splicing errors. They also observed a
higher density of ESEs with an increase in the number of splicing reactions suggesting
selective pressure towards less splicing noise. They argue that these observations imply
that low errors are expected in the presence of many introns to ensure that useful transcripts are produced (otherwise, nothing would be produced at all). Additionally, lower
errors could also ensure lower toxicity due to accumulation of aberrant transcripts at
high expression.
Pickrell and colleagues used RNA-Seq data to assess splicing error rates and characterised over 150,000 previously unannotated splice sites conforming to consensus splice
sites (GT-AG and GC-AG) but lacking evolutionary conservation [169] (Fig. 2.29). They
estimated that the spliceosome admits about 0.7% error per human intron. They found
that longer introns tend to have a higher rate of mis-splicing and that highly expressed
genes have lower errors perhaps due to their shorter introns spliced via intron-defined
pathway unlike those with higher noise that bore resemblance to exons spliced via exondefined pathway and were enriched for ESEs.
In a recent analysis of yeast transcriptome using transcript isoform sequencing (TIFSeq), Pelechano and colleagues demonstrated remarkable transcript diversity even in an
organism with very little alternative splicing [170]. TIF-Seq involves high-throughput
sequencing of captured transcript ends revealing the diversity in transcription start and
end sites. It will be intriguing to see what results TIF-Seq (or comparable methods)
produce from human transcriptomes.
2.4
2.4.1
Measurement of Transcript Expression
Genes
The steady state abundance of products of gene transcription is an important measure
that can be used to infer the endogenous state or response of a cell under various conditions. Identifying differentially expressed genes is a powerful approach to determine
their functions. This has elevated gene expression measurement to a major biotechnological and bioinformatic challenge leading to remarkable innovation in developing high
41
Figure 2.29: An illustration of noisy splicing on the HERPUD1 locus. The top level
shows average expression level at each position. Green marks exonic regions and black
marks intronic regions. Middle panel shows identified exons from junction mapping
RNA-Seq reads. Black bars are annotated exons; red bars are not annotated. Bottom
panel shows Ensembl transcript isoforms. Image reproduced from [169].
resolution and accurate expression assays. Expression assays may be divided into quantitative and non-quantitative methods reflecting advances in the underlying technology.
Each class of methods may be further split into targeted [171] or global [172] techniques,
whereby targeted techniques probe a handful to thousands of genes while global approaches attempt to probe all genes in the genome of an organism. However, for a gene
that undergoes alternative splicing, gene expression estimates may be uninformative particularly if the overall expression remains unchanged even when the transcription profile
changes [173]. This is shifting attention towards measurement of transcript isoforms.
42
Figure 2.30: A schematic showing various tools available for detecting and measuring
splicing events using RNA-Seq. Image reproduced from [175].
2.4.2
Transcript Isoforms
As the preceding sections have developed, alternative splicing plays an important role
particularly in metazoans, giving rise to the need to identify and estimate the abundance of all transcript isoforms (Fig. 2.30). The shift towards a transcript-isoformcentric paradigm has been slow, possibly hampered by the sharp drop in the cost of
microarrays for which well-developed analytical approaches abound; on the bright side,
there has been a correspondingly sharp drop in the cost of more appropriate approaches
like high-throughput sequencing (http://www.genome.gov/sequencingcosts/) as the
technology evolves towards greater accuracy and reproducibility [174].
Whereas measurement of gene expression is often based on well-defined gene models,
transcriptome complexity routinely integrates transcript isoform discovery in addition
43
to estimating the abundance of each transcript isoform. The stochastic nature of alternative splicing makes identifying and measuring transcript isoforms especially challenging.
Therefore, a middle-ground has been devised that entails detection of alternative splicing
events [176].
This section outlines the three main aspects of transcript isoform measurement: detection
of alternative splicing events, identification of transcript isoforms, and measurement of
transcript isoform abundance. This discussion will interchangeably refer to microarray
and high-throughput sequencing methods.
2.4.2.1
Detection of Alternative Splicing Events
Detection of alternative splicing events usually involves assessing part of a gene associated with the transcript isoform of interest. Typically, this will be the exon though in
organisms that show a preponderance of intron-retention (such as plants) introns may
be used [141]. For example, consider a simple case of a gene with a cassette exon and two
transcript isoforms. The presence or absence of one or more transcript isoforms will be
indicated by the relative expression of this exon. This can be assessed by a test between
experimental groups for the normalised expression of that exon. The advent of highdensity oligonucleotide arrays has been instrumental in availing affordable, genome-wide
access to such analyses. Another method to infer splicing events is by using junction
tiling arrays [177], which check the presence of predefined splice junctions.
Some widely used measures used in alternative splicing detection (some of which are
used in later analytical chapters) are:
1. Differential inclusion. The most straightforward approach involves a direct
comparison of normalised exon expression. Exon inclusion is measured as the
exon signal normalised by the gene signal. Thereafter, a simple test can be used
to compare exon inclusion levels between two samples (e.g. using a t-test or more
advanced statistical test). Because the number of exons and hence tests is typically
large, one must control the false discovery rate. Additionally, one may employ
strict filtering to consider only exons that have strong evidence of expression. For
Affymetrix arrays, this is done using the detection above background (DABG)
metric. We apply this method to study aberrant splicing in Chapter 4.
2. Splicing index. The splicing index employs the idea of gene expression fold
change to compare the inclusion between two groups. It is computed as,
SI =
44
p1
g1
log2 pg22
log2
,
where pi is the probe set signal and gi is the metaprobe set signal in groups
i = 1, 2. Typically, absolute values greater than one suggest differential inclusion
of the targeted exon. The splicing index is normally used in conjunction with
statistical tests such as conventional t-tests or modified t-tests [178]. One limitation
of the splicing index is that it only applies to two-sample data. An extension of
the splicing index is microarray detection of alternative splicing (MIDAS), which
performs an ANOVA on the splicing index.
3. Percent/Proportion Spliced In. One powerful advantage of high-throughput
sequencing is that it is able to give a snapshot of actual splicing events by using
splice junctions. When reads are mapped back to the organism’s genome, the
presence of gaps that coincide with introns captures the set of splicing events that
occurred. This is useful in also identifying novel splice variants that are absent
from available gene models. The simplest method to discern splicing events is by
the percent/proportion spliced in (PSI) given by
Ψ=
no. of reads that include an exon
no. of reads that include OR exclude the exon
The PSI is a single-sample metric (as opposed to differential inclusion or splicing index) because it allows an assessment of alternative splicing within a single
sample.
2.4.2.2
Identification of Transcript Isoforms
Complete mapping of a complex organism’s transcriptome is a perpetual task because of
the diversity of tissues brought about by elaborate developmental programmes. Therefore, it is necessary to incorporate a search for novel transcript isoforms into the analysis.
Microarrays have limited applicability to this task making room for sequencing-based
techniques [179].
There are two ways in which a transcriptome may be assembled to unravel the set of
isoforms. In a guided approach, transcriptome assembly leverages a reference annotation
to infer novel transcripts. For example, Cufflinks uses a graph theoretic approach by
which it attempts to construct the most parsimonious assembly that the read fragments
support [180, 181].
Another method involves ab initio transcriptome assembly, where evidence for transcript
isoforms is based on structural motifs found in aligned reads (e.g. transcription start
sites (TSS), exon-intron boundaries etc). Oases accomplishes this by building an array of
hash lengths (where each hash is a k-mer) followed by a filtering step in which noisy data
45
(lacking structural motifs) are iteratively eliminated. It also considers various alternative
splicing events (e.g. cassette exons) produced from multiple primary assemblies [182].
2.4.2.3
Measurement of Transcript Isoform Abundance
Transcript quantification is currently an active area of research and new tools continue
to be developed as sequencing technology improves and sequencing costs drop. A recent survey examined the performance of 14 tools and concluded that transcriptome
assembly is still a challenging task [174]. Just as in transcriptome assembly, abundance
disambiguation may proceed either with a reference annotation or may be ab initio.
Several tools have also been developed that can infer transcript isoform abundance
from Affymetrix exon arrays, including MMBGX (based on multi-mapping probes and
Bayesian hierarchical models) [183], PUMA 3.0 [184], and SPACE [185]. However, these
have been superceded by sequence based approaches.
The majority of transcript disambiguation methods employ an expectation maximisation
method on some generative model. One of the first such models was proposed by Jiang
and Wong [186] upon which iReckon [187], IsoEM [188], IsoLasso [189], RSEM [190], among
many others have built. Others such as mGene [191] use discriminative models based on
a hidden semi-Markov SVM [192], while others like SLIDE [193] rely on sparse matrix
linear models that consider transcript structure (exon boundaries) with the possibility
of incorporating alternative transcriptomic data (e.g. EST) to augment the transcript
discovery step prior to quantification.
By far, the most popular tool is Cufflinks, which may be guided or ab initio. In both
approaches it uses a likelihood model for the alignment of fragments (determined using
a gapped aligner such as TopHat [194]) to assign abundances. It then uses numerical
optimisation to obtain the maximum likelihood transcript abundances.
In Chapter 5, we introduce a novel approach to gene and transcript expression estimation
that uses both high-throughput sequencing data and microarray data in conjunction
with a statistical learning algorithm. Using a high-quality, public data resource from
the Genotype-Tissue Expression (GTEx) project, our method learns the relationship
between probe intensities and gene expression as estimated by a gold-standard method
(e.g. RNA-Seq) then predicts gene expression from probe intensities alone.
46
Chapter 3
Evidence for intron length
conservation in a set of
mammalian genes associated with
embryonic development
The content of this chapter was published as:
Cathal Seoighe and Paul K Korir “Evidence for intron length conservation in a set of
mammalian genes associated with embryonic development.” BMC Bioinformatics, 12
(Suppl. 19):S16, 2011.
PKK carried out analyses and co-wrote the manuscript.
Abstract
Background:
We carried out an analysis of intron length conservation across a diverse
group of nineteen mammalian species. Motivated by recent research suggesting a role for
time delays associated with intron transcription in gene expression oscillations required
for early embryonic patterning, we searched for examples of genes that showed the most
extreme conservation of total intron content in mammals.
Results:
Gene sets annotated as being involved in pattern specification in the early
embryo or containing the homeobox DNA-binding domain were significantly enriched
among genes with highly conserved intron content. We used ancestral sequences reconstructed with probabilistic models that account for insertion and deletion mutations to
47
Intron Length Conservation
distinguish insertion and deletion events on lineages leading to human and mouse from
their last common ancestor. Using a randomization procedure, we show that genes containing the homeobox domain show less change in intron content than expected, given
the number of insertion and deletion events within their introns.
Conclusions:
Our results suggest selection for gene expression precision or the ex-
istence of additional development-associated genes for which transcriptional delay is
functionally significant.
3.1
Introduction
One of the salient features of eukaryotic genomes is the pervasive presence of introns,
comprising up to 95% of transcribed primary protein-coding sequences in mammals
[195]. The functions, origins and evolutionary trajectory of introns have long been of
great interest in genomics and genome evolution. Although some theories of the spread
of introns postulate that they can accumulate passively as a consequence of insufficient
purifying selection to remove them in organisms with relatively low effective population
sizes [196, 197], introns have been shown to play a number of functional roles [198].
They give rise to the possibility of alternative splicing, contributing to the diversity of
biomolecules, and they can also have a substantial impact on levels of gene expression
[199–201]. Introns contain most of the sequence features required for splicing (5’ and
3’ splice sites, branch point sites (BPS), poly-pyrimidine tracts and various intronic
splicing elements). Many introns also contain functional non-coding RNAs [198], which
can play critical roles in fine-tuning gene expression [202]. Lastly, introns have been
proposed to control the timing of gene expression by delaying transcription [203].
Negative feedback loops with a time delay can result in oscillating patterns of gene
expression, which can be exploited by living organisms as biological time-keeping devices.
This appears to be particularly important in development [203]. The hairy and enhancer
of split 7 (Hes7 ) gene is involved in the control of somite formation through oscillatory
patterns of gene expression [204]. Recently, Takashima et al. [205] investigated in
vivo the impact of removing the introns from the mouse Hes7 gene. They found that
expression of the mutant Hes7 gene occurred approximately 19 minutes earlier than for
the wild-type and that this reduction in gene expression delay resulted in abolition of
oscillations and segmentation defects.
Given that the delay associated with transcribing introns appears to play a crucial
role in the functioning of this gene, the length of the introns may be evolving under
48
Figure 3.1: Phylogeny. Phylogenetic tree illustrating relationships of the 19 mammal included in this study.
purifying selective pressure. Moreover, further examples of genes with highly conserved
intron content across species may reveal more genes for which transcriptional delay is
an important aspect of gene regulation. We investigated the conservation of the intron
content of Hes7 across a diverse set of nineteen mammals and carried out a search for
other genes for which total intron length shows evidence of evolutionary conservation.
In such cases introns are likely to play an important functional role and, in a subset,
time delays associated with transcribing introns may be a significant aspect of gene
regulation.
3.2
Data and Methods
Complete sets of gene models were downloaded from Ensembl release 62 [206] via
BioMart for 19 mammalian species. These were Homo sapiens, Bos taurus, Canis familiaris, Ornithorhynchus anatinus, Equus caballus, Erinaceus europaeus, Gorilla gorilla,
Loxodonta africana, Monodelphis domestica, Myotis lucifugus, Microcebus murinus, Mus
musculus, Oryctolagus cuniculus, Sus scrofa, Spermophilus tridecemlineatus, Tarsius
syrichta, Tursiops truncates, Vicugna pacos and Felis catus. Species were selected to
sample a broad range of mammalian evolutionary history. A phylogenetic tree of these
taxa, obtained from the interactive tree of life [207, 208] is provided for illustrative
purporses (Fig. 3.1).
49
For each Ensembl gene in each species we considered all annotated exons and calculated
the total intronic content of the gene as the sum of the gaps between non-overlapping
successive exons in the canonical transcript associated with the gene. Genes were divided
into four classes, depending on the total intron content of the gene (1–5 kb, 5–15 kb,
20–50 kb and 50–100 kb). Orthologous groups of mammalian proteins were downloaded
from OrthoDB [209]. We extracted the protein identifiers from OrthoDB corresponding
to each of the mammalian species included in the study. Where more than one protein
from a species was included in the same orthologous group, we selected one paralogue
at random. For each orthologous group, with a representative in at least half of the
mammalian species considered we determined the number of species in which the intron
content, as defined above, was within 10% of the length of the intron content of the
human gene. This was done separately for genes in different intron content classes.
Functional analysis of genes with conserved intron content was carried out using DAVID
[210, 211]. For each size class, the genes for which the intron content showed evidence
of conservation was used as the foreground set and the complete set of genes in the size
class was used as the background set.
Genomic multiple sequence ancestor alignments, inferred using the Ensembl EnredoPecan-Ortheus pipeline [212], were downloaded from the comparative genomics section
of the Ensembl FTP site. Insertions and deletions were inferred for ancestral sequences
included in these alignments using a branch transducer method that has been shown
to achieve high accuracy [213]. This allowed insertions and deletions to be placed on
branches of the mammalian phylogeny. For reconstructed insertion and deletion events
we focussed on the branches leading from the common ancestor of the euarchontoglires
(which includes rodents and primates) to humans and mouse. All insertion and deletion
events occurring within introns (at least 20 bp from exon-intron) boundaries were identified, based on Ensembl gene models. In this case, intronic indels were defined as indels
that occur within the boundaries of the gene but not within 20 bp of any annotated
exon associated with the gene.
3.2.1
Randomization study of intron length evolution along the human
lineage
For each intron size class we used the complete set of insertion and deletion events within
the introns to define the null expectation of events in that intron size class. To test for
evidence of purifying selection acting on the intron content of a gene or a set of genes
we considered all of the insertion and deletion events in the gene(s) under consideration
and for each event sampled an insertion or deletion from the total set of events in the
introns of genes in that class. Given a set of insertions and deletions randomly sampled
50
in this way we calculated the change in intron content implied by this set of insertions
and deletions (sum of insertion lengths minus the sum of the lengths of the deletions),
separately along the human and mouse lineages. This was repeated 10,000 times and in
each case the implied change in intron content in the randomized data was compared to
the change in intron content, given the actual insertions and deletions inferred to have
occurred. The number of times the absolute value of the change in intron content in the
randomized data was less than or equal to the absolute value of the change in intron
content in the observed data was used to calculate a p-value, with a separate p-value
calculated for the human and mouse lineages. These p-values indicate the probability
of observing as little or less variation in the intron content of a gene or set of genes,
given the number of insertion and deletion events that have occurred and under the
assumption that these indel events have the same distribution as events in other genes
in the same intron size class.
3.3
Results and Discussion
Delayed expression of Hes7 resulting, at least in part, from intron transcription has been
proposed to play a key role in somite formation in animal embryos by establishing an
oscillating pattern of gene expression [205, 214–216]. Given the important role played
by the length of the introns rather than their specific sequence content in this gene, we
compared the combined length of Hes7 introns across a diverse set of mammalian species
(Fig. 3.1) to determine the extent to which the intron content of this gene is conserved
over evolutionary time. For each orthologue, the sum of the distances between successive
exons in the canonical transcript (according to Ensembl) was calculated. Despite the
fact that there may be differences in gene annotation across species the intron content of
most of the orthologues was similar. Nine out of the twelve of the available orthologues
of Hes7 among the mammalian species in our dataset differed in combined intron length
from the human gene by less than 10% (Table 3.1).
51
52
Common name
Gene
Human
ENSG00000179111
Mouse
ENSMUSG00000023781
Rabbit
ENSOCUG00000002606
Dog
ENSCAFG00000016957
Cat
ENSFCAG00000015190
Horse
ENSECAG00000013278
Little brown bat
ENSMLUG00000008014
Hedgehog
ENSEEUG00000006336
Pig
ENSSSCG00000017982
Bottlenose dolphin
ENSTTRG00000010312
African elephant
ENSLAFG00000002331
Grey short-tailed opossum ENSMODG00000007704
Platypus
ENSOANG00000022456
1 Cumulative intron length
#introns
3
3
3
3
3
3
3
3
4
3
3
3
2
CIL1
1839
1846
1696
1823
1958
1804
1810
2774
1550
1827
1870
2364
2018
Table 3.1: Conservation of Hes7 intron length. Intron content of orthologues of human Hes7 (OrthoDB ID EOG45TDC4). The tarsier
orthologue was omitted. This gene had no annotated introns but the annotation appeared to be incomplete.
Species
Homo sapiens
Mus musculus
Oryctolagus cuniculus
Canis familiaris
Felis catus
Equus caballus
Myotis lucifugus
Erinaceus europaeus
Sus scrofa
Tursiops truncatus
Loxodonta africana
Monodelphis domestica
Ornithorhynchus anatinus
To determine whether intron content, and thus potentially transcription time, of Hes7
was exceptionally conserved compared to other genes and to discover additional examples
of genes for which there is evidence of intron content conservation we compared intron
content (defined as the sum of the lengths of all introns in the canonical transcript)
between human gene models from Ensembl and sets of orthologous genes, obtained from
OrthoDB [209]. Because the pattern of insertion and deletion may differ between very
large and smaller introns we restricted to genes with introns of similar lengths to Hes7
(specifically we considered genes for which the sum of the intron lengths in the canonical
transcript was between one and five kilobases). Restricting also to orthologous groups
represented in at least half of the species, only 11 other orthologous groups (from a total
of 1875 satisfying these criteria) had intron content within 10% of the human gene in as
high a proportion of the orthologues as the Hes7 group. Remarkably, of the corresponding 12 human genes (including Hes7 ), five are annotated with the gene ontology (GO)
term GO:0007389 (pattern specification process) from the biological process component
of GO, defined as any developmental process that results in the creation of defined areas
or spaces within an organism to which cells respond and eventually are instructed to
differentiate. Using the DAVID functional annotation tool [210, 211], this is the most
statistically enriched GO term (p = 9.4 × 10−5 ) and remains significant following correction for multiplicity of testing using the Benjamini-Hochberg method (FDR = 0.03).
Six of the genes were annotated with the Swissprot and Protein Information Resource
(SP-PIR) keywords term developmental protein (FDR = 0.003).
We also used a more relaxed conservation threshold and identified a larger set of 144
genes with relatively conserved intron content among the genes with between one and
five kilobases of intron. Genes for which more than 50% of the orthologues have intron content within 10% of the intron content of the human member of the group were
included in this group, again restricting to orthologue groups represented in at least
half of the species. We again found evidence for a set of genes with conserved intron
content which was very highly enriched for genes involved in development, identified
using DAVID (Table 3.2). Only the top 20 enriched terms are shown in Table 2. These
include terms corresponding to the presence of homeobox DNA-binding domains, frequently involved in developmental gene regulation. To look for examples of conserved
intron content among genes with higher intron content, we separated the human genes
into five intron content size classes (1–5 kb, 5–20 kb, 20–50 kb and 50–100 kb), to account, to some extent, for the fact that patterns of intron length evolution may differ
between genes with larger versus smaller introns. The proportion of genes from the
other size classes (other than size class one, discussed above) that showed evidence of
conserved intron content was smaller and we found no evidence for enrichment for the
default DAVID gene sets (after multiplicity correction).
53
Category
UP SEQ FEATURE
INTERPRO
SP PIR KEYWORDS
INTERPRO
GOTERM BP FAT
Term
DNA-binding region:Homeobox
Homeobox, conserved site
Homeobox
Homeobox
Regulation of transcription,
DNAdependent
INTERPRO
Homeodomain-related
GOTERM BP FAT
Regulation of RNA metabolic process
GOTERM BP FAT
Positive regulation of transcription from
RNA polymerase II promoter
GOTERM BP FAT
Positive regulation of biosynthetic process
GOTERM BP FAT
Positive regulation of cellular biosynthetic
process
GOTERM MF FAT
Transcription regulator activity
GOTERM BP FAT
Positive regulation of gene expression
GOTERM BP FAT
Positive regulation of transcription, DNAdependent
GOTERM BP FAT
Positive regulation of RNA metabolic process
GOTERM BP FAT
Positive regulation of macromolecule
biosynthetic process
GOTERM BP FAT
Positive regulation of macromolecule
metabolic process
GOTERM BP FAT
Positive regulation of nitrogen compound
metabolic process
GOTERM MF FAT
Sequence-specific DNA binding
SMART
HOX
GOTERM BP FAT
Positive regulation of transcription
1
Benjamini-Hochberg FDR corrected P -values
Count
20
20
20
19
41
P-value
2 × 10−10
9 × 10−10
2 × 10−9
5 × 10−9
7 × 10−9
FDR (BH)1
8 × 10−8
2 × 10−7
4 × 10−7
6 × 10−7
1 × 10−5
19
41
19
1 × 10−8
1 × 10−8
1 × 10−8
8 × 10−7
8 × 10−6
6 × 10−6
27
27
3 × 10−8
3 × 10−8
1 × 10−5
1 × 10−5
38
24
21
7 × 10−8
8 × 10−8
8 × 10−8
2 × 10−5
2 × 10−5
2 × 10−5
21
8 × 10−8
2 × 10−5
25
9 × 10−8
2 × 10−5
27
1 × 10−7
2 × 10−5
25
1 × 10−7
2 × 10−5
25
19
23
1 × 10−7
2 × 10−7
2 × 10−7
2 × 10−5
8 × 10−6
3 × 10−5
Table 3.2: Gene set enrichment analysis. Gene set enrichment analysis of genes
in size class 1 showing evidence of intron content conservation in mammals. Top twenty
most significantly enriched terms are shown.
3.3.1
Intronic insertions and deletions along lineages leading to human
and mouse
To investigate evolution of intron length through insertion and deletion in greater detail
we considered changes in intron length along the human and mouse lineages following
divergence from their last common ancestor. Genomic sequences at ancestral nodes of
the mammalian phylogeny were obtained from Ensembl. These sequences were inferred
using a probabilistic method that has been shown to have high accuracy for the inference
of insertion and deletion events. We mapped insertion and deletion events to introns.
Conservation of intron content could result from the absence of insertion and deletion
events or from balance between insertion and deletion. For the 11 genes that showed as
much conservation as Hes7 across the mammalian panel, we found qualitative support
for both effects. The total amount of insertion and deletion was lower in these genes but
there were also some genes with substantial insertion and deletions in balance (Fig. 3.2).
This was particularly evident in the case of indels along the lineage leading to mouse,
54
Figure 3.2: Comparison of insertions and deletions. The sum of the insertions
and deletions that have taken place in genes in size class one along the human (a)
and mouse (b) lineages since divergence from their common ancestor. Blue points,
corresponding to the genes that showed as much or more intron content conservation
than Hes7 are shown against a background of grey points, corresponding to all genes
in the size class. All axes are truncated, but the truncated axes include all of the blue
points. Grey points lying with cumulative insertions or deletions greater than 200 bp
for human or 2000 bp for mouse are not shown.
however, as these genes were selected on the basis of their conservation in a panel that
included humans and mouse this result should be considered to be illustrative only.
We also identified the complete set of genes in size class one that were annotated with
a homeobox domain by Interpro [217]. To investigate whether the change in intron
content since the common ancestor of human and mouse was less than expected, given
the inferred numbers of indels we used a randomization procedure (described in Data
and Methods). We applied this procedure to the set of homeobox domain proteins and
found that the change in intron content of these genes along both the human and mouse
lineages was far less than expected, given the number of indels inferred to have occurred
in each lineage and under the null assumption that the ratio of insertions and deletions
and size distribution of these events did not differ for these genes compared to the rest of
genes in the size class. In the case of human, the mean absolute change in intron content
was 450 bp per gene, whereas the median of this value in the randomized data was 846
55
bp per gene. The randomized values were less than or equal to the observed value in 87
of 10,000 randomizations (one-tailed p = 0.009). In mouse the mean absolute change in
intron length was just 9 bp per gene. The median of this quantity across randomizations
was 398 bp per gene. The values in the random data were lower than or equal to the
observed data in 33 of 10,000 randomizations (one-tailed p = 0.003).
3.3.2
Contribution of conserved sequence elements to intron length
conservation
From our results it appears that there is selection to prevent large changes in intron
content in specific sets of genes, notably in genes involved in development and especially
in early embryonic patterning. This selection could result from the presence of cis or
trans (e.g. non-coding RNAs) regulatory elements within introns of these genes, which
would be disrupted by insertions and even more so by deletions. However, the presence
of regulatory elements would tend to increase the intron length [218], yet we found good
evidence for classes of genes with well conserved intron content only among the genes
with lower intron content. To test the contribution of conserved functional elements
within intronic sequences on the conservation of intron content we obtained conservation
scores, calculated using phastCons, from the UCSC genome database. phastCons scores
for the intronic regions of Hes7 are shown as Figure 3.3. While there is evidence of
functional elements these are relatively few and short and seem unlikely to prevent
changes in intron length over evolutionary time. More quantitatively, we compared
phastCons scores, averaged across all intronic positions, between the 144 genes with
evidence of conserved intron content and the remainder of the genes in the smallest
intron size class. The difference in mean phastCons scores was not significant (p = 0.33,
Wilcoxon rank sum test). The difference in the proportion of conserved sites (with
phastCons scores > 0.95) was also non-significant (p = 0.69), suggesting that a greater
proportion of conserved intronic functional elements is not the cause of the intron length
conservation in these genes. However, we cannot rule out that conserved sequence
elements contribute to the conservation of the intron content.
To explain our observations, conservation acting on functional elements would, perhaps,
have to be balanced with selection for rapid induction, as rapidly induced genes have been
shown to have short introns [219]. If the conservation of intron content is a consequence
of balance between conservation of intronic functional elements and selection either for
efficiency [220] or for rapid induction, this balance appears to be particularly significant
for development-associated genes. For these genes, expression precision at the critical
developmental junctures in which patterns are established in the developing embryo
may be particularly crucial. An intriguing role for introns that has recently gained
56
Figure 3.3: phastCons scores across introns of the Hes7 gene
experimental support is in the establishment of oscillating patterns of gene expression
through the control of time delays in expression [205]. These authors attributed a delay
of approximately 19 minutes to the presence of introns. However, the introns of Hes7
are relatively short (< 2 kb) and RNA polymerase II processes nucleotides at a rate in
the order of 2 kb per minute [221, 222]. This may suggest either the existence of other
functional elements within the intron that caused the delay in transcription or specific
properties of the intron that result in an exceptionally slow transcription rate.
3.4
Conclusion
Some theories of intron evolution have focussed on the energy cost associated with introns. Reduced intron lengths in highly expressed genes were proposed to result from
selection for efficiency [220]. In support of this model, greater selective efficiency associated with larger effective population sizes of unicellular versus multicellular organisms,
is associated with shorter introns [196, 197]. Alternatively longer introns in highly and
ubiquitously expressed genes may be associated with greater regulatory complexity and
the preservation of regulatory elements in introns [218]. Selection to reduce energetic
57
costs of transcription and for rapid induction of some genes as well as selection to preserve regulatory elements in highly regulated genes may all be important factors for
intron evolution. However, comparison of genes with very conserved intron contents
carried out here, suggests that there may be a substantial number of genes for which the
size of the primary transcript is an important and conserved feature. These genes are
enriched for key developmental processes, particularly the establishment of early embryonic patterns. Since time delays in gene induction have been shown to be important
for some such genes, we propose that selection to conserve transcription time is an important factor in the evolution of intron lengths, particularly in development-associated
genes. Our analysis involved a combination of a heuristic examination of intron content
across a panel of mammalian species and a statistical randomization approach, which
does not take into account all of the factors that may affect the fixation probabilities
of insertions and deletions in introns. A better understanding of the evolution of intron
sizes through insertion and deletion requires the development of evolutionary models of
intron length evolution, though this is challenging because of the diversity of insertion
and deletion events and the difficulty in modelling the constraints imposed by functional
elements within the introns on the occurrence and size distribution of these events. If
an appropriate model of neutral evolution by insertion and deletion can be derived,
such a model could be used to identify and quantify purifying selection acting on intron
lengths and, perhaps, to discover examples of positive selection acting on changes in
intron length over a phylogeny or specific branches of a phylogenetic tree.
58
Chapter 4
A mutation in a splicing factor
that causes retinitis pigmentosa
has a transcriptome-wide effect
on mRNA splicing
The content of this chapter has been submitted as:
Paul K Korir, Lisa Roberts, Raj Ramesar and Cathal Seoighe “A mutation in a
splicing factor that causes retinitis pigmentosa has a transcriptome-wide effect on
mRNA splicing .” under review in BMC Research Notes.
PKK carried out the bioinformatic analyses and drafted the manuscript.
Abstract
Background: Substantial progress has been made in the identification of sequence elements that
control mRNA splicing and the genetic variants in these elements that alter mRNA splicing (referred
to as splicing quantitative trait loci – sQTLs). Genetic variants that affect mRNA splicing in trans
are harder to identify because their effects can be more subtle and diffuse, and the variants are not
co-located with their targets. We carried out a transcriptome-wide analysis of the effects of a mutation
in a ubiquitous splicing factor that causes retinitis pigmentosa (RP) on mRNA splicing, using exon
microarrays.
Results: Exon microarray data was generated from whole blood samples obtained from four individuals with a mutation in the splicing factor PRPF8 and four sibling controls. Although the mutation has
no known phenotype in blood, there was evidence of widespread differences in splicing between cases and
controls (affecting between 10% and 25% of exons). Most probesets with significantly different inclusion
59
Transcriptome-wide Aberrant Splicing
(defined as the expression intensity of the exon divided by the expression of the corresponding transcript)
between cases and controls had higher inclusion in cases and corresponded to exons that were shorter
than average, AT rich, located towards the 5’ end of the gene and flanked by long introns. Introns
flanking affected probesets were particularly depleted for the shortest category of introns, associated
with splicing via intron definition.
Conclusions: Our results show that a mutation in a splicing factor, with a phenotype that is
restricted to retinal tissue, acts as a trans-sQTL cluster in whole blood samples. Characteristics of the
affected exons suggest that they are spliced co-transcriptionally and via exon definition.
4.1
Introduction
Splicing sits at the intersection between transcription and translation and directly regulates both the abundance and diversity of transcripts. It is effected by the spliceosome,
a macromolecular complex composed of uridine-rich snRNAs (U1, U2, U5, and U4/U6
duplex for the U2-type spliceosome) and numerous proteins [47], which systematically
assemble at conserved sequence motifs on newly-transcribed RNA. The spliceosome undergoes a series of structural rearrangements to catalyse two transesterification reactions
ligating adjacent exons [49]. The PRPF8 protein mediates most spliceosomal interactions as it interfaces with splice sites on the transcribed RNA, snRNAs, and other
proteins. Over the years, mapping of PRPF8 protein domains has revealed numerous
splicing-related roles. For example, the 3’ fidelity region attenuates the impact of 3’
splice site (3SS) mutations [223]. Also, the C-terminal domain (CTD) of Prp8p (yeast
homologue of PRPF8 ) interacts with Snu114p and Brr2p (another protein that interacts
with the CTD) to unwind and release the U4 snRNA, which activates the spliceosome
[224–228]. Recent evidence suggests a direct role in splicing through catalysis of the
transesterification reactions [229]. It is therefore unsurprising that mutations in PRPF8
have been shown to affect splicing efficiency in yeast [230], mouse [231, 232], zebrafish
[233] and human [234].
Genome-wide association studies (GWAS) have revealed a large number of genetic variants that are associated with diseases and phenotypes but for many of these associations
the exact causal mechanism remains unknown. A large proportion of the associations
are likely to be mediated by the effects of polymorphic variants on gene expression. In
support of this view, GWAS results have been found to be enriched for expression quantitative trait loci (eQTLs) [235]. Genetic variants may also affect transcript processing.
The contribution of variants that affect mRNA splicing, in particular, is thought to be
60
substantial [236, 237]. Although many studies have assessed the impact of genetic variants acting in cis on mRNA splicing [236, 238–242] variants that act in trans have been
less extensively studied [243].
Genetic variants affecting components of the spliceosome can have trans-acting effects
on splicing. In particular, maturation defects, structural malformations and splicing
factor mutations can lead to aberrant transcripts due to mis-splicing, resulting in genetic
diseases [147, 244]. For example microencephalic osteodysplastic primordial dwarfism
type I (MOPDI), also known as Taybi-Linder Syndrome (TALS), is caused by mutations
in the minor spliceosome resulting in incomplete splicing of a small number of critical
transcripts [245, 246]. MOPDI is characterised by gross developmental retardation and
a lifespan of under one year. Mutations affecting components of the spliceosome can
also have a more restricted phenotype, such as in the case of splicing factor associated
retinitis pigmentosa (RP). RP is a broad spectrum of eye diseases featuring gradual
degeneration of rod and cone photoreceptors, causally associated with mutations in over
100 genes and may be autosomal dominant, autosomal recessive or X-linked [247, 248]. A
subset of autosomal dominant mutations are located on components of the spliceosome
PRPF3, PRPF6, PRPF8, and PRPF31 [247, 249]. The progression of the disease is
marked by night blindness and tunnel vision, and in later stages, complete blindness.
However, no adverse phenotypes in non-retinal tissues are known [243, 247].
There are two hypotheses that could explain how mutations in splicing factors result
in RP [234]. First, because the pathology of the disease is restricted to the retina,
RP may be the result of aberrant splicing of a transcript isoform that is specific to
retinal tissue. Recently, transcriptome analysis of retinal tissue in a mouse model of
RP revealed a large number of novel and/or aberrant transcripts [232]. This hypothesis
does not rule out the second possibility: that these mutations in core components of the
spliceosome increase the transcriptome-wide rate of splicing error and that retinal tissue
is particularly sensitive to this increased rate of mis-splicing. Intermediates between
these two extremes are also possible.
Here we test the hypothesis that a mutation in a splicing factor that causes RP leads to
transcriptome-wide splicing defects in a non-retinal tissue. We used exon microarrays
to profile transcript expression in whole blood samples obtained from RP-PRPF8 individuals bearing the p.H2309R mutation in the splicing factor PRPF8 [248] and paired
sibling controls. For a large proportion of exons we found evidence of differential inclusion of the exon in mature transcripts of cases compared to controls. Differentially
spliced exons were disproportionately associated with longer flanking introns and likely
to be spliced through exon, rather than intron definition.
61
4.2
Methods
4.2.1
Exon array sample preparation and data generation
This project was approved by the University of Cape Town Research Ethics Committee
(REC REF: 180/2009), which is in compliance with the guidelines of the declaration
of Helsinki. Written informed consent for participation in the study was obtained from
participants.
Blood from five individuals carrying the autosomal dominant RP mutation p.H2309R
(referred to as cases) together with that from unaffected sibling controls was collected
and preparation coordinated to ensure equal sample incubation times. Three blood
TM
samples per subject were collected into PAXgene
tubes (PreAnalytiX), according to
the manufacturer’s instructions, and incubated at room temperature. RNA extractions
were performed 16 hours (A sample), 20 hours (B sample) and 39 hours (C sample) after
TM
collection. Total RNA was extracted using the PAXgene
Blood RNA kit, following
the manufacturer’s recommended protocol. The RNA samples were not heat denatured
following elution, but immediately stored at -80◦ C.
R
The A and B RNA isolations of each sample were pooled, and the Affymetrix GeneChip
Blood RNA Concentration Kit used to concentrate the RNA. Quality control checks
to determine RNA concentration and integrity were performed for each sample, using
the Nanodrop (Thermo Scientific) and Agilent 2100 Bioanalyser (Agilent Technologies),
respectively. It was determined that one of the cases showed poor integrity of A/B
and C RNA samples leading to its exclusion together with its sibling-pair. RNA yields
of between 2.22 and 5.16µg were obtained for the remaining samples, which were of
satisfactory integrity to proceed with microarray analysis. The 260/280 ratios for the
samples ranged from 1.97 to 2.06. All samples exceeded the RNA integrity number
(RIN) quality threshold ≥ 7, as samples ranged from RIN 8.3–9.10.
Ribosomal RNA reduction was performed on the eight remaining samples using the
TM
RiboMinus
Transcriptome Isolation Kit (Human/Mouse) (Invitrogen), in accordance
with a modified Affymetrix protocol. The RNA was then processed with the Whole
Transcript (WT) Sense Target Labelling assay. The labelled samples were hybridised
R
to Affymetrix GeneChip
Human Exon Array 1.0 ST chips according to the pre-
scribed protocol. A total of 5.5µg of single stranded cDNA (generated from cRNA)
was hybridised to the arrays. Following hybridization, the arrays were scanned in the
R
Affymetrix GeneChip
Scanner 3000 7G.
Cases were labelled T1 , T2 , T3 , and T4 and corresponding sibling controls C1 , C2 , C3 ,
and C4 .
62
4.2.2
Expression quantification of microarrays
The Affymetrix GeneChip Human Exon Array consists of oligonucleotide probes clustered into sets of probes (probesets), which are further clustered into metaprobesets.
Raw exon array intensities were summarised into probesets and metaprobesets at the
core and full probeset/metaprobeset level using Affymetrix Power Tools (APT) by the
robust multi-chip averaging (RMA) algorithm [250]. Quality control of array data was
carried out according to a previously described procedure [120]. We applied hierarchical
clustering and principal components analysis to the raw intensities to identify possible
outliers. APT was also used to determine probesets and metaprobesets that were detectable above the background (DABG) [251]. We excluded from further analysis all
probesets that were detectable above background in fewer than half of the samples. We
also used the annotation files provided by Affymetrix to exclude all probesets with probes
likely to cross-hybridise. A similar procedure was applied to metaprobesets: only those
having at least half of the associated probesets expressed above background in all samples were retained [120]. Finally, we normalised each probeset intensity by subtracting
its value from its parent metaprobeset intensity since these values lie on a logarithmic
scale [252]. The number of probesets remaining after these filtering steps were 103,268
and 149,835 for the core and full sets, respectively.
All analyses were performed on the hg19 build of the human genome. Exon boundaries
and splice sites were defined based on build 66 of the Ensembl database of gene models
[253]. Exonic and intronic probesets were isolated using the R tool xmapcore, using the
database based on build 66 of the Ensembl gene model [117]. U12-type introns were
obtained from the U12DB [25] with coordinates modified to hg19 using liftOver [254].
For each probeset, log-transformed expression intensities of probesets were compared
between cases and controls. The log-transformed expression intensities were normalized
by subtracting the log expression intensities of corresponding metaprobesets from that
of probesets. We compared the expression intensity for each probeset using paired tests
first using Welch t-tests from the standard R library then using moderated paired t-tests
implemented in the limma [255] package. Correction for multiple testing was based on
the q-value method as implemented in the R package qvalue [256]. The qvalue package
was also used to estimate π0 , the expected proportion of tests consistent with the null
hypothesis.
63
4.2.3
Characterisation of differentially spliced exons
To avoid the potential for bias resulting from exons to which multiple probesets mapped
or individual probesets that mapped to multiple overlapping exons, we constructed a oneto-one mapping of probesets to exons by randomly sampling at most one exon for each
probeset and one probeset for each exon. We compared the characteristics of exons for
which the corresponding probeset was differentially included between cases and controls
using a significance threshold of 0.01. Exons with significantly higher and lower inclusion
levels in cases relative to controls were considered separately and, in each category, eight
features of the exon (5’ and 3’ splice site scores, length of up and downstream introns,
exon length, nucleotide content, distance from the 3’ end and associated gene length)
were compared between the significantly included/excluded exons and the remainder of
the exons tested. For each feature, we compared groups using a non-parametric test
(Wilcoxon rank-sum test), followed by correction for multiple testing either using the
Benjamini-Hochberg method [257] or the q-value method [256].
4.3
4.3.1
Results
Transcriptome-wide perturbation of splicing
We generated exon-level microarray expression data from four individuals with a mutation in PRPF8, a core component of the spliceosome, and sibling controls. Preprocessing
and quality control steps were carried out as described in the Methods section. Following
quality control steps and removal of probesets that were not expressed in at least 50%
of the samples, a total of 103,268 core and 149,835 full probesets remained. To test for
evidence of differential splicing we compared probeset inclusion (defined as the log of
the ratio of probeset expression intensity to the expression intensity of the corresponding metaprobeset/transcript) between cases and controls. Comparisons were carried out
separately for probesets in four overlapping categories, defined by whether the probeset
mapped to an annotated intron or exon (referred to as the exonic and intronic categories,
respectively) and by whether the probeset belonged to the core probesets or to the full
probesets. The latter two categories are an aspect of the microarray design and reflect
the confidence of the exon annotations on which the probesets were based. The core
probesets correspond to high-confidence exon annotations, while the full probesets also
target exons in computationally predicted transcripts. Intronic probesets were further
partitioned based on splice type (major and minor).
Probeset inclusion was compared between cases and sibling controls for each probeset
that passed the detection and quality control filters described above. Instead of using
64
Empirical distribution of p−values
π0 = 0.7553
0.0
0.5
density
1.0
1.5
uniform distribution
expected true null
0.0
0.2
0.4
0.6
0.8
1.0
p−value
Figure 4.1: Empirical distribution of p-values. Each p-values was calculated
using pairwise t-tests for a normalised probeset between cases and controls.
limma we used standard t-tests so that we could estimate π0 , the proportion of true null
hypotheses. The limma approach has higher statistical power to detect differences at the
individual probeset level, because it shares information across probesets but this violates
the independence assumption when estimating π0 . Consequently, the expected distribution of p-values is no longer uniform under the null hypothesis [258]. The histogram of
p-values obtained from these comparisons (Fig. 4.1) shows a large excess of small values
compared to the uniform distribution, expected under a null hypothesis of no difference
in splicing between cases and controls. Statistical methods exist to estimate π0 from a
distribution of p-values. We used the qvalue package from BioConductor [256, 259] for
this purpose. For core probesets π0 was 0.76 while full probesets had a slightly higher
value of 0.79. This suggests that for approximately one fifth of the probesets the null
hypothesis of no difference in inclusion between cases and controls does not hold. For
exonic probesets π0 was 0.77 while intronic probesets had π0 = 0.81.
It is possible that this high proportion of affected probesets inferred using the qvalue
package could be the result of a failure of the assumptions underlying the estimation
of π0 and not a consequence of differences in splicing between the case and control
groups. To investigate this possibility, we compared the estimates of π0 obtained from
65
the true case-control groups to estimates obtained when sample labels were permuted
exhaustively within sibling pairs (so that comparable paired t-tests could still be carried
out). The estimate of π0 for the correctly labelled samples was lower than for any of
the eight alternative configurations that can be obtained in this way (Supplementary
Figs. A.2 and A.3). The value of π0 was strongly anti-correlated with the number of
properly paired sibling pairs (Pearson r = −0.84, p = 0.008; Spearman ρ = −0.92,
p = 0.001), suggesting that the PRPF8 mutation does have a transcriptome-wide effect
on mRNA splicing in whole blood samples.
4.3.2
Differential splicing primarily involves higher inclusion of exons
in cases
We used limma to compute p-values and t-statistics for differential inclusion of individual probesets because this approach has been shown to have higher statistical power
than standard t-tests for small sample sizes [255]. The majority of probesets with significantly different inclusion between cases and controls had higher inclusion in cases
(Fig. 4.2). Below a p-value threshold of 0.2, the enrichment for probesets with higher
inclusion in cases was highly statistically significant (core probesets: p = 4.8 × 10−4 ,
full: p = 1.3 × 10−4 , exonic: 4.6 × 10−10 ; Supplementary Tables A.1–A.3, Supplementary
Fig. A.6). However, intronic probesets showed no significant excess of positive t-statistics
among probesets with low p-values (p = 0.24 at p-value threshold of 0.2; Supplementary
Tables A.4 and A.5). This suggests that most of the differential splicing involves higher
exon inclusion in cases. The affected probesets tended to be expressed at a lower level
than the remainder of the probesets in the meta-probeset (gene) to which they mapped.
Probesets with higher inclusion in cases had a mean expression intensity (across cases
and controls) that was, on average, less than half the mean expression intensities of
unaffected probesets (p = 9.6 × 10−112 ). The difference was much smaller for probesets
with lower inclusion in cases; these probesets had raw expression intensities that were
on average 14% lower than the average of the unaffected probesets (p = 0.004). Thus,
the affected probesets are from low inclusion exons that are typically included at higher
levels in the mutant samples compared to controls.
4.3.3
Characterisation of differentially included exons
The introns upstream of exons containing probesets with evidence of differential inclusion
between cases and controls were significantly longer than average (Table 4.1). Short
introns (< 250 bp) are thought to be excised primarily through intron definition, and
longer introns through exon definition [29, 51, 83]. These two classes of introns can be
66
0.2
0.4
0.6
0.8
exonic probesets
0.0
0.2
0.4
0.6
0.8
1.0
core probesets
0.0
proportion of probesets
1.0
bins of width ∆p = 0.05 for p in [0,1]
Figure 4.2: Proportion of probesets indicating higher/lower inclusion in
cases for binned p-value thresholds. All bins have an equal width of ∆p = 0.05 for
p-values in the range [0, 1]. Red bars symbolise the proportion of probesets indicating
higher inclusion in cases; similarly, blue bars refer to lower inclusion in cases. The
absolute height of a bar of each colour represents the proportion of probesets with
higher/lower inclusion in cases. The t-statistic was used to determine relative inclusion:
t > 0 - higher inclusion in cases; t < 0 - lower inclusion in cases. Plot only shows core
and exonic (full) probesets.
up
down
all
Upstream intron length (5’)
Median p-value
q-value
1966
0.002
0.0064
2053.5
0.05
0.1
1625
-
Downstream intron length (3’)
Median p-value
q-value
1662
0.84
0.84
1843.5
0.14
0.18
1625
-
Table 4.1: Length of introns flanking differentially included exons. Comparison of median of intron length for exons with higher (up) and lower (down) inclusion
in cases relative to controls.
seen on the density plot of intron lengths (Fig. 4.3). The density plot of introns bordering
differentially included exons had a diminished peak at short lengths (Fig. 4.3), suggesting
that aberrant splicing in cases does not affect splicing via intron definition. Instead,
longer introns were particularly abundant upstream of preferentially included exons
(Fig. 4.3). Thus, differential splicing between cases and controls appears to primarily
affect the exon definition pathway.
Exon definition occurs for short to moderate length (< 500 bp) exons since the components of the spliceosome need to associate across them [84]. Long exons tend to have
additional splicing enhancer motifs, perhaps to aid the binding of spliceosome components since they cannot associate across the length of the exon [84]. We found that
67
0.35
B
0.35
A
0.30
0.25
0.20
0.00
0.05
0.10
0.15
density
0.20
0.15
0.00
0.05
0.10
density
0.25
0.30
higher inclusion exons
lower inclusion exons
background (core) exons
4
6
8
10
12
4
log of intron length
6
8
10
12
log of intron length
Figure 4.3: Distributions of intron length. Density plot of intron length (A) upstream and (B) downstream of differentially included exons in cases relative to controls.
A similar plot for background exons (core exons) is provided for reference.
up
down
all
Median
368.88
562.81
436.5
Mean
127
166
153
p-value
7.52 × 10−18
2.44 × 10−3
-
q-value
1.50 × 10−17
2.44 × 10−3
-
Table 4.2: Length of differentially included exons. Comparison of mean and
median lengths for exons with higher (up) and lower (down) inclusion in cases relative
to controls.
exons containing probesets with evidence of higher inclusion in cases than in controls
were significantly shorter than average (Table 4.2). We also found that exons with
lower inclusion in cases were somewhat longer than other exons (Table 4.2). Exons
that showed evidence of higher inclusion in cases relative to controls were depleted of
long exons (> 500 bp), suggesting that the majority of these exons are spliced via exon
definition [53].
We used MaxEntScan [260] to calculate splice site scores at intron-exon junctions and
compared these between affected and unaffected exons. For exons with higher inclusion
in cases, the median score of 5’ splice sites (5SS) was significantly higher than background (Tables 4.3 and 4.4) but the 3’ splice sites (3SS) showed no difference in score.
68
up
down
all
Mean
8.33
8.12
8.08
Splice donors (5SS)
Median
p-value
q-value
8.76
8.86 × 10−3 3.54 × 10−2
8.68
0.75
0.75
8.68
-
Splice acceptors (3SS)
Mean Median p-value q-value
8.43
8.68
0.58
0.75
8.19
8.40
0.10
0.19
8.30
8.69
-
Table 4.3: Splice site strength of differentially included exons. Mean and
median splice site strength computed using MaxEntScan [260].
up
down
all
Sum of 5SS and 3SS scores
Mean Median p-value q-value
16.76
17.12
0.04
0.08
16.31
16.77
0.11
0.11
16.38
16.99
-
Table 4.4: Combined splice site strength (5SS and 3SS) of differentially
included exons. Similar to Table 3 but using combined splice site strength.
In contrast, for exons with lower inclusion in cases, neither the 5SS nor 3SS showed differences in scores (Table 4.3 and 4.4). The distributions of splice site scores are shown
in supplementary figure A.4.
We found that the exons with higher inclusion in cases had much higher proportions
of A and T nucleotides than background (core) exons and, correspondingly, lower proportions of G and C nucleotides (Fig. 4.4). Exons with lower inclusion in cases showed
no significant difference from background. We also obtained a list of 238 candidate
hexameric exonic splicing enhancers (ESEs) sequences [261]. For each ESE sequence,
we counted the number of times it occurred in each exon category and normalised this
by the sum of exon lengths in that category. Unsurprisingly, considering that ESEs
are A-rich (nearly 50% of bases of the 238 ESEs were A), we found that highly included
exons did indeed show evidence of ESE-enrichment (mean ESE prevalence of 4.14×10−4
against 3.70 × 10−4 for core exons, p = 0.02). We did not observe any ESE enrichment
in exons with lower inclusion in cases (mean ESE prevalence of 3.76 × 10−4 , p = 0.72).
The enrichment of ESEs in the exons with increased inclusion in cases does not explain
the A+T richness because the excess of A and T nucleotides persisted even when all instances of the 238 ESEs were removed. Indeed, the excess of ESEs may be a consequence
of the A-richness of the affected exons.
Splicing can take place co-transcriptionally (while the polymerase is still associated
with the DNA template) or post-transcriptionally [40, 112, 113, 164, 262–264]. Exons
belonging to longer genes or located far from the 3’ ends of genes are more likely than
other exons to be spliced co-transcriptionally [265]. To investigate whether the mutation
may have a greater impact on co- or post-transcriptional splicing we compared gene
length and distance from 3’ gene ends between affected exons (differentially included
69
1.0
0.6
***
***
***
0.4
***
0.0
0.2
proportion
0.8
A
T
C
G
Figure 4.4: Nucleotide content of differentially included exons. The proportion of each nucleotide in differentially included exons (A) with exonic splicing enhancers
(ESEs) present and (B) with ESEs removed.
up
down
all
Mean
100,954
103,781
91,811
Gene length
Median
p-value
53,464
2.6 × 10−7
53,741
3.9 × 10−5
42,729
-
q-value
5.2 × 10−7
3.9 × 10−5
-
Mean
51,939
58,279
46,338
Distance from 3’ end
Median
p-value
q-value
23,542
7.3 × 10−10 1.4 × 10−9
21,426
5.1 × 10−5
5.1 × 10−5
14,514
-
Table 4.5: Indicators of co-transcriptional splicing. Gene length and distance
from the 3’ end of genes for exons differentially included.
between cases and controls) and unaffected exons. We found that the mean gene length
as well as the distance from the 3’ end of the gene were both significantly greater for
affected exons (Table 4.5). This suggests that the mutation has a larger impact on
co-transcriptional splicing; the effect may even be restricted to splicing that occurs cotranscriptionally.
4.4
Discussion
Cis-acting factors have been shown to be important regulators of alternative splicing
and several previous studies have reported cis-acting genetic variants that affect mRNA
splicing [236, 238–241, 266–268]. Such cis-acting splicing quantitative trait loci (sQTLs)
70
typically affect the splicing of a single gene. Trans-acting variants, however, can have
widespread impact and when they affect ubiquitous components of the transcript processing machinery the effects may be transcriptome-wide. To date, there have been fewer
studies on trans-sQTLs than on cis-acting variants. Given that almost all human genes
are spliced and a large majority of multi-exon genes are alternatively spliced, often in
a tissue-specific manner [119], cis- and trans-acting factors that cause splicing errors
or affect the regulation of alternative splicing have the potential to have a substantial
impact on the transcriptome. Indeed, a large proportion of human genetic diseases are
likely to result from mutations that affect splicing [119, 244, 269, 270].
In this study, we applied statistical analyses that interrogated exon inclusion events
across the human transcriptome. We estimated the transcriptome-wide effect of the
PRPF8 mutation by using the π0 statistic, which in the context of multiple hypothesis
testing, is an estimate of the proportion of tests that conform to the null hypothesis.
This approach has previously been applied in transcriptome-wide studies of population
differential gene expression between human populations [271] and, more recently, in
the study of cis- and trans-expression quantitative loci [242]. The observed value of
π0 was the lowest among all permutations of the case-control labels. Nevertheless, we
acknowledge that, given the relatively small sample size in this study, this finding requires
independent validation particularly given the often noisy nature of microarray data.
Our results provide evidence that a non-synonymous mutation in the splicing factor
PRPF8 that leads to retinitis pigmentosa affects the inclusion of approximately 20% of
the exons in genes expressed in whole blood samples. Affected exons were most often
included at a higher level in the transcripts of cases than of controls. Averaging across
all samples (cases and controls), these exons had lower inclusion levels than unaffected
exons, suggesting that they are absent from some of the transcripts of the corresponding
gene. The fact that the mutated form of PRPF8 was associated with higher inclusion
of exons that appear to be skipped in some transcripts was an unexpected result. It
suggests that the mutation may affect alternative splicing (e.g. tissue-specific regulation
of alternative splice isoforms) rather than giving rise to a high rate of splicing errors
involving skipping of constitutive exons or intron retention. Exon skipping is the most
common type of alternative splicing event [272]. Because the mutation is associated with
higher inclusion levels of exons that are expressed at relatively low levels, we propose that
the PRPF8 mutation may reduce the likelihood of exon skipping. Under this model the
adverse phenotype could result from failure of a regulated exon skipping event which is
required in retina. The exon microarray platform we used also includes some probesets
that map to intronic regions. Although the number of expressed probesets mapping
to introns was far lower than for exons, we found some evidence of differential intron
retention between cases and controls. However, the proportion of affected probesets was
71
lower for intronic compared to exonic probesets and the overall effect was approximately
equally divided between increased and decreased intron inclusion.
The introns upstream of exons with higher inclusion in cases were significantly longer
than average. In fact, short introns were almost entirely absent upstream of affected
exons (Fig. 4.3). Short introns (< 250 bp) are more efficiently excised through the
assembly of spliceosomal components across the intron before pairing up (intron definition) and they generally have weaker splice sites [273]. Long introns rely on spliceosome
components first assembling across exons before juxtaposing across the target introns.
One of the most striking features of probesets with significantly different inclusion in
cases compared to controls was that they were found predominantly on shorter exons.
The majority of affected exons had increased inclusion in cases and we found that shorter
exon length applied only to these exons. In fact, probesets with lower inclusion in cases
compared to controls mapped to exons that were slightly longer than the background
unaffected exons. Exon defined splicing is most efficient when exon length is between
50 and 500 bp [29, 53, 274]. Splicing via intron definition requires short introns (< 250
bp) and, as discussed above, these were depleted among introns flanking affected exons,
particularly for the exons with higher inclusion in cases. Consequently, we propose that
the PRPF8 mutation affects splicing via exon definition, but may have no effect on
splicing via intron definition. We also observed that exons with higher inclusion in cases
tended to have stronger 5’ splice site (5SS). Interestingly, PRPF8 is known to interact
directly with the 5SS [49] (and references therein). It contacts the 5SS dinucleotide at
residues QACLK (positions 1894-1898) as part of the U5 snRNP (a constituent of the
U5·U4/U6 tri-snRNP) [275–278]. However, the RP mutation is a histidine to arginine
change at residue 2309 [248] suggesting that the mutation does not directly affect contact
with the 5SS and may, instead, affect interaction with another protein or snRNA.
On average, exons that were differentially included between cases and controls were much
further from 3’ gene ends and belonged to significantly longer genes than unaffected
exons. Both distance from the 3’ end and gene length are associated with the efficiency
of co-transcriptional splicing [265]. The majority of affected exons had higher inclusion
in cases and exons with higher inclusion in cases (but not exons with lower inclusion)
had significantly greater proportions of A and T nucleotides than unaffected exons. Arich tracts have been observed to lead to pol II pausing [166, 279], potentially increasing
the time available for mRNA splicing to occur prior to the completion of transcription.
Taking these observations together we propose that the PRPF8 mutation may primarily
or, at least, disproportionately affect splicing that takes place co-transcriptionally.
Two previous studies of the effects of mutations in three splicing factors (PRPF3, PRPF8
and PRPF31 ) linked to retinitis pigmentosa on mRNA splicing in lymphoblast cells
72
reported evidence of retained introns [231, 234]. Both of these studies investigated
splicing for a small number of introns. Only one intron showed evidence of retention
associated with the mutation in PRPF8 [234]. This was a U12-type intron (the third
intron of STK11 ). However, the corresponding probesets did not show evidence of
expression above background, thus we did not find evidence to support retention of
this intron in our dataset. Furthermore, we did not find any evidence of retention of
U12-type introns in cases (Supplementary Fig. A.5).
We observed that exons with specific structural characteristics were more likely to be
differentially-included: short exons flanked by long introns, high A+T-content, located
towards the 5 end of the gene. It is possible that these exons are more prone to microarray hybridisation noise. For example, it can be more difficult to design microarray probesets targeting short exons (short sequences limit the choices of array probes),
resulting in lower hybridisation affinity. Indeed, we observed higher variability of expression signals in affected exons compared to unaffected exons even when restricting
to controls alone (data not shown). Nevertheless, given that sample preparation and
processing was carried out independent of treatment labels, technical noise should not
be biased towards increased exon inclusion in cases. Our results suggest that a mutation
in PRPF8, implicated in RP, has a subtle effect on the inclusion of a large number of
human exons. However, further investigation using newer sequencing based technologies
(RNA-Seq) and an independent set of samples will enable the impact of the mutation
on splicing error to be confirmed and dissected in greater detail.
4.5
Conclusion
Overall, our results support the hypothesis that a mutation in the splicing factor PRPF8
that leads to retinitis pigmentosa, has a widespread impact on mRNA splicing across the
transcriptome. Given that splicing of such a large proportion of exons is effected by the
mutation in blood, it is surprising that the phenotype associated with this mutation is
restricted to the retina. However, because the differentially included probesets were from
exons with low inclusion overall and had higher inclusion in cases compared to controls,
we propose that the mutation does not lead to an increase in the rate of mis-splicing of
constitutive exons. Instead, the mutation may influence the inclusion of alternatively
spliced exons. Consequently, we suggest that the disease phenotype in retinal tissue
could result from failure to produce one or several retina-specific isoforms that require
exon skipping.
73
Availability of supporting data
Affymetrix exon array CEL files are available on GEO website under accession GSE43134
(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43134). This manuscript
is accompanied by supplementary information.
74
Chapter 5
Seq-ing improved gene expression
estimates from microarrays using
machine learning
The content of this chapter has been submitted as:
Paul K Korir, Paul Geeleher and Cathal Seoighe “Seq-ing improved gene expression
estimates from microarrays using machine learning.” under review in Nature Methods.
PKK implemented the algorithm, analysed the data and drafted the initial manuscript.
Paul Geeleher and Cathal Seoighe were involved in improving the manuscript.
Abstract
The quantification of gene expression by high-throughput sequencing (RNA-Seq) has several advantages
over microarrays, including better reproducibility, greater dynamic range and gene expression estimates
on an absolute, rather than a relative scale. We propose a novel approach to microarray analysis
that attains some of the advantages of RNA-Seq. The method, called Machine Learning of Transcript
Expression (MaLTE), leverages samples for which both microarray and RNA-Seq to learn the relationship
between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression
estimates. We trained MaLTE on data released by the Genotype-Tissue Expression (GTEx) project,
consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of
human tissues. We show that regression models learned from a subset of GTEx samples can accurately
estimate RNA-Seq values for the remainder and can be used to estimate absolute gene expression levels
from archived microarray data.
75
Machine Learning of Transcript Expression
5.1
Introduction
Initially designed for DNA mapping and sequencing-by-hybridization [280–282], microarrays were first used to quantify gene expression in the 1990s [283]. Since then, the technology has been adapted for diverse biological assays such as genotyping [284], detection
of alternative splicing [121], epigenetic modifications [285], protein-DNA interactions
[286] (and references therein), among many others [287]. Because of the importance of
gene expression analysis, much effort has been invested in developing accurate methods
to infer gene expression levels from the fluorescence intensities of microarray probes.
Typical analysis pipelines for oligonucleotide microarrays include subtraction of background signals, normalization of the signal intensities across samples and summarization
of the fluorescence intensities of probes that map to the same genomic feature (gene or
transcript) [121, 288]. Numerous approaches have been applied to obtain gene-level
expression estimates from probe intensities [289] with robust linear models (RMA, parameter estimation by median-polish) and multiplicative models (PLIER, MAS5.0, and
dChip) suggested to give the best results [290].
Although they remain in widespread use, microarrays have a number of limitations in
comparison to sequencing-based methods for the quantification of gene expression (generally referred to as RNA-Seq). Documented advantages of RNA-Seq include improved
reproducibility and dynamic range and gene expression estimates on an absolute scale
[291]. The latter attribute facilitates comparison of gene expression levels between different experiments and also enables the expression of different genes within the same
sample to be compared. Data on the expression levels of different genes within a sample can be useful, for instance in modeling regulatory networks [291]. The summarized
probe intensity values obtained from microarrays, by contrast, are on an arbitrary scale,
cannot easily be compared between experiments without renormalization and are not
well suited to the comparison of expression levels of different genes in the same sample
[292–294]. In addition, the chips are designed according to specific genome annotations
and need to be adapted when these annotations change. Finally, the dynamic range of
microarrays (a few hundred fold) is an order of magnitude lower than that of RNA-Seq
(approx. 9,000 fold depending on the coverage) making them less sensitive [295, 296].
However, because of their relatively low cost, microarrays remain in widespread use and
are particularly important for large-scale studies [297–299].
Here we investigate an alternative approach to estimating gene or transcript expression
levels from microarray probe level fluorescence intensities, which addresses several of the
limitations discussed above. We refer to this method as machine learning of transcript
expression (MaLTE). Given a set of samples for which both microarray and RNA-Seq
data are available, MaLTE uses statistical regression techniques to learn the relationship
76
between the intensities of a set of probes associated with the gene and the expression
level of the feature as estimated by RNA-Seq. The regression models can then be applied
to estimate gene expression from microarray data for which RNA-Seq data is not available. We applied MaLTE to 716 samples from a broad range of tissues profiled by the
Genotype-Tissue Expression (GTEx) project, training the regression models on a random subset of the samples and testing on the remainder. Within individual test samples
the gene expression estimates from MaLTE approximate closely the values estimated
by RNA-Seq. Cross-sample correlation between RNA-Seq and microarray expression
estimates for individual genes was also significantly higher using MaLTE compared to
existing methods to estimate gene expression from microarrays. In addition, MaLTE
can leverage transcript isoform expression estimates produced from RNA-Seq to provide
a means to estimate the expression levels of specific isoforms from the microarray data.
5.2
5.2.1
Methods
GTEx data preparation
Affymetrix Human Gene 1.1 ST microarray and RNA-Seq expression data (derived
from 837 samples across 29 tissue types and three cell lines) were downloaded from
the genotype-tissue expression (GTEx) project website (http://www.broadinstitute.
org/gtex) on 30th July 2013. For 91 of the samples, only microarray data was available
and multiple biological replicates were included for several of the cell lines. A total of
716 of the samples were used in this study.
Raw and RMA (median-polish) processed microarray data were downloaded from GEO
(accession GSE45878). Library files for Affymetrix Human Gene 1.1 ST arrays were
downloaded from the Affymetrix website (http://www.affymetrix.com). Custom CDF
files for Human Gene (HuGene11stv1 Hs ENSG.cdf) that were used in the GTEx project
[300] were also downloaded from the Brainarray resource (http://brainarray.mbni.
med.umich.edu). Probes were extracted using Affymetrix Power Tools (APT v.1.12.0)
with and without quantile-normalization and with background correction (http://www.
affymetrix.com/estore/partners_programs/programs/developer/tools/powertools.
affx). Expression estimates obtained using RMA were downloaded from GEO. Gene
expression was also estimated using PLIER [301] as implemented in APT using the
custom CDF (HuGene11stv1 Hs ENSG.cdf) by passing the following analysis string to
APT:
-a quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,plier
77
We mapped genes to probes using two datasets: (i) the mapping between probes and
probe sets implied within the probe intensities file and (ii) the mapping between probe
sets and genes. The data for (ii) were constructed using the BEDTools intersect
utility. This utility takes two BED files as input: one for gene and another for probe set
genomic coordinates. Gene coordinates were determined from the Ensembl v.72 gene
model [253] while probe set coordinates were downloaded from the Affymetrix website
(http://www.affymetrix.com).
Altogether, we assessed expression estimates for 26,215 genes. These corresponded to
140,212 transcript isoforms. Quantile normalization of gene RPKM values for the 26,215
genes was performed prior to training and testing. MaLTE as well as RMA and PLIER
expression estimates were also quantile-normalized prior to downstream analyses.
5.2.2
Selecting and tuning the best learning algorithm
We used an independent dataset to select and tune the optimal learning algorithm.
We obtained RNA-Seq data [236, 302] and Affymetrix Human Exon 1.0 ST array CEL
files [303] for 53 Yoruba (YRI) and 44 European (CEU) HapMap cell lines [304]. All
individuals were unrelated. Data was downloaded from the GEO database [305] under
accessions GSE25030 (CEU RNA-Seq), GSE19480 (YRI RNA-Seq) and GSE7851 (CEU
and YRI exon array data).
We compared the performance of several learning algorithms: classification and regression trees (CART), multivariate adaptive regression splines (MARS), boosted regression
trees (BRT), random forest (RF), conditional random forest (CRF) and quantile regression random forest (QRF) [306–310]. All learning algorithms were compared to
median-polish and PLIER. All methods were applied in the R statistical computing
environment [311]. Performance was evaluated based on the mean cross-sample correlation over all genes (see Section 5.3.1 for a definition). The R multicore package
(http://cran.r-project.org/web/packages/multicore/) was used to speed up computations.
We identified CRF as the best performing algorithm and then optimized its performance
by identifying the best parameter settings. To do this, we created ten datasets, each
with 1,000 randomly selected genes. We examined the following parameters: minimum
number of features (probes) selected (F S), number of randomly sampled predictor probes
(mtry) and number of trees (ntree). We chose the optimal value of each parameter
based on the mean cross-sample Pearson correlation coefficient of out-of-bag (OOB)
estimates for positively correlated genes (rOOB > 0). We identified the optimal F S by
restricting the n probes that were most highly correlated with the response in the training
78
set. We also identified quantile regression random forest (QRF) as having equivalent
performance with the added benefit of producing prediction intervals (upper and lower
quantiles). Tuned parameters for the best algorithm were applied without modification
on the GTEx dataset.
5.2.3
Effect of training set size
We tested the effect of training set size on prediction accuracy. To do this, we generated
training sets having between 20 and 600 randomly selected samples (20, 30, 40, 50, 75,
100, 150, 200, 300, 400, 500, and 600) and evaluated the performance of each training
set on a random set of 100 test samples (Fig. B.1). Additionally, we counted the number
of genes that passed a nominal OOB filter threshold of zero (Fig. B.1c). From these
results, we decided on a training set size of approximately one-fifth of all samples for
the training set with the remainder used to test performance.
5.2.4
Differential gene expression (DGE) between GTEx tissues
We selected two GTEx tissues for differential expression analysis: heart muscle (left
ventricle) and skeletal muscle. Five samples [312] for each tissue that were among the
570 samples in the test set were randomly selected (designated 10-sample DGE). Only
genes for which expression could be estimated using all of the methods tested and with
RPKM (in the RNA-Seq data) greater than one in all samples were considered in this
assessment. Differential expression analysis was performed using standard t-tests [290].
While standard t-tests are sub-optimal in practice [313], they enable a fair comparison of
quantification methods without platform-specific optimizations. For example, the popular limma [255] method uses moderated t-tests that make distributional assumptions
that do not necessarily hold for RNA-Seq data [314]. Performance on differential expression was compared using the cumulative Jaccard index (Eqn. 5.1) [315] traversing the
list of genes sorted by q-value in steps of 10 genes (for computational feasibility) and the
concordance correlation coefficient (ρc ) (Eqn. 5.2) [316]. Here, as in the within-sample
and cross-sample tests of correlation, we treat RNA-Seq as the gold standard and the
microarray methods (MaLTE, RMA and PLIER) are evaluated based on similarity of
their DGE results to the RNA-Seq results. We repeated this computation with 100
bootstrap replicates to estimate variability (Fig. B.3).
The Jaccard index J is defined as the ratio of the intersection to the union of two sets,
J(A, B) =
79
|A ∩ B|
.
|A ∪ B|
(5.1)
The concordance correlation coefficient gives the degree of departure from the unit gradient line. Two random variables Y1 and Y2 have a concordance correlation rc given
by
rc =
2σ1 σ2
r12 = Cb r12 ,
σ12 + σ22 + (µ1 + µ2 )2
(5.2)
where µi and σi2 are the mean and variance for Yi , and r12 is the Pearson correlation
coefficient for Y1 with Y2 [316]. The value of rc was determined using gene lists from
differential expression results for each method ranked by the q-value. The q-values were
computed using the Bioconductor qvalue package [256].
We also compared the log of fold change between MaLTE and each of median-polish and
PLIER separately. First, we identified the common set of genes differentially expressed
at FDR < 1% then compared to the respective RNA-Seq log of fold change values. There
were 120 genes found in common between MaLTE and median-polish and only six genes
common between MaLTE and PLIER.
To determine the consistency of differential expression results, we repeated differential
expression analysis using 22 samples for each tissue (44-sample DGE). We then compared
the 10-sample DGE results to the 44-sample DGE results treating the latter as true
differentially expressed genes using the Jaccard index (Eqn. 5.1). We refer to these
as self-comparisons because the 10-sample DGE results were only compared to the 44sample DGE results from the same method (e.g. MaLTE 10-sample DGE was compared
to MaLTE 44-sample DGE; Fig. B.4a). A similar assessment of all methods to the
44-sample RNA-Seq DGE only was also carried out (Fig. B.4b).
5.2.5
Application to archived samples
A dataset consisting of Affymetrix Human Exon 1.0 ST array and RNA-Seq expression
estimates from a set of human brain samples was downloaded from GEO (accession
GSE26586) [317]. To apply MaLTE, trained on Affymetrix Human Gene 1.1 ST arrays,
to data from the Affymetrix Human Exon 1.0 ST arrays, we needed to map probes
shared between the two platforms using comparison spreadsheets. Spreadsheets relating
exon array to gene array meta-probe sets were downloaded from the Affymetrix website
(http://www.affymetrix.com). We used meta-probe sets classified as “Best Match” to
map exon array probes to gene array probes. A probe was converted if it was part of a
probe set that was in a meta-probe set in the “Best Match” spreadsheet and only if it
shared 100% sequence similarity between arrays. In total 425,268 of the 861,493 probes
on the gene array could be mapped to exon array probes in this way. GTEx and brain
RNA-Seq and probe intensities were then quantile-normalized together but we excluded
80
background correction because the process of transforming the exon arrays excluded several background probes, which would hamper comparison to non-background-corrected
median-polish and PLIER.
To compare performance between median-polish, PLIER and MaLTE, we redefined the
exon array meta-probe set mappings. (For a detailed description of the exon array
design please refer to [318].) We did this to ensure that the same gene-to-probe set
definitions were applied to all methods. Meta-probe sets represent the gene/transcript
level and are quantified by summarizing the estimates from the individual probe sets.
For each Ensembl gene identifier, we identified all exonic probe sets using xmapcore (now
annmap) [319]. An exonic probe set is defined as having: (i) all probes mapping to the
genome only once, and (ii) all probes falling within an exon boundary. We then used
the Ensembl gene identifiers as meta-probe set identifiers. We also constructed a text
file of ‘kill list’ probes, which consisted of probes that were excluded from the modified
exon array. Gene expression estimates were then computed using APT for custom and
core meta-probe sets with the following analysis strings:
-a quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish \
-a quant-norm.sketch=0.usepm=true.bioc=true,pm-only,plier
\
--kill-list <kill list file>
5.2.6
Application to the Affymetrix tissue mixture dataset
The Affymetrix tissue mixture dataset was downloaded from the Affymetrix website
(http://www.affymetrix.com). This consisted of nine mixtures of brain and heart in
varying proportions (1:0 (heart to brain), 0.95:0.5, 0.9:0.1, 0.75:0.25, 0.5:0.5 (three biological replicates), 0.25:0.75, 0.1:0.9, 0.05:0.95, and 0:1) with three technical replicates
of each. Because this dataset consisted of Affymetrix Human Gene 1.0 ST arrays, we
mapped probes to Human Gene 1.1 ST probe identifiers. The same custom library files
(described in Section 5.2.5) were used to evaluate median-polish and PLIER summarization after background correction and quantile normalization. Tissue mixture proportions
were then estimated using the R package CellMix [320] for common genes following OOB
filtering.
5.3
Results
We hypothesized that learning the relationship between single-channel microarray probe
intensities and RNA-Seq expression estimates from samples for which microarray and
81
RNA-Seq data are available could provide a means to obtain improved estimates of gene
expression from microarrays. We refer to this general approach as MaLTE. To investigate
the performance of the MaLTE approach we applied it to 716 samples from a broad range
of human tissues for which high quality Affymetrix Human Gene 1.1 ST microarray and
RNA-Seq data have been generated through the pilot phase of the GTEx project [300].
We selected approximately one-fifth (146) of the 716 unique samples at random for use
as the training set and tested the performance of MaLTE on the remaining samples.
Training on a larger number of samples did not lead to substantial improvement in
prediction accuracy (Supplementary Fig. B.1). For each feature of interest (gene or
transcript), we identified a set of probes with genomic co-ordinates that overlap the
coordinates of the feature and used quantile regression random forest, including a feature
selection step, to learn the relationship between the RNA-Seq expression estimates and
the probe intensities in the training data (see Section 5.2.2 for details of the choice
of regression algorithm). MaLTE is available as an R software package from http:
//bioinf.nuigalway.ie/MaLTE/malte.html.
5.3.1
Correlation with RNA-Seq
Given a putatively accurate measure of gene expression (here RNA-Seq), our goal is to
approximate this measure from the array probe intensities. For most genes we could
estimate the RNA-Seq expression level, given as reads per kilobase of transcript per
million mapped reads (RPKM) relatively accurately (Fig. 5.1). For the lowest expressed
genes, accuracy is limited by stochastic fluctuation in the number of reads from a given
transcript and for the highest expressed genes we underestimate expression level due to
saturation of microarray probe intensities. We also calculated the correlation between
MaLTE and RNA-Seq and compared it to the correlations with RNA-Seq obtained
from two existing widely-used methods to estimate expression from sets of microarray
oligonucleotide probe intensities (median polish [288] and PLIER [301]). Comparison
was carried out both for correlation within samples and across samples. The former
provides an indication of the agreement between the methods on the relative expression
of different genes within the same sample, while the latter is an indication of how well
the variation across samples detected by RNA-Seq is captured by the array estimates.
Gene expression levels estimated by MaLTE on the test samples showed much higher
within-sample Pearson correlation with the RNA-Seq estimates than median-polish or
PLIER (Fig. 5.2a–5.2d). Importantly, the improved within-sample Pearson correlation
is not simply a result of MaLTE rescaling the microarray expression estimates to match
the RNA-Seq data, as MaLTE also results in substantially higher within-sample rank
correlation (Fig. 5.2e). The slopes of the within-sample regression lines were close to
82
unity for MaLTE (e.g. Fig. 5.2c). In contrast, the expression estimates from the other
methods are not on the same scale as the RNA-Seq values (Fig. 5.2a and 5.2b). By
placing gene expression estimated from the arrays on the absolute scale defined by
RNA-Seq, MaLTE allows comparison of gene expression between genes on the array.
This is not possible with standard summarization techniques, such as median-polish
and PLIER [291, 293, 294].
Figure 5.1: Estimation accuracy. MaLTE provides an estimate of the RNA-Seq
gene expression levels from microarray probe intensities. (a) The relative error (i.e.
difference between MaLTE estimate and RNA-Seq, divided by the RNA-Seq value) as
a function of the RNA-Seq expression level. Each point corresponds to a bin of 75 genes.
The data represents all genes but with a random subset of 10 samples for each gene.
Only relative errors below 2 and RNA-Seq values between 1 and 1000 are represented.
Low expression genes were excluded due to high stochasticity for low read counts. A
Loess regression line is shown in red, illustrating that MaLTE slightly underestimates
RNA-Seq particularly for highly-expressed genes. (b) The distribution of relative error
with percentage median error and median absolute error displayed with the median
error indicated by the dashed red line.
83
a
c
b
e
d
Figure 5.2: Within-sample correlation with RNA-Seq. (a, b, c) Scatter plot
of gene expression for a single exemplary sample for each method against RNA-Seq.
The sample with the within-sample Pearson correlation closest to the median over all
samples was chosen. Box plots are provided to show the range of within-sample (d)
Pearson and (e) Spearman correlation coefficients across samples in the test dataset.
Median correlations are indicated beneath in brackets.
The correlation of microarray and RNA-Seq estimates of gene expression has been investigated previously by several studies [236, 321, 322]. Because not all genes vary substantially across samples, while within individual samples mRNA abundance ranges over
several orders of magnitude [323], cross-sample correlations tend to be lower than withinsample correlations. MaLTE significantly outperformed median-polish and PLIER in
cross-sample correlation (Fig. 5.3). For example, mean cross-sample Pearson correlation (r̄), in test data was 0.76 for MaLTE compared to 0.72 for median-polish (p <
1 × 10−322 , from a Wilcoxon rank sum test of cross-sample correlations) and 0.68 for
PLIER (Fig. 5.3a). Mean Spearman cross-sample correlations (ρ̄) obtained from MaLTE
were also much higher (0.69 compared to 0.64 for median-polish and 0.61 for PLIER;
Fig. 5.3b).
84
a
b
Figure 5.3: Cross-sample correlation with RNA-Seq. For each gene, the crosssample correlation was determined between the gene expression values estimated from
the microarrays using MaLTE, median-polish and PLIER. Density plots show the distribution across genes of (a) Pearson and (b) Spearman correlation coefficients. Mean
and median values of the correlation coefficients are provided in parentheses next to
the method name in the legend. Vertical lines show mean cross-sample correlation for
MaLTE (solid), median-polish (dashed) and PLIER (dotted).
5.3.2
Restricting to well-estimated genes
MaLTE does not perform equally well for all genes. For example, low values of crosssample correlation between MaLTE and RNA-Seq can be obtained for genes with low
variation in expression across samples. Such genes will typically also show poor crosssample correlation when their expression is estimated using median-polish and PLIER.
However, MaLTE has the advantage that it provides an estimate of the accuracy with
which the expression level of a given gene can be predicted. This is provided by the
85
cross-validation carried out by random forest when the gene-specific regression model is
learned from the training data [308]. Each regression tree in the forest is constructed
from a subset of the samples. The expression level of the gene in a given sample can be
estimated from the regression trees from which that sample was omitted. This is called
the out-of-bag (OOB) estimate. For example, to estimate how well MaLTE will perform
for a given gene as assessed by cross-sample correlation with RNA-Seq, we calculate the
cross-sample correlation between the OOB estimates and the RNA-Seq data from the
training samples. This provides an accurate estimate of the cross-sample correlation
in test data (Supplementary Fig. B.2). The OOB estimate can be used as a filter, so
that MaLTE returns expression estimates only for genes with a desired property, such
as high cross-sample correlation with hypothetical RNA-Seq (note the RNA-Seq data
is hypothetical here as MaLTE will typically be applied to samples for which RNA-Seq
data is not available). By thresholding on the OOB cross-sample correlation, we found
that very high values of cross-sample correlation can be achieved for a subset of genes
(Fig. 5.4a). Because genes that pass the OOB cross-sample correlation threshold are
likely to have high cross-sample variation, median-polish and PLIER also achieve higher
cross-sample correlation for these genes. However, MaLTE maintains a performance
advantage over the other methods with increasing threshold values (Fig. 5.4a).
a
b
Figure 5.4: The effects of OOB filtering. Mean cross-sample Pearson correlation
as a function of OOB correlation threshold for (a) genes and (b) transcripts. Error bars
correspond to two standard errors. Note that transcript-level estimates are not provided
by RMA and PLIER. The black line represents the number of genes/transcripts at each
level.
86
5.3.3
Inference of differential expression
For many gene expression studies, a key objective is to identify the set of genes that
are differentially expressed between groups of samples (e.g. disease versus non-disease
or treatment versus control). To assess the performance of MaLTE in differential expression analysis we compared gene expression between two different GTEx tissues:
heart muscle and skeletal muscle. RNA-Seq has been shown to have higher statistical power than microarrays for detecting differentially expressed genes [321]; therefore,
we used similarity to the set of differentially expressed genes identified by RNA-Seq
as the metric to evaluate the performance of MaLTE compared to median-polish and
PLIER (Supplementary Fig. B.3 and Fig. B.4). To limit the influence of differences in
platform-specific techniques for identifying differentially expressed genes (e.g. Cuffdiff
for RNA-Seq, limma for microarrays) we used standard t-tests, and ranked the 21,367
common genes by the q-value. We measured the agreement between the ranked lists
produced from the microarrays by alternative methods and from RNA-Seq using the
concordance correlation coefficient (ρc ) and cumulative Jaccard index (see Section 5.2.4;
Supplementary Fig. B.3a). MaLTE showed significantly higher concordance with RNASeq than median-polish or PLIER (ρc = 0.34, 0.16 and 0.12, respectively; Supplementary
Fig. B.3b).
5.3.4
Estimation of transcript isoform expression
A major advantage of RNA-Seq over microarrays is that the former can be used to discover and quantify the expression of novel transcript isoforms, resulting, for example,
from alternative splicing. Microarray platforms have been developed to assess alternative
splicing by incorporating probes designed to target individual exons or splice junctions.
Differential mRNA splicing can be detected by comparing exon inclusion rates or using
exon-level parametrized models [324–327]. It is difficult to obtain reliable estimates of
expression at the level of transcript isoforms from microarrays, although some methods
have been developed for this purpose [183, 185, 267]. MaLTE extends naturally to the
estimation of the abundance of specific isoforms by replacing gene expression as the fitted variable with transcript expression. Although it contains a lower density of probes
than Affymetrix exon microarrays, the Affymetrix Human Gene 1.1 ST array contains
multiple probe sets along human genes and can be used to measure alternative splicing
[328]. Using multiple response regression, we learned the relationship between the expression level of multiple transcript isoforms estimated from the RNA-Seq data and the
fluorescence intensity of all probes mapping to the gene. We predicted the expression
of 140,212 transcript isoforms, achieving mean cross-sample correlation with RNA-Seq
87
estimates of 0.29. Again, the OOB expression estimates enabled us to identify smaller
sets of transcript isoforms whose variation across samples can be predicted more accurately (Fig 5.4b). For example, about 25,000 transcript isoforms have a mean Pearson
cross-sample correlation of over 0.65 (Fig. 5.4b and Supplementary Fig. B.5).
5.3.5
Application of MaLTE trained on GTEx data to independent
microarray datasets
To determine whether MaLTE regression models, trained on a diverse panel of GTEx
tissues, can be applied to estimate expression from microarrays generated independently,
we downloaded a dataset, consisting of Affymetrix Human Exon 1.0 ST microarrays
and RNA-Seq data from a set of brain samples from a recent study [317] (we refer
to this as the Mazin dataset). The fact that these data were from a different array
platform (an exon rather than gene array) posed a particular challenge, requiring that we
restrict MaLTE to the subset of probes that are shared between the platforms (425,268 of
5,432,523 of the exon array probes are on the gene array). In spite of this, MaLTE again
provided considerable improvements in within-sample correlations compared to medianpolish and PLIER and similar performance in cross-sample correlation (Figs. 5.5 and
Supplementary Fig. B.6). For this comparison, all of the methods used only the set
of probes shared between the platforms because these are the only probes available to
MaLTE. Without this restriction, the cross-sample correlations obtained using medianpolish and PLIER applied to all core probe sets were, in fact, lower than for MaLTE.
This is likely to be the result of noise resulting from lower quality probe sets that are
not shared between the two platforms. Indeed, the majority of exon array probes have
been shown to contribute little to expression signals [328].
88
a
b
c
d
Figure 5.5: Application to archived data. MaLTE, trained using the GTEx
data, was applied to predict gene expression from published microarray data based on
brain samples for which RNA-Seq data was also available. Despite the fact that the
two studies used different array platforms (Affymetrix Human Exon 1.0 ST arrays and
Affymetrix Human Gene 1.1 ST arrays for the brain and GTEx studies, respectively),
MaLTE predictions exceeded the within-sample correlations obtained using medianpolish and PLIER. MaLTE predictions were based on probes shared between the two
array platforms. Box plots of (a) Pearson and (b) Spearman within correlations are
shown. (c) Pearson and (d) Spearman cross sample correlations with OOB filtering.
The black line represents the number of genes/transcripts at each level.
89
5.3.6
Evaluation of MaLTE applied to a controlled tissue mixture
dataset
Both of the evaluations of MaLTE discussed thus far involve comparison of expression
estimates learned from microarrays with RNA-Seq estimates. This is the primary evaluation because the objective of MaLTE is to use arrays to approximate RNA-Seq (or
any gold-standard measure that replaces it and for which a suitable training dataset is
available). However, we also applied MaLTE to a tissue mixture dataset, provided by
Affymetrix (http://www.affymetrix.com), for which only microarray data are available. The dataset consists of expression estimates from nine mixtures of commercially
available heart and brain total RNA, with proportions 1:0 (pure heart), 0.95:0.05, 0.9:0.1,
0.5:0.5 (three biological replicates), 0.1:0.9, 0.05:0.95 and 0:1 (pure brain). At each mixture ratio three technical replicates were conducted, for a total of 33 arrays. Tissues
were assayed using Affymetrix Human Gene 1.0 ST arrays. There were 9,455 genes that
were called differentially expressed (FDR < 0.05) between the pure heart and pure brain
samples with each of the three methods. In the absence of biological noise and measurement error we expect to find a perfect linear correlation between the tissue proportion
(i.e. brain or heart proportion) and the expression level for these genes. Because the
biological noise is shared between the methods, the Pearson correlation coefficient gives
an estimate of how consistent the gene expression measurements are across different
sample mixtures. In this test, we would expect PLIER and RMA to outperform MaLTE
because PLIER and RMA directly summarize the probe intensities, whereas MaLTE
allows more complex relationships between expression levels of different probes and the
gene expression estimate. The key advantage of MaLTE is that it provides an estimate
of the absolute expression level, whereas, although the RMA and PLIER estimates are
consistent across tissue mixtures, their relationship with actual gene expression level can
be unclear. Nonetheless, the mean absolute value of the Pearson correlation coefficient
from MaLTE was similar to that of RMA and PLIER (0.839 versus 0.874 and 0.896 for
MaLTE, RMA and PLIER, respectively).
Because MaLTE provides gene expression estimates on the absolute scale, characteristic
of RNA-Seq it has significant advantages over the other methods in certain settings. For
example, MaLTE outperformed the other methods when we applied CellMix [320], to
estimate the tissue mixture proportions from the expression data, using gene expression
deconvolution techniques [329]. The correlation between true and estimated proportions
was high for all methods (Supplementary Fig. B.7), but values estimated using MaLTE
were closest to the true proportions (the slope of the regression line was 1.02 for MaLTE,
compared to 0.86 and 0.96 for RMA and PLIER, respectively; Supplementary Fig. B.7).
Both Spearman and Pearson correlation coefficients between the true and estimated
90
proportions were also highest for the estimates from MaLTE. In all cases filtering was
applied (using OOB and removal of noisy genes; see Section 5.2.6) at the training stage
to determine which genes could be estimated accurately by MaLTE, but expression estimates for the same set of genes were provided to CellMix for each of the array estimation
techniques.
5.4
Discussion
Oligonucleotide expression microarrays frequently include multiple probes targeting genes
or parts of genes. For these arrays, a key consideration is how to summarize fluorescence
intensity signals from multiple probes in order to arrive at an estimate of the feature
(e.g. gene or exon) of interest. A wide range of strategies to evaluate the performance
of different microarray analysis tools and pipelines have been developed [289, 290]. We
propose an alternative method to estimate gene and transcript expression from microarrays as well as a very different approach to the evaluation of performance. Given a gold
standard measure of gene expression our goal is to obtain expression estimates from
the microarray data that approximate it as closely as possible. The evaluation of performance then becomes a matter of determining how closely the expression estimates
derived from the microarrays match the gold standard estimates. In principal, any highthroughput measure of gene expression can play the role of gold standard, provided a
sufficiently large and diverse set of training samples with both microarray and gold standard expression data is available. Here we use RNA-Seq because it has been shown to
have several important advantages over microarrays [295, 330]. Our goal is to estimate
gene expression from microarrays in a way that attains some of these advantages. To
achieve this, we learned the relationship between RNA-Seq transcript expression levels
and microarray probe intensities from a subset of the GTEx [300] samples, evaluating
performance on the remainder, as well as on data from a second study [317] that also
generated RNA-Seq and microarrays from the same set of samples.
Compared to two widely-used methods to estimate expression levels from microarrays,
MaLTE obtained much greater within-sample correlation with RNA-Seq estimates on
both the test GTEx data [300] and on the brain dataset [317] and comparable (for the
brain dataset) or significantly better (for the GTEx test data) cross-sample correlation.
The performance on the brain dataset was achieved in spite of the use of very different
array platforms for the training and test samples in this case that shared only a small
proportion of probes (an exon microarray for the brain dataset and a gene microarray
for GTEx). In addition to high within-sample correlation coefficients, the slopes of the
91
regression lines of MaLTE against RNA-Seq for each sample were close to one (Supplementary Fig. B.6). Taken together, this indicates that MaLTE provides an estimate of
the RNA-Seq data on the same scale as RNA-Seq. MaLTE can be applied in the context of large studies where RNA-Seq and arrays are applied to a subset of the samples
and arrays only to the remainder. The performance on the Mazin dataset suggests that
MaLTE, trained on the GTEx data, can also be applied to archived samples, despite
batch effects and, in this case, differences in array platforms. Further improvements
in performance are likely to be possible if batch effects are modeled and appropriate
methods exist for this purpose.
Tissue-specific alternative splicing can complicate the relationship between the signal
from a collection of probes and gene expression level because different transcript isoforms
may dominate in different tissues. We have found (data not shown) that the performance
of MaLTE can be further improved by first carrying out principal component analysis
on the probe intensity matrix and including a subset of the principal components as
features that MaLTE uses to predict gene expression. In this case the feature set for a
given gene includes shared (experiment-wide) principal components in addition to the
gene-specific probe intensities.
Our results show that MaLTE, trained on the GTEx dataset, can be applied to estimate
gene expression accurately from microarray data generated in other studies. There are
currently over 24,000 expression microarray datasets in the GEO database [331] including
more than 9,000 from humans. Affymetrix GeneChip gene and exon array platforms
together account for over 1,100 expression array experiments, involving over 19,000
samples. MaLTE offers an alternative approach to the analysis of these data, which will
allow the estimated gene and transcript expression levels to become more comparable
with expression estimates from RNA-Seq. Archived microarray datasets represent a rich
resource of data and, by learning the relationship between probe intensities and RNASeq expression estimates, MaLTE offers the possibility of joint analysis of data generated
using RNA-Seq and microarrays. In general, training datasets that have been assayed
using different platforms represent a Rosetta Stone for gene expression measurement,
allowing measurements from one platform to be translated to another. Due to the
study size and the breadth of sample types, GTEx serves this role for Affymetrix gene
oligonucleotide arrays and RNA-Seq.
92
Chapter 6
Differential expression analysis on
retinitis pigmentosa samples
using RNA-Seq
RNA-Seq data was sequenced at TriSeq at Trinity College Dublin.
Analysis and authorship of this chapter is by PKK.
Abstract
RNA-Seq data has become a standard quantification tool in the biologist’s toolbox and is taking over
from microarrays. Often, as with other technologies, data may be generated presenting systematic biases
introduced either during sample preparation or sequencing placing a high demand on normalisation.
Rigorous quality control can help identify and guide in salvaging affected data. We present a thorough
analysis of the RNA-Seq data that was generated from similar samples to those analysed in Chapter 4,
which had several technical artefacts. From these preliminary analyses we found that filtering out noisy
genes prior to simple quantile normalisation reduced the impact of poor coverage and high duplication.
We then performed differential gene expression analysis using RNA-Seq, MaLTE and microarray (RMA)
expression measures to identify potential aberrantly spliced genes and compared performance and results
from each.
6.1
Introduction
As we have presented in the preceding chapter, high-throughput sequencing exhibits
several desirable advantages over alternative technologies and we took advantage of
these to develop the MaLTE framework. In this chapter, we outline differential gene
93
RNA-Seq Differential Expression Analysis
expression analysis using RNA-Seq data that was generated as an extension to Chapter
4. This chapter, therefore, extends the analysis of Chapter 4 and 5 and compares
differential expression results obtained using both MaLTE and RMA (exon array) to
RNA-Seq results.
The output of an RNA-Seq experiment are short cDNA sequences called reads (usually sequenced in pairs from the end of a cDNA fragment) accompanied with per-base
quality scores. Processing RNA-Seq data involves three main steps: mapping reads
to a reference (genome, transcriptome or exome); summarisation of reads into a single measure, and normalisation to correct for library-specific differences [332]. Examples of publicly available mapping tools include BWA [333], Bowtie [334], MAQ
(http://maq.sourceforge.net/index.shtml), SOAP [335] and many others (http:
//en.wikipedia.org/wiki/List_of_sequence_alignment_software). A fraction of
the reads will have exon-exon junctions because the cDNA is derived from spliced
mRNA therefore gapped aligners such as TopHat [194] are frequently used. Summarisation and normalisation are typically performed simultaneously and can be carried out
either arithmetically (sums of reads over specified regions) such as with RPKM [322]
and counts-per-million [336] or using statistical models [337]. Despite the advantages
gained using sequencing-based quantification, there are a number of biases which may
have a severe impact on differential expression analysis. Some of the biases observed
are due to sequencing coverage [338], GC content [339], feature length [340, 341], and
read distribution biases [342, 343]. Correction for biases often entails elaborate statistical normalisation that aims to model and eliminate the bias. Normalisation can be
effected to minimise either between-library biases (such as coverage) or within-library
biases (such as transcript length).
Here we analyse RNA-Seq data that exhibited considerable technical biases in the form
of coverage and duplication. We first performed quality analysis of the RNA-Seq data
highlighting the two major technical biases. We show that excluding poorly covered
genes prior to quantile normalisation can mitigate the effects of poor coverage. We
then perform differential expression analysis using the normalised data and compare
the results to those obtained using MaLTE and RMA. From these analyses, we found
that the Parkinson’s-associated gene α-synuclein (SNCA) showed evidence of differential
expression using all three expression measures and, in conjunction with results from our
previous analysis, may be differentially spliced between cases and controls. We conclude
this chapter with a discussion on SNCA based on a survey of its role in pathology.
94
6.2
6.2.1
Materials and Methods
RNA-Seq
RNA was extracted from whole blood samples of the four sibling-control pairs, described
R
in Chapter 4, using the QuantiTect
Reverse Transcription (RT) kit (Qiagen) and cDNA
libraries were prepared according to the Illumina RNA sequencing protocol. Poly-A
containing molecules were purified, fragmented and cDNA was synthesized, followed by
end repair and adaptor ligation and purification. Paired-end sequencing was carried out
on an Illumina Genome Analyzer II producing reads of 50bp (74bp for sample T3 ). All
lanes were processed using the Illumina GA Pipeline version 1.4.0. We used one lane to
sequence the ΦX174 DNA control to calibrate the run quality scores.
6.2.2
RNA-Seq data analysis
Paired-end Illumina reads were mapped to the hg19 build of the human genome using
TopHat v.1.1.4 [194] enforcing unique read mapping (-g flag). We evaluated RP mutations based on data provided from the Online Mendelian Inheritance of Man (OMIM)
resource (http://omim.org/) and used the samtools mpileup utility [344] to assess variation present in all samples. The quality of aligned reads was evaluated using FastQC
v.0.9 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Picard
v.1.40 (http://picard.sourceforge.net/).
Picard was used to estimate the level
of optical duplicates and to mark PCR duplicates. Quantification of gene expression
was achieved using Cufflinks v.2.1.1 [181] using the following non-default parameters:
-u (multiread correct) and -b ref.fa (fragment-bias correct). Genes were quantified
against the Ensembl v.66 gene model [206]. We filtered out all genes having at least
four samples expressed below an FPKM of one as well as those having a mean expression in the top percentile. The remaining genes were quantile normalised using the
normalizeQuantiles() function from the limma R package [255]. Differential expression
was carried out using standard paired t-tests in R [345] and moderated paired t-tests
provided in the limma R package. All intermediate analyses (expression asymmetry
(Fig. 6.10), overlapping gene lists (Fig. 6.13), were performed using custom R scripts.
6.2.3
Preparation of MaLTE and RMA expression estimates
A detailed outline on how MaLTE expression estimates are obtained is described in
Appendix C. Briefly, gene-specific regression models were trained based on the GTEx
data described in Chapter 5, using the quantile regression forest algorithm. In order
95
to make predictions based on the RP exon array data, probe sets from the exon array
were mapped to gene array probe sets. The exon array comprises a much larger number of probe sets and exon array probe sets that do not occur on the gene array were
discarded. Predictions were filtered at an OOB correlation threshold of 0.5 and quantile normalised. RMA expression estimates were also generated from exon array data
based on a modified library consisting only of probes used by MaLTE to ensure they
were comparable. RMA expression estimates were computed using Affymetrix Power
Tools v.1.12.0 (http://www.affymetrix.com/estore/partners_programs/programs/
developer/tools/powertools.affx).
As stated in Chapter 4, RP-PRPF8 samples were designated T1−4 and corresponding
sibling controls C1−4 .
6.3
6.3.1
Results
Quality assessment of RNA-Seq data
We generated the RNA-Seq data to validate the microarray results presented in Chapter 4. However, we encountered several technical challenges with this data. This section
highlights some of these concerns.
We used FastQC and Picard to assess the quality of the RNA-Seq data. FastQC performs
several diagnostic tests on the data to evaluate read quality based on quality scores, GC
content, presence of over-represented sequences, and the overall duplication rate. We
used Picard to estimate the levels of both PCR and optical duplication. We illustrate
each quality assessment result in turn using plots for sample C1 .
Coverage is defined as the number of times a region (entire genome, transcriptome or
targeted locus) is sequenced and gives the expected number of reads covering all genomic
bases. It is approximated by the expression
C=
LN
,
G
where L is the read length, N is the number of reads and G is the genomic length [346]
(http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.
pdf). Columns two and four in Table 6.1 show the number of mapped reads in millions
and the estimated coverage, respectively. It is evident that sample T3 has about onequarter of the reads present in each of the other samples and less than half their coverage
and is likely to have a negative impact on the accuracy of differential gene expression
analyses. It is important to note that sample T3 also differed in read length which
96
Sample
T1
T2
T3
T4
C1
C2
C3
C4
Mapped
reads
millions
29.05
34.72
7.97
27.65
27.77
33.45
31.40
30.56
Read
length
bp
50
50
74
50
50
50
50
50
Coverage
% GC
% dupl.
% optical
dupl.
0.48
0.58
0.20
0.46
0.46
0.56
0.52
0.51
50
48
54
52
50
51
50
50
77.98
68.37
43.61
65.08
68.16
66.68
60.62
65.08
0.011
0.0091
0.24
0.012
0.016
0.014
0.079
0.0071
Table 6.1: Summary of quality assessment on RP RNA-Seq data from FastQC and
Picard.
had a positive effect on coverage but not enough to salvage the poor coverage. This
single-sample low coverage presented one of the main technical setbacks to using this
dataset.
Overall GC content varied from a low of 48% to a high of 54%. However, the first 14
or so bases appeared to exhibit higher GC content than the rest (Fig. 6.1), which arise
due to biases introduced by random hexamer primers [343]. The pattern observed in the
first 13 bases was identical for all samples characteristic of the primer bias. Moreover,
the read quality for the first 14 bases was the highest of all bases confirming that they
were not the result of sequencing error (Fig. 6.2). In effect, using the random hexamers
non-uniformly samples cDNA fragments from the transcriptome. Therefore, trimming
the first 14 bp does not resolve this problem [343] because the underlying sampling bias
would persist in the remaining sequences. We therefore used the full read lengths because
we could not circumvent this bias. The effect of random hexamer bias was also noticeable
in the sequence content (Fig. 6.4) and the over-represented sequences (Fig. 6.5), which
were CCAGG, TGTCG, CTGGG, CAGCA, AAAAA (possible poly-A remnants) and CCAGC
(60% GC in these pentamers). Base quality had overall means Phred scores per sample
of over 36 (Fig. 6.3) reflecting base calling accuracy of over 99.9% [347].
97
Figure 6.1: Plot of GC content found in each base position.
Unfortunately, the data exhibited a high variance in duplication levels ranging from a
low of about 44% (T3 ) to over 77% (T1 ). There are two forms of undesirable duplication: optical and PCR duplication. Optical duplicates result during the base reading
phase of sequencing in which two or more fragment clusters are located very close to one
another on the flow cell resulting in identical coordinates. Therefore, the sequence of
nucleotides could be from either cluster introducing uncertainty in the inferred sequence.
Using Picard (http://picard.sourceforge.net), we found very low levels of optical
duplicates (well under 1%; column 7 of Table 6.1) and therefore ignored them. PCR
duplicates, on the other hand, are harder to identify. Defining PCR duplicates as reads
having the same sequence content and sharing 5’ coordinates is ambiguous. It ignores
the possibility that similar fragments may be sequenced independently. Nevertheless,
conditional on the gene expression level, one would expect that the proportion of duplicates arising because of PCR amplification would constitute a substantial proportion
of observed duplicates: low for highly expressed genes but may be sizeable at lower
levels. Excluding duplicates would have presented a serious challenge in further analysis (Fig. 6.6) therefore we opted to retain them and attempted to correct this through
alternative normalisation.
98
Figure 6.2: Plot of distribution of quality scores in each base position.
In summary, we observed two main biases that had the potential to severely affect
expression analyses: coverage differences (low in T3 ) and high duplication (most samples
but particularly in T1 ). Overall read quality was good indicated by the high quality
scores.
99
Figure 6.3: Plot of mean quality score for each short read sequence.
100
Figure 6.4: Plot of nucleotide percentage at each base position.
101
Figure 6.5: Plot of k -mer enrichment profile across short read sequences.
102
Figure 6.6: Effect of removing duplicates on the number of mapped reads.
6.3.2
Identification of retinitis pigmentosa mutations from RNA-Seq
data
Sequencing-based transcriptome assays have the advantage of simultaneously providing quantitative and descriptive information. We exploited this feature of RNA-Seq to
assess 20 known retinitis pigmentosa (RP) associated genes at 120 loci. To do this,
we used the samtools mpileup v.0.12.7 utility by compiling bases of reads aligned at
individual genomic coordinates [344]. We used the coordinates as specified in OMIM
diseases database for the RP mutations (http://www.omim.org). Of the 120 loci examined only 20 were covered in at least one sample and eight loci in all eight samples. The
eight loci were all found on PRPF8 and CA4, which showed that, in addition to the
PRPF8 mutation (RP13) designating the samples as case-control, there was an additional disease-associated mutation on CA4 (RP17) in samples T2 , C2 and C4 (Table 6.2).
However, C2 and C4 (both control samples) have not presented with any RP symptoms
(Lisa Roberts, personal communication, 27th February 2012).
103
Design.
RP13
RP17
Gene
PRPF8
CA4
Mutation
p.H2309R
p.R14W
T1
+
-
T2
+
+
Samples
T3 T4 C1 C2
+ +
+
C3
-
C4
+
Table 6.2: RP-associated mutations identified using RNA-Seq data. Mutations identified in whole blood samples.(+) present, (-) absent.
6.3.3
Quality control optimisation
We assessed how experimental groups clustered using principal components and observed the effect of filtering and quantile normalisation on the densities of expression.
Principal components can give a global view on the relationship between experimental
groups. They can also identify outliers and guide in quality control steps. To the best of
our knowledge, applying quantile normalisation after filtering has not been done before.
We demonstrate that it may be used to alleviate low coverage bias. First, we quantified
expression using Cufflinks as described in Materials and Methods then applied a log-2
transformation on expression estimates. Clustering was evaluated following different
filtering strategies (no filtering, using well-predicted genes using MaLTE OOB filtering,
and filtering out noisy genes). The OOB filtering step was done because the final comparison with MaLTE would be based only on common genes between RNA-Seq, MaLTE
and RMA. Because the MaLTE genes were OOB genes (see Materials and Methods),
RNA-Seq would effectively undergo this type of filtering.
Figure 6.7a shows a plot of the first two principal components (PCs) which account
for 46.49% (PC1) and 14.9% (PC2) of the variation, respectively. Sample T3 failed to
cluster with the other samples and there was no clear partitioning of samples into cases
and controls. The truncated densities of expression were also different across samples
(Figure 6.9a).
104
b
T3 ●
●
Cases
Controls
T4
T2 C1
C3
−0.40
●
−0.35
C2
C4
●
0.4
●
C4
●
●
●
−0.30
−0.25
0.2
PC1 [46.49%]
C2● C1
●
●
T4
●
0.3
0.4
T2
C3
0.5
PC1 [52.47%]
T3
●
●
●
●
0.0
C1
C4
T2
C3
C2
T4
T1
0.2
T1T3
● ●
●
0.4
0.6
0.8
1.0
●
●
●
0.350
PC1 [82.5%]
T4
C2
●
−0.4
−0.4
−0.2
PC2 [1.19%]
●
0.0 0.2 0.4 0.6
d
0.0 0.1
c
PC2 [16.83%]
T1
−0.4
T1
●
●●
●
T3
0.0
0.6
−0.2
0.2
PC2 [14.9%]
●
PC2 [14.43%]
●
0.8
1.0
a
C1
0.352
0.354
T2●
C3
C4
0.356
PC1 [96.03%]
Figure 6.7: First two principal components for RNA-Seq expression estimates with and without various filtering approaches. (a) No filtering applied,
(b) filtering using the subset of genes that passed OOB filtering in the GTEx data (for
comparison with MaLTE), (c) combining OOB filtering and removal of noisy genes,
and (d) OOB filtering, removal of noisy genes then quantile-normalisation.
Using OOB-filtered genes (Figure 6.7b), the proportion of variation of the first PCs
increased (to 52.47%), though T3 was still an outlier. Next, we removed noisy genes to
assess their effect. Typically, low expression (having an expected expression of fewer than
one fragment per kilobase of million-mapped reads, one FPKM) and high expression (top
1%) genes may be excluded to eliminate genes dominated by noise [348, 349]. Therefore,
we excluded all genes expressed at less than one FPKM in at least four samples and
those in the top percentile of mean expression across samples. Sample T3 exhibited much
higher densities (Fig. 6.9b) and was even further from the other samples. However, the
proportion of variation captured by P C1 markedly increased to 82.5% (Fig. 6.7c). The
densities of filtered genes were symmetrical about the same median and mode for all
105
samples (including T3 ) though T3 had much higher densities. Because of this, we applied
quantile normalisation to eliminate the coverage bias [349]. While quantile-normalisation
did not entirely resolve the outlier effect, it imposed equal distributions on all samples.
Figure 6.7d shows that T1 and T3 (samples with high duplication and low coverage,
respectively) were isolated relatively to the other samples. Additionally, the proportion
of variance covered by the P C1 was as high (96.03%) as exhibited by MaLTE (Figs. 6.8a
and b; over 97%) and RMA (Figs. 6.8c and d; over 93%). In summary, the combined
effect of OOB filtering, filtering out noisy genes and quantile normalisation were able to
substantially reduce the coverage and duplication bias.
a
b
T1 ●
●
●
−0.5
C2
●
C3
C1
●
−0.3545
T1
●
●
●
−0.5
0.0
C4
PC2 [1.02%]
PC2 [0.33%]
T2 ●
T3 ●
T4
●
0.5
Cases
Controls
0.5
●
0.0
●
●
−0.3535
−0.3525
●
0.345
PC1 [98.57%]
T3
C4
●
C3
C1
0.355
0.360
PC1 [97.32%]
d
0.6
c
C1 ●
●
T1
●
0.6
T3
T1
●
●
−0.358
T4
T2
●
−0.354
−0.350
●
●
C2
●
●
−0.6
C4 ●
−0.2
C3 ●
C2 ●
−0.346
●
0.345
PC1 [98.64%]
T2
T4
0.2
●
PC2 [3.13%]
0.2
−0.2
−0.6
PC2 [0.45%]
0.350
T2
●
T4
C2
●
●
C4
T3
C3
C1
0.350
0.355
0.360
0.365
PC1 [93.84%]
Figure 6.8: First two principal components for MaLTE and RMA expression
estimates. (a) MaLTE prior to filtering out noisy genes, (b) MaLTE after filtering out
noisy genes. (c and d) Corresponding plots for RMA.
To assess the effect of quantile normalisation, we performed nominal differential expression analysis using standard t-tests. Figure 6.10a shows the proportion of genes
106
expressed higher in controls across the full range of p-values in 0.05 width bins before
quantile normalisation. Figure 6.10b shows a similar plot after quantile normalisation.
The asymmetrical attribution to lower expression in cases is almost entirely abolished.
Instead, there is a modest abundance of higher expression in cases for a subset of genes
reminiscent of observations in Chapter 4 for exons (Fig. 4.2). Figures 6.10c and d show
results using moderated t-tests (limma) with a similar outcome.
0.3
0.2
0.0
0.1
Density
T1
T2
T3
T4
C1
C2
C3
C4
0.4
b
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Density
a
−10
−5
0
5
10
15
−5
N = 9015 Bandwidth = 0.3784
0
5
10
N = 7455 Bandwidth = 0.2897
Figure 6.9: Density of expression (a) before and (b) after filtering noisy
genes. Both plots are truncated on the x-axis because the scale extends to -700 indicating numerous genes at very low expression levels. Truncation results in the shown
densities not being normalised. Expression estimates (x-axis) are log-2 transformed.
107
0.8
0.0
0.2
0.4
0.6
Proportion of genes
0.0
0.2
0.4
0.6
0.8
0.6
0.0
0.2
Proportion of genes
0.6
0.4
0.0
0.2
Proportion of genes
0.8
1.0
d
1.0
c
0.4
Proportion of genes
0.8
1.0
b
1.0
a
Figure 6.10: Effect of quantile normalisation of RNA-Seq data on nominal
differential expression. (a and b) Differential expression using standard t-tests.
(a) No quantile normalisation, (b) quantile-normalised expression estimates. (c and
d) Differential expression using moderated t-tests (performed with limma). (c) No
quantile normalisation, (d) quantile-normalised expression estimates. Red denotes the
proportion of genes that are expressed higher in cases, while blue denotes those higher
in controls. Under the null hypothesis of no differences in expression both proportions
should be equal at all p-values.
6.3.4
Comparison of RNA-Seq to MaLTE and RMA
We then assessed how MaLTE and RMA expression estimates derived from the exon
microarray experiment described in Chapter 4 compared to the normalised RNA-Seq
expression estimates. We first measured the degree of agreement using Pearson and
Spearman correlation coefficients as described in Chapter 5 (Figs. 5.2 and 5.3). We used
108
MaLTE genes filtered at an OOB Pearson correlation of 0.5 to exclude poorly-predicted
genes. The comparison was then restricted to 7,455 genes common to all methods.
MaLTE and RMA were almost identical in median cross-sample correlations (Figs. 6.11a,
b and Fig. 6.12) but MaLTE had significantly higher within-sample Pearson and Spearman correlations (p = 1.55 × 10−5 for both; Figs. 6.11c and d) indicative of its ability
to better estimate absolute expression estimates. Thus, even on this cross-array application, MaLTE exhibited good performance.
a
b
0.5
0.0
Correlation
0.5
−0.5
−0.5
0.0
Correlation
Spearman
1.0
1.0
Pearson
MaLTE
[0.4]
c
RMA
[0.42]
MaLTE
[0.36]
d
Spearman
0.78
0.70
0.74
Correlation
0.74
0.70
Correlation
0.78
0.82
Pearson
RMA
[0.38]
MaLTE
[0.79]
RMA
[0.72]
MaLTE
[0.82]
RMA
[0.71]
Figure 6.11: Cross-sample and within-sample correlations. (a) Pearson crosssample correlations, (b) Spearman cross-sample correlations, (c) Pearson within-sample
correlations, (d) Spearman within-sample correlations.
109
0.8
0.4
MaLTE (0.4)
RMA (0.42)
0.0
Density
1.2
Pearson cross−sample correlation
−1.0
−0.5
0.0
0.5
1.0
N = 7455 Bandwidth = 0.05856
0.8
0.4
MaLTE (0.36)
RMA (0.38)
0.0
Density
1.2
Spearman cross−sample correlation
−1.0
−0.5
0.0
0.5
1.0
N = 7455 Bandwidth = 0.05849
Figure 6.12: Densities of cross-sample correlations after filtering and quantile normalisation.
6.3.5
Comparison in differential expression analysis
Finally, we assessed how all expression measures performed in differential expression
using both standard and moderated (limma) t-tests. Standard t-tests not only assume
that all genes are independent, which is unrealistic, but have lower power, particularly
when the number of samples is small (as is often the case) [178]. The moderated t-tests
in limma can be applied to RNA-Seq (and by extension MaLTE estimates) but this
utilises a different pipeline commencing with raw read counts. To facilitate comparison
with array we therefore used limma on filtered RNA-Seq estimates having excluded the
genes dominated by noise.
110
We first estimated π0 , the expected proportion of true nulls based on the standard
(paired) t-tests. We estimated these to be 0.84, 0.57 and 0.69 for RNA-Seq, MaLTE
and RMA, respectively, all of which provide varying degrees of evidence in favour of the
existence of a set of differentially expressed genes. The RNA-Seq data had the weakest
evidence, potentially due to lower data quality.
Next, we measured the proportion of overlap with RNA-Seq for top nominally significant genes at various p-value thresholds for MaLTE and RMA (Fig. 6.13). At each
p-value threshold, we obtained the proportion of overlap between the genes identified
using each method then compared it to the size of overlap between MaLTE and RMA.
We used this to infer the relative ability to identify differentially expressed genes in
comparison to RNA-Seq. RMA exhibited consistently higher proportions of overlap
using limma (Fig. 6.13b) . The results were nearly identical for both methods with standard (Fig. 6.13a) t-tests. We also compared how log of fold change value compared to
those from RNA-Seq. We first found the common set of genes at FDR < 10% between
MaLTE and RMA then computed the corresponding correlations. We found seven genes
in common whose log of fold change values with RMA had marginally better Pearson
correlations (r = 0.82, p = 0.024) compared to MaLTE (r = 0.76, p = 0.046). Spearman
correlations were not significant.
111
Proportion of overlap with RNA-Seq
Proportion of overlap with RNA-Seq
Figure 6.13: Proportion of overlap with RNA-Seq to nominally significant
genes using MaLTE and RMA. (a) Standard t-tests, (b) Moderated t-tests (performed with limma). Each bar represents a single p-value threshold. The proportion is
computed relative to the overlap between MaLTE and RMA. Error bars indicate two
standard errors.
112
Gene Legend
Genes (GENCODE)
Genes (GENCODE)
134.22 kb
113
merged Ensembl/Havana
protein coding
Reverse strand
90.64Mb
protein coding
< SNCA-004
protein coding
< SNCA-006
protein coding
< SNCA-008
protein coding
< SNCA-003
protein coding
< SNCA-002
protein coding
< SNCA-001
protein coding
< SNCA-005
protein coding
< SNCA-202
protein coding
processed transcript
Figure 6.14: α-synuclein transcript isoforms on Ensembl.
134.22 kb
90.74Mb
protein coding
< SNCA-007
protein coding
90.76Mb
antisense
< SNCA-201
RP11-67M1.1-001 >
90.76Mb
Forward strand
antisense
< SNCA-009
90.74Mb
RP11-115D19.1-003 >
90.72Mb
90.72Mb
antisense
90.70Mb
90.70Mb
RP11-67M1.1-002 >
90.68Mb
90.68Mb
antisense
90.66Mb
90.66Mb
RP11-115D19.1-005 >
90.64Mb
Symbol
SNCA
SLC4A1
FAXDC2
OSBP2
RNA-Seq
p-value
log2 FC
1.32 × 10−5
2.31
−5
2.08 × 10
2.69
2.37 × 10−5
1.67
3.85 × 10−5
1.89
MaLTE
p-value
log2 FC
5.71 × 10−5
2.72
-
RMA
p-value
log2 FC
2.59 × 10−5
1.14
−6
3.61 × 10
1.61
1.37 × 10−4
1.18
7.12 × 10−4
0.80
Table 6.3: Genes identified as differentially expressed at FDR < 10%. All
genes are expressed higher in RP cases.
Lastly, from the genes with the highest significance, only one was significant using all
three measures at an FDR < 10%: α-synuclein (SNCA) (Table 6.3). The log-2 of
fold was estimated as 1.14 (RMA), 2.72 (MaLTE) and 2.31 (RNA-Seq). The RNA-Seq
estimates indicate that cases expressed SNCA roughly five times higher than controls:
the mean (standard deviation) in cases was 59.48 (6.01) FPKM and 12.44 (4.20) FPKM
in controls. There were three additional genes at the same FDR shared between RNASeq and RMA: solute carrier family 4 (anion exchanger) member 1 (SLC4A1 ); fatty acid
hydroxylase domain-containing 2 (FAXDC2 ); and oxysterol binding protein 2 (OXB2 ),
all of which were expressed at higher levels in cases (Table 6.3).
6.4
Discussion
The utility of RNA-Seq can easily be offset by technical biases introduced either during
library preparation or sequencing. We have presented an analysis of RNA-Seq data
based on whole blood samples from individuals with RP-PRPF8 and sibling controls.
We were able to confirm the PRPF8 variants and found another RP mutation, although
no symptoms were observed in the affected controls. Two technical problems affected the
RNA-Seq data we generated: one sample was sequenced at less than half the coverage
than the rest and another was sequenced with considerably higher duplication than
the others. While there are no standard duplication thresholds, nearly all samples had
over 60% duplication. Despite these technical biases, the filtering and normalisation
we employed appeared to reduce their effects. Our analysis also presented the potential
utility of MaLTE in an analysis similar to the brain microarray data generated by Mazin
et al. [317] in which related arrays were employed. As before, our results highlight how
MaLTE provides higher within-sample correlations but did not perform as well as RMA
on differential expression analysis.
All the genes we identified as differentially expressed were at higher levels in cases. Only
one gene, α-synuclein (SNCA) was differentially expressed at a maximum FDR of 10%
based on all measures. In Chapter 4, we argued that missplicing appeared to target
exons spliced through the exon-defined splicing pathway (Section 2.2.5) in which shorter
114
than average exons were flanked by longer than average introns. Interestingly, exon five
of SNCA was one of those identified (see Section A.7 for accession ENST00000508895:5 ).
This phase 0-0 exon is 84 bp long and flanked by a 92 kbp upstream intron and a 2,533 bp
downstream intron making it especially vulnerable to missplicing or being processed as
a cassette. It is found in seven (of 11 known) splice isoforms, five of which are translated
into 140 bp amino acid sequences (Fig. 6.14). The five transcripts encode the main
protein product of SNCA, designated SNCA140 [350]. The other two transcript isoforms
containing this exon are translated into 126 bp amino acids (SNCA126 ) involving a
frameshift deletion of exon five [350]. Of the other four protein-coding isoforms, three
exclude exon five (all designated SNCA112 because of they encode of 112-115 bp amino
acids) [351] and one has an incomplete CDS (see Ensembl transcript table for SNCA).
Aberrant splicing in this gene could therefore be the result of higher inclusion of exon
five through one of two ways. Higher inclusion could occur if more than the required
SNCA140 or SNCA126 isoforms (having exon five) are spliced. On the other hand,
higher inclusion could be the result of failure to skip exon five in a SNCA112 isoform.
Both scenarios may also occur simultaneously.
The SNCA gene has been intensively studied in relation to its role in neurodegenerative diseases. It is predominantly associated with familial autosomal dominant, juvenile
forms of Parkinsons disease (PD) [352, 353]. The SNCA mRNA and protein are expressed in the retinal pigment epithelium (RPE) and neural retinal cells, the latter
being consistent with its association with the central nervous system specifically with
presynaptic terminals [354]. The latter study was prompted by previous reports citing
instances of visual and morphological impairments in retina of PD patients [355]. It had
previously been shown that transgenic fruit flies harbouring both normal and mutant
(A30P and A53T) SNCA showed a development-associated retinal tolerance of human
SNCA only resulting in retinal degeneration in adult flies [356]. However, at present,
our results are only speculative and splicing sensitive expression profiling on RP-affected
retinal tissue will need to be done to expose any link between SNCA missplicing and
splicing factor RP.
115
Chapter 7
Conclusion
In this thesis, we have examined several questions relating to estimation and analysis
of gene expression and alternative splicing focusing on development and disease. We
began with a survey of the literature commencing with the seminal work by Sharp [2]
and Roberts [4] and ended that discussion on the pervasiveness of alternative splicing
in higher order organisms, highlighting some approaches to measurement of alternative splicing. This final chapter briefly summarises the key findings, draws concluding
remarks then anticipates future directions.
7.1
Summary
In our first analysis (Chapter 3), we presented evidence that the intron content in a
subset of genes associated with embryonic development is conserved in mammals. We
showed that homeodomain genes are enriched among the set of genes with conserved
intron content. In particular, a far higher proportion of genes that play a role in patterning processes such as somitogenesis have conserved intron content. Next, we performed
a transcriptome wide analysis of splicing using exon arrays (Chapter 4). We found that
a subset of exons exhibited higher inclusion in individuals affected with an autosomal
dominant mutation on PRPF8 that leads to retinitis pigmentosa (RP). PRPF8 is a
key splicing factor and our results support the hypothesis that this mutation may act
as a trans-sQTL. We then outlined a new approach to gene and transcript isoform expression estimation that provides RNA-Seq-like expression estimates (Chapter 5). Our
method, named MaLTE for machine learning of transcript expression, improves microarray expression estimates by learning regression forest models of gene expression between
microarray probes and RNA-Seq. From our analysis based on additional expression estimates from RNA-Seq and MaLTE, we were able to provide further evidence for our
117
Conclusion
results in Chapter 4 that showed higher expression of a subset of genes consistent with
higher exon inclusion in RP cases (Chapter 6).
7.2
7.2.1
Conclusions
More emphasis on isoforms rather than genes
It is now widely accepted that almost all multi-exon genes in humans are alternatively
spliced with most spliced in a tissue-specific manner [119]. To say that a gene is highly
expressed is, therefore, not informative as this provides little indication of the actual
transcripts involved. For example, the results presented in Chapter 6 implicated at least
one transcript isoform in SNCA of being aberrantly spliced, all of which, when viewed
alongside each other, are virtually identical (Fig. 6.14). Given the recent transcript
isoform sequencing (TIF-Seq) results from yeast that showed wide diversity in transcript
isoforms from an organism that undergoes relatively little alternative splicing [170], it
is likely that the resolution of current transcriptome sequencing merely scratches the
surface of transcriptome complexity. Isoform expression estimation will undoubtedly
take over from gene expression. This will demand advances in quantitative assays and
computational approaches that paint a better picture of alternative splicing.
In light of this, there is a noticeable shift towards new approaches that combine the
best of various technologies or incorporate new information in isoform analysis. As we
outlined in Chapter 2, some of the tools available for isoform quantification incorporate
both identification and quantification. However, one group has come up with a novel approach using hybrid sequencing: combining the long reads from PacBios pyrosequencing
for isoform discovery with conventional RNA-Seq involving the newly identified isoforms
in quantification [357]. It has also been proposed that integrating information on chromatin 3D conformation could enable a better view of transcriptional complexity [358].
The advent of single cell sequencing is also likely to introduce cell-specific features which
are typically lost when analysing cell populations [359].
7.2.2
From algorithms to better quantitative assays
One of the observation made during in the course of developing the MaLTE framework was that only a subset of predictors (identified using simple feature selection)
were required to predict expression with equivalent accuracy to widely used microarray
summarisation. Microarray design may be redundant and could benefit from a design
118
Conclusion
strategy that works in reverse: identifying optimal probes starting from known expression estimates. High density arrays, such as the exon array, currently rely on the coverage
of probes across the entire length of a gene with higher coverage being preferred [318].
However, as has been articulated by others [328], this increases the number of probes
expressed at background levels (approx. 80% in the exon array). Understandably, competition between targets as the number of probes increases is likely to be the main source
of unreliability. Therefore, higher coverage might not imply better expression resolution
but may only be beneficial in detection of splicing events. Alternatively, one study has
already demonstrated the clinical potential of isoform-specific microarrays in which a
subset of pre-mRNA splicing events of interest may be captured [360]. A MaLTE-like
framework could be useful in identifying the most informative probes for the array.
7.3
7.3.1
Future work
Conservation of introns
Our results highlight an important functional role played by intron content in mammals
by introducing delays. However, using conservative estimates for RNA pol II elongation
rate (e.g. 1.9 kbp min−1 [222]) does not full account for the 19 minute delay observed by
Takashima et al. [6]. Other avenues we did not investigate are whether the speed of RNA
pol II is varied during transcription due to other factors such as RNA pol II queues [166],
spliceosomal influences on transcription [361–363], or even epigenetic factors [364, 365].
Future work on intron conservation will require models that explicitly capture insertion
and deletion events across several lineages. The work presented here highlighted the
potential of using the Enredo-Pecan-Ortheus (EPO) pipeline [366, 367] to infer insertions
and deletions from ancestral sequences from which these can be discerned.
7.3.2
Retinitis pigmentosa transcriptome
The retinal transcriptome is complex [232]. Identifying the underlying transcriptome
aberrations involved in RP require far higher coverage than our experiment was able to
accomplish. Unfortunately, obtaining retinal tissue samples is a challenge. Nevertheless,
more work needs to be done on closely related species to study the etiology of retinal
degeneration in light of functional transcriptomics. Another alternative would be to use
induced pluripotent stem cells using non-retinal samples from individuals affected by
RP-PRPF8 (Raj Ramesar, personal communication).
119
7.3.3
Further development of the MaLTE framework
The MaLTE framework is a starting point for incorporating statistical learning techniques to bridge the gap between legacy assays such as microarrays and newer sequencingbased technologies. While forest-based methods display attractive properties, they have
a few limitations. As we pointed out in Appendix B, boosted regression trees (BRT)
performed just as well as the random forest variants we used with the exception that
BRT was more computationally intensive. In general, forests can be computationally
intensive and, depending on the number of trees used, storing the forests consumes a
large amount of memory. We were able to bypass the large storage requirements by
treating each gene independently then carrying out test data predictions immediately
after training and easily lends itself to parallel computation. Improvements can also be
made on how forest-based learners estimate the variance without sacrificing accuracy.
Resolving these limitations will make MaLTE a powerful tool in extending the benefits of
high-throughput sequencing quantification to the more affordable array-based platforms
and unlock the vast collection of archived array datasets.
There is no doubt that deeper insight into gene expression and alternative splicing will
be instrumental in furthering biological understanding. It is seminal discoveries such
as those made by Sharp and Roberts, which have the ability to radically revise on our
perspectives, that will advance this goal. The staggering diversity of tissues promises
to makes this a long-standing challenge. Ironically, it is often in the context of the
transitory effects of development or disease-causing aberrations that we learn much on
the intricacies of cellular mechanisms.
120
121
Supplements for Chapter 4
Appendix A
Supplementary results for
Chapter 4
A.1
Quality Assessment
B
540
0.5
●T1
0.0
●T3
●C4
●T2
T1
500
PC2
●T4
C1
●C3
●C1
●
T4
T3
C4
T2
C3
−0.5
●C2
C2
460
distance
580
A
●
−0.358
−0.354
Samples
−0.350
cases
controls
−0.346
PC1
C
D
0.5
0.5
●T1
●T4
C3●
●C1
0.358
0.0
●T2
0.354
●T2●
●C4
T3
−0.5
−0.5
●C2
C3●
●C2
●T3
●C4
PC2
0.0
PC2
●T4
●C1
0.350
0.346
●T1
−0.360
PC1
−0.355
−0.350
−0.345
PC1
Figure A.1: Quality assessment of exon array data. A. Hierarchical cluster of
samples. Ti are cases, Ci are controls. B. Plot of first two principal components (PCs)
for full probesets. C.,D. First two PCs for exonic and intronic probesets, respectively.
122
A.2
Sample Permutations
A.2.1
Core Probesets
(T1,T2,T3,C4) vs. (C1,C2,C3,T4)
pi_0 = 0.8473
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
pvalues
pvalues
(T1,C2,T3,T4) vs. (C1,T2,C3,C4)
pi_0 = 0.8921
(T1,T2,C3,T4) vs. (C1,C2,T3,C4)
pi_0 = 0.9779
1.0
0.4
0.0
0.0
0.4
density
0.8
0.8
0.0
density
0.4
density
1.0
0.5
0.0
density
1.5
1.2
(T1,T2,T3,T4) vs. (C1,C2,C3,C4)
pi_0 = 0.7553
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
pvalues
pvalues
(T1,T2,C3,C4) vs. (C1,C2,T3,T4)
pi_0 = 0.9929
(T1,C2,C3,C4) vs. (C1,T2,T3,T4)
pi_0 = 1.0
1.0
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
pvalues
pvalues
(T1,C2,C3,T4) vs. (C1,T2,T3,C4)
pi_0 = 1.0
(T1,C2,T3,C4) vs. (C1,T2,C3,T4)
pi_0 = 1.0
1.0
0.4
0.0
0.0
0.4
density
0.8
0.8
0.0
density
0.4
density
0.4
0.0
density
0.8
1.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
pvalues
0.2
0.4
0.6
0.8
1.0
pvalues
Figure A.2: Sample permutations for core probesets. Plots representing the
complete set (eight) of pairwise swaps between cases (Ti ) and controls (Ci ). Only
core probesets are considered here. Plots are ordered from the top by the value of π0 .
Dashed red line represents a uniform distribution; dashed blue lines represent the value
of π0 .
123
A.2.2
Full Probesets
(T1,T2,T3,C4) vs. (C1,C2,C3,T4)
pi_0 = 0.8618
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
pvalues
pvalues
(T1,C2,T3,T4) vs. (C1,T2,C3,C4)
pi_0 = 0.8981
(T1,T2,C3,T4) vs. (C1,C2,T3,C4)
pi_0 = 0.9732
1.0
0.0
0.4
density
0.8
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
(T1,T2,C3,C4) vs. (C1,C2,T3,T4)
pi_0 = 0.9839
(T1,C2,C3,C4) vs. (C1,T2,T3,T4)
pi_0 = 1.0
0.0
0.4
density
0.4
0.0
1.0
0.8
pvalues
0.8
pvalues
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
pvalues
pvalues
(T1,C2,C3,T4) vs. (C1,T2,T3,C4)
pi_0 = 1.0
(T1,C2,T3,C4) vs. (C1,T2,C3,T4)
pi_0 = 1.0
1.0
0.0
0.0
0.4
density
0.8
0.8
0.0
0.4
density
0.0
density
0.8
0.0
0.2
0.8
0.0
density
0.4
density
0.8
0.0
0.4
density
1.2
1.2
(T1,T2,T3,T4) vs. (C1,C2,C3,C4)
pi_0 = 0.7963
0.0
0.2
0.4
0.6
0.8
1.0
0.0
pvalues
0.2
0.4
0.6
0.8
1.0
pvalues
Figure A.3: Sample permutations for full probesets. Plots representing the
complete set (eight) of pairwise swaps between cases (Ti ) and controls (Ci ). Plots
are ordered from the top by the value of π0 . Dashed red line represents a uniform
distribution; dashed blue lines represent the value of π0 .
124
A.3
Additional Analyses
0.20
0.00
0.05
0.10
0.15
Density
0.20
0.15
0.00
0.05
0.10
Density
0.25
B
0.25
A
0.30
Splice Site Scores
0.30
A.3.1
0
5
10
15
0
Splice site score
5
10
15
Splice site score
Figure A.4: Distribution of splice site scores. Density plot of splice site scores
of A. 5’ splice sites and B. 3’ splice sites. The dashed line indicates a similar plot for
background exons (core exons).
A.3.2
Constitutive versus Alternative Splicing Analyses
We reasoned that exons differentially included would also be enriched for alternatively
spliced exons. However, we found no evidence of splicing bias for exons prone to alternative splicing. We tested this by constructing reference lists for constitutive and
alternative exons in lymphoblastoid (similar to the whole blood samples we used) tissue
[236] using the following procedure.
First, we aligned RNA-Seq data for 53 YRI samples to the hg19 build of the genome and
isolated all junction spanning reads which numbered about 62 million reads. We then
constructed a reference set of exons from the Ensembl gene model. Exons were defined at
the transcript level by associating each transcript ID with the exon number (e.g. ENST*:1
for the first exon in ENST*). This method produced a redundant list because many exons
overlap. Overlapping exons were eliminated using a randomisation procedure: from
125
every set of exons that shared at least one terminal coordinate (start/stop) one exon
was chosen at random.
For each internal exon we counted the number of reads that mapped to the upstream
(C1 : A) and downstream (A : C2 ) junction as well as those that mapped only to flanking
exons (skipping) (C1 : C2 ) [368]. We computed the percent-/proportion-spliced-in (PSI
or Ψ) using the expression
Ψ=
avg(C1 : A, A : C2 )
.
avg(C1 : A, A : C2 ) + C1 : C2
An exon was considered constitutive if Ψ − 2 · SE(Ψ) > 0.5 and alternative if Ψ +
2 · SE(Ψ) < 0.5. In this way we identified 19,685 constitutive and 3,078 alternative
exons. From this list of constitutive and alternative exons we tallied those present in
higher/lower inclusion in cases and compared them to similar counts in the background
set of exons using Fisher’s exact test. For exons included at higher levels in cases the
odds ratio was 1.57 (p = 0.45) while those included at lower levels had OR = 3.59
(p = 0.27) suggesting no splicing-specific influence.
A.3.3
U12-type Introns
We did not find any distinction in probeset intensities based on intron type. U2-type
introns form the majority of introns and Fig. A.6D adequately represents them. U12type introns had a less even distribution (Supplementary Fig.A.5) likely due to few
probesets associated with them. No significant bias in inclusion levels was observed. We
also compared U2-type and U12-type introns to each other based on their t-statistics.
We found no significant difference in the t-statistics (p = 0.6268, 95% CI of difference of
means: [−0.27, 0.17]) suggesting similar distributions for both.
126
0.6
0.4
0.0
0.2
No. of probesets
0.8
1.0
U12−type intronic probesets
Figure A.5: Proportion of probesets indicating higher/lower inclusion in
cases relative to controls for binned p-value thresholds for U12-type introns.
Each column represents all probesets (exons) for that bin. Red represent probesets
included at higher levels in cases; blue are those included at lower levels in cases.
127
Differential Inclusion by Probeset Class
B
●
●
●
3000
●
up
down
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
●
●
●
●
●
●
●
●
●
0.2
0.4
0.6
0.8
1.0
0.0
0.2
●
●
●
●
0.4
p−values
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
0.8
1.0
p−values
C
D
2500
●
●
●
●
up
down
●
No. of probesets
●
●
●
●
1500
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
up
down
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
0
500
●
●
200 400 600 800
●
No. of probesets
up
down
0
1000
2000
●
●
●
●
●
0
No. of probesets
●
No. of probesets
4000
A
3000
A.4.1
Number of Differentially Included Probesets
1000
A.4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
p−values
0.2
0.4
0.6
0.8
1.0
p−values
Figure A.6: Illustration of differential inclusion for different classes of probesets. A. Core, B. full, C. exonic, and D. intronic probesets. Each point represents the
number of probesets presenting higher (‘up’)/lower (‘down’) inclusion in cases relative
to controls within ∆p = 0.05 p-value bins.
128
A.4.2
Core Probesets
p threshold
down
up
equal
p-value
FDR
10−87
1.61 × 10−85
0.01
473
1287
0
7.64 ×
0.05
2417
3963
0
3.61 × 10−84
3.79 × 10−83
0.1
3008
3760
0
6.36 × 10−20
4.45 × 10−19
0.15
2801
3230
0
3.52 × 10−8
1.85 × 10−7
0.2
2625
2913
0
1.2 × 10−4
4.8 × 10−4
0.25
2446
2557
0
0.12
0.36
0.3
2253
2362
0
0.11
0.36
0.35
2190
2271
0
0.23
0.58
0.4
2073
2122
0
0.46
0.81
0.45
1975
1959
0
0.81
0.99
0.5
2003
1956
0
0.46
0.81
0.55
1881
1856
0
0.69
0.99
0.6
1857
1848
0
0.90
0.99
0.65
1784
1761
0
0.71
0.99
0.7
1774
1779
0
0.95
0.99
0.75
1764
1713
0
0.40
0.81
0.8
1710
1642
0
0.25
0.58
0.85
1646
1637
0
0.89
0.99
0.9
1643
1617
0
0.66
0.99
0.95
1688
1685
0
0.97
0.99
1
1631
1629
207
0.99
0.99
Table A.1: Number of core probesets in each bin of width ∆p = 0.05. Data
used to produce the plots in Fig. 4.2 and Fig. A.6.
129
A.4.3
Full Probesets
p threshold
down
up
equal
p-value
FDR
10−7
3.054 × 10−6
0.01
1147
1400
0
5.82 ×
0.05
3825
4370
0
1.84 × 10−9
3.86 × 10−8
0.1
4028
4524
0
8.60 × 10−8
6.02 × 10−7
0.15
3534
4048
0
3.78 × 10−9
3.40 × 10−8
0.2
3275
3622
0
3.09 × 10−5
0.00013
10−5
0.00017
4.88 ×
0.25
3040
3366
0
0.3
2991
3129
0
0.080
0.1291
0.35
2685
2913
0
0.0024
0.0063
0.4
2642
2887
0
0.0010
0.0031
0.45
2470
2650
0
0.012
0.026
0.5
2399
2548
0
0.035
0.062
0.55
2391
2590
0
0.0050
0.012
0.6
2408
2378
0
0.68
0.68
0.65
2230
2306
0
0.27
0.35
0.7
2237
2352
0
0.093
0.14
0.75
2263
2337
0
0.28
0.35
0.8
2123
2155
0
0.64
0.68
0.85
2058
2150
0
0.16
0.23
0.9
2089
2143
0
0.42
0.48
0.95
2059
2032
0
0.68
0.68
1
2194
2052
25057
0.030
0.058
Table A.2: Number of full probesets in each bin of width ∆p = 0.05
130
A.4.4
Exonic Probesets
p threshold
down
up
equal
p-value
FDR
10−40
4.59 × 10−40
0.01
347
790
0
2.95 ×
0.05
1809
2862
0
6.61 × 10−54
2.059 × 10−53
0.1
2139
2769
0
2.46 × 10−19
2.55 × 10−19
0.15
1985
2338
0
8.48 × 10−8
5.29 × 10−8
0.2
1771
2160
0
5.89 × 10−10
4.59 × 10−10
0.25
1672
1872
0
0.00083
0.00037
0.3
1590
1827
0
5.37 × 10−5
2.79 × 10−5
0.35
1427
1607
0
0.0012
0.00045
0.4
1392
1558
0
0.0024
0.00082
0.45
1323
1432
0
0.040
0.010
0.5
1340
1379
0
0.47
0.085
0.55
1212
1342
0
0.011
0.0033
0.6
1243
1311
0
0.19
0.038
0.65
1190
1308
0
0.019
0.0055
0.7
1229
1266
0
0.47
0.085
0.75
1176
1186
0
0.85
0.13
0.8
1072
1136
0
0.18
0.038
0.85
1138
1157
0
0.71
0.11
0.9
1170
1136
0
0.49
0.085
0.95
1072
1098
0
0.59
0.097
1
1202
1110
649
0.058
0.014
Table A.3: Number of exonic probesets in each bin of width ∆p = 0.05
131
A.4.5
Intronic Probesets
p threshold
down
up
equal
p-value
FDR
0.01
280
304
0
0.34
0.52
0.05
914
879
0
0.42
0.52
0.1
1014
919
0
0.033
0.17
0.15
800
887
0
0.036
0.17
0.2
743
673
0
0.067
0.24
0.25
731
732
0
1
0.85
0.3
710
665
0
0.24
0.52
0.35
602
639
0
0.31
0.52
0.4
628
624
0
0.93
0.83
0.45
529
617
0
0.010
0.17
0.5
564
574
0
0.79
0.78
0.55
532
580
0
0.16
0.40
0.6
536
561
0
0.47
0.52
0.65
567
538
0
0.40
0.52
0.7
541
474
0
0.038
0.17
0.75
517
543
0
0.44
0.52
0.8
483
519
0
0.27
0.52
0.85
482
468
0
0.67
0.71
0.9
469
496
0
0.40
0.52
0.95
458
453
0
0.89
0.83
1
513
459
4235
0.089
0.26
Table A.4: Number of intronic probesets in each bin of width ∆p = 0.05
132
A.4.6
U12-type Intronic Probesets
p threshold
down
up
equal
p-value
FDR
0.01
1
2
0
1
1
0.05
9
9
0
1
1
0.1
8
7
0
1
1
0.15
9
10
0
1
1
0.2
6
4
0
0.75
1
0.25
3
6
0
0.51
1
0.3
3
6
0
0.51
1
0.35
10
4
0
0.18
1
0.4
5
6
0
1
1
0.45
6
8
0
0.79
1
0.5
7
4
0
0.55
1
0.55
6
9
0
0.61
1
0.6
3
6
0
0.51
1
0.65
6
5
0
1
1
0.7
3
3
0
1
1
0.75
6
3
0
0.51
1
0.8
5
0
0
0.063
1
0.85
3
5
0
0.73
1
0.9
3
8
0
0.23
1
0.95
5
2
0
0.45
1
1
2
7
40
0.18
1
Table A.5: Number of U12-type intronic probesets in each bin of width
∆p = 0.05
A.5
Transcript Cluster Identifiers (Metaprobesets) Associated with PRPF8
PRPF8 is located on the reverse strand of chromosome 17 between base 1553923 and
1588154 (hg19 and EnsEMBL v.66). The following table shows corresponding metaprobesets.
133
ID
chr
strand
start
end
3705896
chr17
+
1577504
1577972
3705894
chr17
+
1574104
1574222
3705898
chr17
+
1581163
1581608
3740589
chr17
-
1574051
1574859
3740625
chr17
-
1583120
1583533
3740649
chr17
-
1587369
1587620
Table A.6: Metaprobesets along PRPF8 . The coordinates of transcript cluster
IDs along the PRPF8 gene.
Only 3740589 and 3740625 passed quality control filtering but were not differentially
expressed between cases and controls.
A.6
Associated Genes
Lists of genes with higher/lower inclusion are available upon request.
A.7
Associated Exons
The list of exons analysed represent a one-to-one mapping of probesets to exons (see
Methods in the main text). The set of exons represent a random choice of all possible
exons that overlap with the exon array probeset. The number after the colon (‘:’)
represents the exon number with the transcript isoform.
ENST00000217800:7 ENST00000217909:4
ENST00000005340:4 ENST00000011700:2
ENST00000219070:4 ENST00000221973:2
ENST00000013125:29 ENST00000025301:2 ENST00000223642:40 ENST00000225308:3
ENST00000082259:9 ENST00000083182:7
ENST00000230568:2 ENST00000230901:15
ENST00000200135:5 ENST00000202017:4
ENST00000238831:3 ENST00000242057:7
ENST00000216115:4 ENST00000216492:6
ENST00000243583:13 ENST00000243662:2
134
ENST00000243903:7 ENST00000244426:9
ENST00000278568:10 ENST00000278612:15
ENST00000244769:6 ENST00000246314:2
ENST00000278616:31 ENST00000280333:45
ENST00000247400:4 ENST00000250572:6
ENST00000280612:3 ENST00000281513:18
ENST00000251108:3 ENST00000251900:7
ENST00000281772:9 ENST00000282007:8
ENST00000254821:3 ENST00000256117:6
ENST00000283351:6 ENST00000283684:15
ENST00000261755:9 ENST00000261867:6
ENST00000300417:15 ENST00000300650:2
ENST00000262134:5 ENST00000262262:2
ENST00000301336:5 ENST00000301843:9
ENST00000262352:8 ENST00000262455:2
ENST00000303372:8 ENST00000303375:20
ENST00000268130:5 ENST00000269216:5
ENST00000321169:12 ENST00000321322:4
ENST00000269445:4 ENST00000269703:4
ENST00000321370:23 ENST00000321990:18
135
ENST00000326080:6 ENST00000326080:8
ENST00000357482:6 ENST00000357563:6
ENST00000326682:2 ENST00000327087:3
ENST00000357602:10 ENST00000357604:10
ENST00000330236:2 ENST00000331768:4
ENST00000358025:94 ENST00000358032:5
ENST00000333577:2 ENST00000333807:3
ENST00000358136:3 ENST00000358499:6
ENST00000334186:8 ENST00000334318:2
ENST00000358544:23 ENST00000358552:8
ENST00000334859:3 ENST00000335387:3
ENST00000358951:20 ENST00000359142:18
ENST00000336735:3 ENST00000337006:8
ENST00000359785:7 ENST00000360222:15
ENST00000337318:3 ENST00000337679:3
ENST00000360385:2 ENST00000360822:17
ENST00000338608:6 ENST00000339497:5
ENST00000361078:14 ENST00000361246:8
ENST00000342905:6 ENST00000342981:7
ENST00000366920:22 ENST00000366922:15
ENST00000343439:8 ENST00000344492:4
ENST00000366976:4 ENST00000367079:8
ENST00000346432:5 ENST00000348141:3
ENST00000367506:13 ENST00000367585:2
ENST00000348335:4 ENST00000348616:5
ENST00000367591:9 ENST00000367996:4
ENST00000348925:2 ENST00000348959:9
ENST00000368033:5 ENST00000368132:6
ENST00000349456:2 ENST00000349995:7
ENST00000368305:2 ENST00000368331:7
ENST00000354185:3 ENST00000354241:3
ENST00000369294:3 ENST00000369556:2
ENST00000354417:3 ENST00000354558:5
ENST00000369983:8 ENST00000370343:12
ENST00000355221:2 ENST00000355674:5
ENST00000370513:3 ENST00000370631:7
ENST00000355824:4 ENST00000355867:7
ENST00000371184:4 ENST00000371189:5
ENST00000357103:2 ENST00000357321:2
ENST00000372899:36 ENST00000372901:44
136
ENST00000373558:2 ENST00000373584:3
ENST00000392529:7 ENST00000392614:5
ENST00000373620:4 ENST00000373750:5
ENST00000392677:3 ENST00000392717:8
ENST00000374879:9 ENST00000375165:4
ENST00000394822:9 ENST00000394886:8
ENST00000375221:15 ENST00000375226:101ENST00000395125:20 ENST00000395211:6
ENST00000376485:2 ENST00000376704:3
ENST00000395898:10 ENST00000395911:7
ENST00000377223:4 ENST00000377540:6
ENST00000396328:15 ENST00000396658:2
ENST00000378128:6 ENST00000378164:6
ENST00000396865:11 ENST00000397227:3
ENST00000378601:2 ENST00000378616:2
ENST00000397311:9 ENST00000397736:6
ENST00000378643:5 ENST00000378697:6
ENST00000397881:2 ENST00000398129:10
ENST00000378715:7 ENST00000378722:4
ENST00000398208:3 ENST00000398341:11
ENST00000379120:9 ENST00000379374:8
ENST00000398569:44 ENST00000398835:5
ENST00000379965:6 ENST00000380030:7
ENST00000399918:3 ENST00000400105:15
ENST00000380649:4 ENST00000380671:7
ENST00000401475:11 ENST00000401979:14
ENST00000381340:6 ENST00000381938:5
ENST00000404251:10 ENST00000404251:7
ENST00000382058:3 ENST00000382850:8
ENST00000405383:4 ENST00000405570:12
ENST00000382891:3 ENST00000383744:4
ENST00000406875:3 ENST00000409311:6
ENST00000392257:4 ENST00000392257:6
ENST00000415485:19 ENST00000415511:2
137
ENST00000415570:7 ENST00000415740:4
ENST00000440591:5 ENST00000441077:2
ENST00000415849:3 ENST00000415948:5
ENST00000441946:7 ENST00000442506:21
ENST00000416445:9 ENST00000417408:8
ENST00000442913:7 ENST00000443236:19
ENST00000419091:6 ENST00000419286:2
ENST00000444189:2 ENST00000444746:27
ENST00000420166:3 ENST00000420632:9
ENST00000445992:3 ENST00000446066:9
ENST00000421748:6 ENST00000421794:4
ENST00000447540:24 ENST00000447611:13
ENST00000422431:7 ENST00000423109:2
ENST00000448504:3 ENST00000449213:3
ENST00000424254:2 ENST00000424848:6
ENST00000450444:7 ENST00000450465:4
ENST00000426203:2 ENST00000426496:9
ENST00000452508:22 ENST00000452837:7
ENST00000432185:5 ENST00000432910:4
ENST00000460495:8 ENST00000460658:2
ENST00000433067:5 ENST00000433336:7
ENST00000461319:2 ENST00000461696:2
ENST00000433384:4 ENST00000433477:5
ENST00000461774:3 ENST00000463046:9
ENST00000434949:2 ENST00000435252:2
ENST00000464941:7 ENST00000464952:2
ENST00000435944:2 ENST00000435957:2
ENST00000465020:4 ENST00000465445:8
ENST00000437205:2 ENST00000438053:2
ENST00000465942:5 ENST00000466192:2
138
ENST00000467991:3 ENST00000468222:5
ENST00000494090:26 ENST00000495044:5
ENST00000470890:2 ENST00000470902:4
ENST00000497558:5 ENST00000497607:3
ENST00000471069:4 ENST00000471302:2
ENST00000498012:3 ENST00000498077:19
ENST00000471652:7 ENST00000471858:4
ENST00000498144:6 ENST00000498211:3
ENST00000472232:2 ENST00000472414:5
ENST00000498611:22 ENST00000498699:10
ENST00000472428:2 ENST00000472433:2
ENST00000503230:10 ENST00000503868:3
ENST00000473642:2 ENST00000473691:4
ENST00000504202:4 ENST00000504584:2
ENST00000473795:3 ENST00000474225:3
ENST00000504706:2 ENST00000504872:4
ENST00000474314:4 ENST00000474606:5
ENST00000506202:15 ENST00000506733:20
ENST00000474859:4 ENST00000475265:3
ENST00000506799:5 ENST00000506816:4
ENST00000475443:2 ENST00000475560:6
ENST00000506828:2 ENST00000507184:28
ENST00000476978:2 ENST00000477308:3
ENST00000508466:10 ENST00000508895:5
ENST00000478407:2 ENST00000478412:3
ENST00000509667:2 ENST00000509732:2
ENST00000478588:4 ENST00000478777:7
ENST00000509768:6 ENST00000510377:6
ENST00000480606:2 ENST00000480606:4
ENST00000512575:2 ENST00000512636:8
ENST00000480623:5 ENST00000480886:3
ENST00000513031:2 ENST00000513079:4
ENST00000481266:6 ENST00000481518:6
ENST00000513245:4 ENST00000513695:4
ENST00000481751:2 ENST00000481965:3
ENST00000513837:8 ENST00000514141:8
ENST00000482755:2 ENST00000482953:4
ENST00000517339:24 ENST00000518211:4
ENST00000483398:4 ENST00000483593:5
ENST00000519449:4 ENST00000520113:11
ENST00000484220:2 ENST00000485566:4
ENST00000522052:3 ENST00000522072:9
ENST00000487556:5 ENST00000487883:2
ENST00000523582:17 ENST00000523756:15
ENST00000490566:5 ENST00000490873:6
ENST00000526995:5 ENST00000526995:6
139
ENST00000527591:9 ENST00000528744:2
ENST00000543127:5 ENST00000543192:11
ENST00000529244:5 ENST00000529588:4
ENST00000543523:16 ENST00000543645:9
ENST00000530044:2 ENST00000530704:2
ENST00000543949:10 ENST00000544784:7
ENST00000532259:5 ENST00000532502:2
ENST00000545132:6 ENST00000546208:10
ENST00000532742:2 ENST00000533349:4
ENST00000546893:8 ENST00000547043:7
ENST00000534548:9 ENST00000534738:5
ENST00000547540:2 ENST00000548465:2
ENST00000535106:7 ENST00000535117:4
ENST00000549851:7 ENST00000549944:7
ENST00000536649:5 ENST00000536681:8
ENST00000552738:23 ENST00000552994:23
ENST00000539368:8 ENST00000539476:8
ENST00000557018:4 ENST00000557332:9
ENST00000539933:3 ENST00000540043:8
ENST00000560792:5 ENST00000561552:4
ENST00000540216:7 ENST00000540246:3
ENST00000562768:4 ENST00000563197:2
ENST00000540823:7 ENST00000540905:5
ENST00000565267:11 ENST00000565435:3
ENST00000541133:5 ENST00000541188:8
ENST00000565488:4 ENST00000565502:3
ENST00000542094:5 ENST00000542289:2
ENST00000569057:4
ENST00000542339:4 ENST00000542445:9
140
A.7.1
Lower Inclusion in Cases
ENST00000218436:8 ENST00000221856:3
ENST00000343516:15 ENST00000344799:7
ENST00000226432:6 ENST00000228916:7
ENST00000349874:9 ENST00000350221:8
ENST00000253110:5 ENST00000255174:3
ENST00000358552:3 ENST00000358866:2
ENST00000257198:4 ENST00000257290:9
ENST00000360003:15 ENST00000360064:20
ENST00000261292:9 ENST00000261707:9
ENST00000361731:7 ENST00000361924:17
ENST00000264345:2 ENST00000264705:2
ENST00000367492:100 ENST00000367492:49
ENST00000269554:8 ENST00000270225:5
ENST00000368169:4 ENST00000368416:3
ENST00000281172:6 ENST00000286332:5
ENST00000370336:11 ENST00000370449:10
ENST00000291547:6 ENST00000292246:5
ENST00000371494:6 ENST00000371923:6
ENST00000295294:2 ENST00000295408:5
ENST00000375180:3 ENST00000375755:5
ENST00000296932:4 ENST00000297404:2
ENST00000379922:10 ENST00000380550:20
ENST00000298527:2 ENST00000301336:4
ENST00000380968:4 ENST00000380993:22
ENST00000302174:7 ENST00000306317:7
ENST00000381480:9 ENST00000381913:14
ENST00000306821:6 ENST00000307092:4
ENST00000382040:4 ENST00000382233:8
ENST00000309930:7 ENST00000310018:2
ENST00000383702:12 ENST00000383807:9
ENST00000312584:2 ENST00000312655:2
ENST00000388781:13 ENST00000389374:10
141
ENST00000395834:3 ENST00000395896:8
ENST00000483720:2 ENST00000484327:12
ENST00000396212:5 ENST00000396231:2
ENST00000488991:5 ENST00000489796:2
ENST00000397544:5 ENST00000397843:3
ENST00000490550:3 ENST00000491910:3
ENST00000397937:2 ENST00000398052:5
ENST00000492039:4 ENST00000492917:4
ENST00000398835:4 ENST00000399039:3
ENST00000493649:4 ENST00000493991:5
ENST00000414624:3 ENST00000415253:2
ENST00000522006:6 ENST00000522335:4
ENST00000416757:2 ENST00000417700:2
ENST00000522795:2 ENST00000524842:4
ENST00000421333:6 ENST00000422516:7
ENST00000532494:3 ENST00000533706:20
ENST00000435834:3 ENST00000437283:3
ENST00000539100:2 ENST00000539625:2
ENST00000440705:3 ENST00000443968:9
ENST00000542445:22 ENST00000542835:4
ENST00000453408:9 ENST00000454345:2
ENST00000548670:2 ENST00000550053:10
ENST00000454536:4 ENST00000457316:4
ENST00000550286:12 ENST00000552994:32
ENST00000460663:6 ENST00000461039:3
ENST00000553532:5 ENST00000554017:9
ENST00000470360:9 ENST00000470597:4
ENST00000558879:2 ENST00000559982:8
ENST00000480648:7 ENST00000483465:11
142
Appendix B
Supplementary results for
Chapter 5
B.1
Effect of training set size
a
b
c
Figure B.1: The effect of training size on predictions. A test set of 100 samples
was used for all training sample sizes.
143
B.2
OOB filtering
Figure B.2: Out-of-bag (OOB) filtering. Scatter plot of cross-sample correlation
coefficients from on OOB gene expression estimates and estimates obtained on test data
for 1,000 randomly selected genes.
144
B.3
Comparison of DGE results
a
b
c
d
Figure B.3: Comparison of differential expression results. Comparison of the
performance of MaLTE and median-polish on the problem of detecting differential gene
expression between five heart-muscle and five skeletal-muscle samples in the GTEx
dataset. Each method results in a list of genes, ranked by the q-value from the comparison of the gene expression level in the two groups of samples. We used the (a)
cumulative Jaccard index and (b) concordance correlation to compare the similarity as
a function of rank and the overall similarity, respectively, between lists of genes ranked
by MaLTE/median-polish and by RNA-seq. The set of genes assessed were defined
by RNA-Seq: genes with RPKM of above one in both tissues. Bootstrap re-sampling
(100 pseudo-replicates) was used to assess the effect of sampling error for both cumulative Jaccard index and concordance correlations. Results of bootstrapping are shown
as faint lines around the observed Jaccard index and yield the density plots shown in
(b). Log of fold change estimates for (c) 120 differentially expressed genes common to
MaLTE and median-polish and (d) MaLTE and PLIER for six common genes.
145
a
b
Figure B.4: Comparison of 10-sample DGE to 44-sample DGE. (a) Selfcomparisons (e.g. MaLTE 10-sample DGE to MaLTE 44-sample DGE, with five and
22 samples in each tissue, respectively). Forty-four-sample DGE results are treated
as putative true differential expressions at the indicated (top right or each plot) false
discover rates (FDRs). FDRs were computed using the q-value method [256]. Comparisons are made using the Jaccard index. (b) Comparison of each method to 44-sample
DGE using RNA-Seq.
146
B.4
Cross-sample correlations for transcript isoform expression estimates
a
b
Figure B.5: Transcript isoform expression estimates. Densities of (a) Pearson
and (b) Spearman cross-sample correlations for transcript isoform expression estimates
obtained using MaLTE. Filtered data corresponds to transcript isoforms with rOOB > 0.
147
B.5
Slopes of regression between RNA-Seq and array methods assessed for Brain data
β
β
1.5
1.0
●
●
0.5
●
●
0.0
●
MaLTE
median−polish
PLIER
Figure B.6: Box plots of slopes β computed by linearly regressing RNASeq against array method. Dotted red line indicates unit gradient. Each box plot
represents 12 slopes for brain samples. RNA-Seq expression is restricted to RPKM
between one and 1000 because of high uncertainty at low values and few genes at high
values representing 78% of genes quantified. Using all genes results in the same medians
but wider variance in slopes due to outlying genes.
148
a
b
Prediction of tissue-mixture proportions
c
Figure B.7: Estimation of tissue mixture proportions. (a) MaLTE, (b) median-polish (RMA), and (c) PLIER. Each plot shows comparisons
of true and estimated proportions with key statistics indicated in the legends.
B.6
149
Appendix C
Use Case: MaLTE Trained with
GTEx Data Applied on Exon
Array
C.1
Introduction
Using MaLTE with the GTEx training dataset consists of three main steps illustrated in
Figure C.1.
• Obtaining the package and training data
• Preparing the data for training-and-testing
• Predicting, filtering and collating expression estimates
151
MaLTE Use-Case
OBTAIN TRAINING AND TEST DATA
Github
Google Drive
GEO
GTEx training RNA-Seq
(pre-processed)
GTEx training microarray
RP test microarray
MaLTE-aux
MaLTE
MaLTE-doc
create_samples_template.py
Microarray training
and test probe data
RNA-Seq training data
PREPARE DATA FOR TRAINING-AND-TEST
samples.txt
transform_microarrays.py
gene-probesets
prepare.data(...)
LIST
...
tt.ready
array2seq.oob(...)
LIST
LIST
PREDICT, FILTER, COLLATE
array2seq(...)
...
...
tt.seq
tt.seq.oob
oob.filter(...)
LIST
...
tt.filtered
get.predictions(...)
Legend
Python script
RP_malte.txt
MaLTE function
TT.Ready object
TT.Seq object
Text file
Figure C.1: Schematic representing a MaLTE use-case. The four text files (dark
green) are the main input to MaLTE. These files are obtained by processing the training
data using the auxiliary scripts.
152
MaLTE Use-Case
In the first step, the user needs to download a set of auxiliary data and scripts before
installing the MaLTE R package. The training (and test) data may then be downloaded
after which several preprocessing steps may be carried out depending on the type of
array involved. For example, we will need to transform the exon array to a gene array.
The transformed array data is then quantile normalized to reduce batch effects.
The second and third steps are relatively straightforward and employ only MaLTE functions as outlined from Sections C.6.
C.2
Obtaining Auxiliary Data and Scripts
The use-case presented here requires several auxiliary data and scripts to convert the
exon array to a gene array. These are all contained in the MaLTE-aux Github repository
which may be directly downloaded from https://github.com/polarise/MaLTE-aux.
Click the ‘Download’ button to get the whole directory as a zip file.
unzip MaLTE-aux-master.zip
cd MaLTE-aux
Alternatively, it is much faster to use git as follows:
git clone https://github.com/polarise/MaLTE-aux.git
cd MaLTE-aux
This directory contains pre-compiled resource files described in Table C.1. We strongly
recommend that all analysis be carried out in this directory.
The script create samples template.py compiles the resource file samples.txt by choosing
a random subset of samples (the user specifies the number) to be used for training
specified in training samples.txt. More information on how it works can be obtained by
running
./create_samples_template.py --help
The script transform microarrays.py converts one array to another using several resource
files provided in MaLTE-aux. A detailed specification of how to use this script is available
in the file transform microarrays.help.
153
MaLTE Use-Case
File
create samples template.py
training samples.txt
transform microarrays.py
transform microarrays.help
Purpose
These script and input data files are
used to generate samples.txt.
The Python script is used to transform the exon array to a gene array. Some probes are omitted in the
process. The .help file gives the explicit use of the data files below.
HuEx-1 0-st-v2.r2.dt1.hg18.full.mps.gz
HuEx-1 0-st-v2.r2.pgf.txt.gz
HuExVsHuGene BestMatch.txt.gz
HuGene-1 1-st-v1.r4.mps.gz
HuGene-1 1-st-v1.r4.pgf.txt.gz
gene probesets HuGene Ens72.txt.gz
transcript probesets HuGene Ens72.txt.gz
Data files required
form microarrays.py
by
trans-
Input data files required to run the
MaLTE function prepare.data()
Table C.1: Contents of MaLTE-aux and the functions they provide
C.3
Working Directory Structure
As all analysis will be carried out in MaLTE-aux directory, we need to create two directories for the training and test array CEL files.
mkdir GTEx_CEL
mkdir RP_CEL
C.4
Obtaining MaLTE
MaLTE may be downloaded as a pre-built or source package. Pre-built versions can be
found at https://github.com/polarise/MaLTE-packages. The latest version should
be downloaded (files are not sorted by version), ensuring that all prerequisites (see
the manual available http://bioinf.nuigalway.ie/MaLTE/malte.html) are installed
before installing MaLTE.
R CMD INSTALL MaLTE_<version>.tar.gz
The source package requires an installation of git and is obtained as followed:
git clone https://github.com/polarise/MaLTE.git
154
MaLTE Use-Case
then built with
R CMD build MaLTE
and installed using R CMD INSTALL as shown above.
C.4.1
Creating the file samples.txt
The samples.txt file is created by running
./create_samples_template.py
The names of the test samples should be appended at the bottom of this file using a
text editor. For example, to add a test sample called Test01.CEL add the following line
to samples.txt:
*NA<tab>Test01.CEL
The asterisk (‘*’) marks this sample as a test sample and ‘NA’ indicates that RNA-Seq
data is not available. Repeat this for as many samples as are present in the test data.
C.4.2
Obtaining the training data
GTEx RNA-Seq data. Gene expression levels estimated from RNA-Seq data may be
downloaded from http://www.broadinstitute.org/gtex. The downloaded data needs
to be processed to arrive at the training state by excluding extraneous information and
converting gene annotation to Ensembl (from GENCODE) then quantile-normalizing
only those genes with probe sets. Pre-processed data may be downloaded from https:
//drive.google.com/file/d/0BxLgaMV5aZahVDNjM29WeGh2NHM/edit?usp=sharing.
GTEx gene array data. CEL files for the GTEx microarray data are available as
a single zipped archive which may be downloaded from GEO archive GSE45878 into
the directory GTEx CEL. Background-corrected probes should be extracted as follows
using the Affymetrix Power Tools function apt-cel-extract, which depends on the
R
Affymetrix GeneChip
Human Gene 1.1 ST Array library files available at http://www.
affymetrix.com/estore/catalog/prod350003/AFFY/Human-Gene-ST-Array-Strips#
1_3. Select the ‘Technical Documentation’ tab to get a link to the library files.
155
MaLTE Use-Case
apt-cel-extract -o GTEx_probe_intensities.txt \
-c /path/to/HuGene-1_1-st-v1.r4.clf \
-p /path/to/HuGene-1_1-st-v1.r4.pgf \
-b /path/to/HuGene-1_1-st-v1.r4.bgp \
-a pm-gcbg GTEx_CEL/*.CEL
It may be necessary to extract background probes file using apt-dump-pgf.
apt-dump-pgf \
-p /path/to/HuGene-1_1-st-v1.r4.pgf \
-c /path/to/HuGene-1_1-st-v1.r4.clf \
--probeset-type antigenomic \
-o HuGene-1_1-st-v1.r4.bgp
Exon array data. We outline this procedure using a previously uploaded dataset
available under GEO accession GSE43134 into the directory RP CEL. The Affymetrix
R
GeneChip
Human Exon 1.0 ST Array library files are available from http://www.
affymetrix.com/estore/catalog/131452/AFFY/Human-Exon-ST-Array. Similarly, backgroundcorrected probe fluorescence intensities should be extracted as follows:
apt-cel-extract \
-o RP_probe_intensities.txt \
-c /path/to/HuEx-1_0-st-v2.r2.clf \
-p /path/to/HuEx-1_0-st-v2.r2.pgf \
-b /path/to/HuEx-1_0-st-v2.r2.antigenomic.bgp \
-a pm-gcbg RP_CEL/*.CEL
C.5
Transforming exon array probes to gene array probes
We now run transform microarrays.py using the following template
./transform_microarrays.py \
-e RP_probe_intensities.txt \
-g GTEx_probe_intensities.txt \
-c HuExVsHuGene_BestMatch.txt.gz \
-i HuGene-1_1-st-v1.r4.pgf.txt.gz \
156
MaLTE Use-Case
-f HuEx-1_0-st-v2.r2.pgf.txt.gz \
-p HuEx-1_0-st-v2.r2.dt1.hg18.full.mps.gz \
-q HuGene-1_1-st-v1.r4.mps.gz \
--huex-out BestMatch_HuEx_probe_intensities.txt.gz \
--huge-out BestMatch_HuGe_probe_intensities.txt.gz
Once the modified probe intensity files have been created they need to be quantilenormalized together with the training data as follows:
> library( limma )
> # BESTMATCH DATA
> huex.best <- read.table( "BestMatch_HuEx_probe_intensities.txt.gz",
header=TRUE,
stringsAsFactors=FALSE,
check.names=FALSE)
> huge.best <- read.table( "BestMatch_HuGe_probe_intensities.txt.gz",
header=TRUE,
stringsAsFactors=FALSE,
check.names=FALSE)
> # all raw data
> data <- cbind( huge.best[,8:ncol( huge.best )],
huex.best[,8:ncol( huex.best )] )
> # quantile normalisation only
> qnorm.data <- normalizeQuantiles( data, ties=FALSE )
> # save the quantile-normalised data for later
> # (extracting principal components)
> save( qnorm.data, file="qnorm.data.Rdata" )
> # re-insert probe metadata
> hugeex.qnorm.data <- cbind( huge.best[,1:7], qnorm.data )
> # save to file
> write.table( hugeex.qnorm.data,
file="BestMatch_GTEx_RP_probe_intensities_QN.txt", col.names=TRUE,
row.names=FALSE,
quote=FALSE,
sep="\t" )
All training data needed is now available to use MaLTE.
157
MaLTE Use-Case
C.6
Preparing the data for training-and-testing
We can now use MaLTE to prepare the data for training and testing. First, we load the
package after launching R.
> library( MaLTE )
We use the prepare.data() function which takes as arguments the samples.txt,
GTEx subset gene rpkm QN.txt, BestMatch GTEx RP probe intensities QN.txt.gz, and
gene probesets HuGene Ens72.txt.gz (available in MaLTE-aux directory).
> prepare.data( samples.fn=‘samples.txt’,
hts.fn=‘GTEx_subset_gene_rpkm_QN.txt.gz,
ma.fn=‘BestMatch_GTEx_RP_probe_intensities_QN.txt,
g2p.fn=‘gene_probesets_HuGene_Ens72.txt.gz’, raw=TRUE )
We then read in the training and test data.
> tt.ready = read.data( train.fn=‘train_data.tar.gz’,
test.fn=‘test_data.tar.gz’ )
C.7
Prediction
Prediction can either be performed on the test data training data (out-of-bag, OOB).
We outline OOB shortly. First, we need to define training-and-test parameters.
> tt.params = TT.Params() # default parameters
This sets the following values:
> tt.params
mtry
= 2
ntree
= 1000
feature.select
= TRUE
min.probes
= 15
cor.thresh
= 0
OOB
= FALSE
QuantReg
= FALSE
Tune (OOB cor.P) = NA
158
MaLTE Use-Case
Predictions are performed using the array2seq() function.
> tt.seq = array2seq( tt.ready, tt.params )
OOB predictions are performed using the array2seq.oob() function.
> tt.seq.oob = array2seq.oob( tt.ready, tt.params )
C.8
Filtering by OOB
OOB filtering is performed using the oob.filter() function, which takes a tt.seq pair
(tt.seq and tt.seq.oob) and a filtering threshold together with the correlation to use
for filtering (Pearson (default) or Spearman),
> tt.filtered = oob.filter( tt.seq, tt.seq.oob, thresh=0.5, method=‘spearman’ )
C.9
Collating predicted gene expression values
Predicted values should be collated as a data frame using the get.predictions() function. First, we get the names of test samples using the get.names() or get.test()
functions.
> test.names = get.names( samples.fn=‘samples.txt’, test=TRUE )
> test.names = get.test( samples.fn=‘samples.txt’ )
The predictions can be quantile-normalized but this may be omitted.
> df.malte.qn = get.predictions( tt.filtered, sample.names=test.names,
qnorm=TRUE )
then saved to file
> write.table( df.malte, file=‘RP_malte_OOB_0p5_QN.txt’,
col.name=TRUE, row.names=FALSE, quote=FALSE, sep=‘\t’ )
159
MaLTE Use-Case
C.10
Experimental features
C.10.1
Per-gene/transcript tuning
Set tune.cor.P=TRUE in TT.Params()
> tt.params = TT.Params( tune.cor.P=TRUE )
C.10.2
Incorporating principal components
Principal components of probe intensities may be incorporated into the prediction. The
same principal component will be used for all genes/transcript therefore they must be
incorporated into the samples.txt file to be dispatched to all genes/transcripts. We have
experimented with the first 10 principal components.
> load( "dnorm.data.Rdata" ) # load qnorm.data from before
> pc.data = princomp( qnorm.data )
# save the first 10 principal components
> pcs = pc.data$loadings[,1:10]
> samples = read.table( samples.txt’, header=TRUE )
# create new samples data frame augmented with principal components
> samples.aug = cbind( samples, pcs )
> colnames( samples.aug ) = paste( P’, 1:10, sep=‘’ )
# create a new samples file
> write.table( samples.aug, file=‘samples_aug.txt’,
col.names=TRUE, row.names=FALSE, quote=FALSE, sep=‘\t’ )
We then run prepare.data() using samples aug.txt (instead of samples.txt as before)
setting the variable PCs=TRUE. This creates two files: train PCs.txt having the training
principal components and a corresponding test PCs.txt.
> prepare.data( samples.fn=samples.fn, hts.fn=hts.fn, ma.fn=ma.fn,
g2p.fn=g2p.fn, PCs=TRUE )
Also, we need to run read.data() setting PCs.present=TRUE,
train.PCs.fn=‘train PCs.txt’ and test.PCs.fn=‘test PCs.txt’.
160
MaLTE Use-Case
> tt.ready.pc <- read.data( train.fn="train_data.txt.gz",
test.fn="test_data.txt.gz", PCs.present=TRUE,
train.PCs.fn="train_PCs.txt", test.PCs.fn="test_PCs.txt"
The principal components are now embedded in the data for each gene.
161
)
Bibliography
[1] Sambrook J, 1977. Adenovirus amazes at Cold Spring Harbor. Nature 268(5616):101.
[2] Berget S, Moore C, and Sharp P, 1977. Spliced segments at the 5’terminus of adenovirus 2 late
mRNA. Proceedings of the National Academy of Sciences 74(8):3171.
[3] Lewis J, Atkins J, Anderson C, Baum P, and Gesteland R, 1975. Mapping of late adenovirus genes
by cell-free translation of RNA selected by hybridization to specific DNA fragments. Proceedings
of the National Academy of Sciences 72(4):1344–1348.
[4] Chow LT, Gelinas RE, Broker TR, and Roberts RJ, 1977. An amazing sequence arrangement at
the 5’ ends of adenovirus 2 messenger RNA. Cell 12(1):1–8.
[5] Gilbert W, 1978. Why genes in pieces? Nature 271(5645):501.
[6] Takashima Y, Ohtsuka T, González A, Miyachi H, and Kageyama R, 2011. Intronic delay is
essential for oscillatory expression in the segmentation clock. Proceedings of the National Academy
of Sciences 108(8):3300–3305.
[7] Brugiolo M, Herzel L, and Neugebauer KM, 2013.
Counting on co-transcriptional splicing.
F1000prime Reports 5.
[8] Rodrı́guez-Trelles F, Tarrı́o R, and Ayala FJ, 2006. Origins and evolution of spliceosomal introns.
Annu. Rev. Genet. 40:47–76.
[9] Rogozin IB, Carmel L, Csuros M, Koonin EV et al., 2012. Origin and evolution of spliceosomal
introns. Biol Direct 7(11).
[10] Roy SW and Gilbert W, 2006. The evolution of spliceosomal introns: patterns, puzzles and
progress. Nature Reviews Genetics 7(3):211–221.
[11] Cech TR, Zaug AJ, and Grabowski PJ, 1981. In vitro splicing of the ribosomal RNA precursor of
tetrahymena: Involvement of a guanosine nucleotide in the excision of the intervening sequence.
Cell 27(3):487–496.
[12] Clancy S, 2008. RNA splicing: introns, exons and spliceosome. Nature Education 1(1).
[13] Hedberg A and Johansen SD, 2013. Nuclear group I introns in self-splicing and beyond. Mobile
DNA 4(1):17.
[14] Haugen P, Simon DM, and Bhattacharya D, 2005. The natural history of group I introns. Trends
in Genetics 21(2):111–119.
163
Bibliography
[15] Saldanha R, Mohr G, Belfort M, and Lambowitz AM, 1993. Group i and group ii introns. The
FASEB Journal 7(1):15–24.
[16] Marcia M, Somarowthu S, and Pyle AM, 2013. Now on display: a gallery of group ii intron
structures at different stages of catalysis. Mobile DNA 4(1):1–14.
[17] Lambowitz AM and Zimmerly S, 2011. Group II introns: mobile ribozymes that invade DNA.
Cold Spring Harbor Perspectives in Biology 3(8).
[18] Keating KS, Toor N, Perlman PS, and Pyle AM, 2010. A structural analysis of the group ii intron
active site and implications for the spliceosome. RNA 16(1):1–9.
[19] Christopher DA and Hallick RB, 1989. Euglena gracilis chloroplast ribosomal protein operon: a
new chloroplast gene for ribosomal protein l5 and description of a novel organelle intron category
designated group iii. Nucleic Acids Research 17(19):7591–7608.
[20] Doetsch NA, 2000. Group III Intron Structure and Evolutionary Analysis in Euglenoid Chloroplast
Genomes. Ph.D. thesis, University of Arizona.
[21] Lykke-Andersen J, Aagaard C, Semionenkov M, and Garrett RA, 1997. Archaeal introns: splicing,
intercellular mobility and evolution. Trends in Biochemical Sciences 22(9):326–331.
[22] Csuros M, Rogozin IB, and Koonin EV, 2011. A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Computational Biology
7(9):e1002150.
[23] Deutsch M and Long M, 1999. Intron-exon structures of eukaryotic model organisms. Nucleic
Acids Research 27(15):3219.
[24] Levine A and Durbin R, 2001. A computational scan for U12-dependent introns in the human
genome sequence. Nucleic Acids Research 29(19):4006–4013.
[25] Alioto T, 2007. U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids
Research 35(suppl 1):D110–D115.
[26] Taggart AJ, DeSimone AM, Shih JS, Filloux ME, and Fairbrother WG, 2012. Large-scale mapping
of branchpoints in human pre-mRNA transcripts in vivo. Nature Structural & Molecular Biology
19(7):719–721.
[27] Lim LP and Burge CB, 2001. A computational analysis of sequence features involved in recognition
of short introns. Proceedings of the National Academy of Sciences 98(20):11193–11198.
[28] Ast G, 2004. How did alternative splicing evolve? Nature Reviews Genetics 5(10):773–782.
[29] Berget SM, 1995.
Exon recognition in vertebrate splicing.
Journal of Biological Chemistry
270(6):2411–2414.
[30] Yandell M, Mungall CJ, Smith C, Prochnik S, Kaminker J, Hartzell G, Lewis S, and Rubin GM,
2006. Large-scale trends in the evolution of gene structures within 11 animal genomes. PLoS
Computational Biology 2(3):e15.
[31] Doolittle WF, 2013. The spliceosomal catalytic core arose in the RNA world or did it? Genome
Biology 14(12):141.
[32] Doolittle WF, 1978. Genes in pieces: were they ever together? Nature 272(5654):581–582.
164
Bibliography
[33] Williams TA, Foster PG, Cox CJ, and Embley TM, 2013. An archaeal origin of eukaryotes supports
only two primary domains of life. Nature 504(7479):231–236.
[34] Martin W and Koonin EV, 2006. Introns and the origin of nucleus–cytosol compartmentalization.
Nature 440(7080):41–45.
[35] Mattick JS, 1994. Introns: evolution and function. Current Opinion in Genetics & Development
4(6):823–831.
[36] Cech TR, 1986. The generality of self-splicing RNA: relationship to nuclear mRNA splicing. Cell
44(2):207–210.
[37] Fica SM, Tuttle N, Novak T, Li NS, Lu J, Koodathingal P, Dai Q, Staley JP, and Piccirilli JA,
2013. RNA catalyses nuclear pre-mRNA splicing. Nature 503(7475):229–234.
[38] Oberstrass FC, Auweter SD, Erat M, Hargous Y, Henning A, Wenter P, Reymond L, AmirAhmady B, Pitsch S, Black DL et al., 2005. Structure of PTB bound to RNA: specific binding
and implications for splicing regulation. Science 309(5743):2054–2057.
[39] Xue Y, Zhou Y, Wu T, Zhu T, Ji X, Kwon YS, Zhang C, Yeo G, Black DL, Sun H et al., 2009.
Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing
repressor to modulate exon inclusion or skipping. Molecular Cell 36(6):996–1006.
[40] Roberts GC, Gooding C, Mak HY, Smith CW, and Proudfoot NJ, 1998. Co-transcriptional
commitment to alternative splice site selection. Nucleic Acids Research 26(24):5568–5572.
[41] Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, and Kondrashov FA, 2002. Selection for
short introns in highly expressed genes. Nature Genetics 31(4):415–418.
[42] Seoighe C and Korir PK, 2011. Evidence for intron length conservation in a set of mammalian
genes associated with embryonic development. BMC Bioinformatics 12(Suppl 9):S16.
[43] Kolkman JA and Stemmer WP, 2001. Directed evolution of proteins by exon shuffling. Nature
Biotechnology 19(5):423–428.
[44] Licatalosi D and Darnell R, 2010. RNA processing and its regulation: global insights into biological
networks. Nature Reviews Genetics 11(1):75–87.
[45] Tsokos GC, 2006. In the beginning was Sm. The Journal of Immunology 176(3):1295–1296.
[46] Matera AG, Terns RM, and Terns MP, 2007. Non-coding RNAs: lessons from the small nuclear
and small nucleolar RNAs. Nature Reviews Molecular Cell Biology 8(3):209–220.
[47] Hegele A, Kamburov A, Grossmann A, Sourlis C, Wowro S, Weimann M, Will CL, Pena V,
Lührmann R, and Stelzl U, 2012. Dynamic protein-protein interaction wiring of the human spliceosome. Molecular Cell 45(4):567–580.
[48] Stevens SW, Ryan DE, Ge HY, Moore RE, Young MK, Lee TD, and Abelson J, 2002. Composition
and functional characterization of the yeast spliceosomal penta-snRNP. Molecular Cell 9(1):31–44.
[49] Grainger RJ and Beggs JD, 2005. Prp8 protein: at the heart of the spliceosome. RNA 11(5):533–
557.
[50] Kramer A, 1996. The structure and function of proteins involved in mammalian pre-mRNA
splicing. Annual Review of Biochemistry 65(1):367–409.
165
Bibliography
[51] Schellenberg MJ, Ritchie DB, and MacMillan AM, 2008. Pre-mRNA splicing: a complex picture
in higher definition. Trends in Biochemical Sciences 33(6):243–246.
[52] Lerner MR, Boyle JA, Mount SM, Wolin SL, and Steitz JA, 1980. Are snRNPs involved in splicing?
Nature pages 220–224.
[53] Robberson BL, Cote GJ, and Berget SM, 1990. Exon definition may facilitate splice site selection
in RNAs with multiple exons. Molecular and Cellular Biology 10(1):84–94.
[54] Hoffman B and Grabowski P, 1992. U1 snrnp targets an essential splicing factor, u2af65, to the
3’splice site by a network of interactions spanning the exon. Genes & Development 6(12b):2554–
2568.
[55] Ruskin B, Zamore PD, and Green MR, 1988. A factor, U2AF, is required for U2 snRNP binding
and splicing complex assembly. Cell 52(2):207–219.
[56] Zuo P and Maniatis T, 1996. The splicing factor U2AF35 mediates critical protein-protein interactions in constitutive and enhancer-dependent splicing. Genes & Development 10(11):1356–1368.
[57] Pena V, Liu S, Bujnicki J, Lührmann R, and Wahl M, 2007. Structure of a Multipartite ProteinProtein Interaction Domain in Splicing Factor Prp8 and Its Link to Retinitis Pigmentosa. Molecular Cell 25(4):615–624.
[58] Liu ZR, 2002. p68 RNA helicase is an essential human splicing factor that acts at the U1 snRNA-5
splice site duplex. Molecular and Cellular Biology 22(15):5443–5450.
[59] Liu ZR, Sargueil B, and Smith CW, 1998. Detection of a novel ATP-dependent cross-linked protein
at the 5 splice site-U1 small nuclear RNA duplex by methylene blue-mediated photo-cross-linking.
Molecular and Cellular Biology 18(12):6910–6920.
[60] de la Cruz J, Kressler D, and Linder P, 1999. Unwinding RNA in Saccharomyces cerevisiae:
DEAD-box proteins and related families. Trends in Biochemical Sciences 24(5):192–198.
[61] Wang Y and Guthrie C, 1998. PRP16, a DEAH-box RNA helicase, is recruited to the spliceosome
primarily via its nonconserved N-terminal domain. RNA 4(10):1216–1229.
[62] Schneider S, Hotz HR, and Schwer B, 2002. Characterization of dominant-negative mutants of
the DEAH-box splicing factors Prp22 and Prp16. Journal of Biological Chemistry 277(18):15452–
15458.
[63] Umen J and Guthrie C, 1995. Prp16p, Slu7p, and Prp8p interact with the 3’splice site in two
distinct stages during the second catalytic step of pre-mRNA splicing. RNA 1(6):584.
[64] McPheeters DS, Schwer B, and Muhlenkamp P, 2000. Interaction of the yeast DExH-box RNA
helicase Prp22p with the 3 splice site during the second step of nuclear pre-mRNA splicing. Nucleic
Acids Research 28(6):1313–1321.
[65] James SA, Turner W, and Schwer B, 2002. How Slu7 and Prp18 cooperate in the second step of
yeast pre-mRNA splicing. RNA 8(8):1068–1077.
[66] Martin A, Schneider S, and Schwer B, 2002. Prp43 is an essential RNA-dependent ATPase required
for release of lariat-intron from the spliceosome. Journal of Biological Chemistry 277(20):17743–
17750.
166
Bibliography
[67] Schneider S, Campodonico E, and Schwer B, 2004. Motifs IV and V in the DEAH box splicing factor Prp22 are important for RNA unwinding, and helicase-defective Prp22 mutants are suppressed
by Prp8. Journal of Biological Chemistry 279(10):8617–8626.
[68] Staley JP and Guthrie C, 1998. Mechanical devices of the spliceosome: motors, clocks, springs,
and things. Cell 92(3):315–326.
[69] Berg J, Tymoczko J, and Stryer L. Biochemistry. Sixth. 2007.
[70] Abelson J, Trotta CR, and Li H, 1998.
tRNA splicing.
Journal of Biological Chemistry
273(21):12685–12688.
[71] Cech TR, 1990. Self-splicing of group I introns. Annual Review of Biochemistry 59(1):543–568.
[72] Bonen L and Vogel J, 2001. The ins and outs of group II introns. Trends in Genetics 17(6):322–331.
[73] Doetsch NA, Favreau MR, Kuscuoglu N, Thompson MD, and Hallick RB, 2001. Chloroplast
transformation in Euglena gracilis: splicing of a group III twintron transcribed from a transgenic
psbK operon. Current Genetics 39(1):49–60.
[74] Bonen L, 1993. Trans-splicing of pre-mRNA in plants, animals, and protists. The FASEB Journal
7(1):40–46.
[75] Hastings KE, 2005. SL trans-splicing: easy come or easy go? Trends in Genetics 21(4):240–247.
[76] Lasda EL and Blumenthal T, 2011.
Trans-splicing.
Wiley Interdisciplinary Reviews: RNA
2(3):417–434.
[77] Dorn R and Krauss V, 2003. The modifier of mdg4 locus in Drosophila: functional complexity is
resolved by trans splicing. Genetica 117(2-3):165–177.
[78] Goeke S, Greene EA, Grant PK, Gates MA, Crowner D, Aigaki T, and Giniger E, 2003. Alternative
splicing of lola generates 19 transcription factors controlling axon guidance in Drosophila. Nature
Neuroscience 6(9):917–924.
[79] Li H, Wang J, Mor G, and Sklar J, 2008. A neoplastic gene fusion mimics trans-splicing of RNAs
in normal human cells. Science 321(5894):1357–1361.
[80] Mayer MG and Floeter-Winter LM, 2005. Pre-mRNA trans-splicing: from kinetoplastids to mammals, an easy language for life diversity. Memorias do Instituto Oswaldo Cruz 100(5):501–513.
[81] Li Y and Blencowe BJ, 1999. Distinct factor requirements for exonic splicing enhancer function
and binding of U2AF to the polypyrimidine tract. Journal of Biological Chemistry 274(49):35074–
35079.
[82] Schneider M, Will CL, Anokhina M, Tazi J, Urlaub H, and Lührmann R, 2010. Exon definition
complexes contain the tri-snRNP and can be directly converted into B-like precatalytic splicing
complexes. Molecular Cell 38(2):223–235.
[83] Fox-Walsh K, Dou Y, Lam B, Hung S, Baldi P, and Hertel K, 2005. The architecture of pre-mRNAs
affects mechanisms of splice-site pairing. Proceedings of the National Academy of Sciences of the
United States of America 102(45):16176.
[84] Bolisetty M and Beemon K, 2012. Splicing of internal large exons is defined by novel cis-acting
sequence elements. Nucleic Acids Research .
167
Bibliography
[85] Ke S, Shang S, Kalachikov SM, Morozova I, Yu L, Russo JJ, Ju J, and Chasin LA, 2011. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Research 21(8):1360–1374.
[86] Zhu J, He F, Wang D, Liu K, Huang D, Xiao J, Wu J, Hu S, and Yu J, 2010. A novel role for
minimal introns: Routing mRNAs to the Cytosol. PLoS ONE 5(4).
[87] Dominski Z and Kole R, 1991. Selection of splice sites in pre-mRNAs with short internal exons.
[88] Dominski Z and Kole R, 1992. Cooperation of pre-mRNA sequence elements in splice site selection.
[89] Black DL, 2003. Mechanisms of alternative pre-messenger RNA splicing. Annual Review of Biochemistry 72(1):291–336.
[90] Shukla S, Kavak E, Gregory M, Imashimizu M, Shutinoski B, Kashlev M, Oberdoerffer P, Sandberg
R, and Oberdoerffer S, 2011. CTCF-promoted RNA polymerase II pausing links DNA methylation
to splicing. Nature 479(7371):74–79.
[91] Sasaki-Haraguchi N, Shimada MK, Taniguchi I, Ohno M, and Mayeda A, 2012. Mechanistic insights into human pre-mRNA splicing of human ultra-short introns: Potential unusual mechanism
identifies G-rich introns. Biochemical and Biophysical Research Communications 423(2):289–294.
[92] Behzadnia N, Golas MM, Hartmuth K, Sander B, Kastner B, Deckert J, Dube P, Will CL, Urlaub
H, Stark H et al., 2007. Composition and three-dimensional EM structure of double affinitypurified, human prespliceosomal A complexes. The EMBO Journal 26(6):1737–1748.
[93] Sirand-Pugnet P, Durosay P, Brody E, and Marie J, 1995. An intronic (A/U) GGG repeat
enhances the splicing of an alternative intron of the chicken β-tropomyosin pre-mRNA. Nucleic
Acids Research 23(17):3501–3507.
[94] Carlo T, Sterner DA, and Berget SM, 1996. An intron splicing enhancer containing a G-rich repeat
facilitates inclusion of a vertebrate micro-exon. RNA 2(4):342.
[95] McCullough AJ and Berget SM, 1997. G triplets located throughout a class of small vertebrate
introns enforce intron borders and regulate splice site selection. Molecular and Cellular Biology
17(8):4562–4571.
[96] Fedorova L and Fedorov A, 2005. Puzzles of the human genome: Why do we need our introns?
Current Genomics 6(8):589–595.
[97] Suzuki H, Kameyama T, Ohe K, Tsukahara T, and Mayeda A, 2013. Nested introns in an intron:
Evidence of multi-step splicing of a large intron in the human dystrophin pre-mRNA. FEBS letters
.
[98] Hatton AR, Subramaniam V, and Lopez AJ, 1998. Generation of Alternative Ultrabithorax Isoforms and Stepwise Removal of a Large Intron by Resplicing at Exon–Exon Junctions. Molecular
Cell 2(6):787–796.
[99] Burnette JM, Miyamoto-Sato E, Schaub MA, Conklin J, and Lopez AJ, 2005. Subdivision of large
introns in Drosophila by recursive splicing at nonexonic elements. Genetics 170(2):661–674.
[100] Shepard S, McCreary M, and Fedorov A, 2009. The peculiarities of large intron splicing in animals.
PloS one 4(11):e7853.
168
Bibliography
[101] Sterner DA and Berget SM, 1993. In vivo recognition of a vertebrate mini-exon as an exon-intronexon unit. Molecular and Cellular Biology 13(5):2677–2687.
[102] Black DL, 1991. Does steric interference between splice sites block the splicing of a short c-src
neuron-specific exon in non-neuronal cells? Genes & Development 5(3):389–402.
[103] Black DL, 1992. Activation of c-src neuron-specific splicing by an unusual RNA element in vivo
and in vitro. Cell 69(5):795–807.
[104] Min H, Chan RC, and Black DL, 1995. The generally expressed hnRNP F is involved in a neuralspecific pre-mRNA splicing event. Genes & Development 9(21):2659–2671.
[105] Kornblihtt AR, 2008. Coupling transcription and alternative splicing. Advances in Experimental
Medicine and Biology 623:175.
[106] McCracken S, Fong N, Rosonina E, Yankulov K, Brothers G, Siderovski D, Hessel A, Foster S,
Shuman S, Bentley DL et al., 1997. 5’-Capping enzymes are targeted to pre-mRNA by binding
to the phosphorylated carboxy-terminal domain of RNA polymerase II. Genes & Development
11(24):3306–3318.
[107] Hirose Y, Tacke R, and Manley JL, 1999. Phosphorylated RNA polymerase II stimulates premRNA splicing. Genes & Development 13(10):1234–1239.
[108] Millhouse S and Manley JL, 2005. The C-terminal domain of RNA polymerase II functions as
a phosphorylation-dependent splicing activator in a heterologous protein. Molecular and Cellular
Biology 25(2):533–544.
[109] Tarn W, Hsu C, Huang K, Chen H, Kao HY, Lee KR, and Cheng S, 1994. Functional association
of essential splicing factor (s) with PRP19 in a protein complex. The EMBO Journal 13(10):2421.
[110] David CJ, Boyne AR, Millhouse SR, and Manley JL, 2011. The RNA polymerase II C-terminal
domain promotes splicing activation through recruitment of a U2AF65–Prp19 complex. Genes &
Development 25(9):972–983.
[111] Zeng C and Berget SM, 2000. Participation of the C-terminal domain of RNA polymerase II in
exon definition during pre-mRNA splicing. Molecular and Cellular Biology 20(21):8290–8301.
[112] Pandya-Jones A and Black D, 2009. Co-transcriptional splicing of constitutive and alternative
exons. RNA 15(10):1896–1908.
[113] Vargas D, Shah K, Batish M, Levandoski M, Sinha S, Marras S, Schedl P, and Tyagi S, 2011.
Single-Molecule Imaging of Transcriptionally Coupled and Uncoupled Splicing. Cell 147(5):1054–
1065.
[114] Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, and Muñoz MJ, 2013. Alternative
splicing: a pivotal step between eukaryotic transcription and translation. Nature Reviews Molecular
Cell Biology .
[115] Denis MM, Tolley ND, Bunting M, Schwertz H, Jiang H, Lindemann S, Yost CC, Rubner FJ,
Albertine KH, Swoboda KJ et al., 2005. Escaping the nuclear confines: signal-dependent premRNA splicing in anucleate platelets. Cell 122(3):379–391.
169
Bibliography
[116] Breitbart RE, Andreadis A, and Nadal-Ginard B, 1987. Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual Review of Biochemistry 56(1):467–495.
[117] Flicek P, Amode M, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S,
Fitzgerald S et al., 2011. Ensembl 2011. Nucleic Acids Research 39(suppl 1):D800–D806.
[118] Pan Q, Shai O, Lee LJ, Frey BJ, and Blencowe BJ, 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics
40(12):1413–1415.
[119] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP,
and Burge CB, 2008. Alternative isoform regulation in human tissue transcriptomes. Nature
456(7221):470–476.
[120] Lockstone HE, 2011. Exon array data analysis using Affymetrix power tools and R statistical
software. Briefings in Bioinformatics 12(6):634–644.
[121] Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE,
Stoughton R, and Shoemaker DD, 2003. Genome-wide survey of human alternative pre-mRNA
splicing with exon junction microarrays. Science 302(5653):2141–2144.
[122] Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta
M, Weissman S et al., 2004. Global identification of human transcribed sequences with genome
tiling arrays. Science 306(5705):2242–2246.
[123] Ladomery MR, Harper SJ, and Bates DO, 2007. Alternative splicing in angiogenesis: the vascular
endothelial growth factor paradigm. Cancer Letters 249(2):133–142.
[124] Marden J, 2006. Quantitative and evolutionary biology of alternative splicing: how changing
the mix of alternative transcripts affects phenotypic plasticity and reaction norms. Heredity
100(2):111–120.
[125] Hattori D, Demir E, Kim HW, Viragh E, Zipursky SL, and Dickson BJ, 2007. Dscam diversity is
essential for neuronal wiring and self-recognition. Nature 449(7159):223–227.
[126] Ullrich B, Ushkaryov YA, and Südhof TC, 1995. Cartography of neurexins: more than 1000
isoforms generated by alternative splicing and expressed in distinct subsets of neurons. Neuron
14(3):497–507.
[127] Keren H, Lev-Maor G, and Ast G, 2010. Alternative splicing and evolution: diversification, exon
definition and function. Nature Reviews Genetics 11(5):345–355.
[128] Kondrashov FA and Koonin EV, 2001. Origin of alternative splicing by tandem exon duplication.
Human Molecular Genetics 10(23):2661–2669.
[129] Lynch M and Conery JS, 2003. The origins of genome complexity. Science 302(5649):1401–1404.
[130] Alekseyenko AV, Kim N, and Lee CJ, 2007. Global analysis of exon creation versus loss and the
role of alternative splicing in 17 vertebrate genomes. RNA 13(5):661–670.
[131] Kim E, Magen A, and Ast G, 2007. Different levels of alternative splicing among eukaryotes.
Nucleic Acids Research 35(1):125–131.
170
Bibliography
[132] Schwartz S, Gal-Mark N, Kfir N, Oren R, Kim E, and Ast G, 2009. Alu exonization events reveal
features required for precise recognition of exons by the splicing machinery. PLoS Computational
Biology 5(3):e1000300.
[133] Sorek R, Lev-Maor G, Reznik M, Dagan T, Belinky F, Graur D, and Ast G, 2004. Minimal
conditions for exonization of intronic sequences: 5 splice site formation in alu exons. Molecular
Cell 14(2):221–231.
[134] Letunic I, Copley RR, and Bork P, 2002. Common exon duplication in animals and its role in
alternative splicing. Human Molecular Genetics 11(13):1561–1567.
[135] Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R,
Suzuki H et al., 2002. Analysis of the mouse transcriptome based on functional annotation of
60,770 full-length cDNAs. Nature 420(6915):563–573.
[136] Zavolan M, Kondo S, Schönbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T et al.,
2003. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA
transcripts encoded by the mouse transcriptome. Genome Research 13(6b):1290–1300.
[137] Stamm S, Ben-Ari S, Rafalska I, Tang Y, Zhang Z, Toiber D, Thanaraj T, and Soreq H, 2005.
Function of alternative splicing. Gene 344:1–20.
[138] Patthy L, 1999. Genome evolution and the evolution of exon-shufflinga review. Gene 238(1):103–
114.
[139] Gabut M, Samavarchi-Tehrani P, Wang X, Slobodeniuc V, O’Hanlon D, Sung HK, Alvarez M,
Talukder S, Pan Q, Mazzoni EO et al., 2011. An alternative splicing switch regulates embryonic
stem cell pluripotency and reprogramming. Cell 147(1):132–146.
[140] Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, Babak T, Siu H, Hughes TR,
Morris QD et al., 2004. Revealing global regulatory features of mammalian alternative splicing
using a quantitative microarray platform. Molecular Cell 16(6):929–941.
[141] Irimia M, Rukov J, Penny D, and Roy S, 2007. Functional and evolutionary analysis of alternatively
spliced genes is consistent with an early eukaryotic origin of alternative splicing. BMC Evolutionary
Biology 7(1):188.
[142] Kelemen O, Convertini P, Zhang Z, Wen Y, Shen M, Falaleeva M, and Stamm S, 2012. Function
of alternative splicing. Gene .
[143] Lewis B, Green R, and Brenner S, 2003. Evidence for the widespread coupling of alternative
splicing and nonsense-mediated mRNA decay in humans. Proceedings of the National Academy of
Sciences 100(1):189–192.
[144] Sureau A, Gattoni R, Dooghe Y, Stevenin J, and Soret J, 2001. SC35 autoregulates its expression
by promoting splicing events that destabilize its mRNAs. The EMBO Journal 20(7):1785–1796.
[145] Ritchie W, Granjeaud S, Puthier D, and Gautheret D, 2008. Entropy measures quantify global
splicing disorders in cancer. PLoS Computational Biology 4(3):e1000011.
[146] Faustino NA and Cooper TA, 2003. Pre-mRNA splicing and human disease. Genes & Development
17(4):419–437.
171
Bibliography
[147] Tazi J, Bakkour N, and Stamm S, 2009. Alternative splicing and disease. Biochimica et Biophysica
Acta (BBA)-Molecular Basis of Disease 1792(1):14–26.
[148] Barash Y, Calarco J, Gao W, Pan Q, Wang X, Shai O, Blencowe B, and Frey B, 2010. Deciphering
the splicing code. Nature 465(7294):53–59.
[149] Chasin LA, 2008. Searching for splicing motifs. Advances in Experimental Medicine and Biology
623:85.
[150] Listerman I, Sapra AK, and Neugebauer KM, 2006. Cotranscriptional coupling of splicing factor
recruitment and precursor messenger RNA splicing in mammalian cells. Nature Structural &
Molecular Biology 13(9):815–822.
[151] Bennett M, Pinol-Roma S, Staknis D, Dreyfuss G, and Reed R, 1992. Differential binding of
heterogeneous nuclear ribonucleoproteins to mRNA precursors prior to spliceosome assembly in
vitro. Molecular and Cellular Biology 12(7):3165–3175.
[152] Zheng ZM, 2004. Regulation of alternative RNA splicing by exon definition and exon sequences
in viral and mammalian gene expression. Journal of Biomedical Science 11(3):278–294.
[153] Cartegni L, Chew SL, and Krainer AR, 2002. Listening to silence and understanding nonsense:
exonic mutations that affect splicing. Nature Reviews Genetics 3(4):285–298.
[154] Liu HX, Zhang M, and Krainer AR, 1998. Identification of functional exonic splicing enhancer
motifs recognized by individual SR proteins. Genes & Development 12(13).
[155] Chandler SD, Mayeda A, Yeakley JM, Krainer AR, and Fu XD, 1997. RNA splicing specificity
determined by the coordinated action of RNA recognition motifs in SR proteins. Proceedings of
the National Academy of Sciences 94(8):3596–3601.
[156] Wang J, Xiao SH, and Manley JL, 1998. Genetic analysis of the SR protein ASF/SF2: interchangeability of RS domains and negative control of splicing. Genes & Development 12(14):2222–2233.
[157] Zahler AM, Lane WS, Stolk JA, and Roth MB, 1992. SR proteins: a conserved family of premRNA splicing factors. Genes & Development 6(5):837–847.
[158] Wagner EJ and Garcia-Blanco MA, 2001. Polypyrimidine tract binding protein antagonizes exon
definition. Molecular and Cellular Biology 21(10):3281–3288.
[159] Luco RF, Allo M, Schor IE, Kornblihtt AR, and Misteli T, 2011. Epigenetics in alternative premRNA splicing. Cell 144(1):16–26.
[160] Hodges C, Bintu L, Lubkowska L, Kashlev M, and Bustamante C, 2009. Nucleosomal fluctuations
govern the transcription dynamics of RNA polymerase II. Science 325(5940):626–628.
[161] Churchman LS and Weissman JS, 2011. Nascent transcript sequencing visualizes transcription at
nucleotide resolution. Nature 469(7330):368–373.
[162] Gunderson FQ and Johnson TL, 2009. Acetylation by the transcriptional coactivator Gcn5 plays
a novel role in co-transcriptional spliceosome assembly. PLoS Genetics 5(10):e1000682.
[163] Cheng D, Côté J, Shaaban S, and Bedford MT, 2007. The arginine methyltransferase CARM1
regulates the coupling of transcription and mRNA processing. Molecular Cell 25(1):71–83.
172
Bibliography
[164] Oesterreich F, Bieberstein N, and Neugebauer K, 2011. Pause locally, splice globally. Trends in
Cell Biology 21(6):328–335.
[165] Close P, East P, Dirac-Svejstrup AB, Hartmann H, Heron M, Maslen S, Chariot A, Söding J,
Skehel M, and Svejstrup JQ, 2012. DBIRD complex integrates alternative mRNA splicing with
RNA polymerase II transcript elongation. Nature 484(7394):386–389.
[166] Saeki H and Svejstrup J, 2009. Stability, flexibility, and dynamic interactions of colliding RNA
polymerase II elongation complexes. Molecular Cell 35(2):191–205.
[167] Fu XD, 2004. Towards a splicing code. Cell 119(6):736–738.
[168] Melamud E and Moult J, 2009. Stochastic noise in splicing machinery. Nucleic Acids Research
37(14):4873–4886.
[169] Pickrell JK, Pai AA, Gilad Y, and Pritchard JK, 2010. Noisy splicing drives mRNA isoform
diversity in human cells. PLoS Genetics 6(12):e1001236.
[170] Pelechano V, Wei W, and Steinmetz LM, 2013. Extensive transcriptional heterogeneity revealed
by isoform profiling. Nature .
[171] Lockhart DJ and Winzeler EA, 2000. Genomics, gene expression and DNA arrays. Nature
405(6788):827–836.
[172] Duggan DJ, Bittner M, Chen Y, Meltzer P, and Trent JM, 1999. Expression profiling using cDNA
microarrays. Nature Genetics 21:10–14.
[173] Malone JH and Oliver B, 2011. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biology 9(1):34.
[174] Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, Harrow J, Bertone P,
Consortium R et al., 2013. Assessment of transcript reconstruction methods for RNA-seq. Nature
Methods .
[175] Eyras E, Alamancos GP, and Agirre E, 2013. Methods to Study Splicing from RNA-Seq. URL
http://figshare.com/articles/Methods_to_Study_Splicing_from_RNA_Seq/679993.
[176] Anders S, Reyes A, and Huber W, 2012. Detecting differential usage of exons from RNA-seq data.
Genome Research 22(10):2008–2017.
[177] Clark TA, Sugnet CW, and Ares M, 2002. Genomewide analysis of mRNA processing in yeast
using splicing-specific microarrays. Science 296(5569):907–910.
[178] Cui X, Hwang JG, Qiu J, Blades NJ, and Churchill GA, 2005. Improved statistical tests for
differential gene expression by shrinking variance components estimates. Biostatistics 6(1):59–75.
[179] Hiller D, Jiang H, Xu W, and Wong WH, 2009. Identifiability of isoform deconvolution from
junction arrays and RNA-Seq. Bioinformatics 25(23):3056–3059.
[180] Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ,
and Pachter L, 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation. Nature Biotechnology 28(5):511–515.
[181] Roberts A, Pimentel H, Trapnell C, and Pachter L, 2011. Identification of novel transcripts in
annotated genomes using RNA-Seq. Bioinformatics 27(17):2325–2329.
173
Bibliography
[182] Schulz MH, Zerbino DR, Vingron M, and Birney E, 2012. Oases: robust de novo RNA-seq assembly
across the dynamic range of expression levels. Bioinformatics 28(8):1086–1092.
[183] Turro E, Lewin A, Rose A, Dallman MJ, and Richardson S, 2010. MMBGX: a method for
estimating expression at the isoform level and detecting differential splicing using whole-transcript
Affymetrix arrays. Nucleic Acids Research 38(1):e4–e4.
[184] Liu X, Gao Z, Zhang L, and Rattray M, 2013. puma 3.0: improved uncertainty propagation
methods for gene and transcript expression analysis. BMC Bioinformatics 14(1):39.
[185] Anton MA, Gorostiaga D, Guruceaga E, Segura V, Carmona-Saez P, Pascual-Montano A, Pio R,
Montuenga LM, and Rubio A, 2008. SPACE: an algorithm to predict and quantify alternatively
spliced isoforms using microarrays. Genome Biology 9(2):R46.
[186] Jiang H and Wong WH, 2009. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8):1026–1032.
[187] Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, Shah S, Aparicio S, Chiang DY, Goldenberg
A, and Brudno M, 2013. iReckon: Simultaneous isoform discovery and abundance estimation from
RNA-seq data. Genome Research 23(3):519–529.
[188] Nicolae M, Mangul S, Mandoiu II, and Zelikovsky A, 2011. Estimation of alternative splicing
isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology 6(1):9.
[189] Li W, Feng J, and Jiang T, 2011. IsoLasso: a LASSO regression approach to RNA-Seq based
transcriptome assembly. Journal of Computational Biology 18(11):1693–1707.
[190] Li B and Dewey C, 2011. RSEM: accurate transcript quantification from RNA-Seq data with or
without a reference genome. BMC Bioinformatics 12(1):323.
[191] Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L,
Bohlen A et al., 2009. mGene: accurate SVM-based gene finding with an application to nematode
genomes. Genome Research 19(11):2133–2143.
[192] Raetsch G and Sonnenburg S, 2006. Large Scale Semi Hidden Markov SVMs .
[193] Li JJ, Jiang CR, Brown JB, Huang H, and Bickel PJ, 2011. Sparse linear modeling of nextgeneration mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation.
Proceedings of the National Academy of Sciences 108(50):19867–19872.
[194] Trapnell C, Pachter L, and Salzberg SL, 2009. TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 25(9):1105–1111.
[195] Mattick J and Gagen M, 2001. The evolution of controlled multitasked gene networks: The role of
introns and other noncoding RNAs in the development of complex organisms. Molecular Biology
and Evolution 18(9):1611–1630.
[196] Lynch M, 2002. Intron evolution as a population-genetic process. Proceedings of the National
Academy of Sciences of the United States of America 99(9):6118–6123.
[197] Lynch M, 2006.
The origins of eukaryotic gene structure.
Molecular Biology and Evolution
23(2):450–468.
[198] Fedorova L and Fedorov A, 2003. Introns in gene evolution. Genetica 118(2-3):123–131.
174
Bibliography
[199] Tange T, Nott A, and Moore M, 2004. The ever-increasing complexities of the exon junction
complex. Current Opinion in Cell Biology 16(3):279–284.
[200] Nott A, Meislin S, and Moore M, 2003. A quantitative analysis of intron effects on mammalian
gene expression. RNA 9:607–617.
[201] Lu S and Cullen B, 2003. Analysis of the stimulatory effect of splicing on mRNA production and
utilization in mammalian cells. RNA 9:618–630.
[202] Rearick D, Prakash A, McSweeny A, Shepard S, Fedorova L, and Fedorov A, 2011. Critical
association of ncRNA with introns. Nucleic Acids Research 39(6):2357–2366.
[203] Swinburne I and Silver P, 2008. Intron delays and transcriptional timing during development.
Developmental cell 14(3):324–330.
[204] Aulehla APO, 2008. Oscillating signaling pathways during embryonic development. Current
Opinion in Cell Biology 20(6):632–637.
[205] Takashima Y, Ohtsuka T, González A, Miyachi H, and Kageyama R, 2011. Intronic delay is
essential for oscillatory expression in the segmentation clock. Proceedings of the National Academy
of Sciences of the United States of America 108(8):3300–3305.
[206] Flicek P, Amode M, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S,
Fitzgerald S et al., 2011. Ensembl 2011. Nucleic Acids Research 39(SUPPL. 1):D800–D806.
[207] Letunic I and Bork P, 2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree
display and annotation. Bioinformatics 23(1):127–128.
[208] Letunic I and Bork P, 2011. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Research 39(suppl 2):W475–W478.
[209] Waterhouse R, Zdobnov E, Tegenfeldt F, Li J, and Kriventseva E, 2011. OrthoDB: The hierarchical
catalog of eukaryotic orthologs in 2011. Nucleic Acids Research 39(SUPPL. 1):D283–D288.
[210] Huang D, Sherman B, and Lempicki R, 2009. Bioinformatics enrichment tools: Paths toward the
comprehensive functional analysis of large gene lists. Nucleic Acids Research 37(1):1–13.
[211] Huang D, Sherman B, and Lempicki R, 2009. Systematic and integrative analysis of large gene
lists using DAVID bioinformatics resources. Nature Protocols 4(1):44–57.
[212] Hubbard T, Aken B, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke
L et al., 2009. Ensembl 2009. Nucleic Acids Research 37(SUPPL. 1):D690–D697.
[213] Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, and Birney E, 2008. Genome-wide
nucleotide-level mammalian ancestor reconstruction. Genome Research 18(11):1829–1843.
[214] Bessho Y, Sakata R, Komatsu S, Shiota K, Yamada S, and Kageyama R, 2001. Dynamic expression
and essential functions of hes7 in somite segmentation. Genes and Development 15(20):2642–2647.
[215] Oates A and Ho R, 2002. Hairye/(spl)-related (her) genes are central components of the segmentation oscillator and display redundancy with the delta/notch signaling pathway in the formation
of anterior segmental boundaries in the zebrafish. Development 129(12):2929–2946.
175
Bibliography
[216] Brend T and Holley S, 2009. Expression of the oscillating gene her1 is directly regulated by
hairy/enhancer of split, T-box, and suppressor of hairless proteins in the zebrafish segmentation
clock. Developmental Dynamics 238(11):2745–2759.
[217] Zdobnov EAR and Apweiler R, 2001. Interproscan - an integration platform for the signaturerecognition methods in interpro. Bioinformatics 17(9):847–848.
[218] Vinogradov A, 2006. “genome design” model: Evidence from conserved intronic sequence in
human-mouse comparison. Genome Research 16(3):347–354.
[219] Elkon R, Zlotorynski E, Zeller K, and Agami R, 2010. Major role for mRNA stability in shaping
the kinetics of gene induction. BMC Genomics 11(1).
[220] Castillo-Davis C, Mekhedov S, Hartl D, Koonin E, and Kondrashov F, 2002. Selection for short
introns in highly expressed genes. Nature Genetics 31(4):415–418.
[221] Larson D, Zenklusen D, Wu B, Chao J, and Singer R, 2011. Real-time observation of transcription
initiation and elongation on an endogenous yeast gene. Science 332(6028):475–478.
[222] Boireau S, Maiuri P, Basyuk E, De La Mata M, Knezevich A, Pradet-Balade B, Bäcker V, Kornblihtt A, Marcello A, and Bertrand E, 2007. The transcriptional cycle of hiv-1 in real-time and
live cells. Journal of Cell Biology 179(2):291–304.
[223] Umen JG and Guthrie C, 1996. Mutagenesis of the yeast gene PRP8 reveals domains governing
the specificity and fidelity of 3’ splice site selection. Genetics 143(2):723–739.
[224] van Nues RW and Beggs JD, 2001. Functional contacts with a range of splicing proteins suggest a
central role for Brr2p in the dynamic control of the order of events in spliceosomes of Saccharomyces
cerevisiae. Genetics 157(4):1451–1467.
[225] Kuhn AN, Reichl EM, and Brow DA, 2002. Distinct domains of splicing factor Prp8 mediate different aspects of spliceosome activation. Proceedings of the National Academy of Sciences
99(14):9145–9149.
[226] Brenner TJ and Guthrie C, 2005. Genetic analysis reveals a role for the C terminus of the
Saccharomyces cerevisiae GTPase Snu114 during spliceosome activation. Genetics 170(3):1063–
1080.
[227] Nielsen KH and Staley JP, 2012. Spliceosome activation: U4 is the path, stem I is the goal, and
Prp8 is the keeper. Let’s cheer for the ATPase Brr2 ! Genes & Development 26(22):2461–2467.
[228] Mozaffari-Jovin S, Wandersleben T, Santos KF, Will CL, Lührmann R, and Wahl MC, 2013.
Inhibition of RNA Helicase Brr2 by the C-Terminal Tail of the Spliceosomal Protein Prp8. Science
341(6141):80–84.
[229] Schellenberg MJ, Wu T, Ritchie DB, Fica S, Staley JP, Atta KA, LaPointe P, and MacMillan
AM, 2013. A conformational switch in PRP8 mediates metal ion coordination that promotes
pre-mRNA exon ligation. Nature Structural & Molecular Biology 20(6):728–734.
[230] Boon K, Grainger R, Ehsani P, Barrass J, Auchynnikava T, Inglehearn C, and Beggs J, 2007. prp8
mutations that cause human retinitis pigmentosa lead to a U5 snRNP maturation defect in yeast.
Nature Structural & Molecular Biology 14(11):1077–1083.
176
Bibliography
[231] Tanackovic G, Ransijn A, Thibault P, Elela S, Klinck R, Berson E, Chabot B, and Rivolta C, 2011.
PRPF mutations are associated with generalized defects in spliceosome formation and pre-mRNA
splicing in patients with retinitis pigmentosa. Human Molecular Genetics 20(11):2116–2130.
[232] Farkas M, Grant G, and Pierce E, 2012. Transcriptome Analyses to Investigate the Pathogenesis
of RNA Splicing Factor Retinitis Pigmentosa. Retinal Degenerative Diseases pages 519–525.
[233] Keightley MC, Crowhurst MO, Layton JE, Beilharz T, Markmiller S, Varma S, Hogan BM,
de Jong-Curtain TA, Heath JK, and Lieschke GJ, 2013. In vivo mutation of pre-mRNA processing factor 8 (Prpf8 ) affects transcript splicing, cell survival and myeloid differentiation. FEBS
Letters 587(14):2150–2157.
[234] Ivings L, Towns K, Matin M, Taylor C, Ponchel F, Grainger R, Ramesar R, Mackey D, and
Inglehearn C, 2008. Evaluation of splicing efficiency in lymphoblastoid cell lines from patients
with splicing-factor retinitis pigmentosa. Molecular Vision 14:2357–2366.
[235] Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, and Cox NJ, 2010. Trait-associated
SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genetics
6(4):e1000888.
[236] Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M,
Gilad Y, and Pritchard JK, 2010. Understanding mechanisms underlying human gene expression
variation with RNA sequencing. Nature 464(7289):768–772.
[237] Fraser H and Xie X, 2009. Common polymorphic transcript variation in human disease. Genome
Research 19(4):567–575.
[238] Buchner D, Trudeau M, and Meisler M, 2003. SCNM1, a putative RNA splicing factor that
modifies disease severity in mice. Science 301(5635):967–969.
[239] Nembaware V, Wolfe K, Bettoni F, Kelso J, and Seoighe C, 2004. Allele-specific transcript isoforms
in human. FEBS Letters 577(1):233–238.
[240] Hull J, Campino S, Rowlands K, Chan M, Copley R, Taylor M, Rockett K, Elvidge G, Keating
B, Knight J et al., 2007. Identification of common genetic variation that modulates alternative
splicing. PLoS Genetics 3(6):e99.
[241] Nembaware V, Lupindo B, Schouest K, Spillane C, Scheffler K, and Seoighe C, 2008. Genome-wide
survey of allele-specific splicing in humans. BMC Genomics 9(1):265.
[242] Grundberg E, Small K, Hedman Å, Nica A, Buil A, Keildson S, Bell J, Yang T, Meduri E, Barrett
A et al., 2012. Mapping cis-and trans-regulatory effects across multiple tissues in twins. Nature
Genetics 44(10):1084–1089.
[243] Wang G and Cooper T, 2007. Splicing in disease: disruption of the splicing code and the decoding
machinery. Nature Reviews Genetics 8(10):749–761.
[244] Singh R and Cooper T, 2012. Pre-mRNA splicing in disease and therapeutics. Trends in Molecular
Medicine .
[245] He H, Liyanarachchi S, Akagi K, Nagy R, Li J, Dietrich R, Li W, Sebastian N, Wen B, Xin B et al.,
2011. Mutations in U4atac snRNA, a component of the minor spliceosome, in the developmental
disorder MOPD I. Science 332(6026):238.
177
Bibliography
[246] Edery P, Marcaillou C, Sahbatou M, Labalme A, Chastang J, Touraine R, Tubacher E, Senni F,
Bober M, Nampoothiri S et al., 2011. Association of TALS developmental disorder with defect in
minor splicing component U4atac snRNA. Science 332(6026):240.
[247] Hartong D, Berson E, and Dryja T, 2006. Retinitis pigmentosa. The Lancet 368(9549):1795–1809.
[248] McKie A, McHale J, Keen T, Tarttelin E, Goliath R, van Lith-Verhoeven J, Greenberg J, Ramesar
R, Hoyng C, Cremers F et al., 2001. Mutations in the pre-mRNA splicing factor gene PRPC8 in
autosomal dominant retinitis pigmentosa (RP13). Human Molecular Genetics 10(15):1555–1562.
[249] Tanackovic G, Ransijn A, Ayuso C, Harper S, Berson E, and Rivolta C, 2011. A missense mutation
in PRPF6 causes impairment of pre-mRNA splicing and autosomal-dominant retinitis pigmentosa.
The American Journal of Human Genetics .
[250] Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, and Speed T, 2003.
Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics 4(2):249–264.
[251] Affymetrix G. Quality Assessment of Exon Arrays. Affymetrix GeneChip Whitepaper Collection .
[252] Rajan P, Dalgliesh C, Carling P, Buist T, Zhang C, Grellscheid S, Armstrong K, Stockley J,
Simillion C, Gaughan L et al., 2011. Identification of Novel Androgen-Regulated Pathways and
mRNA Isoforms through Genome-Wide Exon-Specific Profiling of the LNCaP Transcriptome.
PLoS ONE 6(12):e29088.
[253] Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley
S, Fitzgerald S et al., 2012. Ensembl 2012. Nucleic Acids Research 40(D1):D84–D90.
[254] Fujita P, Rhead B, Zweig A, Hinrichs A, Karolchik D, Cline M, Goldman M, Barber G, Clawson
H, Coelho A et al., 2011. The UCSC genome browser database: update 2011. Nucleic Acids
Research 39(suppl 1):D876–D882.
[255] Smyth GK, 2005. Limma: linear models for microarray data. In Bioinformatics and Computational
Biology Solutions using R and Bioconductor, pages 397–420. Springer.
[256] Storey J and Tibshirani R, 2003. Statistical significance for genomewide studies. Proceedings of
the National Academy of Sciences of the United States of America 100(16):9440.
[257] Benjamini Y and Hochberg Y, 1995. Controlling the false discovery rate: a practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological)
pages 289–300.
[258] Li J, Paramita P, Choi K, Karuturi R et al., 2011. ConReg-R: Extrapolative recalibration of
the empirical distribution of p-values to improve false discovery rate estimates. Biology Direct
6(1):1–14.
[259] Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y,
Gentry J et al., 2004. Bioconductor: open software development for computational biology and
bioinformatics. Genome Biology 5(10):R80.
[260] Yeo G and Burge C, 2004. Maximum entropy modeling of short sequence motifs with applications
to RNA splicing signals. Journal of Computational Biology 11(2-3):377–394.
178
Bibliography
[261] Fairbrother W, Yeo G, Yeh R, Goldstein P, Mawson M, Sharp P, and Burge C, 2004. RESCUEESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Research
32(suppl 2):W187–W190.
[262] Goldstrohm A, Greenleaf A, and Garcia-Blanco M, 2001. Co-transcriptional splicing of premessenger RNAs: considerations for the mechanism of alternative splicing. Gene 277(1):31–47.
[263] Reed R, 2003. Coupling transcription, splicing and mRNA export. Current Opinion in Cell Biology
15(3):326–331.
[264] Tardiff D, Lacadie S, and Rosbash M, 2006. A genome-wide analysis indicates that yeast premRNA splicing is predominantly posttranscriptional. Molecular Cell 24(6):917–929.
[265] Khodor Y, Menet J, Tolan M, and Rosbash M, 2012. Cotranscriptional splicing efficiency differs
dramatically between Drosophila and mouse. RNA 18(12):2174–2186.
[266] Pagani F and Baralle F, 2004. Genomic variants in exons and introns: identifying the splicing
spoilers. Nature Reviews Genetics 5(5):389–396.
[267] Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, Hudson T, Sladek R, and
Majewski J, 2008. Genome-wide analysis of transcript isoform variation in humans. Nature
Genetics 40(2):225–231.
[268] Kwan T, Benovoy D, Dias C, Gurd S, Serre D, Zuzan H, Clark T, Schweitzer A, Staples M,
Wang H et al., 2007. Heritability of alternative splicing in the human genome. Genome Research
17(8):1210–1218.
[269] Krawczak M, Thomas N, Hundrieser B, Mort M, Wittig M, Hampe J, and Cooper D, 2006.
Single base-pair substitutions in exon–intron junctions of human genes: nature, distribution, and
consequences for mRNA splicing. Human Mutation 28(2):150–158.
[270] Roca X, Olson A, Rao A, Enerly E, Kristensen V, Borresen-Dale A, Andresen B, Krainer A, and
Sachidanandam R, 2008. Features of 5’-splice-site efficiency derived from disease-causing mutations
and comparative genomics. Genome Research 18(1):77–87.
[271] Storey JD, Madeoy J, Strout JL, Wurfel M, Ronald J, and Akey JM, 2007. Gene-expression variation within and among human populations. The American Journal of Human Genetics 80(3):502–
509.
[272] Sugnet C, Kent W, Ares Jr M, Haussler D et al., 2004. Transcriptome and genome conservation of
alternative splicing events in humans and mice. In Pac Symp Biocomput, volume 9, pages 66–77.
[273] Farlow A, Dolezal M, Hua L, and Schlötterer C, 2012. The genomic signature of splicing-coupled
selection differs between long and short introns. Molecular Biology and Evolution 29(1):21–24.
[274] Sterner D, Carlo T, and Berget S, 1996. Architectural limits on split genes. Proceedings of the
National Academy of Sciences 93(26):15081–15085.
[275] Konforti B and Konarska M, 1994. U4/U5/U6 snRNP recognizes the 5’splice site in the absence
of U2 snRNP. Genes & Development 8(16):1962–1973.
[276] Konforti B and Konarska M, 1995. A short 5’splice site RNA oligo can participate in both steps
of splicing in mammalian extracts. RNA 1(8):815.
179
Bibliography
[277] Reyes J, Kois P, Konforti B, and Konarska M, 1996. The canonical GU dinucleotide at the 5’splice
site is recognized by p220 of the U5 snRNP within the spliceosome. RNA 2(3):213–225.
[278] Reyes J, Gustafson E, Luo H, Moore M, and Konarska M, 1999. The C-terminal region of hPrp8
interacts with the conserved GU dinucleotide at the 5’splice site. RNA 5(2):167–179.
[279] Sigurdsson S, Dirac-Svejstrup A, and Svejstrup J, 2010. Evidence that transcript cleavage is
essential for RNA polymerase II transcription and cell viability. Molecular Cell 38(2):202–210.
[280] Bains W and Smith GC, 1988. A novel method for nucleic acid sequence determination. Journal
of Theoretical Biology 135(3):303–307.
[281] Dramanac R, Labat I, Brukner I, and Crkvenjakov R, 1989. Sequencing of megabase plus DNA
by hybridization: theory of the method. Genomics 4(2):114–128.
[282] Khrapko K, P Lysov Y, Khorlyn A, Shick V, Florentiev V, and Mirzabekov A, 1989. An oligonucleotide hybridization approach to DNA sequencing. FEBS Letters 256(1):118–122.
[283] Schena M, Shalon D, Davis RW, and Brown PO, 1995. Quantitative monitoring of gene expression
patterns with a complementary DNA microarray. Science 270(5235):467–470.
[284] Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester
E, Spencer J et al., 1998. Large-scale identification, mapping, and genotyping of single-nucleotide
polymorphisms in the human genome. Science 280(5366):1077–1082.
[285] van Steensel B and Henikoff S, 2003. Epigenomic profiling using microarrays. Biotechniques
35(2):346–357.
[286] Buck MJ and Lieb JD, 2004. ChIP-chip: considerations for the design, analysis, and application
of genome-wide chromatin immunoprecipitation experiments. Genomics 83(3):349–360.
[287] Hoheisel JD, 2006. Microarray technology: beyond transcript profiling and genotype analysis.
Nature Reviews Genetics 7(3):200–210.
[288] Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, and Speed TP, 2003. Summaries of
Affymetrix GeneChip probe level data. Nucleic Acids Research 31(4):e15–e15.
[289] Irizarry RA, Wu Z, and Jaffee HA, 2006. Comparison of Affymetrix GeneChip expression measures.
[290] Choe SE, Boutros M, Michelson AM, Church GM, and Halfon MS, 2005. Preferred analysis
methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology
6(2):R16.
[291] Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R et al., 2009. Estimating
accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 10(1):161.
[292] Mutch DM, Berger A, Mansourian R, Rytz A, and Roberts MA, 2002. The limit fold change
model: a practical approach for selecting differentially expressed genes from microarray data.
BMC Bioinformatics 3(1):17.
[293] Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G et al., 2005. Multiple-laboratory comparison of microarray platforms. Nature
Methods 2(5):345–350.
180
Bibliography
[294] Seita J, Sahoo D, Rossi DJ, Bhattacharya D, Serwold T, Inlay MA, Ehrlich LI, Fathman JW,
Dill DL, and Weissman IL, 2012. Gene expression commons: an open platform for absolute gene
expression profiling. PLoS ONE 7(7):e40321.
[295] Wang Z, Gerstein M, and Snyder M, 2009. RNA-Seq: a revolutionary tool for transcriptomics.
Nature Reviews Genetics 10(1):57–63.
[296] Labaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, and Kreil DP, 2011. Characterization
and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13):i383–i391.
[297] Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z,
Snyder M, Dermitzakis ET, Thurman RE et al., 2007. Identification and analysis of functional
elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816.
[298] Meyer S, Fuchs TJ, Bosserhoff AK, Hofstädter F, Pauer A, Roth V, Buhmann JM, Moll I, Anagnostou N, Brandner JM et al., 2012. A seven-marker signature and clinical outcome in malignant
melanoma: a large-scale tissue-microarray study with two independent patient cohorts. PLoS
ONE 7(6):e38222.
[299] Clarke C, Doolan P, Barron N, Meleady P, O’Sullivan F, Gammell P, Melville M, Leonard M, and
Clynes M, 2011. Large scale microarray profiling and coexpression network analysis of CHO cells
identifies transcriptional modules associated with growth and productivity. Journal of Biotechnology 155(3):350–359.
[300] Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young
N et al., 2013. The Genotype-Tissue Expression (GTEx) project. Nature Genetics 45(6):580–585.
[301] Affymetrix, 2005. Technical note: guide to probe logarithmic intensity error (PLIER) estimation. Technical report, Affymetrix Inc. URL http://www.affymetrix.com/support/technical/
technotes/plier_technote.pdf.
[302] Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, and
Dermitzakis ET, 2010. Transcriptome genetics using second generation sequencing in a Caucasian
population. Nature 464(7289):773–777.
[303] Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W, Clark TA, Chen TX, Schweitzer AC,
Blume JE, Cox NJ et al., 2007. A genome-wide approach to identify genetic variants that contribute
to etoposide-induced cytotoxicity. Proceedings of the National Academy of Sciences 104(23):9758–
9763.
[304] Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Yang H, Ch’ang LY, Huang W, Liu B,
Shen Y et al., 2003. The international HapMap project. Nature 426(6968):789–796.
[305] Edgar R, Domrachev M, and Lash AE, 2002. Gene Expression Omnibus: NCBI gene expression
and hybridization array data repository. Nucleic Acids Research 30(1):207–210.
[306] Breiman L, 1993. Classification and regression trees. CRC Press.
[307] Friedman JH, 1991. Multivariate adaptive regression splines. The Annals of Statistics pages 1–67.
[308] Breiman L, 2001. Random forests. Machine Learning 45(1):5–32.
181
Bibliography
[309] Hothorn T, Hornik K, and Zeileis A, 2006. Unbiased recursive partitioning: A conditional inference
framework. Journal of Computational and Graphical Statistics 15(3).
[310] Meinshausen N, 2006. Quantile regression forests. The Journal of Machine Learning Research
7:983–999.
[311] Team RC et al., 2005. R: A language and environment for statistical computing. URL http:
//www.r-project.org.
[312] Allison DB, Cui X, Page GP, and Sabripour M, 2006. Microarray data analysis: from disarray to
consolidation and consensus. Nature Reviews Genetics 7(1):55–65.
[313] Witten D and Tibshirani R, 2007. A comparison of fold-change and the t-statistic for microarray
data analysis. Posted at http://wwwstat.stanford.edu/Btibs/ftp/FCTComparison.pdf .
[314] Kvam VM, Liu P, and Si Y, 2012. A comparison of statistical methods for detecting differentially
expressed genes from RNA-seq data. American Journal of Botany 99(2):248–256.
[315] Jaccard P, 1901. Etude comparative de la distribution florale dans une portion des Alpes et du
Jura. Impr. Corbaz.
[316] Lawrence I and Lin K, 1989. A concordance correlation coefficient to evaluate reproducibility.
Biometrics pages 255–268.
[317] Mazin P, Xiong J, Liu X, Yan Z, Zhang X, Li M, He L, Somel M, Yuan Y, Chen YPP et al., 2013.
Widespread splicing changes in human brain development and aging. Molecular Systems Biology
9(1).
[318] Affymetrix, 2005. Exon Array Whitepaper Collection,“Exon Probeset Annotations and Transcript Cluster Groupings,” rev. Sep. 27, 2005, ver. 1.0.
Inc.
Technical report, Affymetrix
URL http://www.affymetrix.com/support/technical/whitepapers/exon_probeset_
trans_clust_whitepaper.pdf.
[319] Yates T, Okoniewski MJ, and Miller CJ, 2008. X:Map: annotation and visualization of genome
structure for Affymetrix exon array analysis. Nucleic Acids Research 36(suppl 1):D780–D786.
[320] Gaujoux R and Seoighe C, 2013. CellMix: A Comprehensive Toolbox for Gene Expression Deconvolution. Bioinformatics doi:10.1093/bioinformatics/btt351.
[321] Marioni JC, Mason CE, Mane SM, Stephens M, and Gilad Y, 2008.
RNA-Seq: an assess-
ment of technical reproducibility and comparison with gene expression arrays. Genome Research
18(9):1509–1517.
[322] Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold B, 2008. Mapping and quantifying
mammalian transcriptomes by RNA-Seq. Nature Methods 5(7):621–628.
[323] Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, and Selbach M,
2011. Global quantification of mammalian gene expression control. Nature 473(7347):337–342.
[324] Srinivasan K, Shiue L, Hayes JD, Centers R, Fitzwater S, Loewen R, Edmondson LR, Bryant J,
Smith M, Rommelfanger C et al., 2005. Detection and measurement of alternative splicing using
splicing-sensitive microarrays. Methods 37(4):345–359.
182
Bibliography
[325] Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Schweitzer A, Awad T, Sugnet
C, Dee S et al., 2006. Alternative splicing and differential gene expression in colon cancer detected
by a whole genome exon array. BMC Genomics 7(1):325.
[326] Bemmo A, Benovoy D, Kwan T, Gaffney D, Jensen R, and Majewski J, 2008. Gene expression
and isoform variation analysis using Affymetrix Exon Arrays. BMC Genomics 9(1):529.
[327] Laajala E, Aittokallio T, Lahesmaa R, Elo LL et al., 2009. Probe-level estimation improves the
detection of differential splicing in Affymetrix exon array studies. Genome Biology 10(7):R77.
[328] Robinson MD and Speed TP, 2007. A comparison of Affymetrix gene expression arrays. BMC
Bioinformatics 8(1):449.
[329] Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM,
Davis MM, and Butte AJ, 2010. Cell type–specific gene expression differences in complex tissues.
Nature Methods 7(4):287–289.
[330] Ozsolak F and Milos PM, 2010. RNA sequencing: advances, challenges and opportunities. Nature
Reviews Genetics 12(2):87–98.
[331] Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, and Edgar R, 2007. NCBI GEO: mining tens of millions of expression profilesdatabase
and tools update. Nucleic Acids Research 35(suppl 1):D760–D765.
[332] Oshlack A, Robinson MD, Young MD et al., 2010. From RNA-seq reads to differential expression
results. Genome Biol 11(12):220.
[333] Li H and Durbin R, 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760.
[334] Langmead B, Trapnell C, Pop M, Salzberg SL et al., 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25.
[335] Li R, Li Y, Kristiansen K, and Wang J, 2008. SOAP: short oligonucleotide alignment program.
[336] Law CW, Chen Y, Shi W, and Smyth GK, 2013. Voom! Precision weights unlock linear model
analysis tools for RNA-seq read counts. Technical report, Technical Report 1 May 2013, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Reseach, Melbourne, Australia.
http://www. statsci. org/smyth/pubs/VoomPreprint. pdf.
[337] Pachter L, 2011.
Models for transcript quantification from RNA-Seq.
arXiv preprint
arXiv:1104.3889 .
[338] Tarazona S, Garcı́a-Alcalde F, Dopazo J, Ferrer A, and Conesa A, 2011. Differential expression
in RNA-seq: a matter of depth. Genome Research 21(12):2213–2223.
[339] Hansen KD, Irizarry RA, and Zhijin W, 2012. Removing technical variability in RNA-seq data
using conditional quantile normalization. Biostatistics 13(2):204–216.
[340] Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, Lee S, Lee B, Kang C, and Lee S, 2011. Accurate
quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic
Acids Research 39(2):e9–e9.
183
Bibliography
[341] Oshlack A and Wakefield MJ, 2009. Transcript length bias in RNA-seq data confounds systems
biology. Biol Direct 4(1):14.
[342] Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L et al., 2011. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 12(3):R22.
[343] Hansen KD, Brenner SE, and Dudoit S, 2010. Biases in Illumina transcriptome sequencing caused
by random hexamer priming. Nucleic Acids Research 38(12):e131–e131.
[344] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R
et al., 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–
2079.
[345] Team R et al., 2010. R: A language and environment for statistical computing. R Foundation for
Statistical Computing Vienna Austria (01/19).
[346] Lander ES and Waterman MS, 1988. Genomic mapping by fingerprinting random clones: a
mathematical analysis. Genomics 2(3):231–239.
[347] Ewing B and Green P, 1998. Base-calling of automated sequencer traces usingPhred. II. error
probabilities. Genome Research 8(3):186–194.
[348] Robinson MD, McCarthy DJ, and Smyth GK, 2010. edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 26(1):139–140.
[349] Bullard J, Purdom E, Hansen K, and Dudoit S, 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11(1):94.
[350] Beyer K, Domingo-Sábat M, Lao JI, Carrato C, Ferrer I, and Ariza A, 2008. Identification and
characterization of a new alpha-synuclein isoform and its role in Lewy body diseases. Neurogenetics
9(1):15–23.
[351] Ueda K, Saitoh T, and Mori H, 1994. Tissue-Dependent Alternative Splicing of mRNA for NACP,
the Precursor of Non-Aβ Component of Alzheimer s Disease Amyloid. Biochemical and Biophysical
Research Communications 205(2):1366–1372.
[352] Singleton A, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M, Peuralinna
T, Dutra A, Nussbaum R et al., 2003. α-Synuclein locus triplication causes Parkinson’s disease.
Science 302(5646):841–841.
[353] Spillantini MG, Schmidt ML, Lee VMY, Trojanowski JQ, Jakes R, and Goedert M, 1997. αSynuclein in Lewy bodies. Nature 388(6645):839–840.
[354] Martı́nez-Navarrete GC, Martı́n-Nieto J, Esteve-Rudd J, Angulo A, Cuenca N et al., 2007. αSynuclein gene expression profile in the retina of vertebrates. Molecular Vision 13:949.
[355] Rodnitzky R, 1997. Visual dysfunction in Parkinson’s disease. Clinical Neuroscience 5(2):102–106.
[356] Feany MB and Bender WW, 2000.
A Drosophila model of Parkinson’s disease.
Nature
404(6776):394–398.
[357] Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE,
Reijo-Pera RA, Underwood JG et al., 2013. Characterization of the human ESC transcriptome
by hybrid sequencing. Proceedings of the National Academy of Sciences 110(50):E4821–E4830.
184
Bibliography
[358] Mercer TR and Mattick JS, 2013. Understanding the regulatory and transcriptional complexity
of the genome through structure. Genome Research 23(7):1081–1088.
[359] Shapiro E, Biezuner T, and Linnarsson S, 2013. Single-cell sequencing-based technologies will
revolutionize whole-organism science. Nature Reviews Genetics 14(9):618–630.
[360] Pesson M, Eymin B, De La Grange P, Simon B, and Corcos L, 2014. A dedicated microarray for
in-depth analysis of pre-mRNA splicing events: application to the study of genes involved in the
response to targeted anticancer therapies. Molecular Cancer 13(1):9.
[361] Lin S, Coutinho-Mansfield G, Wang D, Pandit S, and Fu XD, 2008. The splicing factor SC35 has
an active role in transcriptional elongation. Nature Structural & Molecular Biology 15(8):819–826.
[362] Alexander RD, Innocente SA, Barrass JD, and Beggs JD, 2010. Splicing-dependent RNA polymerase pausing in yeast. Molecular Cell 40(4):582–593.
[363] Brody Y, Neufeld N, Bieberstein N, Causse SZ, Böhnlein EM, Neugebauer KM, Darzacq X, and
Shav-Tal Y, 2011. The in vivo kinetics of RNA polymerase II elongation during co-transcriptional
splicing. PLoS Biology 9(1):e1000573.
[364] Kim S, Kim H, Fong N, Erickson B, and Bentley DL, 2011. Pre-mRNA splicing is a determinant
of histone H3K36 methylation. Proceedings of the National Academy of Sciences 108(33):13564–
13569.
[365] De Almeida SF, Grosso AR, Koch F, Fenouil R, Carvalho S, Andrade J, Levezinho H, Gut M,
Eick D, Gut I et al., 2011. Splicing enhances recruitment of methyltransferase HYPB/Setd2 and
methylation of histone H3 Lys36. Nature Structural & Molecular Biology 18(9):977–983.
[366] Paten B, Herrero J, Beal K, Fitzgerald S, and Birney E, 2008. Enredo and Pecan: genome-wide
mammalian consistency-based multiple alignment with paralogs. Genome Research 18(11):1814–
1828.
[367] Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, and Birney E, 2008. Genome-wide
nucleotide-level mammalian ancestor reconstruction. Genome Research 18(11):1829–1843.
[368] Saltzman AL, Pan Q, and Blencowe BJ, 2011. Regulation of alternative splicing by the core
spliceosomal machinery. Genes & Development 25(4):373–384.
185

Title Estimation and analysis of gene expression and

Transcription

Similar documents

www.genome.ou.edu

Barbara Schoenfeld

Hand-Held Single Fiber Fusion Splicer FITEL NINJA NJ001 (PDF

PadPak AutoPad Senior

Quiz 10

FITEL S FUSION SPLICERS

Hello, and thank you for your enquiry about the horse genetics

SEX-LINKED INHERITANCE

ABCA17P - BMC Molecular Biology

Since its completion in 2003….