presentation

Transcription

presentation
Functional annotation of
high-throughput data
q
e
s
P
ChI
)
q
e
s
E
R
I
A
(F Carl HERRMANN
Université de la Méditerranée & TAGC – Inserm U928
"Peak annotation"
Does it colocalize
with other features ?
cisTargetX2.0
Is there a specific
DNA motif ?
PeakMotif
What biological
functions is it
related to ?
GREAT
1. Regulatory motif annotation
in ChIP-peaks
What's in my peaks ?
Why do we want to find motifs ?

control : is the motif of my chipped TF present in my peaks ?

improvement : build a enhanced motif based on hundreds of
binding sites
Sox2
mouse
30 BS

667 BS
discovery :

no a priori of binding TF in nucleosome free regions (FAIRE)

discover co-factors in ChIP-seq data (e.g. p300)
Discovering motifs in sequences

long standing problem in bioinformatics

various motifs finding approaches [Tompa et al., Nature Biotech (2005)]


word-counting: RSA-tools, Weeder,HexaDiff

E-M: MEME

Gibbs sampling : MotifSampler,...
most were not developped for high-throuput data: scalable ?
we have developped an integrated, time-efficient
motif analyzis workflow for ChIP-seq data
[Thomas-Chollier,Herrmann,Defrance,Sand,Thieffry,van Helden, submitted]
peak-motif: an integrated work-flow

sequence analysis
(biases, size,...)

motif discovery

word-frequencies

positional biases

motif comparison with
databases

visualization in genome
browser
peak-motif: an integrated work-flow
discovered
motif
comparison
with motif
databases
positional profile
and enrichment
Time efficiency
peak-motifs can handle
full sized datasets
on a personal computer
Esrrb dataset in mouse ES
cells [Chen et al.2008]
Case study: p300 in mouse tissues

p300 ChIP-seq in 4 different mouse
embryo (E11.5) tissues

forebrain (2759)

midbrain (2786)

limb (3839)

heart (3597)
Which transcription factors do
recruit p300 in various tissues ?
Forebrain
Heart
Discovered motifs
common motifs ?
tissue specific motifs ?
Midbrain
Limb
Motif comparison

do a all-against-all motif comparison, using various similarity
measures

build network, identify clusters of similar motifs
midbrain
forebrain
limb
heart
heart
brain
midbrain
forebrain
limb
heart
limb
all tissues
several tissues
brain
heart
Gsh2
Vsx2
midbrain
forebrain
limb
GATA
Dmbx1
limb
heart
E-box/bHLH
Hox9
Zbtb3
all tissues
Zscan4c
Sp1
Mef2
several tissues
p300 in four embryonic
mouse tissues
- heart
- limb
- forebrain
- midbrain
[Blow et al.; Visel et al.]
cardiac
tissue/cells
ChIP-seq in HL1 cardiac cell line
- Mef2
- Nkx2.5
- Srf
- Gata
[He et al. (2011)]
heart
brain
Gsh2
limb
Vsx2
midbrain
forebrainforebrain
limb
heart
GATA4
GATA
Dmbx1
limb
midbrain
E-box
Hox9
Zbtb3
heart
all tissues
Nkx2.5
Zscan4c
Mef2
SRF
Sp1
Mef2
several tissues
What are the motifs ?
tissue specific/
common motifs
subsets of peaks
with particular
motif combinations
Are my regions of interest
specifically enriched in some features ?
2. in-vivo feature annotation
Who else is in my peak ?
"Regulatory features"

regulation is more than just a TF binding to a motif
Overview of modEncode project [Roy et al., Science 2010]
"Regulatory features"

exploit large scale in-vivo datasets (ENCODE, modEncode)

for specific regions of the genome (e.g. ChIP peaks), looks for
specific enrichments in

histone modification patterns

chromatin binding proteins

DNAse hypersensitive sites

transcription factor binding sites

motifs
in vivo datasets
in silico predictions
refine prediction of regulatory regions accross cell types
and conditions

feature extraction and CRM prediction in Drosophila

355 in-vivo features


300 modEncode features (histone modification, chromatin
binding proteins, transcription factors,...)

40 BDTNP features (TFs involved in early embryogenesis)

15 mesodermal features (Furlong lab ; mesoderm TF at various
stages)
3731 PWMs from various sources (Transfac, JASPAR, PBM, …)
[C. Herrmann, B. Van de Sande, D. Potier, S. Aerts, in preparation]
loci or genes
features
Genome partitioning
1. seed regions around PhastCons peaks; extend to form partition
2. remove coding exons
3. split regions containing insulators
4. merge small regions to obtains regions ≥ 500bp
~ 136.000 regions
Scoring regions
partition of non-coding genome
average score
continuous binding density
Feature A (e.g. H3K27ac)
each region is scored
with average value
1
9
7
6
4
3
ranking of all regions
for feature A
5
2
8
fe
a
tu
re
5
4
fe
at
ur
e
3
tu
re
fe
a
tu
re
fe
a
fe
a
tu
re
1
2
For each feature, a ranking of all regions is computed
decreasing
rank
~ 4000
features
~ 136.000 regions
List of relevant
features (E-score)
with highly ranked
regions
fe
a
tu
re
5
4
fe
at
ur
e
3
tu
re
fe
a
fe
a
tu
re
1
tu
re
fe
a
Which features
rank my regions
of interest best ?
2
position of ChIP-peaks
Test cases: Drosophila ChIP-seq datasets

Test case 1 : Heat-shock factor [Guertin et al., PLoS Genetics, 2010]

ChIP-seq dataset for heat-shock factor (HSF)

performed in S2 cell lines (late 20-24h embryo)

422 ChIP peaks obtained after heat-shock
HSF : output for PWMs
Motif enrichment confirms that bound regions contain HSF-like motifs
HSF : output for iVFs
Enriched features

CBP/p300 : transcriptional
co-activator

DNAse hypersensitive sites in
S2 and embryo

H3K27ac in S2
HSF : output for iVFs

dMi-2 : member of a
polycomb related
deacetylase complex

H3K27ac : active chromatin
suggests a competition /
balance between histone
acetylation and polycomb
related deacetylase activity
HSF : output for iVFs

enriched feature in a
different cell type
(DNAse HS sites in Kc cells)

181 highly ranked regions
→ putative binding events in
Kc cells ?
S2 binding
sites
Kc binding
sites
HSF : output for iVFs

HSF ChIP-chip dataset in Kc
cells
[Gonsalves et al., PLoS One 2011]

True positives among highly ranked regions
70%
50%
996 binding sites
57%
60%
47%
40%
30%
20%
Highly ranked DHS Kc regions
are enriched in true Kc binding
events !
422 S2
binding
sites
10%
0%
all peaks
DHS Kc
996 Kc
binding
sites
Negative control : HSF sites not bound
Positive set
Negative control set
PWM based features do not
discriminate truly bound from
unbound regions
Negative control : HSF sites not bound
Positive set
Negative control set
in-vivo features related to active
chromatin clearly distinguish bound
regions
Test cases: Drosophila ChIP-seq datasets


Test case 1 : Heat-shock factor [Guertin et al., PLoS Genetics, 2010]

ChIP-seq dataset for heat-shock factor (HSF)

performed in S2 cell lines (late 20-24h embryo)

422 ChIP peaks obtained after heat-shock
Test case 2 : embryonic TFs
[Kaplan et al., PLoS Genetics 2011 ; Zinzen et al. Nature (2009)]

40 ChIP-seq dataset (BDTNP and Furlong Lab)

early embryonic TFs and mesodermal
Test Case 2 : BDTNP/Mesoderm
run cisTargetX to do
motif enrichment

positive control
correct motif identified in
32/40 cases

prediction
overwhelming enrichment
for zelda motif in early vs.
late datasets
zelda predictions correlate with
actual zelda in-vivo binding
Percentage overlapp with experimental zelda binding
100%
80%
60%
40%
20%
0%
i.2
tw
-4
ti
6
4.
n
ti
4
2.
n
i.4
tw
8
8
8
0
2
4
6
8
8
0
2
2466666-1
-1
-1
-1
.
.
.
.
.
.
.
i
8
0
0
8
n
p
2
2
2
1
1
tin
2.
n.
tw
bi
ef
ef
ef
ba
n.
2.
ef
bi
m
m
m
bi
ef
m
m
-6
Overlapp with Zelda ChIP-peaks
Correlation E-score / Zelda overlapp
100%
R² = 0.75
80%
60%
twi 2h-4h
40%
20%
0%
0
2
4
6
8
10
12
E-score for zelda motif
14
16
18
T Kaplan, MB Eisen
Summary of part 2


enriched features might help in …

distinguishing bound from unbound binding events
[CENTIPEDE, Chromia,...]

pointing at subsets of our peak collection (ubiquitous/tissue
specific binding sites,...)

predicting condition dependent binding events
will improve with more specific datasets (histone modifications
in particular tissues,...)
Functions ?
3. Functional annotation
of ChIP-peaks
how do we go from peaks to functions ?
Peaks → Genes → Functions

collect sets of genes

compute over-represented functional annotations


Gene Ontology

Phenotypic annotations

Biological Pathways
Typical tools

DAVID [Huang et al., NAR 2009]

Babelomics [Medina et al., NAR 2010]
Peaks → Genes → Functions
5kb

5kb
Drawbacks

restricting to proximal regions discards a large number of binding
events

"nearest gene" approach introduces bias towards genes with large
intergenic regions
e.g. : "multicellular organism development" : 14% of the genes, but
33% of the genome associated
Genes → Regions ← Peaks



Idea :

assign functional annotation to genomic regions

use statistics to avoid biases
assign to each gene a regulatory domain

basal (-5kb/+1kb from TSS)

extended (up to nearest
basal region ; max 1Mb)
each domain is annotated to the functional terms of
the corresponding gene
→ "Functional domains"
"GREAT improves functional interpretation of cis-regulatory regions"
McLean et al. Nat. Biotech. (2010)
Genes → Regions ← Peaks
term A
term B
Given that 60% of the genome is annotated
to A, would I randomly expect 3 or more
peaks to fall into region A ?
Given that 15% of the genome is annotated
to B, would I randomly expect 3 or more
peaks to fall into region B ?
"GREAT improves functional interpretation of cis-regulatory regions"
McLean et al. Nat. Biotech. (2010)
p > 0.5
p = 0.07
GREAT vs. proximal peaks
GREAT
Proximal 2kb peaks
Best GO term
P-val
MGI expression
P-val
Best GO-term
p300 limb
Embryonic limb
morphogenesis
1E-27
TS19 limb
7E-49
Skeletal system
4E-06
development
p300 forebrain
CNS
development
8E-36
TS17 forebrain
6E-41
Forebrain
development
p300 midbrain
CNS
development
1E-12
TS 15 CNS
1E-14
none

more specific terms with higher significance

more peaks/genes taken into account
P-val MGI expression P-val
2E-04
TS19 limb
3E-05
TS22 forebrain
3E-07
none
Summary and conclusion

"Annotation of ChIP-peaks" helps …

controling the consistency of the dataset [motifs ; features ;
functions]

putting the results in a broader biological perspective
[condition specific in-vivo features ; functions]

distinguishing subsets of binding events [co-motifs ; features]
we need ...
… HTS-era specific tools !!

… because the amount of data is different



motif discovery challenges
… because the nature of the data is different

functional annotation of peaks vs. genes

specific biases in RNA-seq functional annotation
… because the variety of available data is different

epigenomic landscape
"Garbage in, garbage out"
Original GEO peaks (Sox2)
4000 peaks
Klf4 co-factor
MACS + Peaksplitter (Sox2)
8000 peaks
URLs

PeakMotif
M.Thomas-Chollier, M.Defrance, O.Sand, D.Thieffry, J.vanHelden
http://rsat.scmbb.ulb.ac.be/rsat/

cisTargetX 2.0
B. Van de Sande, D. Potier, S. Aerts
http://med.kuleuven.be/lcb/cisTargetX2

GREAT
http://great.stanford.edu/