presentation
Transcription
presentation
Functional annotation of high-throughput data q e s P ChI ) q e s E R I A (F Carl HERRMANN Université de la Méditerranée & TAGC – Inserm U928 "Peak annotation" Does it colocalize with other features ? cisTargetX2.0 Is there a specific DNA motif ? PeakMotif What biological functions is it related to ? GREAT 1. Regulatory motif annotation in ChIP-peaks What's in my peaks ? Why do we want to find motifs ? control : is the motif of my chipped TF present in my peaks ? improvement : build a enhanced motif based on hundreds of binding sites Sox2 mouse 30 BS 667 BS discovery : no a priori of binding TF in nucleosome free regions (FAIRE) discover co-factors in ChIP-seq data (e.g. p300) Discovering motifs in sequences long standing problem in bioinformatics various motifs finding approaches [Tompa et al., Nature Biotech (2005)] word-counting: RSA-tools, Weeder,HexaDiff E-M: MEME Gibbs sampling : MotifSampler,... most were not developped for high-throuput data: scalable ? we have developped an integrated, time-efficient motif analyzis workflow for ChIP-seq data [Thomas-Chollier,Herrmann,Defrance,Sand,Thieffry,van Helden, submitted] peak-motif: an integrated work-flow sequence analysis (biases, size,...) motif discovery word-frequencies positional biases motif comparison with databases visualization in genome browser peak-motif: an integrated work-flow discovered motif comparison with motif databases positional profile and enrichment Time efficiency peak-motifs can handle full sized datasets on a personal computer Esrrb dataset in mouse ES cells [Chen et al.2008] Case study: p300 in mouse tissues p300 ChIP-seq in 4 different mouse embryo (E11.5) tissues forebrain (2759) midbrain (2786) limb (3839) heart (3597) Which transcription factors do recruit p300 in various tissues ? Forebrain Heart Discovered motifs common motifs ? tissue specific motifs ? Midbrain Limb Motif comparison do a all-against-all motif comparison, using various similarity measures build network, identify clusters of similar motifs midbrain forebrain limb heart heart brain midbrain forebrain limb heart limb all tissues several tissues brain heart Gsh2 Vsx2 midbrain forebrain limb GATA Dmbx1 limb heart E-box/bHLH Hox9 Zbtb3 all tissues Zscan4c Sp1 Mef2 several tissues p300 in four embryonic mouse tissues - heart - limb - forebrain - midbrain [Blow et al.; Visel et al.] cardiac tissue/cells ChIP-seq in HL1 cardiac cell line - Mef2 - Nkx2.5 - Srf - Gata [He et al. (2011)] heart brain Gsh2 limb Vsx2 midbrain forebrainforebrain limb heart GATA4 GATA Dmbx1 limb midbrain E-box Hox9 Zbtb3 heart all tissues Nkx2.5 Zscan4c Mef2 SRF Sp1 Mef2 several tissues What are the motifs ? tissue specific/ common motifs subsets of peaks with particular motif combinations Are my regions of interest specifically enriched in some features ? 2. in-vivo feature annotation Who else is in my peak ? "Regulatory features" regulation is more than just a TF binding to a motif Overview of modEncode project [Roy et al., Science 2010] "Regulatory features" exploit large scale in-vivo datasets (ENCODE, modEncode) for specific regions of the genome (e.g. ChIP peaks), looks for specific enrichments in histone modification patterns chromatin binding proteins DNAse hypersensitive sites transcription factor binding sites motifs in vivo datasets in silico predictions refine prediction of regulatory regions accross cell types and conditions feature extraction and CRM prediction in Drosophila 355 in-vivo features 300 modEncode features (histone modification, chromatin binding proteins, transcription factors,...) 40 BDTNP features (TFs involved in early embryogenesis) 15 mesodermal features (Furlong lab ; mesoderm TF at various stages) 3731 PWMs from various sources (Transfac, JASPAR, PBM, …) [C. Herrmann, B. Van de Sande, D. Potier, S. Aerts, in preparation] loci or genes features Genome partitioning 1. seed regions around PhastCons peaks; extend to form partition 2. remove coding exons 3. split regions containing insulators 4. merge small regions to obtains regions ≥ 500bp ~ 136.000 regions Scoring regions partition of non-coding genome average score continuous binding density Feature A (e.g. H3K27ac) each region is scored with average value 1 9 7 6 4 3 ranking of all regions for feature A 5 2 8 fe a tu re 5 4 fe at ur e 3 tu re fe a tu re fe a fe a tu re 1 2 For each feature, a ranking of all regions is computed decreasing rank ~ 4000 features ~ 136.000 regions List of relevant features (E-score) with highly ranked regions fe a tu re 5 4 fe at ur e 3 tu re fe a fe a tu re 1 tu re fe a Which features rank my regions of interest best ? 2 position of ChIP-peaks Test cases: Drosophila ChIP-seq datasets Test case 1 : Heat-shock factor [Guertin et al., PLoS Genetics, 2010] ChIP-seq dataset for heat-shock factor (HSF) performed in S2 cell lines (late 20-24h embryo) 422 ChIP peaks obtained after heat-shock HSF : output for PWMs Motif enrichment confirms that bound regions contain HSF-like motifs HSF : output for iVFs Enriched features CBP/p300 : transcriptional co-activator DNAse hypersensitive sites in S2 and embryo H3K27ac in S2 HSF : output for iVFs dMi-2 : member of a polycomb related deacetylase complex H3K27ac : active chromatin suggests a competition / balance between histone acetylation and polycomb related deacetylase activity HSF : output for iVFs enriched feature in a different cell type (DNAse HS sites in Kc cells) 181 highly ranked regions → putative binding events in Kc cells ? S2 binding sites Kc binding sites HSF : output for iVFs HSF ChIP-chip dataset in Kc cells [Gonsalves et al., PLoS One 2011] True positives among highly ranked regions 70% 50% 996 binding sites 57% 60% 47% 40% 30% 20% Highly ranked DHS Kc regions are enriched in true Kc binding events ! 422 S2 binding sites 10% 0% all peaks DHS Kc 996 Kc binding sites Negative control : HSF sites not bound Positive set Negative control set PWM based features do not discriminate truly bound from unbound regions Negative control : HSF sites not bound Positive set Negative control set in-vivo features related to active chromatin clearly distinguish bound regions Test cases: Drosophila ChIP-seq datasets Test case 1 : Heat-shock factor [Guertin et al., PLoS Genetics, 2010] ChIP-seq dataset for heat-shock factor (HSF) performed in S2 cell lines (late 20-24h embryo) 422 ChIP peaks obtained after heat-shock Test case 2 : embryonic TFs [Kaplan et al., PLoS Genetics 2011 ; Zinzen et al. Nature (2009)] 40 ChIP-seq dataset (BDTNP and Furlong Lab) early embryonic TFs and mesodermal Test Case 2 : BDTNP/Mesoderm run cisTargetX to do motif enrichment positive control correct motif identified in 32/40 cases prediction overwhelming enrichment for zelda motif in early vs. late datasets zelda predictions correlate with actual zelda in-vivo binding Percentage overlapp with experimental zelda binding 100% 80% 60% 40% 20% 0% i.2 tw -4 ti 6 4. n ti 4 2. n i.4 tw 8 8 8 0 2 4 6 8 8 0 2 2466666-1 -1 -1 -1 . . . . . . . i 8 0 0 8 n p 2 2 2 1 1 tin 2. n. tw bi ef ef ef ba n. 2. ef bi m m m bi ef m m -6 Overlapp with Zelda ChIP-peaks Correlation E-score / Zelda overlapp 100% R² = 0.75 80% 60% twi 2h-4h 40% 20% 0% 0 2 4 6 8 10 12 E-score for zelda motif 14 16 18 T Kaplan, MB Eisen Summary of part 2 enriched features might help in … distinguishing bound from unbound binding events [CENTIPEDE, Chromia,...] pointing at subsets of our peak collection (ubiquitous/tissue specific binding sites,...) predicting condition dependent binding events will improve with more specific datasets (histone modifications in particular tissues,...) Functions ? 3. Functional annotation of ChIP-peaks how do we go from peaks to functions ? Peaks → Genes → Functions collect sets of genes compute over-represented functional annotations Gene Ontology Phenotypic annotations Biological Pathways Typical tools DAVID [Huang et al., NAR 2009] Babelomics [Medina et al., NAR 2010] Peaks → Genes → Functions 5kb 5kb Drawbacks restricting to proximal regions discards a large number of binding events "nearest gene" approach introduces bias towards genes with large intergenic regions e.g. : "multicellular organism development" : 14% of the genes, but 33% of the genome associated Genes → Regions ← Peaks Idea : assign functional annotation to genomic regions use statistics to avoid biases assign to each gene a regulatory domain basal (-5kb/+1kb from TSS) extended (up to nearest basal region ; max 1Mb) each domain is annotated to the functional terms of the corresponding gene → "Functional domains" "GREAT improves functional interpretation of cis-regulatory regions" McLean et al. Nat. Biotech. (2010) Genes → Regions ← Peaks term A term B Given that 60% of the genome is annotated to A, would I randomly expect 3 or more peaks to fall into region A ? Given that 15% of the genome is annotated to B, would I randomly expect 3 or more peaks to fall into region B ? "GREAT improves functional interpretation of cis-regulatory regions" McLean et al. Nat. Biotech. (2010) p > 0.5 p = 0.07 GREAT vs. proximal peaks GREAT Proximal 2kb peaks Best GO term P-val MGI expression P-val Best GO-term p300 limb Embryonic limb morphogenesis 1E-27 TS19 limb 7E-49 Skeletal system 4E-06 development p300 forebrain CNS development 8E-36 TS17 forebrain 6E-41 Forebrain development p300 midbrain CNS development 1E-12 TS 15 CNS 1E-14 none more specific terms with higher significance more peaks/genes taken into account P-val MGI expression P-val 2E-04 TS19 limb 3E-05 TS22 forebrain 3E-07 none Summary and conclusion "Annotation of ChIP-peaks" helps … controling the consistency of the dataset [motifs ; features ; functions] putting the results in a broader biological perspective [condition specific in-vivo features ; functions] distinguishing subsets of binding events [co-motifs ; features] we need ... … HTS-era specific tools !! … because the amount of data is different motif discovery challenges … because the nature of the data is different functional annotation of peaks vs. genes specific biases in RNA-seq functional annotation … because the variety of available data is different epigenomic landscape "Garbage in, garbage out" Original GEO peaks (Sox2) 4000 peaks Klf4 co-factor MACS + Peaksplitter (Sox2) 8000 peaks URLs PeakMotif M.Thomas-Chollier, M.Defrance, O.Sand, D.Thieffry, J.vanHelden http://rsat.scmbb.ulb.ac.be/rsat/ cisTargetX 2.0 B. Van de Sande, D. Potier, S. Aerts http://med.kuleuven.be/lcb/cisTargetX2 GREAT http://great.stanford.edu/