Interoperable metadata leads to integrative analyses
Transcription
Interoperable metadata leads to integrative analyses
Interoperable metadata leads to integrative analyses Biocuration 2015 April 25, 2015 Mike Cherry Department of Genetics Stanford University, School of Medicine Information curated by SGD • • • • • • • • • • • • • • • • Biochemical pathways Cellular pathways Chromosomal feature annotation Full-text papers and abstracts Functional genomics Gene Ontology Gene expression Gene regulation Genetic interactions Mutant phenotypes Post-translational modification Protein complex Protein domains Protein interactions Sequence Strain differences Balakrishnan, poster #96 Engel, poster #75 Genes GO term genes by publication Genes by the number of datasets in which their expression profiles are highly correlated Genes by interaction Role of a Genomic Resource Experimental data Genomic Resource Publications Computational analyses Data generation Literature curation Data wrangling Data integration Hypothesis generation Additional integration Used in analysis Google Trends http://google.com/trends ENCODE Assays and Elements Questions we want to answer 1. ChIP-seq results on K562 targeting RNA-binding proteins 2. Which fastq files were used to create this integrated analysis file 3. Which version of bwa was used to process this file 4. Show experiments that have a TF bound near my gene of interest 5. Find all RNA-seq experiments completed on liver tissue or primary cells from liver An ontology is a set of words... .. with different types of relationships to each other. All relationships must be true because inferences can be made based on these relationships Parent term cell part_of part_of part_of mitochondrion nucleus chromosome X part_of part_of part_of is_a mitochondrial chromosome http://www.geneontology.org/GO.ontology.relations.shtml Child term X part_of Impact of using ontologies: Common ontologies = instant interoperability circulatory system mesoderm develops_from part_of develops_from part_of develops_from heart part_of develops_from Explicit relationships Inferred relationships cardiac muscle cell http://uberon.org/ http://cellontology.org/ myoblast Project integration using ontologies Malladi, talk #26 Other projects OBI (for assays): http://obi-ontology.org EFO (for cell lines): http://www.ebi.ac.uk/efo/ UBERON (for tissues): http://uberon.org/ CL (for primary cells): http://cellontology.org/ DCC ENCODE portal (DCC) Find common biosamples between ENCODE2 and REMC 356 terms 314 terms http://genome.ucsc.edu/ENCODE/cellTypes.html GEO characteristics: common_name, tissue_type, cell_type, lines Labs were internally consistent After curating biosample identifiers there are 33 in common between ENCODE2 & REMC 20 UBERON 10 CL 2 EFO 1 NTR 217 terms 154 terms ENCODE Project Portal https://encodeproject.org Davidson, poster #77 Ontology-driven searches http://www.encodeproject.org/ Query for estradiol treated human samples track hubs displayed on UCSC browser Track Hubs on the Fly Browser pulls files from DB Track-hub displayed DB constructs track-hub files User Finds data to view Thousands of experiments (multiple files each) available from ENCODE Portal. Primarily previous ENCODE Construct URLs to Search ENCODE data curl -H 'Accept: application/json' -X GET https://www.encodeproject.org/search/ ?type=experiment&assay_term_name=RNA-seq &organ_slims=lung &replicates.library.biosample.life_stage=fe tal" Project integration using ontologies Malladi, talk #26 Other projects OBI (for assays): http://obi-ontology.org EFO (for cell lines): http://www.ebi.ac.uk/efo/ UBERON (for tissues): http://uberon.org/ CL (for primary cells): http://cellontology.org/ DCC ENCODE portal (DCC) ENCODE standard analysis pipelines Labs Submission and Processing of ENCODE data DNAnexus Amazon S3 Metadata DB Portal Chan, poster #73 TF ChIP-seq; A relatively complicated processing pipeline How would you deploy this pipeline • • • With the same versions of all the software components; The same parameters; With access to all 40+TB of ENCODE data; To integrate or compare your results with ENCODE? State of the Art in Pipeline Metadata & Distribution • ENCODE at UCSC’s Genome Browser • Materials and Methods • Galaxy/Globus (Galaxy on the Cloud) • Seven Bridges Genomics • tarball of scripts • DNAnexus Deploy Analysis Pipelines to the Cloud Replicable Provenance On the web to re-run. Accessioned inputs Pipeline metadata in database Ease of use Drop in your files Scalable to 1000’s of runs will be populated from the metadata database. Re-run on the web for a few datasets. Either way it’s exactly the same pipeline. Input Files Outputs plumbed to inputs Output Files How much did that run cost? Software used in pipelines How to find a region of interest Search by Region of Interest Find ENCODE datasets overlapping a region of interest by its genomic coordinates, or rs ID (SNP), or gene name, etc. Figure 1 from Boyle, et. al., Genome Res. 22:17901797 Acknowledgements SGD: • Biocuration Scientists: Stacia Engel, Rama Balakrishnan, Maria Costanzo, Janos Demeter, Rob Nash, Marek Skrzypek, Edith Wong, Sage Hellerstedt, Kyla Dalusag • Software Developers: Ben Hitz, Kelley Paskov, Travis Sheppard, Shuai Weng • Systems Admins: Stuart Miyasato, Matt Simison • Project Manager: Gail Binkley ENCODE: Stanford: • Data Wranglers: Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan, Marcus Ho • Software Developers: Ben Hitz, Laurence Rowe, Nikhil Podduturi, Forrest Tanaka, Tim Dreszer • Project Manager: Eurie Hong UCSC • Jim Kent, Brian Lee • previous members of the UCSC ENCODE DCC ClinGen: Stanford: Carlos Bustamante, Tam Snedden, Selina Dwight Baylor: Sharon Plon, Aleks Milosavljevic, Ronak Patel, Xin Feng, Harvard: Heidi Rehm UNC: Jonathan Berg Members of the many working groups SGD, GOC, ENCODE & ClinGen Talk Venkat Malladi : #26. – Sunday10:40-11:00. Ontology application and use at the ENCODE DCC Posters ENCODE Esther Chan : #73. Towards reproducible computational analyses: the ENCODE approach Cricket Sloan : #74. Tracking data provenance to compare, reproduce, and interpret ENCODE results Jean Davidson : #77. The role of the ENCODE Data Coordination Center Posters SGD Stacia Engel : #75. The war on disease: Homology curation at SGD to promote budding yeast as a model for eurkaryotic biology Rama Balakrishnan : #96. Collection and curation of whole genome studies of budding yeast at the Saccharomyces Genome Database (SGD) Workshops Workshop 2 : Data Visualization & Annotation Chairs: Rama Balakrishnan (SGD) and Monica Munoz-Torres Workshop 3 : Biocuration in big data to knowledge: new strategy, process & framework Participant : Mike Cherry Workshop 4 : International collaboration in biocuration: projects & data/expertise sharing Participant : Rama Balakrishnan (SGD) Workshop 5 : Genotype-2-phenotype: Curation challenges in translational & reverse translational informatics. Participant : Stacia Engel (SGD)