COPO - TGAC Documentation
Transcription
COPO - TGAC Documentation
COPO: Collaborative Open Plant Omics Rob Davey Data Infrastructure and Algorithms Group Leader [email protected] @froggleston Toni Etuk Acknowledgements Oxford eResearch Centre Susanna Sansone Alejandra Gonzalez-Beltran Philippe Rocca-Serra Alfie Abdul-Rahman Felix Shaw Warwick Jim Beynon Katherine Denby Ruth Bastow EMBL-EBI Paul Kersey TGAC Vicky Schneider Tanya Dickie Emily Angiolini Matt Drew COPO • Recently awarded BBSRC BBR grant • TGAC, Univ. Oxford, Univ. Warwick, EMBL-EBI • • Supported by GARNet, iPlant, Eagle Genomics Empower bioscience plant researchers to: 1. Enable standards-compliant data collection, curation and integration 2. Enhance access to data analysis and visualisation pipelines 3. Facilitate data sharing and publication to promote reuse • Train plant researchers in best practice for data sharing and producing citable Research Objects COPO • (Good) Science is founded on reproducibility • Reproducibility depends on: • reducing reinvention (“friction”)* • describing methods and data • maximising benefit to the researcher • Describing methods well established through “traditional” publishing • Data description sorely under-represented and used • Benefits are often opaque • Fear of being scooped, loss of control, reputation, etc * http://cameronneylon.net/blog/network-enabled-research/ COPO • What prevents plant scientists from openly depositing their data and metadata? • • Lack of interoperability between: • metadata annotation services • data repository services • data analysis services • data publishing services Researchers might not: • be aware that the services exist • have the expertise to use them • see the value in properly describing their data COPO • Data: • • Code: • • figshare, Scientific Data, Dryad, F1000, PeerJ, Gigascience Beyond the PDF: • • Galaxy, iPlant, Bioconductor, Taverna, local code/services Publication: • • GitHub, BitBucket, Zenodo Analysis: • • Sample, Sequence, Genome, Proteome, Metabolome, Imaging Utopia, GitHub Training: • Materials, examples, workshops, bootcamps COPO • It's not because these services don't exist! • Clearly, barriers exist between the scientist and the service • Infrastructure can help by: • • wiring existing services together • improving access to services • facilitating collaboration • raising profile of the benefits of open science How do we collaborate successfully to make this happen? • Mapping services with Application Programming Interfaces COPO COPO • Grace signs into COPO with her ORCID ID • This signs her into all other services as required • She starts a new COPO Profile • She uploads to the COPO platform: • Three FASTQs (two Illumina HiSeq2500, one PacBio P6-C4) representing her velociraptor sequencing reads • She tells COPO to push her data to a Galaxy server and run a workflow, producing: • • An assembly of the reads from ALLPATHS-LG v51551 • A draft automated annotation from RAST v33-1 The interface prompts her to add metadata to her data in order to deposit them in the public repositories • • Metadata fields will be shown based on data, and redundant fields will be merged automatically • Sample name, sample organism, data type, sequencer used, software name, software version.... She clicks “Upload”, and everything is submitted COPO • Single-sign on (SSO), e.g. ORCID • Deposit multi-omics data in one go • • • • No context-switching between services Run and deposit analytical workflows • Describe software used, versions • Pull into platforms, e.g. Galaxy, iPlant • Support virtualisation, e.g. iPlant Atmosphere, Docker, Amazon AWS Data is well-described, open, and everything has DOIs • Finding and integrating data is improved greatly • Make suggestions to users based on their data/workflows Programmatic access to all layers REPRODUCIBILITY COPO • Not just raw/processed data is valuable • • COPO supports submission of supplementary data to Figshare • PDFs (posters, papers) • CSV/Excel • movies/images (size permitting) Zenodo/Github releases for code DOIs • Marked up with ENCODE Digital Curation Center’s software metadata descriptors, for example COPO • What have we achieved so far? • • TGAC infrastructure to support brokering of data • iRODS and web server virtual machines • High speed transfer Aspera links to EBI Prototype user interface for multi-omics data submissions • • Developing JSON specification for COPO objects • • Oauth2 support (“sign in with” ORCiD, Google, Twitter) Easily stored in document-based databases, e.g. MongoDB Interconversion between ISA formats • ISATab (CSV based) to JSON, and vice versa • Linked Data specifications • Community interactions • Metabolights group at EBI • Setting up this workshop! COPO • COPO will: • Facilitate easy relevant data description to: • Submit data and metadata to multiple public repositories • The reasons most of you are here… What are the barriers for you and your data? • Facilitate access to workflows used to analyse the data, e.g. to GigaDB, Scientific Data • This will form part of another COPO workshop