COPO - TGAC Documentation

Transcription

COPO - TGAC Documentation
COPO: Collaborative Open Plant Omics
Rob Davey
Data Infrastructure and Algorithms Group Leader
[email protected]
@froggleston
Toni Etuk
Acknowledgements
Oxford eResearch Centre
Susanna Sansone
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Alfie Abdul-Rahman
Felix Shaw
Warwick
Jim Beynon
Katherine Denby
Ruth Bastow
EMBL-EBI
Paul Kersey
TGAC
Vicky Schneider
Tanya Dickie
Emily Angiolini
Matt Drew
COPO
•
Recently awarded BBSRC BBR grant
•
TGAC, Univ. Oxford, Univ. Warwick, EMBL-EBI
•
•
Supported by GARNet, iPlant, Eagle Genomics
Empower bioscience plant researchers to:
1. Enable standards-compliant data collection, curation and
integration
2. Enhance access to data analysis and visualisation pipelines
3. Facilitate data sharing and publication to promote reuse
•
Train plant researchers in best practice for data sharing and
producing citable Research Objects
COPO
•
(Good) Science is founded on reproducibility
•
Reproducibility depends on:
•
reducing reinvention (“friction”)*
•
describing methods and data
•
maximising benefit to the researcher
•
Describing methods well established through “traditional” publishing
•
Data description sorely under-represented and used
•
Benefits are often opaque
•
Fear of being scooped, loss of control, reputation, etc
* http://cameronneylon.net/blog/network-enabled-research/
COPO
•
What prevents plant scientists from openly depositing their data
and metadata?
•
•
Lack of interoperability between:
•
metadata annotation services
•
data repository services
•
data analysis services
•
data publishing services
Researchers might not:
•
be aware that the services exist
•
have the expertise to use them
•
see the value in properly describing their data
COPO
•
Data:
•
•
Code:
•
•
figshare, Scientific Data, Dryad, F1000, PeerJ, Gigascience
Beyond the PDF:
•
•
Galaxy, iPlant, Bioconductor, Taverna, local code/services
Publication:
•
•
GitHub, BitBucket, Zenodo
Analysis:
•
•
Sample, Sequence, Genome, Proteome, Metabolome, Imaging
Utopia, GitHub
Training:
•
Materials, examples, workshops, bootcamps
COPO
•
It's not because these services don't exist!
•
Clearly, barriers exist between the scientist and the service
•
Infrastructure can help by:
•
•
wiring existing services together
•
improving access to services
•
facilitating collaboration
•
raising profile of the benefits of open science
How do we collaborate successfully to make this happen?
•
Mapping services with Application Programming Interfaces
COPO
COPO
•
Grace signs into COPO with her ORCID ID
•
This signs her into all other services as required
•
She starts a new COPO Profile
•
She uploads to the COPO platform:
•
Three FASTQs (two Illumina HiSeq2500, one PacBio P6-C4)
representing her velociraptor sequencing reads
•
She tells COPO to push her data to a Galaxy server and run a workflow,
producing:
•
•
An assembly of the reads from ALLPATHS-LG v51551
•
A draft automated annotation from RAST v33-1
The interface prompts her to add metadata to her data in order to deposit
them in the public repositories
•
•
Metadata fields will be shown based on data, and redundant fields will be merged automatically
•
Sample name, sample organism, data type, sequencer used, software name, software version....
She clicks “Upload”, and everything is submitted
COPO
•
Single-sign on (SSO), e.g. ORCID
•
Deposit multi-omics data in one go
•
•
•
•
No context-switching between services
Run and deposit analytical workflows
•
Describe software used, versions
•
Pull into platforms, e.g. Galaxy, iPlant
•
Support virtualisation, e.g. iPlant Atmosphere, Docker, Amazon AWS
Data is well-described, open, and everything has DOIs
•
Finding and integrating data is improved greatly
•
Make suggestions to users based on their data/workflows
Programmatic access to all layers
REPRODUCIBILITY
COPO
•
Not just raw/processed data is valuable
•
•
COPO supports submission of supplementary data to Figshare
•
PDFs (posters, papers)
•
CSV/Excel
•
movies/images (size permitting)
Zenodo/Github releases for code DOIs
•
Marked up with ENCODE Digital Curation Center’s software
metadata descriptors, for example
COPO
•
What have we achieved so far?
•
•
TGAC infrastructure to support brokering of data
•
iRODS and web server virtual machines
•
High speed transfer Aspera links to EBI
Prototype user interface for multi-omics data submissions
•
•
Developing JSON specification for COPO objects
•
•
Oauth2 support (“sign in with” ORCiD, Google, Twitter)
Easily stored in document-based databases, e.g. MongoDB
Interconversion between ISA formats
•
ISATab (CSV based) to JSON, and vice versa
•
Linked Data specifications
•
Community interactions
•
Metabolights group at EBI
•
Setting up this workshop!
COPO
•
COPO will:
•
Facilitate easy relevant data description to:
•
Submit data and metadata to multiple public repositories
•
The reasons most of you are here…
What are the barriers for you and your data?
•
Facilitate access to workflows used to analyse the data, e.g. to
GigaDB, Scientific Data
•
This will form part of another COPO workshop