Rebecca Wilson, University of Bristol

Transcription

Rebecca Wilson, University of Bristol
DataSHIELD
Taking the analysis to the data
Dr Becca Wilson
D2K Research Group, University of Bristol
McGill University, OICR, Maelstrom Research
The Norwegian Institute of Public Health, Dept of Epidemiology
MRC Epidemiology Unit, Cambridge
Eindhoven Technical University
F1000 Research Journal
@Data2Knowledge @drbeccawilson #d2kDatashield
Rationale
Data access-analysis barriers in any discipline result from a range of
scenarios:
• Ethical-legal or governance restrictions surrounding
confidentiality/disclosure of confidential data
• Maintaining control intellectual property
• Physical size of the data
DataSHIELD provides a flexible, modular, open-source solution
ideally placed to grow a broad user and development community
@Data2Knowledge @drbeccawilson #d2kDatasheld
2
The DataSHIELD Approach
• DataSHIELD born from requirement in biomedical & social
sciences to co-analyse individual patient data from different
sources, without disclosing identity or sensitive information
• Under DataSHIELD, raw data never leaves the data provider, only
non-disclosive summary statistics returned to the researcher
• Researcher able to do the analysis themselves using R
• The analysis is taken to the data – not the data to the analysis
@Data2Knowledge @drbeccawilson #d2kDatasheld
3
Example Infrastructure
@Data2Knowledge @drbeccawilson #d2kDatasheld
4
Example Infrastructure
@Data2Knowledge @drbeccawilson #d2kDatasheld
5
Example Infrastructure
@Data2Knowledge @drbeccawilson #d2kDatasheld
6
Example Infrastructure
Includes R
parser
@Data2Knowledge @drbeccawilson #d2kDatasheld
7
Example Infrastructure
@Data2Knowledge @drbeccawilson #d2kDatasheld
8
Example Infrastructure
@Data2Knowledge @drbeccawilson #d2kDatasheld
9
DataSHIELD Status
• DataSHIELD methodology and infrastructure proven
• Gaye, A. et al (2014). DataSHIELD: taking the analysis to the
data, not the data to the analysis.International Journal of
Epidemiology
• Jones, EM et al (2012). DataSHIELD – shared individual-level
analysis without sharing data: a biostatistical
perspective.Norwegian Journal of Epidemiology
•
@Data2Knowledge @drbeccawilson #d2kDatasheld
10
DataSHIELD Status
• Current functionality http://www.datashield.ac.uk/latest-release/
- descriptive stats (e.g. mean)
- exploratory stats (e.g. histogram)
- contingency tables (e.g. 1D and 2D)
- modelling (survival analysis using piecewise
exponential regression, glm)
• Currently enhancing existing functions, developing further
modeling tools (glmm), exploring other datasets genomics,
geospatial, text
@Data2Knowledge @drbeccawilson #d2kDatasheld
11
DataSHIELD Status
• Current pilot phase in in 10 European studies (www.bioshare.eu):
– Healthy Obese Project
– Environmental Core Project: effects of environmental
exposures on cardio-respiratory and mental health in
European adults
@Data2Knowledge @drbeccawilson #d2kDatasheld
12
DataSHIELD status
@Data2Knowledge @drbeccawilson #d2kDatasheld
13
DataSHIELD Future
• Support and Training: Historically grant funded – investigating
alternative funding models
– Free: support from wiki
• http://www.datashield.ac.uk/wiki
• http://wiki.obiba.org/display/CAG/Home
–
User Support: Access to support forum
• www.datashield.ac.uk/forum
• code, questions, trouble shoot error messages
• complex questions may require a DataSHIELD
developer access permissions to the portal in order to
replicate the error
@Data2Knowledge @drbeccawilson #d2kDatasheld
14
DataSHIELD Future
– Consortium Support: for infrastructure and users
• Support implementation of Opal/DataSHIELD
• Design / develop new packages or functionality
• Opal monitoring system: users and data providers can see
– When/which Opal servers are down
– Clues as to why they are down (memory, load etc)
– Alerts sent to 2 designated people at each data provider
@Data2Knowledge @drbeccawilson #d2kDatasheld
15
DataSHIELD Future
• Training: Can provide training on:
– Introduction to R (assuming statistical knowledge) [1 day]
– Introduction to DataSHIELD (assuming R and stats knowledge)
[1-2 days]
– Developer workshop [3-4 days]
@Data2Knowledge @drbeccawilson #d2kDatasheld
16
DataSHIELD Future
• Broadening use:
– Applications outside academia e.g. academic publishing,
university data repositories etc
• Different types of DataSHIELD:
– Single site DataSHIELD
– Vertical DataSHIELD
@Data2Knowledge @drbeccawilson #d2kDatasheld
17
Further Info
Any Questions?
www.datashield.ac.uk
@Data2Knowledge @drbeccawilson #d2kDatasheld
18
Guidelines for Data Providers
• Hardware requirements:
– Server or VM to install Opal plus database server (mongodb or
mysql) to hold the data e.g. NCDS BioSHaRE has:
• 2 vCPU 2.6GHz
• 4GB RAM
• 20GB disk space (for db server)
– Authorise Opal to receive/send comms via web services
(through your firewall) to DataSHIELD portal
@Data2Knowledge @drbeccawilson #d2kDatasheld
19
Guidelines for Data Providers
• Operations: Designate 2 people to maintain Opal and DataSHIELD.
Responsible for:
– Installing / setting up Opal and DataSHIELD
– Joining the DataSHIELD/Opal mailing list
– Maintaining software updates
– Resilience of the data service
– Transparency about the disclosure level
@Data2Knowledge @drbeccawilson #d2kDatasheld
20
Guidlines for Client Portal
• Client Portal: Designate 2 people to maintain the DataSHIELD
client portal. Responsible for:
– Installing / setting up DataSHIELD client portal
– Joining the DataSHIELD/Opal mailing list
– Maintaining software updates
– Resilience of the client portal