Transforming Data into Action
Transcription
Transforming Data into Action
Transforming Data into Action UC CYBERINFRASTRUCTURE CONFERENCE A longstanding objective of biomedical informatics has been to transform the wealth of clinical and research data into actionable knowledge and insights. New cyberinfrastructure is needed to achieve this objective, given both the vast amount of data and the new types of information. What could we do if we had new cyberinfrastucture? Precision medicine We could make the processing of clinical genomic tests faster (decreasing time from weeks to hours) We could combine more electronic health record data to drive real-world, continually learning predictive models for individualized screening, diagnosis, therapeutic prediction, and prognosis New paradigms for healthcare delivery/research We could create new pervasive, wireless monitoring models for patients outside of the clinical environment We could support interdisciplinary sharing and reuse of large datasets across institutions Translational capacity We could seamlessly transfer state-of-the-art algorithms and techniques involving large amounts of computational power from the research environment to the hospital (direct benchto-beside) Supporting Data-driven Projects Project Objective Computational Challenges Athena Project Predictive modeling and stratification for breast cancer screening Machine learning algorithms; complex data, 15 TB of data (EHR, imaging, genomic, outcomes) Depression Determining early symptoms and genomic factors Integrated data mining methods over large, heterogeneous datasets, 15 PB of data in ~4 years (EHR, imaging, genomic, sensors, etc.). Autism Uncovering genomic factors Integrated data mining, 445 TB of data (retrospective) Clinical genomics Diagnostic classification Data size; data processing (e.g., sequence alignment); machine learning methods; timely analysis Wireless monitoring Online real-time assessment for monitored environments/patients High volume, low complexity data; network/streaming; machine learning algorithms Text/image analysis projects Biomarker identification and validation Data size; data processing; machine learning algorithms; complex data UCLA USE CASES Both past and current projects are driving the need for high performance computing and infrastructure to support research and clinical applications. Translational Informatics Platform COMMON NEEDS ACROSS THE USE CASES The overarching purpose of our planned system is to enable biomedical research in an environment that enables the secure sharing of clinical/research data, and the seamless translation of developed computational methods back into the clinical environment. Secure HPC environment Optimized research database • Access separation between sensitive/secure data and non-sensitive research data • Memory, storage, network • ACL permissions • Regular audits • Different types of queries to find data over integrated views • Mixed environment needed (e.g., SQL vs. NoSQL systems) for retrieval and discovery Workflow pipelines Production environment • Data loaders (from underlying sources), batch loading • Sophisticated userdesigned sequences of pre- and post-processing of data (e.g., Kepler) • Clinical applications running off of the database, developed algorithms • Runtime prioritiziation Translational Informatics Platform Data sources Clinical data sources Image archives Sequencing core data Other HIGH-LEVEL BLOCK DIAGRAM The overarching purpose of our planned system is to enable biomedical research in an environment that enables the secure sharing of clinical/research data, and the seamless translation of developed computational methods back into the clinical environment. Uploaded raw data (e.g., BLOBs), data archives Pre- and postprocessing algorithms (bioinformatics, NLP, image processing) Workflow execution Online analytical packages (e.g., Matlab, R, etc.) Secondary database (Cassandra) Machine learning algorithms Source-specific data loaders Unified, webbased user querying/data retrieval interface Primary database (MySQL) Secured cloud storage in restricted network address space Clinical user Clinical data (CareConnect) Analytics, predictive model execution HPC Hospital computing environment/network