Using Big Data to Predict Veteran Suicide Risk

Transcription

Using Big Data to Predict Veteran Suicide Risk
Overview
Using Big Data to Predict
Veteran Suicide Risk
Patterns and Predictions (P&P) is a predictive analytics firm with a core technology that
provides unstructured and linguistics driven prediction. It is the technology powering the
Durkheim Project’s ‘Big Data’ analytics network for the assessment of mental health risks.
Partners include Bloomberg, The Geisel School of Medicine at Dartmouth, Cloudera, and
Attivio. Funding sources include the U.S. Government’s Defense Advanced Research Project
Agency (DARPA), and customers include Global 100 companies. The company’s principal
partner, Chris Poulin, is co-inventor of the company’s core Centiment® technology that
delivers unstructured and linguistics-driven prediction.
The Durkheim Project is named in honor of David Émile Durkheim, a French sociologist
whose 1897 publication, Suicide, defined early text analysis for suicide risk and provided
important theoretical explanations relating to societal disconnection. The project
follows its namesake’s lead in its search for what Durkheim referred to as the “qualities”
of suicide – those specific patterns and clues that point to suicide risk. The Durkheim
Project, though, has one valuable tool at its disposal that the founding sociologist didn’t
have: technology.
The Challenge
Suicide is an issue with which the US military has struggled for years. Today, the battle
against this pervasive enemy continues to rage on, with staggering – and persistently
increasing – casualties. In one of its many articles that reference the subject, Time reports that the record number of 349 military suicides in 2012 far exceeded the number of
American combat deaths in Afghanistan for the same year. The rate of military suicides is
roughly double those of adults in the general US population.1
In its Suicide Data Report, 2012, the US Department of Veterans Affairs (VA) noted that,
“Information on the characteristics and outcomes of veterans at risk for suicide is critical
to the development of improved suicide prevention programs.”2
The Durkheim Project is well-positioned to deliver on the promise of this crucial information. With its powerful array of advanced analytics, real-time predictive modeling, and
machine learning working in concert, the project seeks to identify critical correlations
between veterans’ communications and suicide risk in what Fast Company describes as
“the most vital use of ‘Big Data’ we’ve ever seen.”3
1
2
3
Time, One a Day. July 23, 2012
The US Department of Veterans Affairs, Suicide Data Report, 2012
Fast Company Labs, This May Be The Most Vital Use Of “Big Data” We’ve Ever Seen. July 12, 2012
CUSTOMER SUCCESS STORY
1
Key Highlights
Industries
• Government
• Healthcare and Life Sciences
Location
• Portsmouth, NH, USA
Business Application Supported
• Predictive analytics that identify risk
factors for suicide
Impact
• Accurate, linguistics-driven
correlations between real-time
communications and suicide risk
• Infrastructure delivers lower cost, better
computational throughput, and reduced
complexity of IT support
Technologies in Use
• Hadoop Platform: CDH
• Hadoop Components: Cloudera
Impala, Cloudera Search
• Servers: Cray grid, Amazon EC2
• Analytic Tools: Patterns and
Predictions Centiment®; Attivio
Big Data Scale
• Over 1TB of jobs processed per day in
real time
• Up to 100,000 active duty and
veterans supported in real time
Solution
Phase One
The Durkheim Project began in 2010 with initial funding by the DARPA, a research arm of
the Department of Defense (DoD), and with prior research from Dartmouth College, with
which P&P and Poulin have ties. Poulin and his specialists are key players in the project’s
multi-disciplinary team that also includes experts in artificial intelligence, medical professionals from private companies, the Geisel School of Medicine at Dartmouth, and the VA.
Phase One of the project began with a study of three cohorts, with 100 subjects each,
representing “non-psychiatric”, “psychiatric”, and “suicide positive” profiles. The
researchers developed linguistics-driven prediction models to estimate suicide risk,
generated from unstructured clinical notes.
In 2011, P&P began sourcing the technology and building out the integrated foundational
infrastructure and predictive modeling that would support the project’s extensive data
collection and analysis, once it was scaled up. Distributed technologies like Apache
Hadoop presented a logical solution for an efficient and highly scalable big data platform;
but the project required a lightweight machine learning framework that would run on
Hadoop and detect real-time risk at scale.
“Most of the big data machine learning solutions that were out there were of low
performance in accuracy, or highly complex in implementation and in integration with our
existing environment,” explained Poulin.
Cloudera’s category leadership and subject matter expertise with Hadoop and big data
led Poulin to engage Cloudera Professional Services to co-develop Bayesian counters,
a lightweight statistical model that detects risk at scale, based on Apache HBase and
CDH (Cloudera’s Distribution Including Apache Hadoop), the market-leading, 100% open
source distribution of Hadoop and related projects. The Cloudera based framework is a
cornerstone technology of the Durkheim Project.
The tightly integrated system was “trained” by feeding in isolated statistical indicators
– keyword combinations, patterns, and other linguistic clues determined through careful
analysis of the previous data from a variety of veterans’ database sources. Once trained,
the machine learning then could identify useful clues in real data, and establish a risk
“score.”
Because suicide is such a personal act – and one in which the person tends to keep up an
outward appearance of being fine, explained Poulin, “the risk signals are weaker. When you
deploy the system at scale, machine learning has to be very sensitive on that big data.”
The Phase One build and testing concluded in early 2013. It validated that the project’s
machine learning data fabric was viable, with predictive capabilities that were 65%
accurate in predicting suicide risk among a veteran control group.
CUSTOMER SUCCESS STORY
2
Phase Two
Phase Two of the Durkheim Project launched in July 2013 and focused, with Cloudera’s
involvement, on the project’s ultimate objective of “suicidality prediction at scale” across
different types of structured and unstructured data. Facebook joined DARPA in supporting
this phase, through the promotion of content of consenting participants for the project’s
monitoring purposes.
With a target number of 100,000 veteran participants, the data most certainly will be
“big.” Those veterans who opt into the project receive a unique Facebook app and a
mobile app for either the iOS or Android system – all designed to capture posts, Tweets,
mobile uploads, and even location. Additional profile data is captured as well, including
physician information and clinical notes. To ensure compliance with various privacy
and HIPAA regulations, all captured data is stored in a secure environment behind the
medical firewall at Dartmouth’s Geisel School of Medicine.
“With Cloudera Search and Impala,
our ingestion of data on Hadoop
is promisingly efficient in terms of
lower costs, better computational
throughput, and reduced
complexity of IT support.”
Chris Poulin, Principal Partner,
Patterns and Predictions
As participants join, individual profiles are set up and accessible, via a dashboard, to researchers at Geisel and to clinicians. The system assigns overall risk scores to each profile
based on the collective information and on keywords that are specific to each participant.
The use of text analytics against the continuously fed large data pool delivers an
exponential number of variables which can then be compared and analyzed, resulting in a
real-time assessment of the participant’s mental health. Said Poulin, “The computational
processing to analyze that data requires a big data fabric, but the benefit is that it’s much
more informative.”
The technical rubric for the project is “maximum speed at minimum cost”, which
prompted adoption of Cloudera Search and Cloudera Impala. “The project has a very
complex workflow,” explained Poulin. “All of our machine learning is indexed, and we
actually access all of the machine learning through search interfaces, which can get
expensive. With Cloudera Search and Impala, our ingestion of data on Hadoop is promisingly efficient in terms of lower costs, better computational throughput, and reduced
complexity of IT support.”
Impact
The complexity and sensitivity of the topic of suicide, combined with the intensifying battle
that the military faces, make for a very weighty backdrop for the Durkheim Project. In
that respect, “the technology aspect of the project has almost been easier than the social
engineering,” said Poulin. “If a person is really committed to taking his own life, you need to
be both informed and gentle enough to try to help that person find an alternative outcome.”
Still in its initial phases, though, the Durkheim Project is authorized only to monitor and
analyze data. While the project has delivered statistically valid results that accurately
predict suicide risk in a control group of veterans, its critical research is restricted, at
least for the time being, to a non-interventional protocol. Using Cloudera, the project’s
continued scaling of risk classifiers, Poulin hopes, will help to establish the necessary
confidence in the project’s ability to assess risk in real time so that they can apply for an
interventional study.
Phase One of The Durkheim
Project predicted suicide risk
among a veteran control group
with 65% accuracy, demonstrating
statistical significance.
“One of the promises of big data in this case,” Poulin stated, “is that you can shorten the
distance between the people who need help and the system that can get them help. That is
our goal, and one we want to continue to move toward this with Cloudera as our partner.”
CUSTOMER SUCCESS STORY
3
About Cloudera
Cloudera is revolutionizing enterprise data management by offering the first unified
Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers
enterprises one place to store, process and analyze all their data, empowering them to
extend the value of existing investments while enabling fundamental new ways to derive
value from their data. Only Cloudera offers everything needed on a journey to an enterprise data hub, including software for business critical data challenges such as storage,
access, management, analysis, security and search. As the leading educator of Hadoop
professionals, Cloudera has trained over 40,000 individuals worldwide. Over 1200
partners and a seasoned professional services team help deliver greater time to value.
Finally, only Cloudera provides proactive and predictive support to run an enterprise
data hub with confidence. Leading organizations in every industry plus top public sector
organizations globally run Cloudera in production. www.cloudera.com.
cloudera.com
1-888-789-1488 or 1-650-362-0488
Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA
© 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA
and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.
cloudera-casestudy-patternsandpredictions-102