Data analysis is a complex process involving frequent

Transcription

Data analysis is a complex process involving frequent
Data-Driven Discovery Investigator Application
Jeffrey Heer, University of Washington
Data analysis is a complex process involving frequent shis among exploration and confirmation, data
formats and models, as well as textual and graphical media. In an interview study of 35 analysts at 25
companies, we noted recurring issues shared by data scientists. Common workflows consist of data
discovery and acquisition; wrangling data through reformatting, cleaning and integration; profiling data to
explore its contents, identify salient features and assess data quality; modeling data to explain or predict
phenomena; and reporting findings to others. is process is highly iterative, with analysts moving back
and forth among phases, and also interactive, regularly requiring human attention and domain knowledge.
At the Interactive Data Lab, we aim to accelerate this analytic lifecycle by identifying critical bottlenecks
and developing new interactive systems for data analysis. We study the perceptual, cognitive and social
factors affecting data analysis to enable people to work with data more effectively. e goal is to improve
the efficiency and scale at which expert analysts work, and to lower barriers to entry for non-experts.
Motivating questions include: How might we enable users to transform, integrate and model data while
minimizing the need for programming? How can we support expressive and effective visualization
designs? Can we build scalable systems to query and visualize massive data sets at interactive rates? How
might we enable domain experts to guide machine learning methods to produce better models?
DATA WRANGLING
Analysts must regularly restructure data to make it palatable to databases, statistics packages and
visualization tools. In our interviews, analysts reported spending 50-80% of their time transforming data
prior to visualization or modeling. In response, our work on Data Wrangler lets analysts interactively
transform data at scale. With Wrangler, users select features in a data table to prompt automatic suggestion
of possible actions, each of which is a statement in an underlying transformation language. Wrangler ranks
suggestions using a model that integrates user input with the frequency and diversity of transforms. Visual
previews of transformation results help analysts rapidly assess viable operations. e result of this process
is not simply transformed data, but a reusable transformation program that we can compile to runtime
environments such as Python, SQL and Map-Reduce. By producing not just data but executable programs,
Wrangler enables a level of scalability that is not currently possible with other graphical tools. In its first
year of release, our online Wrangler demo received over 10,000 unique users. Given the demand and
market opportunity, we have founded a start-up company (Trifacta) to commercialize this work.
We have applied a similar strategy to other "wrangling" tasks such as modeling networks from multiple
tabular data sources (Orion) and assessing data quality using both statistical anomaly detection and
automated visualization recommendation (Profiler). Across domains, we have found that the integration
of (1) visual, direct manipulation interfaces, (2) domain specific languages for data analysis tasks, and (3)
machine learning methods to suggest operations and visualizations can simplify, accelerate and scale data
preparation tasks that would otherwise involve significant tedium and programming effort.
DATA VISUALIZATION
Once data has been acquired and structured, visualization provides a powerful medium to spot patterns,
test assumptions, form hypotheses and communicate findings. Our research group is a world-recognized
leader in data visualization. Part of our work is empirical: we conduct human subjects studies to assess the
effectiveness of visual encoding choices. We then apply these findings to create improved visualization
tools. For example, we have contributed novel visualization methods for networks, text, and genomics
data, as well as new algorithms for automated visualization design.
However, visualization techniques are of little use if inaccessible to designers and analysts. Accordingly, my
group researches architectures and tools for data visualization. We have developed a number of popular
visualization tools, the most recent being Data-Driven Documents (D3.js). D3.js provides a grammar for
authoring expressive visualizations by mapping data to the visual properties of web page elements. Since
its release in 2011, D3.js has become the de facto standard for web-based visualization. D3.js developers
number in the tens of thousands, reaching millions of end users. D3.js is now widely used in industry,
journalism (e.g., the New York Times) and science, including applications in atmospheric science, biology,
bioinformatics, chemistry, geology, oceanography, physics and sociology. However, D3.js and related tools
still require non-trivial programming skills. In ongoing work, we are investigating the design of interactive
systems that enable custom visualization design without writing code.
In addition, we are developing techniques for scaling visual encoding and interaction to the large data
volumes now common in scientific and industrial practice. By binning data into overlapping, multivariate
data "tiles" and leveraging parallel query processing, our imMens system can sustain 50 frames-per-second
interactive querying over summaries of billion+ element databases. In a follow-up experiment with human
subjects (currently in preparation), we have found that reduced interactive latency significantly improves the
rates of both observation and hypothesis formation during exploratory data analysis.
ENABLING DATA-DRIVEN DISCOVERY
ough the projects above have been informed by data science trends in industry, our motivating insights
also resonate within the natural sciences. In conversations and collaborations with scientists, we have
observed that data acquisition, preparation and visualization remain critical concerns requiring significant
effort. Moreover, as the sciences become increasingly data-rich, traditional processes of data collection will
be augmented by data selection from a growing array of extant scientific databases. As data is repurposed
for use outside its original context of collection, issues of data discovery, transformation and quality
assessment will magnify, and underlie both scientific validity and reproducibility.
We propose to center our research efforts squarely on the challenges facing data-driven science. Similar to
our study of enterprise analysts, we will begin with a qualitative assessment of current data analysis
practices among our colleagues in the natural sciences. Our recent move from Stanford to UW provides
new opportunities for this work, given strong ties to domain scientists through the UW eScience Institute.
For example, for my first Data Visualization class at UW, I solicited student project ideas from my
eScience colleagues. e response was tremendous, including projects from astronomy, biology, chemistry,
physics, oceanography and seismology. We received many more projects than the current class of 60
students can field. is response indicates both strong interest and a recognized need for improved tools.
We expect scientific "payoffs" to arrive through multiple means. e first is targeted collaborations with
domain scientists. Based on our interviews, we will identify research groups with challenges that (a)
require novel tools-oriented research and (b) will produce impactful scientific findings if addressed. We
followed a similar strategy at Stanford, embedding in multiple bioinformatics and biology research groups,
leading to new visual analysis tools for population-scale genomics data. We have also collaborated with
biologists studying ant colonies, producing new findings from decades of field study data.
Second, we expect more widespread breakthroughs to result from the availability of novel analysis tools
and through the education and enculturation of a new generation of scientists. We will research interactive
tools for data preparation and visualization with a focus on scientific data, and initiate new projects for
search, discovery and visualization in existing scientific databases. We will make all developed tools freely
available as open-source soware, along with requisite examples and documentation. In addition to crossdisciplinary coursework, we will develop tutorials (both in-person and online), institute a campus-wide
data seminar series, and offer "data science office hours" for data preparation and analysis help.
Moore Foundation support would be transformative for these efforts. Funding will enable our group to
focus squarely on scientific discovery and expand our efforts and outreach. In addition to more student
research assistantships, we will hire a research programmer to strengthen tool building and support. ese
efforts would complement and magnify existing Moore/Sloan support to UW, for which I am not a PI.
For more about our group, research and tools, please visit http://idl.cs.washington.edu/