Data analysis is a complex process involving frequent
Transcription
Data analysis is a complex process involving frequent
Data-Driven Discovery Investigator Application Jeffrey Heer, University of Washington Data analysis is a complex process involving frequent shis among exploration and confirmation, data formats and models, as well as textual and graphical media. In an interview study of 35 analysts at 25 companies, we noted recurring issues shared by data scientists. Common workflows consist of data discovery and acquisition; wrangling data through reformatting, cleaning and integration; profiling data to explore its contents, identify salient features and assess data quality; modeling data to explain or predict phenomena; and reporting findings to others. is process is highly iterative, with analysts moving back and forth among phases, and also interactive, regularly requiring human attention and domain knowledge. At the Interactive Data Lab, we aim to accelerate this analytic lifecycle by identifying critical bottlenecks and developing new interactive systems for data analysis. We study the perceptual, cognitive and social factors affecting data analysis to enable people to work with data more effectively. e goal is to improve the efficiency and scale at which expert analysts work, and to lower barriers to entry for non-experts. Motivating questions include: How might we enable users to transform, integrate and model data while minimizing the need for programming? How can we support expressive and effective visualization designs? Can we build scalable systems to query and visualize massive data sets at interactive rates? How might we enable domain experts to guide machine learning methods to produce better models? DATA WRANGLING Analysts must regularly restructure data to make it palatable to databases, statistics packages and visualization tools. In our interviews, analysts reported spending 50-80% of their time transforming data prior to visualization or modeling. In response, our work on Data Wrangler lets analysts interactively transform data at scale. With Wrangler, users select features in a data table to prompt automatic suggestion of possible actions, each of which is a statement in an underlying transformation language. Wrangler ranks suggestions using a model that integrates user input with the frequency and diversity of transforms. Visual previews of transformation results help analysts rapidly assess viable operations. e result of this process is not simply transformed data, but a reusable transformation program that we can compile to runtime environments such as Python, SQL and Map-Reduce. By producing not just data but executable programs, Wrangler enables a level of scalability that is not currently possible with other graphical tools. In its first year of release, our online Wrangler demo received over 10,000 unique users. Given the demand and market opportunity, we have founded a start-up company (Trifacta) to commercialize this work. We have applied a similar strategy to other "wrangling" tasks such as modeling networks from multiple tabular data sources (Orion) and assessing data quality using both statistical anomaly detection and automated visualization recommendation (Profiler). Across domains, we have found that the integration of (1) visual, direct manipulation interfaces, (2) domain specific languages for data analysis tasks, and (3) machine learning methods to suggest operations and visualizations can simplify, accelerate and scale data preparation tasks that would otherwise involve significant tedium and programming effort. DATA VISUALIZATION Once data has been acquired and structured, visualization provides a powerful medium to spot patterns, test assumptions, form hypotheses and communicate findings. Our research group is a world-recognized leader in data visualization. Part of our work is empirical: we conduct human subjects studies to assess the effectiveness of visual encoding choices. We then apply these findings to create improved visualization tools. For example, we have contributed novel visualization methods for networks, text, and genomics data, as well as new algorithms for automated visualization design. However, visualization techniques are of little use if inaccessible to designers and analysts. Accordingly, my group researches architectures and tools for data visualization. We have developed a number of popular visualization tools, the most recent being Data-Driven Documents (D3.js). D3.js provides a grammar for authoring expressive visualizations by mapping data to the visual properties of web page elements. Since its release in 2011, D3.js has become the de facto standard for web-based visualization. D3.js developers number in the tens of thousands, reaching millions of end users. D3.js is now widely used in industry, journalism (e.g., the New York Times) and science, including applications in atmospheric science, biology, bioinformatics, chemistry, geology, oceanography, physics and sociology. However, D3.js and related tools still require non-trivial programming skills. In ongoing work, we are investigating the design of interactive systems that enable custom visualization design without writing code. In addition, we are developing techniques for scaling visual encoding and interaction to the large data volumes now common in scientific and industrial practice. By binning data into overlapping, multivariate data "tiles" and leveraging parallel query processing, our imMens system can sustain 50 frames-per-second interactive querying over summaries of billion+ element databases. In a follow-up experiment with human subjects (currently in preparation), we have found that reduced interactive latency significantly improves the rates of both observation and hypothesis formation during exploratory data analysis. ENABLING DATA-DRIVEN DISCOVERY ough the projects above have been informed by data science trends in industry, our motivating insights also resonate within the natural sciences. In conversations and collaborations with scientists, we have observed that data acquisition, preparation and visualization remain critical concerns requiring significant effort. Moreover, as the sciences become increasingly data-rich, traditional processes of data collection will be augmented by data selection from a growing array of extant scientific databases. As data is repurposed for use outside its original context of collection, issues of data discovery, transformation and quality assessment will magnify, and underlie both scientific validity and reproducibility. We propose to center our research efforts squarely on the challenges facing data-driven science. Similar to our study of enterprise analysts, we will begin with a qualitative assessment of current data analysis practices among our colleagues in the natural sciences. Our recent move from Stanford to UW provides new opportunities for this work, given strong ties to domain scientists through the UW eScience Institute. For example, for my first Data Visualization class at UW, I solicited student project ideas from my eScience colleagues. e response was tremendous, including projects from astronomy, biology, chemistry, physics, oceanography and seismology. We received many more projects than the current class of 60 students can field. is response indicates both strong interest and a recognized need for improved tools. We expect scientific "payoffs" to arrive through multiple means. e first is targeted collaborations with domain scientists. Based on our interviews, we will identify research groups with challenges that (a) require novel tools-oriented research and (b) will produce impactful scientific findings if addressed. We followed a similar strategy at Stanford, embedding in multiple bioinformatics and biology research groups, leading to new visual analysis tools for population-scale genomics data. We have also collaborated with biologists studying ant colonies, producing new findings from decades of field study data. Second, we expect more widespread breakthroughs to result from the availability of novel analysis tools and through the education and enculturation of a new generation of scientists. We will research interactive tools for data preparation and visualization with a focus on scientific data, and initiate new projects for search, discovery and visualization in existing scientific databases. We will make all developed tools freely available as open-source soware, along with requisite examples and documentation. In addition to crossdisciplinary coursework, we will develop tutorials (both in-person and online), institute a campus-wide data seminar series, and offer "data science office hours" for data preparation and analysis help. Moore Foundation support would be transformative for these efforts. Funding will enable our group to focus squarely on scientific discovery and expand our efforts and outreach. In addition to more student research assistantships, we will hire a research programmer to strengthen tool building and support. ese efforts would complement and magnify existing Moore/Sloan support to UW, for which I am not a PI. For more about our group, research and tools, please visit http://idl.cs.washington.edu/