Communications of the ACM - June 2016

Transcription

Communications of the ACM - June 2016
COMMUNICATIONS
ACM
CACM.ACM.ORG
OF THE
06/2016 VOL.59 NO.06
Whitfield Diffie
Martin E. Hellman
Recipients of ACM’s
A.M. Turing Award
Association for
Computing Machinery
IsIsInternet
softwaresosodifferent
different
from
“ordinary”
software?
This practically
book practically
this questio
Internet software
from
“ordinary”
software?
This book
answersanswers
this question
through
the presentation
presentationofof
a software
design
method
theChart
State XML
ChartW3C
XML
W3C standard
through the
a software
design
method
basedbased
on theon
State
standard
along
Java.Web
Webenterprise,
enterprise,
Internet-of-Things,
and Android
applications,
in particular,
are
along with
with Java.
Internet-of-Things,
and Android
applications,
in particular,
are
seamlessly
specifiedand
andimplemented
implemented
from
“executable
models.
”
seamlessly specified
from
“executable
models.
”
Internet software
thethe
idea
of event-driven
or reactive
programming,
as pointed
out in out in
Internet
softwareputs
putsforward
forward
idea
of event-driven
or reactive
programming,
as pointed
Bonér et
et al.
It tells
us that
reactiveness
is a must.
However,
beyondbeyond
concepts,concepts,
Bonér
al.’s’s“Reactive
“ReactiveManifesto”.
Manifesto”.
It tells
us that
reactiveness
is a must.
However,
software engineers
means
withwith
which
to puttoreactive
programming
into practice.
software
engineersrequire
requireeffective
effective
means
which
put reactive
programming
into practice.
Reactive
Internet
Programming
outlines
and
explains
such
means.
Reactive Internet Programming outlines and explains such means.
The lack of professional examples in the literature that illustrate how reactive software should
The
lack of professional examples in the literature that illustrate how reactive software should
be shaped can be quite frustrating. Therefore, this book helps to fill in that gap by providing inbe
shaped
can be quite
frustrating.
Therefore,
this bookdetails
helps and
to fill
in that gap
by providing indepth
professional
case studies
that contain
comprehensive
meaningful
alternatives.
depth
professional
casestudies
studiescan
that
comprehensive
details and meaningful alternatives.
Furthermore,
these case
be contain
downloaded
for further investigation.
Furthermore, these case studies can be downloaded for further investigation.
Internet software requires higher adaptation, at run time in particular. After reading Reactive Internet
Programming,
you requires
will be ready
to enter
the forthcoming
Internet
era.
Internet
software
higher
adaptation,
at run time
in particular.
After reading Reactive Interne
Programming, you will be ready to enter the forthcoming Internet era.
this.splash 2016
Sun 30 October – Fri 4 November 2016
Amsterdam
ACM SIGPLAN Conference on Systems, Programming, Languages and Applications:
Software for Humanity (SPLASH)
OOPSLA
Novel research on software development and programming
Onward!
Radical new ideas and visions related to programming and software
SPLASH-I
World class speakers on current topics in software, systems, and languages research
SPLASH-E
DLS
GPCE
SLE
Researchers and educators share educational results, ideas, and challenges
Dynamic languages, implementations, and applications
Generative programming: concepts and experiences
Principles of software language engineering, language design, and evolution
Biermann
SPLASH General Chair: Eelco Visser
SLE General Chair: Tijs van der Storm
OOPSLA Papers: Yannis Smaragdakis
SLE Papers: Emilie Balland, Daniel Varro
OOPSLA Artifacts: Michael Bond, Michael Hind
GPCE General Chair: Bernd Fischer
Onward! Papers: Emerson Murphy-Hill
GPCE Papers: Ina Schaefer
Onward! Essays: Crista Lopes
Student Research Competition: Sam Guyer, Patrick Lam
SPLASH-I: Eelco Visser, Tijs van der Storm
Posters: Jeff Huang, Sebastian Erdweg
SPLASH-E: Matthias Hauswirth, Steve Blackburn
Mövenpick
Publications: Alex Potanin Amsterdam
DLS: Roberto Ierusalimschy
Publicity and Web: Tijs van der Storm, Ron Garcia
Workshops: Jan Rellermeyer, Craig Anslow
Student Volunteers: Daco Harkes
@splashcon
2016.splashcon.org
bit.ly/splashcon16
COMMUNICATIONS OF THE ACM
Departments
5
News
Viewpoints
From the President
Moving Forward
By Alexander L. Wolf
7
Cerf’s Up
Celebrations!
By Vinton G. Cerf
8
Letters to the Editor
No Backdoor Required or Expected
10BLOG@CACM
The Solution to AI, What Real
Researchers Do, and Expectations
for CS Classrooms
John Langford on AlphaGo, Bertrand
Meyer on Research as Research,
and Mark Guzdial on correlating
CS classes with laboratory results.
29Calendar
15
12 Turing Profile
26
22 Inside Risks
The Key to Privacy
40 years ago, Whitfield Diffie and
Martin Hellman introduced
the public key cryptography used to
secure today’s online transactions.
By Neil Savage
The Risks of Self-Auditing Systems
Unforeseen problems can
result from the absence of
impartial independent evaluations.
By Rebecca T. Mercuri
and Peter G. Neumann
Last Byte
26 Kode Vicious
Finding New Directions
in Cryptography
Whitfield Diffie and Martin Hellman
on their meeting, their research, and
the results that billions use every day.
By Leah Hoffmann
15 What Happens When
Big Data Blunders?
Big data is touted as a cure-all for
challenges in business, government,
and healthcare, but as disease
outbreak predictions show,
big data often fails.
By Logan Kugler
What Are You Trying to Pull?
A single cache miss is more
expensive than many instructions.
By George V. Neville-Neil
28 The Profession of IT
How to Produce Innovations
Making innovations happen
is surprisingly easy, satisfying,
and rewarding if you start
small and build up.
By Peter J. Denning
31Interview
17 Reimagining Search
Search engine developers are moving
beyond the problem of document
analysis, toward the elusive goal of
figuring out what people really want.
By Alex Wright
20 What’s Next for Digital Humanities?
Association for Computing Machinery
Advancing Computing as a Science & Profession
2
COMMUNICATIO NS O F THE ACM
New computational tools spur
advances in an evolving field.
By Gregory Mone
| J U NE 201 6 | VO L . 5 9 | NO. 6
An Interview with Yale Patt
ACM Fellow Professor Yale Patt
reflects on his career.
By Derek Chiou
Watch Patt discuss
his work in this exclusive
Communications video.
http://cacm.acm.org/
videos/an-interview-withyale-patt
For the full-length video,
please visit https://vimeo.
com/an-interview-withyale-patt
IMAGES BY CREATIONS, EVERETT COLLECT ION/SH UT T ERSTOCK
Watch the Turing
recipients discuss their
work in this exclusive
Communications video.
http://cacm.acm.org/
videos/the-key-to-privacy
112Q&A
06/2016
VOL. 59 NO. 06
Viewpoints (cont’d.)
Contributed Articles
Review Articles
37Viewpoint
Computer Science
Should Stay Young
Seeking to improve computer science
publication culture while retaining
the best aspects of the conference
and journal publication processes.
By Boaz Barak
39Viewpoint
Privacy Is Dead, Long Live Privacy
Protecting social norms
as confidentiality wanes.
By Jean-Pierre Hubaux and Ari Juels
62
80
42Viewpoint
A Byte Is All We Need
A teenager explores ways
to attract girls into the magical world
of computer science.
By Ankita Mitra
62 Improving API Usability
80 RandNLA: Randomized
Human-centered design can make
application programming interfaces
easier for developers to use.
By Brad A. Myers and Jeffrey Stylos
Numerical Linear Algebra
Randomization offers new benefits
for large-scale linear computations.
By Petros Drineas and
Michael W. Mahoney
70 Physical Key Extraction
Practice
45 Nine Things I Didn’t Know I Would
Learn Being an Engineer Manager
Many of the skills
aren’t technical at all.
By Kate Matsudaira
Attacks on PCs
Computers broadcast their secrets via
inadvertent physical emanations that
are easily measured and exploited.
By Daniel Genkin, Lev Pachmanov,
Itamar Pipman, Adi Shamir,
and Eran Tromer
48 The Flame Graph
Veritesting Tackles
Path-Explosion Problem
By Koushik Sen
Execution with Veritesting
By Thanassis Avgerinos,
Alexandre Rebert, Sang Kil Cha,
and David Brumley
58 Standing on Distributed
IMAGES BY BENIS A RAPOVIC/D OTSH OCK , FORA NCE
92 Technical Perspective
93 Enhancing Symbolic
This visualization of software execution
is a new necessity for performance
profiling and debugging.
By Brendan Gregg
101 Technical Perspective
Shoulders of Giants
Farsighted physicists of yore
were danged smart!
By Pat Helland
Articles’ development led by
queue.acm.org
Research Highlights
Computing with the Crowd
By Siddharth Suri
102 AutoMan: A Platform for
About the Cover:
Whitfield Diffie (left) and
Martin E. Hellman,
cryptography pioneers
and recipients of
the 2015 ACM A.M. Turing
Award, photographed at
Stanford University’s
Huang Center in March.
Photographed by
Richard Morgenstein,
http://www.morgenstein.com/
Integrating Human-Based
and Digital Computation
By Daniel W. Barowy, Charlie Curtsinger,
Emery D. Berger, and Andrew McGregor
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF THE ACM
3
COMMUNICATIONS OF THE ACM
Trusted insights for computing’s leading professionals.
Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.
ACM, the world’s largest educational
and scientific computing society, delivers
resources that advance computing as a
science and profession. ACM provides the
computing field’s premier Digital Library
and serves its members and the computing
profession with leading-edge publications,
conferences, and career resources.
Executive Director and CEO
Bobby Schnabel
Deputy Executive Director and COO
Patricia Ryan
Director, Office of Information Systems
Wayne Graves
Director, Office of Financial Services
Darren Ramdin
Director, Office of SIG Services
Donna Cappo
Director, Office of Publications
Bernard Rous
Director, Office of Group Publishing
Scott E. Delman
ACM CO U N C I L
President
Alexander L. Wolf
Vice-President
Vicki L. Hanson
Secretary/Treasurer
Erik Altman
Past President
Vinton G. Cerf
Chair, SGB Board
Patrick Madden
Co-Chairs, Publications Board
Jack Davidson and Joseph Konstan
Members-at-Large
Eric Allman; Ricardo Baeza-Yates;
Cherri Pancake; Radia Perlman;
Mary Lou Soffa; Eugene Spafford;
Per Stenström
SGB Council Representatives
Paul Beame; Jenna Neefe Matthews;
Barbara Boucher Owens
STA F F
E DITOR- IN- C HIE F
Moshe Y. Vardi
[email protected]
Executive Editor
Diane Crawford
Managing Editor
Thomas E. Lambert
Senior Editor
Andrew Rosenbloom
Senior Editor/News
Larry Fisher
Web Editor
David Roman
Rights and Permissions
Deborah Cotton
NE W S
Art Director
Andrij Borys
Associate Art Director
Margaret Gray
Assistant Art Director
Mia Angelica Balaquiot
Designer
Iwona Usakiewicz
Production Manager
Lynn D’Addesio
Director of Media Sales
Jennifer Ruzicka
Publications Assistant
Juliet Chance
Columnists
David Anderson; Phillip G. Armour;
Michael Cusumano; Peter J. Denning;
Mark Guzdial; Thomas Haigh;
Leah Hoffmann; Mari Sako;
Pamela Samuelson; Marshall Van Alstyne
CO N TAC T P O IN TS
Copyright permission
[email protected]
Calendar items
[email protected]
Change of address
[email protected]
Letters to the Editor
[email protected]
BOARD C HA I R S
Education Board
Mehran Sahami and Jane Chu Prey
Practitioners Board
George Neville-Neil
W E B S IT E
http://cacm.acm.org
AU T H O R G U ID E L IN ES
http://cacm.acm.org/
REGIONA L C O U N C I L C HA I R S
ACM Europe Council
Dame Professor Wendy Hall
ACM India Council
Srinivas Padmanabhuni
ACM China Council
Jiaguang Sun
ACM ADVERTISIN G DEPARTM E NT
2 Penn Plaza, Suite 701, New York, NY
10121-0701
T (212) 626-0686
F (212) 869-0481
PUB LICATI O N S BOA R D
Co-Chairs
Jack Davidson; Joseph Konstan
Board Members
Ronald F. Boisvert; Anne Condon;
Nikil Dutt; Roch Guerrin; Carol Hutchins;
Yannis Ioannidis; Catherine McGeoch;
M. Tamer Ozsu; Mary Lou Soffa; Alex Wade;
Keith Webster
Director of Media Sales
Jennifer Ruzicka
[email protected]
For display, corporate/brand advertising:
Craig Pitcher
[email protected] T (408) 778-0300
William Sleight
[email protected] T (408) 513-3408
ACM U.S. Public Policy Office
Renee Dopplick, Director
1828 L Street, N.W., Suite 800
Washington, DC 20036 USA
T (202) 659-9711; F (202) 667-1066
EDITORIAL BOARD
DIRECTOR OF GROUP PU BLIS HING
Scott E. Delman
[email protected]
Media Kit [email protected]
Co-Chairs
William Pulleyblank and Marc Snir
Board Members
Mei Kobayashi; Michael Mitzenmacher;
Rajeev Rastogi
VIE W P OINTS
Co-Chairs
Tim Finin; Susanne E. Hambrusch;
John Leslie King
Board Members
William Aspray; Stefan Bechtold;
Michael L. Best; Judith Bishop;
Stuart I. Feldman; Peter Freeman;
Mark Guzdial; Rachelle Hollander;
Richard Ladner; Carl Landwehr;
Carlos Jose Pereira de Lucena;
Beng Chin Ooi; Loren Terveen;
Marshall Van Alstyne; Jeannette Wing
P R AC TIC E
Co-Chair
Stephen Bourne
Board Members
Eric Allman; Peter Bailis; Terry Coatta;
Stuart Feldman; Benjamin Fried;
Pat Hanrahan; Tom Killalea; Tom Limoncelli;
Kate Matsudaira; Marshall Kirk McKusick;
George Neville-Neil; Theo Schlossnagle;
Jim Waldo
The Practice section of the CACM
Editorial Board also serves as
.
the Editorial Board of
C ONTR IB U TE D A RTIC LES
Co-Chairs
Andrew Chien and James Larus
Board Members
William Aiello; Robert Austin; Elisa Bertino;
Gilles Brassard; Kim Bruce; Alan Bundy;
Peter Buneman; Peter Druschel; Carlo Ghezzi;
Carl Gutwin; Yannis Ioannidis;
Gal A. Kaminka; James Larus; Igor Markov;
Gail C. Murphy; Bernhard Nebel;
Lionel M. Ni; Kenton O’Hara; Sriram Rajamani;
Marie-Christine Rousset; Avi Rubin;
Krishan Sabnani; Ron Shamir; Yoav
Shoham; Larry Snyder; Michael Vitale;
Wolfgang Wahlster; Hannes Werthner;
Reinhard Wilhelm
RES E A R C H HIGHLIGHTS
Co-Chairs
Azer Bestovros and Gregory Morrisett
Board Members
Martin Abadi; Amr El Abbadi; Sanjeev Arora;
Nina Balcan; Dan Boneh; Andrei Broder;
Doug Burger; Stuart K. Card; Jeff Chase;
Jon Crowcroft; Sandhya Dwaekadas;
Matt Dwyer; Alon Halevy; Norm Jouppi;
Andrew B. Kahng; Sven Koenig; Xavier Leroy;
Steve Marschner; Kobbi Nissim;
Steve Seitz; Guy Steele, Jr.; David Wagner;
Margaret H. Wright; Andreas Zeller
ACM Copyright Notice
Copyright © 2016 by Association for
Computing Machinery, Inc. (ACM).
Permission to make digital or hard copies
of part or all of this work for personal
or classroom use is granted without
fee provided that copies are not made
or distributed for profit or commercial
advantage and that copies bear this
notice and full citation on the first
page. Copyright for components of this
work owned by others than ACM must
be honored. Abstracting with credit is
permitted. To copy otherwise, to republish,
to post on servers, or to redistribute to
lists, requires prior specific permission
and/or fee. Request permission to publish
from [email protected] or fax
(212) 869-0481.
For other copying of articles that carry a
code at the bottom of the first or last page
or screen display, copying is permitted
provided that the per-copy fee indicated
in the code is paid through the Copyright
Clearance Center; www.copyright.com.
Subscriptions
An annual subscription cost is included
in ACM member dues of $99 ($40 of
which is allocated to a subscription to
Communications); for students, cost
is included in $42 dues ($20 of which
is allocated to a Communications
subscription). A nonmember annual
subscription is $269.
ACM Media Advertising Policy
Communications of the ACM and other
ACM Media publications accept advertising
in both print and electronic formats. All
advertising in ACM Media publications is
at the discretion of ACM and is intended
to provide financial support for the various
activities and services for ACM members.
Current advertising rates can be found
by visiting http://www.acm-media.org or
by contacting ACM Media Sales at
(212) 626-0686.
Single Copies
Single copies of Communications of the
ACM are available for purchase. Please
contact [email protected].
COMMUN ICATION S OF THE ACM
(ISSN 0001-0782) is published monthly
by ACM Media, 2 Penn Plaza, Suite 701,
New York, NY 10121-0701. Periodicals
postage paid at New York, NY 10001,
and other mailing offices.
POSTMASTER
Please send address changes to
Communications of the ACM
2 Penn Plaza, Suite 701
New York, NY 10121-0701 USA
Printed in the U.S.A.
COMMUNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
REC
Y
PL
NE
E
I
S
I
4
SE
CL
A
TH
Computer Science Teachers Association
Mark R. Nelson, Executive Director
Chair
James Landay
Board Members
Marti Hearst; Jason I. Hong;
Jeff Johnson; Wendy E. MacKay
E
WEB
Association for Computing Machinery
(ACM)
2 Penn Plaza, Suite 701
New York, NY 10121-0701 USA
T (212) 869-7440; F (212) 869-0481
M AGA
Z
from the president
DOI:10.1145/2933245
Alexander L. Wolf
Moving Forward
A
M Y T E N U R E as ACM
president ends, I find myself reflecting on the past
two years and so I looked
back at my 2014 election
position statement.
[W]e must confront the reality that
what ACM contributed to the computing
profession for more than 65 years might
not sustain it in the future …
ACM was formed in 1947 by a small
group of scientists and engineers who
had helped usher in the computer age
during World War II. They saw ACM
as a means for professionals, primarily mathematicians and electrical engineers, to exchange and curate technical information about “computing
machinery.” The fact that ACM is now
home to a much broader community of
interests, with members literally spanning the globe, was likely well beyond
their imagination.
Conferences and publications remain the primary means by which our
organization sustains itself. I worried
in 2014 that revenue would eventually
fall, and that we needed to prepare. I
pointed out in a 2015 Communications
letter that conference surpluses go directly back to the SIGs, while publication surpluses are used to subsidize
the entire enterprise: allowing student
members everywhere, and reducedrate professional members in developing regions, to receive full member
benefits; contributing an additional
$3M per year to the SIGs; and supporting in entirety our volunteer-driven efforts in education, inclusion, and public policy. The specter of open access
undercutting the library subscription
business created many uncertainties,
some of which remain to this day.
Two years on, some things are coming into better focus, giving hope that
conferences and publications will remain viable revenue sources.
S
As it turns out, the popularity of
our conferences continues to rise with
overall conference attendance steadily
increasing. I attribute this to the growing importance and influence of computing, and the broadening of ACM’s
constituency and audience.
We have empowered authors and
conference organizers with new open
access options. Yet the uptake of Gold
(“author pays”) open access is surprisingly slow and the growth of the
subscription business is surprisingly
robust. Perhaps most profound is the
realization that the marketable value
of ACM’s Digital Library derives not
so much from access to individual articles, as from access to the collection
and the services that leverage and enhance the collection. In other words,
ACM sells subscriptions to a collection,
so in a sense open access to articles is
not the immediate threat. Moreover,
there is a potential future business to
be built around government mandates
for open data, reproducible computation, and digital preservation generally
that takes us far beyond today’s simple
PDF artifact and collection index.
We must recognize that the nature
of community, community identity, and
“belonging” is evolving rapidly …
What is the value of being formally
associated with ACM? This seemingly
simple and fundamental question
comes up so often that the answer
should be obvious and immediate.
Twenty years ago, perhaps it was. Today, although I personally feel the value,
I struggle to articulate an answer that I
am confident will convince someone
new to the community already engaged
with others through means falling outside the traditional ACM circle.
What I do know is that remarkably
few people are aware of the important
and impactful volunteer activities beyond conferences and publications
that are supported by ACM. This seems
to be the case whether the person is
one of our more than 100,000 duespaying members or one of the millions
of non-dues-paying participants and
beneficiaries in ACM activities.
That is why I sought to “change the
conversation” around ACM, from merely
serving as computing’s premier conference sponsor and publisher to also being
a potent and prominent force for good
in the community. My goal was to raise
awareness that ACM, as a professional
society, offers a uniquely authoritative,
respected voice, one that can amplify
the efforts of individuals in a way that
an ad hoc social network cannot. That
ACM and its assets are at the disposal
of its members and volunteer leaders to
drive its agenda forward. And that being
a member of this organization is a statement in support of that agenda. Getting
this message out is largely about how
ACM presents itself to the world through
its communication channels, which are
in the process of a long-overdue refresh.
ACM’s services and programs are
founded on three vital pillars: energetic
volunteers, dedicated HQ staff, and a
sufficient and reliable revenue stream …
The most rewarding experiences I
had as president were visits with the
many communities within the community that is ACM: SIGs, chapters, boards,
and committees. Each different, yet
bound by a commitment to excellence
that is our organization’s hallmark. Enabling those communities is a professional staff as passionate about ACM as
its members. They deserve our thanks
and respect.
As I end my term, I wish the next
president well in continuing to move
the organization forward. You have
great people to work with and an important legacy to continue.
Alexander L. Wolf is president of ACM and a professor
in the Department of Computing at Imperial College
London, UK.
Copyright held by author.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF THE ACM
5
The National Academies of
SCIENCES • ENGINEERING • MEDICINE
ARL Distinguished Postdoctoral Fellowships
The Army Research Laboratory (ARL) is the nation’s premier laboratory for
land forces. The civilians working at ARL and its predecessors have had
many successes in basic and applied research. Currently, ARL scientists and
engineers are pioneering research in such areas as neuroscience, energetic
materials and propulsion, electronics technologies, network sciences, virtual
interfaces and synthetic environments and autonomous systems. They are
leaders in modeling and simulation and have high performance computing
resources on-site. They are expanding into frontier areas, including fields such
as quantum information and quantum networks.
We invite outstanding young researchers to participate in this excitement
as ARL Distinguished Postdoctoral Fellows. These Fellows will display
extraordinary ability in scientific research and show clear promise of
becoming outstanding future leaders. Candidates are expected to have
already successfully tackled a major scientific or engineering problem or to
have provided a new approach or insight evidenced by a recognized impact
in their field. ARL offers these named Fellowships in honor of distinguished
researchers and work that has been performed at Army labs.
Advertise
Advertise with
with ACM!
ACM!
Reach
Reach the
the innovators
innovators and
and thought
thought leaders
leaders
working
at
the
cutting
edge
of
working at the cutting edge of
computing
computing and
and information
information
technology
through
ACM’s
technology through ACM’s magazines,
magazines,
websites
and
newsletters.
websites and newsletters.
The ARL Distinguished Postdoctoral Fellowships are three-year appointments.
The annual stipend is $100,000, and the fellowship includes benefits and
potential additional funding for selected proposals. Applicants must hold a
Ph.D., awarded within the past three years, at the time of application. For
complete application instructions and more information, visit:
http://sites.nationalacademies.org/PGA/Fellowships/ARL.
Applications must be received by July 1, 2016.
Request
Request aa media
media kit
kit
with
specifications
and
with specifications and pricing:
pricing:
Craig Pitcher
Craig Pitcher
408-778-0300 ◆ [email protected]
408-778-0300 ◆ [email protected]
Bill Sleight
Bill Sleight
408-513-3408 ◆ [email protected]
408-513-3408 ◆ [email protected]
World-Renowned Journals from ACM
ACM publishes over 50 magazines and journals that cover an array of established as well as emerging areas of the computing field.
IT professionals worldwide depend on ACM's publications to keep them abreast of the latest technological developments and industry
news in a timely, comprehensive manner of the highest quality and integrity. For a complete listing of ACM's leading magazines & journals,
including our renowned Transaction Series, please visit the ACM publications homepage: www.acm.org/pubs.
6
ACM Transactions
on Interactive
Intelligent Systems
ACM Transactions
on Computation
Theory
ACM Transactions on Interactive
Intelligent Systems (TIIS). This
quarterly journal publishes papers
on research encompassing the
design, realization, or evaluation of
interactive systems incorporating
some form of machine intelligence.
ACM Transactions on Computation
Theory (ToCT). This quarterly peerreviewed journal has an emphasis
on computational complexity, foundations of cryptography and other
computation-based topics in theoretical computer science.
PUBS_halfpage_Ad.indd 1
COMM UNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
PLEASE CONTACT ACM MEMBER
SERVICES TO PLACE AN ORDER
Phone:
1.800.342.6626 (U.S. and Canada)
+1.212.626.0500 (Global)
Fax:
+1.212.944.1318
(Hours: 8:30am–4:30pm, Eastern Time)
Email:
[email protected]
Mail:
ACM Member Services
General Post Office
PO Box 30777
New York, NY 10087-0777 USA
www.acm.org/pubs
6/7/12 11:38 AM
cerf’s up
DOI:10.1145/2933148
Vinton G. Cerf
Celebrations!
There is a rhythm in the affairs of the
Association for Computing Machinery and
June marks our annual celebration of award
recipients and the biennial election of new
officers. I will end my final year as
past president, Alex Wolf will begin
his first year in that role, and a new
president and other officers will take
their places in the leadership. June
also marks Bobby Schnabel’s first
appearance at our annual awards
event in his role as CEO of ACM. I am
especially pleased that two former
Stanford colleagues, Martin Hellman
and Whitfield Diffie, are receiving
the ACM A.M. Turing Award this year.
Nearly four decades have passed since
their seminal description of what has
become known as public key cryptography and in that time the technology
has evolved and suffused into much
of our online and offline lives.
In another notable celebration,
Alphabet, the holding company that
includes Google, saw its AlphaGo system from DeepMind win four of five
GO games in Seoul against a world
class human player. The complexity
of the state space of GO far exceeds
that of chess and many of us were
surprised to see how far neural networks have evolved in what seems
such a short period of time. Interestingly, the system tries to keep track
of its own confidence level as it uses
the state of the board to guide its
choices of next possible moves. We
are reminded once again how complexity arises from what seems to be
the simplest of rules.
While we are celebrating advances
in artificial intelligence, other voices
are forecasting a dark fate for humanity. Intelligent machines, once they
can match a human capacity, will go
on to exceed it, they say. Indeed, our
supercomputers and cloud-based
systems can do things that no human
can do, particularly with regard to
“big data.” Some of us, however, see
the evolution of computing capability in terms of partnership. When you
do a search on the World Wide Web
or use Google to translate from one
language to another, you are making
use of powerful statistical methods,
parsing, and semantic graphs to approximate what an accomplished
multilingual speaker might do. These
translations are not perfect but they
have been improving over time. This
does not mean, however, that the
programs understand in the deepest
cognitive sense what the words and
sentences mean. In large measure,
such translation rests on strong correlation and grammar. This is not
to minimize the utility of such programs—they enhance our ability to
communicate across language barriers. They can also create confusion
when misinterpretation of colloquialisms or other nuances interfere
with precision.
One has to appreciate, however,
the role of robotics in manufacturing in today’s world. The Tesla factory in Fremont, CA, is a marvel of
automationa and there are many
other examples, including the process of computer chip production
that figures so strongly in the work of
ACM’s members. Automation can be
considered an aspect of artificial in-
telligence if by this we mean the autonomous manipulation of the real
world. Of course, one can also argue,
as I have in the past, that stock market trading programs are robotic in
the sense they receive inputs, perform analysis, and take actions that
affect the real world (for example, our
bank accounts). Increasingly, we see
software carrying out tasks in largely
autonomous ways, including the dramatic progress made in self-driving
cars. Apart from what we usually call
artificial intelligence, it seems important to think about software that goes
about its operation with little or no
human intervention. I must confess, I
am still leery of the software that runs
the massage chairs at Google—thinking that a bug might cause the chair to
fold up while I am sitting in it!
While we celebrate the advances
made in artificial intelligence and autonomous systems, we also have an
obligation to think deeply about potential malfunctions and their consequences. This certainly persuades me
to keep in mind safety and reliability
to say nothing of security, privacy, and
usability, as we imbue more and more
appliances and devices with programmable features and the ability to communicate through the Internet. a https://www.youtube.com/watch?v=TuC8drQmXjg
Copyright held by author.
Vinton G. Cerf is vice president and Chief Internet Evangelist
at Google. He served as ACM president from 2012–2014.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF THE ACM
7
COMMUNICATIONSAPPS
letters to the editor
DOI:10.1145/2931085
No Backdoor
Required or Expected
I
Access the
latest issue,
past issues,
BLOG@CACM,
News, and
more.
Available for iPad,
iPhone, and Android
Available for iOS,
Android, and Windows
http://cacm.acm.org/
about-communications/
mobile-apps
8
COMMUNICATIO NS O F THE ACM
DISAPPOINTED by Eugene
H. Spafford’s column “The
Strength of Encryption” (Mar.
2016) in which Spafford conflated law enforcement requests for access to the contents of
specific smartphones with the prospect of the government requiring
backdoors through which any device
could be penetrated. These are separate issues. Even if the methods the
FBI ultimately used to unlock a particular Apple iPhone 5C earlier this
year are too elaborate for the hundreds of encrypted or code-protected
phones now in police custody, the
principle—that it is a moral if not legal responsibility for those with the
competence to open the phones do
so—would still be relevant.
Unlocking an individual phone
would not legally compel a backdoor
into all Apple devices. Rather, Apple
would have to create and download
into a particular target phone only a
version of iOS that does two things—
return to requesting password entry
after a failed attempt, without invoking the standard iOS delay-andattempt-count code and allow password attempts at guessing the correct
password be submitted electronically
rather than through physical taps on
the phone’s keypad. The first is clearly trivial, and the second is, I expect,
easily achieved.
The FBI would then observe, at an
Apple facility, the modified iOS being
downloaded and be able to run multiple brute-force password attempts
against it. When the phone is eventually unlocked, the FBI would have
the former user’s correct password.
Apple could then reload the original
iOS, and the FBI could take away the
phone and the password and access
the phone’s contents without further
Apple involvement.
No backdoor would have been released. No existing encryption security
would have been compromised. Other
law-enforcement agencies, armed with
WAS
| J U NE 201 6 | VO L . 5 9 | NO. 6
judicial orders, would likewise expect
compliance—and should receive it.
The secondary argument—that
should Apple comply and authoritarian regimes worldwide would demand
the same sort of compliance from
Apple, as well as from other manufacturers—is a straw man. Since Apple
and other manufacturers, as well as
researchers, have acknowledged they
are able to gain access to the contents
of encrypted phones, other regimes are
already able to make such demands,
independent of the outcome of any
specific case.
R. Gary Marquart, Austin, TX
Author Responds:
My column was written and published
before the FBI vs. Apple lawsuit occurred
and was on the general issue of encryption
strength and backdoors. Nowhere in it did
I mention either Apple or the FBI. I also
made no mention of “unlocking” cellphones,
iOS, or passwords. I am thus unable
to provide any reasonable response to
Marquart’s objections as to items not in it.
Eugene H. Spafford, West Lafayette, IN
The What in the GNU/Linux Name
George V. Neville-Neil’s Kode Vicious
column “GNL Is Not Linux” (Apr. 2016)
would have been better if it had ended
with the opening paragraph. Instead
Neville-Neil recapped yet again the
history of Unix and Linux, then went
off the rails, hinting, darkly, at ulterior
motives behind GPL, particularly that
it is anti-commercial. Red Hat’s billions in revenue ($1.79 billion in 2015)
should put such an assertion to rest.
The Free Software Foundation apparently has no problem with individuals or companies making money from
free software.
We do not call houses by the tools
we use to build them, as in, say, “… a
Craftsman/House, a Makita/House, or
a Home Depot/House …” in NevilleNeil’s example. But we do call a house
letters to the editor
Todd M. Lewis, Sanford, NC
Author Responds:
Lewis hints at my anti-GPL bias, though
I have been quite direct in my opposition
to any open source license that restricts
the freedoms of those using the code, as
is done explicitly by the GPLv2 licenses.
Open source means just that—open, free to
everyone, without strings, caveats, codicils,
or clawbacks. As for a strong drink and a reread of anything from Richard Stallman it
would have to be a very strong drink indeed
to induce me to do it again.
George V. Neville-Neil, Brooklyn, NY
Diversity and ‘CS for All’
Vinton G. Cerf’s Cerf’s Up column “Enrollments Explode! But diversity students are leaving …” (Apr. 2016) on di-
versity in computer science education
and Lawrence M. Fisher’s news story
on President Barack Obama’s “Computer Science for All” initiative made
us think Communications readers
might be interested in our experience
at Princeton University over the past
decade dramatically increasing both
CS enrollments in general and the percentage of women in CS courses. As
of the 2015–2016 academic year, our
introductory CS class was the highestenrolled class at Princeton and included over 40% women, with the number
and percentage of women CS majors
approaching similar levels.
Our approach is to teach a CS course
for everyone, focusing outwardly on
applications in other disciplines,
from biology and physics to art and
music.1 We begin with a substantive
programming component, with each
concept introduced in the context
of an engaging application, ranging
from simulating the vibration of a
guitar string to generate sound to implementing Markov language models
to computing DNA sequence alignments. This foundation allows us to
consider the great intellectual contributions of Turing, Shannon, von
Neumann, and others in a scientific
context. We have also had success
embracing technology, moving to
active learning with online lectures.2
We feel CS is something every college
student can and must learn, no matter
what their intended major, and there
is much more to it than programming
alone. Weaving CS into the fabric of
modern life and a broad educational
experience in this way is valuable to
all students, particularly women and
underrepresented minorities. Other
institutions adopting a similar approach have had similar success.
Meanwhile, we have finally (after
25 years of development) completed
our CS textbook Computer Science, An
Interdisciplinary Approach (AddisonWesley, 2016), which we feel can stand
alongside standard textbooks in biology, physics, economics, and other
disciplines. It will be available along
with studio-produced lectures and associated Web content (http://introcs.
cs.princeton.edu) that attract more
than one million visitors per year.
Over the next few years, we will seek
opportunities to disseminate these
materials to as many teachers and
learners as possible. Other institutions
will be challenged to match our numbers, particularly percentage of women
engaged in CS. It is an exciting time.
References
1. Hulette, D. ‘Computer Science for All’ (Really).
Princeton University, Princeton, NJ, Mar. 1, 2016;
https://www.cs.princeton.edu/news/‘computerscience-all’-really
2. Sedgewick, R. A 21st Century Model for Disseminating
Knowledge. Princeton University, Princeton, NJ; http://
www.cs.princeton.edu/~rs/talks/Model.pdf
obert Sedgewick and Kevin Wayne,
R
Princeton, NJ
Communications welcomes your opinion. To submit a
Letter to the Editor, please limit yourself to 500 words or
less, and send to [email protected].
© 2016 ACM 0001-0782/16/06 $15.00
Coming Next Month in COMMUNICATIONS
made of bricks a brick house in a nomenclature that causes no confusion.
Why then would it be confusing to
call a system with a Linux kernel and a
user space largely from the GNU project a “GNU/Linux system”? Including
“GNU” in the name seems to be a problem only for people with an anti-GNU
bias or misunderstanding of GPL, both
of which Neville-Neil exhibited through
his “supposedly” slight (in paragraph
10) intended to cast aspersions on the
Hurd operating system project and
the dig (as I read it) at GPLv3 for being
more restrictive than GPLv2. However,
in fairness, GPLv3 is more restrictive
and explicit about not allowing patents
to circumvent the freedoms inherent
in a license otherwise granted by copyright. As Neville-Neil appeared disdainful of the GPLv2 methods of securing
users’ freedoms, it is not surprising he
would take a negative view of GPLv3.
Neville-Neil also suggested the
“GNU/Linux” name is inappropriate,
as it reflects the tools used to build the
kernel. But as Richard Stallman explained in his 2008 article “Linux and
the GNU System” (http://www.gnu.org/
gnu/linux-and-gnu.html) to which Neville-Neil linked in his column, a typical
Linux distribution includes more code
from the GNU project than from the
Linux kernel project. Perhaps NevilleNeil should pour himself a less-“strong
beverage” and read Stallman’s article
again. He may find himself much less
confused by the “GNU/Linux” name.
The Rise of Social Bots
Statistics for Engineers
On the Growth
of Polyominoes
Turing’s Red Flag
Should You Upload
or Ship Big Data
to the Cloud?
Inverse Privacy
Formula-Based
Software Debugging
The Motivation for
a Monolithic Codebase
Mesa: Geo-Replicated
Online Data
Warehouse for Google’s
Advertising System
Plus the latest news about
solving graph isomorphism,
AI and the LHC, and apps
that fight parking tickets.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF THE ACM
9
The Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLOG@CACM
community. In each issue of Communications, we’ll publish
selected posts or excerpts.
Follow us on Twitter at http://twitter.com/blogCACM
DOI:10.1145/2911969 http://cacm.acm.org/blogs/blog-cacm
The Solution to AI,
What Real Researchers
Do, and Expectations
for CS Classrooms
John Langford on AlphaGo, Bertrand Meyer on Research as Research,
and Mark Guzdial on correlating CS classes with laboratory results.
John Langford
AlphaGo Is Not
the Solution to AI
http://bit.ly/1QSqgHW
March 14, 2016
Congratulations are in
order for the folks at Google Deepmind
(https://deepmind.com) who have
mastered Go (https://deepmind.com/
alpha-go.html).
However, some of the discussion
around this seems like giddy overstatement. Wired says, “machines
have conquered the last games”
(http://bit.ly/200O5zG) and Slashdot
says, “we know now that we don’t need
any big new breakthroughs to get to true
AI” (http://bit.ly/1q0Pcmg). The truth is
nowhere close.
For Go itself, it has been well known
for a decade that Monte Carlo tree
search (MCTS, http://bit.ly/1YbLm4M;
that is, valuation by assuming randomized playout) is unusually effective in
Go. Given this, it is unclear the AlphaGo
algorithm extends to other board games
10
COMMUNICATIO NS O F TH E AC M
where MCTS does not work so well.
Maybe? It will be interesting to see.
Delving into existing computer
games, the Atari results (http://bit.
ly/1YbLBgl, Figure 3) are very fun but obviously unimpressive on about a quarter
of the games. My hypothesis for why their
solution does only local (epsilon-greedy
style) exploration rather than global
exploration so they can only learn policies addressing either very short credit
assignment problems or with greedily
accessible polices. Global exploration
strategies are known to result in exponentially more efficient strategies in
general for deterministic decision process (1993, http://bit.ly/1YbLKjQ), Markov Decision Processes (1998, http://
bit.ly/1RXTRCk), and for MDPs without
modeling (2006, http://bit.ly/226J1tc).
The reason these strategies are not
used is because they are based on tabular learning rather than function fitting.
That is why I shifted to Contextual Bandit research (http://bit.ly/1S4iiHT) after
the 2006 paper. We have learned quite a
| J U NE 201 6 | VO L . 5 9 | NO. 6
bit there, enough to start tackling a Contextual Deterministic Decision Process
(http://arxiv.org/abs/1602.02722), but
that solution is still far from practical.
Addressing global exploration effectively
is only one of the significant challenges
between what is well known now and
what needs to be addressed for what I
would consider a real AI.
This is generally understood by people working on these techniques but
seems to be getting lost in translation
to public news reports. That is dangerous because it leads to disappointment
(http://bit.ly/1ql1dDW). The field will
be better off without an overpromise/
bust cycle, so I would encourage people
to keep and inform a balanced view of
successes and their extent. Mastering
Go is a great accomplishment, but it is
quite far from everything.
See further discussion at
http://bit.ly/20106Ff.
Bertrand Meyer
What’s Your Research?
h
ttp://bit.ly/1QRo9Q9
March 3, 2016
One of the pleasures of
having a research activity
is that you get to visit research institutions
and ask people what they do. Typically, the
answer is “I work in X” or “I work in the application of X to Y,” as in (made-up example
among countless ones, there are many
Xs and many Ys): I work in model checking
for distributed systems. Notice the “in.”
This is, in my experience, the dominant style of answers to such a question.
I find it disturbing. It is about research
as a job, not research as research.
blog@cacm
Research is indeed, for most researchers, a job. It was not always like that: up to
the time when research took on its modern form, in the 18th and early 19th centuries, researchers were people employed at
something else, or fortunate enough not
to need employment, who spent some of
their time looking into open problems of
science. Now research is something that
almost all its practitioners do for a living.
But a real researcher does not just
follow the flow, working “in” a certain
fashionable area or at the confluence of
two fashionable areas. A real researcher attempts to solve open problems.
This is the kind of answer I would
expect: I am trying to find a way to do A,
which no one has been able to do yet; or
to find a better way to do B, because the
current ways are deficient; or to solve the
C conjecture as posed by M; or to find out
why phenomenon D is happening; or to
build a tool that will address need E.
A researcher does not work “in” an
area but “on” a question.
This observation also defines what
it means for research to be successful.
If you are just working “in” an area, the
only criteria are bureaucratic: paper
accepted, grant obtained. They cover
the means, not the end. If you view
research as problem solving, success
is clearly and objectively testable: you
solved the problem you set out to solve,
or not. Maybe that is the reason we are
uneasy with this view: it prevents us
from taking cover behind artificial and
deceptive proxies for success.
Research is about solving problems;
at least about trying to solve a problem,
or—more realistically and modestly—
bringing your own little incremental
contribution to the ongoing quest for
a solution. We know our limits, but if
you are a researcher and do not naturally describe your work in terms of the
open problems you are trying to close,
you might wonder whether you are
tough enough on yourself.
Mark Guzdial
CS Classes Have
Different Results
than Laboratory
Experiments—
Not in a Good Way
http://bit.ly/1UUrOUu
March 29, 2016
I have collaborated with Lauren Margulieux on a series of experiments and
papers around using subgoal labeling
to improve programming education.
She has just successfully defended her
dissertation. I describe her dissertation work, and summarize some of
her earlier findings, in the blog post at
http://bit.ly/23bxRWd.
She had a paragraph in her dissertation’s methods section that I just flew
by when I first read it:
Demographic information was collected for participants’ age, gender, academic field of study, high school GPA,
college GPA, year in school, computer science experience, comfort with computers,
and expected difficulty of learning App
Inventor because they are possible predictors of performance (Rountree, Rountree, Robins, & Hannah, 2004; see Table
1). These demographic characteristics
were not found to correlate with problem
solving performance (see Table 1).
Then I realized her lack of result was
a pretty significant result.
I asked her about it at the defense.
She collected all these potential predictors of programming performance
in all the experiments. Were they
ever a predictor of the experiment
outcome? She said she once, out of
eight experiments, found a weak correlation between high school GPA
and performance. In all other cases,
“these demographic characteristics were not found to correlate with
problem solving performance” (to
quote her dissertation).
There has been a lot of research into
what predicts success in programming
classes. One of the more controversial
claims is that a mathematics background is a prerequisite for learning
programming. Nathan Ensmenger
suggests the studies show a correlation
between mathematics background
and success in programming classes,
but not in programming performance.
He suggests overemphasizing mathematics has been a factor in the decline
in diversity in computing (see http://
bit.ly/1ql27jD about this point).
These predictors are particularly important today. With our burgeoning undergraduate enrollments, programs are
looking to cap enrollment using factors
like GPA to decide who gets to stay in CS
(see Eric Roberts’ history of enrollment
caps in CS at http://bit.ly/2368RmV).
Margulieux’s results suggest choosing
who gets into CS based on GPA might
be a bad idea. GPA may not be an important predictor of success.
I asked Margulieux how she might explain the difference between her experimental results and the classroom-based
results. One possibility is that there
are effects of these demographic variables, but they are too small to be seen
in short-term experimental settings. A
class experience is the sum of many experiment-size learning situations.
There is another possibility Margulieux agrees could explain the difference between classrooms and laboratory experiments: we may teach better
in experimental settings than we do
in classes. Lauren has almost no one
dropping out of her experiments, and
she has measurable learning. Everybody learns in her experiments, but
some learn more than others. The differences cannot be explained by any of
these demographic variables.
Maybe characteristics like “participants’ age, gender, academic field of
study, high school GPA, college GPA,
year in school, computer science experience, comfort with computers, and
expected difficulty of learning” programming are predictors of success in
programming classes because of how
we teach programming classes. Maybe
if we taught differently, more of these
students would succeed. The predictor
variables may say more about our teaching of programming than about the
challenge of learning programming.
Reader’s comment:
Back in the 1970s when I was looking
for my first software development job,
companies were using all sorts of tests
and “metrics” to determine who would be
a good programmer. I’m not sure any of
them had any validity. I don’t know that
we have any better predictors today. In my
classes these days, I see lots of lower-GPA
students who do very well in computer
science classes. Maybe it is how I teach.
Maybe it is something else (interest?), but
all I really know is that I want to learn
better how to teach.
—Alfred Thompson
John Langford is a Principal Researcher at Microsoft
Research New York. Bertrand Meyer is a professor at
ETH Zurich. Mark Guzdial is a professor at the Georgia
Institute of Technology.
© 2016 ACM 0001-0782/16/06 $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
11
N
news
Profile | DOI:10.1145/2911979 Neil Savage
The Key to Privacy
I
for Martin Hellman, a professor of electrical
engineering at Stanford University, to present two papers
on cryptography at the International Symposium on Information
Theory in October 1977. Under normal
circumstances, Steve Pohlig or Ralph
Merkle, the doctoral students who also
had worked on the papers, would have
given the talks, but on the advice of
Stanford’s general counsel, it was Hellman who spoke.
The reason for the caution was
that an employee of the U.S. National Security Agency, J.A. Meyer, had
claimed publicly discussing their
new approach to encryption would
violate U.S. law prohibiting the export of weapons to other countries.
Stanford’s lawyer did not agree with
that interpretation of the law, but
told Hellman it would be easier for
him to defend a Stanford employee
than it would be to defend graduate
students, so he recommended Hellman give the talk instead.
Whitfield Diffie, another student
of Hellman’s who says he was a hippie with “much more anti-societal
views then,” had not been scheduled
to present a paper at the conference,
but came up with one specifically to
thumb his nose at the government’s
claims. “This was just absolute nonT WA S U N U S UA L
12
COM MUNICATIO NS O F TH E ACM
sense, that you could have laws that
could affect free speech,” Diffie says.
“It was very important to defy them.”
In the end, no one was charged with
breaking any laws, though as Hellman,
now professor emeritus, recalls, “there
was a time there when it was pretty dicey.” Instead, the researchers’ work started to move the field of cryptography into
academia and the commercial world,
where the cutting edge had belonged
almost exclusively to government researchers doing classified work.
Diffie and Hellman wrote a paper
in 1976, “New Directions in Cryptography,” introducing public key cryptography that still prevails in secure online
transactions today. As a result, they
The researchers’
work started to
move the field of
cryptography into
the realm of
academia and the
commercial world.
| J U NE 201 6 | VO L . 5 9 | NO. 6
have been named the 2015 recipients
of the ACM A.M. Turing Award.
Public key cryptography arose as the
solution to two problems, says Diffie,
former vice president and chief security
officer at Sun Microsystems. One was
the problem of sharing cryptographic
keys. It was possible to encrypt a message, but for the recipient to decrypt it,
he would need the secret key with which
it was encrypted. The sender could
physically deliver the secret key by courier or registered mail, but with millions
of messages, that quickly becomes unwieldy. Another possibility would be to
have a central repository of keys and distribute them as needed. That is still difficult, and not entirely satisfactory, Diffie says. “I was so countercultural that
I didn’t regard a call as secure if some
third party knew the key.”
Meanwhile, Diffie’s former boss,
John McCarthy, a pioneer in the field
of artificial intelligence, had written
about future computer systems in
which people could use home terminals to buy and sell things; that would
require digital signatures that could
not be copied, in order to authenticate
the transactions.
Both problems were solved with the
idea of a public key. It is possible to
generate a pair of complementary cryptographic keys. A person who wants to
receive a message generates the pair
PHOTOGRA PH BY RICHA RD M ORGENST EIN
40 years ago, Whitfield Diffie and Martin E. Hellman introduced
the public key cryptography used to secure today’s online transactions.
news
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
13
news
and makes one public; then a sender
can use that public key to encrypt a
message, but only the person with the
private key, which does not have to be
sent anywhere, can decrypt it.
Hellman compares it to a box with
two combination locks, one to lock
the box and the other to unlock it. Alice, the sender, and Bob, the receiver,
each generate a pair of keys and make
one public. Alice puts a message in the
“box,” then locks it with her secret key,
guaranteeing its authenticity since
only she knows how to do that. She
then places that locked box inside a
larger one, which she locks with Bob’s
public key. When Bob gets the box, he
uses his private key to get past the outer box, and Alice’s public key to open
the inner box and see the message.
Hellman and Diffie, building on
an approach developed by Merkle,
later came up with a variation on the
scheme now called the Diffie-Hellman Key Exchange (though Hellman
argues Merkle’s name should be on
it as well). In this version, the box
has a hasp big enough for two locks.
Alice places her message in the box
and locks it with a lock to which only
she knows the combination, then
“Cryptography has
really blossomed
since the publication
of their paper.
It’s become
the key tool of
the information age.”
sends it to Bob. Bob cannot open
it, nor can anyone who intercepts it
en route, but he adds his own lock
and sends it back. Alice then takes
off her lock and sends the box back
to Bob with only his lock securing
it. On arrival, he can open it. In the
Internet world, that translates to a
commutative one-way function that
allows Alice and Bob to create a common key in a fraction of a second.
While an eavesdropper, in theory,
could compute the same key from
what he hears, that would take millions of years.
Ron Rivest, an Institute Professor
in the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory, calls the
duo’s impact on the field revolutionary.
“Cryptography has really blossomed
since the publication of their paper,”
he says. “It’s become a key tool of the
information age.” Rivest, with his colleagues Adi Shamir and Leonard Adleman, developed the first practical implementation of public key encryption,
stimulated, Rivest says, by Diffie and
Hellman’s paper. Rivest, Shamir, and
Adleman were awarded the ACM A.M.
Turing Award for that work in 2002.
The Turing Award carries a $1 million prize, which Diffie and Hellmann
will split. Diffie says he plans to use his
half of the award to pursue research on
the history of cryptography. Hellman
and his wife, Dorothie, will use the
money, and the attendant publicity,
to bring attention to their forthcoming book about how they transformed
an almost-failed marriage into one
in which they have reclaimed the love
they felt when they first met, and how
that same approach can be used to rescue the world from the risk posed by
nuclear weapons.
If young people want to go into the
field of cryptography, there are three
great problems for them to tackle,
Diffie says: cryptography resistant
to quantum computing; proof of the
computational complexity of cryptosystems; and homomorphic encryption that would allow computations to
be carried out on encrypted data.
Hellman encourages people to
take risks and not wait to know everything they think they should know before launching a project. “When I first
started working in cryptography, my
colleagues all told me I was crazy,” he
says. “My advice, is don’t worry about
doing something foolish.”
© 2016 ACM 0001-0782/16/06 $15.00
Martin E. Hellman (left) and Whitfield Diffie.
14
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
Watch the Turing recipients
discuss their work in this
exclusive Communications
video. http://cacm.acm.org/
videos/the-key-to-privacy
PHOTOGRA PH BY RICHA RD M ORGENST EIN
Neil Savage is a science anvd technology writer based in
Lowell, MA.
news
Technology | DOI:10.1145/2911975 Logan Kugler
What Happens When
Big Data Blunders?
Big data is touted as a cure-all for challenges in business, government, and
healthcare, but as disease outbreak predictions show, big data often fails.
Y
IMAGE BY CREATIONS
O U C A N N OT B ROW S E technology news or dive into an
industry report without typically seeing a reference to
“big data,” a term used to
describe the massive amounts of information companies, government organizations, and academic institutions can
use to do, well, anything. The problem
is, the term “big data” is so amorphous
that it hardly has a tangible definition.
While it is not clearly defined, we
can define it for our purposes as: the
use of large datasets to improve how
companies and organizations work.
While often heralded as The Next
Big Thing That Will Cure All Ills, big
data can, and often does, lead to big
blunders. Nowhere is that more evident than its use in forecasting outbreaks and spread of diseases.
An influenza forecasting service pioneered by Google employed big data—
and failed spectacularly to predict the
2013 flu outbreak. Data used to prognosticate Ebola’s spread in 2014 and
early 2015 yielded wildly inaccurate
results. Similarly, efforts to predict the
spread of avian flu have run into problems with data sources and interpretations of those sources.
These initiatives failed due to a combination of big data inconsistencies and
human errors in interpreting that data.
Together, those factors lay bare how big
data might not be the solution to every
problem—at least, not on its own.
Big Data Gets the Flu
Google Flu Trends was an initiative
the Internet search giant began in
2008. The program aimed to better
predict flu outbreaks using Google
search data and information from the
U.S. Centers for Disease Control and
Prevention (CDC).
The big data from online searches,
combined with the CDC’s cache of dis-
ease-specific information, represented
a huge opportunity. Many people will
search online the moment they feel a
bug coming on; they look for information on symptoms, stages, and remedies. Combined with the CDC’s insights
into how diseases spread, the knowledge of the numbers and locations of
people seeking such information could
theoretically help Google predict where
and how severely the flu would strike
next—before even the CDC could. In
fact, Google theorized it could beat CDC
predictions by up to two weeks.
The success of Google Flu Trends
would have big implications. In the last
three decades, thousands have died
from influenza-related causes, says the
CDC, while survivors can face severe
health issues because of the disease.
Also, many laid up by the flu consume
the time, energy, and resources of
healthcare organizations. Any improvement in forecasting outbreaks could
save lives and dollars.
However, over the years, Google Flu
Trends consistently failed to predict flu
cases more accurately than the CDC.
After the program failed to predict the
2013 flu outbreak, Google quietly shuttered the program.
David Lazer and Ryan Kennedy
studied why the program failed, and
found key lessons about avoiding big
data blunders.
The Hubris of Humans
Google Flu Trends failed for two reasons, say Lazer and Kennedy: big data
hubris, and algorithmic dynamics.
Big data hubris means Google researchers placed too much faith in big
data, rather than partnering big data
with traditional data collection and
analysis. Google Flu Trends was built to
map not only influenza-related trends,
but also seasonal ones. Early on, engineers found themselves weeding out
false hits concerned with seasonal, but
not influenza-related, terms—such as
those related to high school basketball
season. This, say Lazer and Kennedy,
should have raised red flags about
the data’s reliability. Instead, it was
thought the terms could simply be removed until the results looked sound.
As Lazer and Kennedy say in their
article in Science: “Elsewhere, we have
asserted that there are enormous scientific possibilities in big data. However,
quantity of data does not mean that one
can ignore foundational issues of measurement and construct validity and reliability and dependencies among data.”
In addition, Google itself turned out
to be a major problem.
The second failure condition was
one of algorithmic dynamics, or the
idea that Google Flu Trends predictions
were based on a commercial search algorithm that frequently changes based
on Google’s business goals.
Google’s search algorithms change
often; in fact, say Lazer and Kennedy,
in June and July 2012 alone, Google’s
algorithms changed 86 times as the
firm tweaked how it returned search
results in line with its business and
growth goals. This sort of dynamism
was not accounted for in Google Flu
Trends models.
“Google’s core business is improving
search and driving ad revenue,” Kennedy told Communications. “To do this, it is
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
15
news
continuously altering the features it offers. Features like recommended searches and specialized health searches to
diagnose illnesses will change search
prominence, and therefore Google Flu
Trends results, in ways we cannot currently anticipate.” This uncertainty
skewed data in ways even Google engineers did not understand, even skewing
the accuracy of predictions.
Google is not alone: assumptions
are dangerous in other types of outbreak prediction. Just ask the organizations that tried to predict Ebola outbreaks in 2014.
Failing to Foresee Ebola
Headlines across the globe screamed
worst-case scenarios for the Ebola
outbreak of 2014. There were a few
reasons for that: it was the worst such
outbreak the world had ever seen, and
there were fears the disease could become airborne, dramatically increasing its spread. In addition, there were
big data blunders.
At the height of the frenzy, according to The Economist (http://econ.
st/1IOHYKO), the United Nations’ public health arm, the World Health Organization (WHO), predicted 20,000 cases of Ebola—nearly 54% more than the
13,000 cases reported. The CDC predicted a worst-case scenario of a whopping 1.4 million cases. In the early days
of the outbreak, WHO publicized a 90%
death rate from the disease; the reality
at that initial stage was closer to 70%.
Why were the numbers so wrong?
There were several reasons, says Aaron King, a professor of ecology at the
University of Michigan. First was the
failure to account for intervention; like
Google’s researchers, Ebola prognosticators failed to account for changing
conditions on the ground. Google’s
model was based on an unchanging
algorithm; Ebola researchers used a
model based on initial outbreak conditions. This was problematic in both
cases: Google could not anticipate
how its algorithm skewed results; Ebola fighters failed to account for safer
burial techniques and international
interventions that dramatically curbed
outbreak and death-rate numbers.
“Perhaps the biggest lesson we
learned is that there is far less information in the data typically available
in the early stages of an outbreak than
16
COMM UNICATIO NS O F THE ACM
“In the future, I hope
we as a community
get better at
distinguishing
information
from assumptions,”
King says.
is needed to parameterize the models
that we would like to be able to fit,”
King told Communications.
That was not the only mistake made,
says King. He argues stochastic models
that better account for randomness
are more appropriate for predictions
of this kind. Ebola fighters used deterministic models that did not account
for the important random elements in
disease transmission.
“In the future, I hope we as a community get better at distinguishing information from assumptions,” King says.
Can We Ever Predict
Outbreaks Accurately?
It is an open question whether models
can be substantially improved to predict disease outbreaks more accurately.
Other companies want to better predict flu outbreaks after the failure of
Google Flu Trends—specifically avian
flu—using social media and search
platforms. Companies such as Sickweather and Epidemico Inc. use algorithms and human curation to assess
both social media and news outlets for
flu-related information.
These efforts, however, run the
same risks as previous flu and Ebola
prediction efforts. Social media platforms change, and those changes do
not always benefit disease researchers.
In fact, says King, data collection may
hold the key to better predictions.
“I suspect that our ability to respond
effectively to future outbreaks will depend more on improved data collection techniques than on improvement
in modeling technologies,” he says.
Yet even improvements in data collection might not be enough. In addition to internal changes that affect
| J U NE 201 6 | VO L . 5 9 | NO. 6
how data is collected, researchers must
adapt their assessments of data to conditions on the ground. Sometimes, as
in the case of avian flu, not even experts
understand what to look for right away.
“The biggest challenge of the spring
2015 outbreak [of avian flu] in the United States was that poultry producers
were initially confused about the actual
transmission mechanism of the disease,” says Todd Kuethe, an agricultural
economist who writes on avian flu topics. “Producers initially believed it was
entirely spread by wild birds, but later
analysis by the USDA (U.S. Department
of Agriculture) suggested that farm-tofarm transmission was also a significant factor.”
No matter the type of data collection or the models used to analyze
it, sometimes disease conditions
change too quickly for humans or
algorithms to keep up. That might
doom big data-based disease prediction from the beginning.
“The ever-changing situation on the
ground during emerging outbreaks
makes prediction failures inevitable,
even with the best models,” concludes
Matthieu Domenech De Celles, a postdoctoral fellow at the University of
Michigan who has worked on Ebola
prediction research. Further Reading
Lazer, D., and Kennedy, R.
(2014) The Parable of Google Flu:
Traps in Big Data Analysis. Science.
http://scholar.harvard.edu/files/gking/
files/0314policyforumff.pdf
Miller, K.
(2014) Disease Outbreak Warnings Via
Social Media Sought By U.S. Bloomberg.
http://www.bloomberg.com/news/
articles/2014-04-11/disease-outbreakwarnings-via-social-media-sought-by-u-sErickson, J.
(2015) Faulty Modeling Studies Led To
Overstated Predictions of Ebola Outbreak.
Michigan News. http://ns.umich.edu/new/
releases/22783-faulty-modeling-studiesled-to-overstated-predictions-of-ebolaoutbreak
Predictions With A Purpose. The Economist.
http://www.economist.com/news/
international/21642242-why-projectionsebola-west-africa-turned-out-wrongpredictions-purpose
Logan Kugler is a freelance technology writer based in
Tampa, FL. He has written for over 60 major publications.
© 2016 ACM 0001-0782/16/06 $15.00
news
Science | DOI:10.1145/2911971
Alex Wright
Reimagining Search
Search engine developers are moving beyond
the problem of document analysis, toward the elusive
goal of figuring out what people really want.
IMAGE F RO M SH UTT ERSTOCK.CO M
E
VER SINCE GERARD SALTON of Cornell University developed the
first computerized search
engine (Salton’s Magical
Automatic Retriever of Text,
or SMART) in the 1960s, search developers have spent decades essentially
refining Salton’s idea: take a query
string, match it against a collection
of documents, then calculate a set of
relevant results and display them in a
list. All of today’s major Internet search
engines—including Google, Amazon,
and Bing—continue to follow Salton’s
basic blueprint.
Yet as the Web has evolved from a
loose-knit collection of academic papers to an ever-expanding digital universe of apps, catalogs, videos, and
cat GIFs, users’ expectations of search
results have shifted. Today, many of
us have less interest in sifting through
a collection of documents than in getting something done: booking a flight,
finding a job, buying a house, making
an investment, or any number of other highly focused tasks.
Meanwhile, the Web continues to
expand at a dizzying pace. Last year,
Google indexed roughly 60 trillion pages—up from a mere one trillion in 2008.
“As the Web got larger, it got harder
to find the page you wanted,” says Ben
Gomes, a Google Fellow and vice president of the search giant’s Core Search
team, who has been working on search
at Google for more than 15 years.
Today’s Web may bear little resemblance to its early incarnation as
a academic document-sharing tool,
yet the basic format of search results
has remained remarkably static over
the years. That is starting to change,
however, as search developers shift
focus from document analysis to the
even thornier challenge of trying to
understand the kaleidoscope of human wants and needs that underlie
billions of daily Web searches.
While document-centric search algorithms have largely focused on solving the problems of semantic analysis—identifying synonyms, spotting
spelling errors, and adjusting for
other linguistic vagaries—many developers are now shifting focus to the
other side of the search transaction:
the query itself.
By mining the vast trove of query
terms that flow through Web search
engines, developers are exploring
new ways to model the context of inbound query strings, in hopes of improving the precision and relevance
of search results.
“Before you look at the documents,
you try to determine the intent,” says
Daniel Tunkelang, a software engineer who formerly led the search team
at LinkedIn.
There, Tunkelang developed a sophisticated model for query understanding that involved segmenting
incoming queries into groups by tagging relevant entities in each query,
categorizing certain sequences of tags
to identify the user’s likely intent, and
using synonym matching to further
refine the range of likely intentions.
At LinkedIn, a search for “Obama”
returns a link to the president’s profile
page, while a search for “president”
returns a list of navigational shortcuts
to various jobs, people, and groups
containing that term. When the user
selects one of those shortcuts, LinkedIn picks up a useful signal about that
user’s intent, which it can then use to
return a highly targeted result set.
In a similar vein, a search for
“Hemingway” on Amazon will return a
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
17
news
familiar-looking list of book titles, but
a search for a broader term like “outdoors” will yield a more navigational
page with links to assorted Amazon
product categories. By categorizing
the query—distinguishing a “known
item” search from a more exploratory keyword search—Amazon tries to
adapt its results based on a best guess
at the user’s goal.
The widespread proliferation of structured data, coupled with advances in
natural language processing and the rise
of voice recognition-equipped mobile
devices, has given developers a powerful set of signals for modeling intent,
enabling them to deliver result formats
that are highly customized around particular use cases, and to invite users into
more conversational dialogues that can
help fine-tune search results over time.
Web users can see a glimpse of
where consumer search may be headed in the form of Google’s increasingly ubiquitous “snippets,” those highly
visible modules that often appear at
the top of results pages for queries
on topics like sports scores, stock
quotes, or song lyrics. Unlike previous
incarnations of Google search results,
snippets are trying to do more than
just display a list of links; they are trying to answer the user’s question.
These kinds of domain-specific
searches benefit from a kind of a priori
knowledge of user intent. Netflix, for
example, can reasonably infer most
queries have something to do with mov-
By analyzing
the interplay
of query syntax
and synonyms,
Google looks
for linguistic
patterns that
can help refine
the search result.
ies or TV. Yet a general-purpose search
engine like Google must work harder
to gauge the intent of a few characters’
worth of text pointed at the entire Web.
Developers are now beginning to
make strides in modeling the context of
general Web searches, thanks to a number of converging technological trends:
advances in natural language processing; the spread of location-aware, voice
recognition-equipped mobile devices,
and the rise of structured data that allows search engines to extract specific
data elements that might once have remained locked inside a static Web page.
Consumer search engines also
try to derive user intent by applying natural language processing
techniques to inbound search terms.
For example, when a user enters the
phrase “change a lightbulb,” the word
“change” means “replace;” but if a
user enters “change a monitor,” the
term “change” means “adjust.”
By analyzing the interplay of query
syntax and synonyms, Google looks for
linguistic patterns that can help refine
the search result. “We try to match the
query language with the document
language,” says Gomes. “The corpus of
queries and the corpus of documents
come together to give us a deeper understanding of the user’s intent.”
Beyond the challenges of data-driven query modeling, some search engine developers are finding inspiration
by looking beyond their search logs and
turning their gaze outward to deepen
their understanding of real-life users
“in the wild.”
“Qualitative research is great to generate insight and hypotheses,” says
Tunkelang, who sees enormous potential in applying user experience (UX) research techniques to assess the extent
to which users may trust a particular set
of search results, or exploring why they
may not choose to click on a particular link in the results list. Qualitative
research can also shed light on deeper
emotional needs that may be difficult to
ascertain through data analysis alone.
At Google, the search team runs an
ongoing project called the Daily Information Needs study, in which 1,000
volunteers in a particular region receive a ping on their smartphones up
Milestones
Computer Science Awards, Appointments
PAPADIMITRIOU AWARDED
VON NEUMANN MEDAL
IEEE has honored Christos H.
Papadimitriou, C. Lester Hogan
Professor in the Department
of Electrical Engineering
and Computer Science at the
University of California, Berkeley,
with the 2016 John von Neumann
Medal “for providing a deeper
understanding of computational
complexity and its implications
for approximation algorithms,
artificial intelligence, economics,
database theory, and biology.”
Papadimitriou, who
has taught at Harvard, the
Massachusetts Institute of
18
COMM UNICATIO NS O F THE ACM
Technology, the National
Technical University of Athens,
Stanford University, and the
University of California at
San Diego, is the author of
the textbook Computational
Complexity, which is widely used
in the field of computational
complexity theory. He also
co-authored the textbook
Algorithms with Sanjoy Dasgupta
and Umesh Vazirani, and the
graphic novel Logicomix with
Apostolos Doxiadis.
The IEEE John von Neumann
Medal is awarded for outstanding
achievements in computerrelated science and technology.
| J U NE 201 6 | VO L . 5 9 | NO. 6
ACM CITES PERROT
FOR VISION, LEADERSHIP
ACM has named Ron Perrot of
the Queen’s University Belfast/
Oxford e-Research Centre
recipient of the 2015 ACM
Distinguished Service Award “for
providing vision and leadership
in high-performance computing
and e-science, championing
new initiatives and advocating
collaboration among interested
groups at both national and
international levels.”
Perrott was cited for providing
vision and leadership in highperformance computing and
e-science, championing new
initiatives, and advocating
collaboration among interested
groups at the national and
international levels. He has
been an effective advocate for
high-performance and grid
computing in Europe since the
1970s, working tirelessly and
successfully with academic,
governmental, and industrial
groups to convince them of the
importance of developing shared
resources for high-performance
computing at both national and
regional levels.
Perrot is a Fellow of
ACM, IEEE, and the British
Computing Society.
news
to eight times per day to report on what
kind of information they are looking
for that day—not just on Google, but
anywhere. Insights from this study
have helped Google seed the ideas for
new products such as Google Now.
Researchers at Microsoft recently
conducted an ethnographic study that
pointed toward five discrete modes of
Web search behavior:
•Respite: taking a break in the day’s
routine with brief, frequent visits to a
familiar set of Web sites;
•Orienting: frequent monitoring of
heavily-used sites like email providers
and financial services;
•Opportunistic use: leisurely visits
to less-frequented sites for topics like
recipes, odd jobs, and hobbies;
•Purposeful use: non-routine usage scenarios, usually involving timelimited problems like selling a piece of
furniture, or finding a babysitter, and
•Lean-back: consuming passive entertainment like music or videos.
Each of these modes, the authors
argue, calls for a distinct mode of onscreen interaction, “to support the
construction of meaningful journeys
that offer a sense of completion.”
As companies begin to move away
from the one-size-fits-all model of
list-style search results, they also are
becoming more protective of the underlying insights that shape their presentation of search results.
“One irony is that as marketers have
gotten more sophisticated, the amount
of data that Google is sharing with its
marketing partners has actually diminished,” says Andrew Frank, vice
president of research at Gartner. “It
used to be that if someone clicked
on an organic link, you could see the
search terms they used, but over the
past couple of years, Google has started to suppress that data.”
Frank also points to Facebook as
an example of a company that has
turned query data into a marketing
asset, by giving marketers the ability
to optimize against certain actions
without having to target against particular demographics or behaviors.
As search providers continue to try
to differentiate themselves based on
a deepening understanding of query
intent, they will also likely focus on
capturing more and more information
about the context surrounding a partic-
ular search, such as location, language,
and the history of recent search queries. Taken together, these cues will
provide sufficient fodder for increasingly predictive search algorithms.
Tunkelang feels the most interesting unsolved technical problem in
search involves so-called query performance prediction. “Search engines
make dumb mistakes and seem blissfully unaware when they are doing so,”
says Tunkelang.
“In contrast, we humans may not
always be clever, but we’re much better
at calibrating our confidence when it
comes to communication. Search engines need to get better at query performance prediction—and better at providing user experiences that adapt to it.”
Looking even further ahead, Gomes
envisions a day when search engines
will get so sophisticated at modeling
user intent that they will learn to anticipate users’ needs well ahead of time.
For example, if the system detects you
have a history of searching for Boston
Red Sox scores, your mobile phone
could greet you in the morning with
last night’s box score.
Gomes thinks this line of inquiry
may one day bring search engines to
the cusp of technological clairvoyance.
“How do we get the information to you
before you’ve even asked a question?”
Further Reading
Bailey, P., White, R.W., Liu, H., and Kumaran, G.,
Mining Historic Query Trails to Label Long
and Rare Search Engine Queries. ACM
Transactions on the Web. Volume 4 Issue 4,
Article 15 (September 2010),
http://dx.doi.org/10.1145/1841909.1841912
Lindley, S., Meek, S., Sellen, A., and Harper, R.,
‘It’s Simply Integral to What I do:’
Enquiries into how the Web is Weaved into
Everyday Life, WWW 2012,
http://research.microsoft.com/en-us/
people/asellen/wwwmodes.pdf
Salton, G.,
The SMART Retrieval System—Experiments
in Automatic Document Processing, PrenticeHall, Inc., Upper Saddle River, NJ, 2012
Vakkari, P.,
Exploratory Searching as Conceptual
Exploration, Microsoft Research,
http://bit.ly/1N3rI3x
Alex Wright is a writer and information architect based in
Brooklyn, NY.
© 2016 ACM 0001-0782/16/06 $15.00
ACM
Member
News
A “LITTLE DIFFERENT”
CAREER TRAJECTORY
“It’s a little
different,”
says Julia
Hirschberg,
Percy K. and
Vida L.W.
Hudson
Professor of Computer Science
and Chair of the Computer
Science Department at
Columbia University, of her
career trajectory.
Hirchberg majored in
history as an undergraduate,
earning a Ph.D. in 16th century
Mexican social history at the
University of Michigan at Ann
Arbor. While teaching history at
Smith College, she discovered
artificial intelligence techniques
were useful in building social
networks of 16th century
colonists from “fuzzy” data. She
soon decided computer science
was even more exciting than
history and went back to school,
earning a doctorate in computer
science from the University of
Pennsylvania in 1985.
“None of my career decisions
have been carefully planned. You
often see opportunities you never
dreamed would be possible.”
As a result of her thesis work,
Hirschberg met researchers
at Bell Laboratories. She went
to work there in 1985, first
working in test-to-speech
synthesis, then launching the
Human-Computer Interface
Research Department in 1994,
and moving with Bell to AT&T
Laboratories.
Hirschberg started teaching
at Columbia in 2002, and
became chair of the Computer
Science Department in 2012.
Her major research area is
computational linguistics;
her current interests include
deceptive speech and spoken
dialogue systems.
“One of the things I think
of when I tell young women
about my career is that
many opportunities arise,”
Hirschberg says. “I never knew
as an undergraduate that I
would become a computer
scientist, let alone chairing a
computer science department
at Columbia. You make some
decisions, but they are not
necessarily decisions for life.”
—John Delaney
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
19
news
Society | DOI:10.1145/2911973 Gregory Mone
What’s Next for
Digital Humanities?
I
N 1 9 4 6 , A N Italian Jesuit priest
named Father Roberto Busa
conceived of a project to index the works of St. Thomas
Aquinas word by word. There
were an estimated 10 million words,
so the priest wondered if a computing machine might help. Three years
later, he traveled to the U.S. to find an
answer, eventually securing a meeting
with IBM founder Thomas J. Watson.
Beforehand, Busa learned Watson’s
engineers had already informed him
the task would be impossible, so on his
way into Watson’s office, he grabbed a
small poster from the wall that read,
“The difficult we do right away; the
impossible takes a little longer.” The
priest showed the executive his own
company’s slogan, and Watson promised IBM’s cooperation.
“The impossible” took roughly
three decades, but that initial quest
also marked the beginning of the field
now known as Digital Humanities. Today, digital humanists are applying advanced computational tools to a wide
range of disciplines, including literature, history, and urban studies. They
are learning programming languages,
generating dynamic three-dimensional
(3D) re-creations of historic city spaces,
developing new academic publishing
platforms, and producing scholarship.
The breadth of the field has led to
something of an identity crisis. In fact,
there is an annual Day of Digital Humanities (which was April 8 this year),
during which scholars publish details
online about the work they are conducting on that particular date. The
goal is to answer the question, “Just
what do digital humanists really do?”
As it turns out, there are many different answers.
Distant Reading
Digital Humanities is most frequently
associated with the computational
20
COM MUNICATIO NS O F TH E ACM
Father Roberto Busa, whose project to index
the works of St. Thomas Aquinas marked the
beginning of Digital Humanities.
analysis of text, from the Bible to modern literature. One common application is distant reading, or the use of
computers to study hundreds or thousands of books or documents rather
than having a human pore over a dozen.
Consider Micki Kaufman, a Ph.D.
candidate at The Graduate Center, City
University of New York (CUNY), who decided to study the digitized correspondence of Henry Kissinger. This was no
small task; she was faced with transcripts of more than 17,500 telephone
calls and 2,200 meetings. Adding to the
challenge was the fact that some of the
materials had been redacted for national security reasons. She realized by
taking a computational approach, she
could glean insights both into the body
of documents as a whole and the missing material.
In one instance, Kaufman used a
machine-reading technique combining word collocation and frequency
analysis to scan the texts for the words
“Cambodia” and “bombing,” and to
track how far apart they appear within
the text. A statement such as “We are
| J U NE 201 6 | VO L . 5 9 | NO. 6
bombing Cambodia” would have a distance of zero, whereas the result might
be 1,000 if the terms are separated by
several pages. Kaufman noticed the
words tended to be clustered together
more often in telephone conversations, suggesting Kissinger believed
he had greater privacy on the phone,
relative to the meetings, and therefore
spoke more freely. Furthermore, the
analysis offered clues to what had been
redacted, as it turned up major gaps in
the archive—periods during which the
terms did not appear together—when
the bombing campaign was known to
be active.
Overall, Kaufman was able to study
the archive through a different lens,
and found patterns she might not have
detected through a laborious reading
of each file. “You get the long view,”
says Kaufman. “You can ask yourself
about behavioral changes and positional changes in ways that would have
required the reading of the entire set.”
The computer-aided approach of
distant reading has also started to
move beyond texts. One example is
the work of the cultural historian Lev
Manovich, also of The Graduate Center, CUNY, who recently subjected a
dataset of 6,000 paintings by French
Impressionists to software that extracted common features in the images and grouped them together.
Manovich and his colleagues found
more than half of the paintings were
reminiscent of the standard art of the
day; Impressionist-style productions,
on the other hand, represented only a
sliver of the total works.
A New Way of Seeing
That sort of finding would be of interest to any Impressionist historian, not
just those with a digital bent, and according to University of Georgia historian Scott Nesbit, this is a critical
distinction. Digital humanists have
PHOTO © ROBERTO BUSA , S.J., A ND T HE EMERGENCE OF H UMA NITIES COMP UT ING. IM AGES F ROM BUSA ARCHI V E ARE KI N D LY MAD E AVAI L ABLE UN D E R
A C REAT IVE COM M ONS CC- BY- NC LICENSE BY PERMISSION O F CIRCSE RESEA RCH CENT RE, UNIVERSITÀ CAT TOLI CA D E L SACRO CUORE , MI L AN , I TALY.
New computational tools spur advances in an evolving field.
news
their own dedicated journals and conferences, but to Nesbit this might not
be the best approach going forward. “I
don’t see Digital Humanities as its own
discipline,” he says. “We’re humanists
who use certain methods and certain
tools to try to understand what’s going
on in our discipline and ask questions
in ways we hadn’t been asking before.”
When Nesbit set out to analyze the
post-emancipation period during the
U.S. Civil War, he wanted to look at exactly how enslaved people became free,
and specifically how the movement of
the anti-slavery North’s Union Army
impacted that process. “We wanted to
come up with a way to see what emancipation actually looked like on the
ground,” he says.
Nesbit and his colleagues extracted
data from both U.S. Census results
and advertisements of slave owners looking for their freed servants.
They built a Geographic Information
System (GIS) map of the region, and
then overlaid the apparent tracks of
the freed slaves with the movements
of the Union Army at the time. What
they found surprised them: there were
the expected spikes in the number of
freed slaves escaping when the army
arrived, but these advances apparently
did not inspire everyone to seek freedom. The people fleeing north were
predominantly men; of the few advertisements seeking runaway women
that do appear during these periods,
the data suggests they escaped to the
city instead. “There are a number
of possible reasons for this,” Nesbit
says, “one of them being that running
toward a group of armed white men
might not have seemed like the best
strategy for an enslaved woman.”
This gender-based difference to the
workings of emancipation was a new
insight relevant to any historian of the
period—not just the subset who prefer digital tools. While Nesbit might
have spotted the same trend through
exhaustive research, the digital tools
made it much easier to see patterns
in the data. “It was important to visualize these in part so I could see the
spatial relationships between armies
and the actions of enslaved people,”
Nesbit says.
The art historians, architects, and
urban studies experts behind a project called Visualizing Venice hope
for similarly surprising results. This
collaboration between academics
at Duke University, the University of
Venice, and the University of Padua
generates 3D representations of specific areas within the famed city, and
how the buildings, public spaces, and
even interior designs of its structures
have changed over the centuries. The
researchers create accurate digital representations of various buildings in
their present form, using laser radar
scanning and other tools, then draw
upon historical paintings, architectural plans, civic documents, and more to
effectively roll back the clock and trace
each structure’s evolution over time.
The animations allow researchers to
watch buildings grow and change in
response to the evolving city, but they
are not just movies; they are annotated
in such a way that it is possible to click
through a feature to see the historical
document(s) on which it is based.
Beyond the
Computationally Inflected
While the goal of Visualizing Venice is
in part to produce scholarship, other
experts argue Digital Humanities also
encompass the development of tools
designed to simplify research.
The programmer and amateur art
historian John Resig, for example,
found himself frustrated at the difficulty of searching for images of his favorite
style of art, Japanese woodblock prints.
He wrote software that scours the digital archives of a museum or university
and copies relevant images and their
associated metadata to his site. Then
he applied the publicly available MatchEngine software tool, which scans these
digital reproductions for similarities
and finds all the copies of the same
print, so he could organize his collection by image. In short, he developed a
simple digital way for people to find the
physical locations of specific prints.
At first, Resig says, academics did
not take to the tool. “There was one
scholar who said, ‘That sounds useful,
but not for me, because I’m already an
expert,’” Resig recalls. “A year later,
this scholar came to me and said, ‘I’m
so glad you built this website. It saves
me so much time!’”
This type of contribution has become commonplace in the field of
Archaeology. For example, the Codifi
software platform, developed in part
by archaeologists from the University
of California, Berkeley, is designed to
reduce field researchers’ dependence
on paper, giving them an easier and
more scalable way to collect and organize images, geospatial information,
video, and more. Archaeologists also
have proven quick to explore the potential of even more advanced technologies, from 3D printers that generate
reproductions of scanned artifacts to
the possibility of using low-cost drones
equipped with various sensors as a new
way of analyzing dig sites.
Yet archaeologists who engage in
this kind of work are rarely considered digital humanists, or even digital
archaeologists. Archaeology was so
quick to adopt computational tools
and methods and integrate them into
the practice of the discipline that the
digital aspect has integrated with the
field as a whole. This might be a kind of
roadmap for digital humanists in other
disciplines to follow.
Matthew Gold, a digital humanist
at The Graduate Center, CUNY, suggests the time is right for such a shift.
“What we’re seeing now is a maturation of some of the methods, along
with an effort by digital humanists to
test their claims against the prevailing
logic in their field, so that it’s not just
computationally inflected work off to
the side,” Gold says. “The field is at an
interesting moment.”
Further Reading
Gold, M. (Ed.)
Debates in the Digital Humanities,
The University of Minnesota Press, 2016.
Berry, D.M. (Ed.)
Understanding Digital Humanities,
Palgrave Macmillan, 2012.
Nesbit, S.
Visualizing Emancipation: Mapping
the End of Slavery in the American Civil War,
in Computation for Humanity: Information
Technology to Advance Society (New York:
Taylor & Francis), 427-435.
Moretti, F.
Graphs, Maps, Trees,
New Left Review, 2003.
Visualizing Venice Video:
http://bit.ly/24f5bgJ
Gregory Mone is a Boston, MA-based science writer and
children’s novelist.
© 2016 ACM 0001-0782/16/06 $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
21
V
viewpoints
DOI:10.1145/2909877
Rebecca T. Mercuri and Peter G. Neumann
Inside Risks
The Risks of
Self-Auditing Systems
Unforeseen problems can result from
the absence of impartial independent evaluations.
O
T WO D E CAD E S
ago,
NIST Computer Systems
Laboratory’s Barbara Guttman and Edward Roback
warned that “the essential difference between a self-audit
and an external audit is objectivity.”6
In that writing, they were referring to
internal reviews by system management staff, typically for purposes of
risks assessment—potentially having
inherent conflicts of interest, as there
may be disincentives to reveal design
flaws that could pose security risks. In
this column, we raise attention to the
additional risks posed by reliance on
information produced by electronically self-auditing sub-components
of computer-based systems. We are
defining such self-auditing devices as
being those that display internally generated data to an independent external
observer, typically for purposes of ensuring conformity and/or compliance
with particular range parameters or degrees of accuracy.
Our recent interest in this topic was
sparked by the revelations regarding
millions of Volkswagen vehicles whose
VE R
22
COMM UNICATIO NS O F THE AC M
emission systems had been internally
designed and manufactured such that
lower nitrogen dioxide levels would be
produced and measured during the
inspection-station testing (triggered by
the use of the data port) than would occur in actual driving. In our earlier writings, we had similarly warned about
voting machines potentially being set
to detect election-day operations, such
that the pre-election testing would
show results consistent with practice
ballot inputs, but the actual electionday ballots would not be tabulated accurately. These and other examples are
described further in this column.
Issues
We are not suggesting that all selfauditing systems are inherently bad.
Our focus is on the risks of explicit reliance only on internal auditing, to the
exclusion of any independent external
oversight. It is particularly where selfauditing systems have end-to-end autonomous checking or only human interaction with insiders, that unbiased
external observation becomes unable to
influence or detect flaws with the imple-
| J U NE 201 6 | VO L . 5 9 | NO. 6
mentation and operations with respect
to the desired and expected purposes.
Although many self-auditing systems suffer from a lack of sufficient
transparency and external visibility
to ensure trustworthiness, the expedience and the seeming authority of
results can inspire false confidence.
More generally, the notion of self-regulation poses the risk of degenerating
into no regulation whatsoever, which
appears to be the case with respect to
self-auditing.
By auditing, we mean systematic examination and verification of accounts,
transaction records (logs), and other documentation, accompanied by physical inspection (as appropriate), by an independent entity. In contrast, self-auditing
results are typically internally generated, but are usually based on external
inputs by users or other devices. The
self-audited aggregated results typically lack a verifiable correspondence of
the outputs with the inputs. As defined,
such systems have no trustworthy independent checks-and-balances. Worse
yet, the systems may be proprietary or
covered by trade-secret protection that
IMAGE BY ALICIA KUBISTA /A ND RIJ BORYS ASSOCIAT ES
viewpoints
explicitly precludes external inspection and validation.
Trade secrecy is often used to maintain certain intellectual property protections—in lieu of copyright and/or
patent registration. It requires proofs
of strict secrecy controls, which are inherently difficult to achieve in existing
systems. Trade-secrecy protection can
extend indefinitely, and is often used
to conceal algorithms, processes, and
software. It can thwart detection of illicit activity or intentional alteration of
reported results.
Relying on internally generated audits creates numerous risks across a
broad range of application areas, especially where end-to-end assurance is
desired. In some cases, even internal
audits are lacking altogether. The risks
may include erroneous and compromised results, opportunities for serious
misuse, as well as confusions between
precision and accuracy.
Systemic Problems
Of course, the overall problems are
much broader than just those relating
to inadequate or inappropriately com-
promised internal auditing and the absence of external review.
Of considerable relevance to networked systems that should be trustworthy is a recent paper2 that exposes
serious security vulnerabilities resulting from composing implementations
of apparently correctly specified components. In particular, the authors
of that paper examine the client-side
and server-side state diagrams of the
Transport Layer Security (TLS) specification. The authors show that approximately a half-dozen different popular TLS implementations (including
OpenSSL and the Java Secure Socket
Extension JSSE) introduce unexpected
security vulnerabilities, which arise as
emergent properties resulting from
the composition of the client-side and
server-side software. This case is an example of an open source concept that
failed to detect some fundamental
flaws—despite supposed many-eyes
review. Here, we are saying the selfauditing is the open-source process
itself. This research illustrates some
of the risks of ad hoc composition, the
underlying lack of predictability that
can result, and the lack of auditing
sufficient for correctness and security.
However, their paper addresses only
the tip of the iceberg when it comes
to exploitable vulnerabilities of open
source systems.
Digital Meters
The relative inaccuracy of self-calibrated (or merely factory-set) meters is often neglected in electronic measurement and design. Self-calibration can
be considered to be a form of self-auditing when performed to a presumed
reliable reference source. Calibration
is also highly dependent on the specific applications. For example, while
a 5% error rate may not be of tremendous concern when measuring a 5-volt
source, at higher test levels the disparity can become problematic. There is
also the error of perception that comes
with digital displays, where precision
may be misinterpreted as accuracy. Engineers have been shown to have a propensity toward overly trusting trailing
digits in a numerical read-out, when
actually analog meters can provide
less-misleading relative estimates.8
Many concerns are raised as we become increasingly dependent on healthmonitoring devices. For example,
millions of diabetics test their blood
glucose levels each day using computerized meters. System accuracy for such
consumer-grade devices is recommended to be within 15 mg/dl as compared
with laboratory results, yet experimental data shows that in the low-blood
sugar range (<= 75 mg/dl), some 5% of
these personal-use meters will fail to
match the (presumably more stringent)
laboratory tests. Reliance on results that
show higher than actual values in the
low range (where percentages are most
critical) may result in the user’s failure to
take remedial action or seek emergency
medical attention, as appropriate. Many
users assume the meters are accurate,
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
23
viewpoints
and are unaware that periodic testing
should be performed using a control
solution (the hefty price of which is often not covered by health insurance). In
actuality, since the control-solution test
uses the same meter and is not a wholly
independent comparison (for example,
with respect to a laboratory test), it too
may not provide sufficient reliability to
establish confidence of accuracy.
End-to-End System Assurance
The security literature has long demonstrated that embedded testing
mechanisms in electronic systems can
be circumvented or designed to provide false validations of the presumed
correctness of operations. Proper endto-end system design (such as with respect to Common Criteria and other
security-related standards) is intended to ferret out such problems and
provide assurances that results are being accurately reported. Unfortunately, most systems are not constructed
and evaluated against such potentially
stringent methodologies.
Yet, even if such methods were applied, all of the security issues may not
be resolved, as was concluded in a SANS
Institute 2001 white paper.1 The author
notes that the Common Criteria “can
only assist the IT security communities
to have the assurance they need and
may push the vendor and developer for
[a] better security solution. IT security
is a process, which requires the effort
from every individual and management
in every organization. It is not just managing the risk and managing the threat;
it is the security processes of Assessment, Prevention, Detection and Response; it is a cycle.” Rebecca Mercuri
also points out7 that certain requirements cannot be satisfied simultaneously (such as, a concurrent need for
system integrity and user privacy along
with assuredly correct auditability),
whereas the standards fail to mitigate
or even address such design conflicts.
The Volkswagen Case
and Its Implications
Security professionals are well aware
that the paths of least resistance (such
as the opportunities and knowledge
provided to insiders) often form the
best avenues for system exploits. These
truths were underscored when Volkswagen announced in September 2015
24
COMM UNICATIO NS O F THE ACM
Relying on internally
generated audits
creates numerous
risks across
a broad range of
application areas.
“that it would halt sales of cars in the
U.S. equipped with the kind of diesel
motors that had led regulators to accuse the German company of illegally
[creating] software to evade standards
for reducing smog.”5
While Volkswagen’s recall appeared
at first to be voluntary, it had actually
been prompted by investigations following a March 2014 Emissions Workshop (co-sponsored by the California
Air Resources Board and the U.S. Environmental Protection Agency (EPA),
among others). There, a West Virginia
University research team working
under contract for the International
Council on Clean Transportation
(ICCT, a European non-profit) provided
results showing the self-tested data significantly underrepresented what occurred under actual driving conditions.
These revelations eventually led to a
substantial devaluation of Volkswagen
stock prices and the resignations of the
CEO and other top company officials,
followed by additional firings and layoffs. Pending class-action and fraud
lawsuits and fines promise to be costly
in the U.S. and abroad.
Ironically, the report9 was originally
intended to support the adoption of the
presumably strict U.S. emissions testing program by European regulators,
in order to further reduce the release of
nitrogen oxides into the air. Since the
university researchers did not just confine themselves to automated testing,
but actually drove the vehicles on-road,
they were able to expose anomalous
results that were as much as 40 times
what is allowed by the U.S. standard defined by the Clean Air Act. The EPA subsequently recalled seven vehicle models dating from 2009–2015, including
approximately 500,000 vehicles in the
| J U NE 201 6 | VO L . 5 9 | NO. 6
U.S.; Germany ordered recall of 2.4M
vehicles. Extensive hardware and software changes are required to effect the
recall modifications. Still, the negative
environmental impacts will not be fully
abated, as the recalls are anticipated to
result in poorer gas mileage for the existing Volkswagen diesel vehicles.
Election Integrity
An application area that is particularly
rife with risks involves Direct Recording Electronic (DRE) voting systems—
which are self-auditing. These are endto-end automated systems, with results
based supposedly entirely on users’
ballot entries. Aggregated results over
multiple voters may not have assured
correspondence with the inputs. Most
of the commercial systems today lack
independent checks and balances, and
are typically proprietary and prohibited from external validation.
Reports of voters choosing one candidate and seeing their selection displayed incorrectly have been observed
since the mid-1990s. This occurs on
various electronic balloting systems
(touchscreen or push-button). However, what happens when votes are
recorded internally (or in processing
optically scanned paper ballots) inherently lacks any independent validation.
For example, Pennsylvania certified a
system even after videotaping a voteflipping incident during the state’s
public testing. The questionable design and development processes of
these systems—as well as inadequate
maintenance and operational setup—are known to result in improper
and unchecked screen alignment and
strangely anomalous results.
Some research has been devoted to
end-to-end cryptographic verification
that would allow voters to demonstrate
their choices were correctly recorded
and accurately counted.4 However, this
concept (as with Internet voting) enables possibilities of vote buying and
selling. It also raises serious issues of
the correctness of cryptographic algorithms and their implementation,
including resistance to compromise
of the hardware and software in which
the cryptography would be embedded.
Analogous Examples
It seems immediately obvious that the
ability to rig a system so it behaves cor-
viewpoints
rectly only when being tested has direct bearing on election systems. The
Volkswagen situation is a bit more
sophisticated because the emissions
system was actually controlled differently to produce appropriate readings whenever testing was detected.
Otherwise, it is rather similar to the
voting scenario, where the vendors
(and election officials) want people to
believe the automated testing actually
validates how the equipment is operating during regular operations, thus
seemingly providing some assurance
of correctness. While activation of the
Volkswagen stealth cheat relied on a
physical connection to the testing system, one might imagine a tie-in to the
known locations of emission inspection stations—using the vehicle’s GPS
system—which could similarly be applied to voting machines detecting
their polling place.
Election integrity proponents often
point to the fact that lottery tickets are
printed out by the billions each year,
while voting-system vendors seem to
have difficulty printing out paper ballots that can be reviewed and deposited
by the voter in order to establish a paper audit trail. Numerous security features on the lottery tickets are intended
to enable auditing and thwart fraud,
and are in principle rather sophisticated. While the location and time of
lottery ticket purchases is known and
recorded, this would not be possible
for elections, as it violates the secrecy
of the ballot. However, it should be
noted that insider lottery fraud is still
possible, and has been detected.
Automatic Teller Machines (ATMs)
are internally self-auditing, but this
is done very carefully—with extensive cross-checking for consistency to
ensure each transaction is correctly
processed and there are no discrepancies involving cash. There is an exhaustive audit trail. Yet, there are still
risks. For example, some ATMs have
been known to crash and return the
screen to the operating-system command level. Even more riskful is the
possible presence of insider misuse
and/or malware. Code has been discovered for a piece of malware that
targets Diebold ATMs (this manufacturer was also a legacy purveyor of voting machines). The code for this malware used undocumented features to
create a virtual ‘skimmer’ capable of
recording card details and personal
identification numbers without the
user’s knowledge, suggesting the creator may have had access to the source
code for the ATM. While this does not
directly point to an inside job, the possibility certainly cannot be ruled out.
Experts at Sophos (a firewall company)
believe this code was intended to be
preinstalled by an insider at the factory, and would hold transaction details
until a special card was entered into
the machine—at which point a list
of card numbers, PINs, and balances
would be printed out for the ne’er-dowell to peruse, and perhaps use, at
leisure. It is also possible the malware
could be installed by someone with
access to the ATM’s internal workings,
such as the person who refills the supply of money each day (especially if
that malware were to disable or alter
the audit process).
Complex Multi-Organizational
Systems
One case in which oversight was supposedly provided by corporate approval
processes was the disastrous collapse of
the Deepwater Horizon. The extraction
process in the Gulf of Mexico involved
numerous contractors and subcontractors, and all sorts of largely self-imposed monitoring and presumed safety
measures. However, as things began to
go wrong incrementally, oversight became increasingly complicated—exacerbated further by pressures of contractual time limits and remote managers.
This situation is examined in amazing
detail in a recent book on this subject.3
Conclusion
Recognition of the risks of systems
that are exclusively self-auditing is
not new. Although remediations have
been repeatedly suggested, the reality
is even worse today. We have a much
greater dependence on computer- and
network-based systems (most of which
are riddled with security flaws, potentially subject to external attacks, insider misuse, and denials of service). The
technology has not improved with respect to trustworthiness, and the totalsystem risks have evidently increased
significantly.
Independent verification is essential on a spot-check and routine ba-
sis. Security must be designed in, not
added on; yet, as we have seen, hacks
and exploits can be designed in as well.
Hired testers may suffer from tunnel
vision based on product objectives or
other pressures. Group mentality or
fraudulent intent may encourage cover-up of detected failure modes. Whistle-blowers attempting to overcome
inadequate self-auditing are often
squelched—which tends to suppress
reporting. Classified and trade secret
systems inherently add to the lack of
external oversight.
The bottom line is this: Lacking
the ability to independently examine
source code (much less recompile it),
validate results, and perform spotchecks on deployed devices and system
implementations, various anomalies
(whether deliberate or unintentional)
are very likely to be able to evade detection. Specific questions must be periodically asked and answered, such as:
What independent audits are being
performed in order to ensure correctness and trustworthiness? When are
these audits done? Who is responsible
for conducting these audits? Without
sufficient and appropriate assurances,
self-auditing systems may be nothing
more than a charade.
References
1. Aizuddin, A. The Common Criteria ISO/IEC 15408—
The Insight, Some Thoughts, Questions and Issues,
2001; http://bit.ly/1IVwAr8
2. Beurdouche, B. et al. A messy state of the union:
Taming the composite state machines of TLS. In
Proceedings of the 36th IEEE Symposium on Security
and Privacy, San Jose, CA (May 18–20, 2015); https://
www.smacktls.com/smack.pdf
3. Boebert, E. and Blossom, J. Deepwater Horizon: A
Systems Analysis of the Macondo Disaster. Harvard
University Press, 2016.
4. Chaum, D. Secret-ballot receipts: True voter-verifiable
elections. IEEE Security and Privacy 2, 1 (Jan./Feb. 2004).
5. Ewing, J. and Davenport, C. Volkswagen to stop sales
of diesel cars involved in recall. The New York Times
(Sept. 20, 2015).
6. Guttman, B. and Roback, E.A. An Introduction
to Computer Security: The NIST Handbook. U.S.
Department of Commerce, NIST Special Publication
800-12 (Oct. 1995).
7. Mercuri, R. Uncommon criteria. Commun. ACM 45, 1
(Jan. 2002).
8. Rako, P. What’s all this meter accuracy stuff, anyhow?
Electronic Design 16, 41 (Sept. 3, 2013).
9. Thompson, G. et al. In-use emissions testing of lightduty diesel vehicles in the United States. International
Council on Clean Transportation (May 30, 2014);
http://www.theicct.org
Rebecca Mercuri ([email protected])
is a digital forensics and computer security expert
who testifies and consults on casework and product
certifications.
Peter G. Neumann ([email protected]) is Senior
Principal Scientist in the Computer Science Lab at SRI
International, and moderator of the ACM Risks Forum.
Copyright held by authors.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
25
V
viewpoints
DOI:10.1145/2909881 George V. Neville-Neil
Article development led by
queue.acm.org
Kode Vicious
What Are You
Trying to Pull?
A single cache miss is more expensive than many instructions.
Dear Pull,
Saving instructions—how very 1990s
of him. It is always nice when people
pay attention to details, but sometimes
they simply do not pay attention to
the right ones. While KV would never
encourage developers to waste in26
COM MUNICATIO NS O F TH E AC M
structions, given the state of modern
software, it does seem like someone already has. KV would, as you did, come
out on the side of legibility over the saving of a few instructions.
It seems that no matter what advances are made in languages and compilers, there are always programmers who
think they are smarter than their tools,
and sometimes they are right about
that, but mostly they are not. Reading
the output of the assembler and counting the instructions may be satisfying
for some, but there had better be a lot
| J U NE 201 6 | VO L . 5 9 | NO. 6
more proof than that to justify obfuscating code. I can only imagine a module full of code that looks like this:
if (some condition) {
retval++;
goto out:
} else {
retval--;
goto out:
}
...
out:
return(retval)
PHOTO F ROM EVERETT COLLECT ION/ SH UT TERSTOCK
Dear KV,
I have been reading some pull requests from a developer who has recently been working in code that I also
have to look at from time to time. The
code he has been submitting is full of
strange changes he claims are optimizations. Instead of simply returning a
value such as 1, 0, or -1 for error conditions, he allocates a variable and then
increments or decrements it, and
then jumps to the return statement.
I have not bothered to check whether
or not this would save instructions,
because I know from benchmarking
the code those instructions are not
where the majority of the function
spends its time. He has argued any
instruction we do not execute saves
us time, and my point is his code is
confusing and difficult to read. If he
could show a 5% or 10% increase in
speed, it might be worth considering,
but he has not been able to show that
in any type of test. I have blocked several of his commits, but I would prefer to have a usable argument against
this type of optimization.
Pull the Other One
viewpoints
and, honestly, I do not really want to.
Modern compilers, or even not so modern ones, play all the tricks programmers used to have to play by hand—
inlining, loop unrolling, and many
others—and yet there are still some
programmers who insist on fighting
their own tools.
When the choice is between code
clarity and minor optimizations, clarity must, nearly always, win. A lack of
clarity is the source of bugs, and it is
no good having code that is fast and
wrong. First the code must be right,
then the code must perform; that is
the priority that any sane programmer must obey. Insane programmers,
well, they are best to be avoided.
The other significant problem
with the suggested code is it violates a common coding idiom. All
languages, including computer languages, have idioms, as pointed out
at length in The Practice of Programming by Brian W. Kernighan and Rob
Pike (Addison-Wesley Professional,
1999), which I recommended to readers more than a decade ago. Let’s not
think about the fact the book is still
relevant, and that I have been repeating myself every decade. No matter
what you think of a computer language, you ought to respect its idioms
for the same reason one has to know
idioms in a human language—they
facilitate communication, which is
the true purpose of all languages, programming or otherwise. A language
idiom grows organically from the use
of a language. Most C programmers,
though not all of course, will write an
infinite loop in this way:
for (;;) {
}
or as
while (1) {
}
with an appropriate break statement
somewhere inside to handle exiting
the loop when there is an error. In fact,
checking the Practice of Programming
book, I find this is mentioned early on
(in section 1.3). For the return case,
you mention it is common to return
using a value such as 1, 0, or -1 unless
the return encodes more than true,
When the choice
is between
code clarity and
minor optimizations,
clarity must,
nearly always, win.
false, or error. Allocating a stack variable and incrementing or decrementing and adding a goto is not an idiom
I have ever seen in code, anywhere—
and now that you are on the case, I
hope I never have to.
Moving from this concrete bit of
code to the abstract question of when
it makes sense to allow some forms of
code trickery into the mix really depends on several factors, but mostly
on how much speedup can be derived
from twisting the code a bit to match
the underlying machine a bit more
closely. After all, most of the hand optimizations you see in low-level code,
in particular C and its bloated cousin
C++, exist because the compiler cannot recognize a good way to map what
the programmer wants to do onto the
way the underlying machine actually works. Leaving aside the fact that
most software engineers really do not
know how a computer works, and leaving aside that what most of them were
taught—if they were taught—about
computers, hails from the 1970s and
1980s before superscalar processors
and deep pipelines were a standard
feature of CPUs, it is still possible to
find ways to speed up by playing tricks
on the compiler.
The tricks themselves are not that
important to this conversation; what
is important is knowing how to measure their effects on the software.
This is a difficult and complicated
task. It turns out that simply counting instructions as your co-worker
has done does not tell you very much
about the runtime of the underlying
code. In a modern CPU the most precious resource is no longer instructions, except in a very small number of
compute-bound workloads. Modern
systems do not choke on instructions;
they drown in data. The cache effects
of processing data far outweigh the
overhead of an extra instruction or
two, or 10. A single cache miss is a
32-nanosecond penalty, or about 100
cycles on a 3GHz processor. A simple
MOV instruction, which puts a single,
constant number into a CPU’s register,
takes one-quarter of a cycle, according
to Agner Fog at the Technical University of Denmark (http://www.agner.
org/optimize/instruction_tables.pdf).
That someone has gone so far as to
document this for quite a large number of processors is staggering, and
those interested in the performance
of their optimizations might well
lose themselves in that site generally
(http://www.agner.org).
The point of the matter is that a
single cache miss is more expensive
than many instructions, so optimizing away a few instructions is not really going to win your software any
speed tests. To win speed tests you
have to measure the system, see where
the bottlenecks are, and clear them if
you can. That, though, is a subject for
another time.
KV
Related articles
on queue.acm.org
Human-KV Interaction
http://queue.acm.org/detail.cfm?id=1122682
Quality Software Costs Money—
Heartbleed Was Free
Poul-Henning Kamp
http://queue.acm.org/detail.cfm?id=2636165
The Network Is Reliable
Peter Bailis and Kyle Kingsbury
http://queue.acm.org/detail.cfm?id=2655736
George V. Neville-Neil ([email protected]) is the proprietor of
Neville-Neil Consulting and co-chair of the ACM Queue
editorial board. He works on networking and operating
systems code for fun and profit, teaches courses on
various programming-related subjects, and encourages
your comments, quips, and code snips pertaining to his
Communications column.
Copyright held by author.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
27
V
viewpoints
DOI:10.1145/2909883
Peter J. Denning
The Profession of IT
How to Produce
Innovations
Making innovations happen is surprisingly easy,
satisfying, and rewarding if you start small and build up.
Y
idea for something new that could
change your company, maybe even the industry. What
do you do with your idea?
Promote it through your employer’s
social network? Put a video about it
on YouTube? Propose it on Kickstarter
and see if other people are interested?
Found a startup? These possibilities
have murky futures. Your employer
might not be interested, the startup
might fail, the video might not go viral,
the proposal might not attract followers. And if any of these begins to look
viable, it could be several years before
you know if your idea is successful. In
the face of these uncertainties, it would
be easy to give up.
Do not give up so easily. Difficulty
getting ideas adopted is a common
complaint among professionals. In
this column, I discuss why it might not
be as difficult as it looks.
The Apparent Weediness
of Adoption
Bob Metcalfe’s famous story of the
Ethernet illustrates the difficulties.2
With the provocative title “invention
is a flower, innovation is a weed” he
articulated the popular impression
that creating an idea is glamorous
and selling it is grunt work. In his account of Ethernet and the founding of
3Com to sell Ethernets, the invention
part happened in 1973–1974 at Xerox
PARC. It produced patents, seminal
academic papers, and working pro28
COMMUNICATIO NS O F TH E ACM
totypes. The Ethernet was adopted
within Xerox systems. Metcalfe left
Xerox in 1979 to found 3Com, which
developed and improved the technology and championed it for an international standard (achieved in 1983
as IEEE 802.3). Metcalfe tells of many
hours on the road selling Ethernets to
executives who had never heard of the
technology; he often had only a short
time to convince them Ethernet was
better than their current local-network technology and they could trust
him and his company to deliver. He
| J U NE 201 6 | VO L . 5 9 | NO. 6
did a lot of “down in the weeds” work
to get Ethernet adopted.
Metcalfe summarized his experience saying the invention part took
two years and the adoption part took
10. He became wealthy not because
he published a good paper but because he sold Ethernets for 10 years.
He found this work very satisfying
and rewarding.
Sense 21
I would like to tell a personal story that
sheds light on why adoption might be
PHOTO BY A LIC IA KU BISTA
O U H AVE A N
viewpoints
rewarding. In 1993, I created a design
course for engineers. I called it “Designing a new common sense for engineering in the 21st century,” abbreviated “Sense 21.” The purpose of this
course was to show the students how
innovation works and how they might
be designers who can intentionally
produce innovations.
I became interested in doing this
after talking to many students and
learning about the various breakdowns they had around their aspirations for producing positive change
in their organizations and work environments. These students were
seniors and graduate students in
the age group 20–25. They all were
employed by day and took classes in
the evening. The breakdowns they
discussed with me included: suffering time crunch and information
overload, inability to interest people
in their ideas, frustration that other
“poor” ideas are selected instead of
their obviously “better” ideas, belief
that good ideas sell themselves, revulsion at the notion you have to sell
ideas, complaints that other people
do not listen, and complaints that
many customers, teammates, and
bosses were jerks. I wanted to help
these students by giving them tools
that would enable them to navigate
through these problems instead of
being trapped by them. I created the
Sense 21 course for them.
I announced to the students that the
course outcome is “produce an innovation.” That meant each of them would
find an innovation opportunity and
make it happen. To get there we would
need to understand what innovation
is—so we can know what we are to produce—and to learn some foundational
tools of communication that are vital
for making it happen.
We spent the first month learning
the basics of generating action in language—specifically speech acts and
the commitments they generate, and
how those commitments shape their
worlds.1 There is no action without a
commitment, and commitments are
made in conversations. The speech
acts are the basic moves for making
commitments. What makes this so
fundamental is there are only five kinds
of commitments (and speech acts) and
therefore the basic communication
The alternative
sense of language
as generator and
shaper gave rise to
a new definition of
automation.
tools are simple, universal, and powerful. With this we were challenging
the common sense that the main purpose of language is to communicate
messages and stories. We were after a
new sense: with language we make and
shape the world.
Everett Rogers, whose work on innovation has been very influential since
1962, believed communication was
essential to innovation. Paraphrasing
Rogers: “Innovation is the creation of
a novel proposal that diffuses through
the communication channels of a social network and attracts individuals to
decide to adopt the proposal.”3
The message sense of communication permeates this view: an innovation proposal is an articulation and
description of a novel idea to solve a
problem, and adoption is an individual decision made after receiving messages about the proposal.
My students struggled with this definition of innovation. They could not
see their own agency in adoption. How
do they find and articulate novel ideas?
What messages should they send, over
which channels? How do they find
and access existing channels? Should
they bring the message to prospective
adopters by commercials, email, or
personal visits? What forms of messages are most likely to influence a positive decision? How do they deal with
the markedly different kinds of receptivity to messages among early, majority, and laggard adopters? Should they
be doing something else altogether?
The definition gave no good answers
for such questions.
The alternative sense of language as
generator and shaper gave rise to a new
definition of innovation, which we used
in the course: “Innovation is adoption
Calendar
of Events
June 2–4
SIGMIS-CPR ‘16: 2015
Computers and People
Research Conference,
Washington, D.C.,
Sponsored: ACM/SIG,
Contact: Jeria Quesenberry,
Email: [email protected]
June 4–8
DIS ‘16: Designing Interactive
Systems Conference 2016,
Brisbane, QLD, Australia,
Sponsored: ACM/SIG,
Contact: Marcus Foth,
Email: [email protected]
June 8–10
PASC ‘16: Platform for
Advanced Scientific Computing
Conference,
Lausanne, Switzerland,
Contact: Olaf Schenk,
Email: [email protected]
June 14–18
SIGMETRICS ‘16: SIGMETRICS/
PERFORMANCE Joint
International Conference on
Measurement and Modeling of
Computer Systems,
Antibes, Juan-Les-Pins, France,
Contact: Sara Alouf,
Email: [email protected]
June 18–22
ISCA ‘16: The 42nd Annual
International Symposium
on Computer Architecture,
Seoul, Republic of Korea,
Contact: Gabriel Loh,
Email: [email protected]
June 19–23
JCDL ‘16: The 16th
ACM/IEEE-CS Joint Conference
on Digital Libraries,
Newark, NJ,
Contact: Lillian N. Cassel,
Email: lillian.cassel@villanova.
edu
June 20–22
PerDis ‘16:
The International Symposium
on Pervasive Displays,
Oulu, Finland,
Sponsored: ACM/SIG,
Contact: Vassilis Kostakos,
Email: vassilis.spam@kostakos.
org
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
29
viewpoints
of new practice in a community, displacing other existing practices.”
The term “practice” refers to routines, conventions, habits, and other
ways of doing things, shared among
the members of a community. Practices are embodied, meaning people perform them without being aware they
are exercising a skill. Technologies are
important because they are tools that
enable and support practices. Since
people are always doing something,
adopting a new practice means giving
up an older one. This was the most demanding of all the definitions of innovation. It gives an acid test of whether
an innovation has happened.
With this formulation, student questions shifted. Who is my community?
How do I tell whether my proposal will
interest them? Is training with a new
tool a form of adoption? Who has power
to help or resist? Their questions shifted from forms of communication to
how they could engage with their community. A summary answer to their engagement questions was this process:
˲˲ Listen for concerns and breakdowns in your community
˲˲ Gather a small team
˲˲ Design and offer a new tool—a
combination or adaptation of existing technologies that addresses the
concern
˲˲ Mobilize your community into using the tool
˲˲ Assess how satisfied they are with
their new practice
To start their project, I asked them
to find a small group of about five
people in their work environment.
This group would be their innovating
community. I did not impose much
structure on them because I wanted
them to learn how to navigate the unruly world they would be discovering. I
asked them to give progress reports to
the whole group, who frequently gave
them valuable feedback and gave me
opportunities to coach the group.
Here is an example of an incident
that helped a student—let’s call him
“Michael”—learn to listen for concerns. Michael was unhappy with me
because I declined his request to let
him use workstations in my lab for a
project unrelated to the lab. In class,
I asked Michael to repeat his request.
He did so enthusiastically and quickly fell into a confrontational mood.
30
COMM UNICATIO NS O F THE AC M
Could it be that
finding out
the concerns of
your community
might be as
simple as asking
“What are
your concerns?”
He tried half a dozen different arguments on me, all variations on the
theme that I was acting unethically
or irrationally in denying his request.
None moved me. Soon the entire class
was offering suggestions to Michael.
None of that moved me either. After
about 10 minutes, Michael hissed,
“Are you just playing with me? Saying no just for spite? What’s wrong
with my request? It’s perfectly reasonable!” I said, “You have not addressed
any of my concerns.” With utter frustration, he threw his hands into the
air and exclaimed, “But I don’t even
know what you are concerned about!”
I smiled at him, leaned forward, and
said, “Exactly.”
Convulsed by a Great Aha!, Michael
turned bright red and proclaimed,
“Geez, now I get what you mean by
listening.” The other members of the
class looked startled and got it too.
Then they excitedly urged him on:
“Ask him what he is concerned about!”
This he did. Soon he proposed to fashion his project to help contribute to
the goals of the lab. I was seduced. We
closed a deal.
Could it be that finding out the concerns of your community might be as
simple as asking “What are your concerns?”
By the end of the semester we had
worked our way though the stages of
the process and coached on other fine
points. They all succeeded in producing
an innovation. In our final debriefing
they proclaimed an important discovery: innovations do not have to be big.
That was very important. The innovation stories they had learned all
their lives told them all innovations
| J U NE 201 6 | VO L . 5 9 | NO. 6
are big world-shakers and are the
work of geniuses. However, their own
experiences told them they could produce small innovations even if they
were not geniuses. Moreover, they
saw they could increase the size of
their innovation communities over
time as they gained experience.
Getting It Done
A cursory reading of the Metcalfe story could lead you to conclude the full
Ethernet innovation took 10 years.
That is a very long time. If you believe
you will not see the fruits of your work
for 10 years, you are unlikely to undertake the work. If on the other hand
you believe your work consists of an
ongoing series of small innovations,
you will find your work enjoyable, and
after 10 years you will find it has added
up to a large innovation. This is what
Metcalfe wanted to tell us. He enjoyed
his work and found that each encounter with a new company that adopted
Ethernet was a new success and a new
small innovation.
The students said one other thing
that startled me. They said that taking
the course and doing the project was
life altering for them. The reason was
the basic tools had enabled them to
be much more effective in generating
action through all parts of their lives.
The realization that we generate action through our language is extraordinarily powerful.
If we can tell the stories and satisfying experiences of innovators doing everyday, small innovation, we will have a
new way to tell the innovation story and
lead people to more success with their
own innovations. Innovation is no ugly
weed. Like a big garden of small flowers, innovation is beautiful.
References
1. Flores, F. Conversations for Action and Collected
Essays. CreateSpace Independent Publishing
Platform, 2013.
2. Metcalfe, R. Invention is a flower, innovation is a
weed. MIT Technology Review (Nov. 1999); http://
www.technologyreview.com/featuredstory/400489/
invention-is-a-flower-innovation-is-a-weed/
3. Rogers, E. Diffusion of Innovations (5th ed. 2003). Free
Press, 1962.
Peter J. Denning ([email protected]) is Distinguished
Professor of Computer Science and Director of the
Cebrowski Institute for information innovation at
the Naval Postgraduate School in Monterey, CA, is
Editor of ACM Ubiquity, and is a past president of ACM.
The author’s views expressed here are not necessarily
those of his employer or the U.S. federal government.
Copyright held by author.
V
viewpoints
DOI:10.1145/2909885
Derek Chiou
Interview
An Interview
with Yale Patt
ACM Fellow Professor Yale Patt reflects on
his career in industry and academia.
P
the Ernest Cockrell, Jr. Centennial Chair in Engineering
at The University of Texas
at Austin has been named
the 2016 recipient of the Benjamin
Franklin Medal in Computer and
Cognitive Science by the Franklin
Institute. Patt is a renowned computer architect, whose research has
resulted in transformational changes
to the nature of high-performance
microprocessors, including the first
complex logic gate implemented on
a single piece of silicon. He has received ACM’s highest honors both in
computer architecture (the 1996 Eckert-Mauchly Award) and in education
(the 2000 Karl V. Karlstrom Award).
He is a Fellow of the ACM and the
IEEE and a member of the National
Academy of Engineering.
Derek Chiou, an associate professor
of Electrical and Computer Engineering at The University of Texas at Austin,
conducted an extensive interview of
Patt, covering his formative years to his
Ph.D. in 1966, his career since then,
and his views on a number of issues.
Presented here are excerpts from that
interview; the full interview is available
via the link appearing on the last page
of this interview.
DEREK CHIOU: Let’s start with the influences that helped shape you into
who you are. I have often heard you
comment on your actions as, “That’s
the way my mother raised me.” Can you
elaborate?
IMAGE BY XIAO HONG J IANG
RO FES S O R YALE PATT,
Yale Patt, ACM Fellow and Ernest Cockrell, Jr. Centennial Chair Professor at The University
of Texas at Austin.
YALE PATT: In my view my mother was
the most incredible human being who
ever lived. Born in Eastern Europe, with
her parents’ permission, at the age of
20, she came to America by herself. A
poor immigrant, she met and married
my father, also from a poor immigrant
family, and they raised three children.
We grew up in one of the poorer sections of Boston. Because of my mother’s insistence, I was the first from
that neighborhood to go to college. My
brother was the second. My sister was
the third.
You have often said that as far as your
professional life is concerned, she
taught you three important lessons.
That is absolutely correct. Almost
everyone in our neighborhood quit
school when they turned 16 and went
to work in the Converse Rubber factory, which was maybe 100 yards from
our apartment. She would have none
of it. She knew that in America the
road to success was education. She insisted that we stay in school and that
we achieve. An A-minus was not acceptable. “Be the best that you can be.”
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
31
viewpoints
That was the first lesson. The second
lesson: “Once you do achieve, your job
is to protect those who don’t have the
ability to protect themselves.” And I
have spent my life trying to do that. The
third lesson is to not be afraid to take a
stand that goes against the currents—
to do what you think is right regardless
of the flak you take. And I have certainly taken plenty of flak. Those were the
three lessons that I believe made me
into who I am. When I say that’s the way
my mother raised me, it usually has to
do with one of those three principles.
What about your father?
My father was also influential—but
in a much quieter way. We didn’t have
much money. It didn’t matter. He still
took us to the zoo. He took us to the
beach. He took me to my first baseball
game. He got me my first library card—
taught me how to read. I remember us
going to the library and getting my first
library card at the age of five. So when
I started school, I already knew how to
read. That was my father’s influence.
I understand there is a story about your
father that involves your first marathon.
Yes, the New York City Marathon.
The first time I ran it was in 1986. If
you finish, they give you a medal. I gave
it to my father. “Dad, this is for you.”
He says, “What’s this?” I said, “It’s a
medal.” “What for?” “New York City
Marathon.” “You won the New York
City Marathon?” “No, Dad. They give
you a medal if you finish the New York
City Marathon.” And then he looked
at me in disbelief. “You mean you lost
the New York City Marathon?” It was
like he had raised a loser, and I realized
that he too, in his quieter way, was also
pushing me to achieve and to succeed.
Besides your parents there were other
influences. For example, you’ve often
said Bill Linvill was the professor who
taught you how to be a professor.
Bill Linvill was incredible. He was
absolutely the professor who taught
me how to be a professor—that it’s
not about the professor, it’s about the
students. When he formed the new
Department of Engineering Economic
Systems, I asked if I could join him.
“No way,” he said. “You are a qualified
Ph.D. candidate in EE. You will get your
Ph.D. in EE, and that will open lots of
32
COMMUNICATIO NS O F TH E AC M
doors for you. If you join me now, you
will be throwing all that away, and I will
not let you do that. After you graduate,
if you still want to, I would love to have
you.” That was Bill Linvill. Do what is
best for the students, not what is best
for Bill Linvill.
You did your undergraduate work at
Northeastern. Why Northeastern?
Northeastern was the only school
I could afford financially, because of
the co-op plan. Ten weeks of school,
then ten weeks of work. It was a great
way to put oneself through engineering
school.
What do you think of co-op now?
I think it’s an outstanding way to
get an education. The combination
of what I learned in school and what
I learned on the job went a long way
toward developing me as an engineer.
In fact, I use that model with my Ph.D.
students. Until they are ready to devote
themselves full time to actually writing
the dissertation, I prefer to have them
spend their summers in industry. I
make sure the internships are meaningful, so when they return to campus
in the fall, they are worth a lot more
than when they left at the beginning of
the summer. The combination of what
we can teach them on campus and
what they can learn in industry produces Ph.D.’s who are in great demand
when they finish.
I understand you almost dropped out
of engineering right after your first engineering exam as a sophomore.
Yes, the freshman year was physics,
math, chemistry, English, so my first
engineering course came as a sophomore. I did so badly on my first exam
I wasn’t even going to go back and see
just how badly. My buddies convinced
me we should at least go to class and
find out. There were three problems on
the exam. I knew I got one of them. But
one of them I didn’t even touch, and
the third one I attempted, but with not
great success. It turns out I made a 40.
The one I solved I got 33 points for. The
one I didn’t touch I got 0. And the one I
tried and failed I got seven points. The
professor announced that everything
above a 25 was an A. I couldn’t believe
it. In fact, it took me awhile before I understood.
| J U NE 201 6 | VO L . 5 9 | NO. 6
Engineering is about solving problems. You get no points for repeating
what the professor put on the blackboard. The professor gives you problems you have not seen before. They
have taught you what you need to solve
them. It is up to you to show you can.
You are not expected to get a 100, but
you are expected to demonstrate you
can think and can crack a problem that
you had not seen before. That’s what
engineering education is about.
Then you went to Stanford University
for graduate work. Why did you choose
Stanford?
My coop job at Northeastern was
in microwaves, so it seemed a natural
thing to do in graduate school. And,
Stanford had the best program in electromagnetics.
But you ended up in computer engineering. How did that happen?
There’s a good example of how one
professor can make a difference. At
Stanford, in addition to your specialty,
they required that you take a course in
some other part of electrical engineering. I chose switching theory, which at
the time we thought was fundamental
to designing computers, and we recognized computers would be important in
the future. The instructor was a young
assistant professor named Don Epley.
Epley really cared about students, made
the class exciting, made the class challenging, was always excited to teach us
and share what he knew. By the end of
the quarter, I had shifted my program
to computers and never looked back.
The rumor is you wrote your Ph.D. thesis in one day. What was that all about?
Not quite. I made the major breakthrough in one day. As you know, when
you are doing research, at the end of
each day, you probably don’t have a
lot to show for all you did that day.
But you keep trying. I was having a dry
spell and nothing was working. But I
kept trying. I had lunch, and then I’d
gone back to my cubicle. It was maybe
2:00 in the afternoon All of a sudden,
everything I tried worked. The more I
tried, the more it worked. I’m coming
up with algorithms, and I’m proving
theorems. And it’s all coming together,
and, my heart is racing at this point. In
fact, that’s what makes research worth-
viewpoints
while—those (not often) moments
when you’ve captured new knowledge,
and you’ve shown what nobody else
has been able to show. It’s an amazing
feeling. Finally I closed the loop and
put the pen down. I was exhausted; it
was noon the next day. I had worked
from 2:00 in the afternoon all the way
through the night until noon the next
day, and there it was. I had a thesis!
So you wrote your thesis in one day.
No, I made the breakthrough in one
day, which would not have happened if
it had not been for all those other days
when I kept coming up empty.
What did you do then?
I walked into my professor’s office.
He looked up from his work. I went to
the blackboard, picked up the chalk,
and started writing. I wrote for two
hours straight, put down the chalk and
just looked at him. He said, “Write it up
and I’ll sign it. You’re done.”
After your Ph.D., your first job was as an
assistant professor at Cornell University. Did you always plan on teaching?
No. I always thought: Those who
can do; those who can’t, teach. I interviewed with 10 companies, and had
nine offers. I was in the process of deciding when Fred Jelinek, a professor
at Cornell, came into my cubicle and
said, “We want to interview you at Cornell.” I said, “I don’t want to teach.” He
said, “Come interview. Maybe you’ll
change your mind.” So there I was,
this poor boy from the slums of Boston
who could not have gotten into Cornell back then, being invited to maybe
teach there. I couldn’t turn down the
opportunity to interview, so I interviewed, and I was impressed—Cornell
is an excellent school.
Now I had 10 offers. After a lot of
agonizing, I decided on Cornell. All my
friends said, “We knew you were going to decide on Cornell because that’s
what you should be—a teacher.” And
they were right! I was very lucky. If Fred
Jelinek had not stumbled into my cubicle, I may never have become a professor, and for me, it’s absolutely the most
fantastic way to go through life.
Why did you only spend a year there?
At the time, the U.S. was fighting a
war in Vietnam. I was ordered to report
“The combination
of what I learned
in school and what
I learned on the job
went a long way
toward developing
me as an engineer.”
to active duty in June 1967, at the end
of my first year at Cornell. I actually volunteered; I just didn’t know when my
number would come up.
Your active duty started with boot
camp. What was that like?
Boot camp was amazing. Not that I
would want to do it again, but I am glad
I did it once. It taught me a lot about
the human spirit, and the capabilities
of the human body that you can draw
on if you have to.
What happened after boot camp?
After nine weeks of boot camp, I was
assigned to the Army Research Office
for the rest of my two-year commitment. I was the program manager for
a new basic research program in computer science. I was also the Army’s
representative on a small committee
that was just beginning the implementation of the initial four-node ARPANET. I knew nothing about communication theory, but I had a Ph.D. in EE,
and had been a professor at Cornell, so
someone thought I might be useful. In
fact, it was an incredible learning experience. I had fantastic tutors: Lenny
Kleinrock and Glen Culler. Lenny had
enormous critical expertise in both
packet switching and queueing theory.
Glen was a professor at UC Santa Barbara, trained as a mathematician, but
one of the best engineers I ever met.
In fact, I give him a lot of the credit for
actually hacking code and getting the
initial network to work.
After the Army, you stayed in North
Carolina, taught at NC State, then
moved to San Francisco State to build
their computer science program. Then
you went to Berkeley. You were a visiting professor at Berkeley from 1979 to
1988. What was that like?
Berkeley was an incredible place at
that time. Mike Stonebraker was doing Ingres, Sue Graham had a strong
compiler group, Dick Karp and
Manny Blum were doing theory, Domenico Ferrari was doing distributed
UNIX, Velvel Kahan was doing IEEE
Floating Point, Dave Patterson with
Carlo Sequin had started the RISC
project, and I and my three Ph.D. students Wen-mei Hwu, Mike Shebanow,
and Steve Melvin were doing HPS. In
fact, that is where HPS was born. We
invented the Restricted Data Flow
model, showed that you could do outof-order execution and still maintain precise exceptions, and that you
could break down complex instructions into micro-ops that could be
scheduled automatically when their
dependencies were resolved. We had
not yet come up with the needed aggressive branch predictor, but we did
lay a foundation for almost all the
cutting-edge, high-performance microprocessors that followed.
You had other Ph.D. students at Berkeley as well.
Yes, I graduated six Ph.D.’s while
I was at Berkeley—I guess a little unusual for a visiting professor. The other
three were John Swensen, Ashok Singhal, and Chien Chen. John was into numerical methods and showed that an
optimal register set should contain a
couple of very fast registers when latency is the critical issue and a large number of slow registers when throughput
is critical. Ashok and Chien worked on
implementing Prolog, which was the
focal point of the Aquarius Project, a
DARPA project that Al Despain and I
did together.
Then you went to Michigan. Two things
stand out at Michigan: first, your research in branch prediction.
We actually did a lot of research in
branch prediction during my 10 years
at Michigan, but you are undoubtedly
thinking of our first work, which I did
with my student Tse-Yu Yeh. Tse-Yu
had just spent the summer of 1990
working for Mike Shebanow at Motorola. Mike was one of my original
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
33
viewpoints
HPS students at Berkeley. When TseYu returned to Michigan at the end
of the summer, he had some ideas
about branch prediction, based on his
interaction with Shebanow. He and I
ended up with the two-level adaptive
branch predictor which we published
in Micro in 1991. Intel was the first
company to use it. When they moved
from a five-stage pipeline on Pentium
to a 12-stage pipeline on Pentium Pro,
they could not afford the misprediction penalty they would have gotten
with their Pentium branch predictor.
So, they adapted ours. Since then,
some variation has been used by just
about everybody.
Michigan is also where you developed
the freshman course.
Yes, I had wanted to teach that
material to freshmen for a long time,
but always ran up against a brick wall.
Then in 1993, the faculty were complaining that students didn’t understand pointer variables and recursion
was magic. I just blurted out, “The
reason they don’t understand is they
have no idea what’s going on underneath. If we really want them to understand, then we have to start with
how the computer works.” I offered
to do it, and the faculty said okay.
Kevin Compton and I developed the
freshman course, and in fall 1995, we
taught it for the first time. In fall 1996,
it became the required first course in
computing, and we taught it to all 400
EECS freshmen.
I heard Trevor Mudge volunteered to
teach it if something happened.
Trevor said he would be willing
to teach the course if we gave him a
book. There was no book. In fact, the
course was completely different from
every freshman book on the market.
We started with the transistor as a
wall switch. Kids have been doing wall
switches since they were two years old,
so it was not difficult to teach them the
switch level behavior of a transistor.
From wall switches we made inverters,
and then NAND gates and NOR gates,
followed by muxes and decoders and
latches and memory, then a finite state
machine, and finally a computer. They
internalized the computer, bottomup, and then wrote their first program
in the machine language of the LC-2,
34
COMMUNICATIO NS O F TH E ACM
“Engineering
is about
solving problems.
You get no points
for repeating what
the professor put
on the blackboard.”
a computer I invented for the course.
Programming in 0s and 1s gets old very
quickly, so we quickly moved to LC-2
assembly language.
Since Trevor needed a textbook
to teach the course in the spring, I
wrote the first draft over Christmas
vacation. That’s why the freshman
textbook was born. If Trevor hadn’t
insisted, who knows? There may not
have been a freshman textbook. But
there was no other book available
because it was a complete departure
from everybody else.
You ended up co-authoring the book
with one of your Ph.D. students.
Yes, originally, it was going to be
with Kevin Compton, but Kevin ended
up not having time to do it. So I asked
Sanjay Patel, one of my Ph.D. students
who TA’d the course the first year we
offered it. We wrote the book together,
and published it as he was finishing
his Ph.D.
You left Michigan in 1999 to come to
Texas. Is there anything at Texas that
particularly stands out?
Far and away, my students and my
colleagues. I have now graduated 12
Ph.D.’s at Texas. When I came here, I
brought my Michigan Ph.D. students
with me. Two of them, Rob Chappell
and Paul Racunas, received Michigan
degrees but actually finished their research with me at UT. Two others, Mary
Brown and Francis Tseng, were early
enough in the Ph.D. program that it
made more sense for them to transfer.
Mary graduated from UT in 2005, went
to IBM, rose to be one of the key architects of their Power 8 and 9 chips, and
| J U NE 201 6 | VO L . 5 9 | NO. 6
recently left IBM to join Apple. Francis
got his Ph.D. in 2007, and joined Intel’s
design center in Hillsboro, Oregon.
With respect to my colleagues, I
consider one of my biggest achievements that I was able to convince you
and Mattan Erez to come to Texas. The
two of you are, in a major way, responsible for building what we’ve got in the
computer architecture group at Texas.
Six of your students are professors?
That’s right. Three of them hold
endowed chairs. Wen-Mei Hwu is the
Sanders Chair at Illinois. Greg Ganger,
one of my Michigan Ph.D.’s, holds the
Jatras Chair at Carnegie Mellon, and
Onur Mutlu, one of my Texas Ph.D.’s
holds the Strecker chair at CarnegieMellon. In total, I have two at Illinois,
Wen-Mei Hwu and Sanjay Patel, also
a tenured full professor, two at Carnegie Mellon, Greg Ganger and Onur
Mutlu, and two at Georgia Tech, Moin
Qureshi, and Hyesoon Kim, both associate professors.
And a number of your students are doing great in industry too.
Yes. I already mentioned Mary
Brown. Mike Shebanow has designed a
number of chips over the years, including the Denali chip at HAL and the M1
at Cyrix. He was also one of the lead
architects of the Fermi chip at Nvidia.
Mike Butler, my first Michigan Ph.D.,
was responsible for the bulldozer core
at AMD. Several of my students play key
roles at Intel and Nvidia.
You are well known for speaking your
mind on issues you care about, and
have some very strong views on many
things. Let’s start with how you feel
about the United States of America.
Quite simply, I love my country. I already mentioned that I spent two years
in the Army—voluntarily. I believe everyone in the U.S. should do two years
of service, and that nobody should be
exempt. It’s not about letting the other
guy do it. It’s about every one of us accepting this obligation. I believe in universal service. It does not have to be the
military. It can be the Peace Corps, or
Teach for America, or some other form
of service.
I also believe in immigration. That’s
another key issue in the U.S. today.
Immigration is part of the core of the
viewpoints
American fabric. It has contributed
enormously to the greatness of America. Some people forget that unless
you’re a Native American we all come
from immigrant stock. The Statue of
Liberty says it well: “Give me your tired,
your poor.” It is a core value of America.
I hope we never lose it.
I also believe in the Declaration
of Independence as the founding
document of America, and the Constitution as the codification of that
document. Most important are the
10 amendments Jefferson put forward that represent the essence of
America. “We hold these truths to be
self-evident,” that some rights are too
important to leave to the will of the
majority, that they are fundamental
to every human being. And that’s also
come under siege lately. Freedom of
speech, assembly, free from unlawful
search and seizure, habeas corpus,
the knowledge that the police can’t
come and pick you up and lock you up
and throw the key away. Some of this
seems to have gotten lost over the last
few years. I remain hopeful we will return to these core values, that nothing
should stand in the way of the first 10
amendments to the Constitution.
Let’s talk about your research and
teaching. Can you say something about
how you mentor your Ph.D. students in
their research?
I don’t believe in carving out a problem and saying to the student, “Here’s
your problem. Turn the crank; solve
the problem.” I have a two-hour meeting every week with all my graduate
students. My junior students are in the
room when I push back against my senior students. Initially, they are assisting my senior students so they can follow the discussion. At some point, they
identify a problem they want to work
on. Maybe during one of our meetings,
maybe during a summer internship,
whenever. I encourage them to work
on the problem. They come up with
stuff, and I push back. If they get too far
down a rat hole, I pull them back. But
I cut them a lot of slack as I let them
continue to try things. In most cases,
eventually they do succeed.
Don’t research-funding agencies require
you to do specific kinds of research?
I don’t write proposals to funding
agencies. I’ve been lucky that my research has been supported by companies. It is true that in this current
economy, money is harder to get from
companies. So if any companies are
reading this and would like to contribute to my research program and fund
my Ph.D. students, I’ll gladly accept
a check. The checks from companies
come as gifts, which means there is no
predetermined path we are forced to
travel; no deliverables we have promised. In fact, when we discover we are
on the wrong path, which often happens, we can leave it. My funding has
come almost exclusively from companies over the last 40 years so I don’t
have that problem.
There is a story about you wanting to
give your students a shovel.
As I have already pointed out, most
days nothing you try works out so when
it is time to call it a day, you have nothing to show for all your work. So I’ve often thought what I should do is give my
student a shovel and take him out in
the backyard and say, “Dig a hole.” And
he would dig a hole. And I’d say, “See?
You’ve accomplished something. You
can see the hole you’ve dug.” Because
at the end of most days, you don’t see
anything else.
The next day, the student still
doesn’t see anything, so we go to the
backyard again. “Now fill in the hole.”
So, again, he could see the results of
what he did. And that’s the way research goes day after day, until you
make the breakthrough. All those days
of no results provides the preparation
so that when the idea hits you, you can
run with it. And that’s when the heart
pounds. There is nothing like it. You’ve
uncovered new knowledge.
Can you say something about your love
for teaching?
It’s the thing I love most. These kids
come in, and I’m able to make a difference, to develop their foundation, to
see the light go on in their eyes as they
understand difficult concepts. In my
classroom, I don’t cover the material.
That’s their job. My job is to explain the
tough things they can’t get by themselves. I entertain questions. Even in
my freshman class with 400 students, I
get questions all the time. Some people
say lectures are bad. Bad lectures are
bad. My lectures are interactive—I’m
explaining the tough nuts, and the students ask questions. And they learn.
I know you have a particular dislike for
lip service instead of being real.
Being real is very important. The
kids can tell whether you’re spouting
politically correct garbage or whether
you’re speaking from the depths of your
soul. If you’re real with them, they will
cut you enormous slack so you can be
politically incorrect and it doesn’t matter to them because they know you’re
not mean spirited. They know you’re
real. And that’s what’s important.
What do you think about Texas’ seven-percent law that forces the universities to admit the student if he’s
in the top seven percent of the high
school graduating class, since many
of them are really not ready for the
freshman courses?
It is important to provide equal opportunity. In fact, my classroom is all
about equal opportunity. I don’t care
what race, I don’t care what religion,
I don’t care what gender. I welcome
all students into my classroom and I
try to teach them. The seven-percent
law admits students who come from
neighborhoods where they didn’t get
a proper high school preparation. And
this isn’t just the black or Hispanic
ghettos of Houston. It’s also rural
Texas where white kids don’t get the
proper preparation. It’s for anyone
who is at the top of the class, but has
not been prepared properly. The fact
they’re in the top of the class means
they’re probably bright. So we should
give them a chance. That’s what equal
opportunity is all about—providing
the chance. The problem is that when
we welcome them to the freshman
class, we then tell them we want them
to graduate in four years. And that’s a
serious mistake because many aren’t
yet ready for our freshman courses.
They shouldn’t be put in our freshman courses.
If we’re serious about providing
equal opportunity for these students,
then we should provide the courses to
make up for their lack of preparation,
and get them ready to take our freshman courses. And if that means it takes
a student more than four years to graduate, then it takes more than four years
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
35
viewpoints
to graduate. I don’t care what they
know coming in. What I care about is
what they know when they graduate. At
that point I want them to be every bit
as good as the kids who came from the
best highly prepared K–12 schools. We
can do that if we’re willing to offer the
courses to get them ready for our freshman courses.
Can you say something about your Ten
Commandments for good teaching?
On my website I have my Ten Commandments. For example, memorization is bad. The students in my
freshman course have been rewarded
all through school for their ability to
memorize, whether or not they understood anything. And now they are
freshman engineering students expecting to succeed by memorizing.
But engineering is about thinking
and problem solving, not memorizing. So I have to break them of that
habit of memorizing.
There are other commandments.
You should want to be in the classroom. You should know the material.
You should not be afraid of interruptions. If I explain them all, this interview will go on for another two or three
hours, so I should probably stop. If you
want to see my Ten Commandments,
they’re on my website.a
There was an incident regarding your
younger sister in a plane geometry
course. What was that about?
That was a perfect example of
memorization. I was visiting my parents, and my sister, who was studying
plane geometry at the time, asked me
to look at a proof that had been marked
wrong on her exam paper. Her proof
was completely correct. All of a sudden it hit me! I had gone to the same
high school and in fact had the same
math teacher. He absolutely did not
understand geometry. But he was assigned to teach it. So what did he do?
This was before PowerPoint. The night
before, he would copy the proof from
the textbook onto a sheet of paper. In
class he would copy the proof onto the
blackboard. The students would copy
the proof into their notes. The night
before the exam, they’d memorize the
ahttp://users.ece.utexas.edu/~patt/Ten.commandments
36
COMM UNICATIO NS O F THE ACM
proof. On the exam he’d ask them to
prove what he had put on the board.
They had no idea what they were doing,
but they’d memorized the proof. The
result: 100% on the exam.
My sister didn’t memorize proofs.
She understood plane geometry. She
read the theorem, and came up with
a proof. It’s not the proof that was in
the book. But as you well know, there
are many ways to prove a theorem. The
teacher did not understand enough geometry to be able to recognize that even
though her proof was not the proof in
the book, her proof was correct. So she
got a zero! Memorization!
You once told me about a colleague at
Michigan who came into your office
one day after class complaining he had
given the worst lecture of his life.
Yes, a very senior professor. He
came into my office, slammed down
his sheaf of papers, “I’ve just given
the worst lecture of my life. I’m starting my lecture, and I’ve got 10 pages
of notes I need to get through. I get
about halfway through the first page,
a kid asks a question. And I think, this
kid hasn’t understood anything. So I
made the mistake of asking the class,
who else doesn’t understand this?
Eighty percent of their hands go up. I
figured there’s no point going through
the remaining 9½ pages if they don’t
understand this basic concept. I put
my notes aside, and spent the rest
of the hour teaching them what they
needed to understand in order for me
to give today’s lecture. At the end of
the lecture, I’ve covered nothing that
I had planned to cover because I spent
all the time getting the students ready
for today’s lecture. The worst day of
my life.”
I said, “Wrong! The best day of
your life. You probably gave them the
best lecture of the semester.” He said,
“But I didn’t cover the material.” I
said, “Your job is to explain the hard
things so they can cover the material
for themselves.” He adopted this approach, and from then on, he would
check regularly. And if they didn’t understand, he would explain. He never
got through all the material.
In fact, that’s another one of my Ten
Commandments. Don’t worry about
getting through all the material. Make
sure you get through the core mate-
| J U NE 201 6 | VO L . 5 9 | NO. 6
rial, but that’s usually easy to do. The
problem is that back in August when
you’re laying out the syllabus, you figure every lecture will be brilliant, every kid will come to class wide awake,
ready to learn, so everything will be
fine. Then the semester begins. Reality sets in. Not all of your lectures are
great. It’s a reality. Not all kids come
to class wide awake. It’s a reality. So
you can’t get through everything you
thought you would back in August. But
you can get through the core material.
So don’t worry about getting through
everything. And don’t be afraid to be
interrupted with questions. He adopted those commandments and ended
up with the best teaching evaluations
he had ever received.
You got your Ph.D. 50 years ago. Your
ideas have made major impact on
how we implement microprocessors.
Your students are endowed chairs at
top universities. Your students are
at the top of their fields in the companies where they work. You’ve won
just about every award there is. Isn’t it
time to retire?
Why would I want to retire? I love
what I’m doing. I love the interaction
with my graduate students in research.
I enjoy consulting for companies on
microarchitecture issues. Most of all,
I love teaching. I get to walk into a
classroom, and explain some difficult
concept, and the kids learn, the lights
go on in their eyes. It’s fantastic. Why
would I want to retire? I have been doing this now, for almost 50 years? I say
I am at my mid-career point. I hope to
be doing it for another 50 years. I probably won’t get to do it for another 50
years. But as long as my brain is working and as long as I’m excited about
walking into a classroom and teaching, I have no desire to retire.
Derek Chiou ([email protected]) is an associate
professor of Electrical and Computer Engineering at The
University of Texas at Austin and a partner hardware
architect at Microsoft Corporation.
Watch the authors discuss
their work in this exclusive
Communications video.
http://cacm.acm.org/videos/aninterview-with-yale-patt
For the full-length video, please
visit https://vimeo.com/aninterview-with-yale-patt
Copyright held by author.
V
viewpoints
DOI:10.1145/2832904
Boaz Barak
Viewpoint
Computer Science
Should Stay Young
Seeking to improve computer science publication culture while retaining
the best aspects of the conference and journal publication processes.
IMAGE BY AND RIJ BORYS ASSOCIAT ES/SHUT TERSTOCK
U
NLIKE MOST OTHER academic
fields, refereed conferences
in computer science are generally the most prestigious
publication venues. Some
people have argued computer science
should “grow up” and adopt journals
as the main venue of publication, and
that chairs and deans should base hiring and promotion decisions on candidate’s journal publication record as
opposed to conference publications.a,b
While I share a lot of the sentiments
and goals of the people critical of our
publication culture, I disagree with the
conclusion that we should transition to
a classical journal-based model similar
to that of other fields. I believe conferences offer a number of unique advantages that have helped make computer
science dynamic and successful, and
can continue to do so in the future.
First, let us acknowledge that no
peer-review publication system is perfect. Reviewers are inherently subjective and fallible, and the amount of
papers being written is too large to allow as careful and thorough review of
each submission as should ideally be
the case. Indeed, I agree with many of
the critiques leveled at computer science conferences, but also think these
critiques could apply equally well to
a Moshe Vardi, Editor’s letter, Communications
(May 2009); http://bit.ly/1UngC33
b Lance Fortnow, “Time for Computer Science
to Grow Up,” Communications (Aug. 2009);
http://bit.ly/1XQ6RrW
any other peer-reviewed publication
system. That said, there are several reasons I prefer conferences to journals:
˲˲ A talk is more informative than a
paper. At least in my area (theory), I
find I can get the main ideas of a piece
of work much better by hearing a talk
about it than by reading the paper.
The written form can be crucial when
you really need to know all the details,
but a talk is better at conveying the
high-order bits that most of us care
about. I think that our “conference
first” culture in computer science has
resulted with much better talks (on
average) than those of many journalfocused disciplines.
˲˲ Deadlines make for more efficient
reviewing. As an editor for the Journal
of the ACM, I spend much time chasing
down potential reviewers for every submission. At this rate, it would have taken me decades to process the amount
of papers I handled in six months as
the program chair of the FOCS confer-
ence. In a conference you line up a set
of highly qualified reviewers (that is,
the program committee) ahead of the
deadline, which greatly reduces the administrative overhead per submission.
˲˲ People often lament the quality of
reviews done under time pressure, but
no matter how we organize our refereeing process, if X papers are being
written each year, and the community
is willing to dedicate Y hours to review
them in total, on average a paper will
always get Y/X hours of reviewer attention. I have yet to hear a complaint from
a reviewer that they would have liked to
spend a larger fraction of their time refereeing papers, but have not been able
to do so due to the tight conference
schedule. Thus, I do not expect an increase in Y if journals were to suddenly
become our main avenue of publication. If this happened, then journals
would have the same total refereeing
resources to deal with the same mass
of submissions conferences currently
do and it is unrealistic to expect review
quality would be magically higher.
˲˲ Conferences have rotating gatekeepers. A conference program committee typically changes at every iteration, and often contains young people
such as junior faculty or postdocs that
have a unique perspective and are intimately familiar with cutting-edge
research. In contrast, editorial boards
of journals are much more stable and
senior. This can sometimes be a good
thing but also poses the danger of keeping out great works that are not appeal-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
37
viewpoints
ing to the particular board members.
Of course, one could imagine a journal
with a rotating board, but I think there
is a reason this configuration works
better at a conference. It is much easier
for program committee members to
judge papers in batch, comparing them
with one another, than to judge each
paper in isolation as they would in a
journal. This holds doubly so for junior
members, who cannot rely on extensive
experience when looking at individual
papers, and who benefit greatly from
the highly interactive nature of the conference decision process.
Related to the last point, it is worthwhile to mention the NIPS 2014 experiment, where the program chairs,
Corinna Cortes and Neil Lawrence,
ran a duplicate refereeing process for
10% of the submissions, to measure
the agreement in the accept/reject decisions. The overall agreement was
roughly 74% (83% on rejected submissions and 50% on accepted ones, which
were approximately one-quarter of the
total submissions) and preliminary
analysis suggests standard deviations
of about 5% and 13% in the agreement
on rejection and acceptance decisions
respectively.c These results are not
earth-shattering—prior to the experiment Cortes and Lawrence predicted
an agreement of 75% and 80% (respectively)—and so one interpretation is
they simply confirm what many of us
believe—that there is a significant subjective element to the peer review process. I see this as yet another reason to
favor venues with rotating gatekeepers.
Are conferences perfect? Not by a
long shot—for example, I have been
involved in discussionsd on how to improve the experience for participants in
one of the top theory conferences and
I will be the first to admit that some of
these issues do stem from the publication-venue role of the conferences.
The reviewing process itself can be improved as well, and a lot of it depends
on the diligence of the particular program chair and committee members.
The boundaries between conferences and journals are not that cut and
dry. A number of communities have
c See the March 2015 blog post by Neil Lawrence: http://bit.ly/1pK4Anr
d See the author’s May 2015 blog post: http://bit.
ly/1pK4LiF
38
COM MUNICATIO NS O F TH E AC M
I completely agree
with many critics
of our publication
culture that
we can and should
be thinking of ways
to improve it.
been exploring journal-conference
“hybrid” models that can be of great
interest. My sense is that conferences
are better at highlighting the works that
are of broad interest to the community
(a.k.a. “reviewing” the paper), while
journals do a better job at verifying the
correctness and completeness of the
paper (a.k.a. “refereeing”), and iterating with the author to develop more
polished final results.
These are two different goals and are
best achieved by different processes. For
selecting particular works to highlight,
comparing a batch of submissions by a
panel of experts relying on many short
reviews (as is the typical case in a conference) seems to work quite well. But fewer deeper reviews, involving a back-andforth between author and reviewer (as is
ideally the case in a journal) are better at
producing a more polished work, and
one in which we have more confidence
in its correctness. We can try to find
ways to achieve the best of both worlds,
and make the most efficient use of the
community’s attention span and resources for refereeing. I personally like
the “integrated journal/conference”
model where a journal automatically
accepts papers that appeared in certain
conferences, jumping straight into the
revision stage, which can involve significant interaction with the author.
The advantage is that by outsourcing
the judgment of impact and interest
to the conference, the journal review
process avoids redundant work and
can be focused on the roles of verifying
correctness and improving presentation. Moreover, the latter properties are
more objective, and hence the process
can be somewhat less “adversarial” and
involve more junior referees such as stu-
| J U NE 201 6 | VO L . 5 9 | NO. 6
dents. In fact, in many cases these referees could dispense with anonymity and
get some credit in print for their work.
Perhaps the biggest drawback of
conferences is the cost in time and resources to attend them. This is even an
issue for “top tier” conferences, where
this effort at least pays off for attendees
who get to hear talks on exciting new
works as well as connect with many others in their community. But it is a greater problem for some lower-ranked conferences where many participants only
come when they present a paper, and
in such a case it may indeed have been
better off if those papers appeared in a
journal. In fact, I wish it were acceptable for researchers’ work to “count”
even if it appeared in neither a conference nor a journal. Some papers can
be extremely useful to experts working in a specific field, but have not yet
advanced to a state where they are of
interest to the broader community.
We should think of ways to encourage
people to post such works online without spending resources on refereeing
or travel. While people often lament
the rise of the “least publishable unit,”
there is no inherent harm (and there is
some benefit) in researchers posting
the results of their work, no matter how
minor they are. The only problem is the
drain on resources when these incremental works go through the peer review process. Finally, open access is of
course a crucial issue and I do believee
both conferences and journals should
make all papers, most of which represent work supported by government
grants or non-profit institutions, freely
available to the public.
To sum up, I completely agree with
many critics of our publication culture
that we can and should be thinking of
ways to improve it. However, while doing so we should also acknowledge and
preserve the many positive aspects of
our culture, and take care to use the
finite resource of quality refereeing in
the most efficient manner.
e See the author’s December 2012 blog post:
http://bit.ly/1UcYdFF
Boaz Barak ([email protected]) is the Gordon McKay
Professor of Computer Science at the Harvard John A.
Paulson School of Engineering and Applied Sciences,
Harvard University, Cambridge, MA.
Copyright held by author.
V
viewpoints
DOI:10.1145/2834114
Jean-Pierre Hubaux and Ari Juels
Viewpoint
Privacy Is Dead,
Long Live Privacy
Protecting social norms as confidentiality wanes.
T
years have been
especially turbulent for privacy advocates. On the one
hand, the global dragnet
of surveillance agencies
has demonstrated the sweeping surveillance achievable by massively resourced government organizations.
On the other, the European Union has
issued a mandate that Google definitively “forget’’ information in order to
protect users.
Privacy has deep historical roots,
as illustrated by the pledge in the
Hippocratic oath (5th century b.c.),
“Whatever I see or hear in the lives of
my patients ... which ought not to be
spoken of outside, I will keep secret,
as considering all such things to be
private.”11 Privacy also has a number
of definitions. A now common one
among scholars views it as the flow
of information in accordance with
social norms, as governed by context.10 An intricate set of such norms
is enshrined in laws, policies, and
ordinary conduct in almost every
culture and social setting. Privacy in
this sense includes two key notions:
confidentiality and fair use. We argue that confidentiality, in the sense
of individuals’ ability to preserve
secrets from governments, corporations, and one another, could well
continue to erode. We call instead for
more attention and research devoted
to fair use.
To preserve existing forms of privacy against an onslaught of online
threats, the technical community is
PHOTO: 201 3 GIA NTS ARE SM ALL LP. A LL RIGH TS RESERVED.
H E PA S T FE W
working hard to develop privacy-enhancing technologies (PETs). PETs enable users to encrypt email, conceal
their IP addresses, avoid tracking by
Web servers, hide their geographic
location when using mobile devices,
use anonymous credentials, make untraceable database queries, and publish documents anonymously. Nearly
all major PETs aim at protecting confidentiality; we call these confidentiality-oriented PETs (C-PETs). C-PETs
can be good and helpful. But there
is a significant chance that in many
or most places, C-PETs will not save
privacy. It is time to consider adding
a new research objective to the community’s portfolio: preparedness for
a post-confidentiality world in which
many of today’s social norms regarding the flow of information are regularly and systematically violated.
Global warming offers a useful
analogy, as another slow and seem-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
39
viewpoints
ingly unstoppable human-induced
disaster and a worldwide tragedy of
commons. Scientists and technologists are developing a portfolio of
mitigating innovations in renewable
energy, energy efficiency, and carbon
sequestration. But they are also studying ways of coping with likely effects,
including rising sea levels and displacement of populations. There is a
scientific consensus that the threat
justifies not just mitigation, but preparation (for example, elevating Holland’s dikes).
The same, we believe, could be
true of privacy. Confidentiality may
be melting away, perhaps inexorably: soon, a few companies and surveillance agencies could have access
to most of the personal data of the
world’s population. Data provides information, and information is power.
An information asymmetry of this degree and global scale is an absolute
historical novelty.
There is no reason, therefore, to
think of privacy as we conceive of it today as an enduring feature of life.
Example: RFID
Radio-Frequency IDentification (RFID)
location privacy concretely illustrates
how technological evolution can undermine C-PETs. RFID tags are wireless microchips that often emit static
identifiers to nearby readers. Numbering in the billions, they in principle
permit secret local tracking of ordinary
people. Hundreds of papers proposed
C-PETs that rotate identifiers to prevent RFID-based tracking.6
Today, this threat seems quaint.
Mobile phones with multiple RF interfaces (including Bluetooth, Wi-Fi,
NFC), improvements in face recognition, and a raft of new wireless devices (fitness trackers, smartwatches,
and other devices), offer far more
effective ways to track people than
RFID ever did. They render RFID CPETs obsolete.
This story of multiplying threat vectors undermining C-PETs’ power—and
privacy more generally—is becoming
common.
est sense. The adversaries include surveillance agencies and companies in
markets such as targeted advertising,
as well as smaller, nefarious players.
Pervasive data collection. As the
number of online services and always-on devices grows, potential adversaries can access a universe of personal data quickly expanding beyond
browsing history to location, financial
transactions, video and audio feeds,
genetic data4, real-time physiological
data—and perhaps eventually even
brainwaves.8 These adversaries are
developing better and better ways to
correlate and extract new value from
these data sources, especially as advances in applied machine learning
make it possible to fill in gaps in users’ data via inference. Sensitive data
might be collected by a benevolent
party for a purpose that is acceptable
to a user, but later fall into dangerous hands, due to political pressure,
a breach, and other reasons. “Secondhand” data leakage is also growing in prevalence, meaning that one
person’s action impacts another’s
private data (for example, if a friend
declares a co-location with us, or if a
blood relative unveils her genome).
The emerging Internet of Things will
make things even trickier, soon surrounding us with objects that can report on what we touch, eat, and do.16
Monetization (greed). Political philosophers are observing a drift from
what they term having a market economy to being a market society13 in which
market values eclipse non-market social norms. On the Internet, the ability
to monetize nearly every piece of information is clearly fueling this process,
which is itself facilitated by the existence of quasi-monopolies. A market-
There is no reason
to think of privacy
as we conceive of it
today as an enduring
feature of life.
The Assault on Privacy
We posit four major trends providing
the means, motive, and opportunity
for the assault on privacy in its broad40
COM MUNICATIO NS O F TH E ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
place could someday arise that would
seem both impossible and abhorrent
today. (For example, for $10: “I know
that Alice and Bob met several times.
Give me the locations and transcripts
of their conversations.”) Paradoxically,
tools such as anonymous routing and
anonymous cash could facilitate such
a service by allowing operation from
loosely regulated territories or from no
fixed jurisdiction at all.
Adaptation and apathy. Users’
data curation habits are a complex
research topic, but there is a clear
generational shift toward more information sharing, particularly on
social networks. (Facebook has more
than one billion users regularly sharing information in ways that would
have been infeasible or unthinkable
a generation ago.). Rather than fighting information sharing, users and
norms have rapidly changed, and
convenience has trumped privacy to
create large pockets of data-sharing
apathy. Foursquare and various other
microblogging services that encourage disclosure of physical location,
for example, have led many users to
cooperate in their own physical tracking. Information overload has in any
event degraded the abilities of users
to curate their data, due to the complex and growing challenges of “secondhand” data-protection weakening
and inference, as noted previously.
Secret judgment. Traceability and
accountability are essential to protecting privacy. Facebook privacy settings
are a good example of visible privacy
practice: stark deviation from expected
norms often prompts consumer and/
or regulatory pushback.
Increasingly often, though, sensitive-data exploitation can happen
away from vigilant eyes, as the recent
surveillance scandals have revealed.
(National security legitimately demands surveillance, but its scope and
oversight are critical issues.) Decisions made by corporations—hiring,
setting insurance premiums, computing credit ratings, and so forth—
are becoming increasingly algorithmic, as we discuss later. Predictive
consumer scores are one example;
privacy scholars have argued they
constitute a regime of secret, arbitrary, and potentially discriminatory
and abusive judgment of consumers.2
viewpoints
A Post-Confidentiality
Research Agenda
We should prepare for the possibility of a post-confidentiality world, one
in which confidentiality has greatly
eroded and in which data flows in such
complicated ways that social norms
are jeopardized. The main research
challenge in such a world is to preserve
social norms, as we now explain.
Privacy is important for many reasons. A key reason, however, often cited in discussions of medical privacy, is
concern about abuse of leaked personal information. It is the potentially resulting unfairness of decision making,
for example, hiring decisions made on
the basis of medical history, that is particularly worrisome. A critical, defensible bastion of privacy we see in postconfidentiality world therefore is in the
fair use of disclosed information.
Fair use is increasingly important
as algorithms dictate the fates of
workers and consumers. For example,
for several years, some Silicon Valley
companies have required job candidates to fill out questionnaires (“Have
you ever set a regional-, state-, country-, or world-record?”). These companies apply classification algorithms
to the answers to filter applications.5
This trend will surely continue, given
the many domains in which statistical predictions demonstrably outperform human experts.7 Algorithms,
though, enable deep, murky, and extensive use of information that can
exacerbate the unfairness resulting
from disclosure of private data.
On the other hand, there is hope
that algorithmic decision making can
lend itself nicely to protocols for enforcing accountability and fair use. If
decision-making is algorithmic, it is
possible to require decision-makers
to prove that they are not making use
of information in contravention of
social norms expressed as laws, policies, or regulations. For example, an
insurance company might prove it
has set a premium without taking
genetic data into account—even if
this data is published online or otherwise widely available. If input data
carries authenticated labels, then
cryptographic techniques permit the
construction of such proofs without revealing underlying algorithms,
which may themselves be company
If we cannot win
the privacy game
definitively, we need
to defend paths to
an equitable society.
secrets (for example, see Ben-Sasson
et al.1). Use of information flow control12 preferably enforced by software
attested to by a hardware root of trust
(for example, see McKeen et al.9) can
accomplish much the same end. Statistical testing is an essential, complementary approach to verifying fair
use, one that can help identify cases
in which data labeling is inadequate,
rendered ineffective by correlations
among data, or disregarded in a system. (A variety of frameworks exist,
for example, see Dwork et al.3)
Conclusion
A complementary research goal is
related to privacy quantification. To
substantiate claims about the decline
of confidentiality, we must measure
it. Direct, global measurements are
difficult, but research might look to
indirect monetary ones: The profits
of the online advertising industry per
pair of eyeballs and the “precision” of
advertising, perhaps as measured by
click-through rates. At the local scale,
research is already quantifying privacy (loss) in such settings as locationbased services.14
There remains a vital and enduring
place for confidentiality. Particularly
in certain niches—protecting political dissent, anti-censorship in repressive regimes—it can play a societally
transformative role. It is the responsibility of policymakers and society
as a whole to recognize and meet the
threat of confidentiality’s loss, even
as market forces propel it and political leaders give it little attention. But
it is also incumbent upon the research
community to contemplate alternatives to C-PETs, as confidentiality is
broadly menaced by technology and
social evolution. If we cannot win the
privacy game definitively, we need to
defend paths to an equitable society.
We believe the protection of social
norms, especially through fair use
of data, is the place to start. While CPETs will keep being developed and
will partially mitigate the erosion of
confidentiality, we hope to see many
“fair-use PETs” (F-PETs) proposed
and deployed in the near future.15
References
1. Ben-Sasson, E. et al. SNARKs for C: Verifying program
executions succinctly and in zero knowledge. In
Advances in Cryptology–CRYPTO, (Springer, 2013),
90–108.
2. Dixon, P. and Gellman, R. The scoring of America: How
secret consumer scores threaten your privacy and
your future. Technical report, World Privacy Forum
(Apr. 2, 2014).
3. Dwork, C. et al. Fairness through awareness. In
Proceedings of the 3rd Innovations in Theoretical
Computer Science Conference. (ACM, 2012), 214–226.
4. Erlich, Y. and Narayanan, A. Routes for breaching and
protecting genetic privacy. Nature Reviews Genetics
15, 6 (2014), 409–421.
5. Hansell, S. Google answer to filling jobs is an algorithm.
New York Times (Jan. 3, 2007).
6. Juels, A. RFID security and privacy: A research
survey. IEEE Journal on Selected Areas in
Communication 24, 2 (Feb. 2006).
7. Kahneman, D. Thinking, Fast and Slow. Farrar, Straus,
and Giroux, 2012, 223–224.
8. Martinovic, I. et al. On the feasibility of side
channel attacks with brain-computer interfaces. In
Proceedings of the USENIX Security Symposium,
(2012), 143–158.
9. McKeen, F. et al. Innovative instructions and software
model for isolated execution. In Proceedings of
the 2nd International Workshop on Hardware and
Architectural Support for Security and Privacy, Article
no. 10 (2013).
10. Nissenbaum, H. Privacy in Context: Technology, Policy,
and the Integrity of Social Life. Stanford University
Press, 2009.
11. North, M.J. Hippocratic oath translation. U.S. National
Library of Medicine, 2002.
12. Sabelfeld, A. and Myers, C. Language-based
information-flow security. IEEE Journal on Selected
Areas in Communications 21, 1 (2003), 5–19.
13. Sandel, M.J. What Money Can’t Buy: The Moral Limits
of Markets. Macmillan, 2012.
14. Shokri, R. et al. Quantifying location privacy. In
Proceedings of the IEEE Symposium on Security and
Privacy (2011), 247–262.
15. Tramèr, F. et al. Discovering Unwarranted Associations
in Data-Driven Applications with the FairTest Testing
Toolkit, 2016; arXiv:1510.02377.
16. Weber, R.H. Internet of things—New security and
privacy challenges. Computer Law and Security
Review 26, 1 (2010), 23–30.
Jean-Pierre Hubaux ([email protected])
is a professor in the Computer Communications and
Applications Laboratory at the Ecole Polytechnique
Fédérale de Lausanne in Switzerland.
Ari Juels ([email protected]) is a professor at Cornell
Tech (Jacobs Institute) in New York.
We would like to thank George Danezis, Virgil Gligor, Kévin
Huguenin, Markus Jakobsson, Huang Lin, Tom Ristenpart,
Paul Syverson, Gene Tsudik and the reviewers of this
Viewpoint for their many generously provided, helpful
comments, as well the many colleagues with whom we
have shared discussions on the topic of privacy. The views
presented in this Viewpoint remain solely our own.
Copyright held by authors.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
41
V
viewpoints
DOI:10.1145/2909887
Ankita Mitra
Viewpoint
A Byte Is All We Need
A teenager explores ways to attract girls
into the magical world of computer science.
I
T WA S T I M E to begin teaching
my class. The children were in
their seats, laptops turned on,
ready to begin. I scanned the
doorway, hoping for one more
girl to arrive: there were nine boys in
my class and just two girls. I was conducting free coding classes, but young
girls were still reluctant to attend. As
a 15-year-old computer enthusiast,
I was baffled by this lack of interest.
A young boy arrived with his mother.
As the mother was preparing to leave,
I asked her, “If you have the time,
why don’t you stay? Maybe you could
help your son.” She agreed. I started
my class without further delay. In the
next class, the boy’s mother brought
along a friend and her daughter. Subsequent classes saw the registration
of a few more girls, friends of friends.
My message was getting across: computer science (CS) is not as difficult as
presumed—it is fun, and more importantly, it is certainly not an exclusively
male-oriented domain.
Gender Difference in Perspectives
Being enamored by CS myself, I was
disappointed to find girls shunned
this super-exciting, super-useful, and
super-pervasive discipline. I was determined to find out why, and as I started
teaching Java to middle school children I kept a close watch on how the
questions, understanding, reactions,
and study methods of girls differed
from the boys in class. The difference
I noticed immediately was the boys
were more advanced in their knowledge. It was a challenge for me to balance the boys and the girls not only
42
COMM UNICATIO NS O F THE ACM
Exposure and encouragement are key to attracting girls to CS: the author doing her part.
in teaching but also in their learning
perspectives. I noted that while the
boys accepted concepts unquestioningly and focused on application—
the ‘How’ of things—the girls always
wanted to know ‘Why?’ So I asked the
boys to explain the ‘why’ of things to
the girls. The boys soon learned they
did not know it all, so attempted a
deeper understanding and in the process the girls got their answers. By the
time the session was over, both boys
and girls were equivalent in knowledge and confidence, and were keen
to collaborate in writing apps.
Dive In Early
But why was there so much disparity at the start? After a brief round of
questioning, I realized the boys had
a head start because they had started
young—just like I had. Young boys are
more attracted to computer games and
gadgets than young girls. As I have an
older brother, I had been exposed to
| J U NE 201 6 | VO L . 5 9 | NO. 6
computer games and programming as
a small child. But what about girls with
no brothers? Girls are not aware of the
fun element in controlling computers
most often because they have not had
the opportunity to try it.
The essential difference between
the genders in the interest and knowledge in computer science stems from
exposure (or the lack thereof) at a
young age. If one goes to a store to buy
PlayStation, Nintendo, or Xbox games,
the gender imbalance is apparent. Except for a few Barbie games, there are
practically no games with young girls
as protagonists. There have been a few
attempts to create stimulating games
geared only for girls. In 1995, Brenda
Laurel started her company Purple
Moon to make video games that focused particularly on girls’ areas of
interest while retaining the action and
challenge mode. Despite extensive
research on the interests and inclinations of girls, Purple Moon failed.4 To-
viewpoints
day, the Internet has games for young
girls but most are based on cultural
biases like dressing up, cooking, nail
art, fashion designing, and shopping.
Exciting and challenging video games
continue to be male oriented, which
makes the initiation into computer
science easier and earlier for boys.
Once hooked on these games, curiosity and the wish to engineer desired
results take the boys into the world of
programming. And that is the bit that
starts the coder’s journey.
It is a journey whose momentum
can be picked up by girls, too. Facebook COO Sheryl Sandberg says, “Encourage your daughters to play video
games.” She claims, “A lot of kids
code because they play games. Give
your daughters computer games.” In
a gaming world thirsting for young
girls’ games, there are some invigorating splashes, like the wonderfully
created game Child of Light.5 This
role-playing game not only has a little
girl as the central character but most
of its other characters (both good and
bad) are women, too. It is this kind of
game the world needs to entice girls
into the world of computer science—
a world where women can have fun,
create, and lead. The lead programmer of Child of Light, Brie Code says,
“It can be lonely to be the only woman
or one of very few women on the team
… it is worth pushing for more diversity within the industry.”
Play to Learn—
Replace Fear with Fun
Computer games for the very young
have a vital role to play in ushering in
diversity within the industry. Fred Rogers, the American icon for children’s
entertainment and education, rightly
said, “For children, play is serious
learning.” I learned the program LOGO
when I was just four years old because
it was only a game to me—I was not
programming, I was having fun. To
be able to control the movements of
the LOGO turtle was thrilling. Today,
when I code in Java to make complex
apps I gratefully acknowledge the little
LOGO turtle that started it all. That
was 10 years ago. Today, more exciting programming languages like KIBO,
Tynker, and ScratchJr, software like The
Foos and apps like Kodable aim to make
computer science fun for small chil-
Early interest in computer science gives
boys a head start.
dren who have not yet learned to read!
This is the level at which girls have to
enter the field of CS, not hover around
the boundaries at high school. In the
21st century, computer science is as
much a part of fundamental literacy as
reading, writing, and math.
Not an Option,
Mandatory K–12 Learning
Hence, I believe CS should be made
mandatory in kindergarten and elementary school. President Obama has
stressed the need for K–12 computer
science education to “make them jobready on day one.” Learning the basics
determines choices for higher studies. Girls need a bite and taste in early
childhood in order to make informed
decisions about computer science
when they are teenagers. I was not a
teenager yet when I migrated from
India to the U.S. My private school in
India had CS as a compulsory subject
from Grade 1 onward. So when I began
attending school in the U.S., I knew I
loved CS and it definitely had to be one
of my electives. But the boys looked at
us few girls in class as if we were aliens!
Even today, in my AP computer science
class, few boys ask me for a solution if
they have a problem (I kid myself that it
is because of my age and not gender).
In India, it is cool for a girl to study
computer science in school. I basked
in virtual glory there, while in the U.S. I
found most of my female friends raised
their eyebrows, their eyes asking me,
“Why would you want to study CS?”
So I decided to flip the question
and conduct a survey to find out why
they did not want to study computer
science. I interviewed 107 girls from
the ages of 5 to 17 in the U.S., U.K., and
India. My question was: “Would you
study computer science in college?” A
whopping 82.4% of the girls said ‘No’
or ‘Maybe’. When asked why not, 78%
of them answered ‘I am afraid that I
am not smart enough to do CS.’ Other
answers included ‘I am not a big fan
of programming’/‘I am not inclined
toward the sciences, I am more creatively oriented’/‘I prefer the literary
field, writing, editing, publishing’/‘I
am too cool to be a geek’! When I asked
whether they knew any programming
language, only 14 girls out of the 107
said ‘Yes.’ Dismayed by the results, I
posed the same question to 50 boys, in
the same age group: 74.8% of them said
‘Yes’ to studying CS; 82% of all the boys
I interviewed knew more than one programming language and many of them
were less than 10 years old. My resolve
Starting young removes fear and makes coding fun.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
43
viewpoints
was strengthened: the only way to remove the fear of CS from the minds of
girls is to catch them young and encourage curiosity before negative attitudes
are developed.
Thorns of Preconceived Notions
Once the worth of the field is realized,
one notices the crop of roses and the
thorns pale in comparison. One such
thorn is the idea of geekiness that mars
the face of computer science. Girls are
unwilling to be nerds. The misconception that a computer nerd is a socially
awkward eccentric has been obliterated by the stars in technology like Sheryl
Sandberg, Sophie Wilson, Marissa
Mayer, or the cool young female employees of Facebook, Google, and other
companies and organizations.
Another (and a more piercing) thorn
that keeps girls away from computer science is the lack of confidence in math
and science. Intensive studies indicate
spatial reasoning skills are deeply associated with mathematical and technical skills. Research also shows action
and building games vastly improve
spatial skills, math skills, divergent
problem-solving skills, and even creativity. So why should these games be
reserved only for boys? To develop spatial skills and attract girls into the tech
field, Stanford engineers Bettina Chen
and Alice Brooks created Roominate,
a wired DIY dollhouse kit, where girls
can build a house with building blocks
and circuits, design and create rooms
with walls, furniture, working fans, and
lights.6 These kinds of toys can develop
spatial perception and engender confidence in STEM fields in girls, too. “Playing an action video game can virtually
eliminate gender difference in spatial
attention and simultaneously decrease
the gender disparity in mental rotation
ability, a higher-level process in spatial
cognition.”2 The same study concludes,
“After only 10 hours of training with an
action video game, subjects realized
substantial gains in both spatial attention and mental rotation, with women
benefiting more than men. Control
subjects who played a non-action game
showed no improvement. Given that
superior spatial skills are important
in the mathematical and engineering
sciences, these findings have practical
implications for attracting men and
women to these fields.”
44
COMM UNICATIO NS O F THE AC M
It is just a matter
of time before girls
realize studying and
working in CS
is neither fearsome,
nor boring.
Programming before Reading
Inspired by studies like these, I started a
project to make CS appealing to the feminine mind. Based on my ‘Catch Them
Young’ philosophy, I am using my programming knowledge to create action
and building games for tiny tots where
the action is determined by the player.
The player decides whether the little girl
protagonist wants to build a castle or rescue a pup from evil wolves. There is not
a single alphabet used in the games, so
that even two-year-old children can play
with ease and develop their spatial reasoning even before they learn the alphabet. Using LOGO, ScratchJr, and Alice, I
have created a syllabus to enable an early
understanding of logic and creation of a
sequence of instructions, which is the
basis of all programming. I am currently
promoting this course in private elementary schools in the U.S., India, and
other countries to expose the minds of
young girls to computer science.
Computational Thinking
In a similar endeavor, schools in South
Fayette Township (near Pittsburg, PA)
have introduced coding, robotics, computer-aided design, 3D printing, and
more, as part of the regular curriculum
from kindergarten to 12th grade to make
computational reasoning an integral
part of thinking.1 The problem-solving
approach learned in the collaborative
projects helps children apply computational thinking to the arts, humanities,
and even English writing. Girls in these
South Fayette schools are now computer
whiz kids fixing computer bugs as well
as malfunctioning hardware! They are
clear proof that computational thinking is a strength not restricted to males
alone—girls often combine it with creativity, designing, and literary skills for
even more powerful effects. Recent stud-
| J U NE 201 6 | VO L . 5 9 | NO. 6
ies show girls often go the extra length
in creating more complex programs
than boys. As Judith Good of the school
of engineering and informatics at the
University of Sussex commented after
a workshop for boys and girls for creating computer games: “In our study, we
found more girls created more scripts
that were both more varied in terms
of the range of actions they used, and
more complex in terms of the computational constructs they contained.”3
Conclusion
The change has begun and it is just a
matter of time before girls realize
studying and working in CS is neither
fearsome, nor boring. The time to realize this truth can be further shortened if
parents and teachers can also be inducted into the realm of CS. Adult computer
education is essential for closing the
gender gap in computer science. The
role of parents and teachers is paramount in introducing curiosity and interest in young children, irrespective
of gender, and steer them toward the
magic of CS. The huge participation
of Indian mothers in the CS stream in
Bangalore is a powerful stimulus and
one of the primary reasons behind
young girls embracing this field so
enthusiastically and successfully in India. When role models open up new vistas of the computer science world to all
children—including the very young—
only then can the unfounded fear of
girls regarding this relatively new domain be replaced by curiosity, excitement, and a desire to participate. For
computational thinking has no gender—it just has magical power.
References
1. Berdik, C. Can coding make the classroom better?
Slate 23 (Nov. 23, 2015); http://slate.me/1Sfwlwc.
2. Feng, J., Spence, I., and Pratt, J. Playing an action
video game reduces gender differences in spatial
cognition. Psychological Science 18, 10 (Oct. 2007),
850–855; http://bit.ly/1pmG8Am.
3. Gray, R. Move over boys: Girls are better at
creating computer games than their male friends.
DailyMail.com (Dec. 2, 2014); http://dailym.ai/1FMtEN1.
4. Harmon, A. With the best research and intentions,
a game maker fails. The New York Times (Mar. 22,
1999); http://nyti.ms/1V1hxEL.
5. Kaszor, D. PS4 preview: Child of Light a personal
project born within a giant game developer. Financial
Post (Nov. 12, 2013).
6. Sugar, R. How failing a freshman year physics quiz
helped 2 friends start a “Shark Tank” funded company.
Business Insider 21 (Jul. 21, 2015); http://read.
bi/1SknHep.
Ankita Mitra ([email protected]) is a 15-year-old
student at Monta Vista High School in Cupertino, CA.
Copyright held by author.
practice
DOI:10/ 1145 / 2 9 0 9 470
Article development led by
queue.acm.org
Many of the skills aren’t technical at all.
BY KATE MATSUDAIRA
Nine Things
I Didn’t Know
I Would
Learn Being
an Engineer
Manager
from being an engineer to being a dev
lead, I knew I had a lot to learn. My initial thinking was
I had to be able to do thorough code reviews, design,
and architect websites, see problems before they
happened, and ask insightful technical questions.
To me that meant learning the technology and
WH E N I M OV E D
becoming a better engineer. When
I actually got into the role (and after
doing it almost 15 years), the things
I have learned—and that have mattered the most—were not those technical details. In fact, many of the
skills I have built that made me a good
engineer manager were not technical
at all and, while unexpected lessons,
have helped me in many other areas
of my life.
What follows are some of these les-
sons, along with ideas for applying them
in your life—whether you are a manager, want to be a manager, or just want to
be a better person and employee.
1. Driving consensus. Technical
people love to disagree. I’ve found
there usually are no definitive answers
to a problem. Instead, there are different paths with different risks, and each
solution has its own pros and cons. Being able to get people to agree (without
being the dictator telling people what
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
45
to do) means learning how to get everyone on the same page.
Since meetings with lots of people
can be very contentious, one technique
that has helped me wrangle those ideas
is called multivoting. Multivoting is
helpful for narrowing a wide range of
ideas down to a few of the most important or appropriate, and it allows every
idea to receive consideration.
You can do this by first brainstorming as a team while putting all the ideas
on a whiteboard, along with the pros
and cons of each. From there you go
through a voting process until the group
arrives at what it considers to be an appropriate number of ideas for further
analysis. Organizational development
consultant Ava S. Butler explains the
multivoting process in wonderful detail
if you would like more information.1
2. Bringing out ideas (even from
quiet people). One of the challenges of
working with introverts and shy people without strong communication
skills is it can be difficult to surface
their ideas. They tend to be quiet in
meetings and keep their ideas (which
can be very good!) to themselves. Here
are a few techniques I have learned
that help me bring these people out of
their shells:
˲˲ In meetings I call on people or
do a round robin so everyone gets a
chance to talk. This way, the shy team
members are given the floor to speak
where they may have otherwise remained silent.
˲˲ In one-on-ones I have learned to
use the power of silence. I ask a question and then refrain from speaking
until the person answers—even if it is a
minute later. I had to learn to get comfortable with uncomfortable silence,
which has been a powerful technique in
uncovering what people are thinking.
˲˲ I often have everyone write their
ideas on a Post-it note and put it on
the whiteboard during team meetings.
This allows everyone’s ideas to receive
equal weight, and introverted people
are therefore encouraged to share their
thoughts.
3. Explaining tech to nontech. When
you want to rewrite code that already
works, you have to justify the change
to management. Much of the time
nontechnical people do not care about
the details. Their focus is on results.
Therefore, I have learned to look at all
46
COMMUNICATIO NS O F TH E AC M
my work, and the work my team does,
in a business context. For example,
does it save time, money, or generate
revenue—and then how do I best communicate that?
I frame my ideas in a context that
matters to the specific audience I am
addressing. Using analogy is one technique I have found to be quite powerful.2 Explaining an idea through
analogy allows you to consider your
audience’s perspective and talk at their
level, never above them.
4. Being a good listener. When you
manage people you really must learn to
listen. And, by the way, listening goes
way beyond paying attention to what is
said. You should also be paying attention to body language and behavior.
I like to use the example of an employee who always arrives early to work.
If that person suddenly makes a new
habit of showing up late, this could
be a cue that something is amiss. By
listening to that person’s actions, and
not just their words, you gain valuable
insight and can manage with greater
empathy and awareness.
5. Caring about appearance. When
you are in a leadership role you often
meet with people outside of your immediate co-workers who do not know
you as well. And they judge you. Plus,
studies have shown that your appearance strongly influences other people’s
perception of your intelligence, au-
| J U NE 201 6 | VO L . 5 9 | NO. 6
thority, trustworthiness, financial success, and whether you should be hired
or promoted.5
Growing up, I was taught by my
grandfather how to dress for the job I
wanted, not the job I currently had. As
a new manager, I put more of an effort
into my appearance, and it definitely
had a positive effect, especially when
interacting with customers and clients
outside of the organization.
I recommend emulating the people
in your organization whom you look up
to. Look at how they dress. Study how
they carry themselves. Watch how they
conduct themselves in meetings, parties, and other events. This is where you
can get your best ideas for how to dress
and communicate success. You want
your work and reputation to speak for
itself, but do not let your appearance
get in the way of that.
6. Caring about other disciplines.
The more you know about other facets
of the business, like sales and marketing, the more capable you are of making strategic decisions. The higher up
you go, the more important this is,
because you are not just running software—you are running a business.
It is also vital to understand the
needs of your customers. You could
build what you believe is an amazing
product, but it could end up being
useless to the customer if you never
took the time to fully understand their
IMAGES BY VL A DGRIN
practice
practice
needs. Even if you work in back-end development, caring about the end user
will make you create better solutions.
7. Being the best technologist does
not make you a good leader. If you are
managing enough people or products, you do not have time to dive into
the deep details of the technology.
Moreover, you need to learn to trust
the people on your team. It is better
to let them be the experts and shine
in meetings than to spend your time
looking over their shoulders to know
all the details.
The best skills you can have are
these:
˲˲ Ask great questions that get to the
root of the problem. This helps others
think through their challenges, uncovering issues before they arise.
˲˲ Delegate and defer so that you are
able to accomplish more while empowering those around you.
˲˲ Teach people to think for themselves. Instead of prescribing answers,
ask people what they think you would
say or tell them to do. I highly recommend David Marquet’s talk, “Greatness.”3 He reveals that while working
as a captain on a military submarine
he vowed never to give another order.
Instead, he allowed his reports to make
their own empowered decisions. This
small shift in thinking brought about
powerful change.
8. Being organized and having a
system. When you are responsible
for the work of others, you must have
checks and balances. Practicing strong
project-management skills is key. You
must have a way of keeping things organized and know what is going on,
and be able to communicate it when
things are not going as planned.
It is also important to be strategic
about your own time management. I
start each week with at least 30 minutes dedicated to looking at my top
priorities for the week, and then I
carve out the time to make progress
on these priorities. One time-management tool that has been successful for
me is time blocking, where I plan my
days in a way that optimizes my time
for my productivity (for example, I am
a much better writer in the mornings
so I make sure to do my writing then).4
This helps me optimize my time and
always know the best way to use a
spare 15 minutes.
Similarly, I have a system for keeping track of my great ideas. I keep an
Evernote where I save articles I love or
interesting ideas I come across. This
gives me a little vault of information I
can go to when I need to get inspired,
write a blog post, or come up with
something worthwhile to post on social media.
The point here is to have systems
in place. You need a way to do all the
things that are important and keep your
information and details organized.
9. Networking. If you think about
it, every job offer, promotion, and
raise was not given to you because
of the work you did. The quality of
your work may have been a factor, but
there was a person behind those decisions. It was someone who gave you
those opportunities.
If you do great work and no one
likes you, then you simply will not be
as successful. Be someone with whom
people want to work. For example,
helping others, listening intently, and
caring about the lives of the people
around you will help you profoundly. I
am always looking for ways to expand
my network, while also deepening the
relationships I have with my mentors
and friends.
I hope these ideas help you become
a better leader or employee. Pick one
or two to focus on each week, and see
where it takes you—progress is a process! I would love to hear from you, especially if you have any other ideas to add
to this list.
Related articles
on queue.acm.org
Mal Managerium: A Field Guide
Phillip Laplante
http://queue.acm.org/detail.cfm?id=1066076
Sink or Swim, Know When It’s Time to Bail
Gordon Bell
http://queue.acm.org/detail.cfm?id=966806
Adopting DevOps Practices in Quality
Assurance
James Roche
http://queue.acm.org/detail.cfm?id=2540984
References
1. Butler, A.S. Ten techniques to make decisions: #2
multivoting, 2014; http://www.avasbutler.com/
ten-techniques-to-make-decisions-2-multivoting/#.
Vtd1ZYwrIy4.
2. Gavetti, G., Rivkin, J.W. How strategists really think:
tapping the power of analogy. Harvard Business
Review (April 2005); https://hbr.org/2005/04/howstrategists-really-think-tapping-the-power-of-analogy.
3. Marquet, D. Inno-versity presents: greatness.
YouTube, 2013; https://www.youtube.com/
watch?v=OqmdLcyES_Q.
4. Matsudaira, K. Seven proven ways to get more done in
less time, 2015; http://katemats.com/7-proven-waysto-get-more-done-in-less-time/.
5. Smith, J. Here’s how clothing affects your success.
Business Insider (Aug. 19, 2014); http://www.
businessinsider.com/how-your-clothing-impacts-yoursuccess-2014-8.
Kate Matsudaira (katemats.com) is the founder of
her own company, Popforms. Previously she worked in
engineering leadership roles at companies like Decide
(acquired by eBay), Moz, Microsoft, and Amazon.
Copyright held by author.
Publication rights licensed to ACM. $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
47
practice
DOI:10.1145/ 2909476
rticle development led by
A
queue.acm.org
This visualization of software execution
is a new necessity for performance
profiling and debugging.
BY BRENDAN GREGG
The
Flame
Graph
in our industry is understanding
how software is consuming resources, particularly
CPUs. What exactly is consuming how much, and how
did this change since the last software version? These
questions can be answered using software profilers—
tools that help direct developers to optimize their code
and operators to tune their environment. The output of
profilers can be verbose, however, making it laborious
to study and comprehend. The flame graph provides
a new visualization for profiler output and can make
for much faster comprehension, reducing the time for
root cause analysis.
AN EVERYDAY PROBLEM
48
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
In environments where software
changes rapidly, such as the Netflix
cloud microservice architecture, it is especially important to understand profiles quickly. Faster comprehension can
also make the study of foreign software
more successful, where one’s skills, appetite, and time are strictly limited.
Flame graphs can be generated
from the output of many different software profilers, including profiles for
different resources and event types.
Starting with CPU profiling, this article
describes how flame graphs work, then
looks at the real-world problem that
led to their creation.
IMAGE BY AND RIJ BORYS ASSOCIAT ES/SHUT TERSTOCK
CPU Profiling
A common technique for CPU profiling
is the sampling of stack traces, which
can be performed using profilers such
as Linux perf_events and DTrace. The
stack trace is a list of function calls that
show the code-path ancestry. For example, the following stack trace shows
each function as a line, and the topdown ordering is child to parent:
SpinPause
StealTask::do_it
GCTaskThread::run
java_start
start_thread
Balancing considerations that include sampling overhead, profile size,
and application variation, a typical CPU
profile might be collected in the following way: stack traces are sampled at a
rate of 99 times per second (not 100, to
avoid lock-step sampling) for 30 seconds
across all CPUs. For a 16-CPU system, the
resulting profile would contain 47,520
stack-trace samples. As text, this would
be hundreds of thousands of lines.
Fortunately, profilers have ways to
condense their output. DTrace, for example, can measure and print unique
stack traces, along with their occurrence count. This approach is more
effective than it might sound: identical stack traces may be repeated during loops or when CPUs are in the idle
state. These are condensed into a single stack trace with a count.
Linux perf_events can condense profiler output even further: not only identical stack trace samples, but also subsets of stack traces can be coalesced.
This is presented as a tree view with
counts or percentages for each codepath branch, as shown in Figure 1.
In practice, the output summary
from either DTrace or perf_events is
sufficient to solve the problem in many
cases, but there are also cases where
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
49
practice
the output produces a wall of text, making it difficult or impractical to comprehend much of the profile.
The Problem
The problem that led to the creation of
flame graphs was application performance on the Joyent public cloud.3 The
application was a MySQL database that
was consuming around 40% more CPU
resources than expected.
DTrace was used to sample user-
mode stack traces for the application
at 997Hz for 60 seconds. Even though
DTrace printed only unique stack traces, the output was 591,622 lines long,
including 27,053 unique stack traces.
Fortunately, the last screenful—which
included the most frequently sampled
stack traces—looked promising, as
shown in Figure 2.
The most frequent stack trace included a MySQL calc_sum_of_all_
status() function, indicating it was
Figure 1. Sample Linux perf_events tree view.
# perf report -n --stdio
[...]
# Overhead
Samples Command
Shared Object
# ........ ............ ....... .................
#
16.90%
490
dd [kernel.kallsyms]
|
--- xen_hypercall_xen_version
check_events
|
|--97.76%-- extract_buf
|
extract_entropy_user
|
urandom_read
|
vfs_read
|
sys_read
|
system_call_fastpath
|
__GI___libc_read
|
|--0.82%-- __GI___libc_write
|
|--0.82%-- __GI___libc_read
--0.61%-- [...]
5.83%
169
dd [kernel.kallsyms]
|
--- sha_transform
extract_buf
extract_entropy_user
urandom_read
vfs_read
sys_read
system_call_fastpath
__GI___libc_read
Symbol
.............................
[k] xen_hypercall_xen_version
[k] sha_transform
[...]
Figure 2. MySQL DTrace profile subset.
# dtrace -x ustackframes=100 -n ‘profile-997 /execname == “mysqld”/ {
@[ustack()] = count(); } tick-60s { exit(0); }’
dtrace: description ‘profile-997 ‘ matched 2 probes
CPU
ID
FUNCTION:NAME
1 75195
:tick-60s
[...]
libc.so.1`__priocntlset+0xa
libc.so.1`getparam+0x83
libc.so.1`pthread_getschedparam+0x3c
libc.so.1`pthread_setschedprio+0x1f
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
4884
mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
5530
50
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
processing a “show status” command.
Perhaps the customer had enabled aggressive monitoring, explaining the
higher CPU usage?
To quantify this theory, the stacktrace count (5,530) was divided into the
total samples in the captured profile
(348,427), showing it was responsible
for only 1.6% of the CPU time. This
alone could not explain the higher CPU
usage. It was necessary to understand
more of the profile.
Browsing more stack traces became
an exercise in diminishing returns, as
they progressed in order from most to
least frequent. The scale of the problem is evident in Figure 3, where the
entire DTrace output becomes a featureless gray square.
With so much output to study, solving this problem within a reasonable
time frame began to feel insurmountable. There had to be a better way.
I created a prototype of a visualization that leveraged the hierarchical nature of stack traces to combine
common paths. The result is shown
in Figure 4, which visualizes the
same output as in Figure 3. Since the
visualization explained why the CPUs
were “hot” (busy), I thought it appropriate to choose a warm palette.
With the warm colors and flame-like
shapes, these visualizations became
known as flame graphs. (An interactive version of Figure 4, in SVG
[scalable vector graphics] format is
available at http://queue.acm.org/
downloads/2016/Gregg4.svg.)
The flame graph allowed the bulk of
the profile to be understood very quickly. It showed the earlier lead, the MySQL
status command, was responsible for
only 3.28% of the profile when all stacks
were combined. The bulk of the CPU
time was consumed in MySQL join,
which provided a clue to the real problem. The problem was located and fixed,
and CPU usage was reduced by 40%.
Flame Graphs Explained
A flame graph visualizes a collection of
stack traces (aka call stacks), shown as
an adjacency diagram with an inverted
icicle layout.7 Flame graphs are commonly used to visualize CPU profiler
output, where stack traces are collected using sampling.
A flame graph has the following
characteristics:
practice
˲˲ A stack trace is represented as a
column of boxes, where each box represents a function (a stack frame).
˲˲ The y-axis shows the stack depth,
ordered from root at the bottom to leaf
at the top. The top box shows the function that was on-CPU when the stack
trace was collected, and everything beneath that is its ancestry. The function
beneath a function is its parent.
˲˲ The x-axis spans the stack trace
collection. It does not show the passage of time, so the left-to-right ordering has no special meaning. The
left-to-right ordering of stack traces
is performed alphabetically on the
function names, from the root to the
leaf of each stack. This maximizes
box merging: when identical function
boxes are horizontally adjacent, they
are merged.
˲˲ The width of each function box
shows the frequency at which that
function was present in the stack traces, or part of a stack trace ancestry.
Functions with wide boxes were more
present in the stack traces than those
with narrow boxes, in proportion to
their widths.
˲˲ If the width of the box is sufficient,
it displays the full function name. If
not, either a truncated function name
with an ellipsis is shown, or nothing.
˲˲ The background color for each
box is not significant and is picked at
random to be a warm hue. This randomness helps the eye differentiate
boxes, especially for adjacent thin
“towers.” Other color schemes are
discussed later.
˲˲ The profile visualized may span a
single thread, multiple threads, multiple applications, or multiple hosts.
Separate flame graphs can be generated if desired, especially for studying
individual threads.
Figure 3. Full MySQL DTrace profile output.
Figure 4. Full MySQL profiler output as a flame graph.
Reset Zoom
mysql..
mysq..
mysqld`btr..
mysql..
mysqld`btr..
mysql..
mysqld`row_sel_get_..
mysqld`row_search_for_mysql
mysqld`ha_innobase::general_fetch
mysqld`ha_innobase::index_next_..
mysqld`handler::read_range_next
my..
mysqld`handler::read_multi_range..
mys..
mysqld`QUICK_RANGE_SELECT::get_n..mys..
mysqld`find_all_keys
mysqld`filesort
mysqld`create_sort_index
mysqld`JOIN::exec
mysqld`mysql_select
l..
mysqld`handle_select
l..
mysqld`execute_sqlcom_select
l..
my.. mysqld`mysql_execute_command
libc.. my.. mysqld`mysql_parse
mysqld`dispatch_command
mysqld`do_command
mysqld`handle_one_connection
libc.so.1`_thrp_setup
libc.so.1`_lwp_start
Flame Graph
m..
m..
mysq..
mysqld`eval..
mysqld`sub_select
mysqld`do_select
Search
mys..
m..
mysql.. my..
mysql.. my..
m..
mysqld`row_.. mysqld`ro..
mysqld`row_search_for_mysql
mysqld`ha_innobase::general_fetch
mysqld`ha_innobase::index_prev
mysqld`join_read_prev
mys..
mysq..
mysq..
mysq..
m..
m..
my..
mysq..
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
51
practice
˲˲ Stack traces may be collected from
different profiler targets, and widths
can reflect measures other than sample counts. For example, a profiler
(or tracer) could measure the time
a thread was blocked, along with its
stack trace. This can be visualized as
a flame graph, where the x-axis spans
the total blocked time, and the flame
graph shows the blocking code paths.
As the entire profiler output is visualized at once, the end user can
navigate intuitively to areas of interest.
The shapes and locations in the flame
graphs become visual maps for the execution of software.
While flame graphs use interactivity to provide additional features, these
characteristics are fulfilled by a static
flame graph, which can be shared as an
image (for example, a PNG file or printed on paper). While only wide boxes
have enough room to contain the function label text, they are also sufficient
to show the bulk of the profile.
Flame graphs can
support interactive
features to reveal
more detail,
improve navigation,
and perform
calculations.
Interactivity
Flame graphs can support interactive
features to reveal more detail, improve
navigation, and perform calculations.
The original implementation of
flame graphs4 creates an SVG image
with embedded JavaScript for interactivity, which is then loaded in a
browser. It supports three interactive
features: Mouse-over for information,
click to zoom, and search.
Mouse-over for information. On
mouse-over of boxes, an informational line below the flame graph
and a tooltip display the full function
name, the number of samples present in the profile, and the corresponding percentage for those samples in
the profile. For example, Function:
mysqld`JOIN::exec (272,959 samples, 78.34%).
This is useful for revealing the function name from unlabeled boxes. The
percentage also quantifies code paths
in the profile, which helps the user
prioritize leads and estimate improvements from proposed changes.
Click to zoom. When a box is
clicked, the flame graph zooms horizontally. This reveals more detail, and
often function names for the child
functions. Ancestor frames below the
clicked box are shown with a faded
background as a visual clue that their
52
COMM UNICATIO NS O F THE AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
widths are now only partially shown.
A Reset Zoom button is included to
return to the original full profile view.
Clicking any box while zoomed will reset the zoom to focus on that new box.
Search. A search button or keystroke (Ctrl-F) prompts the user for a
search term, which can include regular expressions. All function names
in the profile are searched, and any
matched boxes are highlighted with
magenta backgrounds. The sum of
matched stack traces is also shown on
the flame graph as a percentage of the
total profile, as in Figure 5. (An interactive version of Figure 5 in SVG format
is available at http://queue.acm.org/
downloads/2016/Gregg5.svg.)
This is useful not just for locating
functions, but also for highlighting
logical groups of functions—for example, searching for “^ext4_” to find the
Linux ext4 functions.
For some flame graphs, many different code paths may end with a function of interest—for example, spinlock functions. If this appeared in 20
or more locations, calculating their
combined contribution to the profile
would be a tedious task, involving finding then adding each percentage. The
search function makes this trivial, as
a combined percentage is calculated
and shown on screen.
Instructions
There are several implementations of
flame graphs so far.5 The original implementation, FlameGraph,4 was written
in the Perl programming language and
released as open source. It makes the
generation of flame graphs a three-step
sequence, including the use of a profiler:
1. Use a profiler to gather stack
traces (for example, Linux perf_events,
DTrace, Xperf).
2. Convert the profiler output into the
“folded” intermediate format. Various
programs are included with the FlameGraph software to handle different
profilers; their program names begin
with “stackcollapse.”
3. Generate the flame graph using
flamegraph.pl. This reads the previous
folded format and converts it to an SVG
flame graph with embedded JavaScript.
The folded stack-trace format puts
stack traces on a single line, with functions separated by semicolons, followed
by a space and then a count. The name
practice
Figure 5. Search highlighting.
Reset Zoom
Linux kernel CPU flame graph: searching on "tcp"
t.. copy_use..
copy_u..
_.. skb_copy..
skb_cop..
tcp_rcv_establi..
tcp_rcv_e..
tcp_v4_do_rcv
copy_user_.. t.. tcp_v4_do..
release_sock
skb_copy_d.. t.. tcp_prequ..
tcp_recvmsg
inet_recvmsg
sock_recvmsg
SYSC_recvfrom
sys_recvfrom
entry_SYSCALL_64_fastpath
__libc_recv
iperf
of the application, or the name and
process ID separated by a dash, can be
optionally included at the start of the
folded stack trace, followed by a semicolon. This groups the application’s
code paths in the resulting flame graph.
For example, a profile containing
the following three stack traces:
func_c
func_b
func_a
start_thread
func_d
func_a
start_thread
func_d
func_a
start_thread
becomes the following in the folded
format:
start_thread;func_a;func_b;func_c 1
start_thread;func_a;func_d 2
If the application name is included—for example, “java”—it would then
become:
copy_user_enhanced_fast_string
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
sys_write
entry_SYSCALL_64_fastpath
java;start_
thread;func_a;func_b;func_c 1
java;start_thread;func_a;func_d 2
This intermediate format has allowed others to contribute converters for
other profilers. There are now stackcollapse programs for DTrace, Linux
perf_events, FreeBSD pmcstat, Xperf,
SystemTap, Xcode Instruments, Intel
VTune, Lightweight Java Profiler, Java
jstack, and gdb.4
The final flamegraph.pl program supports many customization options, including changing the flame graph’s title.
As an example, the following steps
fetch the FlameGraph software, gather
a profile on Linux (99Hz, all CPUs, 60
seconds), and then generate a flame
graph from the profile:
# git clone https://github.com/
brendangregg/FlameGraph
# cd FlameGraph
# perf record -F 99 -a -g -sleep 60
# perf script |./stackcollapseperf.pl | ./flamegraph.pl > out.
svg
Search
_..
tcp..
tcp_rcv_..
tcp_v4_d..
release_sock
_..
a..
sk..
sk..
t..
ip..
ip..
ip..
ip..
__n..
__n..
proc..
net_..
__do..
do_s..
do_s..
__lo..
ip_fini..
ip_fini..
ip_output
ip_loca..
ip_queue..
tcp_transmi..
tcp_write_xmit
tcp_push_one
cp..
st..
swa..
Since the output of stackcollapse
has single lines per record, it can be
modified using grep/sed/awk if needed
before generating a flame graph.
The online flame graph documentation includes instructions for using
other profilers.4,5
Flame Graph Interpretation
Flame graphs can be interpreted as
follows:
˲˲ The top edge of the flame graph
shows the function that was running
on the CPU when the stack trace was
collected. For CPU profiles, this is the
function that is directly consuming
CPU cycles. For other profile types, this
is the function that directly led to the
instrumented event.
˲˲ Look for large plateaus along the
top edge, as these show a single stack
trace was frequently present in the
profile. For CPU profiles, this means a
single function was frequently running
on-CPU.
˲˲ Reading top down shows ancestry. A function was called by its parent,
which is shown directly below it; the
parent was called by its parent shown
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
53
practice
Figure 6. Example for interpretation.
g()
e()
f()
d()
c()
i()
b()
h()
a()
below it, and so on. A quick scan downward from a function identifies why it
was called.
˲˲ Reading bottom up shows code
flow and the bigger picture. A function
calls any child functions shown above
it, which, in turn, call functions shown
above them. Reading bottom up also
shows the big picture of code flow before various forks split execution into
smaller towers.
˲˲ The width of function boxes can
be directly compared: wider boxes
mean a greater presence in the profile
and are the most important to understand first.
˲˲ For CPU profiles that employ timed
sampling of stack traces, if a function
box is wider than another, this may be
because it consumes more CPU per
function call or that the function was
simply called more often. The function-call count is not shown or known
via sampling.
˲˲ Major forks in the flame graph,
spotted as two or more large towers
atop a single function, can be useful
to study. They can indicate a logical
grouping of code, where a function
processes work in stages, each with its
own function. It can also be caused by
a conditional statement, which chooses which function to call.
Interpretation Example
As an example of interpreting a flame
graph, consider the mock one shown
in Figure 6. Imagine this is visualizing
a CPU profile, collected using timed
samples of stack traces (as is typical).
The top edge shows that function
g() is on-CPU the most; d() is wider,
but its exposed top edge is on-CPU the
least. Functions including b() and c()
do not appear to have been sampled
54
COM MUNICATIO NS O F TH E ACM
on-CPU directly; rather, their child
functions were running.
Functions beneath g() show its ancestry: g() was called by f(), which was
called by d(), and so on.
Visually comparing the widths of
functions b() and h() shows the b()
code path was on-CPU about four times
more than h(). The actual functions onCPU in each case were their children.
A major fork in the code paths is visible where a() calls b() and h(). Understanding why the code does this may be a
major clue to its logical organization.
This may be the result of a conditional
(if conditional, call b(), else call h()) or
a logical grouping of stages (where a()
is processed in two parts: b() and h()).
Other Code-Path Visualizations
As was shown in Figure 1, Linux perf_
events prints a tree of code paths with
percentage annotations. This is another type of hierarchy visualization:
An indented tree layout.7 Depending
on the profile, this can sometimes sufficiently summarize the output but not
always. Unlike flame graphs, one cannot zoom out to see the entire profile
and still make sense of this text-based
visualization, especially after the percentages can no longer be read.
KCacheGrind14 visualizes code
paths from profile data using a directed
acyclic graph. This involves representing functions as labeled boxes (where
the width is scaled to fit the function
name), parent-to-child relationships as
arrows, and then profile data is annotated on the boxes and arrows as percentages with bar chart-like icons. Similar to the problem with perf_events, if
the visualization is zoomed out to fit a
complex profile, then the annotations
may no longer be legible.
| J U NE 201 6 | VO L . 5 9 | NO. 6
The sunburst layout is equivalent
to the icicle layout as used by flame
graphs, but it uses polar coordinates.7
While this can generate interesting
shapes, there are some difficulties:
function names are more difficult to
draw and read from sunburst slices
than they are in the rectangular flamegraph boxes. Also, comparing two
functions becomes a matter of comparing two angles rather than two line
lengths, which has been evaluated as a
more difficult perceptual task.10
Flame charts are a similar codepath visualization to flame graphs
(and were inspired by flame graphs13).
On the x-axis, however, they show the
passage of time instead of an alphabetical sort. This has its advantages:
time-ordered issues can be identified.
It can greatly reduce merging, however, a problem exacerbated when profiling multiple threads. It could be a
useful option for understanding time
order sequences when used with flame
graphs for the bigger picture.
Challenges
Challenges with flame graphs mostly
involve system profilers and not flame
graphs themselves. There are two typical problems with profilers:
˲˲ Stack traces are incomplete. Some
system profilers truncate to a fixed
stack depth (for example, 10 frames),
which must be increased to capture the
full stack traces, or else frame merging
can fail. A worse problem is when the
software compiler reuses the frame
pointer register as a compiler optimization, breaking the typical method of
stack-trace collection. The fix requires
either a different compiled binary (for
example, using gcc’s -fno-omit-framepointer) or a different stack-walking
technique.
˲˲ Function names are missing. In this
case, the stack trace is complete, but
many function names are missing and
may be represented as hexadecimal addresses. This commonly happens with
JIT (just-in-time) compiled code, which
may not create a standard symbol table
for profilers. Depending on the profiler
and runtime, there are different fixes.
For example, Linux perf_events supports supplemental symbol files, which
the application can create.
At Netflix we encountered both
problems when attempting to cre-
practice
ate flame graphs for Java.6 The first
has been fixed by the addition of a
JVM (Java Virtual Machine) option—
XX:+PreserveFramePointer,
which
allows Linux perf_events to capture
full stack traces. The second has been
fixed using a Java agent, perf-mapagent,11 which creates a symbol table
for Java methods.
One challenge with the Perl flamegraph implementation has been the
resulting SVG file size. For a large profile with many thousands of unique
code paths, the SVG file can be tens
of megabytes in size, which becomes
sluggish to load in a browser. The fix
has been to elide code paths that are so
thin they are normally invisible in the
flame graph. This does not affect the
big-picture view and has kept the SVG
file smaller.
Other Color Schemes
Apart from a random warm palette,
other flame-graph color schemes can
be used, such as for differentiating code
or including an extra dimension of data.
Various palettes can be selected in
the Perl flame-graph version, including “java,” which uses different hues
to highlight a Java mixed-mode flame
graph: green for Java methods, yellow
for C++, red for all other user-mode functions, and orange for kernel-mode functions. An example is shown in Figure 7.
(An interactive version of Figure 7 in SVG
format is available at http://queue.acm.
org/downloads/2016/Gregg7.svg.)
Another option is a hashing color
scheme, which picks a color based
on a hash of the function name. This
keeps colors consistent, which is helpful when comparing multiple flame
graphs from the same system.
Differential Flame Graphs
A differential flame graph shows the
difference between two profiles, A and
B. The Perl flame-graph software currently supports one method, where
the B profile is displayed and then
colored using the delta from A to B.
Red colors indicate functions that
increased, and blue colors indicate
those that decreased. A problem with
this approach is that some code paths
present in the A profile may be miss-
Figure 7. Java mixed-mode CPU flame graph.
Reset Zoom
Search
Java Mixed-Mode CPU Flame Graph
t..
d..
_..
_..
e..
_..
_..
so..
tc..
tcp_rcv..
tcp_v4_..
tcp_v4_rcv
ip_local_..
ip_local_..
ip_rcv_fi..
ip_rcv
__netif_r..
__netif_r..
process_ba..
net_rx_act..
__do_softirq
do_softirq_..
do_softirq
local_bh_en..
ip_finish_out..
ip_output
ip_local_out
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pendi..
tcp_sendmsg
inet_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
[unknown]
sun/nio/ch/FileDispatch..
sun/nio/ch/SocketChannel..
io/netty/channel/nio/Abstr..
io/netty/channel/DefaultCha..
io/netty/channel/AbstractCh..
io/netty/channel/ChannelOut..
io/netty/channel/AbstractCh..
io/netty/channel/ChannelDupl..
io/netty/channel/AbstractCha..
org/vertx/java/core/net/impl..
io/netty/channel/AbstractCha..
io/netty/handler/codec/ByteT..
io/netty/channel/AbstractCha..
Kernel
Java
JVM (C++)
i..
io..
o..
or..
o..
o..
or..
io..
org/m..
or..
org..
or..
org/mozilla/javas..
org..
sun/re..
org/mozilla/javas..
o.. org..
org/moz..
org/mozilla/javas..
org/.. org/..
org/mozi..
org/moz.. org/mozilla/javascript/gen/file__root_vert_x_2_1_..
org/mozi.. org/mozilla/javascript/gen/file__root_vert_x_2_1_5..
v..
org/mozilla/javascript/gen/file__root_vert_x_2_1_5_sys_mods_io..
s..
org/mozilla/javascript/gen/file__root_vert_x_2_1_5_sys_mods_io_..
s..
org/vertx/java/core/http/impl/ServerConnection:.handleRequest
[..
org/vertx/java/core/http/impl/DefaultHttpServer$ServerHandler:.doM..
s..
org/vertx/java/core/net/impl/VertxHandler:.channelRead
s..
io/netty/channel/AbstractChannelHandlerContext:.fireChannelRead
sun.. io/netty/handler/codec/ByteToMessageDecoder:.channelRead
io/.. io/netty/channel/AbstractChannelHandlerContext:.fireChannelRead
io/netty/channel/nio/AbstractNioByteChannel$NioByteUnsafe:.read
io/netty/channel/nio/NioEventLoop:.processSelectedKeysOptimized
io/netty/channel/nio/NioEventLoop:.processSelectedKeys
Interpreter
Interpreter
Interpreter
call_stub
JavaCalls::call_helper
JavaCalls::call_virtual
JavaCalls::call_virtual
thread_entry
JavaThread::thread_main_inner
G.. JavaThread::run
java_start
start_thread
java
io/ne..
t..
i..
i..
ip..
ip..
__..
__..
pr..
ne..
__..
do..
do..
loc..
ip_..
ip_..
ip_..
ip_q..
tcp_..
tcp_w..
__tcp..
tcp_sen..
inet_se..
sock_aio..
do_sync_..
vfs_write
sys_write
system_ca..
[.. h.. [unknown]
socke.. socket_wri..
aeProcessEvents
User
r..
s..
cpu.. x..
sta.. x..
swapper
ep_p..
sys_e..
syste..
[unkn..
aeMain
thread_main
start_thread
wrk
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
55
practice
ing entirely in the B profile, and so will
be missing from the final visualization. This could be misleading.
Another implementation, flamegraphdiff,2 solves this problem by using three flame graphs. The first shows
the A profile, the second shows the B
profile, and the third shows only the
delta between them. A mouse-over of
one function in any flame graph also
highlights the others to help navigation. Optionally, the flame graphs
can also be colored using a red/blue
scheme to indicate which code paths
increased or decreased.
Other Targets
As previously mentioned, flame graphs
can visualize any profiler output. This
includes stack traces collected on
CPU PMC (performance monitoring
counter) overflow events, static tracing
events, and dynamic tracing events.
Following are some specific examples.
Stall cycles. A stall-cycle flame graph
shows code paths that commonly
block on processor or hardware resources—typically memory I/O. The
input stack traces can be collected using a PMC profiler, such as Linux perf_
events. This can direct the developer to
employ a different optimization technique to the identified code paths, one
that aims to reduce memory I/O rather
than reducing instructions.
CPI (cycles per instruction), or its
invert, IPC (instructions per cycle), is
a measure that also helps explain the
types of CPU cycles and can direct tuning effort. A CPI flame graph shows a
CPU sample flame graph where widths
correspond to CPU cycles, but it uses
a color scale from red to blue to indicate each function’s CPI: red for a high
CPI and blue for a low CPI. This can be
accomplished by capturing two profiles—a CPU sample profile and an instruction count profile—and then using a differential flame graph to color
the difference between them.
Memory. Flame graphs can shed
light on memory growth by visualizing
a number of different memory events.
A malloc() flame graph, created by
tracing the malloc() function, visualizes code paths that allocated memory.
This can be difficult in practice, as allocator functions can be called frequently, making the cost to trace them
prohibitive in some scenarios.
56
COMM UNICATIO NS O F THE ACM
The problem
that led to
the creation
of flame graphs
was the study
of application
performance
in the cloud.
| J U NE 201 6 | VO L . 5 9 | NO. 6
Tracing the brk() and mmap() syscalls can show code paths that caused
an expansion in virtual memory for a
process, typically related to the allocation path, although this could also
be an asynchronous expansion of the
application’s memory. These are typically lower frequency, making them
more suitable for tracing.
Tracing memory page faults shows
code paths that caused an expansion
in physical memory for a process. Unlike allocator code paths, this shows
the code that populated the allocated
memory. Page faults are also typically a
lower-frequency activity.
I/O. The issuing of I/O, such as file
system, storage device, and network,
can usually be traced using system
tracers. A flame graph of these profiles
illustrates different application paths
that synchronously issued I/O.
In practice, this has revealed types
of I/O that were otherwise not known.
For example, disk I/O may be issued:
synchronously by the application, by
a file system read-ahead routine, by
an asynchronous flush of dirty data, or
by a kernel background scrub of disk
blocks. An I/O flame graph identifies
each of these types by illustrating the
code paths that led to issuing disk I/O.
Off-CPU. Many performance issues are not visible using CPU flame
graphs, as they involve time spent
while the threads are blocked, not running on a CPU (off-CPU). Reasons for a
thread to block include waiting on I/O,
locks, timers, a turn on-CPU, and waiting for paging or swapping. These scenarios can be identified by the stack
trace when the thread was descheduled. The time spent off-CPU can also
be measured by tracing the time from
when a thread left the CPU to when it
returned. System profilers commonly
use static trace points in the kernel to
trace these events.
An off-CPU time flame graph can
illustrate this off-CPU time by showing the blocked stack traces where the
width of a box is proportional to the
time spent blocked.
Wakeups. A problem found in practice with off-CPU time flame graphs is
they are inconclusive when a thread
blocks on a conditional variable. We
needed information on why the conditional variable was held by some other
thread for so long.
practice
A wakeup time flame graph can be
generated by tracing thread wakeup
events. This includes wakeups by the
other threads releasing the conditional variable, and so they shed light
on why they were blocked. This flamegraph type can be studied along with
an off-CPU time flame graph for more
information on blocked threads.
Chain graphs. One wakeup flame
graph may not be enough. The thread
that held a conditional variable may
have been blocked on another conditional variable, held by another
thread. In practice, one thread may
have been blocked on a second, which
was blocked on a third, and a fourth.
A chain flame graph is an experimental visualization3 that begins with
an off-CPU flame graph and then adds
all wakeup stack traces to the top of
each blocked stack. By reading bottom
up, you see the blocked off-CPU stack
trace, and then the first stack trace that
woke it, then the next stack trace that
woke it, and so on. Widths correspond
to the time that threads were off-CPU
and the time taken for wakeups.
This can be accomplished by tracing all off-CPU and wakeup events
with time stamps and stack traces, and
post processing. These events can be
extremely frequent, however, and impractical to instrument in production
using current tools.
Future Work
Much of the work related to flame
graphs has involved getting different
profilers to work with different runtimes so the input for flame graphs
can be captured correctly (for example, for Node.js, Ruby, Perl, Lua, Erlang, Python, Java, golang, and with
DTrace, perf_events, pmcstat, Xperf,
Instruments, among others). There is
likely to be more of this type of work in
the future.
Another in-progress differential
flame graph, called a white/black differential, uses the single flame-graph
scheme described earlier plus an extra
region on the right to show only the
missing code paths. Differential flame
graphs (of any type) should also see
more adoption in the future; at Netflix,
we are working to have these generated
nightly for microservices: to identify
regressions and aid with performanceissue analysis.
Several other flame-graph implementations are in development, exploring different features. Netflix has
been developing d3-flame-graph,12
which includes transitions when
zooming. The hope is that this can
provide new interactivity features, including a way to toggle the merge order from bottom-up to top-down, and
also to merge around a given function.
Changing the merge order has already
proven useful for the original flamegraph.pl, which can optionally merge
top-down and then show this as an
icicle plot. A top-down merge groups
together leaf paths, such as spin locks.
Conclusion
The flame graph is an effective visualization for collected stack traces
and is suitable for CPU profiling, as
well as many other profile types. It
creates a visual map for the execution
of software and allows the user to
navigate to areas of interest. Unlike
other code-path visualizations, flame
graphs convey information intuitively using line lengths and can handle
large-scale profiles, while usually remaining readable on one screen. The
flame graph has become an essential tool for understanding profiles
quickly and has been instrumental in
countless performance wins.
Acknowledgments
Inspiration for the general layout, SVG
output, and JavaScript interactivity
came from Neelakanth Nadgir’s function_call_graph.rb time-ordered visualization for callstacks,9 which itself
was inspired by Roch Bourbonnais’s
CallStackAnalyzer and Jan Boerhout’s
vftrace. Adrien Mahieux developed
the horizontal zoom feature for flame
graphs, and Thorsten Lorenz added
a search feature to his implementation.8 Cor-Paul Bezemer researched
differential flame graphs and developed the first solution.1 Off-CPU time
flame graphs were first discussed and
documented by Yichun Zhang.15
Thanks to the many others who have
documented case studies, contributed
ideas and code, given talks, created
new implementations, and fixed profilers to make this possible. See the
updates section for a list of this work.5
Finally, thanks to Deirdré Straughan
for editing and feedback.
Related articles
on queue.acm.org
Interactive Dynamics for Visual Analysis
Ben Shneiderman
http://queue.acm.org/detail.cfm?id=2146416
The Antifragile Organization
Ariel Tseitlin
http://queue.acm.org/detail.cfm?id=2499552
JavaScript and the Netflix User Interface
Alex Liu
http://queue.acm.org/detail.cfm?id=2677720
References
1. Bezemer, C.-P. Flamegraphdiff. GitHub; http://corpaul.
github.io/flamegraphdiff/.
2. Bezemer, C.-P., Pouwelse, J., Gregg, B. Understanding
software performance regressions using differential
flame graphs. Published in IEEE 22nd International
Conference on Software Analysis, Evolution and
Reengineering (2015): http://ieeexplore.ieee.org/
xpl/login.jsp?tp=&arnumber=7081872&url=http%
3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.
jsp%3Farnumber%3D7081872.
3. Gregg, B. Blazing performance with flame graphs.
In Proceedings of the 27th Large Installation System
Administration Conference (2013); https://www.usenix.
org/conference/lisa13/technical-sessions/plenary/gregg.
4. Gregg, B. FlameGraph. GitHub; https://github.com/
brendangregg/FlameGraph.
5. Gregg, B. Flame graphs; http://www.brendangregg.
com/flamegraphs.html.
6. Gregg, B., Spier, M. Java in flames. The Netflix Tech
Blog, 2015; http://techblog.netflix.com/2015/07/javain-flames.html.
7. Heer, J., Bostock, M., Ogievetsky, V. A tour through the
visualization zoo. acmqueue 8, 5 (2010); http://queue.
acm.org/detail.cfm?id=1805128.
8. Lorenz, T. Flamegraph. GitHub; https://github.com/
thlorenz/flamegraph.
9. Nadgir, N. Visualizing callgraphs via dtrace and ruby.
Oracle Blogs, 2007; https://blogs.oracle.com/realneel/
entry/visualizing_callstacks_via_dtrace_and.
10. Odds, G. The science behind data visualisation.
Creative Bloq, 2013; http://www.creativebloq.com/
design/science-behind-data-visualisation-8135496.
11. Rudolph, J. perf-map-agent. GitHub; https://github.
com/jrudolph/perf-map-agent.
12. Spier, M. d3-flame-graph. GitHub, 2015; https://github.
com/spiermar/d3-flame-graph.
13. Tikhonovsky, I. Web Inspector: implement flame
chart for CPU profiler. Webkit Bugzilla, 2013; https://
bugs.webkit.org/show_bug.cgi?id=111162.
14. Weidendorfer, J. KCachegrind; https://kcachegrind.
github.io/html/Home.html.
15. Zhang, Y. Introduction to off-CPU time flame graphs,
2013; http://agentzh.org/misc/slides/off-cpu-flamegraphs.pdf.
Brendan Gregg is a senior performance architect
at Netflix, where he does large-scale computer
performance design, analysis, and tuning. He was
previously a performance lead and kernel engineer
at Sun Microsystems. His recent work includes
developing methodologies and visualizations for
performance analysis.
Copyright held by author.
Publication rights licensed to ACM. $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
57
practice
DOI:10.1145/ 2909466
rticle development led by
A
queue.acm.org
Farsighted physicists of yore
were danged smart!
BY PAT HELLAND
Standing on
Distributed
Shoulders
of Giants
If you squint hard enough, many of the challenges
of distributed computing appear similar to the work
done by the great physicists. Dang, those fellows
were smart!
Here, I examine some of the most important
physics breakthroughs and draw some whimsical
parallels to phenomena in the world of computing
… just for fun.
58
COMM UNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
Newton Thought He Knew
What Time It Was
Isaac Newton (1642–1727) was a brilliant physicist who defined the foundations for classical mechanics, laws of
motion, and universal gravitation. He
also built the first refracting telescope,
developed a theory of color, and much
more. He was one bad dude.
Newton saw the notion of
time as constant and consistent
across the universe. Furthermore, he
assumed that gravity operated instantaneously without regard to distance. Each
object in the universe is exerting gravitational force at all times.
This is very much like what we see
in a single computer or in a tightly coupled cluster of computers that perform
consistent work in a shared transaction. Transactions have a clearly defined local notion of time. Each transaction sees its work as crisply following
a set of transactions. Time marches
forward unperturbed by distance.
When I was studying computer science (and Nixon was president), we
thought about only one computer.
There was barely any network other
than the one connecting terminals
to the single computer. Sometimes, a
tape would arrive from another computer and we had to figure out how to
understand the data on it. We never
thought much about time across
computers. It would take a few years
before we realized our perspective
was too narrow.
Einstein Had Many Watches
In 1905, Albert Einstein (1879–1955)
proposed the special theory of relativity based on two principles. First, the
laws of physics, including time, appear
to be the same to all observers. Second,
the speed of light is unchanging.
An implication of this theory is that
there is no notion of simultaneity. The
notion of simultaneity is relative to the
observer, and the march of time is also
relative to the observer. Each of these
frames of reference is separated by the
speed of light as interpreted relative to
their speed in space.
COLL AGE BY A LICIA K UBISTA/ ANDRIJ BORYS ASSO CIATES, U SING PUBLIC DOM A IN P HOTOS
This concept has some interesting
consequences. The sun might have
blown up five minutes ago, and the
next three minutes will be lovely. When
stuff happens far away, it takes time to
find out … potentially a long time.
In computing, you cannot know
what is happening “over there.” Interacting with another system always
takes time. You can launch a message,
but you always have to wait for the answer to come back to know the result.
More and more, latency is becoming
the major design point in systems.
The time horizon for knowledge
propagation in a distributed system
is unpredictable. This is even worse
than in the physical Einstein-based
universe. At least with our sun and
the speed of light, we know we can see
what is happening at the sun as of eight
minutes ago. In a distributed system,
we have a statistical understanding of
how our knowledge propagates, but
we simply cannot know with certainty.
The other server, in its very own time
domain, may be incommunicado for a
heck of a long time.
Furthermore, in any distributed
interaction, a message may or may
not be delivered within bounded
time. Higher-level applications don’t
ever know if the protocol completed.
Figure 1 shows how the last message
delivery is not guaranteed and the
sender never knows what the receiver
knows. In any distributed protocol,
the sender of the last message cannot
tell whether it arrived. That would require another message.
Another problem is that servers and
messages live in their very own time
space. Messages sent and received
across multiple servers may have surprising reorderings. Each server and
each message lives in its own time, and
they may be relative to each other but
may offer surprises because they are
not coordinated. Some appear slower,
and some faster. This is annoying.
In Figure 2, as work flows across different times in servers and messages,
the time is disconnected and may be
slower or faster than expected. In this
case, the second message sent by A
may arrive after work caused by the first
message, traveling through C. These
problems can make your head hurt in
a similar fashion to how it hurts when
contemplating twins where one travels
close to the speed of light and time appears to slow down while the other one
stays home and ages.
You cannot do distributed agreement in bounded time. Messages get
lost. You can retry them and they will
probably get through. In a fixed period of time, however, there is a small
(perhaps very small) chance they won’t
arrive. For any fixed period of time,
there’s a chance the partner server will
be running sloooooow and not get back.
Two-phase commit cannot guarantee agreement in bounded time. Similarly, Paxos,7 Raft,8 and the other cool
agreement protocols cannot guarantee
agreement in a bounded time. These
protocols are very likely to reach agree-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
59
practice
ment soon, but there’s no guarantee.4
Each lives in its own relative world and
does not know what is happening over
there … at least not yet.
According to the CAP Theorem1,5
(that is, consistency, availability, partition tolerance), if you tolerate failures of computers and/or networks,
you can have either classic database
consistency or database availability. To avoid application challenges,
most systems choose consistency
over availability.
Figure 1. Sender gets no confirmation of final message delivery.
server-A
server-B
server-A
not guaranteed
server-B
not guaranteed
Two-phase commit is
the anti-availability protocol.
From where I stand, Einstein made a
lot of sense. I’m not sure how you feel
about him.
request-response
server-A
fire-and-forget
server-B
server-A
not guaranteed
server-B
not guaranteed
Figure 2. Disconnected time may be slower or faster than expected.
first
leaving
A’s time
server-A’s
time reality
server-B’s
time reality
server-A
second
leaving
A’s time
server-B
second
entering
B’s time
first
entering
B’s time
Hubble Was Increasingly Far Out
Edwin Hubble (1889–1953) was an astronomer who discovered the farther
away an object is, the faster it is receding from us. This, in turn, implies the
universe is expanding. Basically, everything is getting farther away from everything else.
In computing, we have seen an
ever-increasing amount of computation, bandwidth, and memory size.
It looks like this will continue for a
while. Latency is not decreasing too
much and is limited by the speed of
light. There are no obvious signs that
the speed of light will stop being a
constraint anytime soon. The number
of instruction opportunities lost to
waiting while something is fetched is
increasing inexorably.
server-C’s
time reality
Computing is like
the Hubble’s universe ...
Everything is getting farther away
from everything else.
server-C
Shared read-only data isn’t the biggest problem. With enough cache,
you can pull the stuff you need into
the sharing system. Sharing writeable
stuff is a disaster. You frequently stall
while pulling a cache line with the latest copy from a cohort’s cache. More
and more instruction opportunities
will be lost while waiting. This will only
get worse as time moves on!
Shared memory works great ...
as long as you don’t SHARE memory.
60
COMM UNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
practice
Either we figure out how to get
around that pesky speed-of-light thing,
or we are going to need to work harder
on asynchrony and concurrency.
Heisenberg Wasn’t Sure
Werner Heisenberg (1901–1976) defined the uncertainty principle, which
states that the more you know about the
location of a particle, the less you know
about its movement. Basically, you can’t
know everything about anything.
In a distributed system you have a
gaggle of servers, each of which lives in
various states of health, death, or garbage collection. The vast majority of the
time you can chat with a server and get
a crisp and timely result. Other times
you do not get a prompt answer and it’s
difficult to know if you should abandon the slacker or wait patiently. Furthermore, you don’t know if the server
got the request, did the work, and just
has not answered. Anytime a request
goes to a single system, you don’t know
when the request will be delayed.2,6
In some distributed systems, it
is essential to have an extremely
consistent and fast response time
for online users. To accomplish
this, multiple requests must be
issued, and the completion of a
subset of the requests is accepted
as happiness.
In a distributed system,
you can know where the work is done
or you can know when the work is done
but you can’t know both.
To know when a request is done
within a statistical SLA (service-level
agreement), you need to accept that
you do not know where the work will
be done. Retries of the request are the
only option to get a timely answer often
enough. Hence, the requests had better be idempotent.
Schrödinger’s PUT
Erwin Schrödinger (1887–1961) was a
leading physicist of the early 20th century. While he made many substantial
contributions to field quantum theory,
he is most often remembered for a
thought experiment designed to show
the challenges of quantum physics.
In quantum physics the theory, the
math, and the experimental observations show that pretty much everything
remains in multiple states until it in-
teracts with or is observed by the external world. This is known as a superposition of states that collapse when you
actually look.
To show this seems goofy, Schrödinger
proposed this quantum-level uncertainty could map to a macro-level uncertainty. Start by placing a tiny bit of
uranium, a Geiger counter, a vial of
cyanide, and a cat into a steel box. Rig
the Geiger counter to use a hammer to
break the vial of cyanide if an atom of
uranium has decayed. Since the quantum physics of uranium decay show it
is both decayed and not decayed until
you observe the state, it is clear the cat
is both simultaneously dead and alive.
Turns out many contemporary physicists think it’s not goofy … the cat would
be in both states. Go figure!
New distributed systems such as
Dynamo3 store their data in unpredictable locations. This allows prompt and
consistent latencies for PUTs as well as
self-managing and self-balancing servers. Typically, the client issues a PUT
to each of three servers, and when the
cluster is automatically rebalancing,
the destination servers may be sloshing data around. The set of servers
used as destinations may be slippery. A
subsequent GET may need to try many
servers to track down the new value. If
a client dies during a PUT, it is possible
that no servers received the new value
or that only a single server received it.
That single server may or may not die
before sharing the news. That single
server may die, not be around to answer a read, and then later pop back to
life resurrecting the missing PUT.
Therefore, a subsequent GET may
find the PUT, or it may not. There is effectively no limit to the number of places it may be hiding. There is no upper
bound on the time taken for the new
value to appear. If it does appear, it will
be re-replicated to make it stick.
While not yet observed,
a PUT does not really exist ...
it’s likely to exist but you can’t be sure.
Only after it is seen by a GET
will the PUT really exist.
Furthermore, the failure to observe
does not mean the PUT is really missing. It may be lurking in a dead or unresponsive machine. If you see the PUT
and force its replication to multiple
servers, it remains in existence with
very high fidelity. Not seeing it tells you
only that it’s likely it is not there.
Conclusion
Wow! There have been lots of brilliant
physicists, many of them not mentioned here. Much of their work has
shown us the very counterintuitive
ways the world works. Year after year,
there are new understandings and
many surprises.
In our nascent discipline of distributed systems, we would be wise to realize there are subtleties, surprises, and
bizarre uncertainties intrinsic in what
we do. Understanding, bounding, and
managing the trade-offs inherent in
these systems will be a source of great
challenge for years to come. I think it’s
a lot of fun!
Related articles
on queue.acm.org
As Big as a Barn?
Stan Kelly-Bootle
http://queue.acm.org/detail.cfm?id=1229919
Condos and Clouds
Pat Helland
http://queue.acm.org/detail.cfm?id=2398392
Testable System Administration
Mark Burgess
http://queue.acm.org/detail.cfm?id=1937179
References
1. Brewer, E.A. Towards robust distributed systems. In
Proceedings of the 19th Annual ACM Symposium on
Principles of Distributed Computing (2000).
2. Dean, J., Barroso, L.A. 2013. The tail at scale.
Commun. ACM 56, 2 (Feb. 2013), 74–80.
3. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati,
G., Lakshman, A., Pilchin, A., Sivasubramanian, S.,
Vosshall, P., Vogels, W. Dynamo: Amazon’s highly
available key-value store. In Proceedings of the 21st
ACM Symposium on Operating Systems Principles
(2007), 205–220.
4. Fischer, M., Lynch, N., Paterson, M. The impossibility of
distributed consensus with one faulty process. JACM
32, 2 (Apr. 1985).
5. Gilbert, S., Lynch, N. Brewer’s conjecture and the
feasibility of consistent, available, and partition-tolerant
web services. ACM SIGACT News 33, 2 (2002).
6. Helland, P. Heisenberg was on the write track.
In Proceedings of the 7th Biennial Conference on
Innovative Data Systems Research (2015).
7. Lamport, L. The part-time parliament. ACM Trans.
Computer Systems 16, 2 (May 1998).
8. Ongaro, D., Ousterhout, J. In search of an
understandable consensus algorithm. In Proceedings
of the Usenix Annual Technical Conference (2014);
https://www.usenix.org/conference/atc14/technicalsessions/presentation/ongaro.
Pat Helland has been implementing transaction systems,
databases, application platforms, distributed systems,
fault-tolerant systems, and messaging systems since
1978. He currently works at Salesforce.
Copyright held by author.
Publication rights licensed to ACM. $15.00.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
61
contributed articles
DOI:10.1145/ 2896587
Human-centered design can make
application programming interfaces
easier for developers to use.
BY BRAD A. MYERS AND JEFFREY STYLOS
Improving
API Usability
(APIs),
including libraries, frameworks, toolkits, and
software development kits, are used by virtually all
code. If one includes both internal APIs (interfaces
internal to software projects) and public APIs
(such as the Java Platform SDK, the Windows .NET
Framework, jQuery for JavaScript, and Web services
like Google Maps), nearly every line of code most
programmers write will use API calls. APIs provide
a mechanism for code reuse so programmers can
build on top of what others (or they themselves)
have already done, rather than start from scratch
with every program. Moreover, using APIs is
often required because low-level access to system
resources (such as graphics, networking, and the
file system) is available only through protected APIs.
Organizations increasingly provide their internal data
on the Web through public APIs; for example, http://
www.programmableweb.com lists almost 15,000
APIs for Web services and https://www.digitalgov.
gov/2013/04/30/apis-in-government/ promotes use of
government data through Web APIs.
A P P L I C AT I O N P RO G R A M M I N G I N T E R FAC E S
62
COM MUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
There is an expanding market of companies, software, and services to help
organizations provide APIs. One such
company, Apigee Corporation (http://
apigee.com/), surveyed 200 marketing
and IT executives in U.S. companies
with annual revenue of more than $500
million in 2013, with 77% of respondents rating APIs “important” to making their systems and data available
to other companies, and only 1% of
respondents rating APIs as “not at all
important.”12 Apigee estimated the total market for API Web middleware was
$5.5 billion in 2014.
However, APIs are often difficult
to use, and programmers at all levels,
from novices to experts, repeatedly
spend significant time learning new
APIs. APIs are also often used incorrectly, resulting in bugs and sometimes significant security problems.7
APIs must provide the needed functionality, but even when they do, the
design could make them unusable.
Because APIs serve as the interface between human developers and the body
of code that implements the functionality, principles and methods from human-computer interaction (HCI) can
be applied to improve usability. “Usability,” as discussed here, includes a
variety of properties, not just learnability for developers unfamiliar with an
API but also efficiency and correctness
when used by experts. This property
is sometimes called “DevX,” or developer experience, as an analogy with
“UX,” or user experience. But usability
also includes providing the appropriate functionality and ways to access it.
Researchers have shown how various
key insights
˽˽
All modern software makes heavy use
of APIs, yet programmers can find APIs
difficult to use, resulting in errors and
inefficiencies.
˽˽
A variety of research findings, tools,
and methods are widely available for
improving API usability.
˽˽
Evaluating and designing APIs with
their users in mind can result in fewer
errors, along with greater efficiency,
effectiveness, and security.
IMAGE BY BENIS A RAPOVIC/D OTSH OCK
human-centered techniques, including contextual inquiry field studies,
corpus studies, laboratory user studies,
and logs from field trials, can be used
to determine the actual requirements
for APIs so they provide the right functionality.21 Other research focuses on
access to that functionality, showing,
for example, software patterns in APIs
that are problematic for users,6,10,25
guidelines that can be used to evaluate
API designs,4,8 with some assessed by
automated tools,18,20 and mitigations
to improve usability when other considerations require trade-offs.15,23 As
an example, our own small lab study in
2008 found API users were between 2.4
and 11.2 times faster when a method
was on the expected class, rather than
on a different class.25 Note we are not
arguing usability should always overshadow other considerations when
designing an API; rather, API designers
should add usability as explicit design-
and-evaluation criteria so they do not
create an unusable API inadvertently,
and when they intentionally decrease
usability in favor of some other criteria,
at least to do it knowingly and provide
mitigations, including specific documentation and tool support.
Developers have been designing
APIs for decades, but without empirical research on API usability, many of
them have been difficult to use, and
some well-intentioned design recom-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
63
contributed articles
mendations have turned out to be
wrong. There was scattered interest in
API usability in the late 1990s, with the
first significant research in the area appearing in the first decade of the 2000s,
especially from the Microsoft Visual
Studio usability group.4 This resulted
in a gathering of like-minded researchers who in 2009 created the API Usability website (http://www.apiusability.
org) that continues to be a repository
for API-usability information.
We want to make clear the various stakeholders affected by APIs.
The first is API designers, including
all the people involved in creating
the API, like API implementers and
API documentation writers. Some of
their goals are to maximize adoption
of an API, minimize support costs,
minimize development costs, and
release the API in a timely fashion.
Next is the API users, or the programmers who use APIs to help them write
their code. Their goals include being
able to quickly write error-free programs (without having to limit their
scope or features), use APIs many
other programmers use (so others
can test them, answer questions, and
post sample code using the APIs), not
needing to update their code due to
changes in APIs, and having their resulting applications run quickly and
efficiently. For public APIs, there may
be thousands of times as many API
users as there are API developers. Finally, there are the consumers of the
resulting products who may be indirectly affected by the quality of the
resulting code but who also might be
directly affected, as in, say, the case
of user-interface widgets, where API
choices affect the look and feel of the
resulting user interface. Consumers’
goals include having products with
the desired features, robustness, and
ease of use.
APIs are also often
used incorrectly,
resulting in bugs
and sometimes
significant security
problems.
Motivating the Problem
One reason API design is such a challenge is there are many quality attributes on which APIs might be evaluated for the stakeholders (see Figure 1),
as well as trade-offs among them. At
the highest level, the two basic qualities of an API are usability and power.
Usability includes such attributes as
how easy an API is to learn, how productive programmers are using it, how
64
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
well it prevents errors, how simple it is
to use, how consistent it is, and how
well it matches its users’ mental models. Power includes an API’s expressiveness, or the kinds of abstractions
it provides; its extensibility (how users can extend it to create convenient
user-specific components); its “evolvability” for the designers who will
update it and create new versions; its
performance in terms of speed, memory, and other resource consumption;
and the robustness and security of its
implementation and resulting application. Usability mostly affects API
users, though error prevention also
affects consumers of the resulting
products. Power affects mostly API users and product consumers, though
evolvability also affects API designers
and, indirectly, API users to the extent
changes in the API require editing
the code of applications that use it.
Modern APIs for Web services seem
to involve such “breaking changes”
more than desktop APIs, as when, say,
migrating from v2 to v3 of the Google
Maps API required a complete rewrite
of the API users’ code. We have heard
anecdotal evidence that usability can
also affect API adoption; if an API
takes too long for a programmer to
learn, some organizations choose to
use a different API or write simpler
functionality from scratch.
Another reason for difficulty is the
design of an API requires making hundreds of design decisions at many different levels, all of which can affect
usability.24 Decisions range from the
global (such as the overall architecture
of the API, what design patterns will be
used, and how functionality will be presented and organized) down to the low
level (such as specific name of each exported class, function, method, exception, and parameter). The enormous
size of public APIs contributes to these
difficulties; for example, the Java Platform, Standard Edition API Specification includes more than 4,000 classes
with more than 35,000 different methods, and Microsoft’s .NET Framework
includes more than 140,000 classes,
methods, properties, and fields.
Examples of Problems
All programmers are likely able to identify APIs they personally had difficulty
learning and using correctly due to us-
contributed articles
ability limitations.a We list several examples here to give an idea of the range
of problems. Other publications have
also surveyed the area.10,24
Studies of novice programmers
have identified selecting the right facilities to use, then understanding how to
coordinate multiple elements of APIs
as key barriers to learning.13 For example, in Visual Basic, learners wanted to
“pull” data from a dialogue box into a
window after “OK” was hit, but because
controls are inaccessible if their dialogue box is not visible in Visual Basic,
data must instead be “pushed” from
the dialogue to the window.
There are many examples of API
quirks affecting expert professional
programmers as well. For example, one
study11 detailed a number of functionality and usability problems with the
.NET socket Select() function in C#,
using it to motivate greater focus on the
usability of APIs in general. In another
study,21 API users reported difficulty
with SAP’s BRFplus API (a businessrules engine), and a redesign of the API
dramatically improved users’ success
and time to completion. A study of the
early version of SAP’s APIs for enterprise
Service-Oriented Architecture, or eSOA,1
identified problems with documentation, as well as additional weaknesses
with the API itself, including names that
were too long (see Figure 2), unclear
dependencies, difficulty coordinating
multiple objects, and poor error messages when API users made mistakes.
Severe problems with documentation
were also highlighted by a field study19
of 440 professional developers learning to use Microsoft’s APIs.
Many sources of API recommendations are available in print and online. Two of the most comprehensive
are books by Joshua Bloch (then at
Sun Micro­systems)3 and by Krzysztof
Cwalina and Brad Abrams (then at Microsoft). Each offers guidelines devel-
oped over several years during creation
of such widespread APIs as the Java
Development Kit and the .NET base
libraries, respectively. However, we
have found some of these guidelines
to be contradicted by empirical evidence. For example, Bloch discussed
the many architectural advantages of
the factory pattern,9 where objects in
a class-instance object system cannot
Figure 1. API quality attributes and the stakeholders most affected by each quality.
Key: Stakeholders
API Designers
API Users
Product Consumers
Usability
Learnability
Simplicity
Productivity
Error-Prevention
Matching
Mental Models
Consistency
Power
Expressiveness
Extensibility
Evolvability
Performance,
Robustness
a We are collecting a list of usability concerns
and problems with APIs; please send yours to
author Brad A. Myers; for a more complete list
of articles and resources on API usability, see
http://www.apiusability.org
Figure 2. Method names are so long users cannot tell which of the six methods to select in autocomplete;1 note the autocomplete menu
does not support horizontal scrolling nor does the yellow hover text for the selected item.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
65
contributed articles
be created by calling new but must
instead be created using a separate
“factory” method or entirely different
factory class. Use of other patterns
(such as the singleton or flyweight
patterns)9 could also require factory
methods. However, empirical research has shown significant usability
penalties when using the factory pattern in APIs.6
There is also plenty of evidence
that less usable API designs affect
security. Increasing API usability often increases security. For example,
a study by Fahl et al.7 of 13,500 popular free Android apps found 8.0%
had misused the APIs for the Secure
Sockets Layer (SSL) or its successor,
the Transport Layer Security (TLS),
and were thus vulnerable to manin-the-middle and other attacks; a
follow-on study of Apple iOS apps
found 9.7% to be vulnerable. Causes
include significant difficulties using
security APIs correctly, and Fahl et
al.7 recommended numerous changes that would increase the usability
and security of the APIs.
On the other hand, increased security in some cases seems to lower
usability of the API. For example,
Java security guidelines strongly encourage classes that are immutable,
meaning objects cannot be changed
after they are constructed.17 However, empirical research shows professionals trying to learn APIs prefer to
be able to create empty objects and
set their fields later, thus requiring
mutable classes.22 This programmer
preference illustrates that API design
involves trade-offs and how useful it
is to know what factors can influence
usability and security.
Human-Centered Methods
If you are convinced API usability
should be improved, you might wonder
how it can be done. Fortunately, a variety of human-centered methods are
available to help answer the questions
an API designer might have.
Design phase. At the beginning of
the process, as an API is being planned,
many methods can help the API designer. The Natural Programming
Project at Carnegie Mellon University
has pioneered what we call the “natural programming” elicitation method,
where we try to understand how API users are thinking about functionality25
to determine what would be the most
natural way to provide it. The essence
of this approach is to describe the required functionality to the API users,
then ask them to write onto blank paper (or a blank screen) the design for
the API. The key goals are to understand the names API users assign to
the various entities and how users organize the functionality into different
classes, where necessary. Multiple researchers have reported trying to guess
the names of classes and methods is
the key way users search and browse
for the needed functionality,14 and we
have found surprising consistency in
how they name and organize the functionality among the classes.25 This elicitation technique also turns out to be
useful as part of a usability evaluation
of an existing API (described later), as
Code section 1. Two overloadings of the writeStartElement method in Java where
localName and namespaceURI are in the opposite order.
void writeStartElement(String namespaceURI,
String localName)
void writeStartElement(String prefix,
String localName,
String namespaceURI)
Code section 2. String parameters many API users are likely to get wrong.
void
66
setShippingAddress (
String firstName, String lastName, String street,
String city, String state, String country,
String zipCode, String email, String phone)
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
it helps explain the results by revealing
participants’ mental models.
Only a few empirical studies have
covered API design patterns but consistently show simplifying the API and
avoiding patterns like the factory pattern will improve usability.6 Other recommendations on designs are based
on the opinions of experienced designers,3,5,11,17 though there are many
recommendations, and they are sometimes contradictory.
As described here, there is a wide
variety of evaluation methods for designs, but many of them can also be
used during the design phase as guidelines the API designer should keep in
mind. For example, one guideline that
appears in “cognitive dimensions”4
and in Nielsen’s “heuristic evaluation”16 is consistency, which applies
to many aspects of an API design.
One example of its application is that
the order of parameters should be
the same in every method. However,
javax.xml.stream.XMLStreamWriter
for Java 8 has different overloadings
for the writeStartElement method,
taking the String parameters localName and namespaceURI in the opposite order from each other,18 and, since
both are strings, the compiler is not able
to detect user errors (see code section 1).
Another Nielsen guideline is to reduce error proneness.16 It can apply
to avoiding long sequences of parameters of the same type the API user is
likely to get wrong and the compiler
will also not be able to check. For example, the class TPASupplierOrderXDE
in Petstore (J2EE demonstration software from Oracle) takes a sequence of
nine Strings (see code section 2).18
Likewise, in Microsoft’s .Net,
System.Net.Cookie has four constructors that take zero, two, three,
or four strings as input. Another application of this principle is to make
the default or example parameters
do the right thing. Fahl et al.7 reported that, by default, SSL certificate
validation is turned off when using
some iOS frameworks and libraries,
resulting in API users making the
error of leaving them unchecked in
deployed applications.
Evaluating the API Design
Following its design, a new API
should be evaluated to measure and
contributed articles
improve its usability, with a wide variety of user-centered methods available for the evaluation.
The easiest is to evaluate the design
based on a set of guidelines. Nielsen’s
“heuristic evaluation” guidelines 16
describe 10 properties an expert can
use to check any design (http://www.
nngroup.com/articles/ten-usabilityheuristics/) that apply equally well to
APIs as to regular user interfaces. Here
are our mappings of the guidelines to
API designs with a general example of
how each can be applied.
Visibility of system status. It should
be easy for the API user to check the
state (such as whether a file is open
or not), and mismatches between the
state and operations should provide
appropriate feedback (such as writing
to a closed file should result in a helpful error message);
Match between system and real world.
Names given to methods and the organization of methods into classes
should match the API users’ expectations. For example, the most generic
and well-known name should be used
for the class programmers are supposed to actually use, but this is violated by Java in many places. There is
a class in Java called File, but it is a
high-level abstract class to represent
file system paths, and API users must
use a completely different class (such
as FileOutputStream) for reading
and writing;
User control and freedom. API users
should be able to abort or reset operations and easily get the API back to a
normal state;
Consistency and standards. All parts
of the design should be consistent
throughout the API, as discussed earlier;
Error prevention. The API should
guide the user into using the API correctly, including having defaults that
do the right thing;
Recognition rather than recall.
As discussed in the following paragraphs, a favorite tool of API users to
explore an API is the autocomplete
popup from the integrated development environment (IDE), so one
requirement is to make the names
clear and understandable, enabling
users to recognize which element
they want. One noteworthy violation
of this principle was an API where six
names all looked identical in auto-
The most generic
and well-known
name should be
used for the class
that programmers
are supposed
to actually use,
but this is violated
by Java in
many places.
complete because the names were so
long the differences were off screen,1
as in Figure 2. We also found these
names were indistinguishable when
users were trying to read and understand existing code, leading to much
confusion and errors;1
Flexibility and efficiency of use. Users should be able to accomplish their
tasks with the API efficiently;
Aesthetic and minimalist design. It
might seem obvious that a smaller
and less-complex API is likely to be
more usable. One empirical study20
found that for classes, the number
of other classes in the same package/
namespace had an influence on the
success of finding the desired one.
However, we found no correlation
between the number of elements in
an API and its usability, as long as
they had appropriate names and were
well organized.25 For example, adding
more different kinds of objects that
can be drawn does not necessarily
complicate a graphics package, and
adding convenience constructors that
take different sets of parameters can
improve usability.20 An important factor seems to be having distinct prefixes for the different method names so
they are easily differentiated by typing
a small number of characters for code
completion in the editor;20
Help users recognize, diagnose, and
recover from errors. A surprising number of APIs supply unhelpful error information or even none at all when
something goes wrong, thus decreasing usability and also possibly affecting correctness and security. Many
approaches are available for reporting
errors, with little empirical evidence
(but lots of opinions) about which is
more usable—a topic for our group’s
current work; and
Help and documentation. A key complaint about API usability is inadequate
documentation.19
Likewise, the Cognitive Dimensions Framework provides a set of
guidelines that can be used to evaluate APIs.4 A related method is Cognitive Walkthrough2 whereby an expert
evaluates how well a user interface
supports one or more specific tasks.
We used both Heuristic Evaluation
and Cognitive Walkthrough to help
improve the NetWeaver Gateway product from SAP, Inc. Because the SAP
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
67
contributed articles
developers who built this tool were
using agile software-development
processes, they were able to quickly
improve the tool’s usability based on
our evaluations.8
Although a user-interface expert
usually applies these guidelines to
evaluate an API, some tools automate
API evaluations using guidelines; for
example, one tool can evaluate APIs
against a set of nine metrics, including looking for methods that are
overloaded but with different return
types, too many parameters in a row
with the same types, and consistency
of parameter orderings across different methods.18 Likewise, the API Concepts Framework takes the context of
use into account, as it evaluates both
the API and samples of code using
the API.20 It can measure a variety of
metrics already mentioned, including
whether multiple methods have the
same prefix (and thus may be annoying to use in code-completion menus)
and use the factory pattern.
Among HCI practitioners, running
user studies to test a user interface
with target users is considered the
“gold standard.”16 Such user tests can
be done with APIs as well. In a thinkaloud usability evaluation, target users (here, API users) attempt some
tasks (either their own or experimenter-provided) with the API typically in
a lab setting and are encouraged to
say aloud what they are thinking. This
makes clear what they are looking for
or trying to achieve and, in general,
why they are making certain choices.
A researcher might be interested in a
more formal A/B test, comparing, say,
an old vs. new version of an API (as we
previously have done6,21,25), but the insights about usability barriers are usually sufficient when they emerge from
an informal think-aloud evaluation.
Grill et al.10 described a method
where they had experts use Nielsen’s
Heuristic Evaluation to identify problems with an API and observed developers learning to use the same API in
the lab. An interesting finding was
these two methods revealed mostly
independent sets of problems with
that API.
APIs specify not
just the interfaces
for programmers
to understand
and write code
against but also
for computers to
execute, making
them brittle and
difficult to change.
Mitigations
When any of these methods reveals
a usability problem with an API, an
68
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
ideal mitigation would be to change
the API to fix the problem. However,
actually changing an API may not be
possible for a number of reasons. For
example, legacy APIs can be changed
only rarely since it would involve also
changing all the code that uses the
APIs. Even with new APIs, an API designer could make an explicit tradeoff to decrease usability in favor of
other goals, like efficiency. For example, a factory pattern might be used in
a performance-critical API to avoid allocating any memory at all.
When a usability problem cannot be removed from the API itself,
many mitigations can be applied to
help its users. The most obvious is to
improve the documentation and example code, which are the subjects
of frequent complaints from API users in general.19 API designers can
be careful to explicitly direct users
to the solutions to the known problems. For example, the Jadeite tool
adds cross-references to the documentation for methods users expect
to exist but which are actually in a different class.23 For example, the Java
Message class does not have a send
method, so Jadeite adds a pretend
send method to the documentation
for the Message class, telling users
to look in the mail Transport class
instead. Knowing users are confused
by the lack of this method in the Message class allows API documentation
to add help exactly where it is needed.
Tools
This kind of help can be provided even
in programming tools (such as the code
editor or IDE), not just in the documentation. Calcite15 adds extra entries into
the autocomplete menus of the Eclipse
IDE to help API users discover what additional methods will be useful in the
current context, even if they are not
part of the current class. It also highlights when the factory pattern must be
used to create objects.
Many other tools can also help
with API usability. For example,
some tools that help refactor the API
users’ code may lower the barrier for
changing an API (such as Gofix for
the Go language, http://blog.golang.
org/introducing-gofix). Other tools
help find the right elements to use
in APIs, “wizards” that produce part
contributed articles
of the needed code based on API users’ answers to questions,8 and many
kinds of bug checkers that check for
proper API use (such as http://findbugs.sourceforge.net/).
Conclusion
Since our Natural Programming group
began researching API usability in the
early 2000s, some significant shifts
have occurred in the software industry. One of the biggest is the move
toward agile software development,
whereby a minimum-viable-product
is quickly released and then iterated
upon based on real-world user feedback. Though it has had a positive
effect on usability overall in driving
user-centric development, it exposes
some of the unique challenges of API
design. APIs specify not just the interfaces for programmers to understand
and write code against but also for
computers to execute, making them
brittle and difficult to change. While
human users are nimble responding
to the small, gradual changes in user
interface design that result from an
agile process, code is not. This aversion to change raises the stakes for
getting the design right in the first
place. API users behave just like other
users almost universally, but the constraints created by needing to avoid
breaking existing code make the evolution, versioning, and initial release
process considerably different from
other design tasks. It is not clear how
the “fail fast, fail often” style of agile
development popular today can be
adapted to the creation and evolution of APIs, where the cost of releasing and supporting imperfect APIs or
making breaking changes to an existing API—either by supporting multiple versions or by removing support
for old versions—is very high.
We envision a future where API designers will always include usability as
a key quality metric to be optimized by
all APIs and where releasing APIs that
have not been evaluated for usability
will be as unacceptable as not evaluating APIs for correctness or robustness. When designers decide usability
must be compromised in favor of other
goals, this decision will be made knowingly, and appropriate mitigations will
be put in place. Researchers and API
designers will contribute to a body of
knowledge and set of methods and
tools that can be used to evaluate and
improve API usability. The result will
be APIs that are easier to learn and use
correctly, API users who are more effective and efficient, and resulting products that are more robust and secure
for consumers.
Acknowledgments
This article follows from more than a
decade of work on API usability by the
Natural Programming group at Carnegie Mellon University by more than
30 students, staff, and postdocs, in
addition to the authors, and we thank
them all for their contributions. We
also thank André Santos, Jack Beaton,
Michael Coblenz, John Daughtry, Josh
Sunshine, and the reviewers for their
comments on earlier drafts of this article. This work has been funded by
SAP, Adobe, IBM, Microsoft, and multiple National Science Foundation
grants, including CNS-1423054, IIS1314356, IIS-1116724, IIS-0329090,
CCF-0811610, IIS-0757511, and CCR0324770. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the
authors and do not necessarily reflect
those of any of the sponsors. References
1. Beaton, J., Jeong, S.Y., Xie, Y., Stylos, J., and Myers,
B.A. Usability challenges for enterprise serviceoriented architecture APIs. In Proceedings of the
IEEE Symposium on Visual Languages and HumanCentric Computing (Herrsching am Ammersee,
Germany, Sept. 15–18). IEEE Computer Society Press,
Washington, D.C., 2008, 193–196.
2. Blackmon, M.H., Polson, P.G., Kitajima, M., and Lewis,
C. Cognitive walkthrough for the Web. In Proceedings
of the Conference on Human Factors in Computing
Systems (Minneapolis, MN, Apr. 20–25). ACM, Press,
New York, 2002, 463–470.
3. Bloch, J. Effective Java Programming Language
Guide. Addison-Wesley, Boston, MA, 2001.
4. Clarke, S. API Usability and the Cognitive Dimensions
Framework, 2003; http://blogs.msdn.com/stevencl/
archive/2003/10/08/57040.aspx
5. Cwalina, K. and Abrams, B. Framework Design
Guidelines, Conventions, Idioms, and Patterns for
Reusable .NET Libraries. Addison-Wesley, UpperSaddle River, NJ, 2006.
6. Ellis, B., Stylos, J., and Myers, B.A. The factory pattern
in API design: A usability evaluation. In Proceedings of
the International Conference on Software Engineering
(Minneapolis, MN, May 20–26). IEEE Computer Society
Press, Washington, D.C., 2007, 302–312.
7. Fahl, S., Harbach, M., Perl, H., Koetter, M., and Smith,
M. Rethinking SSL development in an appified world.
In Proceedings of the ACM SIGSAC Conference on
Computer and Communications Security (Berlin,
Germany, Nov. 4–8). ACM Press, New York, 2013,
49–60.
8. Faulring, A., Myers, B.A., Oren, Y., and Rotenberg, K.
A case study of using HCI methods to improve tools
for programmers. In Proceedings of Workshop on
Cooperative and Human Aspects of Software Engineering
at the International Conference on Software Engineering
(Zürich, Switzerland, June 2). IEEE Computer Society
Press, Washington, D.C., 2012, 37–39.
9. Gamma, E., Helm, R., Johnson, R., and Vlissides, J.
Design Patterns. Addison-Wesley, Reading, MA, 1995.
10. Grill, T., Polacek, O., and Tscheligi, M. Methods
towards API usability: A structural analysis of
usability problem categories. In Proceedings of
the Fourth International Conference on HumanCentered Software Engineering, M. Winckler et al.,
Eds. (Toulouse, France, Oct. 29–31). Springer, Berlin,
Germany, 2012, 164–180.
11. Henning, M. API design matters. ACM Queue 5, 4
(May–June, 2007), 24–36.
12. Kirschner, B. The Perceived Relevance of APIs. Apigee
Corporation, San Jose, CA, 2015; http://apigee.com/
about/api-best-practices/perceived-relevance-apis
13. Ko, A.J., Myers, B.A., and Aung, H.H. Six learning
barriers in end-user programming systems. In
Proceedings of the IEEE Symposium on Visual
Languages and Human-Centric Computing (Rome,
Italy, Sept. 26–29). IEEE Computer Society Press,
Washington, D.C., 2004, 199–206.
14. Ko, A.J., Myers, B.A., Coblenz, M., and Aung, H.H. An
exploratory study of how developers seek, relate,
and collect relevant information during software
maintenance tasks. IEEE Transactions on Software
Engineering 33, 12 (Dec. 2006), 971–987.
15. Mooty, M., Faulring, A., Stylos, J., and Myers, B.A.
Calcite: Completing code completion for constructors
using crowds. In Proceedings of the IEEE Symposium
on Visual Languages and Human-Centric Computing
(Leganés-Madrid, Spain, Sept. 21–25). IEEE Computer
Society Press, Washington, D.C., 2010, 15–22.
16. Nielsen, J. Usability Engineering. Academic Press,
Boston, MA, 1993.
17. Oracle Corp. Secure Coding Guidelines for the
Java Programming Language, Version 4.0,
2014; http://www.oracle.com/technetwork/java/
seccodeguide-139067.html
18. Rama, G.M. and Kak, A. Some structural measures of
API usability. Software: Practice and Experience 45, 1
(Jan. 2013), 75–110; https://engineering.purdue.edu/
RVL/Publications/RamaKakAPIQ_SPE.pdf
19. Robillard, M. and DeLine, R. A field study of API
learning obstacles. Empirical Software Engineering 16,
6 (Dec. 2011), 703–732.
20. Scheller, T. and Kuhn, E. Automated measurement
of API usability: The API concepts framework.
Information and Software Technology 61 (May 2015),
145–162.
21. Stylos, J., Busse, D.K., Graf, B., Ziegler, C., Ehret,
R., and Karstens, J. A case study of API design
for improved usability. In Proceedings of the IEEE
Symposium on Visual Languages and Human-Centric
Computing (Herrsching am Ammersee, Germany,
Sept. 20–24). IEEE Computer Society Press,
Washington, D.C., 2008, 189–192.
22. Stylos, J. and Clarke, S. Usability implications of
requiring parameters in objects’ constructors. In
Proceedings of the International Conference on
Software Engineering (Minneapolis, MN, May 20–26).
IEEE Computer Society Press, Washington, D.C., 2007,
529–539.
23. Stylos, J., Faulring, A., Yang, Z., and Myers, B.A.
Improving API documentation using API usage
information. In Proceedings of the IEEE Symposium
on Visual Languages and Human-Centric Computing
(Corvallis, OR, Sept. 20–24). IEEE Computer Society
Press, Washington, D.C., 2009, 119–126.
24. Stylos, J. and Myers, B.A. Mapping the space of
API design decisions. In Proceedings of the IEEE
Symposium on Visual Languages and Human-Centric
Computing (Coeur d’Alene, ID, Sept 23–27). IEEE
Computer Society Press, Washington, D.C., 2007, 50–57.
25. Stylos, J. and Myers., B.A. The implications of method
placement on API learnability. In Proceedings of the
16th ACM SIGSOFT Symposium on Foundations of
Software Engineering (Atlanta, GA, Sept. 23–27). ACM
Press, New York, 2008, 105–112.
Brad A. Myers ([email protected]) is a professor in the
Human-Computer Interaction Institute in the School
of Computer Science at Carnegie Mellon University,
Pittsburgh, PA.
Jeffrey Stylos ([email protected]) is a software
engineer at IBM in Littleton, MA, and received his Ph.D.
in computer science at Carnegie Mellon University,
Pittsburgh, PA, while doing research reported in this
article.
© 2016 ACM 0001-0782/16/06 $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
69
contributed articles
DOI:10.1145/ 2851486
Computers broadcast their secrets via
inadvertent physical emanations that
are easily measured and exploited.
BY DANIEL GENKIN, LEV PACHMANOV, ITAMAR PIPMAN,
ADI SHAMIR, AND ERAN TROMER
Physical
Key Extraction
Attacks on PCs
Secure websites and
financial, personal communication, corporate, and
national secrets all depend on cryptographic algorithms
operating correctly. Builders of cryptographic systems
have learned (often the hard way) to devise algorithms
and protocols with sound theoretical analysis,
write software that implements them correctly,
and robustly integrate them with the surrounding
applications. Consequentially, direct attacks against
state-of-the-art cryptographic software are getting
increasingly difficult.
For attackers, ramming the gates of cryptography is
not the only option. They can instead undermine the
fortification by violating basic assumptions made by
the cryptographic software. One such assumption is
software can control its outputs. Our programming
courses explain that programs produce their outputs
through designated interfaces (whether print, write,
send, or mmap); so, to keep a secret, the software just
CRYPTOGRAPH Y I S UBI Q UI TO US.
70
COMMUNICATIO NS O F TH E ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
needs to never output it or anything that
may reveal it. (The operating system
may be misused to allow someone else’s
process to peek into the program’s
memory or files, though we are getting
better at avoiding such attacks, too.)
Yet programs’ control over their
own outputs is a convenient fiction,
for a deeper reason. The hardware running the program is a physical object
and, as such, interacts with its environment in complex ways, including
electric currents, electromagnetic
fields, sound, vibrations, and light
emissions. All these “side channels”
may depend on the computation performed, along with the secrets within
it. “Side-channel attacks,” which exploit such information leakage, have
been used to break the security of numerous cryptographic implementations; see Anderson,2 Kocher et al.,19 and
Mangard et al.23 and references therein.
Side channels on small devices.
Many past works addressed leakage
from small devices (such as smartcards, RFID tags, FPGAs, and simple
embedded devices); for such devices,
physical key extraction attacks have
been demonstrated with devastating
effectiveness and across multiple physical channels. For example, a device’s
power consumption is often correlated
with the computation it is currently executing. Over the past two decades, this
physical phenomenon has been used
extensively for key extraction from
small devices,19,23 often using powerful techniques, including differential
power analysis.18
key insights
˽˽
Small differences in a program’s data
can cause large differences in acoustic,
electric, and electromagnetic emanations
as the program runs.
˽˽
These emanations can be measured
through inexpensive equipment and used
to extract secret data, even from fast and
complex devices like laptop computers
and mobile phones.
˽˽
Common hardware and software are
vulnerable, and practical mitigation of
these risks requires careful applicationspecific engineering and evaluation.
IMAGE BY IWONA USA KIEWICZ/A ND RIJ BORYS ASSOCIATES
The electromagnetic emanations
from a device are likewise affected by the
computation-correlated currents inside
it. Starting with Agrawal et al.,1 Gandolfi
et al.,11 and Quisquater and Samyde,28
such attacks have been demonstrated
on numerous small devices involving
various cryptographic implementations.
Optical and thermal imaging of circuits provides layout information and
coarse activity maps that are useful for
reverse engineering. Miniature probes
can be used to access individual internal wires in a chip, though such techniques require invasive disassembly
of the chip package, as well as considerable technical expertise. Optical
emanations from transistors, as they
switch state, are exploitable as a side
channel for reading internal registers
leading and extracting keys.29
See Anderson2 for an extensive survey of such attacks.
Vulnerability of PCs. Little was
known, however, about the possibility
of cryptographic attacks through physical side channels on modern commodity laptop, desktop, and server computers. Such “PC-class” computers (or
“PCs,” as we call them here) are indeed
very different from the aforementioned
small devices, for several reasons.
First, a PC is a very complex environment—a CPU with perhaps one billion
transistors, on a motherboard with other
circuitry and peripherals, running an
operating system and handling various
asynchronous events. All these introduce complexity, unpredictability, and
noise into the physical emanations as
the cryptographic code executes.
Second is speed. Typical side-channel techniques require the analog leakage signal be acquired at a bandwidth
greater than the target’s clock rate.
For PCs running GHz-scale CPUs, this
means recording analog signals at
multi-GHz bandwidths requiring expensive and delicate lab equipment, in
addition to a lot of storage space and
processing power.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
71
contributed articles
Figure 1. An acoustic attack using a parabolic microphone (left) on a target laptop (right);
keys can be extracted from a distance of 10 meters.
Figure 2. Measuring the chassis potential by touching a conductive part of the laptop;
the wristband is connected to signal-acquisition equipment.
A third difference involves attack
scenarios. Traditional techniques for
side-channel attacks require long, uninterrupted physical access to the target
device. Moreover, some such attacks
involve destructive mechanical intrusion into the device (such as decapsulating chips). For small devices, these
scenarios make sense; such devices
are often easily stolen and sometimes
even handed out to the attacker (such
as in the form of cable TV subscription
cards). However, when attacking other
people’s PCs, the attacker’s physical
access is often brief, constrained, and
must proceed unobserved.
Note numerous side channels in
PCs are known at the software level;
timing,8 cache contention,6,26,27 and
72
COMM UNICATIO NS O F THE ACM
many other effects can be used to
glean sensitive information across the
boundaries between processes or even
virtual machines. Here, we focus on
physical attacks that do not require deployment of malicious software on the
target PC.
Our research thus focuses on two
main questions: Can physical sidechannel attacks be used to nonintrusively extract secret keys from PCs,
despite their complexity and operating
speed? And what is the cost of such attacks in time, equipment, expertise,
and physical access?
Results. We have identified multiple
side channels for mounting physical
key-extraction attacks on PCs, applicable in various scenarios and offering
| J U NE 201 6 | VO L . 5 9 | NO. 6
various trade-offs among attack range,
speed, and equipment cost. The following sections explore our findings, as published in several recent articles.12,15,16
Acoustic. The power consumption of
a CPU and related chips changes drastically (by many Watts) depending on
the computation being performed
at each moment. Electronic components in a PC’s internal power supply,
struggling to provide constant voltage
to the chips, are subject to mechanical forces due to fluctuations of voltages and currents. The resulting vibrations, as transmitted to the ambient
air, create high-pitched acoustic noise,
known as “coil whine,” even though it often originates from capacitors. Because
this noise is correlated with the ongoing computation, it leaks information
about what applications are running and
what data they process. Most dramatically, it can acoustically leak secret keys
during cryptographic operations.
By recording such noise while a
target is using the RSA algorithm to
decrypt ciphertexts (sent to it by the
attacker), the RSA secret key can be extracted within one hour for a high-grade
4,096-bit RSA key. We experimentally
demonstrated this attack from as far as
10 meters away using a parabolic microphone (see Figure 1) or from 30cm away
through a plain mobile phone placed
next to the computer.
Electric. While PCs are typically
grounded to the mains earth (through
their power supply “brick,” or grounded peripherals), these connections
are, in practice, not ideal, so the electric potential of the laptop’s chassis
fluctuates. These fluctuations depend
on internal currents, and thus on the
ongoing computation. An attacker
can measure the fluctuations directly
through a plain wire connected to a
conductive part of the laptop, or indirectly through any cable with a conductive shield attached to an I/O port
on the laptop (such as USB, Ethernet,
display, or audio). Perhaps most surprising, the chassis potential can be
measured, with sufficient fidelity,
even through a human body; human
attackers need to touch only the target computer with a bare hand while
their body potential is measured (see
Figure 2).
This channel offers a higher bandwidth than the acoustic one, allowing
contributed articles
observation of the effect of individual
key bits on the computation. RSA and
ElGamal keys can thus be extracted
from a signal obtained from just a few
seconds of measurement, by touching
a conductive part of the laptop’s chassis, or by measuring the chassis potential from the far side of a 10-meterlong cable connected to the target’s
I/O port.
Electromagnetic. The computation
performed by a PC also affects the electromagnetic field it radiates. By monitoring the computation-dependent
electromagnetic fluctuations through
an antenna for just a few seconds,
it is possible to extract RSA and ElGamal secret keys. For this channel,
the measurement setup is notably
unintrusive and simple. A suitable
electromagnetic probe antenna can
be made from a simple loop of wire
and recorded through an inexpensive
software-defined radio USB dongle. Alternatively, an attacker can sometimes
use a plain consumer-grade AM radio
receiver, tuned close to the target’s signal frequency, with its headphone output connected to a phone’s audio jack
for digital recording (see Figure 3).
Applicability. A surprising result of
our research is how practical and easy
are physical key-extraction side-channel attacks on PC-class devices, despite
the devices’ apparent complexity and
high speed. Moreover, unlike previous
attacks, our attacks require very little
analog bandwidth, as low as 50kHz,
even when attacking multi-GHz CPUs,
thus allowing us to utilize new channels, as well as inexpensive and readily
available hardware.
We have demonstrated the feasibility of our attacks using GnuPG
(also known as GPG), a popular open
source cryptographic software that
implements both RSA and ElGamal.
Our attacks are effective against various versions of GnuPG that use different implementations of the targeted
cryptographic algorithm. We tested
various laptop computers of different
models from different manufacturers
and running various operating systems, all “as is,” with no modification
or case intrusions.
History. Physical side-channel attacks have been studied for decades in
military and espionage contexts in the
U.S. and NATO under the codename
TEMPEST. Most of this work remains
classified. What little is declassified
confirms the existence and risk of
physical information leakage but says
nothing about the feasibility of the key
extraction scenarios discussed in this
article. Acoustic leakage, in particular,
has been used against electromechanical ciphers (Wright31 recounts how
the British security agencies tapped a
phone to eavesdrop on the rotors of a
Hagelin electromechanical cipher machine); but there is strong evidence it
was not recognized by the security services as effective against modern electronic computers.16
Non-Cryptographic Leakage
Peripheral devices attached to PCs are
prone to side-channel leakage due to
their physical nature and lower operating speed; for example, acoustic noise
from keyboards can reveal keystrokes,3
printer-noise printed content,4 and status LEDs data on a communication
line.22 Computer screens inadvertently
broadcast their content as “van Eck”
electromagnetic radiation that can be
picked up from a distance;21,30 see Anderson2 for a survey.
Some observations have also been
made about physical leakage from
PCs, though at a coarse level. The general activity level is easily gleaned from
temperature,7 fan speed, and mechanical hard-disk movement. By tapping
the computer’s electric AC power, it
is possible to identify the webpages
Figure 3. An electromagnetic attack using a consumer AM radio receiver placed near the
target and recorded by a smartphone.
Figure 4. A spectrogram of an acoustic signal. The vertical axis is time (3.7 seconds), and
the horizontal axis is frequency (0kHz–310kHz). Intensity represents instantaneous energy
in the frequency band. The target is performing one-second loops of several x86 instructions: CPU sleep (HLT), integer multiplication (MUL), floating-point multiplication (FMUL),
main memory access, and short-term idle (REP NOP).
0
50
100
150
200
250
300
350kHz
0
0.25
HLT
0.5
MUL
0.75
1
1.25
1.5
1.75
FMUL
ADD
MEM
NOP
sec
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
73
contributed articles
loaded by the target’s browser9 and
even some malware.10 Tapping USB
power lines makes it possible to identify when cryptographic applications
are running.25
The acoustic, electric, and electromagnetic channels can also be used to
gather coarse information about a target’s computations; Figure 4 shows a
microphone recording of a PC, demonstrating loops of different operations
have distinct acoustic signatures.
Cryptanalytic Approach
Coarse leakage is ubiquitous and easily demonstrated once the existence
of the physical channel is recognized.
However, there remains the question
of whether the physical channels can
be used to steal finer and more devastating information. The crown jewels,
in this respect, are cryptographic keys,
for three reasons. First, direct impact,
as compromising cryptographic keys
endangers all data and authorizations
that depend on them. Second, difficulty, as cryptographic keys tend to be well
protected and used in carefully crafted
algorithms designed to resist attacks;
so if even these keys can be extracted,
it is a strong indication more pedestrian data can be also extracted. And
third, commonality, as there is only a
small number of popular cryptograph-
ic algorithms and implementations,
so compromising any of them has a direct effect on many deployed systems.
Consequently, our research focused on
key extraction from the most common
public-key encryption schemes—RSA
and ElGamal—as implemented by the
popular GnuPG software.
When analyzing implementations
of public-key cryptographic algorithms, an attacker faces the difficulties described earlier of complexity,
noise, speed, and nonintrusiveness.
Moreover, engineers implementing
cryptographic algorithms try to make
the sequence of executed operations
very regular and similar for all secret
keys. This is done to foil past attacks
that exploit significant changes in control flow to deduce secrets, including
timing attacks,8 cache contention attacks6,26,27 (such as a recent application
to GnuPG32,33), and many other types of
attacks on small devices.
We now show how to overcome these
difficulties, using a careful selection of
the ciphertext to be decrypted by the
algorithm. By combining the following
two techniques for ciphertext selection,
we obtain a key-dependent leakage that
is robustly observable, even through
low-bandwidth measurements.
Internal value poisoning. While the
sequence of performed operations
Algorithm 1. Modular exponentiation using square-and-always-multiply.
Input: Three integers c,d,q in binary representation such
that d = d1 . . . dm.
Output: a = c d mod q.
1: procedure MOD_EXP(c,d,q)
2:
c ← c mod q
3:
a← 1
4:
for i ← 1 to m do
5:
a ← a2
6:
t← a.c
7:
if di = 1 then
8:
a← t
9:
return a
Algorithm 2. GnuPG’s basic multiplication code.
Input: Two integers a = as . . . a1 and b = bt . . . b1 of size s.
and t limbs respectively
Output: a . b.
1: procedure MUL_BASECASE(a,b)
2:
p ← a . b1
3:
for i ← 2 to t do
4:
if bi ≠ 0 then
 (and if bi = 0 do nothing)
5:
p ← p + a . bi . 232 .(i-1)
6:
return p
74
COM MUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
is often decoupled from the secret
key, the operands to these operations
are often key-dependent. Moreover,
operand values with atypical properties (such as operands containing
many zero bits or that are unusually
short) may trigger implementationdependent corner cases. We thus craft
special inputs (ciphertexts to be decrypted) that “poison” internal values
occurring inside the cryptographic
algorithm, so atypically structured operands occur at key-dependent times.
Measuring leakage during such a poisoned execution can reveal at which
operations these operands occurred,
and thus leak key information.
Leakage self-amplification. In order
to overcome a device’s complexity
and execution speed, an attacker can
exploit the algorithm’s own code to
amplify its own leakage. By asking for
decryption of a carefully chosen ciphertext, we create a minute change (compared to the decryption of a randomlooking ciphertext) during execution
of the innermost loop of the attacked
algorithm. Since the code inside the
innermost loop is executed many
times throughout the algorithm, this
yields an easily observable global
change affecting the algorithm’s
entire execution.
GnuPG’S RSA Implementation
For concreteness in describing our basic attack method, we outline GnuPG’s
implementation of RSA decryption,
as of version 1.4.14 from 2013. Later
GnuPG versions revised their implementations to defend against the adaptive attack described here; we discuss
these variants and corresponding attacks later in the article.
Notation. RSA key generation is
done by choosing two large primes p,
q, a public exponent e and a secret exponent d, such that ed ≡ 1 (mod Φ(n))
where n = pq and Φ(n) = (p − 1)(q − 1).
The public key is (n, e) and the private
key is (p, q, d). RSA encryption of a message m is done by computing me mod n,
and RSA decryption of a ciphertext c is
done by computing cd mod n. GnuPG
uses a common optimization for RSA
decryption; instead of directly computing m = cd mod n, it first computes mp = cdp mod p, m q = cdq mod q
(where d p and dq are derived from the
secret key), then combines m p and m q
contributed articles
into m using the Chinese Remainder
Theorem. To fully recover the secret
key, it suffices to learn any of its components (p, q, d, dp, or dq); the rest can
be deduced.
Square-and-always-multiply
exponentiation. Algorithm 1 is pseudocode
of the square-and-always-multiply exponentiation used by GnuPG 1.4.14
to compute mp and mq. As a countermeasure to the attack of Yarom and
Falkner,32 the sequence of squarings
and multiplications performed by
Algorithm 1 is independent of the
secret key. Note the modular reduction in line 2 and the multiplication
in line 6. Both these lines are used by
our attack on RSA—line 2 for poisoning internal values and line 6 for leakage
self-amplification.
Since our attack uses GnuPG’s multiplication routine for leakage self-amplification, we now analyze the code of
GnuPG’s multiplication routines.
Multiplication. For multiplying large
integers (line 6), GnuPG uses a variant
of the Karatsuba multiplication algorithm. It computes the product of two
k-numbers a and b recursively, using
the identity ab = (22k + 2k)aHbH + 2k(aH −
aL) (bL − bL) + (2k + 1)aLbL, where aH, bH
are the most significant halves of a and
b, respectively, and, similarly, aL, bL are
the least significant halves of a and b.
The recursion’s base case is a
simple grade-school “long multiplication” algorithm, shown (in simplified form) in Algorithm 2. GnuPG
stores large integers in arrays of 32-bit
words, called limbs. Note how Algorithm 2 handles the case of zero limbs
of b. Whenever a zero limb of b is encountered, the operation in line 5 is
not executed, and the loop in line 3
proceeds to handle the next limb of
b. This optimization is exploited by
the leakage self-amplification component of our attack. Specifically, each
of our chosen ciphertexts will cause a
targeted bit of q to affect the number
of zero limbs of b given to Algorithm 2
and thus the control flow in line 4 and
thereby the side-channel leakage.
Adaptive Chosen Ciphertext Attack
We now describe our first attack on
RSA, extracting the bits of the secret
prime q, one by one. For each bit of q,
denoted qi, the attack chooses a ciphertext c (i) such that when c (i) is decrypted
by the target the side-channel leakage
reveals the value of qi. Eventually the
entire q is revealed. The choice of each
ciphertext depends on the key bits
learned thus far, making it an adaptive
chosen ciphertext attack.
This attack requires the target to
decrypt ciphertexts chosen by the attacker, which is realistic since GnuPG
is invoked by numerous applications
to decrypt ciphertexts arriving via
email messages, files, webpages, and
chat messages. For example, Enigmail and GpgOL are popular plugins
that add PGP/MIME encrypted-email
capabilities to Mozilla Thunderbird
and Outlook, respectively. They decrypt incoming email messages by
passing them to GnuPG. If the target
uses them, an attacker can remotely
inject a chosen ciphertext into GnuPG
by encoding the ciphertext as a PGP/
MIME email (following RFC 3156)
and sending it to the target.
Cryptanalysis. We can now describe
the adaptive chosen ciphertext attack
on GnuPG’s RSA implementation.
Internal value poisoning. We begin
by choosing appropriate ciphertexts
that will poison some of the internal
values inside Algorithm 1. Let p, q be
two random k-bit primes comprising an RSA secret key; in the case of
high-security 4,096-bit RSA, k = 2,048.
GnuPG always generates RSA keys
such that the most significant bit of
p and q is set, thus qi = 1. Assume we
have already recovered the topmost
i − 1 bits of q and define the ciphertext c (i) to be the k-bit ciphertext whose
topmost i − 1 bits are the same as q,
its i-th bit is 0 and whose remaining
bits are set to 1. Consider the effects
of decrypting c (i) on the intermediate
values of Algorithm 1, depending on
the secret key bit qi.
Suppose qi = 1. Then c (i) ≤ q, and
this c (i) is passed as the argument c to
Algorithm 1, where the modular reduction in line 2 returns c = c (i) (since
c (i) ≤ q), so the lowest k − i bits of c remain 1. Conversely, if qi = 0, then c (i) >
q, so when c (i) is passed to Algorithm 1,
the modular reduction in line 2 modifies the value of c. Since c (i) agrees with
q on its topmost i − 1 bits, it holds that
q < c (i) < 2q, so in this case the modular
reduction computes c ← c − q, which
is a random-looking number of length
k − i bits.
We have thus obtained a connection between the i-th bit of q and the
resulting structure of c after the modular reduction—either long and repetitive or short and random looking—
thereby poisoning internal values in
Algorithm 1.
Leakage self-amplification. To learn
the i-th bit of q, we need to amplify the
leakage resulting from this connection
so it becomes physically distinguishable. Note the value c is used during
the main loop of Algorithm 1 in line
6. Moreover, since the multiplication
in line 6 is executed once per bit of d,
we obtain that Algorithm 1 performs k
multiplications by c, whose structure
depends on qi. We now analyze the effects of repetitive vs. random-looking
second operand on the multiplication
routine of GnuPG.
Suppose c (i) = 1. Then c has its lowest k − i bits set to 1. Next, c is passed
to the Karatsuba-based multiplication
routine as the second operand b. The
result of (bL − bH), as computed in the
Karatsuba-based multiplication, will
thus contain many zero limbs. This invariant, of having the second operand
containing many zero limbs, is preserved by the Karatsuba-based multiplication all the way until the recursion
reaches the base-case multiplication
routine (Algorithm 2), where it affects
the control flow in line 4, forcing the
loop in line 3 to perform almost no
multiplications.
Conversely, if qi = 0, then c is random-looking, containing few (if any)
zero limbs. When the Karatsuba-based
multiplication routine gets c as its second operand b, the derived values stay
random-looking throughout the recursion until the base case, where these
random-looking values affect the control flow in line 4 inside the main loop
of Algorithm 2, making it almost always perform a multiplication.
Our attack thus creates a situation
where, during the entire decryption
operation, the branch in line 4 of Algorithm 2 is either always taken or is never taken, depending on the current bit
of q. During the decryption process, the
branch in line 4 is evaluated numerous
times (approximately 129,000 times for
4,096-bit RSA). This yields the desired
self-amplification effect. Once qi is extracted, we can compute the next chosen ciphertext ci+1 and proceed to ex-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
75
contributed articles
Figure 5. Measuring acoustic leakage: (a) is the attacked target; (b) is a microphone picking
up the acoustic emanations; (c) is the microphone power supply and amplifier; (d) is the
digitizer; and the acquired signal is processed and displayed by the attacker’s laptop (e).
Figure 6. Acoustic emanations (0kHz–20kHz, 0.5 seconds) of RSA decryption during an
adaptive chosen-ciphertext attack.
0
0
5
10
15
20 kHz
0
0
5
10
15
20 kHz
p
0.25
0.25
q
0.5
sec
0.5
sec
(a)
tract the next secret bit—qi+1—through
the same method.
The full attack requires additional
components (such as error detection
and recovery16).
Acoustic cryptanalysis of RSA. The
basic experimental setup for measuring acoustic leakage consists of a microphone for converting mechanical
air vibrations to electronic signals, an
amplifier for amplifying the microphone’s signals, a digitizer for converting the analog signal to a digital form,
and software to perform signal processing and cryptanalytic deduction. Figure
1 and Figure 5 show examples of such
setups using sensitive ultrasound microphones. In some cases, it even suffices to record the target through the
built-in microphone of a mobile phone
placed in proximity to the target and
running the attacker’s mobile app.16
Figure 6 shows the results of applying the acoustic attack for different values (0 or 1) of the attacked bit
of q. Several effects are discernible.
First, the transition between the two
modular exponentiations (using the
modulus p and q) is clearly visible.
Second, note the acoustic signatures
76
COMM UNICATIO NS O F THE AC M
(b)
of the second exponentiation is different between Figure 6a and Figure
6b. This is exactly the effect created
by our attack, which can be utilized to
extract the bits of q.
By applying the iterative attack algorithm described earlier, attacking
each key bit at a time by sending the
chosen ciphertext for decryption and
learning the key bit from the measured
acoustic signal, the attacker can fully
extract the secret key. For 4,096-bit RSA
keys (which, according to NIST recommendations, should remain secure for
decades), key extraction takes approximately one hour.
Parallel load. This attack assumes
decryption is triggered on an otherwise-idle target machine. If additional software is running concurrently,
then the signal will be affected, but
the attack may still be feasible. In particular, if other software is executed
through timeslicing, then the irrelevant timeslices can be identified and
discarded. If other, sufficiently homogenous software is executed on a
different core, then (empirically) the
signal of interest is merely shifted.
Characterizing the general case is an
| J U NE 201 6 | VO L . 5 9 | NO. 6
open question, but we conjecture that
exploitable correlations will persist.
Non-Adaptive Chosen
Ciphertext Attacks
The attack described thus far requires decryption of a new adaptively
chosen ciphertext for every bit of the
secret key, forcing the attacker to interact with the target computer for a
long time (approximately one hour).
To reduce the attack time, we turn to
the electrical and electromagnetic
channels, which offer greater analog bandwidth, though still orders of
magnitude less than the target’s CPU
frequency. This increase in bandwidth
allows the attacker to observe finer details about the operations performed
by the target algorithm, thus requiring
less leakage amplification.
Utilizing the increased bandwidth,
our next attack trades away some of the
leakage amplification in favor of reducing the number of ciphertexts. This
reduction shortens the key-extraction
time to seconds and, moreover, makes
the attack non-adaptive, meaning the
chosen ciphertexts can be sent to the
target all at once (such as on a CD with
a few encrypted files).
Cryptanalysis. The non-adaptive
chosen ciphertext attack against
square-and-always-multiply exponentiation (Algorithm 1) follows the approach of Yen et al.,34 extracting the
bits of d instead of q.
Internal value poisoning. Consider
the RSA decryption of c = n − 1. As in the
previous acoustic attack, c is passed to
Algorithm 1, except this time, after the
modular reduction in line 2, it holds
that c ≡ –1 (mod q). We now examine
the effect of c on the squaring operation performed during the main loop
of Algorithm 1.
First note the value of a during the
execution of Algorithm 1 is always either 1 or –1 modulo q. Next, since –12
≡ 12 ≡ 1 (mod q), we have that the value
of a in line 6 is always 1 modulo q. We
thus obtain the following connection
between the secret key bit di–1 and the
value of a at the start of the i-th iteration of Algorithm 1’s main loop.
Suppose di–1 = 0, so the branch in
line 7 is not taken, making the value
of a at the start of the i-th iteration
be 1 mod q = 1. Since GnuPG’s internal representation does not truncate
contributed articles
leading zeros, a contains many leading zero limbs that are then passed to
the squaring routine during the i-th
iteration. Conversely, if di–1 = 1, then
the branch in line 7 is taken, making
the value of a at the start of the i-th
iteration be –1 modulo q, represented
as p – 1. Since q is a randomly generated prime, the value of a, and therefore
the value sent to the squaring routine
during the i-th iteration, is unlikely to
contain any zero limbs.
We have thus poisoned some of the
internal values of Algorithm 1, creating
a connection between the bits of d and
the intermediate values of a during the
exponentiation.
Amplification. GnuPG’s squaring
routines are implemented in ways
similar to the multiplication routines,
including the optimizations for handling zero limbs, yielding leakage selfamplification, as in an adaptive attack.
Since each iteration of the exponentiation’s main loop leaks one bit of the
secret d, all the bits d can be extracted
from (ideally) a single decryption of
a single ciphertext. In practice, a few
measurements are needed to cope with
noise, as discussed here.
Windowed exponentiation. Many
RSA implementations, including
GnuPG version 1.4.16 and newer, use
an exponentiation algorithm that is
faster than Algorithm 1. In such an
implementation, the exponent d is
split into blocks of m bits (typically m =
5), either contiguous blocks (in “fixed
window” or “m-ary” exponentiation)
Figure 7. Measuring the chassis potential
from the far side of an Ethernet cable (blue)
plugged into the target laptop (10 meters
away) through an alligator clip leading to
measurement equipment (green wire).
or blocks separated by runs of zero
bits (in “sliding-window” exponentiation). The main loop, instead of handling the exponent one bit at a time,
handles a whole block at every iteration, by multiplying a by cx, where x
is the block’s value. The values cx are
pre-computed and stored in a lookup
table (for all m-bit values x).
An adaptation of these techniques
also allows attacking windowed exponentiation.12 In a nutshell, we focus
on each possible m-bit value x, one at a
time, and identify which blocks in the
exponent d, that is, which iterations of
the main loop, contain x. This is done
by crafting a ciphertext c such that cx
mod q contains many zero limbs. Leakage amplification and measurement
then work similarly to the acoustic and
electric attacks described earlier. Once
we identify where each x occurred, we
aggregate these locations to deduce
the full key d.
Electric attacks. As discussed earlier, the electrical potential on the chassis of laptop computers often fluctuates (in reference to the mains earth
ground) in a computation-dependent
way. In addition to measuring this potential directly using a plain wire connected to the laptop chassis, it is possible to measure the chassis potential
from afar using the conductive shielding of any cable attached to one of the
laptop’s I/O ports (see Figure 7) or
from nearby by touching an exposed
metal part of the laptop’s chassis, as
in Figure 2.
To cope with noise, we measured
the electric potential during a few
(typically 10) decryption operations.
Each recording was filtered and demodulated. We used frequency-demodulation since it produced best
results compared to amplitude and
phase demodulations. We then combined the recordings using correlation-based averaging, yielding a
combined signal (see Figure 8). The
successive bits of d can be deduced
from this combined signal. Full key
extraction, using non-adaptive electric measurements, requires only a few
seconds of measurements, as opposed
to an hour using the adaptive attack.
We obtained similar results for ElGamal encryption; Genkin et al.15 offer a
complete discussion.
Electromagnetic attacks. The electromagnetic channel, which exploits
computation-dependent fluctuations
in the electromagnetic field surrounding the target, can also be used for key
extraction. While this channel was previously used for attacks on small devices at very close proximity,1,11,28 the PC
class of devices was only recently considered by Zajic and Prulovic35 (without
cryptographic applications).
Measuring the target’s electromagnetic emanations requires an antenna,
electronics for filtering and amplification, analog-to-digital conversion,
and software for signal processing and
cryptanalytic deduction. Prior works
(on small devices) typically used cumbersome and expensive lab-grade
Figure 8. A signal segment from an electric attack, after demodulating and combining
measurements of several decryptions. Note the correlation between the signal (blue) and
the correct key bits (red).
1
1
1 1
0
1
1
1
00
0
1
1
1
0
0 0
1
0
0
1 1
1 1 1
1
0 0 0 0
0
1
1
0
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
77
contributed articles
equipment. In our attacks,12 we used
highly integrated solutions that are
small and inexpensive (such as a software-defined radio dongle, as in Figure
9, or a consumer-grade radio receiver
recorded by a smartphone, as in Figure
3). Demonstrating how an untethered
probe may be constructed from readily
available electronics, we also built the
Portable Instrument for Trace Acquisition (PITA), which is compact enough
to be concealed, as in pita bread (see
Figure 10).
Experimental results. Attacking RSA
and ElGamal (in both square-and-always-multiply and windowed implementations) over the electromagnetic
channel (sampling at 200 kSample/sec
around a center frequency of 1.7MHz),
using the non-adaptive attack described earlier, we have extracted secret keys in a few seconds from a distance of half a meter.
Attacking other schemes and other devices. So far, we have discussed
attacks on the RSA and ElGamal cryptosystems based on exponentiation
in large prime fields. Similar attacks
also target elliptic-curve cryptography. For example, we demonstrated
key extraction from GnuPG’s implementation of the Elliptic-Curve Diffie-Hellman scheme running on a
PC;13 the attacker, in this case, can
measure the target’s electromagnetic leakage from an adjacent room
through a wall.
Turning to mobile phones and tab-
Figure 9. Measuring electromagnetic emanations from a target laptop (left) through a loop
of coax cable (handheld) recorded by a software-defined radio (right).
Figure 10. Extracting keys by measuring a laptop’s electromagnetic emanations
through a PITA device.
78
COMM UNICATIO NS O F THE AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
lets, as well as to other cryptographic
libraries (such as OpenSSL and iOS
CommonCrypto), electromagnetic key
extraction from implementations of
the Elliptic Curve Digital Signature Algorithm has also been demonstrated,
including attacks that are non-invasive,17 low-bandwidth,5,24 or both.14
Conclusion
Extraction of secret cryptographic keys
from PCs using physical side channels
is feasible, despite their complexity
and execution speed. We have demonstrated such attacks on many publickey encryption schemes and digitalsignature schemes, as implemented
by popular cryptographic libraries, using inexpensive and readily available
equipment, by various attack vectors
and in multiple scenarios.
Hardware
countermeasures.
Side-channel leakage can be attenuated through such physical means as
sound-absorbing enclosures against
acoustic attacks, Faraday cages
against electromagnetic attacks, insulating enclosures against chassis
and touch attacks, and photoelectric
decoupling or fiber-optic connections
against “far end of cable” attacks.
However, these countermeasures are
expensive and cumbersome. Devising inexpensive physical leakage protection for consumer-grade PCs is an
open problem.
Software countermeasures. Given
a characterization of a side channel,
algorithms and their software implementations may be designed so the
leakage through the given channel
will not convey useful information.
One such approach is “blinding,”
or ensuring long operations (such
as modular exponentiation) that involve sensitive values are, instead,
performed on random dummy values
and later corrected using an operation that includes the sensitive value
but is much shorter and thus more
difficult to measure (such as modular
multiplication). A popular example of
this approach is ciphertext randomization,20 which was added to GnuPG
following our observations and indeed prevents both the internal value
poisoning and the leakage self-amplification components of our attacks.
However, such countermeasures
require careful design and adaptation
contributed articles
for every cryptographic scheme and
leakage channel; moreover, they often involve significant cost in performance. There are emerging generic
protection methods at the algorithmic level, using fully homomorphic
encryption and cryptographic leakage
resilience; however, their overhead is
currently so great as to render them
impractical.
Future work. To fully understand
the ramifications and potential of
physical side-channel attacks on PCs
and other fast and complex devices,
many questions remain open. What
other implementations are vulnerable, and what other algorithms tend
to have vulnerable implementations?
In particular, can symmetric encryption algorithms (which are faster and
more regular) be attacked? What other physical channels exist, and what
signal processing and cryptanalytic
techniques can exploit them? Can the
attacks’ range be extended (such as
in acoustic attacks via laser vibrometers)? What level of threat do such
channels pose in various real-world
scenarios? Ongoing research indicates the risk extends well beyond the
particular algorithms, software, and
platforms we have covered here.
On the defensive side, we also raise
three complementary questions: How
can we formally model the feasible
side-channel attacks on PCs? What engineering methods will ensure devices
comply with the model? And what algorithms, when running on compliant devices, will provably protect their
secrets, even in the presence of sidechannel attacks?
Acknowledgments
This article is based on our previous
research,12,13,15,16 which was supported by the Check Point Institute for
Information Security, the European
Union’s 10th Framework Programme
(FP10/2010-2016) under grant agreement no. 259426 ERC-CaC, a Google
Faculty Research Award, the Leona M.
& Harry B. Helmsley Charitable Trust,
the Israeli Ministry of Science, Technology and Space, the Israeli Centers
of Research Excellence I-CORE program (center 4/11), NATO’s Public Diplomacy Division in the Framework of
“Science for Peace,” and the Simons
Foundation and DIMACS/Simons Col-
laboration in Cryptography through
National Science Foundation grant
#CNS-1523467. References
1. Agrawal, D., Archambeault, B., Rao, J.R., and Rohatgi,
P. The EM side-channel(s). In Proceedings of the
Workshop on Cryptographic Hardware and Embedded
Systems (CHES 2002). Springer, 2002, 29–45.
2. Anderson, R.J. Security Engineering: A Guide to
Building Dependable Distributed Systems, Second
Edition. Wiley, 2008.
3. Asonov, D. and Agrawal, R. Keyboard acoustic
emanations. In Proceedings of the IEEE Symposium
on Security and Privacy. IEEE Computer Society
Press, 2004, 3–11.
4. Backes, M., Dürmuth, M., Gerling, S., Pinkal, M., and
Sporleder, C. Acoustic side-channel attacks on printers.
In Proceedings of the USENIX Security Symposium
2010. USENIX Association, 2010, 307–322.
5. Belgarric, P., Fouque, P.-A., Macario-Rat, G., and
Tibouchi, M. Side-channel analysis of Weierstrass and
Koblitz curve ECDSA on Android smartphones. In
Proceedings of the Cryptographers’ Track of the RSA
Conference (CT-RSA 2016). Springer, 2016, 236–252.
6. Bernstein, D.J. Cache-timing attacks on AES. 2005;
http://cr.yp.to/papers.html#cachetiming
7. Brouchier, J., Dabbous, N., Kean, T., Marsh, C., and
Naccache, D. Thermocommunication. Cryptology
ePrint Archive, Report 2009/002, 2009; https://eprint.
iacr.org/2009/002
8. Brumley, D. and Boneh, D. Remote timing attacks
are practical. Computer Networks 48, 5 (Aug. 2005),
701–716.
9. Clark, S.S., Mustafa, H.A., Ransford, B., Sorber, J., Fu,
K., and Xu, W. Current events: Identifying webpages
by tapping the electrical outlet. In Proceedings of the
18th European Symposium on Research in Computer
Security (ESORICS 2013). Springer, Berlin, Heidelberg,
2013, 700–717.
10. Clark, S.S., Ransford, B., Rahmati, A., Guineau, S.,
Sorber, J., Xu, W., and Fu, K. WattsUpDoc: Power
side channels to nonintrusively discover untargeted
malware on embedded medical devices. In
Proceedings of the USENIX Workshop on Health
Information Technologies (HealthTech 2013). USENIX
Association, 2013.
11. Gandolfi, K., Mourtel, C., and Olivier, F. Electromagnetic
analysis: Concrete results. In Proceedings of the
Workshop on Cryptographic Hardware and Embedded
Systems (CHES 2001). Springer, Berlin, Heidelberg,
2001, 251–261.
12. Genkin, D., Pachmanov, L., Pipman, I., and Tromer,
E. Stealing keys from PCs using a radio: Cheap
electromagnetic attacks on windowed exponentiation.
In Proceedings of the Workshop on Cryptographic
Hardware and Embedded Systems (CHES 2015).
Springer, 2015, 207–228.
13. Genkin, D., Pachmanov, L., Pipman, I., and Tromer,
E. ECDH key-extraction via low-bandwidth
electromagnetic attacks on PCs. In Proceedings of the
Cryptographers’ Track of the RSA Conference (CT-RSA
2016). Springer, 2016, 219–235.
14. Genkin, D., Pachmanov, L., Pipman, I., Tromer, E., and
Yarom, Y. ECDSA Key Extraction from Mobile Devices
via Nonintrusive Physical Side Channels. Cryptology
ePrint Archive, Report 2016/230, 2016; http://eprint.
iacr.org/2016/230
15. Genkin, E., Pipman, I., and Tromer, E. Get your hands
off my laptop: Physical side-channel key-extraction
attacks on PCs. In Proceedings of the Workshop on
Cryptographic Hardware and Embedded Systems
(CHES 2014). Springer, 2014, 242–260.
16. Genkin, D., Shamir, A., and Tromer, E. RSA key
extraction via low-bandwidth acoustic cryptanalysis.
In Proceedings of the Annual Cryptology Conference
(CRYPTO 2014). Springer, 2014, 444–461.
17. Kenworthy, G. and Rohatgi, P. Mobile device security:
The case for side-channel resistance. In Proceedings
of the Mobile Security Technologies Conference
(MoST), 2012; http://mostconf.org/2012/papers/21.pdf
18. Kocher, P., Jaffe, J., and Jun, B. Differential power
analysis. In Proceedings of the Annual Cryptology
Conference (CRYPTO 1999). Springer, 1999, 388–397.
19. Kocher, P., Jaffe, J., Jun, B., and Rohatgi, P.
Introduction to differential power analysis. Journal of
Cryptographic Engineering 1, 1 (2011), 5–27.
20. Kocher, P.C. Timing attacks on implementations of
Diffie-Hellman, RSA, DSS, and other systems. In
Proceedings of the Annual Cryptology Conference
(CRYPTO 1996). Springer, 1996, 104–113.
21. Kuhn, M.G. Compromising Emanations: Eavesdropping
Risks of Computer Displays. Ph.D. Thesis and
Technical Report UCAM-CL-TR-577. University of
Cambridge Computer Laboratory, Cambridge, U.K.,
Dec. 2003; https://www.cl.cam.ac.uk/techreports/
UCAM-CL-TR-577.pdf
22. Loughry, J. and Umphress, D.A. Information leakage
from optical emanations. ACM Transactions on
Information Systems Security 5, 3 (Aug. 2002), 262–289.
23. Mangard, S., Oswald, E., and Popp, T. Power Analysis
Attacks: Revealing the Secrets of Smart Cards.
Springer, Berlin, Heidelberg, 2007.
24. Nakano, Y., Souissi, Y., Nguyen, R., Sauvage, L.,
Danger, J., Guilley, S., Kiyomoto, S., and Miyake, Y. A
pre-processing composition for secret key recovery
on Android smartphones. In Proceedings of the
International Workshop on Information Security
Theory and Practice (WISTP 2014). Springer, Berlin,
Heidelberg, 2014.
25. Oren, Y. and Shamir, A. How not to protect
PCs from power analysis. Presented at the
Annual Cryptology Conference (CRYPTO
2006) rump session. 2006; http://iss.oy.ne.ro/
HowNotToProtectPCsFromPowerAnalysis
26. Osvik, D.A., Shamir, A., and Tromer, E. Cache
attacks and countermeasures: The case of AES. In
Proceedings of the Cryptographers’ Track of the RSA
Conference (CT-RSA 2006). Springer, 2006,1–20.
27. Percival, C. Cache missing for fun and profit. In
Proceedings of the BSDCan Conference, 2005; http://
www.daemonology.net/hyperthreading-consideredharmful
28. Quisquater, J.-J. and Samyde, D. Electromagnetic
analysis (EMA): Measures and countermeasures
for smartcards. In Proceedings of the Smart Card
Programming and Security: International Conference
on Research in Smart Cards (E-smart 2001). Springer,
2001, 200–210.
29. Skorobogatov, S. Optical Surveillance on Silicon Chips.
University of Cambridge, Cambridge, U.K., 2009;
http://www.cl.cam.ac.uk/~sps32/SG_talk_OSSC_a.pdf
30. van Eck, W. Electromagnetic radiation from video
display units: An eavesdropping risk? Computers and
Security 4, 4 (Dec. 1985), 269–286.
31. Wright, P. Spycatcher. Viking Penguin, New York, 1987.
32. Yarom, Y. and Falkner, K. FLUSH+RELOAD: A highresolution, low-noise, L3 cache side-channel attack.
In Proceedings of the USENIX Security Symposium
2014. USENIX Association, 2014, 719–732.
33. Yarom, Y., Liu, F., Ge, Q., Heiser, G., and Lee, R.B.
Last-level cache side-channel attacks are practical.
In Proceedings of the IEEE Symposium on Security
and Privacy. IEEE Computer Society Press, 2015,
606–622.
34. Yen, S.-M., Lien, W.-C., Moon, S.-J., and Ha, J. Power
analysis by exploiting chosen message and internal
collisions: Vulnerability of checking mechanism for
RSA decryption. In Proceedings of the International
Conference on Cryptology in Malaysia (Mycrypt 2005).
Springer, 2005, 183–195.
35. Zajic, A. and Prvulovic, M. Experimental demonstration
of electromagnetic information leakage from modern
processor-memory systems. IEEE Transactions on
Electromagnetic Compatibility 56, 4 (Aug. 2014),
885–893.
Daniel Genkin ([email protected]) is a Ph.D.
candidate in the Computer Science Department at
Technion-Israel Institute of Technology, Haifa, Israel, and
a research assistant in the Blavatnik School of Computer
Science at Tel Aviv University, Israel.
Lev Pachmanov ([email protected]) is a master’s candidate
in the Blavatnik School of Computer Science at Tel Aviv
University, Israel.
Itamar Pipman ([email protected]) is a master’s
candidate in the Blavatnik School of Computer Science at
Tel Aviv University, Israel.
Adi Shamir ([email protected]) is a professor in
the faculty of Mathematics and Computer Science at the
Weizmann Institute of Science, Rehovot, Israel.
Eran Tromer ([email protected]) is a senior lecturer
in the Blavatnik School of Computer Science at Tel Aviv
University, Israel.
Copyright held by authors.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
79
review articles
DOI:10.1145/ 2842602
Randomization offers new benefits
for large-scale linear algebra computations.
BY PETROS DRINEAS AND MICHAEL W. MAHONEY
RandNLA:
Randomized
Numerical
Linear
Algebra
in computer science,
statistics, and applied mathematics. An m × n
matrix can encode information about m objects
(each described by n features), or the behavior of a
discretized differential operator on a finite element
mesh; an n × n positive-definite matrix can encode
the correlations between all pairs of n objects, or the
edge-connectivity between all pairs of nodes in a social
network; and so on. Motivated largely by technological
developments that generate extremely large scientific
and Internet datasets, recent years have witnessed
exciting developments in the theory and practice of
matrix algorithms. Particularly remarkable is the use of
randomization—typically assumed to be a property of the
input data due to, for example, noise in the data
MAT RIC ES ARE U BI Q UI TO US
80
COMM UNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
generation mechanisms—as an algorithmic or computational resource for
the develop­
ment of improved algorithms for fundamental matrix problems such as matrix multiplication,
least-squares (LS) approximation, lowrank matrix approxi­mation, and Laplacian-based linear equ­ation solvers.
Randomized Numerical Linear
Algebra (RandNLA) is an interdisciplinary research area that exploits
randomization as a computational
resource to develop improved algorithms for large-scale linear algebra
problems.32 From a foundational perspective, RandNLA has its roots in
theoretical computer science (TCS),
with deep connections to mathematics (convex analysis, probability theory,
metric embedding theory) and applied
mathematics (scientific computing,
signal processing, numerical linear
algebra). From an applied perspective, RandNLA is a vital new tool for
machine learning, statistics, and data
analysis. Well-engineered implementations have already outperformed
highly optimized software libraries
for ubiquitous problems such as leastsquares,4,35 with good scalability in parallel and distributed envi­ronments.52
Moreover, RandNLA promises a sound
algorithmic and statistical foundation
for modern large-scale data analysis.
key insights
˽˽
Randomization isn’t just used to model
noise in data; it can be a powerful
computational resource to develop
algorithms with improved running
times and stability properties as well as
algorithms that are more interpretable in
downstream data science applications.
˽˽
To achieve best results, random sampling
of elements or columns/rows must be done
carefully; but random projections can be
used to transform or rotate the input data
to a random basis where simple uniform
random sampling of elements or rows/
columns can be successfully applied.
˽˽
Random sketches can be used directly
to get low-precision solutions to data
science applications; or they can be used
indirectly to construct preconditioners for
traditional iterative numerical algorithms
to get high-precision solutions in
scientific computing applications.
IMAGE BY FORA NCE
An Historical Perspective
To get a broader sense of RandNLA, recall
that linear algebra—the mathematics
of vector spaces and linear mappings
between vector spaces—has had a long
history in large-scale (by the standards
of the day) statistical data analysis.46 For
example, the least-squares method is
due to Gauss, Legendre, and others, and
was used in the early 1800s for fitting
linear equations to data to determine
planet orbits. Low-rank approximations
based on Principal Component Analysis
(PCA) are due to Pearson, Hotelling, and
others, and were used in the early 1900s
for exploratory data analysis and for
making predictive models. Such methods are of interest for many reasons, but
especially if there is noise or randomness in the data, because the leading
principal components then tend to capture the signal and remove the noise.
With the advent of the digital computer in the 1950s, it became apparent
that, even when applied to well-posed
problems, many algorithms performed
poorly in the presence of the finite precision that was used to represent real
numbers. Thus, much of the early work
in computer science focused on solving
discrete approximations to continuous nu­merical problems. Work by Turing
and von Neumann (then Householder,
Wilkinson, and others) laid much of the
foundations for scientific computing and
NLA.48,49 Among other things, this led to
the introduction of problem-­
specific
complexity measures (for example, the
condition number) that characterize
the beh­avior of an input for a specific
class of algorithms (for example, iterative algorithms).
A split then occurred in the nascent
field of computer science. Continuous
linear algebra became the domain of
applied mathematics, and much of
computer science theory and practice became discrete and combinatorial.44 Nearly all subsequent work
in scientific computing and NLA
has been deterministic (a notable
exception being the work on integral
evaluation using the Markov Chain
Monte Carlo method). This led to
high-quality codes in the 1980s and
1990s (LINPACK, EISPACK, LAPACK,
ScaLAPACK) that remain widely used
today. Meanwhile, Turing, Church,
and others began the study of computation per se. It became clear that several seemingly different approaches
(recursion theory, the λ-calculus, and
Turing machines) defined the same
class of functions; and this led to the
belief in TCS that the concept of computability is formally captured in a
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
81
review articles
Figure 1. (a) Matrices are a common way to model data. In genetics, for example, matrices can describe data from tens of thousands of
individuals typed at millions of Single Nucleotide Polymorphisms or SNPs (loci in the human genome). Here, the (i, j)th entry is the genotype
of the ith individual at the jth SNP. (b) PCA/SVD can be used to project every individual on the top left singular vectors (or “eigenSNPs”),
thereby providing a convenient visualization of the “out of Africa hypothesis” well known in population genetics.
Single Nucleotide Polymorphmisms (SNPs)
individuals
… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …
… GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA …
… GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …
(a)
0.02
Africa
AFRICA
AMERICA
CENTRAL SOUTH ASIA
EAST ASIA
EUROPE
GUJARATI
MEXICANS
MIDDLE EAST
OCEANIA
Middle East
EigenSNP 3
0
–0.02
Oceania
Europe
–0.04
ia
l As
ntra
h Ce
Sout
–0.06
East Asia
an
s
–0.08
M
ex
ic
–0.1
–0.03
–0.02
–0.01
America
0
0.01
0.02
–0.03
–0.01
–0.02
0
0.01
0.02
0.03
EigenSNP 2
EigenSNP 1
(b)
qualitative and robust way by these
three equivalent approaches, independent of the input data. Many of
these developments were deterministic; but, motivated by early work on
the Monte Carlo method, randomization—where the randomness is
inside the algorithm and the algorithm
is applied to arbitrary or worst-case
data—was introduced and exploited
as a powerful computational resource.
Recent years have seen these two
very different perspectives start to
converge. Motivated by modern massive dataset problems, there has been
a great deal of interest in developing
algorithms with improved running
times and/or improved statistical
properties that are more appropriate
for obtaining insight from the enormous quantities of noisy data that is
now being generated. At the center of
82
COMMUNICATIO NS O F TH E ACM
these developments is work on novel
algorithms for linear algebra problems, and central to this is work on
RandNLA algorithms.a In this article,
we will describe the basic ideas that
underlie recent developments in this
interdisciplinary area.
For a prototypical data analysis
example where RandNLA methods
have been applied, consider Figure
1, which illustrates an application
in genetics38 (although the same
RandNLA methods have been applied
in astronomy, mass spectrometry
imaging, and related areas33,38,53,54).
While the low-dimensional PCA plot
illustrates the famous correlation
a Avron et al., in the first sentence of their
Blendenpik paper, observe that RandNLA is
“arguably the most exciting and innovative
idea to have hit linear algebra in a long time.”4
| J U NE 201 6 | VO L . 5 9 | NO. 6
between geography and genetics,
there are several weaknesses of PCA/
SVD-based methods. One is running
time: computing PCA/SVD approximations of even moderately large
data matrices is expensive, especially
if it needs to be done many times as
part of cross validation or exploratory
data analysis. Another is interpretability: in general, eigenSNPs (that
is, eigenvectors of individual-­by-SNP
matrices) as well as other eigenfeatures don’t “mean” anything in terms
of the processes generating the data.
Both issues have served as motivation to design RandNLA algorithms
to compute PCA/SVD approximations
faster than conventional numerical
methods as well as to identify actual
features (instead of eigenfeatures)
that might be easier to interpret for
domain scientists.
review articles
Basic RandNLA Principles
RandNLA algorithms involve taking an
input matrix; constructing a “sketch”
of that input matrix—where a sketch
is a smaller or sparser matrix that represents the essential information in
the original matrix—by random sampling; and then using that sketch as
a surrogate for the full matrix to help
compute quantities of interest. To be
useful, the sketch should be similar
to the original matrix in some way, for
example, small residual error on the
difference between the two matrices,
or the two matrices should have similar action on sets of vectors or in downstream classification tasks. While
these ideas have been developed in
many ways, several basic design principles underlie much of RandNLA:
(i) randomly sample, in a careful
­data-dependent manner, a small number of elements from an input matrix
to create a much sparser sketch of the
original matrix; (ii) randomly sample,
in a careful data-dependent manner, a
small number of columns and/or rows
from an input matrix to create a much
smaller sketch of the original matrix;
and (iii) preprocess an input matrix
with a random-­projection-type matrix,
in order to “spread out” or uniformize
the information in the original matrix,
and then use naïve data-independent
uniform sampling of rows/columns/
elements in order to create a sketch.
Element-wise sampling. A naïve way
to view an m × n matrix A is an array of
numbers: these are the mn elements
of the matrix, and they are denoted by
Aij (for all i = 1, . . ., m and all j = 1, . . ., n).
It is therefore natural to consider the
following approach in order to create
a small sketch of a matrix A: instead
of keeping all its elements, randomly
sample and keep a small number of
them. Algorithm 1 is a meta-algorithm
that samples s elements from a matrix
A in independent, identically distributed trials, where in each trial a single
element of A is sampled with respect to
the importance sampling probability
distribution pij. The algorithm outputs
a matrix à that contains precisely the
selected elements of A, after appropriate
rescaling. This rescaling is fundamental
from a statistical perspective: the sketch
à is an estimator for A. This rescaling
makes it an unbiased estimator since,
element-wise, the expectation of the
estimator matrix à is equal to the original matrix A.
Algorithm 1 A meta-algorithm for
element-wise sampling
Input: m × n matrix A; integer s > 0
denoting the number of elements to
be sampled; probability distribution pij
(i = 1, . . ., m and j = 1, . . ., n) with ∑i, j pij = 1.
1. Let à be an all-zeros m × n matrix.
2. For t = 1 to s,
•• Randomly sample one element
of A using the probability distribution pij.
•• Let Ai j denote the sampled
t t
element and set
(1)
Output: Return the m × n matrix Ã.
How to sample is, of course, very
important. A simple choice is to perform uniform sampling, that is, set pij
= 1/mn, for all i, j, and sample each element with equal probability. While simple, this suffers from obvious problems:
for example, if all but one of the entries
of the original matrix equal zero, and
only a single non-zero entry exists, then
the probability of sampling the single
non-zero entry of A using uniform sampling is negligible. Thus, the estimator
would have very large variance, in which
case the sketch would, with high probability, fail to capture the relevant structure of the original matrix. Qualitatively
improved results can be obtained by
using nonuniform data-dependent
impor­tance sampling distributions. For
example, sampling larger elements (in
absolute value) with higher probability
is advantageous in terms of variance
reduction and can be used to obtain
worst-case additive-error bounds for
low-rank matrix approximation.1,2,18,28
More elaborate probability distributions (the so-called element-wise leverage scores that use information in the
singular subspaces of A10) have been
shown to provide still finer results.
The first results for Algorithm 12
showed that if one chooses entries
with probability proportional to their
squared-magnitudes (that is, if
in which case larger
magnitude entries are more likely to
be chosen), then the sketch à is similar
to the original matrix A, in the sense that
the error matrix, A − Ã, has, with high
probability, a small spectral norm.
A more refined analysis18 showed that
(2)
where ×2 and ×F are the spectral and
Frobenius norms, respectively, of the
matrix.b If the spectral norm of the difference A − à is small, then à can be
used as proxy for A in applications. For
example, one can use à to approximate
the spectrum (that is, the singular values and singular vectors) of the original matrix.2 If s is set to be a constant
multiple of (m + n) ln (m + n), then the
error scales with the Frobenius norm
of the matrix. This leads to an additiveerror low-rank matrix approximation
algorithm, in which AF is the scale
of the additional additive error.2 This
is a large scaling factor, but improving
upon this with element-wise sampling,
even in special cases, is a challenging
open problem.
The mathematical techniques used
in the proof of these element-wise sampling results exploit the fact that the residual matrix A − Ã is a random matrix whose
entries have zero mean and bounded
variance. Bounding the spectral norm
of such matrices has a long history in
random matrix theory.50 Early RandNLA
element-wise sampling bounds2 used
a result of Füredi and Komlós on the
spectral norm of symmetric, zero mean
matrices of bounded variance.20 Sub­
sequently, Drineas and Zouzias18 introduced the idea of using matrix measure
concentration inequalities37,40,47 to simplify the proofs, and follow-up work18
has improved these bounds.
Row/column sampling. A more sop­
histicated way to view a matrix A is as
a linear operator, in which case the
role of rows and columns becomes
more central. Much RandNLA research
has focused on sketching a matrix by
keeping only a few of its rows and/or
b In words, the spectral norm of a matrix measures
how much the matrix elongates or deforms the
unit ball in the worst case, and the Frobenius
norm measures how much the matrix elongates
or deforms the unit ball on average. Sometimes
the spectral norm may have better properties
especially when dealing with noisy data, as discussed by Achlioptas and McSherry.2
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
83
review articles
columns. This method of sampling
predates element-wise sampling algorithms,19 and it leads to much stronger
worst-case bounds.15,16
Algorithm 2 A meta-algorithm for row
sampling
Input: m × n matrix A; integer s > 0
denoting the number of rows to be
sampled; probabilities pi (i = 1, . . ., m)
with ∑i pi = 1.
1. Let à be the empty matrix.
2. For t = 1 to s,
•• Randomly sample one row of
A using the probability distribution pi.
•• Let Ai ∗ denote the sampled
t
row and set
(3)
Output: Return the s × n matrix Ã.
Consider the meta-algorithm for row
sampling (column sampling is analogous) presented in Algorithm 2. Much of
the discussion of Algorithm 1 is relevant
to Algorithm 2. In particular, Algorithm
2 samples s rows of A in independent,
identically distributed trials according to the input probabilities pis; and
the output matrix à contains precisely the selected rows of A, after a
rescaling that ensures un-biasedness
of appropriate estimators (for example,
the exp­ectation of ÃT Ã is equal to AT A,
element-wise).13,19 In addition, uniform
sampling can easily lead to very poor
results, but qualitatively improved
results can be obtained by using nonuniform, data-dependent, importance
sampling distributions. Some things,
however, are different: the dimension
of the sketch à is different than that of
the original matrix A. The solution is
to measure the quality of the sketch by
comparing the difference between the
matrices AT A and ÃT Ã. The simplest
nonuniform distribution is known as 2
sampling or norm-squared sampling, in
which pi is proportional to square of
the Euclidean norm of the ith rowc:
When using norm-squared sampling,
one can prove that
Motivated by
modern massive
dataset problems,
there has been
a great deal
of interest in
developing
algorithms with
improved running
times and/or
improved statistical
properties that
are more
appropriate for
obtaining insight
from the enormous
quantities of noisy
data now being
generated.
(4)
c We will use the notation Ai* to denote the ith
row of A as a row vector.
84
COMMUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
(5)
holds in expectation (and thus, by
standard arguments, with high probability) for arbitrary A.13,19,d The proof
of Equation (5) is a simple exercise
using basic properties of expectation and var­
iance. This result can
be generalized to approximate the
product of two arbitrary matrices A
and B.13 Proving such bounds with
respect to other matrix norms is
more challenging but very important
for RandNLA. While Equation (5)
trivially implies a bound for AT A −
ÃT Ã2, proving a better spectral norm
error bound necessitates the use of
more sophisticated methods such as
the Khintchine inequality or matrixBernstein inequalities.42,47
Bounds of the form of Equation (5)
immediately imply that à can be used
as a proxy for A, for example, in order
to approximate its (top few) singular
values and singular vectors. Since à is
an s × n matrix, with s  n, computing
its singular values and singular vectors
is a very fast task that scales linearly
with n. Due to the form of Equation (5),
this leads to additive-error low-rank
matrix approximation algorithms, in
which AF is the scale of the additional additive error.19 That is, while
norm-squared sampling avoids pitfalls of uniform sampling, it results
in additive-error bounds that are only
comparable to what element-wise
sampling achieves.2,19
To obtain stronger and more useful bounds, one needs information
about the geometry or subspace
structure of the high-dimensional
Euclidean space spanned by the
columns of A (if m  n) or the space
spanned by the best rank-k approximation to A (if m ∼ n). This can be
achieved with leverage score sampling, in which pi is proportional to
d That is, a provably good approximation to the
product AT A can be computed using just a few
rows of A; and these rows can be found by sampling randomly according to a simple data-­
dependent importance sampling distribution.
This matrix multiplication algorithm can be
implemented in one pass over the data from
external storage, using only O(sn) additional
space and O(s2n) additional time.
review articles
(6)
Due to their historical importance
in regression diagnostics and outlier
detection, the pi’s in Equation (6) are
known as statistical leverage scores.9,14
In some applications of RandNLA, the
largest leverage score is called the
coherence of the matrix.8,14
Importantly, while one can naïvely
compute these scores via Equation (6)
by spending O (mn2) time to compute
U exactly, this is not necessary.14 Let Π be
the fast Hadamard Transform as used
in Drineas et al.14 or the input-sparsitytime random projection of Refs.12,34,36
Then, in o(mn2) time, one can compute
the R matrix from a QR decomposition of ΠA and from that compute 1 ± ε
relative-error approximations to all the
leverage scores.14
In RandNLA, one is typically interested in proving that
(7)
either for arbitrary ε ∈ (0, 1) or for some
fixed ε ∈ (0, 1). Approximate matrix
multiplication bounds of the form
of Equation (7) are very important
in RandNLA algorithm design since
the resulting sketch à preserves rank
properties of the original data matrix
A and provides a subspace embedding:
from the NLA perspective, this is simply an acute perturbation from the
original high-dimensional space to
a much lower dimensional space.22
From the TCS perspective, this provides bounds analogous to the usual
Johnson–Lindenstrauss
bounds,
except that it preserves the geometry
of the entire subspace.43
e A generalization holds if m ∼ n: in this case, U is
any m × k orthogonal matrix spanning the best
rank-k approximation to the column space of
A, and one uses the leverage scores relative to the
best rank-k approximation to A.14,16,33
the application of a transformation,
called the preconditioner, to a given
problem instance such that the transformed instance is more easily solved
by a given class of algorithms.f The
main challenge for sampling-based
RandNLA algorithms is the construction of the nonuniform sampling
probabilities. A natural question
arises: is there a way to precondition
an input instance such that uniform
random sampling of rows, columns, or
elements yields an insignificant loss in
approximation accuracy?
The obvious obstacle to sampling
­uniformly at random from a matrix
is that the relevant information in the
matrix could be concentrated on a small
number of rows, columns, or elements
of the matrix. The solution is to spread
out or uniformize this information, so
that it is distributed almost uniformly
over all rows, columns, or elements of
the matrix. (This is illustrated in Figure
2.) At the same time, the preprocessed
f For example, if one is interested in iterative algorithms for solving the linear system Ax = b, one
typically transforms a given problem instance
to a related instance in which the so-called
condition number is not too large.
Figure 2. In RandNLA, random projections can be used to “precondition” the input data
so that uniform sampling algorithms perform well, in a manner analogous to how
traditional pre-conditioners transform the input to decrease the usual condition number
so that iterative algorithms perform well (see (a)). In RandNLA, the random projectionbased preconditioning involves uniformizing information in the eigenvectors, rather than
flattening the eigenvalues (see (b)).
(a)
Leverage score
Subspace embeddings were first
used in RandNLA in a data-aware manner (meaning, by looking at the input
data to compute exact or approximate
leverage scores14) to obtain samplingbased relative-error approximation
to the LS regression and related lowrank CX/CUR approximation problems.15,16 They were then used in a
data-oblivious manner (meaning, in
conjuction with a random projection
as a preconditioner) to obtain projection-based relative-error approximation to several RandNLA problems.43
A review of data-oblivious subspace
embeddings for RandNLA, including
its relationship with the early work
on least absolute deviations regression,11 has been provided.51 Due to
the connection with data-aware and
data-oblivious subspace embeddings,
approximating matrix multiplication
is one of most powerful primitives
in RandNLA. Many error formulae
for other problems ultimately boil
down to matrix inequalities, where
the randomness of the algorithm only
appears as a (randomized) approximate matrix multiplication.
Random projections as preconditioners. Preconditioning refers to
Leverage score
the ith leverage score of A. To define
these scores, for simplicity assume
that m  n and that U is any m × n
orthogonal matrix spanning the
column space of A.e In this case, U T
U is equal to the identity and UU T =
PA is an m-dimensional projection
matrix onto the span of A. Then, the
importance sampling probablities of
Equation (4), applied to U, equal
Row index
Row index
(b)
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
85
review articles
matrix should have similar properties
(for example, singular values and singular vectors) as the original matrix, and
the preprocessing should be computationally efficient (for example, it should
be faster than solving the original problem exactly) to perform.
Consider Algorithm 3, our metaalgorithm for preprocessing an input
matrix A in order to uniformize information in its rows or columns or elements. Depending on the choice of
preprocessing (only from the left, only
from the right, or from both sides) the
information in A is uniformized in different ways (across its rows, columns,
or elements, respectively). For pedagogical simplicity, Algorithm 3 is described
such that the output matrix has the
same dimensions as the original matrix
(in which case Π is approximately a random rotation). Clearly, however, if this
algorithm is coupled with Algorithm
1 or Algorithm 2, then with trivial to
implement uniform sampling, only the
rows/columns that are sampled actually need to be generated. In this case
the sampled version of Π is known as a
random projection.
Algorithm 3 A meta-algorithm for preconditioning a matrix for random sampling algorithms
1: Input: m × n matrix A, randomized preprocessing matrices ΠL and/or ΠR.
2: Output:
•• To uniformize information across
the rows of A, return ΠLA.
•• To uniformize information across
the columns of A, return AΠR.
•• To uniformize information across
the elements of A, return ΠL AΠR.
There is wide latitude in the choice
of the random matrix Π. For example,
although Π can be chosen to be a random orthogonal matrix, other constructions can have much better algorithmic
properties: Π can consist of appropriatelyscaled independent identically distributed (i.i.d.) Gaussian random variables,
i.i.d. Rademacher random variable
(+1 or −1, up to scaling, each with probability 50%), or i.i.d. random variables
drawn from any sub-Gaussian distribution. Implementing these variants
depends on the time to generate the random bits plus the time to perform the
86
COMMUNICATIO NS O F TH E AC M
matrix-matrix multiplication that actually performs the random projection.
More interestingly, Π could be a
so-called Fast Johnson Lindenstrauss
Transform (FJLT). This is the product
of two matrices, a random diagonal
matrix with +1 or −1 on each diagonal
entry, each with probability 1/2, and the
Hadamard-Walsh (or related Fourierbased) matrix.3 Implementing FJLT-based
random projections can take advantage
of well-studied fast Fourier techniques
and can be extremely fast for arbitrary
dense input matrices.4,41 Recently, there
has even been introduced an extremely
sparse random projection construction
that for arbitrary input matrices can be
implemented in “input-sparsity time,”
that is, time depending on the number
of nonzeros of A, plus lower-order terms,
as opposed to the dimensions of A.12,34,36
With appropriate settings of problem parameters (for example, the
number of uniform samples that are
subsequently drawn, which equals
the dimension onto which the data is
projected), all of these methods precondition arbitrary input matrices so
that uniform sampling in the randomly
rotated basis performs as well as nonuniform sampling in the original basis.
For example, if m  n, in which case
the leverage scores of A are given by
Equation (6), then by keeping only
roughly O(n log n) randomly-rotated
dimensions, uniformly at random, one
can prove that the leverage scores of
the preconditioned system are, up to
logarithmic fluctuations, uniform.g
Which construction for Π should be
used in any particular application of
RandNLA depends on the details of
the problem, for example, the aspect
ratio of the matrix, whether the RAM
model is appropriate for the particular computational infrastructure, how
expensive it is to generate random bits,
and so on. For example, while slower
in the RAM model, Gaussian-based
random projections can have stronger
conditioning properties than other
constructions. Thus, given their ease
of use, they are often more appropriate
for certain parallel and cloud-computing architectures.25,35
Summary. Of the three basic RandNLA
principles described in this section, the
g This is equivalent to the statement that the
coherence of the preconditioned system is small.
| J U NE 201 6 | VO L . 5 9 | NO. 6
first two have to do with identifying nonuniformity structure in the input data;
and the third has to do with preconditi­
oning the input (that is, uniformizing the
nonuniformity structure) so uniform random sampling performs well. Depending
on the area in which RandNLA algorithms
have been developed and/or implemented and/or applied, these principles
can manifest themselves in very different
ways. Relatedly, in applications where
elements are of primary importance
(for example, recommender systems26),
element-wise methods might be most
appropriate, while in applications where
subspaces are of primary importance
(for example, scientific computing25),
column/row-based methods might be
most appropriate.
Extensions and Applications of
Basic RandNLA Principles
We now turn to several examples of problems in various domains where the basic
RandNLA principles have been used in
the design and analysis, implementation, and application of novel algorithms.
Low-precision approximations and
high-precision numerical implementations: least-squares and low-rank
approximation. One of the most fundamental problems in linear algebra is
the least-squares (LS) regression problem: given an m × n matrix A and an
m-dimensional vector b, solve
(8)
where ×2 denotes the 2 norm of a vector. That is, compute the n-dimensional
vector x that minimizes the Euclidean
norm of the residual Ax − b.h If m  n,
then we have the overdetermined (or
overconstrained) LS problem, and its
solution can be obtained in O(mn2)
time in the RAM model with one of
several methods, for example, solving
the normal equations, QR decompositions, or the SVD. Two major successes
of RandNLA concern faster (in terms
of low-precision asymptotic worst-case
theory, or in terms of high-precision
wall-clock time) algorithms for this
ubiquitous problem.
h Observe this formulation includes as a special
case the problem of solving systems of linear
equations (if m = n and A has full rank, then
the resulting system of linear equations has a
unique solution).
review articles
One major success of RandNLA
was the following random sampling
algorithm for the LS problem: quickly
compute 1 ± ε approximations to the
leverage scores;14 form a subproblem
by sampling with Algorithm 2 roughly
Θ(n log(m)/ε) rows from A and the corresponding elements from b using
those approximations as importance
sampling probabilities; and return
the LS solution of the subproblem.14,15
Alternatively, one can run the following random projection algorithm: precondition the input with a
Hadamard-based random projection;
form a subproblem by sampling with
Algorithm 2 roughly Θ(n log(m)/ε)
rows from A and the corresponding
elements from b uniformly at random; and return the LS solution of
the subproblem.17, 43
Both of these algorithms return 1±ε
relative-error approximate solutions
for arbitrary or worst-case input; and
both run in roughly Θ(mn log(n)/ε)
= o(mn2) time, that is, qualitatively
faster than traditional algorithms
for the overdetermined LS problem.
(Although this random projection
algorithm is not faster in terms of
asymptotic FLOPS than the corresponding random sampling algorithm, preconditioning with random
projections is a powerful primitive
more generally for RandNLA algorithms.) Moreover, both of these algorithms have been improved to run in
time that is proportional to the number of nonzeros in the matrix, plus
lower-order terms that depend on the
lower dimension of the input.12
Another major success of RandNLA
was the demonstration that the
sketches constructed by RandNLA
could be used to construct preconditioners for high-quality traditional
NLA iterative software libraries.4 To see
the need for this, observe that because
of its dependence on ε, the previous
RandNLA algorithmic strategy (construct a sketch and solve a LS problem
on that sketch) can yield low-precision
solutions, for example, ε = 0.1, but
cannot practically yield high-precision solutions, for example, ε = 10−16.
Blendenpik4 and LSRN35 are LS solvers
that are appropriate for RAM and parallel environments, respectively, that
adopt the following RandNLA algorithmic strategy: construct a sketch, using
an appropriate random projection;
use that sketch to construct a preconditioner for a traditional iterative NLA
algorithm; and use that to solve the preconditioned version of the original full
problem. This improves the ε dependence
from poly(1/ε) to log(1/ε). Carefullyengineered imple­
mentations of this
approach are competitive with or beat
high-quality numerical implementations of LS solvers such as those implemented in LAPACK.4
The difference between these two
algorithmic strategies (see Figure 3
for an illustration) highlights important differences between TCS and
NLA approaches to RandNLA, as
well as between computer science
and scientific computing more generally: subtle but important differences in problem parameterization,
between what counts as a “good” solution, and between error norms of interest. Moreover, similar approaches
have been used to extend TCS-style
RandNLA algorithms for providing 1 ±
ε relative-error low-rank matrix approximation16,43 to NLA-style RandNLA algorithms for high-quality numerical
low-rank matrix approximation.24,25,41
Figure 3. (a) RandNLA algorithms for least-squares problems first compute sketches, SA and Sb, of the input data, A and b. Then, either
they solve a least-squares problem on the sketch to obtain a low-precision approximation, or they use the sketch to construct a traditional
preconditioner for an iterative algorithm on the original input data to get high-precision approximations. Subspace-preserving embedding:
if S is a random sampling matrix, then the high leverage point will be sampled and included in SA; and if S is a random-projection-type
matrix, then the information in the high leverage point will be homogenized or uniformized in SA. (b) The “heart” of RandNLA proofs is
subspace-preserving embedding for orthogonal matrices: if UA is an orthogonal matrix (say the matrix of the left singular vectors of A),
then SUA is approximately orthogonal.
RandNLA
high leverage
data point
x
x
SA
A
Sb
b
least-squares fit
(a)
RandNLA
T
UA
(SUA)T
I
SUA
I
UA
(b)
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
87
review articles
For example, a fundamental structural
condition for a sketching matrix to satisfy to obtain good low-rank matrix
approximation is the following. Let
Vk ∈ Rn × k (resp., Vk,⊥ ∈ Rn × (n−k)) be any
matrix spanning the top-k (resp., bottom-(n − k) ) right singular subspace of
A ∈ Rm × n, and let Σk (resp., Σk,⊥) be the
diagonal matrix containing the top-k
(resp., all but the top-k) singular values.
In addition, let Z ∈ Rn × r (r ≥ k) be any
matrix (for example, a random sampling
matrix S, a random projection matrix Π, or
a matrix Z constructed deterministically)
has full rank. Then,
such that
(9)
where × is any unitarily invariant
matrix norm.
How this structural condition is used
depends on the particular low-rank
problem of interest, but it is widely used
(either explicitly or implicitly) by low-rank
RandNLA algorithms. For example,
Equation (9) was introduced in the
context of the Column Subset Selection
Problem7 and was reproven and used to
reparameterize low-rank random projection algorithms in ways that could be
more easily implemented.25 It has also
been used in ways ranging from developing improved bounds for kernel methods in machine learning21 to coupling
with a version of the power method to
obtain improved numerical implementations41 to improving subspace iteration methods.24
The structural condition in Equation
(9) immediately suggests a proof strategy for bounding the error of RandNLA
algorithms for low-rank matrix approximation: identify a sketching matrix Z
has full rank; and, at the
such that
same time, bound the relevant norms of
and
Importantly, in many
of the motivating scientific computing
applications, the matrices of interest
are linear operators that are only implicitly represented but that are structured
such that they can be applied to an
arbitrary vector quickly. In these cases,
FJLT-based or input-sparsity-based
projections applied to arbitrary matrices can be replaced with Gaussianbased projections applied to these
structured operators with similar computational costs and quality guarantees.
Matrix completion. Consider the
88
COM MUNICATIO NS O F TH E ACM
following problem, which is an idealization of the important recommender
systems problem.26 Given an arbitrary
m × n matrix A, reconstruct A by sampling a set of O ( (m + n)poly(1/εa) ), as
opposed to all mn, entries of the matrix
such that the resulting approximation
à satisfies, either deterministically or
up to some failure probability,
(10)
Here, a should be small (for example, 2);
and the sample size could be increased
by (less important) logarithmic factors
of m, n, and ε. In addition, one would
like to construct the sample and compute à after making a small number of
passes over A or without even touching
all of the entries of A.
A first line of research (already mentioned) on this problem from TCS
focuses on element-wise sampling:2
sample entries from a matrix with probabilities that (roughly) depend on their
magnitude squared. This can be done in
one pass over the matrix, but the resulting additive-error bound is much larger
than the requirements of Equation (10),
as it scales with the Frobenius norm of A
instead of the Frobenius norm of A − Ak.
A second line of research from signal
processing and applied mathematics has referred to this as the matrix
completion problem.8 In this case, one is
interested in computing à without even
observing all of the entries of A. Clearly,
this is not possible without assumptions on A.i Typical assumptions are
on the eigenvalues and eigenvectors of
A: for example, the input matrix A has
rank exactly k, with k  min{m, n}, and
also that A satisfies some sort of eigenvector delocalization or incoherence
conditions.8 The simplest form of the
latter is the leverage scores of Equation
(6) are approximately uniform. Under
these assumptions, one can prove that
given a uniform sample of O ( (m + n) k
ln (m + n) ) entries of A, the solution to
the following nuclear norm minimization problem recovers A exactly, with
high probability:
i This highlights an important difference in
problem parameterization: TCS-style approaches assume worst-case input and must
identify nonuniformity strucutre, while applied mathematics approaches typically assume well-posed problems where the worst
nonuniformity structure is not present.
| J U NE 201 6 | VO L . 5 9 | NO. 6
(11)
s.t. Ãij = Aij, for all sampled entries Aij,
where ×* denotes the nuclear (or trace)
norm of a matrix (basically, the sum
of the singular values of the matrix).
That is, if A is exactly low-rank (that is,
A = Ak and thus A − Ak is zero) and satisfies an incoherence assumption,
then Equation (10) is satisfied, since
A = Ak = Ã. Recently, the incoherence
assumption has been relaxed, under
the assumption that one is given oracle
access to A according to a non-uniform
sampling distribution that essentially
corresponds to element-wise leverage
scores.10 However, removing the assump­
tion that A has exact low-rank k, with k 
min{m, n}, is still an open problem.j
Informally, keeping only a few rows/
columns of a matrix seems more powerful than keeping a comparable number
of elements of a matrix. For example,
consider an m × n matrix A whose rank
is exactly equal to k, with k  min{m, n}:
selecting any set of k linearly independent rows allows every row of A to be
expressed as a linear combination of the
selected rows. The analogous procedure
for element-wise sampling seems harder.
This is reflected in that state-of-the-art
element-wise sampling algorithms use
convex optimization and other heavierduty algorithmic machinery.
Solving systems of Laplacian-based
linear equations. Consider the special
case of the LS regression problem of
Equation (8) when m = n, that is, the wellknown problem of solving the system
of linear equations Ax = b. For worst-case
dense input matrices A this problem can
be solved in exactly O (n3) time, for example,
using the partial LU decomposition and
other methods. However, especially when
A is positive semidefinite (PSD), iterative
techniques such as the conjugate gradients method are typically preferable,
mainly because of their linear dependency on the number of non-zero entries
in the matrix A (times a factor depending
on the condition number of A).
An important special case is when
the PSD matrix A is the Laplacian matrix
j It should be noted that there exists prior work
on matrix completion for low-rank matrices
with the addition of well-behaved noise; however, removing the low-rank assumption and
achieving error that is relative to some norm of
the residual A − Ak is still open.
review articles
of an underlying undirected graph G
= (V, E), with n = |V| vertices and |E|
weighted, undirected edges.5 Variants
of this special case are common in
unsupervised and semi-supervised
machine learning.6 Recall the Laplacian
matrix of an undirected graph G is an n
× n matrix that is equal to the n × n diagonal matrix D of node weights minus the
n × n adjacency matrix of the graph. In
this special case, there exist randomized, relative-error algorithms for the
problem of Equation (8).5 The running
time of these algorithms is
O (nnz(A)polylog(n)),
where nnz(A) represents the number
of non-zero elements of the matrix
A, that is, the number of edges in
the graph G. The first step of these
algorithms corresponds to randomized graph sparsification and keeps a
small number of edges from G, thus
creating a much sparser Laplacian
is submatrix . This sparse matrix
sequently used (in a recursive manner) as an efficient preconditioner to
approximate the solution of the problem of Equation (8).
While the original algorithms in
this line of work were major theoretical
breakthroughs, they were not immediately applicable to numerical implementations and data applications. In
an effort to bridge the theory-practice
gap, subsequent work proposed a much
simpler algorithm for the graph sparsification step.45 This subsequent work
showed that randomly sampling edges
from the graph G (equivalently, rows
from the edge-incidence matrix) with
probabilities proportional to the effective resistances of the edges provides a
satisfying
sparse Laplacian matrix
the desired properties. (On the negative side, in order to approximate the
effective resistances of the edges of G, a
call to the original solver was necessary,
clearly hindering the applicability of
the simpler sparsification algorithm.45)
The effective resistances are equivalent
to the statistical leverage scores of the
weighted edge-incidence matrix of G.
Subsequent work has exploited graph
theoretic ideas to provide efficient algorithms to approximate them in time proportional to the number of edges in the
graph (up to polylogarithmic factors).27
Recent improvements have essentially
RandNLA has
proven to be a
model for truly
interdisciplinary
research in this era
of large-scale data.
removed these polylogarithmic factors,
leading to useful implementations of
Laplacian-based solvers.27 Extending
such techniques to handle general PSD
input matrices A that are not Laplacian
is an open problem.
Statistics and machine learning.
RandNLA has been used in statistics
and machine learning in several ways,
the most common of which is in the
so-called kernel-based machine learning.21 This involves using a PSD matrix
to encode nonlinear relationships
between data points; and one obtains
different results depending on whether
one is interested in approximating a
given kernel matrix,21 constructing new
kernel matrices of particular forms,39
or obtaining a low-rank basis with
which to perform downstream classification, clustering, and other related
tasks.29 Alternatively, the analysis
used to provide relative-error low-rank
matrix approximation for worst-case
input can also be used to provide
bounds for kernel-based divide-andconquer algorithms.31 More generally,
CX/CUR decompositions provide scalable and interpretable solutions to
downstream data analysis problems
in genetics, astronomy, and related
areas.33,38,53,54 Recent work has focused
on statistical aspects of the “algorithmic leveraging” approach that is central to RandNLA algorithms.30
Looking Forward
RandNLA has proven to be a model
for truly interdisciplinary research in this
era of large-scale data. For example, while
TCS, NLA, scientific computing, mathematics, machine learning, statistics,
and downstream scientific domains are
all interested in these results, each of
these areas is interested for very different
reasons. Relatedly, while technical results
underlying the development of RandNLA
have been nontrivial, some of the largest
obstacles to progress in RandNLA have
been cultural: TCS being cavalier about
polynomial factors, ε factors, and working in overly idealized computational
models; NLA being extremely slow to
embrace randomization as an algorithmic resource; scientific computing
researchers formulating and implementing algorithms that make strong domainspecific assumptions; and machine
learning and statistics researchers being
more interested in results on hypoth-
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
89
review articles
esized unseen data rather than the data
being input to the algorithm.
In spite of this, RandNLA has already
led to improved algorithms for several
fundamental matrix problems, but it is
important to emphasize that “improved”
means different things to different
people. For example, TCS is interested
in these methods due to the deep connections with Laplacian-based linear
equation solvers5,27 and since fast random sampling and random projection
algorithms12,14,17,43 repre­sent an improvement in the asymptotic running time of
the 200-year-old Gaussian elimination
algorithms for least-squares problems
on worst-case input. NLA is interested in
these methods since they can be used to
engineer variants of traditional NLA algorithms that are more robust and/or faster
in wall clock time than high-quality software that has been developed over recent
decades. (For example, Blendenpik
“beats LAPACK’s direct dense leastsquares solver by a large margin on
essentially any dense tall matrix;”4 the
randomized approach for low-rank matrix
approximation in scientific computing
“beats its classical competitors in terms
of accuracy, speed, and robustness;”25
and least-squares and least absolute
deviations regression problems “can be
solved to low, medium, or high precision
in existing distributed systems on up to
terabyte-­sized data.”52) Mathematicians
are interested in these methods since
they have led to new and fruitful fundamental mathematical questions.23,40,42,47
Statisticians and machine learners are
interested in these methods due to their
connections with kernel-based learning and since the randomness inside the
algorithm often implicitly implements a
form of regularization on realistic noisy
input data.21,29,30 Finally, data analysts are
interested in these methods since they
provide scalable and interpretable solutions to downstream scientific data analysis problems.33, 38,54 Given the central role
that matrix problems have historically
played in large-scale data analysis, we
expect RandNLA methods will continue
to make important contributions not only
to each of those research areas but also to
bridging the gaps between them.
References
1. Achlioptas, D., Karnin, Z., Liberty, E. Near-optimal
entrywise sampling for data matrices. In Annual
Advances in Neural Information Processing
Systems 26: Proceedings of the 2013 Conference, 2013.
90
COMMUNICATIO NS O F TH E ACM
2. Achlioptas, D., McSherry, F. Fast computation of
low-rank matrix approximations. J. ACM 54, 2 (2007),
Article 9.
3. Ailon, N., Chazelle, B. Faster dimension reduction.
Commun. ACM 53, 2 (2010), 97–104.
4. Avron, H., Maymounkov, P., Toledo, S. Blendenpik:
Supercharging LAPACK’s least-squares solver. SIAM
J. Sci. Comput. 32 (2010), 1217–1236.
5. Batson, J., Spielman, D.A., Srivastava, N., Teng,
S.-H. Spectral sparsification of graphs: Theory
and algorithms. Commun. ACM 56, 8 (2013), 87–94.
6. Belkin, M., Niyogi, P. Laplacian eigenmaps for
dimensionality reduction and data representation.
Neural Comput. 15, 6 (2003), 1373–1396.
7. Boutsidis, C., Mahoney, M.W., Drineas, P. An improved
approximation algorithm for the column subset
selection problem. In Proceedings of the 20th Annual
ACM-SIAM Symposium on Discrete Algorithms
(2009), 968–977.
8. Candes, E.J., Recht, B. Exact matrix completion via
convex optimization. Commun. ACM 55, 6 (2012),
111–119.
9. Chatterjee, S., Hadi, A.S. Influential observations,
high leverage points, and outliers in linear regression.
Stat. Sci. 1, 3 (1986), 379–393.
10. Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.
Coherent matrix completion. In Proceedings of the
31st International Conference on Machine Learning
(2014), 674–682.
11. Clarkson, K. Subgradient and sampling algorithms
for 1 regression. In Proceedings of the 16th Annual
ACM-SIAM Symposium on Discrete Algorithms
(2005), 257–266.
12. Clarkson, K.L., Woodruff, D.P. Low rank approximation
and regression in input sparsity time. In Proceedings
of the 45th Annual ACM Symposium on Theory of
Computing (2013), 81–90.
13. Drineas, P., Kannan, R., Mahoney, M.W. Fast Monte
Carlo algorithms for matrices I: approximating matrix
multiplication. SIAM J. Comput. 36 (2006), 132–157.
14. Drineas, P., Magdon-Ismail, M., Mahoney, M.W.,
Woodruff, D.P. Fast approximation of matrix
coherence and statistical leverage. J. Mach. Learn.
Res. 13 (2012), 3475–3506.
15. Drineas, P., Mahoney, M.W., Muthukrishnan, S.
Sampling algorithms for 2 regression and applications.
In Proceedings of the 17th Annual ACM-SIAM
Symposium on Discrete Algorithms (2006), 1127–1136.
16. Drineas, P., Mahoney, M.W., Muthukrishnan, S.
Relative-error CUR matrix decompositions.
SIAM J. Matrix Anal. Appl. 30 (2008), 844–881.
17. Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós,
T. Faster least squares approximation. Numer. Math.
117, 2 (2010), 219–249.
18. Drineas, P., Zouzias, A. A note on element-wise matrix
sparsification via a matrix-valued Bernstein inequality.
Inform. Process. Lett. 111 (2011), 385–389.
19. Frieze, A., Kannan, R., Vempala, S. Fast Monte-Carlo
algorithms for finding low-rank approximations.
J. ACM 51, 6 (2004), 1025–1041.
20. Füredi, Z., Komlós, J. The eigenvalues of random
symmetric matrices. Combinatorica 1, 3 (1981),
233–241.
21. Gittens, A. Mahoney, M.W. Revisiting the Nyström
method for improved large-scale machine learning.
J. Mach. Learn Res. In press.
22. Golub, G.H., Van Loan, C.F. Matrix Computations.
Johns Hopkins University Press, Baltimore, 1996.
23. Gross, D. Recovering low-rank matrices from few
coefficients in any basis. IEEE Trans. Inform. Theory
57, 3 (2011), 1548–1566.
24. Gu, M. Subspace iteration randomization and singular
value problems. Technical report, 2014. Preprint:
arXiv:1408.2208.
25. Halko, N., Martinsson, P.-G., Tropp, J.A. Finding
structure with randomness: Probabilistic algorithms
for constructing approximate matrix decompositions.
SIAM Rev. 53, 2 (2011), 217–288.
26. Koren, Y., Bell, R., Volinsky, C. Matrix factorization
techniques for recommender systems. IEEE Comp.
42, 8 (2009), 30–37.
27. Koutis, I., Miller, G.L., Peng, R. A fast solver for a
class of linear systems. Commun. ACM 55, 10 (2012),
99–107.
28. Kundu, A., Drineas, P. A note on randomized elementwise matrix sparsification. Technical report, 2014.
Preprint: arXiv:1404.0320.
29. Le, Q.V., Sarlós, T., Smola, A.J. Fastfood—
approximating kernel expansions in loglinear time.
In Proceedings of the 30th International Conference
on Machine Learning, 2013.
| J U NE 201 6 | VO L . 5 9 | NO. 6
30. Ma, P., Mahoney, M.W., Yu, B. A statistical perspective
on algorithmic leveraging. J. Mach. Learn. Res. 16
(2015), 861–911.
31. Mackey, L., Talwalkar, A., Jordan, M.I. Distributed
matrix completion and robust factorization. J. Mach.
Learn. Res. 16 (2015), 913–960.
32. Mahoney, M.W. Randomized Algorithms for
Matrices and Data. Foundations and Trends in
Machine Learning. NOW Publishers, Boston, 2011.
33. Mahoney, M.W., Drineas, P. CUR matrix
decompositions for improved data analysis. Proc.
Natl. Acad. Sci. USA 106 (2009), 697–702.
34. Meng, X., Mahoney, M.W. Low-distortion subspace
embeddings in input-sparsity time and applications to
robust linear regression. In Proceedings of the 45th
Annual ACM Symposium on Theory of Computing
(2013), 91–100.
35. Meng, X., Saunders, M.A., Mahoney, M.W. LSRN: A
parallel iterative solver for strongly over- or underdetermined systems. SIAM J. Sci. Comput. 36, 2
(2014), C95–C118.
36. Nelson, J., Huy, N.L. OSNAP: Faster numerical
linear algebra algorithms via sparser subspace
embeddings. In Proceedings of the 54th Annual
IEEE Symposium on Foundations of Computer
Science (2013), 117–126.
37. Oliveira, R.I. Sums of random Hermitian matrices and
an inequality by Rudelson. Electron. Commun. Prob.
15 (2010) 203–212.
38. Paschou, P., Ziv, E., Burchard, E.G., Choudhry, S.,
Rodriguez-Cintron, W., Mahoney, M.W., Drineas, P. PCAcorrelated SNPs for structure identification in worldwide
human populations. PLoS Genet. 3 (2007), 1672–1686.
39. Rahimi, A., Recht, B. Random features for large-scale
kernel machines. In Annual Advances in Neural
Information Processing Systems 20: Proceedings of
the 2007 Conference, 2008.
40. Recht, B. A simpler approach to matrix completion.
J. Mach. Learn. Res. 12 (2011), 3413–3430.
41. Rokhlin, V., Szlam, A., Tygert, M. A randomized
algorithm for principal component analysis. SIAM J.
Matrix Anal. Appl. 31, 3 (2009), 1100–1124.
42. Rudelson, M., Vershynin, R. Sampling from large
matrices: an approach through geometric functional
analysis. J. ACM 54, 4 (2007), Article 21.
43. Sarlós, T.. Improved approximation algorithms
for large matrices via random projections. In
Proceedings of the 47th Annual IEEE Symposium on
Foundations of Computer Science (2006), 143–152.
44. Smale, S. Some remarks on the foundations of
numerical analysis. SIAM Rev. 32, 2 (1990), 211–220.
45. Spielman, D.A., Srivastava, N. Graph sparsification by
effective resistances. SIAM J. Comput. 40, 6 (2011),
1913–1926.
46. Stigler, S.M. The History of Statistics: The
Measurement of Uncertainty before 1900. Harvard
University Press, Cambridge, 1986.
47. Tropp, J.A. User-friendly tail bounds for sums of random
matrices. Found. Comput. Math. 12, 4 (2012), 389–434.
48. Turing, A.M. Rounding-off errors in matrix processes.
Quart. J. Mech. Appl. Math. 1 (1948), 287–308.
49. von Neumann, J., Goldstine, H.H. Numerical inverting
of matrices of high order. Bull. Am. Math. Soc. 53
(1947), 1021–1099.
50. Wigner, E.P. Random matrices in physics. SIAM Rev.
9, 1 (1967), 1–23.
51. Woodruff, D.P. Sketching as a Tool for Numerical Linear
Algebra. Foundations and Trends in Theoretical
Computer Science. NOW Publishers, Boston, 2014.
52. Yang, J., Meng, X., Mahoney, M.W. Implementing
randomized matrix algorithms in parallel and distributed
environments. Proc. IEEE 104, 1 (2016), 58–92.
53. Yang, J., Rübel, O., Prabhat, Mahoney, M.W.,
Bowen, B.P. Identifying important ions and positions
in mass spectrometry imaging data using CUR matrix
decompositions. Anal. Chem. 87, 9 (2015), 4658–4666.
54. Yip, C.-W., Mahoney, M.W., Szalay, A.S., Csabai, I.,
Budavari, T., Wyse, R.F.G., Dobos, L. Objective
identification of informative wavelength regions
in galaxy spectra. Astron. J. 147, 110 (2014), 15.
Petros Drineas ([email protected]) is an associate
professor in the Department of Computer Science at
Rensselaer Polytechnic Institute, Troy, NY.
Michael W. Mahoney ([email protected])
is an associate professor in ICSI and in the Department
of Statistics at the University of California at Berkeley.
Copyright held by authors.
Publication rights licensed to ACM. $15.00.
research highlights
P. 92
Technical
Perspective
Veritesting Tackles
Path-Explosion
Problem
P. 93
Enhancing Symbolic
Execution with Veritesting
By Thanassis Avgerinos, Alexandre Rebert,
Sang Kil Cha, and David Brumley
By Koushik Sen
P. 101
P. 102
By Siddharth Suri
AutoMan: A Platform for
Integrating Human-Based
and Digital Computation
Technical
Perspective
Computing with
the Crowd
By Daniel W. Barowy, Charlie Curtsinger,
Emery D. Berger, and Andrew McGregor
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
91
research highlights
DOI:10.1145/ 2 9 2 79 2 2
Technical Perspective
Veritesting Tackles
Path-Explosion Problem
To view the accompanying paper,
visit doi.acm.org/10.1145/2927924
rh
By Koushik Sen
working on a large
piece of software for a safety-critical
system, such as the braking system of
a car. How would you make sure the
car will not accelerate under any circumstance when the driver applies
the brake? How would you know that
someone other than the driver would
not be able to stop a moving car by exploiting a remote security vulnerability
in the software system? How would you
confirm the braking system will not
fail suddenly due to a fatal crash in the
software system?
Testing is the only predominant
technique used by the software industry to answer such questions and
to make software systems reliable.
Studies show that testing accounts for
more than half of the total software development cost in industry. Although
testing is a widely used and a well-established technique for building reliable software, existing techniques for
testing are mostly ad hoc and ineffective—serious bugs are often exposed
post-deployment. Wouldn’t it be nice
if one could build a software system
that could exhaustively test any software and report all critical bugs in the
software to its developer?
In recent years, symbolic execution
has emerged as one such automated
technique to generate high-coverage
test suites. Such test suites could
find deep errors and security vulnerabilities in complex software applications. Symbolic execution analyzes
the source code or the object code of
a program to determine what inputs
would execute the different paths of
the program. The key idea behind
symbolic execution was introduced
almost 40 years ago. However, it has
only recently been made practical, as
a result of significant advances in program analysis and constraint-solving
techniques, and due to the invention
of dynamic symbolic execution (DSE)
or concolic testing, which combines
concrete and symbolic execution.
I M AG I N E YO U A RE
92
COMM UNICATIO NS O F THE ACM
Since its introduction in 2005, DSE
and concolic testing have inspired the
development of several scalable symbolic execution tools such as DART,
CUTE, jCUTE, KLEE, JPF, SAGE, PEX,
CREST, BitBlaze, S2E, Jalangi, CATG,
Triton, CONBOL, and SymDroid. Such
tools have been used to find crashing inputs, to generate high-coverage
test-suites, and to expose security vulnerabilities. For example, Microsoft’s
SAGE has discovered one-third of all
bugs revealed during the development
of Windows 7.
Although modern symbolic execution tools have been successful in finding high-impact bugs and security vulnerabilities, it has been observed that
symbolic execution techniques do not
scale well to large realistic programs
because the number of feasible execution paths of a program often increases exponentially with the length of an
execution path. Therefore, most modern symbolic execution tools achieve
poor coverage when they are applied to
large programs. Most of the research
in symbolic execution nowadays is,
therefore, focusing on mitigating the
path-explosion problem.
To mitigate the path-explosion
problem, a number of techniques
have been proposed to merge symbolic execution paths at various program points. Symbolic path merging,
also known as static symbolic execution (SSE), enables carrying out symbolic execution of multiple paths simultaneously. However, this form of
path merging often leads to large and
complex formula that are difficult to
solve. Moreover, path merging fails
to work for real-world programs that
perform system calls. Despite these
recent proposals for mitigating the
path explosion problem, the proposed
techniques are not effective enough to
handle large systems code.
The following work by Avgerinos et
al. is a landmark in further addressing
the path-explosion problem for real-
| J U NE 201 6 | VO L . 5 9 | NO. 6
world software systems. The authors
have proposed an effective technique
called veritesting that addresses the
scalability limitations of path merging in symbolic execution. They have
implemented veritesting in MergePoint, a tool for automatically testing
all program binaries in a Linux distribution. A key attraction of MergePoint is that the tool can be applied
to any binary without any source information or re-compilation or preprocessing or user-setup. A broader
impact of this work is that users can
now apply symbolic execution to larger software systems and achieve better code coverage while finding deep
functional and security bugs.
Veritesting works by alternating
between dynamic symbolic execution and path merging or static symbolic execution. DSE helps to handle
program fragments that cannot be
handled by SSE, such as program
fragments making system calls and
indirect jumps. SSE, on the other
hand, helps to avoid repeated exploration of exponential number of
paths in small program fragments
by summarizing their behavior as a
formula. What I find truly remarkable is this clever combination of
DSE and SSE has enabled veritesting to scale to thousands of binaries
in a Linux distribution. The tool has
found more than 10,000 bugs in the
distribution and Debian maintainers
have already applied patches to 229
such bugs. These results and impact
on real-world software have demonstrated that symbolic execution has
come out of its infancy and has become a viable alternative for testing
real-world software systems without
user-intervention.
Koushik Sen ([email protected]) is an associate
professor in the Department of Electrical Engineering
and Computer Sciences at the University of California,
Berkeley.
Copyright held by author.
Enhancing Symbolic
Execution with Veritesting
DOI:10.1145/ 2 9 2 79 2 4
By Thanassis Avgerinos, Alexandre Rebert, Sang Kil Cha, and David Brumley
1. INTRODUCTION
Symbolic execution is a popular automatic approach for
testing software and finding bugs. Over the past decade,
numerous symbolic execution tools have appeared—both
in academia and industry—demonstrating the effectiveness
of the technique in finding crashing inputs, generating test
cases with high coverage, and exposing software vulnerabilities.5 Microsoft’s symbolic executor SAGE is responsible for
finding one-third of all bugs discovered during the development of Windows 7.12
Symbolic execution is attractive because of two salient
features. First, it generates real test cases; every bug report
is accompanied by a concrete input that reproduces the
problem (thus eliminating false reports). Second, symbolic
execution systematically checks each program path exactly
once—no work will be repeated as in other typical testing
techniques (e.g., random fuzzing).
Symbolic execution works by automatically translating
program fragments to logical formulas. The logical formulas
are satisfied by inputs that have a desired property, for example, they execute a specific path or violate safety. Thus, with
symbolic execution, finding crashing test cases effectively
reduces to finding satisfying variable assignments in logical formulas, a process typically automated by Satisfiability
Modulo Theories (SMT) solvers.9
At a high level, there are two main approaches for generating formulas: dynamic symbolic execution (DSE) and static
symbolic execution (SSE). DSE executes the analyzed program
fragment and generates formulas on a per-path basis. SSE
translates program fragments into formulas, where each
formula represents the desired property over any path
within the selected fragment. The path-based nature of DSE
introduces significant overhead when generating formulas,
but the formulas themselves are easy to solve. The statementbased nature of SSE has less overhead and produces more
succinct formulas that cover more paths, but the formulas
are harder to solve. Is there a way to get the best of both
worlds?
In this article, we present a new technique for generating
formulas called veritesting that alternates between SSE and
DSE. The alternation mitigates the difficulty of solving formulas, while alleviating the high overhead associated with
a path-based DSE approach. In addition, DSE systems replicate the path-based nature of concrete execution, allowing them to handle cases such as system calls and indirect
jumps where static approaches would need summaries or
additional analysis. Alternating allows veritesting to switch
to DSE-based methods when such cases are encountered.
We implemented veritesting in MergePoint, a system for
automatically checking all programs in a Linux distribution.
MergePoint operates on 32-bit Linux binaries and does not
require any source information (e.g., debugging symbols).
We have systematically used MergePoint to test and evaluate
veritesting on 33,248 binaries from Debian Linux. The binaries were collected by downloading and mining for executable programs all available packages from the Debian main
repository. We did not pick particular binaries or a dataset
that would highlight specific aspects of our system; instead
we focus on our system as experienced in the general case.
The large dataset allows us to explore questions with high
fidelity and with a smaller chance of per-program sample
bias. The binaries are exactly what runs on millions of systems throughout the world.
We demonstrate that MergePoint with veritesting beats
previous techniques in the three main metrics: bugs found,
node coverage, and path coverage. In particular, MergePoint
has found 11,687 distinct bugs in 4379 different programs.
Overall, MergePoint has generated over 15 billion SMT queries
and created over 200 million test cases. Out of the 1043 bugs
we have reported so far to the developers, 229 have been fixed.
Our main contributions are as follows. First, we propose
a new technique for symbolic execution called veritesting.
Second, we provide and study in depth the first system for
testing every binary in an OS distribution using symbolic
execution. Our experiments reduce the chance of perprogram or per-dataset bias. We evaluate MergePoint with
and without veritesting and show that veritesting outperforms previous work on all three major metrics. Finally, we
improve open source software by finding over 10,000 bugs
and generating millions of test cases. Debian maintainers have already incorporated 229 patches due to our bug
reports. We have made our data available on our website.20
For more experiments and details we refer the reader to the
original paper.2
2. SYMBOLIC EXECUTION BACKGROUND
Symbolic execution14 is similar to normal program execution with one main twist: instead of using concrete input
values, symbolic execution uses variables (symbols). During
execution, all program values are expressed in terms of input
variables. To keep track of the currently executing path,
symbolic execution stores all conditions required to follow
the same path (e.g., assertions, conditions on branch statements, etc.) in a logical formula called the path predicate.
The original version of this paper was published in
the Proceedings of the 36th International Conference on
Software Engineering (Hyderabad, India, May 31–June 7,
2014). ACM, New York, NY, 1083–1094.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
93
research highlights
Algorithm 1: Dynamic Symbolic Execution Algorithm
Input: Initial program counter (entry point): pc0
Instruction fetch & decode: instrFetchDecode
Data: State worklist: Worklist, path predicate: Π,
variable state: ∆
1 Function ExecuteInstruction (instr, pc, Π, ∆)
2 switch instr do
3 case var := exp
// assignment
4 ∆[var] ← exp
5 return [(succ(pc), Π, ∆)]
6 case assert (exp)
// assertion
7 return [(succ(pc), Π ∧ exp, ∆)]
8 case if (exp) goto pc′ // conditional jump
9 // Regular DSE forks 2 states
10 return [(pc′, Π ∧ exp, ∆), (succ (pc), Π ∧ ¬ex p,
∆)]
9 // Veritesting integration
10 return Veritest(pc, Π, ∆)
11 case halt: return []
// terminate
// initial worklist
12 Worklist = [(pc0, true, {})]
13 while Worklist ≠ [] do
14 pc, Π, ∆ = removeOne(Worklist)
15 instr = instrFetchDecode(pc)
16 NewStates = ExecuteInstruction(instr, pc, Π, ∆)
17 Worklist = add(Worklist, NewStates)
Inputs that make the path predicate true are guaranteed to
follow the exact same execution path. If there is no input
satisfying the path predicate, the current execution path is
infeasible.
In the following sections, we give a brief overview of the
two main symbolic execution approaches: dynamic and SSE.
We refer the reader to symbolic execution surveys for more
details and examples.5, 21
2.1. Dynamic symbolic execution (DSE)
Algorithm 1 presents the core steps in DSE. The algorithm
operates on a representative low-level language with assignments, assertions and conditional jumps (simplified from the
original Avgerinos et al.2).
Similar to an interpreter, a symbolic executor consists
of a main instruction fetch-decode-execute loop (Lines
14–17). On each iteration, the removeOne function selects
the next state to execute from Worklist, decodes the instruction, executes it and inserts the new execution states in
Worklist. Each execution state is a triple (pc, Π, ∆) where
pc is the current program counter, Π is the path predicate
(the condition under which the current path will be executed), and ∆ is a dictionary that maps each variable to its
current value.
Unlike concrete interpreters, symbolic executors need
to maintain a list of execution states (not only one). The
reason is conditional branches. Line 10 (highlighted
in red) demonstrates why: the executed branch condition could be true or false—depending on the program
input—and the symbolic executor needs to execute
both paths in order to check correctness. The process of
94
COMM UNICATIO NS O F THE ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
generating two new execution states out of a single state
(one for the true branch and one for the false), is typically
called “forking.” Due to forking, every branch encountered during execution, doubles the number of states
that need to be analyzed, a problem known in DSE as path
(or state) explosion.5
Example. We now give a short example demonstrating
DSE in action. Consider the following program:
1 if (input_char == ’B’) {
2
bug ( ) ;
3 }
Before execution starts DSE initializes the worklist with
a state pointing to the start of the program (Line 12): (1, true,
{}). After it fetches the conditional branch instruction, DSE
will have to fork two new states in ExecuteInstruction:
(2, input_char = ’B’, {}) for the taken branch and (3, input_
char ≠ ’B’, {}) for the non-taken. Generating a test case for
each execution path is straightforward: we send each path
predicate to an SMT solver and any satisfying assignment
will execute the same path, for example, input_char → ’B’ to
reach the buggy line of code. An unsatisfiable path predicate
means the selected path is infeasible.
Advantages/Disadvantages. Forking executors and analyzing a single path at a time has benefits: the analysis code is
simple, solving the generated path predicates is typically fast
(e.g., in SAGE4 99% of all queries takes less than 1 s) since we
only reason about a single path, and the concrete path-specific
state resolves several practical problems. For example, executors
can execute hard-to-model functionality concretely (e.g.,
system calls), side effects such as allocating memory in each
DSE path are reasoned about independently without extra
work, and loops are unrolled as the code executes. The disadvantage is path explosion: the number of executors can
grow exponentially in the number of branches. The path
explosion problem is the main motivation for our veritesting
algorithm (see Section 3).
2.2. Static symbolic execution (SSE)
SSE is a verification technique for representing a program as
a logical formula. Safety checks are encoded as logical assertions that will falsify the formula if safety is violated. Because
SSE checks programs, not paths, it is typically employed to
verify the absence of bugs. As we will see, veritesting repurposes SSE techniques for summarizing program fragments
instead of verifying complete programs.
Modern SSE algorithms summarize the effects of both
branches at path confluence points. In contrast, DSE traditionally forks off two executors at the same line, which
remain subsequently forever independent. Due to space, we
do not repeat complete SSE algorithms here, and refer the
reader to previous work.3, 15, 23
Advantages/Disadvantages. Unlike DSE, SSE does not
suffer from path explosion. All paths are encoded in a
a
Note the solver may still have to reason internally about an exponential
number of paths—finding a satisfying assignment to a logical formula is an
NP-hard problem.
single formula that is then passed to the solver.a For acyclic programs, existing techniques allow generating compact formulas of size O (n2),10, 18 where n is the number of
program statements. Despite these advantages over DSE,
state-of-the-art tools still have trouble scaling to very large
programs.13, 16 Problems include the presence of loops
(how many times should they be unrolled?), formula complexity (are the formulas solvable if we encode loops and
recursion?), the absence of concrete state (what is the
concrete environment the program is running in?), as
well as unmodeled behavior (a kernel model is required
to emulate system calls). Another hurdle is completeness: for the verifier to prove absence of bugs, all program
paths must be checked.
3. VERITESTING
DSE has proven to be effective in analyzing real world programs.6, 12 However, the path explosion problem can severely
reduce the effectiveness of the technique. For example, consider the following 7-line program that counts the occurrences of the character ’B’ in an input string:
1 int counter = 0, values = 0;
2 for ( i = 0 ; i < 100 ; i ++ )
3 if (input [i] == ’B’) {
4
counter ++;
5
values += 2;
6}
7 if ( counter == 75) bug ( ) ;
The program above has 2100 possible execution paths.
Each path must be analyzed separately by DSE, thus making full path coverage unattainable for practical purposes.
In contrast, two test cases suffice for obtaining full code
coverage: a string of 75 ‘B’s and a string with no ‘B’s.
However, finding such test cases in the 2100 state space is
challenging.b We ran the above program with several stateof-the-art symbolic executors, including KLEE,6 S2E,8
Mayhem,7 and Cloud9 with state merging.16 None of the
above systems was able to find the bug within a 1-h time
limit (they ran out of memory or kept running). Veritesting
allows us to find the bug and obtain full path coverage in
47 s on the same hardware.
Veritesting starts with DSE, but switches to an SSEstyle approach when we encounter code that—similar
to the example above—does not contain system calls,
indirect jumps, or other statements that are difficult to
precisely reason about statically. Once in SSE mode, veritesting performs analysis on a dynamically recovered control flow graph (CFG) and identifies a core of statements
that are easy for SSE, and a frontier of hard-to-analyze
statements. The SSE algorithm summarizes the effects of
all paths through the easy nodes up to the hard frontier.
Veritesting then switches back to DSE to handle the cases
that are hard to treat statically.
For example,
paths reach the buggy line of code. The probability of
finding one of those paths by random selection is approximately 278/2100 = 2−22.
b
In the rest of this section, we present the main algorithm
and the details of the technique.
3.1. The algorithm
In default mode, MergePoint behaves as a typical dynamic
symbolic executor. It starts exploration with a concrete seed
and explores paths in the neighborhood of the original
seed following a generational search strategy.12 MergePoint
does not always fork when it encounters a symbolic branch.
Instead, MergePoint intercepts the forking process—as
shown in Line 10 (highlighted in green) of algorithm 1—of
DSE and performs veritesting.
Veritesting consists of four main steps:
1. CFG Recovery. Obtains the CFG reachable from the
address of the symbolic branch (Section 3.2).
2. Transition Point Identification & Unrolling. Takes in a
CFG, and outputs candidate transition points and a
CFGe, an acyclic CFG with edges annotated with the
control flow conditions (Section 3.3). Transition points
indicate CFG locations with hard-to-model constructs
where DSE may continue.
3. SSE. Takes the acyclic CFGe and current execution state,
and uses SSE to build formulas that encompass all feasible paths in the CFGe. The output is a mapping from
CFGe nodes to SSE states (Section 3.4).
4. Switch to DSE. Given the transition points and SSE
states, returns the DSE executors to be forked (Section
3.5).
3.2. CFG recovery
The goal of the CFG recovery phase is to obtain a partial CFG
of the program, where the entry point is the current symbolic
branch. We now define the notion of underapproximate and
overapproximate CFG recovery.
A recovered CFG is an underapproximation if all edges
of the CFG represent feasible paths. A recovered CFG is an
overapproximation if all feasible paths in the program are
represented by edges in the CFG (statically recovering a
perfect—that is, non-approximate—CFG on binary code can
be non-trivial). A recovered CFG might be an underapproximation or an overapproximation, or even both in practice.
Veritesting was designed to handle both underapproximated and overapproximated CFGs without losing paths or
precision (see Section 3.4). MergePoint uses a customized
CFG recovery mechanism designed to stop recovery at function boundaries, system calls and unknown instructions.
The output of this step is a partial (possibly approximate)
intraprocedural CFG. Unresolved jump targets (e.g., ret,
call, etc.) are forwarded to a generic Exit node in the CFG.
Figure 1a shows the form of an example CFG after the recovery phase.
3.3. Transition point identification and unrolling
Once the CFG is obtained, MergePoint proceeds to identifying a set of transition points. Transition points define the
boundary of the SSE algorithm (where DSE will continue
exploration). Note that every possible execution path from
the entry of the CFG needs to end in a transition point (our
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
95
research highlights
implementation uses domination analysis2).
For a fully recovered CFG, a single transition point may
be sufficient, for example, the bottom node in Figure 1a.
However, for CFGs with unresolved jumps or system calls,
any predecessor of the Exit node will be a possible transition
point (e.g., the ret node in Figure 1b). Transition points
represent the frontier of the visible CFG, which stops at unresolved jumps, function boundaries and system calls. The
number of transition points gives an upper-bound on the
number of executors that may be forked.
Unrolling Loops. Loop unrolling represents a challenge for
static verification tools. However, MergePoint is dynamic and
can concretely execute the CFG to identify how many times
each loop will execute. The number of concrete loop iterations determines the number of loop unrolls. MergePoint
also allows the user to extend loops beyond the concrete iteration limit, by providing a minimum number of unrolls.
To make the CFG acyclic, back edges are removed and forwarded to a newly created node for each loop, for example,
the “Incomplete Loop” node in Figure 1b, which is a new
transition point that will be explored if executing the loop
more times is feasible. In a final pass, the edges of the CFG are
annotated with the conditions required to follow the edge.
The end result of this step is a CFGe and a set of transition points. Figure 1b shows an example CFG—without
edge conditions—after transition point identification and
loop unrolling.
3.4. Static symbolic execution
Given the CFGe, MergePoint applies SSE to summarize the
execution of multiple paths. Previous work,3 first converted
the program to Gated Single Assignment (GSA)22 and then
performed symbolic execution. In MergePoint, we encode
SSE as a single pass dataflow analysis where GSA is computed
on the fly—more details can be found in the full paper.2
To illustrate the algorithm, we run SSE on the following
program:
if (x > 1) y = 1; else if (x < 42) y = 17;
Figure 1. Veritesting on a program fragment with loops and system
calls. (a) Recovered CFG. (b) CFG after transition point identification &
loop unrolling. Unreachable nodes are shaded.
Entry
Entry
1
1
Figure 2 shows the progress of the variable state as SSE iterates through the blocks. SSE starts from the entry of the
CFGe and executes basic blocks in topological order. SSE
uses conditional ite (if-then-else) expressions—ite is a ternary operator similar to ?: in C—to encode the behavior
of multiple paths. For example, every variable assignment
following the true branch after the condition (x > 1) in Figure
2 will be guarded as ite(x > 1, value, ⊥), where value denotes
the assigned value and ⊥ is a don’t care term. Thus, for the
edge from B3 to B6 in Figure 2, ∆ is updated to {y → ite
(x > 1, 42, ⊥)}.
When distinct paths (with distinct ∆’s) merge to the same
confluence point on the CFG, a merge operator is needed to
“combine” the side effects from all incoming edges. To do
so, we apply the following recursive merge operation M to
each symbolic value:
M(υ1, ⊥) = υ1; M(⊥, υ2) = υ2;
M(ite(e, υ1, υ2), ite(e, υ′1, υ′2)) = ite(e, M(υ1, υ′1), M(υ2, υ′2))
This way, at the last node of Figure 2, the value of y will be
M(ite(x > 1, 42, ⊥), ite(x > 1, ⊥, ite(x < 42, 17, y0) ) ) which is
merged to ite(x > 1, 42, ite(x < 42, 17, y0) ), capturing all possible paths.c Note that this transformation is inlining multiple statements into a single one using ite operators. Also,
note that values from unmerged paths (⊥ values) can be
immediately simplified, for example, ite(e, x, ⊥) = x. During
SSE, MergePoint keeps a mapping from each traversed
node to the corresponding variable state.
Handling Overapproximated CFGs. At any point during
SSE, the path predicate is computed as the conjunction of
the DSE predicate ΠDSE and the SSE predicate computed by
substitution: ΠSSE. MergePoint uses the resulting predicate
to perform path pruning offering two advantages: any infeasible edges introduced by CFG recovery are eliminated, and
our formulas only consider feasible paths (e.g., the shaded
c
To efficiently handle deeply nested and potentially duplicated expressions,
MergePoint utilizes hash-consing at the expression level.2
Figure 2. SSE running on an unrolled CFG—the variable state (∆) is
shown within brackets.
B1: [∆ = {y → y0}]
if (x > 1)
false
Loop
2
3
7
4
2
true
3
true
Unreachable Node
6
5
Unknown Model
System Call
7
4
2a
6
5
Transition Points
Incomplete Loop
ret
Exit
(a)
96
COMM UNICATIO NS O F THE ACM
ret
System Call
Exit
(b)
| J U NE 201 6 | VO L . 5 9 | NO. 6
B2: if (x < 42)
B3: y = 42
[∆ = {y → 42}]
B4: y = 17
[∆ = {y → 17}]
false
B5: [∆ = {y → ite(x > 1, ⊥, ite(x < 42, 17, y0))}]
B6: [∆ = {y → ite(x > 1, 42, ite(x < 42, 17, y0))}]
nodes in Figure 1b can be ignored).
3.5. Switch to DSE
After the SSE pass is complete, we check which states need
to be forked. We first gather transition points and check
whether they were reached by SSE. For the set of distinct,
reachable transition points, MergePoint will fork a new symbolic state in a final step, where a DSE executor is created
(pc, Π, ∆) using the state of each transition point.
Generating Test Cases. Though MergePoint can generate an
input for each covered path, that would result in an exponential number of test cases in the size of the CFGe. By default, we
only output one test per CFG node explored by SSE. (Note that
for branch coverage the algorithm can be modified to generate
a test case for every edge of the CFG.) The number of test cases
can alternatively be minimized by generating test cases only
for nodes that have not been covered by previous test cases.
Underapproximated CFGs. Last, before proceeding with
DSE, veritesting checks whether we missed any paths due to
the underapproximated CFG. To do so, veritesting queries the
negation of the path predicate at the Exit node (the disjunction of the path predicates of forked states). If the query is satisfiable, an extra state is forked to explore missed paths.
4. EVALUATION
In this section we evaluate our techniques using multiple
benchmarks with respect to three main questions:
1. Does Veritesting find more bugs than previous
approaches? We show that MergePoint with veritesting finds twice as many bugs than without.
2. Does Veritesting improve node coverage? We show Merge­
Point with veritesting improves node coverage over DSE.
3. Does Veritesting improve path coverage? Previous work
showed dynamic state merging outperforms vanilla
DSE.16 We show MergePoint with veritesting improves
path coverage and outperforms both approaches.
We detail our large-scale experiment on 33,248 programs
from Debian Linux. MergePoint generated billions of SMT
queries, hundreds of millions of test cases, millions of
crashes, and found 11,687 distinct bugs.
Overall, our results show MergePoint with veritesting
improves performance on all three metrics. We also show
that MergePoint is effective at checking a large number of
programs. Before proceeding to the evaluation, we present
our setup and benchmarks sets. All experimental data from
MergePoint are publicly available online.20
Experiment Setup. We ran all distributed MergePoint experiments on a private cluster consisting of 100 virtual nodes running Debian Squeeze on a single Intel 2.68 GHz Xeon core with
1 GB of RAM. All comparison tests against previous systems were
run on a single node Intel Core i7 CPU and 16 GB of RAM since
these systems could not run on our distributed infrastructure.
We created three benchmarks: coreutils, BIN, and Debian.
Coreutils and BIN were compiled so that coverage information could be collected via gcov. The Debian benchmark
consists of binaries used by millions of users worldwide.
Benchmark 1: GNU coreutils (86 programs) We use the
coreutils benchmark to compare to previous work since:
(1) the coreutils suite was originally used by KLEE6 and other
researchers6, 7, 16 to evaluate their systems, and (2) configuration parameters for these programs used by other tools
are publicly available.6 Numbers reported with respect to
coreutils do not include library code to remain consistent
with compared work. Unless otherwise specified, we ran
each program in this suite for 1 h.
Benchmark 2: The BIN suite (1023 programs). We obtained
all the binaries located under the /bin,/usr/bin, and /sbin
directories from a default Debian Squeeze installation.d We
kept binaries reading from /dev/stdin, or from a file specified
on the command line. In a final processing step, we filtered
out programs that require user interaction (e.g., GUIs). BIN
consists of 1023 binary programs, and comprises 2,181,735
executable lines of source code (as reported by gcov). The BIN
benchmark includes library code packaged with the application
in the dataset, making coverage measurements more conservative than coreutils. For example, an application may include
an entire library, but only one function is reachable from the
application. We nonetheless include all uncovered lines from
the library source file in our coverage computation. Unless otherwise specified, we ran each program in this suite for 30 min.
Benchmark 3: Debian (33,248 programs). This benchmark consists of all binaries from Debian Wheezy and Sid.
We extracted binaries and shared libraries from every package available from the main Debian repository. We downloaded 23,944 binaries from Debian Wheezy, and 27,564
binaries from Debian Sid. After discarding duplicate binaries in the two distributions, we are left with a benchmark
comprising 33,248 binaries. This represents an order of
magnitude more applications than have been tested by prior
symbolic execution research. We analyzed each application
for less than 15 min per experiment.
4.1. Bug finding
Table 1 shows the number of bugs found by MergePoint with
and without veritesting. Overall, veritesting finds 2× more
bugs than without for BIN. Veritesting finds 63 (83%) of the
bugs found without veritesting, as well as 85 additional distinct bugs that traditional DSE could not detect.
Veritesting also found two previously unknown crashes
in coreutils, even though these applications have been thoroughly tested with symbolic execution.6, 7, 16 Further investigation showed that the coreutils crashes originate from a
library bug that had been undetected for 9 years. The bug is in
the time zone parser of the GNU portability library Gnulib,
which dynamically deallocates a statically allocated memory
buffer. It can be triggered by running touch -d ‘TZ=“”” ’,
or date −d ‘TZ=“”” ’. Furthermore, Gnulib is used by
What better source of benchmark programs than the ones you use everyday?
d
Table 1. Veritesting finds 2× more bugs.
Coreutils
BIN
Veritesting
DSE
2 bugs/2 progs
148 bugs/69 progs
0/0
76 bugs/49 progs
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
97
research highlights
4.2. Node coverage
We evaluated MergePoint both with and without Veritest­
ing on node coverage. Table 2 shows our overall results.
Veritesting improves node coverage on average in all cases.
Note that any positive increase in coverage is important.
In particular, Kuznetsov et al. showed both dynamic state
merging and SSE reduced node coverage when compared to
vanilla DSE (Figure 8 in Ref.16).
Figures 3 and 4 break down the improvement per program. For coreutils, enabling veritesting decreased coverage
in only three programs (md5sum, printf, and pr). Manual
investigation of these programs showed that veritesting generated much harder formulas, and spent more than 90% of
its time in the SMT solver, resulting in timeouts. In Figure 4
for BIN, we omit programs where node coverage was the
same for readability. Overall, the BIN performance improved
for 446 programs and decreased for 206.
Figure 5 shows the average coverage over time achieved
by MergePoint with and without veritesting for the BIN suite.
After 30 min, MergePoint without veritesting reached 34.45%
code coverage. Veritesting achieved the same coverage in
less than half the original time (12 min 48 s). Veritesting’s
coverage improvement becomes more substantial as analysis time goes on. Veritesting achieved higher coverage
Table 2. Veritesting improves node coverage.
Coreutils
BIN
Veritesting (%)
DSE (%)
Difference (%)
75.27
40.02
63.62
34.71
+11.65
+5.31
Coverage difference
Figure 3. Code coverage difference on coreutils before and after
veritesting.
60
40
20
velocity, that is, the rate at which new coverage is obtained,
than standard symbolic execution. Over a longer period
of time, the difference in velocity means that the coverage
difference between the two techniques is likely to increase
further, showing that the longer MergePoint runs, the more
essential veritesting becomes for high code coverage.
The above tests demonstrates the improvements of veritesting for MergePoint. We also ran both S2E and MergePoint
(with veritesting) on coreutils using the same configuration
for 1 h on each utility in coreutils, excluding 11 programs
where S2E emits assertion errors. Figure 6 compares the
increase in coverage obtained by MergePoint with veritesting
over S2E. MergePoint achieved 27% more code coverage on
average than S2E. We investigated programs where S2E outperforms MergePoint. For instance, on pinky—the main
outlier in the distribution—S2E achieves 50% more coverage.
The main reason for this difference is that pinky uses a system call not handled by the current MergePoint implementation (netlink socket).
4.3. Path coverage
We evaluated the path coverage of MergePoint both with and
without veritesting using three different metrics: time to
complete exploration, as well as multiplicity.
Time to complete exploration. The metric reports the
amount of time required to completely explore a program,
in those cases where exploration finished.
The number of paths checked by an exhaustive DSE run
is also the total number of paths possible. In such cases we
can measure (a) whether veritesting also completed, and (b)
if so, how long it took relative to DSE. MergePoint without
veritesting was able to exhaust all paths for 46 programs.
MergePoint with veritesting completes all paths 73% faster
than without veritesting. This result shows that veritesting
Figure 5. Coverage over time (BIN suite).
Code coverage (%)
several popular projects, and we have confirmed that the bug
affects other programs, for example, find, patch, tar.
0
40
30
Veritesting
With
Without
20
10
0
0
500
1000
Time (s)
1500
Programs
98
100
50
0
−50
−100
COM MUNICATIO NS O F TH E AC M
Programs
| J U NE 201 6 | VO L . 5 9 | NO. 6
Figure 6. Code coverage difference on coreutils obtained by
MergePoint versus S2E.
Coverage difference (%)
Coverage difference
Figure 4. Code coverage difference on BIN before and after
veritesting, where it made a difference.
50
0
−50
Programs
is faster when reaching the same end goal.
Multiplicity. Multiplicity was proposed by Kuznetsov et al.16
as a metric correlated with path coverage. The initial multiplicity of a state is 1. When a state forks, both children
inherit the state multiplicity. When combining two states,
the multiplicity of the resulting state is the sum of their
multiplicities. A higher multiplicity indicates higher path
coverage.
We also evaluated the multiplicity for veritesting. Figure 7
shows the state multiplicity probability distribution function
for BIN. The average multiplicity over all programs was
1.4 × 10290 and the median was 1.8 × 1012 (recall, higher is
better). The distribution resembles a lognormal with a spike
for programs with multiplicity of 4096 (212). The multiplicity
average and median for coreutils were 1.4 × 10199 and 4.4 × 1011,
respectively. Multiplicity had high variance; thus the median
is likely a better performance estimator.
4.4. Checking Debian
In this section, we evaluate veritesting’s bug finding ability
on every program available in Debian Wheezy and Sid. We
show that veritesting enables large-scale bug finding.
Since we test 33,248 binaries, any type of per-program
manual labor is impractical. We used a single input specification for our experiments: -sym-arg 1 10 -sym-arg 2
2 -sym-arg 3 2 -sym-anon-file 24 -sym-stdin
24 (3 symbolic arguments up to 10, 2, and 2 bytes, respectively, and symbolic files/stdin up to 24 bytes). MergePoint
encountered at least one symbolic branch in 23,731 binaries. We analyzed Wheezy binaries once, and Sid binaries
twice (one experiment with a 24-byte symbolic file, the other
with 2100 bytes to find buffer overflows). Including data
Figure 7. Multiplicity distribution (BIN suite).
Count
60
40
20
0
21
22
24
28 212 220 232 264 2128 2256 2512 21024
Multiplicity (in log scale)
processing, the experiments took 18 CPU-months.
Our overall results are shown in Table 3. Veritesting found
11,687 distinct bugs that crash programs. The bugs appear
in 4379 of the 33,248 programs. Veritesting also finds bugs
that are potential security threats. Two hundred and twentyfour crashes have a corrupt stack, that is, a saved instruction
pointer has been overwritten by user input. As an interesting data point, it would have cost $0.28 per unique crash had
we run our experiments on the Amazon Elastic Compute
Cloud, assuming that our cluster nodes are equivalent to
large instances.
The volume of bugs makes it difficult to report all bugs in
a usable manner. Note that each bug report includes a crashing test case, thus reproducing the bug is easy. Instead, practical problems such as identifying the correct developer and
ensuring responsible disclosure of potential vulnerabilities
dominate our time. As of this writing, we have reported 1043
crashes in total.19 Not a single report was marked as unreproducible on the Debian bug tracking system. Two hundred and twenty-nine bugs have already been fixed in the
Debian repositories, demonstrating the real-world impact
of our work. Additionally, the patches gave an opportunity
to the package maintainers to harden at least 29 programs,
enabling modern defenses like stack canaries and DEP.
4.5. Discussion
Our experiments so far show that veritesting can effectively
increase multiplicity, achieve higher code coverage, and find
more bugs. In this section, we discuss why it works well
according to our collected data.
Each run takes longer with veritesting because multi-path
SMT formulas tend to be harder. The coverage improvement
demonstrates that additional SMT cost is amortized over the
increased number of paths represented in each run. At its
core, veritesting is pushing the SMT engine harder instead
of brute-forcing paths by forking new DSE executors. This
result confirms that the benefits of veritesting outweigh its
cost. The distribution of path times (Figure 8b) shows that
the vast majority (56%) of paths explored take less than 1 s
for standard symbolic execution. With veritesting, the fast
paths are fewer (21%), and we get more timeouts (6.4% vs.
Figure 8. MergePoint performance before and after veritesting for
BIN. The above figures show: (a) Performance breakdown for each
component; (b) Analysis time distribution.
Component
DSE (%) Veritesting (%)
Instrumentation
40.01
16.95
SMT solver
19.23
63.16
Symbolic execution 39.76
19.89
(a)
Table 3. Overall numbers for checking Debian.
33,248
15,914,407,892
12,307,311,404
71,025,540,812
235,623,757 s
125,412,247 s
40,411,781 s
30,665,881 s
199,685,594
2,365,154
11,687
1043
229
Without veritesting
Percentage of analyses
Total programs
Total SMT queries
Queries hitting cache
Symbolic instrs
Run time
Symb exec time
SAT time
Model gen time
# test cases
# crashes
# unique bugs
# reported bugs
# fixed bugs
With veritesting
40
20
0
1
2
4
8
16
32
50 Timeout
1
2
4
8
16
32
50 Timeout
Time (s)
(b)
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM
99
research highlights
1.2%). The same differences are also reflected in the component breakdown. With veritesting, most of the time (63%)
is spent in the solver, while with standard DSE most of the
time (60%) is spent re-executing similar paths that could be
merged and explored in a single execution.
Of course there is no free lunch, and some programs do
perform worse. We emphasize that on average over a fairly
large dataset our results indicate the tradeoff is beneficial.
5. RELATED WORK
Symbolic execution was discovered in 1975,14 with the volume
of academic research and commercial systems exploding in
the last decade. Notable symbolic executors include SAGE
and KLEE. SAGE4 is responsible for finding one third of all
bugs discovered by file fuzzing during the development of
Windows 7.4 KLEE6 was the first tool to show that symbolic
execution can generate test cases that achieve high coverage
on real programs by demonstrating it on the UNIX utilities.
There is a multitude of symbolic execution systems—for
more details, we refer the reader to recent surveys.5, 21
Merging execution paths is not new. Koelbl and Pixley15 pioneered path merging in SSE. Concurrently and independently,
Xie and Aiken23 developed Saturn, a verification tool capable of
encoding of multiple paths before converting the problem to
SAT. Hansen et al.13 follow an approach similar to Koelbl et al.
at the binary level. Babic and Hu3 improved their static algorithm to produce smaller and faster to solve formulas by leveraging GSA.22 The static portion of our veritesting algorithm
is built on top of their ideas. In our approach, we alternate
between SSE and DSE. Our approach amplifies the effect of
DSE and takes advantage of the strengths of both techniques.
The efficiency of the static algorithms mentioned above
typically stems from various types of if-conversion,1 a technique for converting code with branches into predicated
straightline statements. The technique is also known as
φ-folding,17 a compiler optimization technique that collapses
simple diamond-shaped structures in the CFG. Godefroid11
introduced function summaries to test code compositionally.
The main idea is to record the output of an analyzed function,
and reuse it whenever the function is called again. Veritesting
generates context-sensitive on-demand summaries of code
fragments as the program executes—extending to compositional summaries is possible future work.
6. CONCLUSION
In this article we proposed MergePoint and veritesting, a
new technique to enhance symbolic execution with verification-based algorithms. We evaluated MergePoint on 1023
programs and showed that veritesting increases the number of bugs found, node coverage, and path coverage. We
showed that veritesting enables large-scale bug finding by
testing 33,248 Debian binaries, and finding 11,687 bugs.
Our results have had real world impact with 229 bug fixes
already present in the latest version of Debian.
Acknowledgments
We would like to thank Samantha Gottlieb, Tiffany Bao, and
our anonymous reviewers for their comments and suggestions. We also thank Mitch Franzos and PDL for the support
100
CO MM UNICATIO NS O F T H E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
they provided during our experiments. This research was
supported in part by grants from DARPA and the NSF, as well
as the Prabhu and Poonam Goel Fellowship.
References
1. Allen, J.R., Kennedy, K., Porterfield, C.,
Warren, J. Conversion of control
dependence to data dependence.
In Proceedings of the 10th ACM
SIGACT-SIGPLAN Symposium on
Principles of Programming Languages
(Austin, Texas, 1983). ACM Press,
New York, NY, 177–189.
2. Avgerinos, T., Rebert, A., Cha, S.K.,
Brumley, D. Enhancing symbolic
execution with veritesting. In
Proceedings of the 36th International
Conference on Software Engineering,
ICSE 2014 (Hyderabad, India, 2014).
ACM, New York, NY, 1083–1094. DOI:
10.1145/2568225.2568293. URL http://
doi.acm.org/10.1145/2568225.2568293.
3. Babic, D., Hu, A.J. Calysto: Scalable
and precise extended static
checking. In Proceedings of the 30th
International Conference on Software
Engineering (Leipzig, Germany, 2008).
ACM, New York, NY, 211–220.
4. Bounimova, E., Godefroid, P., Molnar, D.
Billions and billions of constraints:
Whitebox Fuzz testing in production.
In Proceedings of the 35th IEEE
International Conference on Software
Engineering (San Francisco, CA, 2013).
IEEE Press, Piscataway, NJ, 122–131.
5. Cadar, C., Sen, K. Symbolic execution for
software testing: three decades later.
Commun. ACM 56, 2 (2013), 82–90.
6. Cadar, C., Dunbar, D., Engler, D. KLEE:
Unassisted and automatic generation
of high-coverage tests for complex
systems programs. In Proceedings of the
8th USENIX Symposium on Operating
System Design and Implementation
(San Diego, CA, 2008). USENIX
Association, Berkeley, CA, 209–224.
7. Cha, S.K., Avgerinos, T., Rebert, A.,
Brumley, D. Unleashing mayhem on
binary code. In Proceedings of the
33rd IEEE Symposium on Security
and Privacy (2012). IEEE Computer
Society, Washington, DC, 380–394.
8. Chipounov, V., Kuznetsov, V., Candea, G.
S2E: A platform for in vivo multi-path
analysis of software systems. In
Proceedings of the 16th International
Conference on Architectural Support
for Programming Languages and
Operating Systems (Newport Beach, CA,
2011). ACM, New York, NY, 265–278.
9. de Moura, L., Bjørner, N. Satisfiability
modulo theories: Introduction and
applications. Commun. ACM 54, 9
(Sept. 2011), 69. ISSN 00010782.
doi: 10.1145/1995376.1995394.
URL http://dl.acm.org/citation.
cfm?doid=1995376.1995394.
10. Flanagan, C., Saxe, J. Avoiding
exponential explosion: Generating
compact verification conditions.
In Proceedings of the 28th ACM
SIGPLAN-SIGACT Symposium on
Principles of Programming Languages
(London, United Kingdom, 2001).
ACM, New York, NY, 193–205.
11. Godefroid, P. Compositional dynamic
test generation. In Proceedings of
the 34th ACM SIGPLAN-SIGACT
Symposium on Principles of
Programming Languages (Nice, France,
2007). ACM, New York, NY, 47–54.
12. Godefroid, P., Levin, M.Y., Molnar, D.
SAGE: Whitebox fuzzing for security
testing. Commun. ACM 55, 3 (2012),
40–44.
13. Hansen, T., Schachte, P., Søndergaard, H.
State joining and splitting for the
symbolic execution of binaries.
Runtime Verif. (2009), 76–92.
14. King, J.C. Symbolic execution and
program testing. Commun. ACM 19, 7
(1976), 385–394.
15. Koelbl, A., Pixley, C. Constructing
efficient formal models from highlevel descriptions using symbolic
simulation. Int. J. Parallel Program.
33, 6 (Dec. 2005), 645–666.
16. Kuznetsov, V., Kinder, J., Bucur, S.,
Candea, G. Efficient state merging in
symbolic execution. In Proceedings of
the 33rd ACM SIGPLAN Conference
on Programming Language Design
and Implementation (Beijing, China,
2012). ACM, New York, NY, 193–204.
17. Lattner, C., Adve, V. LLVM: A
compilation framework for lifelong
program analysis & transformation.
In Proceedings of the International
Symposium on Code Generation and
Optimization: Feedback-directed and
Runtime Optimization (Palo Alto,
CA, 2004). IEEE Computer Society,
Washington, DC, 75–86.
18. Leino, K.R.M. Efficient weakest
preconditions. Inform. Process. Lett.
93, 6 (2005), 281–288.
19. Mayhem. 1.2K Crashes in Debian,
2013. URL http://lists.debian.org/
debian-devel/2013/06/msg00720.html.
20. Mayhem. Open Source Statistics
& Analysis, 2013. URL http://www.
forallsecure.com/summaries.
21. Schwartz, E.J., Avgerinos, T., Brumley, D.
All you ever wanted to know about
dynamic taint analysis and forward
symbolic execution (but might have
been afraid to ask). In Proceedings of
the 31st IEEE Symposium on
Security and Privacy (2010).
IEEE Computer Society, Washington,
DC, 317–331.
22. Tu, P., Padua, D. Efficient
building and placing of gating
functions. In Proceedings of
the 16th ACM Conference on
Programming Language Design and
Implementation (La Jolla, CA, 1995).
ACM, New York, NY, 47–55.
23. Xie, Y., Aiken, A. Scalable error
detection using boolean satisfiability.
In Proceedings of the 32nd ACM
SIGPLAN-SIGACT Symposium on
Principles of Programming Languages
(Long Beach, CA, 2005). ACM,
New York, NY, 351–363.
Thanassis Avgerinos, Alexandre
Rebert, and David Brumley ({thanassis,
alex}@forallsecure.com), For AllSecure,
Inc., Pittsburgh, PA.
Thanassis Avgerinos, Alexandre Rebert,
Sang Kil Cha, and David Brumley
({sangkilc, dbrumley}@cmu.edu), Carnegie
Mellon University, Pittsburgh, PA.
© 2016 ACM 0001-0782/16/06 $15.00
DOI:10:1145 / 2 9 2 79 2 6
Technical Perspective
Computing with
the Crowd
To view the accompanying paper,
visit doi.acm.org/10.1145/2927928
rh
By Siddharth Suri
COMPUTER SCIENCE IS primarily focused
on computation using microprocessors or CPUs. However, the recent rise
in the popularity of crowdsourcing
platforms, like Amazon’s Mechanical
Turk, provides another computational device—the crowd. Crowdsourcing
is the act of outsourcing a job to an
undefined group of people, known
as the crowd, through an open call.3
Crowdsourcing platforms are online
labor markets where employers can
post jobs and workers can do jobs
for pay, but they can also be viewed
as distributed computational systems where the workers are the CPUs
and will perform computations for
pay. In other words, crowdsourcing
platforms provide a way to execute
computation with humans. In a traditional computational system when
a programmer wants to compute
something, they interact with a CPU
through an API defined by an operating system. But in a crowdsourcing
environment, when a programmer
wants to compute something, they
interact with a human through an API
defined by a crowdsourcing platform.
Why might one want to do computation with humans? There are a
variety of problems that are easy for
humans but difficult for machines.
Humans have pattern-matching skills
and linguistic-recognition skills that
machines have been unable to match
as of yet. For example, FoldIt1 is a system where people search for the natural configuration of proteins and their
results often outperform solutions
computed using only machines. Conversely, there are problems that are
easy for machines to solve but difficult
for humans. Machines excel at computation on massive datasets since they
can do the same operations repeatedly without getting tired or hungry. This
brings up the natural question: What
kinds of problems can be solved with
both human and machine computation that neither could do alone?
Systems like AutoMan, described
in the following paper by Barowy et
al., provide the first steps toward answering this question. AutoMan is a
domain-specific programming language that provides an abstraction
layer on top of the crowd. It allows
the programmer to interleave the expression of computation using both
humans and machines in the same
program. In an AutoMan program,
one function could be executed by a
CPU and the next could be executed
by humans.
This new type of computation
brings new types of complexity, which
AutoMan is designed to manage.
Most of this complexity stems from
the fact that unlike CPUs, humans
have agency. They make decisions;
they have needs, wants, and biases.
Humans can choose what tasks to do,
when to quit, what is and isn’t worth
their time, and when to communicate
with another human and what about.
CPUs, on the other hand, always execute whatever instructions they are
given. Much of the design and implementation of AutoMan addresses this
key difference between humans and
machines. For example, AutoMan
has extensive functionality for quality
control on the output of the workers.
It also has functionality to discover
the price that will be enough to incentivize workers to do the given task
and to reduce collusion among workers. Computation with CPUs does
not require any of this functionality.
AutoMan also addresses the natural
difference in speed between human
and machine computation by allowing eager evaluation of the machine
commands and only blocking on the
humans when necessary.
Being able to express human computation and interleave human and
machine computation opens up interesting new research directions in
human computation and organizational dynamics. In the nascent field
of human computation, since we can
now express human computation in a
programming language, we can next
develop a model of human computation analogous to the PRAM.2 This
would, in turn, allow us to develop a
theory of complexity for human computation to help us understand what
problems are easy and difficult for humans to solve. Developing these theories might help us scale up AutoMan,
which is currently designed to solve
microtasks, in terms of complexity to
solve bigger tasks and workflows.
Taking a broader and more interdisciplinary perspective, one can view
a company as a computational device
that combines the human computation of its employees with the machine computation of the company’s
computers. A better theoretical and
empirical understanding of human
computation could allow the field of
computer science to inform how best
to architect and organize companies
for greater accuracy and efficiency.
Whether or not AutoMan proves revolutionary as a programming language,
it is important as an idea because it
provides a “computational lens”4 on
the science of crowdsourcing, human
computation, and the study of group
problem solving. References
1. Cooper, S. et al. Predicting protein structures
with a multiplayer online game. Nature 446
(Aug. 2010), 756–760.
2. Fortune, S. and Wylie, J. Parallelism in random
access machines. In Proceedings of the 10th
Annual Symposium on Theory of Computing (1978).
ACM, 114–118.
3. Howe, J. The rise of crowdsourcing. Wired
(June 1, 2006).
4. Karp R.M. Understanding science through
the computational lens. J. Computer Science
and Technology 26, 4 (July 2011), 569–577.
Siddharth Suri ([email protected]) is Senior
Researcher at—and one of the founding members of—
Microsoft Research in New York City.
Copyright held by author.
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
101
research highlights
AutoMan: A Platform for
Integrating Human-Based
and Digital Computation
DOI:10.1145/ 2 9 2 79 2 8
By Daniel W. Barowy, Charlie Curtsinger, Emery D. Berger, and Andrew McGregor
Abstract
Humans can perform many tasks with ease that remain difficult or impossible for computers. Crowdsourcing platforms
like Amazon Mechanical Turk make it possible to harness
human-based computational power at an unprecedented
scale, but their utility as a general-purpose computational
platform remains limited. The lack of complete automation makes it difficult to orchestrate complex or interrelated
tasks. Recruiting more human workers to reduce latency
costs real money, and jobs must be monitored and rescheduled when workers fail to complete their tasks. Furthermore,
it is often difficult to predict the length of time and payment
that should be budgeted for a given task. Finally, the results of
human-based computations are not necessarily reliable, both
because human skills and accuracy vary widely, and because
workers have a financial incentive to minimize their effort.
We introduce AutoMan, the first fully automatic crowdprogramming system. AutoMan integrates human-based computations into a standard programming language as ordinary
function calls that can be intermixed freely with traditional
functions. This abstraction lets AutoMan programmers
focus on their programming logic. An AutoMan program
specifies a confidence level for the overall computation and
a budget. The AutoMan runtime system then transparently
manages all details necessary for scheduling, pricing, and
quality control. AutoMan automatically schedules human
tasks for each computation until it achieves the desired confidence level; monitors, reprices, and restarts human tasks as
necessary; and maximizes parallelism across human workers
while staying under budget.
1. INTRODUCTION
Humans perform many tasks with ease that remain difficult or
impossible for computers. For example, humans are far better
than computers at performing tasks like vision, motion planning, and natural language understanding.16, 18 Many researchers expect these “AI-complete” tasks to remain beyond the reach
of computers for the foreseeable future.19 Harnessing humanbased computation in general and at scale faces the following
challenges:
Determination of pay and time for tasks. Employers must
decide the payment and time allotted before posting tasks. It is
both difficult and important to choose these correctly since workers will not accept tasks with too-short deadlines or too little pay.
Scheduling complexities. Employers must manage the
tradeoff between latency (humans are relatively slow) and
102
COMM UNICATIO NS O F T H E ACM
| J U NE 201 6 | VO L . 5 9 | NO. 6
cost (more workers means more money). Because workers
may fail to complete their tasks in the allotted time, jobs
need to be tracked and reposted as necessary.
Low quality responses. Human-based computations
always need to be checked: worker skills and accuracy vary
widely, and they have a financial incentive to minimize their
effort. Manual checking does not scale, and majority voting
is neither necessary nor sufficient. In some cases, majority
vote is too conservative, and in other cases, it is likely that
workers will agree by chance.
Contributions
We introduce AutoMan, a programming system that integrates human-based and digital computation. AutoMan
addresses the challenges of harnessing human-based computation at scale:
Transparent integration. AutoMan abstracts human-based
computation as ordinary function calls, freeing the programmer from scheduling, budgeting, and quality control concerns
(Section 3).
Automatic scheduling and budgeting. The AutoMan runtime system schedules tasks to maximize parallelism across
human workers while staying under budget. AutoMan tracks
job progress, reschedules, and reprices failed tasks as necessary (Section 4).
Automatic quality control. The AutoMan runtime system
manages quality control automatically. AutoMan creates
enough human tasks for each computation to achieve the
confidence level specified by the programmer (Section 5).
2. BACKGROUND
Since crowdsourcing is a novel application domain for programming language research, we summarize the necessary
background on crowdsourcing platforms. We focus on
Amazon Mechanical Turka (MTurk), but other crowdsourcing platforms are similar. MTurk acts as an intermediary
between requesters and workers for short-term tasks.
Human intelligence task. In MTurk parlance, tasks are
known as human intelligence tasks (HITs). Each HIT is
a
Amazon Mechanical Turk is hosted at http://mturk.com.
The original version of this paper was published in
the Proceedings of OOPSLA 2012.
represented as a question form, composed of any number
of questions and associated metadata such as a title, description, and search keywords. Questions can be either freetext questions, where workers provide a free-form textual
response, or multiple-choice questions, where workers make
one or more selections from a set of options. Most HITs on
MTurk are for relatively simple tasks, such as “does this
image match this product?” Compensation is generally low
(usually a few cents) since employers expect that work to be
completed quickly (on the order of seconds).
Requesting work. Requesters can create HITs using
either MTurk’s website or programmatically, using an API.
Specifying a number of assignments greater than one allows
multiple unique workers to complete the same task, parallelizing HITs. Distinct HITs with similar qualities can also
be grouped to make it easy for workers to find similar work.
Performing work. Workers may choose any available task,
subject to qualification requirements (see below). When a
worker selects a HIT, she is granted a time-limited reservation
for that particular piece of work such that no other worker
can accept it.
HIT expiration. HITs have two timeout parameters: the
amount of time that a HIT remains visible on MTurk, known
as the lifetime of a HIT, and the amount of time that a worker
has to complete an assignment once it is granted, known as the
duration of an assignment. If a worker exceeds the assignment’s
duration without submitting completed work, the reservation is
cancelled, and the HIT becomes available to other workers. If a
HIT reaches the end of its lifetime without its assignments having been completed, the HIT expires and is made unavailable.
Requesters: Accepting or rejecting work. Once a worker
submits a completed assignment, the requester may then
accept or reject the completed work. Acceptance indicates
that the completed work is satisfactory, at which point the
worker is paid. Rejection withholds payment. The requester
may provide a textual justification for the rejection.
Worker quality. The key challenge in automating work
in MTurk is attracting good workers and discouraging bad
workers from participating. MTurk provides no mechanism for requesters to seek out specific workers (aside from
emails). Instead, MTurk provides a qualification mechanism
that limits which workers may participate. A common qualification is that workers must have an overall assignmentacceptance rate of 90%.
Given the wide variation in tasks on MTurk, overall
worker accuracy is of limited utility. For example, a worker
may be skilled at audio transcription tasks and thus have
a high accuracy rating, but it would be a mistake to assume
on the basis of their rating that the same worker could also
perform Chinese-to-English translation tasks. Worse, workers who cherry-pick easy tasks and thus have high accuracy
ratings may be less qualified than workers who routinely
perform difficult tasks that are occasionally rejected.
3. OVERVIEW
AutoMan is a domain-specific language embedded in Scala.
AutoMan’s goal is to abstract away the details of crowdsourcing so that human computation can be as easy to invoke as a
conventional programming language function.
3.1. Using AutoMan
Figure 1 presents a real AutoMan program that recognizes automobile license plate texts from images. Note that
the programmer need not specify details about the chosen crowdsourcing backend (Mechanical Turk) other than
the appropriate backend adapter and account credentials.
Crucially, all details of crowdsourcing are hidden from the
AutoMan programmer. The AutoMan runtime abstracts
away platform-­specific interoperability code, schedules and
determines budgets (both cost and time), and automatically
ensures that outcomes meet a minimum confidence level.
Initializing AutoMan. After importing the AutoMan and
MTurk adapter libraries, the first thing an AutoMan programmer does is to declare a configuration for the desired
crowdsourcing platform. The configuration is then bound
to an AutoMan runtime object that instantiates any platform-specific objects.
Specifying AutoMan functions. AutoMan functions declaratively describe questions that workers must answer. They must
include the question type and may also include text or images.
Confidence level. An AutoMan programmer can optionally specify the degree of confidence they want to have in their
computation, on a per-function basis. AutoMan’s default confidence is 95%, but this can be overridden as needed. The meaning and derivation of confidence is discussed in Section 5.
Metadata and question text. Each question declaration
requires a title and description, used by the crowdsourcing
Figure 1. A license plate recognition program written using AutoMan.
getURLsFromDisk() is omitted for clarity. The AutoMan programmer
specifies only credentials for Mechanical Turk, an overall budget,
and the question itself; the AutoMan runtime manages all other
details of execution (scheduling, budgeting, and quality control).
import edu.umass.cs.automan.adapters.MTurk._
object ALPR extends App {
val a = MTurkAdapter { mt =>
mt.access_key_id = "XXXX"
mt.secret_access_key = "XXXX"
}
def plateTxt(url:String) = a.FreeTextQuestion { q =>
q.budget = 5.00
q.text = "What does this license plate say?"
q.image_url = url
q.allow_empty_pattern = true
q.pattern = "XXXXXYYY"
}
automan(a) {
// get plate texts from image URLs
val urls = getURLsFromDisk()
val plate_texts = urls.map { url =>
(url, plateTxt(url))
}
}
}
// print out results
plate_texts.foreach { (url,outcome) =>
outcome.answer match {
case Answer(ans,_,_) =>
println(url + ": "+ ans)
case _ => ()
}
}
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
103
research highlights
platform’s user interface. These fields map to MTurk’s fields
of the same name. A declaration also includes the question
text itself, together with a map between symbolic constants
and strings for possible answers.
Question variants. AutoMan supports multiplechoice questions, including questions where only one
answer is correct (“radio-button” questions), where any
number of answers may be correct (“checkbox” questions),
and a restricted form of free-text entry. Section 5 describes
how AutoMan’s quality control algorithm handles each
question type.
Invoking a function. A programmer can invoke an AutoMan
function as if it were any ordinary (digital) function. In
Figure 1, the programmer calls the plateTxt function with
a URL pointing to an image as a parameter. The function
returns an Outcome object representing a Future [Answer]
that can then be passed as data to other functions. AutoMan
functions execute eagerly, in a background thread, as soon as
they are invoked. The program does not block until it needs
to read an Outcome.answer field, and only then if the
human computation is not yet finished.
4. SCHEDULING ALGORITHM
AutoMan’s scheduler controls task marshaling, budgeting
of time and cost, and quality. This section describes how
AutoMan automatically determines these parameters.
4.1. Calculating timeout and reward
AutoMan’s overriding goal is to recruit workers quickly and
at low cost in order to keep the cost of a computation within
the programmer’s budget. AutoMan posts tasks in rounds that
have a fixed timeout during which tasks must be completed.
When AutoMan fails to recruit workers in a round, there are
two possible causes: workers were not willing to complete the
task for the given reward, or the time allotted was not sufficient.
AutoMan does not distinguish between these cases. Instead,
the reward for a task and the time allotted are both increased by
a constant factor g every time a task goes unanswered. g must
be chosen carefully to ensure the following two properties:
1. The reward for a task should quickly reach a worker’s
minimum acceptable compensation.
2. The reward should not grow so quickly that it incentivizes workers to wait for a larger reward.
Section 4.4 presents an analysis of reward growth rates.
We also discuss the feasibility of our assumptions and possible
attack scenarios in Section 5.4.
4.2. Scheduling the right number of tasks
AutoMan’s default policy for spawning tasks is optimistic: it creates the smallest number of tasks required to reach the desired
confidence level when workers agree unanimously. If workers do
agree unanimously, AutoMan returns their answer. Otherwise,
AutoMan computes and then schedules the minimum number of additional votes required to reach confidence.
When the user-specified budget is insufficient, AutoMan
suspends the computation before posting additional tasks.
The computation can either be resumed with an increased
104
COMM UNICATIO NS O F T H E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
budget or accepted as-is, with a confidence value lower than
the one requested. The latter case is considered exceptional,
and must be explicitly handled by the programmer.
4.3. Trading off latency and money
AutoMan allows programmers to provide a time-value parameter that counterbalances the default optimistic assumption that all workers will agree. The parameter instructs the
system to post more than the minimum number of tasks in
order to minimize the latency incurred when jobs are serialized across multiple rounds. The number of tasks posted is
a function of the value of the programmer’s time:
As a cost savings, when AutoMan receives enough
answers to reach the specified confidence, it cancels all
unaccepted tasks. In the worst case, all posted tasks will
be answered before AutoMan can cancel them, which will
cost no more than time_value ⋅ task_timeout. While this
strategy runs the risk of paying substantially more for a
computation, it can yield dramatic reductions in latency.
We re-ran the example program described in Section 7.1
with a time-value set to $50. In two separate runs, the computation completed in 68 and 168 seconds; by contrast, the
default time-value (minimum wage) took between 1 and 3
hours to complete.
4.4. Maximum reward growth rate
When workers encounter a task with an initial reward of
R they may choose to accept the task or wait for the reward
to grow. If R is below Rmin, the smallest reward acceptable to
workers, then tasks will not be completed. Let g be the
reward growth rate and let i be the number of discrete time
steps, or rounds, that elapse from an initial time i = 0, such
that a task’s reward after i rounds is gi R. We want a g large
enough to reach Rmin quickly, but not so large that workers
have an incentive to wait. We balance the probability that a
task remains available against the reward’s growth rate so
workers should not expect to profit by waiting.
Let pa be the probability that a task remains available
from one round to the next, assuming this probability
is constant across rounds. Suppose a worker’s strategy is to wait i rounds and then complete the task for
a larger reward. The expected reward for this worker’s
strategy is
E [rewardi] = (pa g)i R.
when g ≤ 1/pa, the expected reward is maximized at i = 0;
workers have no incentive to wait, even if they are aware
of AutoMan’s pricing strategy. A growth rate of exactly
1/pa will reach Rmin as fast as possible without incentivizing
waiting. This pricing strategy remains sound even when pa
is not constant, provided the desirability of a task does not
decrease with a larger reward.
The true value of pa is unknown, but it can be estimated by
modeling the acceptance or rejection of each task as an independent Bernoulli trial. The maximum likelihood estimator
is a reasonable estimate for pa, where n is the number of
times a task has been offered and t is the number of times
the task was not accepted before timing out. To be conservative, p̃a can be over-approximated, driving g downward.
The difficulty of choosing a reward a priori is a strong case
for automatic budgeting.
5. QUALITY CONTROL
AutoMan’s quality control algorithm is based on collecting enough consensus for a given question to rule out the
possibility, for a specified level of confidence, that the
results are due to random chance. AutoMan’s algorithm
is adaptive, taking both the programmer’s confidence
threshold and the likelihood of random agreement into
account. By contrast, majority rule, a commonly used technique for achieving higher-quality results, is neither necessary nor sufficient to rule out outcomes due to random
chance (see Figure 2). A simple two-option question (e.g.,
“Heads or tails?”) with three random respondents demonstrates the problem: a majority is not just likely, it is guaranteed. Section 5.4 justifies this approach.
Initially, AutoMan spawns enough tasks to meet the
desired confidence level if all workers who complete the
tasks agree unanimously. Computing the confidence of
an outcome in this scenario is straightforward. Let k be
the number of options, and n be the number of tasks. The
confidence is then 1 − k(1/k)n. AutoMan computes the
smallest n such that the probability of random agreement
is less than or equal to one minus the specified confidence threshold.
Humans are capable of answering a rich variety of question
types. Each of these question types requires its own probability analysis.
○ Radio • Buttons For multiple-choice “radio-button” questions where only one choice is possible, k is exactly the number
of possible options.
Check  Boxes For “checkbox” questions with c boxes,
k is much larger: k = 2c. In practice, k is often large enough
Fraction of responses
Figure 2. The fraction of workers that must agree to reach 0.95
confidence for a given number of tasks. For a three-option question
and 5 workers, 100% of the workers must agree. For a six-option
question and 15 or more workers, only a plurality is required to
reach confidence. Notice that majority vote is neither necessary nor
sufficient to rule out random respondents.
Fraction of responses that must agree (b = 0.95)
1.00
Options
0.75
0.50
5 Responses
10 Responses
15 Responses
20 Responses
25 Responses
0.25
that as few as two workers are required to rule out random
behavior. To avoid accidental agreement caused when
low-effort workers simply submit a form without changing
any of the checkboxes, AutoMan randomly pre-selects
checkboxes.
Free-text Input| Restricted “Free-text” input is mathematically equivalent to a set of radio-buttons where each
option corresponds to a valid input string. Nonetheless, even
a small set of valid strings represented as radio buttons
would be burdensome for workers. Instead, workers are
provided with a text entry field and the programmer supplies a pattern representing valid inputs so that AutoMan
can perform its probability analysis.
AutoMan’s pattern specification syntax resembles
COBOL’s picture clauses. A matches an alphabetic character, B matches an optional alphabetic character, X matches
an alphanumeric character, Y matches an optional alphanumeric character, 9 matches a numeric character, and 0
matches an optional numeric character. For example, a telephone number recognition application might use the pattern
09999999999.
For example, given a 7-character numeric pattern with no
optional characters, k = 107. Again, k is often large, so a small
number of HITs suffice to achieve high confidence in the result.
As with checkbox questions, AutoMan treats free-­
text
questions specially to cope with low-effort workers who
might simply submit an empty string. To avoid this problem,
AutoMan only accepts the empty string if it is explicitly
entered with the special string NA.
5.1. Definitions
Formally, AutoMan’s quality control algorithm depends
on two functions, t and v, and associated parameters b
and p*. t computes the minimum threshold (the number
of votes) needed to establish that an option is unlikely
to be due to random chance with probability b (the programmer’s confidence threshold). t depends on the
random variable X, which models when n respondents
choose one of k options uniformly at random. If no option
crosses the threshold, v computes the additional number of votes needed. v depends on the random variable
Y , which models a worker choosing the correct option
with the observed probability p* and all other options uniformly at random.
Let X and Y be multinomial distributions with parameters (n, 1/k, . . . , 1/k) and (n, p, q, . . . , q), respectively, where
q = (1 − p)/(k − 1). We define two functions E1 and E2 that have
the following properties2:
Lemma 5.1.
where
0.00
2
3
4
5
Number of options
6
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
105
research highlights
and
programmer.10 We empirically evaluate the cost and time
overhead for this correction in Section 7.4.
where coeffλ, n( f (λ)) is the coefficient of λn in the polynomial f.
Note that E1(n, n) = 1 − 1/kn−1 and define
Thus, when when n voters each choose randomly, the probability that any option meets or exceeds the threshold t(n, β)
is at most α = 1 − β.
Finally, we define v, the number of extra votes needed, as
If workers have a bias of at least p* toward a “popular” option
(the remaining options being equiprobable), then when we
ask υ(p*, β ) voters, the number of votes cast for the popular
option passes the threshold (and all other options are below
threshold) with probability at least β.
5.2. Quality control algorithm
AutoMan’s quality control algorithm, which gathers
responses until it can choose the most popular answer not
likely to be the product of random chance, proceeds as
follows:
1. Set b = min {m | t (m, β) ≠ ∞}. Set n = 0.
2. Ask b workers to vote on a question with k options. Set
n = b + n.
3. If any option has more than t(n, β) votes, return the
most frequent option as the answer.
4. Let b = v(p*, β ) and repeat from step 2.
Figure 2 uses t to compute the smallest fraction of
workers that need to agree for β = 0.95. As the number of
tasks and the number of options increase, the proportion
of workers needed to agree decreases. For example, for
a 4-option question with 25 worker responses, only 48%
(12 of 25) of workers must agree to meet the confidence
threshold. This figure clearly demonstrates that quality
control based on majority vote is neither necessary nor
sufficient to limit outcomes based on random chance.
5.3. Multiple comparisons problem
Note that AutoMan must correct for a subtle bias that is
introduced as the number of rounds—and correspondingly, the number of statistical tests—increases. This bias
is called the multiple comparisons problem. As the number of hypotheses grows with respect to a fixed sample
size, the probability that at least one true hypothesis will
be incorrectly falsified by chance increases. Without the
correction, AutoMan is susceptible to accepting lowconfidence answers when the proportion of good workers is low. AutoMan applies a Bonferroni correction to
its statistical threshold, which ensures that the familywise
error rate remains at or below the 1 − β threshold set by the
106
COM MUNICATIO NS O F TH E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
5.4. Quality control discussion
For AutoMan’s quality control algorithm to work, two
assumptions must hold: (1) workers must be independent, and (2) random choice is the worst-case behavior for
workers; that is, they will not deliberately pick the wrong
answer. Workers may break the assumption of independence by masquerading as multiple workers, performing
multiple tasks, or by colluding on tasks. We address each
scenario below.
Scenario 1: Sybil Attack. A single user who creates multiple
electronic identities for the purpose of thwarting identitybased security policy is known in the literature as a “Sybil
attack.”6 The practicality of a Sybil attack depends directly on
the feasibility of generating multiple identities.
Carrying out a Sybil attack on MTurk would be burdensome. Since MTurk provides a payment mechanism for
workers, Amazon requires that workers provide uniquely
identifying financial information, typically a credit card or
bank account. These credentials are difficult to forge.
Scenario 2: One Worker, Multiple Responses. In
order in­c rease the pay or allotted time for a task, MTurk
requires requesters to post a new HIT. This means that
a single AutoMan task can span multiple MTurk HITs.
MTurk provides a mechanism to ensure worker uniqueness
for a single HIT that has multiple assignments, but it lacks
the functionality to ensure that worker uniqueness is maintained across multiple HITs. For AutoMan’s quality control
algorithm to be effective, AutoMan must be certain that
workers who previously supplied responses cannot supply
new responses for the same task.
Our workaround for this shortcoming is to use MTurk’s
“qualification” feature inversely: once a worker completes
a HIT, AutoMan grants the worker a special “disqualification” that precludes them from supplying future responses.
Scenario 3: Worker Collusion. While it is appealing to
lower the risk of worker collusion by ensuring that workers
are geographically separate (e.g., by using IP geolocation),
eliminating this scenario entirely is not practical. Workers
can collude via external channels (e-mail, phone, word-ofmouth) to thwart our assumption of independence. Instead,
we opt to make the effort of thwarting defenses undesirable
given the payout.
By spawning large numbers of tasks, AutoMan makes it
difficult for any group of workers to monopolize them. Since
MTurk hides the true number of assignments for a HIT,
workers cannot know how many wrong answers are needed
to defeat AutoMan’s quality control algorithm. This makes
collusion infeasible. The bigger threat comes from workers
who do as little work as possible to get compensated: previous research on MTurk suggests that random-answer spammers are the primary threat.20
Random as worst case AutoMan’s quality control algorithm is based on excluding random responses. AutoMan
gathers consensus not just until a popular answer is revealed,
but also until its popularity is unlikely to be the product
of random chance. As long as there is a crowd bias toward
the correct answer, AutoMan’s algorithm will eventually
choose it. Nevertheless, it is possible that workers could act
maliciously and deliberately choose incorrect answers.
Random choice is a more realistic worst-case scenario:
participants have an economic incentive not to deliberately
choose incorrect answers. First, a correct response to a
given task yields an immediate monetary reward. If workers
know the correct answer, it is against their own economic
self-interest to choose otherwise. Second, supposing that a
participant chooses to forego immediate economic reward
by deliberately responding incorrectly (e.g., out of malice), there are long-term consequences. MTurk maintains
an overall ratio of accepted responses to total responses
submitted (a “reputation” score), and many requesters
only assign work to workers with high ratios (typically
around 90%). Since workers cannot easily discard their
identities for new ones, incorrect answers have a lasting
negative impact on workers. We found that many MTurk
workers scrupulously maintain their reputations, sending
us e-mails justifying their answers or apologizing for having
misunderstood the question.
6. SYSTEM ARCHITECTURE
AutoMan is implemented in tiers in order to cleanly separate three concerns: delivering reliable data to the enduser, interfacing with an arbitrary crowdsourcing system,
and specifying validation strategies in a crowdsourcing
system-agnostic manner. The programmer’s interface to
AutoMan is a domain-specific language embedded in
the Scala programming language. The choice of Scala is
to maintain full interoperablity with existing Java Virtual
Machine code. The DSL abstracts questions at a high level
as question functions.
Upon executing a question function, AutoMan computes
the number of tasks to schedule, the reward, and the timeout; marshals the question to the appropriate backend;
and returns immediately, encapsulating work in a Scala
Future. The runtime memoizes all responses in case the
user’s program crashes. Once quality control goals are satisfied, AutoMan selects and returns an answer.
Each tier in AutoMan is abstract and extensible.
The default quality control strategy implements the
algorithm described in Section 5.2. Programmers
can replace the default strategy by implementing the
ValidationStrategy interface. The default backend is
MTurk, but this backend can be replaced with few changes
to client code by supplying an AutomanAdapter for a different crowdsourcing platform.
7. EVALUATION
We implemented three sample applications using AutoMan:
a semantic image-classification task using checkboxes (Section
7.1), an image-counting task using radio buttons (Section 7.2),
and an optical character recognition (OCR) pipeline using text
entry (Section 7.3). These applications were chosen to be representative of the kinds of problems that remain difficult even
for state-of-the-art algorithms. We also performed a simulation
using real and synthetic traces to explore AutoMan’s performance as confidence and worker quality is varied (Section 7.4).
7.1. Which item does not belong?
Our first sample application asks users to identify which
object does not belong in a collection of items. This kind of
task requires both image- and semantic-classification capability, and is a component in clustering and automated
construction of ontologies. Because tuning of AutoMan’s
parameters is unnecessary, relatively little code is required
to implement this program (27 lines in total).
We gathered 93 responses from workers during our sampling runs. Runtimes for this program were on the order of
minutes, but there is substantial variation in runtime given
the time of the day. Demographic studies of MTurk have
shown that the majority of workers on MTurk are located in
the United States and in India.11 These findings largely agree
with our experience, as we found that this program (and
variants) took upward of several hours during the late evening hours in the United States.
7.2. How many items are in this picture?
Counting the number of items in an image also remains difficult for state-of-the-art machine learning algorithms. Machinelearning algorithms must integrate a variety of feature detection
and contextual reasoning algorithms in order to achieve a fraction of the accuracy of human classifiers.18 Moreover, vision
algorithms that work well for all objects remain elusive.
Counting tasks are trivial with AutoMan. We created an
image processing pipeline that takes a search string as input,
downloads images using Google Image Search, resizes the
images, uploads the images to Amazon S3, obscures the URLs
using TinyURL, and then posts the question “How many
$items are in this image?”
We ran this task eight times, spawning 71 question
instances at the same time of the day (10 a.m. EST), and
employing 861 workers. AutoMan ensured that for each
of the 71 questions asked, no worker was could participate
more than once. Overall, the typical task latency was short.
We found that the mean runtime was 8 min, 20 s and that the
median runtime was 2 min, 35 s.
The mean is skewed upward by the presence of one longrunning task that asked “How many spoiled apples are in this
image?” The difference of opinion caused by the ambiguity of
the word “spoiled” caused worker answers to be nearly evenly
distributed between two answers. This ambiguity forced
AutoMan to collect a large number of responses in order to
meet the desired confidence level. AutoMan handled this
unexpected behavior correctly, running until statistical confidence was reached.
7.3. Automatic license plate recognition
Our last application is the motivating example shown in
Figure 1, a program that performs automatic license plate recognition (ALPR). Although ALPR is now widely deployed using
distributed networks of traffic cameras, it is still considered a
difficult research problem,8 and academic literature on this
subject spans nearly three decades.5 While state-of-the-art systems can achieve accuracy near 90% under ideal conditions,
The MediaLab LPR database is available at http://www.medialab.ntua.gr/
research/LPRdatabase.html.
b
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
107
research highlights
these systems require substantial engineering in practice.4
False positives have dramatic negative consequences in
unsupervised ALPR systems as tickets are issued to motorists
automatically. A natural consequence is that even good unsupervised image-recognition algorithms may need humans in
the loop to audit results and to limit false positives.
Figure 3. A sample trace from the ALPR application shown in Figure 1.
AutoMan correctly selects the answer 767JKF, spending a total of
$0.18. Incorrect, timed-out, and cancelled tasks are not paid for,
saving programmers money.
post tasks
$0.06
Task 1
Task 2
2 answers
post tasks
$0.06
post tasks
$0.12
w1:
767JFK
Task 5
end
w3:
answer:
cancelled
Task 6
767JKF
767JKF
767JKF
workers
disagree
w2:
1 answer
Task 3
timeout
Task 4
t1
t2
t3
t4
t5
t6
“What does this license plate say?”
Using AutoMan to engage humans to perform this
task required only a few hours of programming time and
AutoMan’s quality control ensures that it delivers results
that match or exceed the state-of-the-art on even the most
difficult cases. We evaluated the ALPR application using
the MediaLab LPRb database. Figure 3 shows a sample trace
for a real execution.
The benchmark was run twice on 72 of the “extremely difficult” images, for a total of 144 license plate identifications.
Overall accuracy was 91.6% for the “extremely difficult” subset. Each task cost an average of 12.08 cents, with a median
latency of less than 2 min per image. AutoMan runs all
identification tasks in parallel: one complete run took less
than 3 h, while the other took less than 1 h. These translate
to throughputs of 24 and 69 plates/h. While the AutoMan
application is slower than computer vision approaches, it is
simple to implement, and it could be used for only the most
difficult images to increase accuracy at low cost.
7.4. Simulation
We simulate AutoMan’s ability to meet specified confidence thresholds by varying two parameters, the minimum
confidence threshold β, where 0 < β < 1 (we used 50 levels of β),
and the probability that a random worker chooses the correct answer pr ∈ {0.75, 0.50, 0.33}. We also simulate worker
responses drawn from trace data (“trace”) for the “Which
item does not belong?” task (Section 7.1). For each setting of
Figure 4. These plots show the effect of worker accuracy on (a) overall accuracy and (b) the number of responses required on a five-option
question. “Trace” is a simulation based on real response data while the other simulations model worker accuracies of 33%, 50%, and 75%.
Each round of responses ends with a hypothesis test to decide whether to gather more responses, and AutoMan must schedule more rounds
to reach the confidence threshold when worker accuracy is low. Naively performing multiple tests creates a risk of accepting a wrong
answer, but the Bonferroni correction eliminates this risk by increasing the confidence threshold with each test. Using the correction,
AutoMan (c) meets quality control guarantees and (d) requires few additional responses for real workers.
(b) Responses Required for Confidence
33%
50%
0.950
75%
0.925
Trace
0.900
0.925
0.950
0.975
Worker accuracy
33%
60
50%
75%
30
Confidence
1.000
Trace
0.900
(c) Overall Accuracy with Bonferroni Correction
AUTOMAN accuracy
90
0
0.900
1.000
Worker accuracy
0.975
33%
50%
0.950
75%
0.925
Trace
0.900
0.900
108
Responses
Worker accuracy
0.975
0.925
0.950
0.975
Confidence
COMM UNICATIO NS O F T H E AC M
1.000
| J U NE 201 6 | VO L . 5 9 | NO. 6
0.925
0.950
Confidence
0.975
1.000
(d) Additional Responses with Bonferroni Correction
Additional responses
AUTOMAN accuracy
(a) Overall Accuracy
1.000
60
Worker accuracy
40
33%
50%
75%
20
Trace
0
0.900
0.925
0.950
Confidence
0.975
1.000
β and pr we run 10,000 simulations and observe AutoMan’s
response. We classify responses as either correct or incorrect
given the ground truth. Accuracy is the mean proportion of
correct responses for a given confidence threshold. Responses
required is the mean number of workers needed to satisfy a
given confidence threshold.
Figure 4a and 4b shows accuracy and the number of required
responses as a function of β and pr, respectively. Since the
risk of choosing a wrong answer increases as the number of
hypothesis tests increases (the “multiple comparisons”
problem), we also include figures that show the result of correcting for this effect. Figure 4c shows the accuracy and Figure
4d shows the increase in the number of responses when we
apply the Bonferroni bias correction.10
These results show that AutoMan’s quality control
algorithm is effective even under pessimistic assumptions
about worker quality. AutoMan is able to maintain high
accuracy in all cases. Applying bias correction ensures that
answers meet the programmer’s quality threshold even
when worker quality is low. This correction can significantly increase the number of additional worker responses
required when bad workers dominate. However, worker
accuracy tends to be closer to 60%,20 so the real cost of this
correction is low.
8. RELATED WORK
Programming the Crowd. While there has been substantial ad hoc use of crowdsourcing platforms, there have been
few efforts to manage workers programmatically beyond
MTurk’s low-level API.
TurKit Script extends JavaScript with a templating feature for common MTurk tasks and adds checkpointing to
avoid re-submitting tasks if a script fails.15 CrowdForge and
JabberWocky wrap a MapReduce-like abstraction on MTurk
tasks.1, 13 Unlike AutoMan, neither TurKit nor CrowdForge
automatically manage scheduling, pricing, or quality control;
Jabberwocky uses fixed pricing along with a majority-vote
based quality-control scheme.
CrowdDB models crowdsourcing as an extension to SQL
for crowdsourcing database cleansing tasks.9 The query planner tries to minimize the expense of human operations.
CrowdDB is not general-purpose and relies on majority voting as its sole quality control mechanism.
Turkomatic crowdsources an entire computation, including the “programming” of the task itself.14 Turkomatic can
be used to construct arbitrarily complex computations, but
Turkomatic does not automatically handle budgeting or quality control, and programs cannot be integrated with a conventional programming language.
Quality Control. CrowdFlower is a commercial web service.17 To enhance quality, CrowdFlower seeds questions
with known answers into the task pipeline. CrowdFlower
incorporates methods to programmatically generate these
“gold” questions to ease the burden on the requester. This
approach focuses on establishing trust in particular workers.12 By contrast, AutoMan does not try to estimate worker
quality, instead focusing on worker agreement.
Shepherd provides a feedback loop between task requesters and task workers in an effort to increase quality; the idea is
to train workers to do a particular job well.7 AutoMan requires
no feedback between requester and workers.
Soylent crowdsources finding errors, fixing errors, and verifying the fixes.3 Soylent can handle open-ended questions that
AutoMan currently does not support. Nonetheless, unlike
AutoMan, Soylent’s approach does not provide any quantitative quality guarantees.
9. CONCLUSION
Humans can perform many tasks with ease that remain difficult or impossible for computers. We present AutoMan, the
first crowdprogramming system. Crowdprogramming integrates human-based and digital computation. By automatically managing quality control, scheduling, and budgeting,
Auto­Man allows programmers to easily harness humanbased computation for their applications.
AutoMan is available at www.automan-lang.org.
Acknowledgments
This work was supported by the National Science
Foundation Grant No. CCF-1144520 and DARPA Award
N10AP2026. Andrew McGregor is supported by the National
Science Foundation Grant No. CCF-0953754. The authors
gratefully acknowledge Mark Corner for his support and initial prompting to explore crowdsourcing.
References
1. Ahmad, S., Battle, A., Malkani, Z.,
Kamvar, S. The Jabberwocky
programming environment for
structured social computing. In UIST
2011, 53–64.
2. Barowy, D.W., Curtsinger, C., Berger, E.D.,
McGregor, A. AutoMan: A platform
for integrating human-based and
digital computation. In OOPSLA 2012,
639–654.
3. Bernstein, M.S., Little, G., Miller, R.C.,
Hartmann, B., Ackerman, M.S., Karger,
D.R., Crowell, D., Panovich, K..
Soylent: A word processor with a
crowd inside. In UIST 2010, 313–322.
4. Chang, G.-L., Zou, N. ITS Applications
in Work Zones to Improve Traffic
Operation and Performance
Measurements. Technical Report
MD-09-SP708B4G, Maryland
Department of Transportation State
Highway Administration, May.
5. Davies, P., Emmott, N., Ayland, N.
License plate recognition technology
for toll violation enforcement. In IEE
Colloquium on Image Analysis for
Transport Applications (Feb 1990),
7/1–7/5.
6. Douceur, J.R.. The Sybil attack.
In IPTPS 2001, 251–260.
7. Dow, S., Kulkarni, A., Bunge, B.,
Nguyen, T., Klemmer, S., Hartmann, B.
Shepherding the crowd: Managing and
providing feedback to crowd workers.
In CHI 2011, 1669–1674.
8. Due, S., Ibrahim, M., Shehata, M.,
Badawy, W. Automatic license plate
recognition (ALPR): A state of the art
review. IEEE Trans. Circ. Syst. Video
Technol. 23 (2012), 311–325.
9. Franklin, M.J., Kossmann, D., Kraska, T.,
Ramesh, S., Xin, R. CrowdDB: Answering
Daniel W. Barowy, Charlie Curtsinger,
Emery D. Berger, and Andrew
McGregor ({dbarowy, charlie, emery,
mcgregor}@cs.umass.edu), College of
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
queries with crowdsourcing. In
SIGMOD 2011, 61–72.
Holm, S. A simple sequentially
rejective multiple test procedure.
Scand. J. Stat. 6, 2 (1979), 65–70.
Ipeirotis, P.G. Demographics of
Mechanical Turk. Technical Report
CeDER-10-01, NYU Center for Digital
Economy Research, 2010.
Ipeirotis, P.G., Provost, F., Wang, J.
Quality management on Amazon
mechanical turk. In HCOMP 2010,
64–67.
Kittur, A., Smus, B., Khamkar, S.,
Kraut, R.E. CrowdForge:
Crowdsourcing Complex Work.
Kulkarni, A.P., Can, M., Hartmann, B.
Turkomatic: Automatic recursive task
and workflow design for mechanical
turk. In CHI 2011, 2053–2058.
Little, G., Chilton, L.B., Goldman, M.,
Miller, R.C. TurKit: Human computation
algorithms on mechanical turk. In
UIST 2010, 57–66.
Marge, M., Banerjee, S., Rudnicky, A.
Using the Amazon mechanical turk
for transcription of spoken language.
In ICASSP 2010, 5270–5273, Mar.
Oleson, D., Hester, V., Sorokin, A.,
Laughlin, G., Le, J., Biewald, L.
Programmatic gold: Targeted
and scalable quality assurance in
crowdsourcing. In HCOMP 2011,
43–48.
Parikh, D., Zitnick, L. Human-debugging
of machines. In NIPS CSS 2011.
Shahaf, D., Amir, E. Towards a
theory of AI completeness. In
Commonsense 2007.
Tamir, D., Kanth, P., Ipeirotis, P.
Mechanical turk: Now with
40.92% spam, Dec 2010.
www.behind-the-enemy-lines.com.
Information and Computer Sciences,
University of Massachusetts Amherst,
140 Governors Drive, Amherst, MA.
Copyright held by authors. Publication rights licensed to ACM. $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
109
last byte
it; they
just announced it in the Federal Register as a proposed standard. We quickly
realized the key size was too small and
needed to be enlarged.
DIFFIE: I had an estimate roughly of
half a billion dollars to break it. We
eventually decided it could be done for
$20-ish million.
HELLMAN: And because of Moore’s
Law, it would only get cheaper.
DIFFIE: If you can make a cryptographic system that’s good, it’s usually
not hard to make one that’s effectively
unbreakable. So it takes some explaining if you make the key size small
enough that somebody might conceivably search through it.
HELLMAN: So in March 1975, NBS announced the DES and solicited comments and criticism. And we were naïve enough to think they actually were
open to improving the standard. Five
months later, it was clear to us the key
size problem was intentional and the
NSA was behind it. If we wanted to improve DES—and we did—we had a political fight on our hands.
[CON T I N U E D FRO M P. 112]
Whitfield Diffie
110
CO MM UNICATIO NS O F T H E AC M
| J U NE 201 6 | VO L . 5 9 | NO. 6
suggested the value of trap door ciphers. It became clear to us that NSA
wanted secure encryption for U.S.
communications, but still wanted access to foreign ones. Even better than
DES’ small key size would be to build
in a trap door that made the system
breakable by NSA—which knows the
trap door information—but not by
other nations. It’s a small step from
there to public key cryptography, but
it still took us time to see.
Whit, you have also said you were inspired by John McCarthy’s paper about
buying and selling through so-called
“home information terminals.”
PHOTOGRA PHS BY RICHA RD M ORGENSTEIN
That fight was partly about your work
on public key cryptography.
MARTIN: There was a lot that led up
to that idea. The DES announcement
“If you can make
a cryptographic
system that’s good,
it’s usually not
hard to make one
that’s effectively
unbreakable.”
last byte
DIFFIE: I was concerned with two
problems and didn’t realize how closely related they were. First, I had been
thinking about secure telephone calls
since 1965, when a friend told me—
mistakenly, as it turned out—that
the NSA encrypted the telephone traffic within its own building. From my
countercultural point of view, though,
my understanding of a secure telephone call was: I call you, and nobody
else in the world can understand what
we’re talking about. I began thinking
about what we call the key-management problem.
In 1970, about the time I got to
Stanford, John McCarthy presented
the paper that you note. I began to
think about electronic offices and
what you would do about a signature,
because signatures on paper depend
so heavily on the fact that they’re hard
to copy, and digital documents can be
copied exactly.
So in the spring of 1975, as you were
preparing your critique of DES, you
came to the solution to both problems.
DIFFIE: I was living at John McCarthy’s house, and I was trying to combine what is called identification,
friend or foe (IFF), which is a process by
which a Fire Control radar challenges
an aircraft and expects a correctly encrypted response, and what is called
one-way enciphering, which is used in
UNIX to avoid having the password table be secret. One of these protects you
from the compromise of the password
table, and the other protects you from
someone eavesdropping on the transmission of your password.
You came to the concept of what we
now call digital signatures, constructions in which somebody can judge the
correctness of the signature but cannot
have generated it.
DIFFIE: Only one person can generate it, but anybody can judge its correctness. And then a few days later,
I realized this could be used to solve
the problem I’d been thinking of since
1965. At that point, I realized I really
had something. I told Mary about it as I
fed her dinner and then went down the
hill to explain it to Marty.
HELLMAN So then we had the problem of coming up with a system that
would actually implement it practical-
Martin E. Hellman
ly, and some time later we met Ralph
Merkle, who had come up with related
but slightly different ideas at Berkeley
as a master’s student. The algorithm I
came up with was a public key distribution system, a concept developed by
Merkle. Whit and I didn’t put names
on the algorithm, but I’ve argued it
should be called Diffie-Hellman-Merkle, rather than the Diffie-Hellman Key
Exchange, as it now is.
The NSA was not happy you intended to
publish your results.
HELLMAN: NSA was very upset at
our publishing in an area where they
thought they had a monopoly and
could control what was published.
Marty, you have been at Stanford ever
since. Whit, you left Stanford in 1978
to work at Bell Northern Research, and
later went to Sun Microsystems. And
you now are working on a project to
document the history of cryptography.
DIFFIE: There have been some major
shifts in cryptographic technology in
the latter half of the 20th century; public key is only one of them. I am trying to write the history of some others
before all the people who worked on
them die off.
Marty, you’re writing a book about your
marriage and nuclear weapons.
HELLMAN: Starting about 35 years
ago, my interests shifted from cryptography to the bigger problems in
the world, particularly nuclear weapons and how fallible human beings
are going to survive having that kind
of power. What got me started was
wanting to save my marriage, which
at that time was in trouble. Dorothie
and I not only saved our marriage, but
recaptured the deep love we felt when
we first met. The changes needed to
transform our marriage are the same
ones needed to build a more peaceful, sustainable world. But it has kind
of come full circle, because as we become more and more wired, cyber
insecurity may become an existential
threat. The global part of our effort
is really about solving the existential
threats created by the chasm between
the God-like physical power technology has given us and our maturity level
as a species, which is at best that of an
irresponsible adolescent.
Leah Hoffmann is a technology writer based in
Piermont, NY.
© 2016 ACM 0001-0782/16/06 $15.00
JU N E 2 0 1 6 | VO L. 59 | N O. 6 | C OM M U N IC AT ION S OF T H E ACM
111
last byte
DOI:10.1145/2911977
Leah Hoffmann
Q&A
Finding New Directions
in Cryptography
Whitfield Diffie and Martin Hellman on their meeting,
their research, and the results that billions use every day.
L I K E M A N Y D E V E L O P M E N T S we now
take for granted in the history of the
Internet, public key cryptography—
which provides the ability to communicate securely over an insecure
channel—followed an indirect path
into the world. When ACM A.M. Turing Award recipients Martin Hellman and Whitfield Diffie began their
research, colleagues warned against
pursuing cryptography, a field then
dominated by the U.S. government.
Their 1976 paper “New Directions in
Cryptography” not only blazed a trail
for other academic researchers, but
introduced the ideas of public-key
distribution and digital signatures.
How did you meet?
DIFFIE: In the summer of 1974, my
wife and I traveled to Yorktown Heights
(NY) to visit a friend who worked for
Alan Konheim at IBM.
112
COMM UNICATIO NS O F T H E ACM
DIFFIE:
I was staying with Leslie
Lamport.
HELLMAN: I think I set up a half-hour
meeting in my office, which went on for
probably two hours, and at the end of it, I
said, “Look, I’ve got to go home to watch
my daughters, but can we continue this
there?” Whit came to our house and we
invited him and his wife, Mary, to stay for
dinner, and as I remember we ended the
conversation around 11 o’clock at night.
The two of you worked together for the
next four years.
HELLMAN: Whit had been traveling
around the country and I tried to fig-
| J U NE 201 6 | VO L . 5 9 | NO. 6
ure out ways to keep him at Stanford.
I found a small amount of money in a
research grant that I could use. A lot of
good things came of that.
Among them was a vigorous critique
of the Data Encryption Standard
(DES), a symmetric-key algorithm developed at IBM.
HELLMAN: DES came full-blown from
the brow of Zeus. “Zeus,” in this case,
was NBS, the National Bureau of Standards, or NSA, the National Security
Agency, or IBM, or some combination.
They didn’t tell us how they had come
[C O NTINUED O N P. 110]
up with
PHOTOGRA PH BY RICHA RD M ORGENST EIN
You’re talking about the head of the
IBM mathematics group and author
of Cryptography: A Primer, who subsequently moved to the University of California, Santa Barbara.
DIFFIE: Konheim said he couldn’t tell
me very much because of a secrecy order, but he did mention that his friend
Marty Hellman had been there a few
months ago. He said, “I couldn’t tell
him anything either, but you should
look him up when you get back to Stanford, because two people can work on a
problem better than one.”
HELLMAN: So Whit gives me a call. Whit,
you were up in Berkeley at the time?
Association for
Computing Machinery
ACM Seeks New
Editor(s)-in-Chief
for ACM Interactions
The ACM Publications Board is seeking a volunteer editor-in-chief or
co-editors-in-chief for its bimonthly magazine ACM Interactions.
ACM Interactions is a publication of great influence in the fields that
envelop the study of people and computers. Every issue presents
an array of thought-provoking commentaries from luminaries in
the field together with a diverse collection of articles that examine
current research and practices under the HCI umbrella.
For more about ACM Interactions, see http://interactions.acm.org
Job Description The editor-in-chief is responsible for organizing all
editorial content for every issue. These responsibilities include: proposing
articles to perspective authors; overseeing the magazine’s editorial board
and contributors; creating new editorial features, columns and much more.
An annual stipend will be available for the hiring of an editorial assistant.
Financial support will also be provided for any travel expenses
related to this position.
Eligibility Requirements The EiC search is open to applicants
worldwide. Experience in and knowledge about the issues, challenges,
and advances in human-computer interaction is a must.
The ACM Publications Board has convened a special search
committee to review all candidates for this position.
Please send your CV and vision statement of 1,000 words or
less expressing the reasons for your interest in the position
and your goals for Interactions to the search committee at
[email protected], Subject line: RE: Interactions.
The deadline for submissions is June 1, 2016 or until position is filled.
The editorship will commence on September 1, 2016.
You must be willing and able to make a three-year commitment to this post.