Intelligent Interfaces Enabling Blind Web Users to Build Accessibility

Transcription

Intelligent Interfaces Enabling Blind Web Users to Build Accessibility
Intelligent Interfaces Enabling Blind Web Users
to Build Accessibility Into the Web
Jeffrey P. Bigham
A dissertation submitted in partial fulfillment of
the requirements for the degree of
Doctor of Philosophy
University of Washington
2009
Program Authorized to Offer Degree: Computer Science and Engineering
University of Washington
Graduate School
This is to certify that I have examined this copy of a doctoral dissertation by
Jeffrey P. Bigham
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by the final
examining committee have been made.
Chair of the Supervisory Committee:
Richard E. Ladner
Reading Committee:
Richard E. Ladner
Tessa Lau
Jacob O. Wobbrock
Date:
In presenting this dissertation in partial fulfillment of the requirements for the doctoral
degree at the University of Washington, I agree that the Library shall make its copies
freely available for inspection. I further agree that extensive copying of this dissertation is
allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S.
Copyright Law. Requests for copying or reproduction of this dissertation may be referred
to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346,
1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies
of the manuscript in microform and/or (b) printed copies of the manuscript made from
microform.”
Signature
Date
University of Washington
Abstract
Intelligent Interfaces Enabling Blind Web Users
to Build Accessibility Into the Web
Jeffrey P. Bigham
Chair of the Supervisory Committee:
Boeing Professor Richard E. Ladner
Computer Science and Engineering
The web holds incredible potential for blind computer users. Most web content is relatively
open, represented in digital formats that can be automatically converted to voice or refreshable Braille. Software programs called screen readers can convert some content to an
accessible form, but struggle on content not created with accessibility in mind. Even content that is possible to access may not have been designed for non-visual access, requiring
blind web users to inefficiently search for what they want in the lengthy serialized views of
content exposed by their screen readers. Screen readers are expensive, costing nearly $1000,
and are not installed on most computers. These problems collectively limit the accessibility,
usability, and availability of web access for blind people.
Existing approaches to addressing these problems have not included blind people as
part of the solution, instead relying on either (i) the owners of content and infrastructure
to improve access or (ii) automated approaches that are limited in scope and can produce
confusing errors. Developers can improve access to their content and administrators to
their computing infrastructure, but relying on them represents a bottleneck that cannot be
easily overcome when they fail. Automated tools can improve access but cannot address all
concerns and can cause confusion when they make errors. Despite having the incentive to
improve access, blind web users have largely been left out.
This dissertation explores novel intelligent interfaces enabling blind people to independently improve web content. These tools are made possible by novel predictive models of
web actions and careful consideration of the design constraints for creating software that
can run anywhere. Solutions created by users of these tools can be shared so that blind
users can collaboratively help one another make sense of the web. Disabled people should
not only be seen as access consumers but also as effective partners in achieving better access
for everyone.
The thesis of this dissertation is the following:
With intelligent interfaces supporting them, blind end users can collaboratively and effectively improve the accessibility, usability, and availability of their own web access.
TABLE OF CONTENTS
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1:
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Achieving the Full Potential of the Web . . . . . . . . . . . . . . . . . . . . .
1
1.2
Accessibility, Usability and Availability
. . . . . . . . . . . . . . . . . . . . .
5
1.3
Who should fix the web? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Chapter 2:
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1
Understanding the User Experience . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2
Enabling Users to Improve Accessibility and Usability . . . . . . . . . . . . . 17
2.3
Improving the Availability of Accessible Interfaces . . . . . . . . . . . . . . . 25
Chapter 3:
WebinSitu: Understanding Accessibility Problems . . . . . . . . . . . . 29
3.1
Motivation
3.2
Recording Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3
Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 4:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Collaborative Accessibility with Accessmonkey . . . . . . . . . . . . . 48
4.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3
Accessmonkey Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5
Implemented Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
i
4.7
Summary & Ongoing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 5:
A More Usable Interface to Audio CAPTCHAs . . . . . . . . . . . . . 67
5.1
Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3
Evaluation of Existing CAPTCHAs . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4
Improved Interface for Non-Visual Use . . . . . . . . . . . . . . . . . . . . . . 83
5.5
Evaluation of New CAPTCHA Interface . . . . . . . . . . . . . . . . . . . . . 85
5.6
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 6:
More Effective Access with TrailBlazer . . . . . . . . . . . . . . . . . . 90
6.1
Motivation
6.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3
An Accessible Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4
Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5
Formative Evaluation
6.6
Dynamic Script Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7
Evaluation of Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.8
Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 7:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Improving the Availability of Web Access with WebAnywhere . . . . . 114
7.1
Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3
Public Computer Terminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4
The WebAnywhere System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5
User-Driven Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.6
User Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.7
Reducing Latency
7.8
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.9
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Chapter 8:
A New Delivery Model for Access Technology . . . . . . . . . . . . . . 145
8.1
The Audience for WebAnywhere . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2
Getting New Technology to Users: Two Examples . . . . . . . . . . . . . . . 148
ii
8.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 9:
Conclusion and Future Directions
9.1 Contributions . . . . . . . . . . . . . . .
9.2 Future Directions . . . . . . . . . . . . .
9.3 Final Remarks . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
151
155
157
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
iii
LIST OF FIGURES
Figure Number
1.1
1.2
Page
A motivating example of existing access problems. (a) Finding content, even
on the relatively simple gmail.com login page can be time-consuming. (b) An
incorrect login is difficult to detect and an audio CAPTCHA must be solved to
try again. (c) The most efficient route to the inbox requires knowing arbitrary
key mappings tied to the underlying HTML structure of the web page, (d) as
does finding the beginning of the message. (e) A table of important statistics
and other information on nytimes.com is an image assigned the uninformative
alternative text “INSERT DESCRIPTION,” making it impossible for a screen
reader user to read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Effective web access involves more than simply making it possible for users
with diverse abilities to access content. Content must also be usable and the
tools needed to access it widely available. Accessibility is the foundation of
usability and availability, usability increases the potential audience for whom
access is possible, and availability determines where content can be accessed
and who will be able to access it. . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
An overview of the flow of accessible content from web developer to blind
web user, along with a selection of the components that have been explored
act at each stage designed to improve web accessibility. While later stages
can influence earlier stages, such change is slower and more difficult to achieve. 18
2.2
Many products enable web access to blind individuals but few have high
availability and low cost (upper-left portion of this diagram). Only systems
that voice both web information and provide an interface for browsing it are
included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1
Log frequency of visits per domain name recorded for all participants ordered
by popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2
Diagram of the system used to record users browsing behavior. . . . . . . . . 33
3.3
For the web pages visited by each participant, percentage of: (1) images with
alt text, (2) pages that had one or more mouse movement, (3) pages with
Flash, (4) pages with Asynchronous Javascript and XML (AJAX), (5) pages
containing dynamic content, (6) pages where the participant interacted with
dynamic content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
3.4
Number of probes for each page that had at least one probe. Blind participants performed more probes from more pages. . . . . . . . . . . . . . . . . . 42
3.5
For each participant, average time spend on: (1) all pages visited, (2) WebinSitu search page, (3) WebinSitu results page, (4) Google home page, (5)
Google results pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6
Web history search interface and results. . . . . . . . . . . . . . . . . . . . . . 45
4.1
Accessmonkey allows web users, web developers and systems for automated
accessibility improvement to collaboratively improve web accessibility. . . . . 55
4.2
A screenshot of the WebInSight Accessmonkey script in developer mode applied to the homepage of the International World Wide Web Conference. This
script helps web developers discover images that are assigned inappropriate
alternative text (such as the highlighted image) and suggests appropriate alternatives. The developer can modify these suggestions, as was done here, to
produce the final alternative text. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3
The menubar of this online retailer is inaccessible due to its reliance on the
mouse. To fix this problem we wrote an Accessmonkey script that makes this
menu accessible from the keyboard. . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4
This script moves the header and navigation menus of this site to the bottom
of the page, providing users with a view of the web page that presents the
main content window first in the page. . . . . . . . . . . . . . . . . . . . . . . 65
5.1
Examples of existing interfaces for solving audio CAPTCHAs. (a) A separate
window containing the sound player opens to play the CAPTCHA, (b) the
sound player is in the same window as the answer box but separate from
the answer box, and (c) clicking a link plays the CAPTCHA. In all three
interfaces, a button or link is pressed to play the audio CAPTCHA, and the
answer is typed in a separate answer box. . . . . . . . . . . . . . . . . . . . . 68
5.2
A summary of the features of the CAPTCHAs that we gathered. Audio
CAPTCHAs varied primarily along the several common dimensions shown
here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3
An interface for solving audio CAPTCHAs modeled after those currently
provided to users to solve audio CAPTCHAs (Figure 5.1). . . . . . . . . . . . 76
5.4
Percentage of participants answering each value on a Likert scale from 1
Strongly Agree to 5 Strongly Disagree reflecting perceived frustration of blind
and sighted participants in solving audio and visual CAPTCHAs. Participants could also respond “I have never independently solved a visual[audio]
CAPTCHA.” Results illustrate that (i) nearly half of sighted and blind participants had not solved an audio or visual CAPTCHA, respectively, (ii)
visual CAPTCHAs are a great source of frustration for blind participants,
and (iii) audio CAPTCHAs are also somewhat frustrating to solve. . . . . . . 79
v
5.5
The average time spent by blind and sighted users to submit their first solution to the ten audio CAPTCHAs presented to them. Error bars represent
± 1 standard error (SE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6
The number of tries required to correctly answer each CAPTCHA problem
illustrating that (i) multiple tries resulted in relatively few corrections, (ii)
the success rates of blind and sighted solvers were on par, and (iii) many
audio CAPTCHAs remained unsolved after three tries. . . . . . . . . . . . . . 82
5.7
The new interface developed to better support solving audio CAPTCHAs.
The interface is combined within the answer textbox to give users control of
CAPTCHA playback from within the element in which they will type the
answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.8
The percentage of CAPTCHAs answered correctly by blind participants using the original and optimized interfaces. The optimized interface enabled
participants to answer 59% more CAPTCHAs correctly on their first try as
compared to the original interface. . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1
A CoScript for entering time worked into an online time card. The natural
language steps in the CoScript can be interpreted both by tools such as
CoScripter and TrailBlazer, and also read by humans. These steps are also
sufficient to identify all of the web page elements required to complete this
task – the textbox and two buttons. Without TrailBlazer, steps 2-4 would
require a time-consuming linear search for screen reader users. . . . . . . . . 91
6.2
The TrailBlazer interface is integrated directly into the page, is keyboard accessible, and directs screen readers to read each new step. A) The description
of the current step is displayed visually in an offset bubble but is placed in
DOM order so that the target of a step immediately follows its description
when viewed linearly with a screen reader. B) Script controls are placed in
the page for easy discoverability but also have alternative keyboard shortcuts
for efficient access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3
TrailBlazer guiding a user step-by-step through purchasing a book on Amazon. 1) The first step is to goto the Amazon.com homepage. 2) TrailBlazer
directs the user to select the “Books” option from the highlighted listbox.
8) On the product detail page, TrailBlazer directs users past the standard
template material directly to the product information. . . . . . . . . . . . . 98
6.4
The descriptions provided by two participants for the screenshots shown illustrating diversity in how regions were described. Selected regions are 1) the
table of statistics for a particular baseball player, and 2) the search results
for a medical query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
6.5
Participant responses to Likert scale questions indicating that they think
completing new tasks and finding content is difficult (1, 2), think TrailBlazer
can help them complete tasks more quickly and easier (3,4,5), and want to
use it in the future (6), especially if scripts are available for more tasks (7). . 102
6.6
Proportion of action types at each step number for scripts in the CoScripter
repository. These scripts were contributed by current users of CoScripter.
The action types represented include actions recognized by CoScripter which
appeared in at least one script as of October 2008. . . . . . . . . . . . . . . . 103
6.7
The features calculated and used by TrailBlazer in order to rank potential
action suggestions, along with the three sources from which they are formed. 106
6.8
Suggestions are presented to users within the page context, inserted into
the DOM of the web page following the last element with which they interacted. In this case, the user has just entered “105” into the “Flight Number”
textbox and TrailBlazer recommends clicking on the “Check” button as its
first suggestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.9
The fraction of the time that the correct action appeared among the top
suggestions provided by TrailBlazer for varying numbers of suggestions. The
correct suggestion was listed first in 41.4% cases and within the top 5 in
75.9% of cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.1
People often use computers besides their own, such as computers in university
labs, library kiosks, or friends’ laptops. . . . . . . . . . . . . . . . . . . . . . . 115
7.2
WebAnywhere is a self-voicing web browser inside a web browser. . . . . . . . 116
7.3
Survey of public computer terminals by category indicating that WebAnywhere can run on most of them. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4
Browsing the ICWE 2008 homepage with the WebAnywhere self-voicing,
web-browsing web application. Users use the keyboard to interact with WebAnywhere like they would with their own screen readers. Here, the user has
pressed the TAB key to skip to the next focusable element, and CTRL+h to
skip to the next heading element. Both web content and interfaces are voiced
to enable blind web users access. . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5
The WebAnywhere system consists of server-side components that convert
text to speech and proxy web content; and client-side components that provide the user interface and coordinate what speech will be played and play
the speech. Users interact with the system using the keyboard. . . . . . . . . 122
7.6
Selected shortcut functionality provided by WebAnywhere and the default
keys assigned to each. The system implements the functionality for more
than 30 different shortcut keys. Users can customize the keys assigned to each.126
vii
7.7
Participant response to the WebAnywhere, indicating that they saw a need
for a low-cost, highly-available screen-reading solution (7,8,9) and thought
that WebAnywhere could provide it (3,4,6,10). Full results available in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8 Caching and prefetching on the server and client improve latency. . . . . . . .
7.9 An evaluation of server load as the number of simultaneous users reading
news.google.com is increased for 3 different caching combinations. . . . . . .
7.10 Counts of recorded actions along with the contexts in which they were recorded
(current node and prior action), ordered by observed frequency. . . . . . . . .
7.11 Average latency per sound using different prefetching strategies. The first
set contains tasks performed by participants in our user evaluation, including
results for prefetching strategies that are based on user behavior. The second
set contains five popular sites, read straight through from top to bottom, with
and without DFS prefetching. Bars are shown overlapping. . . . . . . . . . .
8.1
8.2
8.3
130
131
137
139
140
Weekly Web Usage between November 15, 2008 and May 1, 2009. An average
of approximately 600 unique users visit WebAnywhere each week. The large
drop in users in December roughly corresponded to winter break. WebAnywhere offers the chance for a living laboratory to improve our understanding
of how blind people browse the web and the problems that they face. . . . . 146
From November 2008 to May 2009, WebAnywhere was used by people from
over 90 countries. This chart lists the 40 best-represented countries ranked by
the number of unique IPs identified from each country that accessed WebAnywhere over this period. 33.9% of the total 23,384 IPs could not be localized
and are not included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
WebAnywhere, May 2009. Since its release, new languages have been added
to WebAnywhere. These screenshot shows an early version of a Cantonese
version of the system. We have also started to introduce features that may
make it more useful for other populations. The text being read is highlighted
within the page and shown in a magnified, high-contrast view. . . . . . . . . 148
viii
ACKNOWLEDGMENTS
I would like to thank my advisor Richard Ladner. He has helped to shape all of the work
presented in this dissertation, and none of it would have turned out as well absent his input.
I have enjoyed my time working with Richard. He has helped me become the researcher
that I am today. From him, I have learned the importance of connecting my research to
impact and outreach. For that, I will always be thankful.
I also thank Jacob Wobbrock, who somehow has always known the right advice to
give on everything from research direction to future plans, and Tessa Lau, who graciously
entrusted me with her wisdom. Tessa pressed me to consider the broader implications of
my research and provides an excellent example of an incredibly successful researcher who
somehow manages a balanced life. I thank Jake and Tessa for helping me become a better
HCI researcher, both through their explicit mentoring and by example.
Anna Cavender has played an integral role in nearly all of the projects forming this dissertation. From early brainstorming and developing prototypes, to conducting user studies
and considering broader themes, my projects would not have been nearly as successful
without her input and support.
Craig Prince is one of the smartest and talented people I know. He was always there to
listen to another crazy idea, spend sleepless nights helping to solidify prototypes, and get
everything working before the next deadline. Thank you Craig.
I thank Jeffrey Nichols for his energy and great conversations. In Jeff, I see an example
of what I want to become - a builder with a strong base in science, an innovator who pushes
the boundaries of what interfaces can do.
I would like to thank the following organizations who have helped to fund me over the
years: the National Science Foundation, Boeing, Microsoft, TRACE, and NASA. I would
ix
like to especially thank Allan and Inger Osberg who funded me through my last year of
graduate school via their generous Osberg Fellowship.
A large number of other students, faculty, friends, and family have also influenced
this work. They are, in alphabetical order: Chieko Asakawa, Jennison Asuncion, Yevgen Borodin, Jeremy Brudvick, Charles Chen, Wendy Chisholm, Allen Cypher, Clemens
Drews, Reuben Firmin, James Fogarty, Jim Fruchterman, Susumu Harada, Simon Harper,
Sangyun Hahn, Shaun Kane, Ed Lazowska, Clayton Lewis, Benson Limketkai, Jimmy Lin,
I.V. Ramakrishnan, T.V. Raman, Hironobu Takagi, Gregg Vanderheiden, Lindsay Yazzolino
and Yeliz Yesilada. This list is a who’s who in the access and end user programming worlds
- I am grateful for the time that they have given me to learn from them.
I want to thank my parents, Richard and Peggy Bigham. The unshakable confidence
and support that they have shown in me throughout my life has allowed me to dream big.
Thank you.
Finally, I want to thank my wife Jennifer Bigham. Without her unwavering support,
my work would not have been possible. She inspires me with her unending enthusiasm and
positive attitude, and helps to connect me to what is really important in life. Thanks Jen
for forcing me to take a walk on a sunny day every once in a while even in the midst of yet
another important deadline; thanks for understanding when I disappeared to California for a
summer to pursue another important opportunity (twice!); and thanks for being everything
you are.
x
DEDICATION
This dissertation has two dedications.
• To my parents, Richard and Peggy Bigham, who have supported and encouraged me
throughout my life, helping me realize that anything is possible.
• To my wife Jennifer Bigham, whose unwavering love and support has encouraged me
to strive for my best.
xi
1
Chapter 1
INTRODUCTION
The web holds incredible potential for blind computer users. Most web content is relatively open, represented in digital formats that can be automatically converted to voice or
refreshable Braille. Software programs called screen readers can convert some content to an
accessible form, but struggle on content not created with accessibility in mind. Even content that is possible to access may not have been designed for non-visual access, requiring
blind web users to inefficiently search through content using the linearized view of content
exposed by screen readers. Screen readers are expensive, costing nearly $1000 [82, 146], and
are not installed on most computers. These problems can make access impossible, unusable,
and unavailable for blind web users.
This dissertation explores intelligent tools that enable blind web users to collaboratively
solve these problems for themselves and share the results with one another.
1.1
Achieving the Full Potential of the Web
The web has already become an incredible resource for blind people. Unlike printed material of the past, most web information can be automatically converted to an accessible
form. Remaining problems can make web browsing frustrating and access to some content
impossible. Despite its potential, the web remains difficult to use for many of those who
could benefit most [34].
This is particularly true for blind web users whose view of content is devoid of the rich
visual structure that developers use to organize web information. Web developers can help
by creating content that relies less on its visual representation. Existing web standards,
such as World Wide Web Consortium (W3C) Web Content Accessibility Guidelines [142]
and Section 508 of the U.S. Rehabilitation Act, offer quantifiable rules to which web developers can strive to adhere, but creating truly accessible content requires a more subtle and
2
discerning evaluation [85]. Web developers can improve access by following standards and
creating content with accessibility in mind, but many developers are either ill-equipped to
create accessible content or unaware of the problem.
Access problems are wide-ranging and preventing them can require a subjective design
process. Information encoded as images, in the layout, or as color is not accessible for blind
web users. Finding desired content in the long linear stream to which a complex page is
converted can be frustrating for screen reader users. Small text and fixed-width structures
do not work well for low-vision users or anyone using a small screen device. With so many
possible problems, everyone can find themselves in a situation in which the web fails to work
as well as it could for them, preventing full realization of the web’s potential.
Compounding the problem are the numerous stakeholders in web access, who often
lack tools and opportunities to improve access and help one another. Blind people often
cannot fix the problems that they experience themselves and cannot easily communicate the
problems to those who could. Developers and content producers may not understand the
problems faced by users different than themselves, or may overestimate the costs of fixing
existing problems. Developer tools and user agents have evolving support for creating and
conveying accessible content, and often play catch up when new technologies are released.
This dissertation considers intelligent tools that enable blind web users to help one another
improve the web.
1.1.1
A Motivating Example
As an example, consider the following scenario describing the problems that Joe, a blind
computer user, might experience when reading his web-based email and following a link sent
by a friend. First, Joe opens a web browser and navigates to gmail.com. His screen reader
presents the content in a linear, time-dependent stream of voice. To login to the site, Joe
first needs to find the “username” textbox. Fortunately, he has memorized the keyboard
shortcut to skip directly there, is able to quickly enter his username and password, and then
press submit.
Joe’s screen reader announces that a page has loaded and he begins to explore the new
3
b
c
d
a
e
Motivating Example of
Existing Access Problems
Figure 1.1: A motivating example of existing access problems. (a) Finding content, even on
the relatively simple gmail.com login page can be time-consuming. (b) An incorrect login
is difficult to detect and an audio CAPTCHA must be solved to try again. (c) The most
efficient route to the inbox requires knowing arbitrary key mappings tied to the underlying
HTML structure of the web page, (d) as does finding the beginning of the message. (e)
A table of important statistics and other information on nytimes.com is an image assigned
the uninformative alternative text “INSERT DESCRIPTION,” making it impossible for a
screen reader user to read.
page, expecting to find his inbox. Skipping through the page by heading tags usually lets
him get to his inbox more quickly, but his screen reader unexpectedly reports that the page
contains no heading tags. After some exploration, Joe eventually realizes that he’s still on
the login page. Without a quick visual scan, noticing that the page had not changed as
they expected was not obvious, and finding the error message explaining that he had not
entered their password correctly would have taken a long time. The interface was designed
for visual use, and can be frustrating for Joe to use with a screen reader even though all of
4
the content is technically accessible to him.
Due to the failed login, gmail.com asks Joe to complete a CAPTCHA. The site provides
an audio CAPTCHA as an accessible alternative, but the interface provided is not easy to
use with a screen reader. Joe presses a button to play the CAPTCHA and then navigates to
an answer field to enter the text that he hears. During this navigation, his screen reader talks
over the playing audio CAPTCHA, announcing navigation information. Audio CAPTCHAs
are purposefully made to be difficult to understand, and the added interference from the
screen reader makes them all but impossible. Joe eventually just asks a sighted friend to
solve the much easier visual CAPTCHA.
After successfully logging into the site, Joe reads his new messages. To reach the inbox,
he presses the ‘h’ key repeatedly to cycle through the headings on the page. Once he
hears “Gmail Inbox,” he can navigate forward from the beginning of the inbox to review
his messages. Joe has not yet learned the trick of pressing the ‘x’ key to jump directly
to the inbox, which works because each message in the inbox happens to be preceded by
a checkbox. This mapping is arbitrary, impossible to predict before visiting the site, and
confusing to a non-technical user.
A new message from a friend includes a link to an article on nytimes.com that discusses
the prevalence of bullying in elementary schools [23]. A figure containing numerous statistics
and descriptive text supporting the claims in the article is represented as an image. The
creator of this content chose not to provide a meaningful textual description that the screen
reader could read. Instead, a default description (“INSERT DESCRIPTION”) is read when
Joe reaches the image.
Finally, Joe had to wait until he returned home to his own computer before he was able
to access his email even though he had been at the public library earlier that day. The
computers at the library he visited did not have screen readers installed on them. Without
a screen reader, those computers were not possible for him to use. He could have tried
to run the screen reader executable that he keeps on a USB keychain, but the USB ports
weren’t easy to access and the computers prevent new software from being run as a security
precaution anyway.
The preceding example highlighted a number of accessibility problems that a blind per-
5
son might encounter while completing the common task of checking their email. Most of
these problems have work-arounds - for example, blind users can always default to a slow
linear scan of a page to find changed content, or they can carry their own laptop around
with them so they’ll also have access to a screen reader. Some problems, like the image
lacking alternative text, do not have easy work-arounds. Nevertheless, with the right tools
(for instance Optical Character Recognition (OCR) software), blind people could use this
content. Even the lack of access technology could be addressed by blind web users if a base
level of access was available from any computer. Once one blind user has solved a particular
problem, future users should benefit from the prior solution.
This dissertation looks at how blind web users could be empowered to collaboratively improve web content, and help one another more effectively overcome the problems highlighted
in the motivating example.
1.2
Accessibility, Usability and Availability
Creating accessible content is multi-faceted, just like any design process. Content is not
simply accessible or not; instead, access exists on a spectrum across several dimensions.
The accessibility and usability of content have been extensively explored [130, 95], and
research suggests that they are linked [103]. We also explore the availability of access, a less
frequently explored dimension that often determines successful web access.
This dissertation considers how end users can improve their access along the dimensions of accessibility, usability and availability (Figure 1.2). The remainder of this section
describes these dimensions in relation to web access and provide examples.
1.2.1
Accessibility
Accessibility means making access to content possible. For example, in the motivating
example presented earlier, Joe was unable to access the statistics and other information
contained within the image that was not assigned alternative text. Other examples include
encoding information in either the color of visual layout of the page - for instance, “names
marked in red have been chosen to participate” - and embedding content that screen read-
6
Availability
Usability
Accessibility
Figure 1.2: Effective web access involves more than simply making it possible for users with
diverse abilities to access content. Content must also be usable and the tools needed to access
it widely available. Accessibility is the foundation of usability and availability, usability
increases the potential audience for whom access is possible, and availability determines
where content can be accessed and who will be able to access it.
ers cannot access - for instance, some Flash applications do not correctly implement the
accessibility APIs that enable blind people to access them.
Chapter 4 discusses a tool called Accessmonkey that helps end users write scripts to
improve the accessibility of the content that they access. For example, one component of
this tools can automatically formulate alternative text for images. Blind and sighted people
can use Accessmonkey to create scripts that make content more accessible and then share
those scripts with others. Making content possible to access is fundamental to improving
access along other dimensions.
1.2.2
Usability
Usability means creating content in a way that helps users be more effective and efficient.
Content can be possible to for someone to access, but it might require inefficient or confusing
interactions. Just as the elements of usable visual designs are partly subjective, so are the
elements of design for other modalities. A complex page containing a large amount of
information may be confusing or inefficient for blind web users to navigate, but adding
7
semantic information - for example, an outline using heading tags - can help blind users
more effectively access it [138].
Chapter 5 illustrates this concept by exploring in depth the interactions required to
solve current audio CAPTCHAs. By making the interface CAPTCHAs more usable, blind
participants improved their success rate by 59%. The improvements made demonstrate
the importance of broadly applicable principles in designing content for blind web users.
Importantly, web users themselves have control over the interface they use to solve audio
CAPTCHAs as they can overwrite the existing interface with the improved interface using
a script that we provide.
TrailBlazer, presented in Chapter 6, extends the idea of more usable interface to arbitrary
web-based tasks. Blind web users have been shown to be less inefficient than their sighted
counterparts [17], in part because it takes time to find content in the time-dependent, linear
stream of information exposed by the screen reader. TrailBlazer helps guide users through
completing tasks on the web, suggesting actions that they might want to take based on
their prior actions and those of other users.
1.2.3
Availability
The technology that users need to access the web is often not available to them. Many
people do not own their own computers and rely on public computers, such as those at
libraries and schools, for access. Access technology is not installed on most computers for
many reasons. Improving availability means making access technology available wherever
users happen to be. Chapter 7 explores tools that bring access technology, including many of
the projects outlined elsewhere in this dissertation, to the computers to which users happen
to have access. Chapter 8 discusses the implications of a more available delivery model for
access technology.
Summary
Many access problems can be improved with existing technology. Accessible alternatives
can be provided for inaccessible content, web interfaces can be designed with careful consideration of blind people using screen readers, and computer administrators can install screen
8
readers on all computers. With current approaches, there is always a bottleneck, someone
who may not make the best choices for access but on whom access relies.
The creation of accessible web content relies on web developers and content producers
and getting access software on public computers relies on computer administrators. While
these people are in the best position to improve access, they may not have the knowledge,
desire or awareness to make the best choices for achieving access.
1.3
Who should fix the web?
Who should make the web accessible? This is a technical question, a legal question, and
even an ethical question. It is a question that can conjure a desire to help, but one that
quickly leads to a litany of other questions: What does it mean to make the web accessible?
How much will it cost? Do disabled people even visit my site? With so many questions
being asked, and with few concrete answers available, often nothing is done.
The question of who should make the web accessible is misleading. Until recently, only
content producers could directly influence its accessibility; blind end users were primarily
consumers. This dissertation shows the advantages of a complementary approach: enabling
blind web users to collaboratively improve the accessibility of their own web experiences.
Expanding the set of possible whos to anyone with the need, incentive or desire to create a
more accessible web forms a more pragmatic solution to access problems.
This dissertation discusses how to better understand the problems that blind web users
face on the web and enable blind web users to independently improve their own access across
the three dimensions.
1.3.1
Enabling Users to Improve Access
In order to enable blind web users to improve the accessibility of their own web experiences,
we made contributions in the following areas: recording web interactions, understanding
the problems that users face, predicting future web interactions, and leveraging predictions
to help end users improve access for themselves. Our goal is to understand what problems
most impact the access of users, and develop tools that effectively leverage the intelligence
of blind web users to improve access for everyone.
9
The contributions of this dissertation are summarized in Section 9.1 and described in more
detail throughout the document as outlined next.
1.4
Dissertation Outline
The remainder of this dissertation is organized as follows:
Background
Overview of prior work in (i) understanding the user ex-
Chapter 2
perience, (ii) improving the accessibility of web content
for blind web users, (iii) and improving the availability of
access technology.
Understanding Problems
Describes WebinSitu, infrastructure for conducting remote
Chapter 3
studies with disabled users. WebinSitu records web interactions with proxy, and unlike many prior systems records
user actions within web pages. The studies that we completed using WebinSitu quantify the differences between
blind and sighted web users.
Accessibility
Explores the potential of blind web users to both collab-
Chapter 4
oratively improve accessibility for themselves and partner
with web developers to improve their content. Presents an
implementation of this idea, Accessmonkey, which enables
users to create and inject scripts across many platforms
to improve the accessibility of web content and share improvements with web developers.
10
Usability
Demonstrates via a large study of audio CAPTCHAs con-
Chapters 5 and 6
ducted with WebinSitu that audio CAPTCHAs are much
more difficult than visual alternatives. Offers a redesigned
interface that targets non-visual use, which blind web users
can add to existing web sites with Accessmonkey. Provides
quantitative results showing the importance of designing
non-visual interfaces with users in mind.
Presents TrailBlazer, which lets blind web users record and
replay tasks on the web to make browsing more efficient.
By predicting the actions that users might want to complete next, TrailBlazer can also help make users more effective at completing tasks for which no script has already
been recorded.
Availability
WebAnywhere improves the availability of access by en-
Chapter 7 and 8
abling users to access the web on any computer that has
web access. WebAnywhere can also help get technology
for improving accessibility and usability to users wherever
they happen to be.
WebAnywhere is broadly a platform for delivering access
technology to people with diverse needs, requirements and
goals. Since its release, WebAnywhere has drawn a large
global audience of blind and low-vision users, but also web
developers, special education teachers, people learning English, and people with learning disabilities. Because WebAnywhere is free and does not need to be installed, people
can quickly try it and use it if it works for them.
11
Realizing the full potential of the web for blind people involves improving access across
several dimensions. Relying only on developers or computer administrators to create accessible and usable content and provide the tools required by users has been shown not to
enough to address these problems. This dissertation describes intelligent tools that can help
blind web users partner with developers and administrators without relying on them.
12
Chapter 2
RELATED WORK
This chapter summarizes related work in understanding and improving the accessibility,
usability and availability of the web. Section 2.1 considers work in understanding the web
experience of blind users, specifically in terms of the accessibility and usability of web
content, and describes gaps in this understanding that WebinSitu (Chapter 3) helps to fill.
Section 2.2 describes work that has sought to improve the accessibility and usability of
web content, either automatically or by supporting developers. This dissertation explores
several tools that involve users in the process of improving content (Chapters 4-6). Section
2.3 considers a wide variety of products designed to improve the availability of web access
for blind web users, motivating the need for WebAnywhere (Chapter 7), which provides a
base level of access on any web-enabled device.
2.1
Understanding the User Experience
A vital component of improving web accessibility is understanding the user experience of
those involved. Previous work has offered surprising results: for instance, that blind web
users may not evaluate the accessibility of web content as thoroughly as sighted developers
employing screen readers [85]. More work needs to be done to understand the user experience from the perspective of blind web users and understand the implications for blind
users working collaboratively to help improve access. In this section, we review work in
understanding the experience of web users, especially disabled users, motivating our own
WebinSitu approach (Chapter 3).
The Disability Rights Commission in the United Kingdom sought to formally investigate
the accessibility problems faced by users with different disabilities [34]. In this extensive
study, the results of focus groups, automated accessibility testing, user testing of 100 websites, and a controlled study of 6 web pages were combined. This work identified the effects
13
of a number of different accessibility problems. Coyne and Nielsen conducted extensive
observation of blind web users by going to the homes and workplaces of a number of blind
individuals [35]. Each session comprised manually observing users completing four specified
tasks, which did not allow them to record low-level events associated with each browsing
session or for extended periods. Since both studied were conducted, new web technologies
have become increasingly important, such as scripting and dynamic content in the form of
dynamic content, AJAX, Adobe Flash and Rich Internet Application (RIA). WebinSitu
adds to both studies by enabling the observation of participants over a longer period of
time. We measure the practical and observable effects of accessibility features on web users,
which is difficult to determine in a controlled lab study.
Watanabe used lab-based studies to find that the proper use of HTML heading elements
to structure web content can dramatically improve the observed completion time of both
sighted and blind web users [138]. They used screen recordings and key loggers combined
with in-person observation to implement their study, but another approach that has been
explored is to consider the accessibility of the web divorced from user behavior. Some studies
have used manual evaluation of pages [102] and others have been conducted automatically
via a web crawl [18, 36]. Other studies have used automated evaluation [39]. Not considering
which pages users will likely visit or the coping strategies they might employ makes the
practical effects of the results obtained difficult to interpret.
Finding participants meeting specific requirements in a given geographic area and recreating personalized setups and software configurations, studies with disabled populations can
be costly, time-consuming and, in many cases, impractical. This has led some researchers
to utilize remote evaluation and observation procedures instead [101, 85, 27]. Such user
studies are particularly well-suited for investigating web accessibility issues because users
employ a variety of screen readers and other assistive technologies that may be difficult to
replicate in the lab. Because these technologies are designed for web usage, they are already
connected to the Internet and are therefore more easily adapted to remote monitoring. In
the remainder of this section, we first review work that has investigated remote user evaluation with disabled users in order to highlight the lessons learned. We will then discuss
papers that investigate technical aspects of observing web users remotely.
14
2.1.1
Remote Studies with Blind Web Users
The most common type of remote study involving blind web users is a diary-based study.
For example, Lazar et al. conducted a study in which blind web users recorded their web
experiences in user diaries and discovered that, in contrast to sighted users, the mood
of blind users does not seem to deteriorate as the inefficiency of the browsing experience
increases [74]. Possible problems with diary-based studies are that users may misrepresent
their difficulty in achieving tasks and may choose to report only experiences that are unusual.
Another option is on-site observation in the homes and offices of participants, as explored by
Coyne et al. [35]. On-site studies are expensive and impractical for longitudinal observation.
Mankoff et al. compared the following four evaluation techniques for discovering accessibility problems in web pages: the Bobby [140] automated accessibility evaluator, web
developers, web developers using screen readers, and blind web users who evaluated web
pages remotely [85]. The web developers were each introduced to the Web Content Accessibility Guidelines (WCAG) 1.0 accessibility standard [143]. Based on representative tasks
developed in a baseline investigation with blind web users, members of each of the four
conditions were asked to identify accessibility problems. The results indicated that multiple
web developers employing screen readers were able to find the most accessibility problems
and the automated tool was able to find the least. Remote blind users, although shown
to be much less thorough at identifying web accessibility problems, most often found true
problems (that is they labeled few false positives). This number was also artificially deflated
due to users being unable to complete some of the tasks due to severe accessibility problems.
The researchers also speculated that the remote users were not adequately encouraged to
report all accessibility problems and hoped to improve on this in future work.
Petrie et al. investigated remote evaluation by disabled users further by comparing
the results obtained via remote evaluation as compared to laboratory evaluation [101]. In
this work, participants conducted the following two evaluation tasks: an evaluation of a
system that converts certain technical diagrams into spoken language, and a summative
evaluation of websites. Users that conducted their evaluations remotely were more likely to
give high-level qualitative feedback whereas users that evaluated locally gave more specific
15
evaluations. For example, when evaluating the same feature remotely, a remote participant
said “there are a lot of problems with the description,” while a local participant said, “I take
it each thing in brackets is the thing they are talking about [Door to Hall and Living room]...
[Door to Bathroom] it is not clear what the relationship is... I cannot technically understand
these relationships...it just doesn’t work.” While both participants expressed their inability
to understand the relationships expressed by this system, the local participant provided
much more useful feedback. In other instances, usability was so poor that remote users
could not determine if they had successfully completed a task and, therefore, provide an
adequate evaluation. In local studies, researchers could tell participants whether or not they
had successfully completed the task. An important conclusion of this work is that while
remote evaluation may be appropriate for summative results, it is often not as valuable
during formative studies when the technology may not work as intended and researchers
benefit greatly from observing how users attempt to interact with it.
A common theme of both Mankoff and Petrie is that to maximize the value of remote
evaluation, the technology used must achieve at least a base level of usability, otherwise
participants may not be able to provide useful feedback. When operating away from the
guidance of researchers, participants may be less likely to be able to successfully complete
required tasks. Furthermore, the quality and extent of qualitative feedback is likely to be
much greater in local studies in which participants can interact with the researchers. We
seek to address these problems with our WebinSitu work by asking qualitative questions
before and after remote studies.
2.1.2
Recording User Data
Numerous projects have been created to record and aggregate web usage statistics to better
understand the experience of web users. A common approach has been to use a traditional
proxy that passively observes participants as they browse the web.
Users have been central to the web from its inception and an extensive body of work has
been dedicated to better understanding the user experience. Initially, much of this research
concentrated on improving the quantitative performance of web infrastructure components.
16
The Medusa proxy allows web researchers to go a step beyond these first systems by investigating the impact such systems have as perceived by the user [71]. The Medusa proxy
is a non-caching forwarding proxy that provides several convenient features to researchers
investigating the user experience. These features include the ability to simultaneously mirror requests (send a request to multiple sources) and the ability to transform user requests.
The Medusa proxy was used to explore the user-perceived advantages of using the NALNR
cache hierarchy and the Akamai content distribution network. Later work used the proxy
to discover that of several HTTP optimizations, parallel connections provide the largest
benefit and that persistent connections are helpful but only on the subset of pages to which
they are fully utilized [12]. The ability of the Medusa proxy to record and then replay
request logs allow collected data to be reused, reducing the cost of testing variations.
Medusa allows researchers to accurately record quantitative data about web browsing
as long as that data can be directly observed as part of an HTTP request, but the system is
unable to record finer-grained user events such as mouse movement, button clicks, or keys
pressed. Such detail is important for discovering the web components directly affected the
observed access of users.
Goecks and Shavlik used an augmented browser to record the mouse movements and
scrolling behaviors of users. They showed that these recordings could be used to predict
how users would interact with their web browsers [45]. Claypool et al. utilized a similar
approach in order to compare the explicit ratings that users provide to web pages with
several methods of gathering implicit ratings [32]. While an augmented web browser allows
recording data at a finer granularity, we would like to avoid requiring users to use specific
web browsers to avoid confounding factors related to the introduction of new tools. Users in
disabled populations also use a wide variety of web browsers and screen readers, making it
impractical to provide a version tailored to each user’s individual setup. WebinSitu (Chapter
3) addresses these concerns using an enhanced web proxy.
A proxy-based approach is desirable because it is relatively easy to set up and users can
be remotely observed using the technology to which they are already accustomed. Gonzalez
introduced a Java-based system that is capable of remotely monitoring users as they browse
the web [46] and later advocated its potential for conducting remote usability studies with
17
people with disabilities [47]. The goal of using a proxy is to eliminate confounding factors
that are often an unfortunate consequence of traditional laboratory studies, such as users
being unable to use the assistive technology to which they have become accustomed. The
Gonzalez proxy introduces Java applets into web pages that, once on a client’s machine,
can observe the actions of users and report back to the remote server.
UsaProxy uses the light-weight option of using Javascript to remotely record user actions
and found it to be quite powerful [9]. In this approach, users connect to a proxy server which
alters web pages on-the-fly to include a Javascript script that monitors the user’s actions.
When a page is loaded, the Javascript script attaches event listeners to most Javascript
events, including the keypress, mouseover, and mousedown events. It also includes code
that polls at regular intervals to determine the position of the mouse. For all events, the
script records the event type, the time the event occurred and the element of the Document
Object Model (DOM) associated with it (when applicable). To allow this information to
be recorded, the system sends messages back to the server using AJAX. The resulting log
file contains a combination of these user event recordings and the information recorded by
a traditional web proxy, such as the Medusa proxy. Use of the system is unobtrusive and
does not affect the normal function of the web browser. Atterer et al. later demonstrated
how a complex AJAX application (gmail.com) can be remotely observed and evaluated
using the system [8]. Javascript running in the browser is particularly suited for this type
of observation because it has access to the DOM representation of the document being
displayed, and automatically incorporates updates that have been made dynamically.
WebinSitu (Chapter 3) uses the UsaProxy approach, but modifies it to record additional
information useful for understanding the user experience of blind web users. The framework
exposed by WebinSitu facilitates short experiments or longitudinal studies on arbitrary web
content, and will be discussed in the next chapter.
2.2
Enabling Users to Improve Accessibility and Usability
The accessibility of web content can be implemented and improved during the many stages
between when the idea for the content is originally conceived to when the implemented web
page is conveyed to web users. A simplified view of the stages of this process are outlined in
18
Figure 2.1: An overview of the flow of accessible content from web developer to blind web
user, along with a selection of the components that have been explored act at each stage
designed to improve web accessibility. While later stages can influence earlier stages, such
change is slower and more difficult to achieve.
Figure 2.1, including a selection of techniques that have been explored at each stage which
help to place the related work presented in the remainder of this section. Most of this work
has not involved blind web users in the process of improving content.
This section highlights work at each stage along the web publishing pipeline. At each
stage, we will discuss a sample of solutions that have been explored at that stage.
Accessibility Standards
Web accessibility standards set guidelines that, if met, should ensure that a web page is accessible to blind users and other web users with disabilities. Achieving accessibility through
a set of specific guidelines is difficult in general because implementing web accessibility requires implementing efficient access not just making information available. By requiring
the web developers in the screen reader condition of the comparative study previously discussed by Mankoff et al. to use a screen reader, the researchers may have implicitly forced
the web developers to consider usability [85]. This may partially explain why they fared
better than their counterparts who did not make use of the software. Problems with static
standards have been previously noted [69], but remain in use, in part, due to the difficulty
19
of formulating, checking and enforcing more subjective usability standards.
The most important web accessibility standards for web developers in the United States
are the World Wide Web Consortium (W3C)’s WCAG 2.0 [142] and the technology access
requirements of Section 508 of the U.S. Rehabilitation Act of 1973 that were expanded and
strengthened by the Rehabilitation Act Amendments of 1998. Many other countries have
similar accessibility requirements, many of which are based on the W3C’s WCAG [126].
In the United States, only web sites that receive funding from the Federal government are
compelled to comply with Section 508 guidelines, while private entities are exempt. A
recent court case, however, may allow a wider selection of web sites to face legal trouble
if they fail to implement accessibility features. In National Federation of the Blind vs.
Target Corporation, a Federal Circuit Court ruled that Target Corporation could be sued
on grounds that its inaccessible web site violated the Americans with Disabilities Act (ADA)
because the web site is an extension of the physical store [93].
Developer Tools
Numerous tools are available to web developers that automatically identify accessibility
problems. Some of the most popular include A-Prompt [1], UsableNet’s LIFT [77], W3C
Accessibility Validation Service [139], IBM alphaWork’s aDesigner [63], and Watchfire’s
Bobby Worldwide [140]. These tools commonly report on how well a web page adheres
to web standards, reporting problems that can be identified automatically, such as missing
alternative text, lack of row and column heading tags in HTML tables, and using deprecated
HTML tags instead of Cascading Style Sheets (CSS).
Automated tools can provide advice to web developers on how to fix the errors that
have been identified, but they cannot offer subjective suggestions or critical feedback. As
an example, a web page that uses zero-length alternative text for all images will pass most
validators with only warnings because zero-length alternative text is appropriate for purely
decorative images. As noted previously, web developers must be skilled in order to effectively
implement standards.
Another approach is to semantically tag web page elements using ontologies that client
tools can use to improve the interface to content. The Dante approach recognizes that not
20
all elements of accessibility, particularly those that deal with efficient use of a web page,
can be met by existing Hyper Text Markup Language (HTML) markup, and offers web
developers the ability to semantically annotate their web pages in order to facilitate more
appropriate audio presentation. Originally, this semantic knowledge was assigned manually
and was introduced by Yesilada et al. [150]. The Web Authoring for Accessibility (WAfA)
ontology (introduced as the Travel Ontology) was designed to facilitate web navigation and
is the ontology that Dante uses for annotation [149]. This ontology includes such concepts as
Advertisement, Header, and NavigationalList. Once annotations from this ontology
are made to a web page, Dante allows easier navigation of it by performing transformations,
such as removing advertisements, providing skip links to bypass headers and navigation
links, and allowing users to easily move between semantically-related sections of a page.
A downside of this approach is that it required manual annotation by web developers.
Work by Plessers et al. removes the manual component of creating annotating visual objects
with semantic guidelines entirely by building such annotation directly into the design process [105]. This work was based on an existing web engineering approach (Web Site Design
Method (WSDM) [38]), but could potentially be extended to work with any formal design
methodology. Design methodologies in general, and WSDM in particular, help web developers break down the task of creating a usable web site into a series of manageable phases.
They force designers to carefully consider their target audience and the tasks they will most
likely want to perform before considering low-level details related to implementation. As
part of this process, WSDM guides users through creating models of navigation, tasks and
objects on the site. Plessers et al. demonstrated that 70.51% of the WAfA ontology could be
automatically generated on web sites that were created using the WSDM process by directly
mapping elements of the WSDM ontology to elements in the WAfA ontology (84.62% in the
best case). This approach certainly has potential in that it shows web developers that are
already using a formalized method for designing and implementing web pages that they can
build in accessibility without added cost. Unfortunately, its practical benefit would seem
to come largely from its ability to shift the justification for semantic annotation of content
away from the merits of accessibility to a justification based on the merits of using a formal
design method. Plessers et al. also showed that accessibility through semantic markup can
21
be built into dynamically-generated template-based web pages, which is a powerful idea
and may benefit the accessibility of a number of web sites even if they aren’t employing a
formal design method. The automatic method presented in this work still requires manual
annotation, and still suffers from the drawbacks associated with such approaches.
Automatic Transcoding for Improved Accessibility
Automatically transcoding content in order to render it more accessible is an approach that
has been explored extensively. Tools that use this approach generally intercept content after
it is retrieved from the web server (where it is stored) and before when it is read to web
users. This step can occur as part of an edge-service co-located with the web server, as part
of a transformation proxy to which users connect, or as part of a web browser plugin or
extension.
A number of previous systems have been introduced that help screen reader users independently transform documents to better suit their needs, including several systems that
use a proxy-based architecture that allows web pages to be automatically made more accessible before they are delivered to blind web users. Iaccarino described edge-services for
improving web accessibility designed to be implemented at the edge between web servers
and the Internet [61, 62]. Others have proposed that such transformations be implemented
in a proxy to which clients can connect [53]. Harper et al. created an extension for the
Firefox web browser that inserted GIST summaries to allow blind users the ability to get
an overview of web content without reading the entire page [54].
Luis von Ahn et al. created online games that entice humans to accurately label images
and suggested that the labels produced could be stored in a centralized database where
they could be used to help make web images more accessible [135, 134]. Altifier provides
alternative text for images using a set of heuristics [136].
Chapter 4 discusses how we can apply many transcoding services using Javascript scripts
that can be introduced by web pages, web proxies or client-side tools. As an example, we
present our WebInSight system, which formulates and adds alternative text to web images
that lack them automatically [18].
22
Screen Readers
Screen readers were originally developed in the context of console computing systems and
converted the visual information displayed by a computer to text that could be read (hence
the name screen reader) by directly interfacing with the computer’s screen buffer. Since the
arrival of this technology in the 1980’s, a number of more advanced screen readers have been
developed to handle the graphical user interfaces exposed by modern operating systems.
Popular screen readers include Freedom Scientific’s JAWS [82], G.W. Micro’s Window Eyes
[146], IBM’s Home Page Reader [64], IBM alphaWorks’ open source Linux Screen Reader
(LSR) [78] and Screenreader.net’s freeware Thunder Screen Reader [127]. Modern screen
readers attempt to interface directly with the applications that they are reading and have
introduced technology, such as off-screen models of web browsing that seek to overcome
limitations with that interface.
Screen readers have been developed extensively to make web content accessible and they
work reasonably well when content has been appropriately annotated for accessibility, but
the user experience can still be frustrating and inefficient. The hypertext environment of
the web is represented by a wealth of links, multimedia and structure, and the HTML which
represents much of it often lacks the necessary annotation for conveying this information
non-visually. For a blind person, a screen reader currently provides the most direct control
over how web content is presented. Determining the semantic purpose of web elements is
difficult in general, and most screen readers report based on the underlying HTML markup
instead.
The Hearsay web browser is a screen reader that is designed to be controlled by voice
and transforms web content into VoiceXML for audio output [106]. Hearsay does extensive
processing on the DOM of web pages to automatically transform it into a semantic tree
represented in VoiceXML [131] that is ideal for audio browsing. This approach leverages
automatic identification of repetition and locality in HTML documents in order to automatically derive semantic structure. The process for discovering implicit schemas on web pages
was derived from earlier work by Mukherjee et al. that focused on isolating the schemas
imposed by template-driven HTML documents [91]. This process identifies semantically-
23
related groups of objects that appear together, such as items appearing together in a list.
For instance, alternating <h1> and <p> tags in HTML often express the semantic pairings of a title of an object and its summary. In this example, these would all be siblings
of their parent node, but for auditory browsing they should be paired to allow the listener
to efficiently select each group. To aid in the successful reconstruction of the semantic
tree, the system uses various heuristics in order to accept partial matches, and leverages
semantic similarity of nodes as determined by the similarity of the words contained within
them. CSurf, a recent addition to the Hearsay web browser, allows web users to browse in a
context-directed manner [84]. In this system, when a user clicks on a link, the text of that
link is used to automatically detect an appropriate place in the resulting web page for the
browser to begin reading.
Hearsay has the potential to greatly improve the efficiency of web browsing, but does
so at the potential cost of reduced transparency for users. The tree representation created
by Hearsay may place items in unexpected levels of the tree, which may render it confusing
to users. The system was shown to be very accurate in its ability to correctly create the
semantic tree, but these experiments were conducted in the “news” domain for which an
ontology was manually created and Hearsay was manually tuned. The authors later showed
the promise of bootstrapping ontologies for new domains [90], although it remains unclear
how well these techniques will perform in arbitrary domains. User studies of the system
were positive, but they they too operated on this manually-tuned “news” domain and were
tested only in short lab studies by (mostly) sighted users. Evaluators hinted at the issue of
transparency by suggesting that they would like additional control and that they wanted
Hearsay to be more explicit about the types of elements in the semantic tree. CSurf suffers
similar transparency concerns because users are automatically redirected to content but
provided no information concerning where on the page they have been redirected or the
surrounding context.
Chapter 6 presents a system called TrailBlazer that helps users decide what to do next.
For its interface, we chose to offer a list of suggestions from which users could choose.
This interface is designed to keep users in control, but it remains unclear whether this is
sufficient given the limited context afforded by the audio interface. Balancing transparency
24
with automation is a concern with any intelligent user interface, and especially so when
concerning blind web users who may not have supplementary clues to help them determine
what is being done for them by the interface.
User Involvement
To browse the web as a blind web user, one must be both skilled and patient. The reality
is that a primary way for blind web users to influence the accessibility of content is to
request improvements from the original developer of the site. In some cases, these requests
have escalated into lawsuits, for example the National Federation of the Blind vs. Target
Corporation [93] and Maguire vs. SOCOG [31]). The ability of blind users to directly
influence the accessibility of the web has historically been severely limited; the motivation
of this dissertation is to bring control back to these users.
Blind users also affect the accessibility of the web through their choice of access technology. Iaccarino et al. let users choose among many options for transcoding using a single
tool [62]. This offered an incredible benefit for personalization, but users were still restricted
to the options offered by the system. There was no easy way for users to contribute new
transcoding services. Accessmonkey (Chapter 4) and TrailBlazer (Chapter 6) help users
contribute improvements.
Recent projects have sought to involve blind users more directly in the process of improving web content. IBM’s Social Accessibility Project lets blind web users report problems
to sighted volunteers who can fix them [121] and share those improvements in a common
repository [67]. This project has two primary advantages: first, it lets blind people easily
evaluate and report access problems, and, second, any sighted person with the knowledge
and desire to fix problems can do so. Blind users do not directly improve content as part
of Social Accessibility but they are active participants in the process.
This dissertation presents several examples of involving blind users in the process of
improving web access. Chapter 4 discusses our Accessmonkey Project, which was at the
forefront of collaborative accessibility and Chapter 6 presents TrailBazer, which lets blind
users demonstrate trails through web pages that they can share with others.
System Access
Mobile on USB[29]
Braille Sense
PDA[5]
Smartphone w/
Mobile Speak
Pocket[21]
Laptop with
Screen Reader
FireVox[11], HearSay[26]
Free Self-Voicing
Web Browsers
$0.00
Traditional Screen Readers
JAWS[17], Window-Eyes[35], ...
$1000.00
...
Mobile Devices
SA to Go[2]
Any Computer
WebAnywhere
Software
Portability
25
$5000.00
User Cost
Figure 2.2: Many products enable web access to blind individuals but few have high availability and low cost (upper-left portion of this diagram). Only systems that voice both web
information and provide an interface for browsing it are included.
2.3
Improving the Availability of Accessible Interfaces
Many existing solutions convert web content to voice to enable access by blind individuals.
Three important dimensions on which to compare them are functionality, portability and
cost. WebAnywhere provides full web-browsing functionality from any computer for free to
users.
Screen Readers
Screen readers, such as JAWS [82] or Window-Eyes [146], are special-purpose software
programs that cost more than $1,000 US. The Linux Screen Reader [78], the Orca screen
reader [97] for the GNOME platform and the NVDA screen reader for Windows [96] are free
alternatives to the commercial products. Screen readers are seldom installed on computers
not normally used by blind individuals because of their expense and because their owners
are unaware of free alternatives or that blind users might want to use them. Fire Vox is
26
a free extension to the Firefox web browser that provides screen reading functionality [30],
the HearSay web browser is a standalone self-voicing web browser [106], and aiBrowser
is a self-voicing web browser that targets making multimedia web content accessible [87].
Although free, these alternatives are similarly unlikely to be installed on most systems.
Users are rarely allowed to install new software on public terminals, and installing software takes time and may be difficult for blind users without using a screen reader. Many
users would also be hesitant to install new software on a friend’s laptop. WebAnywhere is
designed to replicate the web functionality of screen readers in a way that can be easily accessed from any computer, requires minimal permissions on the client computer, and starts
up quickly without requiring a large download before the system can be used.
Recent versions of Macintosh OS X include a screen reader called VoiceOver which voices
the contents of a web page and provides support for navigating through web content [132].
Most computers do not run OS X, and, on public terminals, access to features not explicitly
allowed by an administrator may be restricted. Windows XP and Vista include a limitedfunctionality screen reader called Narrator, which is described as a “text-to-speech utility”
and does not support the interaction required for web browsing [147].
Mobile Alternatives
Mobile access alternatives can be quite costly. PDA solutions can access the web in remote
locations that offer wireless Internet and usually also offer an integrated Braille display that
can be used on any computer. The GW Micro Braille Sense [25] costs roughly $5,000 US
but also offers an integrated Braille display. A Pocket PC device and the screen reading
software designed for it, Mobile Speak Pocket [88], together cost about $1,000 US. Many
cannot afford or would prefer not to carry such expensive devices.
The Serotek System Access Mobile (SAM) [119] is a screen reader designed to run on
Windows computers from a USB key without prior installation. It is available for $500
US and requires access to a USB port and permission to run arbitrary executables. The
Serotek System Access To Go (SA to Go) screen reader can be downloaded from the web
via a speech-enabled web page, but the program requires Windows, Internet Explorer, and
27
permission to run executables on the computer. The AIR Foundation [2] has recently made
this product free. A self-voicing Flash interface for downloading and running the screen
reader is provided.
Alternatives with Limited Functionality
Some web pages voice their own content, but are limited either by the scope of information
that can be voiced or by the lack of an accessible interface for reaching the speech. Talklets
enable web developers to provide a spoken version of their web pages by including code that
acts with the Talklet server to play a spoken version of each page [123]. This speech plays
as a single file and neither its playback nor the interface of the page can be manipulated by
users. Scribd.com converts documents to speech [117], but the speech is available only as a
single MP3 file that does not support interactive navigation. The interface for converting
a document is also not voiced. The National Association for the Blind, India, provides
access to a portion of their web content via a self-voicing Flash Movie [92]. The information
contained in the separate Flash movie is not nearly as comprehensive of the entire web page,
which could be read by WebAnywhere.
Web information can also be retrieved using a phone. The free GOOG-411 service
enables users to access the business information from Google Maps using a voice interface
over the phone [48]. For a fee, email2phone provides voice access to email over the phone
[40].
Availability Summary
Products that provide full web-browsing functionality are shown in Figure 2.2. The portability axis is approximate. Solutions that can be run on any computer can also be run on
wireless devices and are therefore rated more highly. Mobile phones are more portable than
other solutions but only when cellphone service is available. WebAnywhere will be able to
run on many mobile devices, regardless of the underlying platform, as they are increasingly
supporting more complex web browsers that can play sound. The WebAnywhere web application is more highly portable than the Serotek System Access Mobile, which can run
28
only on Windows computers on which users have permission to run arbitrary executables.
Braille Sense PDAs use a proprietary operating system and some versions cannot connect
to wireless networks using WPA security. WebAnywhere is designed to be free and highlyavailable, but other solutions may be more appropriate or provide more functionality for
users with different requirements using different devices.
29
Chapter 3
WEBINSITU: UNDERSTANDING ACCESSIBILITY PROBLEMS
The extent of accessibility problems and their practical effects on the browsing experience
of blind web users is not yet adequately understood. For web access guidelines, standards,
and future technology to be truly relevant and useful, more information about the real-life
web interactions of blind web users is needed.
In this chapter, we present infrastructure that enables remote web studies of blind participants. Unlike prior systems, the WebinSitu infrastructure presented here can capture
and record the actions that users perform on the web - for example, the buttons pressed,
the text entered into form fields, and the links clicked.
The focus of this chapter is on a study using this infrastructure that illustrates the
differences in browsing behavior of blind and sighted web users. The problems identified in
this study help to motivate many of the solutions presented later in this dissertation, and
the infrastructure is used later to help evaluate these solutions.
3.1
Motivation
We used an advanced web proxy to enable our study and quantitatively measured both
the presence and observed effectiveness of components thought to impact web accessibility.
Most proxy systems can only record HTTP requests and cannot easily discern user actions
performed on web pages [28, 58]. WebinSitu is an enhanced version of UsaProxy [9], designed
to be used for long periods of time and record information especially important for access.
UsaProxy was used as a base for WebinSitu because it can record actions that are
impossible to record with a traditional proxy, such key presses, clicks on arbitrary page
elements (including within-page anchor links), and the use of the “back” button to return
to a page that was previously viewed. Recording user actions has traditionally required
study participants to install specialized browser plugins [45, 32], but UsaProxy is able to
30
record most user actions by using Javascript code that is injected into pages that are viewed.
Because it uses Javascript to parse the viewed web pages, WebinSitu can also record the use
of technology at the center of increasingly important accessibility concerns, such as dynamic
page changes, interaction with dynamic content and AJAX requests. A proxy approach
enables transparent setup by participants and allows them to use their own equipment with
its existing configuration.
Prior work has sought a better understanding of the web user experience for general
users [58, 71]. The importance of measuring accessibility in situ from the user perspective is
illustrated by the relative popularity of web sites visited by web users in our study, as shown
in Figure 3.1. The distribution is Zipf-like [26], which results in three sites (google.com,
myspace.com and msn.com) accounting for approximately 20% of the pages viewed by the
participants in our study. The google.com domain alone accounted for almost twice as many
page views as the 630 domains that were viewed five or less times during our study. The
accessibility of popular sites more strongly affects users than do sites on the long tail of
popularity. While our study is not a replacement for laboratory studies that use common
tasks, it offers an important view of accessibility that better matches the experiences of real
users.
Blind web users have proven adept at overcoming accessibility problems, and one of
the goals of this study was to gain a clearer picture of how blind users might be changing
their browsing behavior to avoid problems. For instance, the lack of alternative text is
an often-cited accessibility concern, but blind users can often obtain the same information
contained within an image from surrounding context. Within-page anchors called “skip
links” are designed to help blind users effectively navigate complex web pages by enabling
them to jump to relevant content, but these links may be used infrequently because other
screen reader functionality also enables users to move quickly through a page. If the context
surrounding links on a page isn’t clearly expressed to blind users, they may explore the page
by clicking on links simply to see where they point and then return.
WebinSitu explores whether blind web users tend not to visit inaccessible content and
considers strategies that they may be using to cope with the problems they might experience.
Quantitative differences that are observed may suggest motivation or browsing strategies,
31
Figure 3.1: Log frequency of visits per domain name recorded for all participants ordered
by popularity.
but causal relationships are difficult to determine clearly from interaction histories alone.
To address this problem, we supplement our observations with qualitative feedback to get
a better sense why we might have observed these differences
The direct effects of technology and developer practices for improving accessibility are
also difficult to measure in practice because users employ many different browsing and coping
strategies that may vary based on the user’s familiarity with the page be accessed. Related
work has looked at task-based analysis of accessibility [85, 138, 34, 101], with a major focus
on supporting effective accessibility evaluation (see Ivory for a survey of this work [65]).
Realistic studies with blind web users are difficult to conduct in the lab due to difficulties
in replicating the diversity of assistive technology and configurations normally used by
participants. Previous work has advocated remote studies because they allow participants
to use their existing assistive technology and software [85, 101, 47]. These studies noted
that blind participants often provide less feedback when a page is considerably inaccessible,
indicating that simply asking blind users to list the problems they face may not be sufficient.
Overall, we found that blind web users browse the web quite similarly to sighted users
and that most pages visited during our study were inaccessible to some degree. In our study
32
these problems are placed in the context of their predicted effects because we implicitly
weighted pages relative to their popularity. Perhaps most surprising, blind participants
generally did not shy away from pages exhibiting accessibility problems anymore than did
sighted users. Blind participants were, however, much less likely to visit pages containing
content not well addressed by assistive technology. Blind users tended not to visit sites
heavily dependent on AJAX, but visited many pages that included Flash content. Blind
users also interacted less with both dynamic content and inaccessible web images. Skip
links, added to web pages to assist screen reader users, were only used occasionally by our
participants. Although from interaction histories alone we cannot determine with certainty
the causal relationship of such differences in browsing behavior, these observations combined
with the known technical capabilities of assistive technology present a strong case for access
problems causing the differences.
The contributions of this chapter are as follows:
• We present the design of the WebinSitu infrastructure which enables remote user
studies with disabled participants.
• We demonstrate the effectiveness of proxy-based recording for exploring the interaction
of blind web users.
• We compare the browsing experience of sighted and blind web users on several quantitative dimensions, and report on access as experienced by blind web users.
• We formulate practical user observations that can influence the direction of future
web accessibility research.
3.2
Recording Data
To enable our study, we developed a tracking proxy called WebinSitu to record statistics
about the web browsing of its users. WebinSitu is an extended implementation of UsaProxy
(Figure 3.2)), which allows both HTTP request data and user-level events to be recorded [9].
This method of data collection allows participants to be located remotely and use their own
equipment. This is important for our study because of the diversity of assistive technology
and configurations employed by blind users. Our proxy-based approach requires minimal
33
Figure 3.2: Diagram of the system used to record users browsing behavior.
configuration by the user and does not require the installation of new software. Connecting
to the system involved configuring their browsers to communicate with the tracking proxy
and entering their login and password. Names and passwords were not connected with
individuals, but a record was kept indicating whether the participant primarily uses a screen
reader or a visual browser to browse the web.
A browsing session begins with the participant initiating an HTTP request, which is
first sent to the proxy and then passed directly to the web server. The web server sends
a response back to the proxy, which logs statistics about the response header and web
page contents. The proxy also injects Javascript into HTML responses to record usergenerated events and sends this modified response back to the user. After the response is
received by the user and is loaded in their browser, the Javascript inserted into the page
can record events such as key presses, mouse events, and focus events and sends data about
each event, including the DOM elements associated with each event, back to the proxy for
logging. For example, if a user clicks on a linked image, the click event and its associated
image (dimension, source, alternative text, etc.), the link address and position in the DOM
are sent to the proxy and recorded. The proxy also records whether content with which
participants interact is dynamic (i.e. created after the page was loaded via Javascript) and
whether the pages viewed issue AJAX requests.
All of the data pertaining to a participant’s browsing experience is stored on a remote
database. At any time during the study, participants may examine their generated web
traces, comment on the web pages viewed, enter general comments about their browsing
34
experience or delete portions of their recorded browsing history (See Figure 3.6). Our
participants deleted only three browsing history entries.
3.3
Study Design
In this study, we considered two categories of data related to web browsing that yield insight
into accessibility problems faced by blind web users. Many definitions of blindness exist;
we use the term blind users for those users that primarily use a screen reader to browse
the web and sighted users for those who use a visual display. First, we recorded statistics
relating to basic web accessibility of pages viewed in our study, such as alternative text for
images, heading tags for added structure and label elements to associate form input with
their labels. Second, we considered the browsing behavior of both blind and sighted users,
including average time spent on pages and interaction with elements.
3.3.1
Accessibility of Content
Accessibility guidelines for web developers offer suggestions on how to create accessible web
content. Most noted is the WCAG [142] on which many other accessibility guidelines are
based. Web developers often don’t include the advice presented in these guidelines in their
designs [34, 18]. Our study effectively weights pages based on the frequency with which
they are viewed, allowing us to measure the accessibility of web content as perceived by
web users. The individual metrics reported here suggest the accessibility of web pages that
users view, but cannot capture the true usability of these pages. Because inaccessible pages
can be inefficient or impractical to use, blind users may choose to visit sites that are more
accessible according to ours metrics. In our analysis, we compared the browsing behavior
of blind and sighted users according to the metrics below.
Descriptive Anchor Text and Skip Links
Navigating from link to link is common method of moving through web pages using a screen
reader. Many screen readers provide users with a list of links accessed via a shortcut key.
However, links can be difficult to interpret when separated from the surrounding context.
35
For instance, the destination of a link labeled “Click Here” is impossible to determine
without accompanying context. Prior work has shown that descriptive link text helps users
efficiently navigate web pages [55] and related work has explored automatically supplying
richer descriptions for links [54]. In our study we collected all links on the pages viewed by
our participants as well as all links clicked on by our participants. We sampled 1000 links
from each set and manually labeled whether or not each was descriptive.
Skip links are within-page links that enable users to skip ahead in content. They normally appear near the beginning of the HTML source of a page and are meant for blind web
users. We identified skip links using two steps. First, we selected all within-page anchors
whose anchor text or alternative text (in the case of images used as skip links) contained
one of the following phrases (case insensitive): “skip,” “jump to,” “content,” “navigation,”
“menu.” These phrases may not appear in all skip links, but this works for our purposes of
efficiently selecting a set of such links. To ensure that the chosen links were skip links, we
manually verified each one chosen in the first step.
Structure, Semantics and Images
Browsing is made more efficient for blind web users when the structure of the page is
encoded in its content and when the semantics of elements are not dependent on visual
features. Heading tags (<h1>... <h6>) have been shown to provide useful structure that
can aid navigation efficiency [138]. The <label>tag allows web developers to semantically
associate input elements with the text that describes them. Often this association is expressed visually, which can make filling out forms difficult for blind web users. These are
some of the easiest methods for encoding structure and semantics into HTML pages. Their
use may foreshadow the likelihood that web developers will use more complex methods for
assigning structure and semantics to web content, such as the Roadmap for Accessible Rich
Internet Applications (WAI-ARIA) [110].
Investigating the accessibility of web images has often been used as an easy measure of
web accessibility [18]. In this study, we analyzed the appropriateness of alternative text on
images viewed by participants. We sampled both 1000 of the images contained on the pages
36
viewed by our participants and 1000 images that were clicked on by our participants. We
manually judged the appropriateness of the alternative text provided for these images.
Dynamic Content, AJAX and Flash
The web has evolved into a more dynamic medium than previous static web pages. This
trend, popularly known as Web 2.0, uses Dynamic HTML (DHTML) and Javascript to
arbitrarily modify web pages on the client-side after they have been loaded. All users
may benefit from this technology, but it raises important accessibility concerns for blind
users. Changes or updates to content that occur dynamically have long been recognized by
standards such as the WCAG [142] as potential problems for screen reader users because
dynamic changes often occur away from a user’s focus. In our study, we recorded dynamic
changes in viewed pages. A dynamic change is defined as any change to the DOM after the
page has loaded. We took special interest when users directly interacted with dynamically
changed content. Our system cannot detect when users read content that is dynamically
introduced, but can detect when users perform an action that uses such an element. We
also recorded how many of the pages viewed by our participants contained programmatic
or AJAX web requests. While not necessarily an accessibility concern, these requests are
indicative of the complex applications that often are accessibility concerns.
A growing number of web pages include Flash content. Recent improvements to this
technology has enabled web developers to make much of this content accessible, but doing
so requires them to consciously decide to implement accessibility features. Conveying this
accessibility information to users requires users to browse with up-to-date versions of their
web browsers, screen readers and Adobe Flash. We report on the percentage of web pages
visited by blind and sighted web users that contain Flash content.
3.3.2
Browsing Behavior
Blind web users browse the web differently from their sighted counterparts in terms of the
tools that they use and the way information is conveyed to them. We explored how these
different access methods manifest in quantifiable differences according to several metrics.
37
In particular, because blind web users have proven quite adept at overcoming accessibility
problems, it is interesting to explore the practical effects of accessibility problems. For
instance, an image that lacks alternative text does not conform to accessibility guidelines,
but may still be accessible if it points to a web page with a meaningful filename. Similarly,
skip links seem as though they would be of assistance to users, but users may choose not
to follow them either because they are most often interested in content that would be
skipped or because they prefer potentially longer reading times to potentially missing out
on valuable information. Our study seeks to measure such factors. Beyond the simple
presence of accessible and inaccessible components in web pages, we also wanted to collect
information that helps suggest the effects of the accessibility of web page components.
Probing
A probing event occurs when a user leaves and then quickly returns to a page. Web users
often exhibit probing behavior as a method of exploration when they are unsure which link
to choose [55]. Probing is also often used as a metric of the quality of results returned when
analyzing search engines [145]. If a returned link is probed, then the user likely did not find
the contents relevant. Because exploring the context surrounding links is less efficient for
screen reader users, they may choose to directly follow links to determine explicitly where
they lead. If screen reader users probe more than their sighted counterparts then this would
motivate the further development of techniques for associating contextual clues with links.
In our study, we investigated the use of probing by our blind and sighted participants.
Timing
Underlying work in improving web accessibility is the goal of increasing efficiency for blind
web users. In our study, we attempted to quantify the differences in time spent web browsing
by blind and sighted web users. We first looked at average time per page to see if there is a
measurable effect of blindness on per page browsing time. We then looked at specific tasks
that were common across our users that we identified from our collected data. The first was
entering a query on the Google search engine, looking through the returned results and then
38
clicking on a result page. The second was using our web history page to find a particular
page they themselves had visited during the web study, finding it on the results page and
then entering feedback for the page. Even though both groups of users could accomplish
these tasks (they were accessible to each group), this comparison provides a sense of the
relative efficiency of performing typical tasks.
3.4
Results
For our study, we recruited both blind and sighted web users. In the end, we had 10 blind
participants (5 female) ranging in age from 18 to 63 years old and 10 sighted participants
ranging in age from 19 to 61 (3 female). We began our recruiting of blind users by first
contacting people who had previously expressed interest in volunteering for one of our user
studies and then by advertising on an email list for blind web users. Our sighted participants
were contacted from a large list of potential volunteers and chosen to be roughly similar to
our blind participants according to age and occupation area. Participants were given $30 in
exchange for completing the week-long study. Both our blind and sighted participants were
diverse in their occupations, although fields related to engineering and science accounted
for slightly more than half of participants in both groups. We placed no restriction on
participation, but all of our participants resided in either the United States or Canada,
with geographical diversity within this region.
Participant were sent instructions outlining how to configure their computers to access
the web through our proxy. Only one participant had difficulty with this setup procedure
and the issue was quickly resolved by speaking to the researchers on the phone. Each
participant was told to browse the web as they normally would for 7 days. During this time,
our participants visited 21,244 total pages (7,161 by blind participants), which represented
approximately 325 combined hours of browsing (141 by blind participants). “Browsing”
time here refers to total time spent on our system with no more than 1 hour of inactivity.
The pages they viewed contained 337,036 images (109,264 by blind participants) and 926,901
links (285,207 by blind participants). Of our blind participants, 8 used the JAWS screen
reader, 2 used Window-Eyes; 9 used Internet Explorer, 1 used Mozilla Firefox. All of our
blind participants but one used the latest major version of their preferred screen reader.
39
Figure 3.3: For the web pages visited by each participant, percentage of: (1) images with alt
text, (2) pages that had one or more mouse movement, (3) pages with Flash, (4) pages with
AJAX, (5) pages containing dynamic content, (6) pages where the participant interacted
with dynamic content.
None reported using multiple screen readers, although we know of individuals who report
switching between JAWS and Window-Eyes depending on the application or web page. All
of our participants used Javascript-enabled web browsers, although we did not screen for
this.
Our data was collected “in the wild,” and, as is often required when working with real
data, it was necessary to remove outliers that might have otherwise inappropriately skewed
our data. For each metric in this section, we removed data that was more than 3 standard
deviations (SD) from the mean. This resulted in an average of 1.04% of our data being
eliminated for the applicable metrics. Our measures are averages over subjects.
The remainder of this section explores the results of our study for the two broad categories initially outlined in our Study Design (Section 3.3). A summary of many of the
measurements reported in this section is presented in Figure 3.3 for both blind and sighted
participants.
40
3.4.1
Accessibility of Content
Descriptive Anchor Text and Skip Links
Overall, 93.71% (SD 0.07) of the anchors on pages visited by blind users contained descriptive anchor text, compared with 92.84% (0.06) of anchors on pages visited by sighted
users. The percentage of anchors that were clicked on by the two groups was slightly higher
at 98.25% (0.03) and 95.99% (0.06), respectively, but this difference was not detectably
significant. This shows that web developers do a good job of providing descriptive anchor
text.
We identified 822 skip links viewed by our blind participants compared to 881 skip links
viewed by our sighted participants, which was not a detectably significant difference. Blind
participants clicked on 46 (5.60%) of the skip links presented to them, whereas sighted
users clicked on only 6 (0.07%). Often these links are made to be invisible in visual web
browsers. These results suggest that blind users may use other functionality provided by
their screen readers to skip forward in content in lieu of skip links. We were unable to
test this hypothesis due to difficulty in reliably determining when users used screen reader
functionality to skip forward in content.
Structure, Semantics and Images
Overall, 53.08% of the web pages viewed by our participants contained at least one heading
tag and there was no significant difference between pages visited by sighted and blind users.
We found that on pages that contained input elements that required labels, only 41.73%
contained at least one label element. Using manual evaluation, we found that 56.9% of
all images on the pages visited by our participants were properly assigned alternative text
and that 55.3% of the images clicked on by web users were properly assigned alternative
text based on manual assessment of appropriateness. Blind participants were more likely
to click on images that contained alternative text. 72.17% (19.61) of images clicked on by
blind participants were assigned appropriate alternative text, compared to 34.03% (29.74)
of the images clicked on by sighted participants, which represents a statistically significant
effect of blindness on this measure (F1,19 = 11.46, p < .01).
41
Dynamic Content, AJAX and Flash
Many of the pages viewed by our participants contained dynamic content, AJAX and Flash
content. Pages visited by sighted participants underwent an average of 21.65 (35.38) dynamic changes to their content as compared to an average of only 1.44 (1.81) changes
per page visited by blind participants. This difference was marginally significant (F1,19 =
3.59, p = .07). Blind users interacted with only 0.04 (0.08) of page elements that were either dynamically introduced or dynamically altered, while sighted users interacted with 0.77
(0.89) of such elements per page. There was a significant effect of blindness on this measure
(F1,19 = 7.49, p < 0.01). Our blind participants may not been aware that the content had
been introduced or changed, or were unable to interact with it. Pages visited by blind and
sighted users issued an average of 0.02 (0.02) and 0.15 (0.20) AJAX requests, respectively.
This result is statistically significant (F1,19 = 4.59, p < 0.05) and suggests that blind users
tend not to visit web pages that contain AJAX content.
Of the dynamic content categories, Flash was the only one for which we were unable to
detect a significant difference in the likelihood of blind versus sighted participants visiting
those types of pages. On average 17.03% (SD 0.24) and 16.00% (11.38) of the web pages
viewed by blind and sighted participants, respectively, contained some Flash content. There
was not a detectably significant difference on this measure (F1,19 = 0.90, n.s.). We also calculated these four metrics for domains visited (groups of web pages) and reached analogous
conclusions.
3.4.2
Browsing Behavior
Blind users used the mouse (or simulated it using the keyboard) a surprising amount. On
average, blind participants used or simulated the mouse on 25.85% (SD 22.01) of the pages
that they viewed and sighted participants used the mouse on 35.07% (12.56) of the pages
that they viewed. This difference was not detectably significant (F1,19 = 1.35 n.s.). Blind
and sighted participants, however, on average performed 0.43 (0.33) and 8.92 (4.21) discrete
mouse movements per page. This difference was statistically significant (F1,19 = 44.57, p <
.0001).
42
Figure 3.4: Number of probes for each page that had at least one probe. Blind participants
performed more probes from more pages.
Our users arrived at 24.21% of the pages that they viewed by following a link. The
HearSay browser leverages the context surrounding links that are followed to begin reading
at relevant content on the resulting page [84] and could likely apply in these cases.
Probing
Our blind participants exhibited more probing than their sighted counterparts as shown in
Figure 3.4. On average, blind participants executed 0.34 (SD 0.18) probes per page while
sighted participants had only 0.12 (0.12), a significant difference (F1,19 = 10.40, p < 0.01)
and may be indicative of the greater difficulty of blind web users due to limited context.
(See Figure 3.4 to better visualize participant probing behavior for individual pages).
Timing
In examining the time spent per task, we found that our data was skewed toward shorter
time periods, which is typical when time is used as a performance measure. Because this
data does not meet the normality assumption of ANOVA, we applied the commonly used log
transformation to all time data [120]. Although this complicates the interpretation of results,
it was necessary to perform parametric statistical analysis [43]. All statistical significance
reported here is in reference to the transformed data; however, results are reported in the
original, untransformed scale.
43
Range
Blind
Sighted
Sig.F1,19
0 - 1.0
0.38 (0.26)
0.23 (0.23)
32.55, p < .0001
0 - 2.5
0.76 (0.65)
0.38 (0.49)
31.83, p < .0001
0 - 5.0
1.04 (1.05)
0.51 (0.80)
10.69, p < .01
0 - 10.0
1.25 (1.54)
0.77 (1.52)
6.90, p < .05
0 - 20.0
1.50 (2.35)
1.11 (2.51)
5.01, p < .05
0-
5.08 (16.68)
11.30 (74.36)
0.01, p < .91
Table 3.1: Average time (minutes) and standard deviation per page for increasing time
ranges.
We found that blind participants spent more time on average on each page visited than
sighted participants. For a summary of the results, see Figure 3.5. These results seemed
particularly strong for short tasks, where sighted users were able to complete the tasks much
faster than blind users. Blindness had a significant effect on the log of time spent on each
page for all but the longest time period. Table 3.1 shows that the average time spent by
blind and sighted participants approach one another as task length increases.
We also identified four tasks conducted by both blind and sighted participants, which
enabled us to compare the time required for users to complete these tasks.
Google This task consisted of two subtasks: 1) querying from the Google homepage, and
2) choosing a result. On the first subtask, blind and sighted users spent a mean of 74.66
(SD 31.57) and 34.54 (105.5) seconds, respectively. Blindness had a significant effect of
blindness on the log of time spent on issuing queries (F1,17 = 7.47, p < .01). On the second
subtask, the time between page load to clicking on a search result for blind and sighted
users was 155.06 (46.14) and 34.81 (222.24) seconds. This represents a significant effect of
blindness on the log of time spent on searching Google’s results. (F1,19 = 28.3, p < .0001).
Providing Feedback Another common task performed by most of our participants
was to provide qualitative comments on some of the web pages that they visited as part
of the study (Figure 3.6). This task also consisted of two subtasks: 1) querying for web
pages from the user’s web history, and 2) commenting on one of the pages returned. On
44
Figure 3.5: For each participant, average time spend on: (1) all pages visited, (2) WebinSitu
search page, (3) WebinSitu results page, (4) Google home page, (5) Google results pages.
average, blind and sighted users took 30.36 and 18.41 seconds to complete the first subtask
(SD 20.59, 19.84). This represents a marginally significant effect of blindness on the log
of time spent querying personal web history (F1,14 = 4.2529, p = .06). On average, blind
and sighted participants spent 104.60 (30.98) and 68.74 (78.74) seconds, respectively, to
leave a comment. This represented a significant effect of blindness on the log of time spent
commenting on personal web history (F1,11 = 5.23, p < .05).
3.5
Discussion
Our study provided an interesting look into the web accessibility experienced by web users.
Overall, the presence of traditional accessibility problems measured in our study did not
seem to deter blind web users from visiting most pages, but problems with dynamic content
characteristic of Web 2.0 did. Our blind participants were less likely than sighted participants to visit pages that contained either dynamic content or which issued AJAX requests.
Much of this content is known to be, for the most part, inaccessible to blind web users.
Our blind participants did not detectably avoid Flash content. Upon manual review of
2000 examples of Flash content, we found that 44.1% of the Flash objects shown to our
45
Figure 3.6: Web history search interface and results.
participants were advertisements. The inaccessibility of these Flash objects is unlikely to
deter blind users from visiting the pages which contain them. Only 5.6% of the Flash objects
viewed by our participants (both blind and sighted) presented the main content of the page.
The remainder of Flash objects contained content relevant to the main content of the page
but supplement to it. Blind users may miss out on some information contained in such
Flash objects, but might still find value in other information on the page that is accessible.
Flash was also often used to play sound, which does not require a visual interface. Finally,
recent strides in Flash accessibility are making it easier to design accessible Flash objects
that can be used by blind users.
We also observed that blind web users were less likely to interact with content that is
inaccessible. Participants were less likely to interact with content that was dynamically
introduced. We also found that blind users are more likely to click on images assigned
appropriate alternative text. This should be a warning to web developers that not only are
their pages more difficult to navigate by blind users when they fail to assign appropriate
alternative text, but they may be driving away potential visitors.
Our blind participants appeared to employ numerous coping strategies. For example,
blind participants used the mouse courser when page elements were otherwise inaccessible.
46
One participate explained that he is often required to search for items that are inaccessible using keyboard commands. Blind participants also exhibited more probing than their
sighted counterparts, suggesting that web pages still have far to go to make their content
efficiently navigable using a screen reader. Technology that obviates the need for these
coping strategies would be quite useful.
Overall, our observations underscore the importance of enabling accessible dynamic content. While our blind participants may have employed (inefficient) coping strategies to access web content that might be considered inaccessible, they generally tended not to visit
pages that rely on dynamic content at all.
3.6
Related Work
Proxy-based recording of user actions on the web have been explored before. The Medusa
Proxy measures user-perceived web performance [71] and WebQuilt displays a visualization
of web experiences based on recorded HTTP requests [57]. Traditional proxy systems record
information contained in HTTP requests and so others have created browser plugins that
can record richer information about user experiences [32]. UsaProxy, on which WebinSitu
is based, is not the only example of using Javascript to record web user actions. Google
Analytics allows web developers to include a Javascript file in their web pages that tracks
basic actions of visitors on their web pages [50]. WebAnywhere uses a web proxy to both
record what users are doing and speak the web content being read (Chapter 7).
The benefits and trade-offs involved in conducting remote studies with blind participants
have been explored previously (Section 2.1.1). WebinSitu enables remote deployment to
blind and sighted participants who are likely using a diversity of browsers and assistive
technology. Developing plugins for each desired browser and deploying them would be a
large undertaking. Our users initially expressed concern over installing new software onto
their machines and wanted to make sure they knew when it was and was not collecting
data. Specifying a proxy server is easy in popular web browsers (Internet Explorer, Firefox,
Safari, Opera, etc.) and allows users to maintain transparent control. WebinSitu is the first
large-scale, longitudinal study demonstrating the promise of this approach.
47
3.7
Summary
This chapter has presented a study in situ of blind and sighted web users performing real-life
web browsing tasks using their own equipment over the period of one week. Our analysis
indicates that blind web users employ coping strategies to overcome many accessibility
problems and are undeterred from visiting pages containing them, although they took more
time to access all pages than their sighted counterparts. Blind users tended not to visit
pages containing severe accessibility problems, such as those related to dynamic content. In
all cases our blind participants were less likely than our sighted participants to interact with
page elements that exhibited accessibility problems. Our user-centered approach afforded a
unique look web accessibility and the problems that most need addressing, motivating work
in later chapters that seeks to address these problems.
Chapter 5 uses the WebinSitu infrastructure to explore a particular interaction, solving
audio CAPTCHAs, in more depth and to evaluate an interface that addresses observed
problems. Chapter 4 presents Accessmonkey, an intelligent tool to help blind web users and
others improve web access, specifically problems identified by our WebinSitu study. Finally,
to address browsing efficiency, Chapter 6 presents a tool called TrailBlazer that helps suggest
paths through the web by predicting the actions that users will want to complete next. As
users choose which suggestions to take, TrailBlazer records what they choose, and lets other
users play them back to make blind web users more efficient. Although these tools do not
fully solve the problems highlighted by our study, they represent the new direction forward
in which blind users have more control that is our thesis.
48
Chapter 4
COLLABORATIVE ACCESSIBILITY WITH ACCESSMONKEY
Standards outline what is required for content to be accessible [142], but relying on
developers has proven insufficient. As a clear example, nearly fifteen years after the introduction of the alt attribute to the image tag, less than 50% of informative web images are
assigned descriptive alternative text [18]. The need for alternative text is readily apparent,
but has not been pervasively applied. Our WebinSitu study (Chapter 3) demonstrated that
other accessibility problems may be even more prevalent and that these problems negatively
influence the behavior and effectiveness of blind web users.
A Firefox extension called Greasemonkey lets users customized web content by writing
Javascript scripts which change web pages after they are loaded [51]. This chapter explores
the potential of scripts to improve accessibility on-the-fly. We introduce a variation on
Greasemonkey called Accessmonkey [19]. Accessmonkey is a framework that targets reuse of
scripts to help developers improve access to content they control. Several implementations
of Accessmonkey are offered to improve the availability of the improvements offered by
Accessmonkey scripts, letting users take advantages of Accessmonkey without installing
new software.
This chapter explores collaborative accessibility, the idea that anyone with the incentive,
desire, or need to create more accessible content should be able to do so. We present Accessmonkey, an implementation of this idea that enables people to fix accessibility problems
and share solutions both with one another and the developers of the content.
4.1
Motivation
Despite efforts to promote accessible web development, web developers often fail to implement even basic elements of accessibility standards. For many web developers, the cause
is a lack of appropriate experience, although even experienced web developers may require
49
more time to produce a visually appealing web page that is also accessible [100]. Available
tools can help spot deviation from the easily quantifiable portions of established standards,
but they fail to adequately guide developers through the process of creating accessible content. Perhaps it should be unsurprising that, when faced with a deadline to release new
web content or when faced with a daunting backlog of web pages to be updated, web developers often delay a full consideration of accessibility. Web developers need tools that can
efficiently guide them through the process of creating accessible web pages, and blind users
need tools that can help them overcome accessibility problems even when developers fail.
One approach is to automatically transcode documents in order to render them more
accessible and usable [18, 62, 60, 53]. Automatic transcoding is potentially powerful, but
implementations require web users to employ a specific browser or platform on their machine
to run. The transformations made by these tools help the web users that know to use them,
but are not easily utilized by web developers, who might create more accessible content
given easier access to the same underlying technology. Most importantly, web users cannot
easily influence this technology or independently suggest new accessibility improvements.
Accessmonkey is a scripting framework that helps web users and web developers collaboratively improve the accessibility of the web. Accessmonkey helps web users automatically
and independently transcode web content according to their personal needs on a number
of browsers and platforms. The same Accessmonkey scripts can also be used by developers
so that they leverage the same scripts that transcode content for blind users to offer suggestions to them. Many existing tools and systems address accessibility problems, but they
often require a specific browser or require the user to install a separate tool.
As such, they can be difficult for users to independently improve and difficult for developers to integrate into existing tools. Accessmonkey provides a convenient mechanism for
sharing techniques developed and insights gained. The framework allows both web users
and web developers to collaboratively improve the accessibility of the web by leveraging the
work of those that have come before them. Users can improve the accessibility of the web
by writing a script and other users can immediately use and adapt the script to fit their own
needs. Web developers can use the script to improve the accessibility of their web pages
automatically, reducing the job of providing accessibility information to a more efficient
50
editing and approval task. To allow as many users as possible to utilize our framework,
we offer several implementations of it that work on multiple platforms and in multiple web
browsers.
The contributions of our work include the following:
1. We illustrate the advantages of using Javascript and dynamic content for accessibility
improvement. Both technologies have been thought to decrease access.
2. We introduce a framework for dual-display of the results of scripts that enables web
users and web developers to utilize the same underlying technology and avoid duplicating work.
3. We re-implement several previous systems designed to improve accessibility as Accessmonkey scripts. These systems can now be used on more web browsers and platforms
by both web users and developers.
4.2
Related Work
Automatically transcoding web content to better support the needs of web users has been
a popular topic in web accessibility research for almost a decade, especially in relation to
improving access for blind web users. Accessmonkey seeks to allow both web users and web
developers to collaboratively create scripts that direct the automatic transcoding of web
content in a way that helps both groups of users efficiently increase web accessibility.
4.2.1
Scripting Frameworks
Greasemonkey was introduced in 2003 by Mark Pilgrim. The project was partially motivated by a desire to provide web users with the ability to automatically transcode web
pages into a form that is more accessible. Several examples of such scripts are offered in
the book Greasemonkey Hacks [104]. NextPlease! is an example of a Greasemonkey script
that has become quite popular among blind web users [137]. This script allows web users to
define key combinations that simulate clicks on links that contain user-defined text strings.
Currently, to implement similar functionality in their own web pages, web developers cannot directly leverage the code changes made by users of a script like NextPlease! and,
51
instead, must independently decide on these changes. Accessmonkey extends the original
idea behind Greasemonkey by providing a mechanism by which web users who write scripts
can also make their scripts useful to web developers. We also provide a web and proxy
implementation of Accessmonkey that opens the system to use by additional users.
Scripts designed to automatically improve accessibility are already available as Greasemonkey scripts. Popular existing user scripts include those for automatically detecting and
removing distracting ads and those that add new functionality to popular web sites like
google.com and hotmail.com. Others automatically add accessibility features to web pages.
Often these scripts present solutions that web developers could have included (support for
access keys, proper table annotation, etc.), while others address problems that apply to
particular subsets of a web pages visitors (high-contrast colors, larger fonts, etc.). Some
of the most popular scripts are those that add access keys to popular web sites and those
that adapt web pages to be easier to view by people with low vision. A large repository
of Greasemonkey scripts is available at userscripts.org, including 49 scripts targeted at accessibility. These scripts alter pages in ways that users have found helpful, such as adding
heading tags (<h2>) to the Google results page. Many scripts are available, which suggests
that a number of individuals are willing to write such scripts and that many web users find
them useful.
Another web scripting framework is Chickenfoot, which allows users to programmatically
manipulate web pages using a superset of Javascript that includes methods specific to web
browsing [24]. The interface of Chickenfoot is designed to make web page manipulation
easier, although it still requires some level of programming knowledge. Platypus, another
Firefox extension, seeks to remove this requirement as well by providing an interface that
allows users to manipulate the content of web pages by simply clicking on items [129].
Neither of these systems offers a mechanism that allows users to save altered web pages,
which Accessmonkey supports.
52
4.2.2
Accessibility Evaluation
The automatic evaluation of the accessibility of web pages has been a popular topic in
both research and industry, and has resulted in the availability of many evaluation tools
[140, 139, 1, 77]. Most of these tools have focused on assisting developers in meeting
quantifiable accessibility standards, such as the W3C Web Content Accessibility Guidelines
[142] or Section 508 of the U.S. Rehabilitation Act. The research community has sought
to extend the capabilities of evaluation tools to allow for the automatic detection of more
subtle usability and accessibility concerns [65], but tools that can do this well have yet to
be developed. Mankoff et al. noted that an effective method for identifying accessibility
problems in a web page is to have it reviewed by multiple web developers using a screen
reader, but that blind web users could effectively detect subtle usability problems [85].
Accessmonkey allows both groups to collaboratively assist in the evaluation and remediation
of web content, but neither group must rely on members of the other before accessibility
improvements can be implemented.
4.2.3
Automatic Accessibility Improvement
Previous work has explored both automatically improving the accessibility of web pages
[60, 62, 53, 18]. To take advantage of these systems, content has generally needed to be
processed by the web developer using a specialized tool, displayed using a specialized web
browser [106], or transcoded for users on-the-fly.
Harper et al. suggested the following three alternative approaches for transcoding documents to make them more accessible for blind web users [53]: (i) in a browser plugin or
extension, (ii) in a transcoding proxy, and (iii) in Javascript included within web pages. Accessmonkey uses a hybrid approach in which scripts are injected using a proxy implemented
as either a browser extension, traditional proxy, or as a web-based proxy. This flexibility
lets Accessmonkey transcode pages and allows scripts written for it to be used on many
different platforms and in many different browsers. Users can personalize what selection of
available transcoding services they would like to apply to web pages that they view and can
also write or modify their own scripts.
53
Implementing transcoders as scripts also has the potential to make extending the techniques that they encompass to web development tools easier. While several systems have
suggested that techniques used to automatically transcode documents could also be used
to help web developers more easily create accessible content [77, 136], this process has often been difficult to directly integrate into existing developer tools. Accessmonkey allows
scripts to be written once and included in a variety of tools used by both web users and web
developers. To our knowledge this is the first example of a system that unifies the automatic
accessibility improvement targeted at web users and web developers in an extensible way.
Despite the similarity in the underlying technology, little work has been devoted to
assisting web developers in automatically improving the content of their web pages through
specific suggestions. Many tools used for evaluation display general advice about how
to rectify problems that are discovered [42, 140, 1], but a web developer must still be
skilled in accessibility to correctly act on this advice. The guidance provided is usually
general and is often drawn from the standard against which the tool is evaluating the
web page. A-Prompt, for example, guides web developers through the process of fixing
accessibility problems. Related systems have been designed to assist users in annotating
web content for the semantic web [11] and Plessers et al. showed how such annotations could
be automatically generated as a direct result of the web design process [105]. Accessmonkey
scripts can utilize the same technology used to assist web users to help web developers.
4.2.4
Adding Alternative Text with WebInSight
WebInSight improves the accessibility of web images by automatically deriving alternative
text [18]. This system was shown to be capable of automatically providing labels for 43.2%
of web images that originally lacked alternative text with high accuracy by using the title
of the linked page for linked images and by applying optical character recognition (OCR)
to the images that contained text. WebInSight was originally developed for web users, but
developers could also use it to help them choose appropriate alternative text. Many available
accessibility tools inform web developers when images lack alternative text, but few suggest
alternative text or automatically judge the quality of the alternative text already provided.
54
WebInSight uses a supervised learning model built from contextual features in order
to identify alternative text that is likely to be incorrect [14]. The ability to automatically
judge the quality of alternative text could potentially improve the user experience by eliding
alternative text that is likely to be inappropriate. Using the WebInSight Accessmonkey
script, web developers are not only told that an image lacking alternative text should be
supplied it, but also whether the alternative text provided is likely to be correct.
4.3
Accessmonkey Framework
The framework exported by the Accessmonkey system allows users to edit web pages using
Javascript. The Greasemonkey Firefox extension [51] is one of the most successful examples
of an open scripting framework and exposes the framework from which Accessmonkey is
derived. The Greasemonkey extension allows users to inject their own scripts into arbitrary
web pages and these scripts can then alter web pages automatically. The main difference
between Accessmonkey and Greasemonkey is that Accessmonkey natively supports web
developers by providing a mechanism for web developers to edit, approve and save changes
that have been made to web pages by user scripts. Figure 4.1 shows the relation between
the components that use the Accessmonkey framework.
Accessmonkey is designed to support multiple implementations which may be placed
on a remote server, on the user’s computer or directly integrated into web tools. Accessmonkey scripts can be used in different browsers and on different platforms because of the
near ubiquity of Javascript. While Greasemonkey is only available on Mozilla web browsers,
other major web browsers, such as Internet Explorer, Safari and Opera, already afford similar capabilities and can often run Greasemonkey scripts unaltered. Incapability concerns
remain because of differences between the implementations of the ECMAscript standard
(commonly known as Javascript) used by different browsers. The primary implementations
of ECMAScript are JScript as implemented by Internet Explorer and Javascript as implemented by other popular browsers, including Firefox and Safari. Despite these limitations,
web developers are accustomed to writing scripts that are compatible with the different
implementations of ECMAScript.
55
Figure 4.1: Accessmonkey allows web users, web developers and systems for automated
accessibility improvement to collaboratively improve web accessibility.
Many web browsers and screen readers do not work well with Javascript code, and some
ignore it altogether. These increasingly rare browsers are currently not supported. The
scripts presented in this chapter have been tested with Window-Eyes 6.0 [146].
Accessmonkey gives users the option of running solely on the client side. A disadvantage
of our approach is that Javascript limits the space of possible transcoding operations that
can be performed, but, as shown in Section 4.5, many of the transcoding operations that
have been previous suggested can be achieved using Javascript only. Furthermore, as we
discuss in Section 4.6, future versions of Accessmonkey may allow Java code to be bundled
with Accessmonkey scripts in order to enhance their functionality.
4.3.1
Writing Scripts
Accessmonkey scripts share structure with Greasemonkey scripts but rely on additional
functionality provided by Accessmonkey implementations. Greasemonkey scripts can be
run unaltered in Accessmonkey implementations. Accessmonkey scripts are expected to
provide a mechanism for users to view, edit and approve changes that are automatically
made by the script when appropriate and rely on functionality exposed by the Accessmonkey
56
implementation to facilitate this. Accessmonkey differentiates two modes of operation: a
user mode in which changes are automatically made to a page and a developer mode in
which users are provided an interface that allows them to edit, approve and save automatic
changes.
A script can query the implementation in which it is running to determine the mode
that is currently activated and to obtain a pre-defined area in which the script can place
its developer interface. The implementation provides functionality that coordinates which
script’s developer interface should be displayed and allows changes that have been made to
the web page to be saved.
To write a script, a user must be able to write Javascript code, but any user can use an
existing script. Future versions of this tool will include a mechanism to help users locate
applicable scripts. We also plan to explore ways of enabling users who are not technically
savvy to create scripts (see Section 4.6).
4.3.2
Requirements
An Accessmonkey implementation requires only a few elements. First, the implementation
must have the capabilities of Greasemonkey. Specifically, it must be able to load a web
page, add custom Javascript to it and execute this Javascript. The implementation must
provide the standard Greasemonkey API [104] and two additional methods required for
Accessmonkey features. The first method returns a Boolean value indicating whether the
system is in developer mode or user mode, which allows user scripts to appropriately alter
their presentation and editing options. The second method returns a reference to a div
element that represents the script’s development area. The script may append elements to
this element to form its developer interface. This interface supports the user in viewing,
editing and approving changes automatically suggested by the script. Each implementation
must also provide a mechanism for saving changes that were made to the web page by the
user. Figure 4.2 shows an implementation of Accessmonkey running a script. The selection
boxes and buttons at the top of this screenshot form the developer interface. They let users
both switch tools and usage modes and save changes made to the web page by the script.
57
4.3.3
Developer Workflow
An important consideration for the usability of Accessmonkey is its potential to fit into
developer workflows. We hope to address one of the main shortcomings of accessibility
tools, which is their inability to integrate well into current developer workflows [65]. Designing a tool that easily integrates into the wide diversity of available developer products
is impractical, but the implementations provided allow our system to be immediately available. Accessmonkey integrates into the developer workflow letting developers make and edit
potential changes and then save changes.
Developers of sites that are generated dynamically using underlying data sources and
web page templates are unable to leverage Accessmonkey. These developers may still benefit
from Accessmonkey. Ideally, Accessmonkey would be implemented directly into the tools
already used by web developers. The simple and open scripting framework exposed by
Accessmonkey allows users to develop such implementations that more closely integrate
into these tools. Regardless, previous work has shown that an improved workflow that still
involves the use of several applications can nevertheless dramatically increase efficiency [73].
4.4
Implementations
People use a variety of different web browsers on a number of platforms, and web developers
use a variety of development tools, on a number of platforms. Accessmonkey should be
easy to integrate into these tools. The decision to implement Accessmonkey as a scripting
framework using Javascript lets new implementations be easily developed because many
platforms already contain support for Javascript.
Creating implementations of Accessmonkey that integrate directly into all possible tools
used by users and developers is impractical. Instead, Accessmonkey provides a simple
framework which can be extended to other tools and platforms by users. We have created
the following three implementations of Accessmonkey covering a wide variety of use cases: a
Firefox extension, a stand-alone web page, and a web proxy. Web users and developers can
access the full range of Accessmonkey functionality by using any of these implementations.
58
Figure 4.2: A screenshot of the WebInSight Accessmonkey script in developer mode applied
to the homepage of the International World Wide Web Conference. This script helps web
developers discover images that are assigned inappropriate alternative text (such as the
highlighted image) and suggests appropriate alternatives. The developer can modify these
suggestions, as was done here, to produce the final alternative text.
4.4.1
Firefox Extension
The Firefox Extension implementation is a straightforward adaptation of the existing Greasemonkey extension, which was the motivation for Accessmonkey and already provides much
of the required functionality. To allow the extension to fully support Accessmonkey scripts,
we enhanced the extension by adding the Accessmonkey-specific methods described earlier
in Section 4.3.1 and added a toggle that allows users to switch between user and developer
mode. Finally, we added the ability to save changes that were made to the web page. A
screenshot of the resulting system is shown in Figure 4.2.
4.4.2
Web Proxy
The web proxy version of Accessmonkey is implemented as an Apache module. This module
inserts a script containing the Accessmonkey code into each web page visited. Disadvantages
59
of proxy-based approaches were discussed previously in Section 4.2.3, but for some users it
is the most viable option because it does not require Firefox. Currently, the administrator
of the proxy is responsible for adding new user scripts, although future versions may allow
users to upload scripts and have them immediately included in their suite of Accessmonkey
scripts. Eventually, we would also like to offer a proxy-based solution that processes web
pages on the fly according to user scripts as the user browses the web.
For security reasons, some methods in the Greasemonkey API do not have direct analogies in Javascript. The Greasemonkey method used to retrieve the content of an arbitrary
URLs is useful for allowing scripts to include information derived from web services or integrated from other web sites. For security reasons, the analogous Javascript functionality is
restricted to retrieving content from the same domain as where the script is located. To allow Accessmonkey scripts running in this implementation to incorporate data not available
on the original domain of the web page, this implementation allows scripts to request the
content of any URL from the proxy, which effectively anonymizes these requests. To avoid
abuse, the proxy implementation limits use of the system to registered users.
4.4.3
Web Page
Several popular evaluation tools are web-based [140, 139, 141]. Visitors to these web sites
can enter the URL of a web page that they would like to evaluate and the tools will analyze
it. Such tools are convenient because they don’t require users to install additional software
and can be accessed from anywhere. Because the evaluation is done remotely, these tools
require the web page to be publicly available and, therefore, may be inappropriate for
accessibility evaluation of private or pre-release web pages. To allow users of our system
additional flexibility, we have created a web-based version of Accessmonkey.
Our web implementation of Accessmonkey requires a browser that supports Javascript,
but requires the user to neither use a specific browser nor install an extension, which opens
Accessmonkey scripts to potential users that prefer Internet Explorer, Opera or another
web browser. This implementation allows a large portion of web users and developers to
use Accessmonkey scripts.
60
Our web page version of Accessmonkey is implemented using a variation on the module
for the Apache Web Server that we developed for our proxy implementation. When users
visit the Accessmonkey web page they are first asked for a URL. The system then fetches
that URL and alters the page returned in a way that allows it to be displayed at a local
address. All of the URLs in each web page are automatically translated into fully qualified
URLs that are directed through the proxy. This Accessmonkey implementation uses the
same techniques for producing the full Accessmonkey API that were required in the proxy
implementation discussed previously.
4.4.4
Future Implementations
Future implementations will allow more web users and developers to use Accessmonkey on
more platforms. Turnabout is a plug-in for Internet Explorer that is similar to Greasemonkey and allows user-defined scripts [128]. It could be modified to provide the added
functionality required of an Accessmonkey implementation. We would also like to add the
capability of running Accessmonkey scripts directly in web development tools. SeaMonkey
Composer and Adobe Dreamweaver are attractive options because they already support
Javascript, although we would like to eventually create Accessmonkey implementations for
other popular tools, such as Microsoft FrontPage.
4.5
Implemented Scripts
We have implemented several scripts for our system that demonstrate the usefulness of the
Accessmonkey architecture. Our current implementations are both strengthened and limited
by their restriction of only using Javascript. Restricting our scripts to Javascript allows
them to be easily extended to many other platforms, but comes at the cost of accepting
the limitations of Javascript. For instance, our scripts cannot gain pixel-level access to
images. One method of circumventing this limitation is to utilize web services, as we did
for our WebInSight script so that it could access OCR functionality. In this section, we
further demonstrate the diversity of powerful transformations that can be accomplished
using Accessmonkey and how they can be leveraged by both web users and developers.
61
4.5.1
WebInSight Script
One motivation for Accessmonkey was to enable web developers to leverage the technology
that we developed for WebInSight (described in Section 4.2.4) to make the creation of
accessible web pages easier. Our belief is that web developers would be more likely to
create accessible content if they are given specific suggestions on what to do; our WebInSight
Accessmonkey script is one example.
A screenshot of the Accessmonkey system running the WebInSight script is shown in
Figure 4.2. The developer interface provides web developers with functionality to quickly
approve and edit the alternative text assigned to each image in the page. To assist in this
process, the system provides several automatically-computed suggestions for appropriate
alternative text that web developers can select with a single click after optionally editing
the suggestion.
The script automatically computes all suggestions, except for the OCR suggestion, which
is retrieved from a web service. Each suggestion is automatically evaluated by the system
and the best suggestion is always displayed in the lowest text box. The interface allows
developers to skip images that are unlikely to be informative and assign these images a
zero-length alternative text. The system does not provide developers with a button that
automatically applies alternative text to all images because the system’s suggestions are not
always correct. Following the spirit of Accessmonkey, blind web users can also utilize this
script. In user mode the script simply inserts the best alternative text for each image directly
into the page, although users are provided the option to preface each inserted alternative
text label with a string indicating that it was automatically generated.
4.5.2
Context-Centered Web Browsing
Mahmud et al. introduced a context-driven method called CSurf for browsing the web that
they showed to be much more efficient than browsing with a traditional screen reader [84].
The increased efficiency of this method is derived from its ability to automatically direct
users to relevant content, instead of requiring them to read a web page starting at the
beginning, as is common in most screen readers. When using the system, users are directed
62
Figure 4.3: The menubar of this online retailer is inaccessible due to its reliance on the
mouse. To fix this problem we wrote an Accessmonkey script that makes this menu accessible from the keyboard.
to content related to links that they have followed. The text of a link is likely similar to the
content in which they are interested. The system calculates where in the web page to begin
reading by choosing the section of the web page that contains content most similar to the
text of the link that was followed. This enhanced functionality is expected to be included
in the Hearsay browser [106].
We have implemented a variation of this accessibility improvement as an Accessmonkey
script. On every web page, the system first adds an onclick event to each anchor tag on
the page. When a user clicks on a link, the text of the link is recorded. When a new
page is loaded, the script checks to see if it occurred as a result of the user clicking a link.
If so, it then finds the DOM element of the page that is most similar to the text of the
clicked link using a simple word-vector comparison. The focus of the web page is changed
to the identified DOM element, which allows modern screen readers to begin reading at that
location. The system also assists web developers in setting skip links, which are links at the
beginning of a web page that are visually hidden but provide a mechanism to screen reader
users to skip to the main content area of a web page. This Accessmonkey script detects
content on the web page that is likely to be relevant, highlights the identified area and adds
the skip link if it is approved by the user. While this script cannot perform the full machine
learning and semantic analysis that is done in CSurf, it allows this powerful technique to
be used by users immediately with the tools they already own.
63
4.5.3
Personalized Edge Services
Iaccarino et al. outlined a number of edge services that transcoded web content into a
form that better suited the personal needs of web users [62]. Accessmonkey provides an
ideal framework in which to implement these edge services and we have replicated many of
them as Accessmonkey scripts. The original intent of the edge services was to provide web
users with disabilities options for personalization. By implementing them as Accessmonkey
scripts, web developers can leverage them as well. Although many of these services are not
appropriate for all users, web developers may employ them to produce alternative views for
specific audiences.
We have replicated several edge services as Accessmonkey scripts, including services that
replace all images in a page with links to the image, delete the target attribute from all
anchor tags and add access keys to all links. Such improvements can improve access for
certain individuals. These scripts can also be used by web developers, although, because of
the nature of the transformations applied, they may be best used to help create alternative
versions of a web page rather than used to create a universally accessible version.
4.5.4
Site-Specific Scripts
Many useful accessibility improvements cannot yet be implemented in a general way that
will apply to multiple web sites. Such scripts can reorganize a site’s layout, add accessibility
features, or improve the accessibility dynamic content. We have implemented several scripts
that demonstrate the dramatic improvements that can be made by Accessmonkey scripts
targeted at specific sites. For example, the web page of a popular online retailer contains
a menubar at the top listing the major categories of products that they sell, organized
in a tree. This menubar (and the elements contained within it) are inaccessible because
they require the use of a mouse. We wrote an Accessmonkey script that allows the same
content to be accessed via keyboard commands (see Figure 4.3). In this example, the menu
content is available to screen reader users, but is not efficiently conveyed to them. Figure 4.4
demonstrates another example of a site-specific script that, in this case, removes distracting
ads and places the main content of the page closest to the top in reading order.
64
The content that the scripts in this section modify already exists on the page, and,
therefore, blind web users could potentially conduct these transformations independently.
This is in contrast to the content in images or Flash content which is more difficult to
access. While figuring out how to create a script that will improve accessibility may take
time, the user will benefit from these improvements on subsequent visits to the page. These
improvements could be leveraged by other web users visiting the page and, perhaps, even
the web developers responsible for creating it. Javascript is a powerful mechanism for
transcoding content and we are exploring how users can more easily discover and apply
these scripts.
4.6
Discussion
Accessmonkey provides a common framework for web users, web developers, and web researchers to share automatic accessibility improvements. To facilitate this collaboration,
we plan to create an online repository where such scripts can be posted and shared. We
also plan to explore methods for enabling users to easily locate and, perhaps automatically, install scripts from this repository. For example, users could arrive at a news site to
which they have not been before and be immediately greeted with the possibility of jumping
directly to the content, navigation or search areas of the page.
Creating an Accessmonkey script currently requires a knowledge of Javascript programming, but many tools let users personalize web content without programming. For example,
the Platypus Firefox extension lets users create Greasemonkey scripts by clicking and dragging elements with the mouse [129]. Keyword commands lets users naturally create simple
scripts [81], and the keyboard driven interface is naturally accessible.
Accessmonkey seeks to enable blind web users to improve the accessibility and usability
of their own web experiences, but programming is often still required. Programming-bydemonstration methods for automating web tasks, such as Web Macros [113], PLOW [66],
and Turquoise [86] enable more people to customize the web for themselves. Chapter 6
introduces TrailBlazer, which helps extend these capabilities to blind web users.
The transformations that current Accessmonkey scripts can achieve are limited by the
Javascript programming language. While Javascript is more than adequate for achieving
65
Figure 4.4: This script moves the header and navigation menus of this site to the bottom
of the page, providing users with a view of the web page that presents the main content
window first in the page.
many transformations, more complex transformations often require specialized libraries. For
example, natural language processing or image manipulation functions are not currently
available in Javascript and would be difficult to implement only in Javascript. Javascript
also has limited ability to directly access other formats that web content in which web
content is represented, such as Flash and Silverlight.
Accessmonkey scripts currently rely on web services for advanced functionality, but a
better solution may be for scripts to use supplementary libraries. We will explore both
adding commonly-used functionality to Accessmonkey implementations and allowing user
scripts to bundle Java code libraries. The implementations that we have provided already
support calling Java code from Javascript and, so, a main challenge is to provide a stan-
66
dardized method for users to include such code along with their scripts and supporting such
bundles in a variety of Accessmonkey implementations.
4.7
Summary & Ongoing Work
Accessmonkey is a common scripting framework that enables web users and developers to
write and apply scripts for improving web accessibility. We have created implementations
that run on a variety of platforms, including as a Firefox extension and directly from the
web. We have reimplemented several existing systems for automatically improving web
pages, which renders these systems available on more platforms and allows them to more
easily be utilized by web developers. In particular, We have converted our WebInSight
system for automatically generating and inserting alternative text into web pages into an
Accessmonkey script from which both web users and web developers can benefit. We also
demonstrated that dynamic content can be made accessible on a per-site basis.
Accessmonkey is at the forefront of the burgeoning area of social accessibility. Soon
after its development, IBM Japan released Social Accessibility [121], a project which helps
connect blind web users experiencing accessibility problems with volunteers who can fix
those problems. They provide end user tools for both groups. A repository of improvements
that have been made are stored in a shared repository called the Accessibility Commons [67],
enabling collaborative improvement of accessibility problems. Social Accessibility has been
developed beyond the prototype stage and is currently in use by many blind and sighted
volunteers. The AxsJAX project from Google is a scripting framework that uses scripts
to turn web pages into dynamic web applications customized for non-visual use [49]. Both
projects explore different aspects of collaborative accessibility, and illustrate the continued
important of this approach.
An important component of Accessmonkey was getting access improvements out to as
many people as possible, regardless of what browser or platform they were running. One
solution advanced by this work was a web-based proxy that could be accessed from any
computer. This work foreshadows our work on WebAnywhere (Chapter 7), which brings
not only access improvement but access itself to users on any computer.
67
Chapter 5
A MORE USABLE INTERFACE TO AUDIO CAPTCHAS
The goal of a CAPTCHA1 is to differentiate humans from automated agents by requesting the solution to a problem that is easy for humans but difficult for computers.
CAPTCHAs are used to guard access to web resources and, therefore, prevent automated
agents from abusing them. Current CAPTCHAs rely on superior human perception, leading to CAPTCHAs that are predominately visual and, therefore, unsolvable by people with
vision impairments. Audio CAPTCHAs that rely instead on human audio perception were
introduced as a non-visual alternative but are much more difficult for web users to solve.
Part of the problem is that the interface has not been designed for non-visual use.
This chapter first presents a study of audio CAPTCHAs conducted using the WebinSitu
infrastructure (Chapter 3), and then presents and evaluates a more usable interface designed
for non-visual use. With the new interface, blind web users had a 59% higher success
rate in solving audio CAPTCHAs [16]. The results of improving this interaction illustrate
broader themes that can inform the design of interfaces for non-visual use; specially, that
visual interfaces cannot just be naively adapted and achieving effective access means making
interfaces usable. Chapter 6 seeks to enable users to expand these benefits generally to
completing other web-based tasks with a tool called TrailBlazer.
5.1
Introduction and Motivation
Most CAPTCHAs on the web today exhibit the following pattern: the solver is presented
text that has been obfuscated in some way and is asked to type the original text into an
answer box. The technique for obfuscation is chosen such that it is difficult for automated
agents to recover the original text but humans should be able to do so easily. Visually
this most often means that graphic text is displayed with distorted characters (Figure 5.1).
1
Completely Automated Public Turing test to tell Computers and Humans Apart
68
(a) Microsoft CAPTCHA
(c) AOL CAPTCHA
(b) reCAPTCHA
Figure 5.1: Examples of existing interfaces for solving audio CAPTCHAs. (a) A separate
window containing the sound player opens to play the CAPTCHA, (b) the sound player is
in the same window as the answer box but separate from the answer box, and (c) clicking
a link plays the CAPTCHA. In all three interfaces, a button or link is pressed to play the
audio CAPTCHA, and the answer is typed in a separate answer box.
In audio CAPTCHAs, this often means text is synthesized and mixed in with background
noise, such as music or unidentifiable chatter. Although the two types of CAPTCHAs seem
roughly analogous, the usability of the two types of CAPTCHAs is quite different because
of inherent differences in the interfaces used to perceive and answer them.
Visual CAPTCHAs are perceived as a whole and can be viewed even when focus is
on the answer box. Once focusing the answer box, solvers can continue to look at visual
CAPTCHAs, edit their answer, and verify their answer. They can repeat this process until
satisfied without pressing any keys other than those that form their answer. Errors primarily
arise from CAPTCHAs that are obfuscated too much or from careless solvers.
Audio playback is linear. A solver of an audio CAPTCHA first plays the CAPTCHA and
then quickly focuses the answer box to provide their answer. For sighted solvers, focusing
the answer box involves a single click of the mouse, but for blind solvers, focusing the answer
box requires navigating with the keyboard using audio output from a screen reader. Solving
audio CAPTCHAs is difficult, especially when using a screen reader.
69
Screen readers voice user interfaces that have been designed for visual display, enabling
blind people to access and use standard computers. Screen readers often speak over playing
CAPTCHAs as solvers navigate to the answer box, speaking the interface but also talking
over the CAPTCHA. A playing CAPTCHA will not pause for solvers as they type their
answer or deliberate about what they heard. Reviewing an audio CAPTCHA is cumbersome, often requiring the user to start again from the beginning, and replaying an audio
CAPTCHA requires solvers to navigate away from the answer box in order to access the
controls of the audio player. The interface to audio CAPTCHAs was not designed for
helping blind users solve them non-visually.
Audio CAPTCHAs have been shown previously to be difficult for blind web users. Sauer
et al. found that six blind participants had a success rate of only 46% in solving the audio
version of the popular reCAPTCHA [114], and Bigham et al. observed that none of the
fifteen blind high school students in an introductory programming class were able to solve
the audio CAPTCHA guarding a web service required for the course [15]. In this chapter,
we present a study with 89 blind web users who achieved only a 43% success rate in solving
10 popular audio CAPTCHAs. On many websites, unsuccessful solvers must try again on
a new CAPTCHA with no guarantee of success on subsequent attempts, a frustrating and
often time-consuming experience.
Given its limitations, audio may be an inappropriate modality for CAPTCHAs. Developing CAPTCHAs that require human intelligence that computers do not yet have seems
an ideal alternative, but the development of such CAPTCHAs has proven elusive [44].
CAPTCHAs cannot be drawn from a fixed set of questions and answers because doing so
would make them easily solvable by computers. Computers are quite good at the math
and logic questions that can be generated automatically. Audio CAPTCHAs could also be
made more understandable, but that could also make them easier for computers to solve
automatically.
The new interface that we developed improves usability without changing the underlying
audio CAPTCHAs. By moving the interface for controlling playback directly into the
answer box, a change in focus (and thus a change in context) is not required. Using the new
interface, solvers have localized access to playback controls without the need to navigate
70
from the answer box to the playback controls. Solvers also do not need to memorize the
CAPTCHA, hurry to navigate to the answer box after starting playback of the CAPTCHA,
or solve the CAPTCHA while their screen readers are talking over it. Solvers can play the
CAPTCHA without triggering their screen readers to speak, type their answer as they go,
pause to think or correct what they have typed, and rewind to review - all from within the
answer box.
Because popular audio CAPTCHAs have similarities in their interfaces, our optimized
interface can easily be used in place of these existing interfaces. Both the ideas and interface
itself are likely to be applicable to CAPTCHAs yet to be developed. Finally, the design
considerations explored here have application to improving a wide range of interfaces for
non-visual access.
Our work on audio CAPTCHAs offers the following four contributions:
• A study of 162 blind and sighted web users showing that popular audio CAPTCHAs
are much more difficult than their visual counterparts.
• An improved interface for solving audio CAPTCHAs optimized for non-visual use that
moves the controls for playback into the answer box.
• A study of the optimized interface indicating that it increases the success rate of
blind web users on popular CAPTCHAs by 59% without altering the underlying
CAPTCHAs.
• An illustration via the optimized interface that usable interfaces for non-visual access should not be directly adapted from their visual alternatives without considering
differences in non-visual access.
5.2
Related Work
CAPTCHAs were developed in order to control access to online resources and prevent access by automated agents that may seek to abuse these resources [133]. As their popularity
increased, so did the concern that the CAPTCHAs used were primarily based on the superiority of human visual perception, and therefore excluded blind web users. Although audio
CAPTCHAs were introduced as an accessible alternative, the interface used to solve them
did not consider the lessons of prior work on optimizing interfaces for non-visual use.
71
5.2.1
Making CAPTCHAs Accessible
Audio CAPTCHAs were introduced soon after their visual alternatives [133, 70], and have
been slowly adopted by web sites using visual CAPTCHAs since that time. Although the
adoption of audio CAPTCHAs has been slower than that of visual CAPTCHAs, many popular sites now include audio alternatives, including services offered by Google and Microsoft.
Over 2600 web users have signed a petition asking for Yahoo to provide an accessible alternative [148]. The reCAPTCHA project, a popular, centralized CAPTCHA service with
the goal of improving the automated OCR (Optical Character Recognition) processing of
books also provides an audio alternative. Although audio CAPTCHAs exist, their usability
has not been adequately examined.
Researchers have quantified the difficulty that users have solving both audio and visual
CAPTCHAs. For instance, Kumar et al. explored the solvability of visual CAPTCHAs
while varying their difficulty on several dimensions [29]. Studies on audio CAPTCHAs
have been smaller but informative. For instance, Sauer et al. conducted a small usability
study (N=6) in order to evaluate the effectiveness of the reCAPTCHA audio CAPTCHA
[114]. They noted that participants in the study employed a variety of strategies for solving
audio CAPTCHAs. Four participants memorized the characters as they were being read
and then entered them into the answer box after the CAPTCHA had finished playing and
one participant used a separate note taking device to record the CAPTCHA characters
as they were read. They noted that the process of solving this audio CAPTCHA was
highly error-prone, resulting in only a 46% success rate. The study presented in the next
section expands these results to a diverse selection of popular CAPTCHAs in use today and
further illustrates the frustration and strategies that blind web users employ to solve audio
CAPTCHAs.
The usability of CAPTCHAs for human users must be achieved while maintaining the
inability of automated agents to solve them. Although visual CAPTCHAs have had the
highest profile in attempts to break them, audio CAPTCHAs have recently faced similar
attempts [124]. As audio CAPTCHAs are increasingly made the target of automated attacks, changes that make them easier to understand will be less likely to be adopted out of
72
concern that they will make automated attacks easier as well. Changing the interface used
to solve a CAPTCHA, however, only impacts the usability for human solvers.
The audio CAPTCHAs described earlier are currently the most popular type of accessible
CAPTCHA, but they are not the only approach pursued. Holman et al. developed a
CAPTCHA that pairs pictures with the sounds that they make (for instance, a dog is
paired with a barking sound) so that either the visual or audio representation can be used to
identify the subject of the CAPTCHA [56]. Tam et al. proposed phrased-based CAPTCHAs
that could be more obfuscated than current audio CAPTCHAs but remain easy for humans
to solve because human solvers will be able to rely on context [124]. The improvements
provided by our optimized interface to audio CAPTCHAs could be adapted to both of
these new approaches should they be shown to be better alternatives.
5.2.2
Other Alternatives
Because audio CAPTCHAs remain difficult to use and are not offered on many web sites,
several alternatives have been developed supporting access for blind web users. Many sites
require blind web users to call or email someone to gain access. This can be slow and
detracts from the instant gratification afforded to sighted users. The WebVisum Firefox
extension enables web users to submit requests for CAPTCHAs to be solved, which are
then forwarded to their system to be solved by a combination of automated and manual
techniques [144]. Because of the potential for abuse, the system is currently offered by
invitation only and questions remain about its long-term effectiveness. For many blind web
users the best solution continues to be asking a sighted person for assistance when required
to solve a visual CAPTCHA.
Combinations of (i) new approaches to creating audio CAPTCHA problems and (ii)
interfaces targeting non-visual use promise to enable blind web users to independently solve
CAPTCHAs in the future. This chapter demonstrates the importance of the interface.
5.2.3
Targeting Non-Visual Access
The interface that we developed for solving audio CAPTCHAs builds on work considering
the development of non-visual interfaces. Such interfaces are often very different than
73
the interfaces developed for visual use even though they enable equivalent interaction. For
instance, in the math domain, specialized interfaces have been developed to make navigation
of complex mathematics feasible in the linear space exposed by non-visual interfaces [108].
Emacspeak explores the usability improvement resulting from applications designed for nonvisual access instead of being adapted from visual interfaces [107]. For instance, a standard
screen reader may not correctly reflect the semantics of columns in a software calendar,
making it difficult for users to determine what day a particular date falls on. Emacspeak
would announce the day along with the date.
With the increasing importance of web content, much work has targeted better nonvisual web access. For instance, the HearSay browser converts web pages into semanticallymeaningful trees [106] and, in some circumstances, automatically directs users to content
in a web page that is likely to be interesting to them [84]. TrailBlazer (Chapter 6) suggests
paths through web content for users to follow, helping them avoid slow linear searches
through content [20]. A common theme in work targeting web accessibility is that content
should be accessed in a semantically meaningful way and functionality should be easily
available from the context in which it most makes sense.
The aiBrowser for multimedia web content enables users to independently control the
volume of their screen reader and multimedia content on the web pages they view [87].
Without the interface provided by aiBrowser, content on a web page can begin making
noise (for instance, playing a song in an embedded sound player or Flash movie) making
screen readers difficult to hear. This audio clutter can make navigating to the controls
of the multimedia content using a screen reader difficult, if controls are provided for the
multimedia content at all.
One of the goals of our optimized interface to audio CAPTCHAs is to prevent CAPTCHAs
from starting to play before the user is in the answer field where they will type their answers
- a major complaint of our study participants concerning how audio CAPTCHAs work currently. Just as with the aiBrowser, the goal is, in part, to give users finer control over the
audio channel used by both their screen readers and applications.
Work in accessibility has also explored the difference between accessibility and usability.
Many web sites are technically accessible to screen reader users, but they are inefficient and
74
time-consuming to access. Prior work has shown that the addition of heading elements to
semantically break up a web page or the use of skip links to enable users to quickly skip
to the main content of a page can increase its usability [126, 138]. Audio CAPTCHAs are
accessible non-visually, but their usability is quite poor for most blind web users. Our new
interface helps to improve usability.
5.3
Evaluation of Existing CAPTCHAs
Many web services now offer audio CAPTCHAs because they believe them to be an accessible alternative to visual CAPTCHAs. However, the accessibility and usability of these
audio CAPTCHAs has not been extensively evaluated. Our initial study aims to evaluate
the accessibility of existing audio CAPTCHAs and search for implications we could use to
improve them. We did this by gathering currently used CAPTCHAs from the most popular
web services and presenting them to study participants to solve. During the study, we collected tracking data to investigate the means by which both sighted and blind users solve
CAPTCHAs. The tracking data we collected allowed us to analyze the timing (from page
load to submit) of every key pressed and button clicked, and search for problem areas and
possible improvements to existing CAPTCHAs.
5.3.1
Existing Audio CAPTCHAs
To gather existing audio CAPTCHAs for our study, we used Alexa [4], a web tracking and
statistic gathering service, to determine the most popular web sites visited from the United
States as of July 2008. Of the top 100, 38 used some form of CAPTCHA, and of those
less than half (47%) had an audio CAPTCHA alternative. For our study, we chose to only
include sites offering both visual and audio CAPTCHAs and avoided sites using the same
third party CAPTCHA services.
Using this method we chose 10 unique types of CAPTCHAs that represent those used by
today’s most popular websites: AOL (aol), Authorize.net payment gateway service provider
(authorize), craigslist.org online classifieds (craigslist), Digg content sharing forum (digg),
Facebook social utility (facebook), Google (google), Microsoft Windows Live individual
web services and software products (mslive), PayPal e-commerce site (paypal), Slashdot
75
Features of Audio CAPTCHAs
AOL
AuthCraigslist DIGG
orize
Facebook
Google MS-Live PayPal
Slashdot
Veoh
Assistance
no
no
yes
no
no
no
no
no
yes
no
Offered
Beeps
3
0
0
0
1
3
0
0
0
1
Before
Background
voice none music static voice voice voice static none voice
Noise
Challenge
A-Z A-Z
A-Z
A-Z 0-9
0-9
0-9
A-Z Word 0-9
Alphabet
0-9
0-9
Duration
10.2 5.1
9.3
6.9 24.7 40.9 7.1
4.3
3.0 25.1
(sec)
Repeat
no
no
no
no
no
no
no
no
yes
no
Figure 5.2: A summary of the features of the CAPTCHAs that we gathered. Audio
CAPTCHAs varied primarily along the several common dimensions shown here.
technology-related news website (slashdot), and Veoh Internet television service (veoh).
For each of the 10 CAPTCHA types we downloaded 10 examples, resulting in a total of 100
audio CAPTCHAs used for the study (Figure 5.2).
Several of these sites attempted to block the download of the audio files representing
each CAPTCHA although all of them were in either the MP3 or WAV format. Many sites
added the audio files to web pages using obfuscated Javascript and would allow each to be
downloaded only once. These techniques at best marginally improve security, but can often
hinder access to users who may want to play the audio CAPTCHA with a separate interface
that is easier for them to use.
5.3.2
Study Description
To conduct our study, we created interfaces for solving visual and audio CAPTCHAs mimicking those we observed on existing web pages (Figure 5.3). The interface for visual
CAPTCHAs consisted of the CAPTCHA image, an answer field, and a submit button.
The interface for solving audio CAPTCHAs replaced the image with a play button that
when pressed caused the audio CAPTCHA to play. These simplified interfaces preserve
the necessary components of the CAPTCHA interface, enabling interface components to be
76
Separate Play Button
Separate Answer Field
Figure 5.3: An interface for solving audio CAPTCHAs modeled after those currently provided to users to solve audio CAPTCHAs (Figure 5.1).
isolated from the surrounding content. Solving CAPTCHAs in real web pages may be more
difficult as there are additional distractions, such as other content, and the CAPTCHA may
need to be solved with a less ideal interface, for instance using a pop-up window.
Our study was conducted remotely. As Petrie et al. observed, conducting studies with
disabled people in a lab setting can be difficult, but remote studies can produce similar
results [101]. Blind users in particular use many different screen readers and settings that
would be difficult to replicate fully in a lab setting, meaning the remote studies can better
approximate the true performance of participants.
Participants were first presented with a questionnaire asking about their experience
with web browsing, experience with CAPTCHAs and the level of difficulty or frustration
they present, as well as demographic information. They were then asked to solve 10 visual
CAPTCHAs and 10 audio CAPTCHAs (for sighted participants) or 10 audio CAPTCHAs
(for blind participants). Each participant was asked to solve one problem randomly drawn
from each CAPTCHA type, and the CAPTCHA types were presented in random order to
help avoid ordering effects.
For this study, participants were designated as belonging to the blind or sighted condition
based on their response to the question: “How do you access the web?” The following
answers were provided as options: “I am blind and use a screen reader,” “I am sighted and
use a visual browser,” and “Other.” In this chapter, blind participates refer to those who
answered with the first option and sighted participants to those who answered with the
second option.
77
Participants were given up to 3 chances to correctly solve each CAPTCHA, but of
primary concern was their ability to correctly solve each CAPTCHA on the first try because
this is what is required by most existing CAPTCHAs.
To instrument our study, we included Javascript tracking code on each page of the
study that allowed us to keep track of the keys users typed and other interaction with
page elements. This approach is similar to that provided by the more general UsaProxy [9]
system which records all user actions in the browser when users connect through its proxy.
This approach has also been used before in studies with screen reader users [17].
The data recorded enabled us to make observations, including the time required to
answer the CAPTCHA, how many times the CAPTCHA was played, how many mistakes
were made in the process of answering a CAPTCHA, and the number of attempts required.
The full list of the events gathered and the information recorded for each is shown below:
• Page Loaded - the web page has loaded.
• Focused Play - participant selected the play button.
• Pressed Play - participant pressed the play button.
• Blurred Play - participant moved away from the play button.
• Answer Box Focused - participant entered the answer box either by clicking on it
or tabbing to it.
• Answer Box Blurred - participant exited the answer box either by clicking out or
moving away.
• Key Pressed - participant pressed a keyboard key.
• Focused Submit - submit button was selected.
• Pressed Submit - submit button was pressed.
• Blurred Submit - participant moved from the submit button but did not press it.
• Incorrect Answer - the answer provided by the participant is incorrect, leading the
participant to be presented with a 2nd or 3rd try.
Personally identifying information was not recorded.
78
5.3.3
Results
Of our 162 participants, 89 were blind and 73 were sighted; 56 were female, 99 were male,
and 7 chose not to answer that question; and their ages ranged from 18 to 69 with an
average age of 38.0 (SD = 13.2).
Before participating in our study, blind and sighted participants showed differing levels
of frustration toward the audio and visual CAPTCHAs they had already come across.
Participants were asked to rate the following questions on a scale from Strongly Agree (1)
to Strongly Disagree (5) or opt out by answering “I have never independently solved a
visual[audio] CAPTCHA” for the following questions: “Audio CAPTCHAs are frustrating
to solve.” and “Visual CAPTCHAs are frustrating to solve.”
For the question about audio CAPTCHAs, averages from the two groups were similar,
2.73 (SD = 1.3) for blind participants and 2.82 (SD = 1.4) for sighted participants. Far
more sighted participants opted out; however, as only 7.87% of blind participants opted
out compared to 44.44% of sighted participants who opted out (χ2 = 69.13, N = 161, df
= 1, p < .0001). This shows that nearly half of our sighted participants had never solved
an audio CAPTCHA before, but those who had were nearly as frustrated by them as blind
participants. For the question about visual CAPTCHAs, blind participants averaged 1.58
(SD = 0.9) with 38.2% opting out and sighted participants averaged 2.98 (SD = 1.2) with
only 1.4% opting out (χ2 = 14.21, N = 161, df = 1, p = .0002). This shows that more than
a third of blind participants said they had never solved a visual CAPTCHA and the others
found them very frustrating with a rating very close to (1) Strongly Agree. This rating may
mean that some of our participants who checked the “I am blind and use a screen reader”
box did have some vision and had tried to solve visual CAPTCHAs before or perhaps some
participants found the required phone call to technical support, the added step of waiting
for an email, or the task of finding a sighted person for help to be extremely frustrating.
These results are summarized in Figure 5.4.
The data gathered from the Javascript tracking code were analyzed using a mixed-effects
model analysis of variance with repeated measures [79, 116]. Condition (blind or sighted),
CAPTCHA type (audio or visual), and CAPTCHA source, were modeled as fixed effects,
79
Participant Agreement with:
blind
sighted
“Audio CAPTCHAs are frustrating to solve.”
Percentage of Participants
50%
40%
30%
Blind
Sighte d
20%
10%
0%
1
2
3
4
5
never
solved
strongly
disagree
strongly
agree
“Visual CAPTCHAs are frustrating to solve.”
Percentage of Participants
50%
40%
30%
Blind
Sighte d
20%
10%
0%
1
strongly
agree
2
3
4
5
strongly
disagree
never
solved
Figure 5.4: Percentage of participants answering each value on a Likert scale from 1 Strongly
Agree to 5 Strongly Disagree reflecting perceived frustration of blind and sighted participants
in solving audio and visual CAPTCHAs. Participants could also respond “I have never
independently solved a visual[audio] CAPTCHA.” Results illustrate that (i) nearly half of
sighted and blind participants had not solved an audio or visual CAPTCHA, respectively,
(ii) visual CAPTCHAs are a great source of frustration for blind participants, and (iii)
audio CAPTCHAs are also somewhat frustrating to solve.
with Condition and CAPTCHA type combined as a fixed effect group with three possible
values (blind-audio, sighted-audio, and sighted-visual). Participant was modeled correctly
as a random effect. Mixed-effects models properly handle the imbalance in our data due
to not all participants solving both audio and visual CAPTCHAs. Mixed-effects models
80
Average Time per CAPTCHA
80
audio-blind
audio
- blind
audio-sighted
audio
- sighted
visual-sighted
visual
- sighted
70
time (seconds)
time
(seconds)
60
50
40
30
20
10
0
Figure 5.5: The average time spent by blind and sighted users to submit their first solution
to the ten audio CAPTCHAs presented to them. Error bars represent ± 1 standard error
(SE).
also account for correlated measurements within participants. However, they retain large
denominator degrees of freedom, which can be fractional for unbalanced data.
Sighted participants solving visual CAPTCHAs were much faster than blind participants
solving audio CAPTCHAs. On average, their respective completion times were more than
5 times faster. Sighted participants averaged 9.9 seconds (SD = 1.9) and blind participants
averaged 50.9 seconds (SD = 1.8), (F1,232.1 = 243.9, p < .0001). This may have been expected, but sighted participants also outperformed blind participants on audio CAPTCHAs
with average completion times of 22.8 (SD = 1.9), or about twice as fast as our blind participants (F1,232.4 = 113.9.0, p < .0001). The timing data alone show the drastic inequalities
in current CAPTCHAs for blind web users (Figure 5.5).
The largest differences were observed in success rates. The sighted participants in this
study successfully solved nearly 80% of the visual CAPTCHAs presented to them (on the
first try). This resembles the 90% previously reported [29]2 . These same participants, however, were only able to solve 39% of audio CAPTCHAs on the first try, demonstrating again
2
The lower observed success rate may reflect the trend of CAPTCHAs having become more difficult in
order to thwart increasingly-sophisticated automated attacks.
81
the higher difficulty of solving audio CAPTCHAs. And while it did take blind participants
longer (see above), blind and sighted participants were on par when it came to solving the
audio CAPTCHAs correctly. Blind participants solved 43% of audio CAPTCHAs presented
to them successfully on the first try, although the difference between blind and sighted was
not significant (χ2 = 3.46, N = 161, df = 1, p = .06). Second and third tries rarely helped
in finding a correct answer (Figure 5.6).
Even though blind participants were on par (slightly better, but not significantly so) at
solving audio CAPTCHAs correctly, they took twice as long to do so. So, what occupied
the remaining time? This extra time may have been spent listening to the CAPTCHA (on
average, blind participants clicked played 3.6 (SD = 0.1) times whereas sighted participants
clicked play 2.5 (SD = 0.1) times (F1,232.1 = 52.2, p < .0001)) or they may have spent more
time navigating to and from the text box. Blind participants entered the text box on average
2.9 (SD = .1) times whereas sighted participants entered the text box on average 2.4 (SD
= 0.1) times (F1,232.2 = 10.2, p < .001).
5.3.4
Discussion
Recruiting participants to take part in studies can be especially difficult when looking for
participants with specific characteristics, such as participants who use a screen reader.
Despite this, we had very little trouble recruiting participants for this study (as reflected
by the large number of responses). Our post on an online mailing list for blind web users
was greeted with a flurry of responses - both positive and negative. Many seemed pleased
to find that this problem was being worked on and many doubted that audio CAPTCHAs
could ever be improved. Our first anecdotal evidence that CAPTCHAs were a widelyacknowledged problem were the number of responses, many of which were written with
what appeared to be significant emotion.
Audio CAPTCHAs were anecdotally a great source of frustration to both blind and
sighted participants in our study. Many sighted participants had no prior experience with
audio CAPTCHAs and told us that they were much more difficult than they expected.
In fact, one participants said, “After going through this exercise, I’ve changed my opinion
82
Figure 5.6: The number of tries required to correctly answer each CAPTCHA problem
illustrating that (i) multiple tries resulted in relatively few corrections, (ii) the success
rates of blind and sighted solvers were on par, and (iii) many audio CAPTCHAs remained
unsolved after three tries.
that audio CAPTCHA is a good alternative solution for people who are blind.” Many
participants, but perhaps especially the blind participants, expressed exacerbation toward
CAPTCHAs: “I understand the necessity for CAPTCHAs, but they are the only obstacle
on the Internet I have been unable to overcome independently.”
Clearly, some types of audio CAPTCHAs are much more difficult to solve than others
and some features were better received than others. For example, “The random-letters,
random-numbers ones were completely impossible for me to solve. I couldn’t tell the difference between c/t/v/b, for example. Those with human-intelligible context (e.g. ‘c as in
cucumber’) were far easier and less stressful.”
While some of the frustration from solving CAPTCHAs seemed to stem from the difficulty of deciphering distorted audio, for blind people, much of the frustration comes from
interacting with the CAPTCHA with their screen reader. For example, “It will always be
hard to activate the play button, jump to the answer edit box, silence a screen reader and
83
get focused to listen and enter data accurately.” This process takes time and often content
in the beginning of the CAPTCHA is missed: “At the beginning of the captcha, give me
time to get down to the edit box and enter it. My screen reader is chattering while I’m
getting to the edit box and the captcha is playing.”
Instead of trying to navigate while the CAPTCHA plays, some people try to memorize
the answer, wait for the play to finish, and then move to the text box and start typing. But,
this presents an entirely new challenge: “I heard them, but could not remember them. And
if I tried to type them out [while] listening, my screen reader interfered with my listening.”
This process resembles what one might expect sighted users to do if the visual CAPTCHA
and the answer box were located on different pages and only one could be viewed at a time.
The interaction problems identified in this study motivate a new interface design with
simple improvements that could greatly increase the usability of audio CAPTCHAs.
5.4
Improved Interface for Non-Visual Use
The comments of participants identified two main areas in which audio CAPTCHAs could be
improved. As expected, one area was the audio itself – the speech representation should be
made clearer, background noise reduced, and additional contextual hints provided in order
to make audio CAPTCHAs easier to solve. The audio characteristics of a CAPTCHA were
important in determining its difficulty but are difficult to change because they directly determine how resistant the CAPTCHA will be to automated attacks. Audio CAPTCHAs have
recently become a more popular target for automated attacks, for example reCAPTCHA
was shown likely to be vulnerable to automated attack [124].
The second area of difficulty mentioned by participants was the interface provided for
solving audio CAPTCHAs. Users found the current interfaces cumbersome and sometimes
confusing to use. Unlike many improvements to the CAPTCHA itself, improvements to the
interface do not affect the resistance to automated attacks. As long as the interface does
not embed clues to the answer of the CAPTCHA, then it can be modified in whatever way
is best for users.
The navigation elements used to listen to and answer an audio CAPTCHA can be
distracting, forcing users to either miss the beginning of the CAPTCHA or memorize the
84
Keyboard Controls:
, Rewind 1 Second
. Play/Pause
/ Forward 1 Second
All other keys as normal.
Figure 5.7: The new interface developed to better support solving audio CAPTCHAs. The
interface is combined within the answer textbox to give users control of CAPTCHA playback
from within the element in which they will type the answer.
entire CAPTCHA before typing the answer. Participants reported that they appreciated
CAPTCHAs that began with a few beeps (as 4 of the 10 CAPTCHAs did) because this
allowed them time to move from the “Play” button to the answer box. This suggested that
a more usable interface would not require users to navigate back and forth. Our interface
optimized for non-visual use addresses this navigation problem by moving the controls for
playback into the answer box, obviating the need to navigate from playback controls to the
answer box because they are now one and the same.
By combining the playback controls and the answer box into a single control, the interface
is designed to present less of a hurdle for users to overcome, enabling them to focus on
answering the CAPTCHA. Many participants mentioned that using the current interface
required them to play through an entire audio CAPTCHA to review a specific portion.
Even when controls other than “play” are available, users do not use them because they
require them to navigate to the appropriate control and then back again to the answer box.
Based on this feedback, we added simple controls into the answer box that enabled users
to both play/pause the CAPTCHA and to rewind or fast-forward by one second without
additional navigation (Figure 5.7).
Through several rounds of integrative design with several blind participants, we refined
this new interface. For example, we initially used various control key combinations to control
playback of the CAPTCHA (such as CTRL+P for play), we found that the shortcuts that
we chose often overlapped with shortcuts available in screen readers. We briefly considered
using the single key “p” for play, but this overlaps with the alphabet used in many popular
CAPTCHAs meaning our interface could not be used with them.
85
On the suggestion of a blind participant, we chose to use the following individual keys for
the playback controls: comma(,) for rewind, period(.) for play/pause, and forward slash(/)
for fast-forward. These were not included in the alphabets of any of the CAPTCHAs that
we considered (Figure 5.2) and are located in that order in standard American keyboards.
For users of keyboards with different layouts, the keys could be similarly chosen to avoid
collision with screen reader shortcuts and characters used in language-specific CAPTCHAs,
and such that they are conveniently located on popular local keyboards.
5.4.1
Integration Into Existing Websites
An advantage of altering the interface used to solve CAPTCHAs instead of attempting
to make CAPTCHA problems themselves more usable is that a new interface can be independently added to existing web sites. We have written a Greasemonkey script [104]
that detects the reCAPTCHA interface and replaces the interface used to solve its audio
CAPTCHA with our optimized interface.
For web sites in which this is not currently possible, web developers could add this interface into their sites without concern that the new interface will expose them to additional
risk of automated attack. All of the currently-used CAPTCHAs considered in the study in
the previous section can be used directly with our optimized interface.
5.5
Evaluation of New CAPTCHA Interface
We evaluated our new interface for solving audio CAPTCHAs with the optimizations for
screen reader users based on the insights from our initial study.
5.5.1
Study Design
To evaluate the new interface for audio CAPTCHAs we repeated the study described earlier
but with the new interface. Below is a snippet from the instructions that were given to
participants before the study began:
We are testing a different interface for solving CAPTCHAs. Please take some time to
familiarize yourself with the new interface. Keys for controlling playback are as follows:
86
• Typing a period in this box will cause the CAPTCHA to play, and pressing it again
will pause playback
• Typing a comma will rewind the CAPTCHA by 1 second and then continue playing.
• Typing a forward slash will fast forward the CAPTCHA by 1 second and then continue.
These keys work only when the textbox used to answer the CAPTCHA problem has focus.
This allows you to control the CAPTCHA directly from the box into which you will enter
your answer. The control key characters will not be entered into the box and are only used
to control playback of the CAPTCHA.
No participants reported difficulty learning the new interface.
5.5.2
Results
This study included 14 blind participants: 2 were female, 10 were male, and 2 chose not to
answer that question; their ages ranged from 22 to 59 with an average age of 36.1 (SD =
10.2).
We again used a mixed-effects model analysis of variance with repeated measures to
analyze our data. While our optimized interface did not have a significant effect on the
time required to solve CAPTCHAs (participants averaged 50.9 seconds (SD = 2.4) with
the original interface and 47.3 seconds (SD = 5.9) with the optimized interface (F1,101.3 =
0.31, p = n.s.), it did have significant and positive effects on the number of tries required
to solve CAPTCHAs and the observed success rate of participants.
With our optimized interface, participants were able to reduce the average number of
attempts required to solve the CAPTCHAs from 2.21 (SD = 0.4) with the original interface
to 1.56 (SD = 1.2) (F1,100 = 20.3, p < .0001). Perhaps more importantly, participants solved
over 50% more CAPTCHAs correctly on the first try with the optimized interface than they
did with the original interface: 42.9% (SD = 0.2) were correctly solved on the first try using
the original interface and 68.5% (SD = 0.5) were correctly solved on the first try using the
optimized interface (F1,100 = 22.3, p < .0001). These improvements are shown in Figure
5.8.
87
Percentage of CAPTCHAs answered correctly
using the Original and Optimized Interfaces
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1st try
2nd try
original
3rd try
never
optimized
Figure 5.8: The percentage of CAPTCHAs answered correctly by blind participants using the original and optimized interfaces. The optimized interface enabled participants
to answer 59% more CAPTCHAs correctly on their first try as compared to the original
interface.
5.5.3
Discussion
Participants in this study were generally enthusiastic about the new interface to audio
CAPTCHAs that we created, leading one participant to say, “I really liked the interface
provided here for answering the captchas. I think it could really be benefitial [sic] if widely
used.” Some participants felt that while the new interface offered an improvement, audio
CAPTCHAs were still frustrating. For example, one participant said, “... sometimes the
audio captchas are still so distorted that it’s hard to solve them even then.”
In general, while audio CAPTCHAs remained challenging for users, they were both
more accurate with the new interface (answered incorrectly less often) and required fewer
attempts to find the right answer. Because the new interface does not affect the security
of the underlying CAPTCHA and can be easily adapted to new CAPTCHAs we hope this
interface will become the default in the near future.
88
5.6
Future Work
In this work, we have demonstrated the difficulty of audio CAPTCHAs and offered improvements to the interface used to answer them that can help make them more usable. We
plan to explore other areas in which interface changes may improve non-visual access, and
consider how the lessons we learned in this work may generalize beyond the interfaces to
audio CAPTCHAs.
Future work may explore how audio CAPTCHAs could be created that are easier for
humans to solve while still addressing the improved automatic techniques for defeating them.
The ten audio CAPTCHAs explored in our study exhibited a wide variety of dimensions
on which they varied, but yet remained quite similar in design. Perceptual CAPTCHAs
face many problems, including that (i) none are currently accessible to individuals who are
both blind and deaf and (ii) automated techniques are becoming increasingly effective in
defeating them. An important direction for future work is addressing these problems.
5.7
Summary
Creating an interface optimized for non-visual access presents challenges that are very different than those targeting visual access. Our study with blind participants demonstrated
that existing audio CAPTCHAs are inadequate alternatives and that their frustration is due
in part to the interface provided for solving them. Based on this feedback, we optimized
the interface to solving audio CAPTCHAs for non-visual use by localizing the playback
interface to the answer box.
Although we did not change the audio CAPTCHAs themselves, users in our subsequent study were able to successfully solve CAPTCHAs on their first try 59% more of the
time. This dramatic improvement can be directly used in existing interfaces to CAPTCHAs
without impacting the ability of the CAPTCHA to protect access from automatic agents.
Because of the incredible differences in non-visual access, the interface can make all the
difference when developing applications designed to be accessed non-visually.
This chapter demonstrates the utility of our WebinSitu infrastructure (Chapter 3) for
conducting large studies of user interaction with blind web users. Not only did we discover
89
the difficulty of using the interfaces to audio CAPTCHAs with WebinSitu but were also
able to evaluate our newly designed improved interface using it. The new interface can be
added to existing pages using an Accessmonkey script (Chapter 4), injected by the user on
a variety of different platforms.
In the next chapter, we present TrailBlazer, a collaborative tool that lets blind web users
generally connect individual interactions together to more effectively complete web-based
tasks. The goal of TrailBlazer is to enable end users to create interactions that work better
for them, just as the script offered in this chapter improves access to audio CAPTCHAs,
without requiring programming experience.
90
Chapter 6
MORE EFFECTIVE ACCESS WITH TRAILBLAZER
The previous chapter illustrated how screen reader users could dramatically improve
their success rate at solving audio CAPTCHAs by injecting a script that altered the interface
to make it better for them. This chapter explores the potential for screen reader users
to help one another be more effective at completing tasks on the web by improving the
constituent interactions by demonstration. We introduce TrailBlazer, a system that provides
an accessible, non-visual interface to guide blind users through existing how-to knowledge
[20]. A formative study indicated that participants saw the value of TrailBlazer but wanted
to use it for tasks and web sites for which no existing script was available. To address this,
TrailBlazer offers suggestion-based help created on-the-fly from a short, user-provided task
description and an existing repository of how-to knowledge.
6.1
Motivation
For blind web users, completing tasks on the web can be time-consuming and frustrating.
Blind users interact with the web through software programs called screen readers. Screen
readers convert information on the screen to a linear stream of either synthesized voice or
refreshable Braille. If a blind user needs to search for a specific item on the page, they
either must listen to the entire linear stream until the goal item is reached or they may
skip around in the page using structural elements, such as headings, as a guide. To become
proficient, users must learn hundreds of keyboard shortcuts to navigate web page structures
and access mouse-only controls. Unfortunately, as shown in Chapter 3, even experienced
screen reader users do not approach the speed of searching a web page that is afforded to
sighted users for many tasks [17, 122].
Existing repositories contain how-to knowledge that is able to guide people through web
tasks quickly and efficiently. This how-to knowledge is often encoded as a list of steps that
91
Time Card CoScript
1. goto “http://www.mycompany.com/timecard/”
2. enter “8” into the “Hours worked” textbox
3. click the “Submit” button
4. click the “Verify” button
Figure 6.1: A CoScript for entering time worked into an online time card. The natural
language steps in the CoScript can be interpreted both by tools such as CoScripter and
TrailBlazer, and also read by humans. These steps are also sufficient to identify all of the
web page elements required to complete this task – the textbox and two buttons. Without
TrailBlazer, steps 2-4 would require a time-consuming linear search for screen reader users.
must be performed in order to complete the task. The description of each step consists of
information describing the element that must be interacted with, such as a button or text
box, and the type of operation to perform with that element. For example, one step in the
task of buying an airplane flight on orbitz.com is to enter the destination city into the text
box labeled “To”. One such repository is provided by CoScripter [80], which contains a
collection of scripts written in a “sloppy” programming language that is both human- and
machine-understandable (Figure 6.1).
In this chapter, we present TrailBlazer, a system that guides blind users through completing tasks step-by-step. TrailBlazer offers users suggestions of what to do next, automatically advancing the focus of the screen reader to the interactive element that needs to be
operated or the information that needs to be heard. This capability reduces the need for
time-consuming linear searches when using a screen reader.
TrailBlazer was created to support the specific needs of blind users. First, its interface
explicitly accommodates screen readers and keyboard-only access. Second, TrailBlazer augments the CoScripter language with a “clip” command that specifies a particular region of
the page on which to focus the user’s attention. A feature for identifying regions is not
included in other systems under the assumption that users can visually search content to
quickly find desired regions.
Third, TrailBlazer is able to dynamically create new scripts from a brief user-specified
description of the goal task and the existing corpus of scripts. Dynamic script creation
92
was inspired by a formative user study of the initial TrailBlazer system, which confirmed
that TrailBlazer made the web more usable but was not helpful in the vast majority of
cases where a script did not already exist. To address this problem, we hypothesized that
users would be willing to spend a few moments describing their desired task for TrailBlazer
if it could make them more efficient on tasks lacking a script. The existing repository of
scripts helps CoScripter to incorporate knowledge from similar tasks or sub-tasks that have
already been demonstrated. Building from the existing corpus of CoScripter scripts and the
short task description, TrailBlazer dynamically creates new scripts that suggest patterns of
interaction with previously unseen web sites, guiding blind users through sites for which no
script exists.
As blind web users interact with TrailBlazer to follow these dynamically-suggested steps,
they are implicitly supervising the synthesis of new scripts. These scripts can be added to
the script repository and reused by all users. Studies have shown that many users are
unwilling to pay the upfront costs of script creation even though those scripts could save
them time in the future [75]. Through the use of TrailBlazer, we can effectively reverse the
traditional roles of the two groups, enabling blind web users to create new scripts that lead
sighted users through completing web tasks.
This chapter makes the following three contributions:
• An Accessible Guide - TrailBlazer is an accessible interface to the how-to knowledge
contained in the CoScripter repository that enables blind users to avoid linear searches
of content and complete tasks more efficiently.
• Formative Evaluation - A formative evaluation of TrailBlazer illustrating its promise
for improving non-visual access, as well as the desire of participants to use it on tasks
for which a script does not already exist.
• Dynamic Script Generation - TrailBlazer, when given a natural language description of a user’s goal and a pre-existing corpus of scripts, dynamically suggests steps
to follow to achieve the user’s goal.
93
6.2
Related Work
Work related to TrailBlazer falls into two main categories: (i) tools and techniques for
improving non-visual web access, and (ii) programming by demonstration and interactive
help systems that play back and record how-to knowledge.
6.2.1
Improving Web Accessibility
Most screen readers simply speak aloud a verbal description of the visual interface. While
this enables blind users to access most of the software available to sighted people, they are
often not easy-to-use because their interfaces were not designed to be viewed non-visually.
Emacspeak demonstrated the benefits to usability resulting from designing applications
with voice output in mind [107]. The openness of the web enables it to be adapted it for
non-visual access.
Unfortunately, most web content is not designed with voice output in mind. In order to
produce a usable spoken interface to a web site, screen readers extract semantic information
and structure from each page and provide interaction techniques designed for typical web
interactions. When pages contain good semantics, these can be used to improve the usability
of the page, for instance by enabling users to skip over sections irrelevant to them.
Semantic information can either be added to pages by content providers or formulated
automatically when pages are accessed. Adding meaningful headings tags (<H1 - 6>) has
been shown to improve web efficiency for blind web users browsing structural information
[138] but, as shown in Chapter 3, less than half of web pages use them. To improve web
navigation, in-page “skip” links visible only to screen reader users can be added to complex
pages by web developers. These links enable users to quickly jump to areas of the page
possibly far in linear distance. Unfortunately, these links are often broken (Chapter 3).
Web developers have proven unreliable in manually providing navigation aids by annotating
their web pages.
Numerous middleware systems [7] have suggested ways for inserting semantically relevant markup into web pages before they reach the client. Other systems have moved the
automatic detection of semantically-important regions to the interface itself. For example,
94
the Hearsay non-visual web browser parses web pages into a semantic tree that can be more
easily navigated with a screen reader [106].
Augmenting the screen reader interface has also been explored. Several systems have
added information about surrounding pages to existing pages to make them easier to use.
Harper et al. augments links in web pages with “Gist” summaries of the linked pages in
order to provide users more information about the page to which a link would direct them
[54]. CSurf observes the context of clicked links in order to begin reading at a relevant point
in the resulting page [84].
Although adding appropriate semantic information makes web content more usable,
finding specific content on a page is still a difficult problem for screen reader users. AxsJAX addresses this problem by embedding “trails” into web pages that guide users through
semantically-related elements [49]. TrailBlazer scripts expand on this trail metaphor. Because AxsJAX trails are generally restricted to a single page and are written in Javascript,
AxsJAX trails cannot be created by end users or applied to the same range of tasks as
TrailBlazer’s scripts.
6.2.2
Recording and playback of how-to knowledge
Interactive help systems and programming by demonstration tools have explored how to
capture procedural knowledge and express it to users. COACH [118] and Eager [37] are
early systems in this space that work with standard desktop applications instead of the
web. COACH observes computer users in order to provide targeted help, and Eager learned
and executed repetitive tasks by observing users.
Expressing procedural knowledge, especially to assist a user who is currently working
to complete a task, is a key issue for interactive help systems. Kelleher et al.’s work on
stencil-based tutorials demonstrates a variety of useful mechanisms [68], such as by blurring
all of the items on the screen except for those which are relevant to the current task. Sticky
notes adding useful contextual information was also found to be effective. TrailBlazer makes
use of analogous ideas to direct the attention of users to important content in its non-visual
user interface.
95
A
B
...
Starts
Here
Step 2 of 15: select “Books” from the “Search” listbox
“Search” Listbox
“Previous Step” Button
“Play from Here” Button
“Next Step” Button
...
Figure 6.2: The TrailBlazer interface is integrated directly into the page, is keyboard accessible, and directs screen readers to read each new step. A) The description of the current
step is displayed visually in an offset bubble but is placed in DOM order so that the target
of a step immediately follows its description when viewed linearly with a screen reader.
B) Script controls are placed in the page for easy discoverability but also have alternative
keyboard shortcuts for efficient access.
6.3
An Accessible Guide
Representing procedural knowledge is also a difficult challenge. Keyword commands is one
method, which uses simple psuedo-natural language description to refer to interface elements
and the operations to be applied to them [81]. This is similar to the sloppy language used
by CoScripter to describe web-based activity [80]. TrailBlazer builds upon these approaches
because the stored procedural knowledge represented by TrailBlazer can be easily spoken
aloud and understood by blind users.
A limitation of most current systems is that they cannot generalize captured procedural
knowledge to other contexts. For example, recording the process of purchasing a plane
flight on orbitz.com will not help perform the same task on travelocity.com. One of the
only systems to explore generalization is the Goal-Oriented Web Browser [41], which attempts to generalize a previously demonstrated script using a database of common sense
knowledge. This approach centered around data detectors that could determine the type
96
of data appearing on web sites. TrailBlazer incorporates additional inputs into its generalization process, including a brief task description from the user, and does not require a
common-sense knowledgebase.
An alternate approach to navigating full-size web pages with a script, as TrailBlazer
does, is to instead shrink the web pages by keeping only the information needed to perform
the current task. This can be done using a system such as Highlight, which enables users
to re-author web pages for display on small screen devices by demonstrating which parts of
the pages used in the task are important [94]. The resulting simplified interfaces created
by Highlight are more efficient to navigate with a screen reader, but prevent the user from
deviating from the task by removing content that is not directly related to the task.
TrailBlazer was designed from the start for non-visual access using the following three
guidelines (Figure 6.2):
• Keyboard Access. All play back functions are accessible using only the keyboard,
making access for those who do not use a mouse feasible.
• Minimize Context Switches. The playback interface is integrated directly into the
web pages through which the user is being guided. This close coupling of the interface
into the web page enables users to easily switch between TrailBlazer’s suggestions and
the web page components needed to complete each step.
• Directing Focus. TrailBlazer directs users to the location on each page to complete
each step. As mentioned, a main limitation of using a screen reader is the difficulty
in finding specific content quickly. TrailBlazer directs users to the content necessary
to complete the instruction that it suggests. If the user wants to complete a different
action, the rest of the page is immediately available.
The bubbles used to visually highlight the relevant portion of the page and provide
contextual information were inspired by the “sticky notes” used in Stencil-Based Tutorials
[68]. The non-visual equivalent in TrailBlazer was achieved by causing the screen reader
to begin reading at the step (Figure 6.2). Although the location of each bubble is visually
97
offset from the target element, the DOM order of the bubble’s components was chosen such
that they are read in an intuitive order for screen reader users. The visual representation
resembles that of some tutoring systems, and may also be preferred by users of visual
browsers, in addition to supporting non-visual access with TrailBlazer.
Upon advancing to a new instruction, the screen reader’s focus is set to the instruction
description (e.g., “Step 2 of 5: click the “search” button”). The element containing that
text is inserted immediately before the relevant control (e.g., the search button) in DOM
order so that exploring forward from this position will take the user directly to the element
mentioned in the instruction. The playback controls for previous step, play, and next step
are represented as buttons and are inserted following the relevant control. Each of these
functions can also be activated by a separate keyboard shortcut - for example, “ALT+S”
advances to the next step.
The TrailBlazer interface enables screen reader users to move from step to step, verifying
that each step is going to be conducted correctly, while avoiding all linear searches through
content (Figure 6.3). In the event that the user does not want to follow a particular step of
the script they are using, the entire web page is available to them as normal. TrailBlazer is
a guide but does not override the user’s intentions.
6.4
Clipping
While examining the scripts in the CoScripter repository, we noticed that many scripts
contained comments directing users to specific content on the page. Comments are not
interpreted by CoScripter however, and there is no command in CoScripter’s language that
can identify a particular region of the screen. Whether users were looking up the status of
their flight, checking the prices of local apartments or searching Google, the end goal was
not to press buttons, enter information into text boxes, or follow links; the goal was to find
information. A visual scan might locate this information quickly, but doing so with a screen
reader would be a slower process.
Coyne et al. observed that blind web users often use the “Find” function of their web
browsers to address this issue [35]. The find function provides a simple way for users to
quickly skip to the content, but requires them to know in advance appropriate text for which
98
TrailBlazer Example
1)
1 of 15: go to www.amazon.com
...
2 of 15: select “Books” from the
“Search” listbox
...
2)
8 of 15: clip the TABLE
containing “List Price”
8)
Figure 6.3: TrailBlazer guiding a user step-by-step through purchasing a book on Amazon. 1) The first step is to goto the Amazon.com homepage. 2) TrailBlazer directs the
user to select the “Books” option from the highlighted listbox. 8) On the product detail
page, TrailBlazer directs users past the standard template material directly to the product
information.
to search. The “clip” command that TrailBlazer adds to the CoScripter language enables
regions to be described and TrailBlazer users to be quickly directed to them.
6.4.1
Region Description Study
Existing CoScripter commands are written in natural language. In order to determine what
language would be appropriate for our CoScripter command, we conducted a study in which
we asked 5 participants to describe 20 regions covering a variety of content (Figure 6.4).
To encourage participants to provide descriptions that would generalize to multiple regions,
two different versions of each region were presented.
Upon an initial review of the results of this study, we concluded that the descriptions
provided fell into the following 5 non-exclusive categories: high-level semantic descriptions
99
1-a. “2008 season stats”
1-b. “The highlighted region is of statistics. This is a table that
has multiple numbers describing a player's achievements
and records of what he has accomplished.”
2-a. “This region lists search results for your query.”
2-b. “This area contains the heading ‘Search Results’ along with
the returns from a search of a term.”
1
2
Figure 6.4: The descriptions provided by two participants for the screenshots shown illustrating diversity in how regions were described. Selected regions are 1) the table of statistics
for a particular baseball player, and 2) the search results for a medical query.
of the content (78%), descriptions matching all or part of the headings provided on the page
for the region (53%), descriptions drawn directly from the words used in the region (37%),
descriptions including the color, size, or other stylistic qualities of the region (18%), and
descriptions of the location of the region on the page (11%).
6.4.2
The Syntax of the Clip Command
We based the formulation of the syntax of the clip command on the results of the study
just described. Clearly, users found it most convenient to describe the semantic class of the
region. While future work may seek to leverage a data detector like Miro to automatically
determine the class of data in order to facilitate such a command [41], our clip command
currently refers to regions by either their heading or the content contained within them.
When using a heading to refer to a region, a user lists text that starts the region of
interest. For instance, the step “clip the ‘search results”’ would begin the clipped region at
the text “search results.” This formulation closely matched what many users wrote in our
100
study, but does not explicitly specify an end to the clip. TrailBlazer uses several heuristics
to end the clip. The most important part is directing the user to the general area before
the information that is valuable to them. If the end of the clipped region comes too soon,
they can simply keep reading past the end of the region.
To use text contained within a region to refer to it, users write commands like, “clip the
region containing “flight status””. For scripts operating on templated web site or for those
that use dynamically-generated content, this is not always an ideal formulation because
specific text may not always be present in a desired region. By using both commands, users
have the flexibility to describe most regions, and, importantly, TrailBlazer is able to easily
interpret them.
6.5
Formative Evaluation
The improvements offered in the previous sections were designed to make TrailBlazer accessible to blind web users using a screen reader. In order to investigate its perceived usefulness
and remaining usability concerns, we conducted a formative user study with 5 blind participants. Our participants were experienced screen reader users. On average, they had 15.0
(SD=4.7) years of computer experience, including 11.8 (2.8) years using the web.
We first demonstrated how to use TrailBlazer as a guide through pre-defined tasks. We
showed users how they could, at each step, choose to either have TrailBlazer complete the
step automatically, complete it themselves, or choose any other action on the page. After
this short introduction, participants performed the following three tasks using TrailBlazer:
(i) checking the status of a flight on united.com, (ii) finding real estate listings fitting specific
criteria, and (iii) querying the local library to see if a particular book is available.
Each participant was asked the extent to which they agreed with several statements on a
Likert scale after completing the tasks (Figure 6.5). In general, participants were enthusiastic about TrailBlazer, leading one to say “this is exactly what most blind users would like.”
One participant said TrailBlazer was a “very good concept, especially for the work setting
where the scenarios and templates are already there.” Another participant who trains people on screen reader use thought that it would be a good way to gradually introduce the
concept of using a screen reader to a new computer user for whom the complexity of web
101
sites and the numerous shortcuts available can be overwhelming. Participants uniformly
agreed that despite their experience using screen readers, “finding a little kernel of information can be really time-consuming on a complex web page” and that “sometimes there
is too much content to just use headings and links to navigate.”
Participants wondered if TrailBlazer could help them with dynamic web content, which
often is added to the DOM of web pages far from where it appears visually, making it
difficult to find. Screen readers can also have trouble presenting dynamically-created content
to users. TrailBlazer could not only direct users to content automatically, avoiding a long
linear search, but also help them interact with it.
Despite their enthusiasm for using TrailBlazer for tasks that were already defined, they
questioned how useful it would be if they had to rely on others to provide the scripts for
them to use. One participant even questioned the usefulness of scripts created for a task
that he wanted to complete because “designers and users do not always agree on what is
important.” TrailBlazer did not support recording new tasks at the time of the evaluation,
although new CoScripts could be created by sighted users using CoScripter.
Participants also had several suggestions on how to improve the interface. TrailBlazer
guides users from one step to the next by dynamically modifying the page, but screen
readers do not always update their external models of the pages that they read from. To fix
this users would need to occasionally refresh the model of the screen reader, which many
thought could be confusing to novice users. Other systems that improve non-visual access
have similar limitations [49], and these problems are being addressed in upcoming versions
of screen readers.
6.6
Dynamic Script Generation
TrailBlazer can suggest actions that users may want to take even when no pre-existing
script is available for their current task. These suggestions are based on a short task description provided by the user and an existing repository of how-to knowledge. Suggestions
are presented to users as options, which they can quickly jump to when correct but also
easily ignore. Collectively, these suggestions help users dynamically create a new script potentially increasing efficiency even when they first complete a task.
102
Completing tasks that are new to me is easy on most
web sites:
1.
Disagree 1
Agree 5
2.
Finding relevant content on web pages can be challenging.
Disagree 1
Agree 5
3.
TrailBlazer makes completing tasks easier
4.
TrailBlazer makes completing tasks faster.
Disagree 1
Agree 5
Disagree 1
Agree 5
5.
TrailBlazer made it easier to find content on web pages.
Disagree 1
Agree 5
6.
I want to use TrailBlazer in the future.
Disagree 1
Agree 5
7.
I would be more likely to use TrailBlazer if more scripts
were available.
Disagree 1
Agree 5
Participants
0
1
2
3
4
5
Figure 6.5: Participant responses to Likert scale questions indicating that they think completing new tasks and finding content is difficult (1, 2), think TrailBlazer can help them
complete tasks more quickly and easier (3,4,5), and want to use it in the future (6), especially
if scripts are available for more tasks (7).
6.6.1
Example Use Case
To inform its suggestions, TrailBlazer first asks users for a short textual description of the
task that they want to complete; it then provides appropriate suggestions to help them
complete that task. As an example, consider Jane, a blind web user who wants to look
up the status of her friend’s flight on Air Canada. She first provides TrailBlazer with the
following description of her task: “flight status on Air Canada.” The CoScripter repository
does not contain a script for finding the status of a flight on Air Canada, but it does contain
scripts for finding the status of flights on Delta and United.
After some pre-processing of the request, TrailBlazer conducts a web search using the
task description to find likely web sites on which to complete it. “goto aircanada.com”
is its first suggested step, and Jane chooses to follow that suggestion. If an appropriate
103
Action Types at Each Step
1
click button
0.9
Proportion
0.8
click link
0.7
enter
0.6
0.5
goto
0.4
0.3
select
0.2
0.1
turnon
0
1
2
3
4
5
6
Step Number
7
8
9
10
Figure 6.6: Proportion of action types at each step number for scripts in the CoScripter
repository. These scripts were contributed by current users of CoScripter. The action types
represented include actions recognized by CoScripter which appeared in at least one script
as of October 2008.
suggestion was not listed, then Jane could have chosen to visit a different web site or even
searched the web for the appropriate web site herself (perhaps using TrailBlazer to guide
her search). TrailBlazer automatically loads aircanada.com and then presents Jane with the
following three suggestions: “click the ’flight status’ button”, “click the ’flight’ button, and
“fill out the ’search the site’ textbox.” Jane chooses the first, and TrailBlazer completes it
automatically. Importantly, TrailBlazer only suggests actions that are possible to complete
on the current web page. Jane uses this interface to complete the entire task without needing
to search within the page for any of the necessary page elements.
A pre-existing script for the described task is not required for TrailBlazer to accurately
suggest appropriate actions. TrailBlazer can in effect apply scripts describing tasks (or
subtasks) on one web site on other web sites. It can, for example, use a script for buying a
book at Amazon to buy a book at Barnes and Noble, a script for booking a trip on Amtrak
to help book a trip on the United Kingdom’s National Rail Line, or a script for checking
the status of a package being delivered by UPS to help check on one being delivered by
Federal Express. Subtasks contained within scripts can also be applied by TrailBlazer in
different domains. For example, the sequence of steps in a script on a shopping site that
helps users enter their contact information can be applied during the registration process on
an employment site. If a script already exists for a user’s entire task, then the suggestions
104
they receive can follow that script without the user having to conduct a search for that
specific script in advance.
6.6.2
Suggestion Types
The CoScripter language provides a set number of action types (Figure 6.6). Most CoScripts
begin with a “goto” command that directs users to a specific web page. Next, users are led
through interaction with a number of links and form controls. Although not included in the
scripts in the CoScripter repository, the final action implicitly defined in most CoScripts is to
read the information that resulted from completion of the previous steps, which corresponds
to the “clip” command added by TrailBlazer.
The creation of suggestions in TrailBlazer is divided into the following three corresponding components:
• Goto Component - TrailBlazer converts a user’s task description to keywords, and
then searches the web using those keywords to find appropriate starting sites.
• General Suggestion Component - TrailBlazer combines a user’s task description,
scripts in an existing repository, and the history of the user’s actions to suggest the
next action that the user should take.
• Automatic Clipping Component - TrailBlazer uses the textual history of user
actions represented as CoScripter commands to find the area on the page that is most
likely relevant to the user at this point using an algorithm inspired by CSurf [84].
Finding the relevant region to read is equivalent to an automatic clip of content.
The following sections describe the components used by TrailBlazer to choose suggestions.
6.6.3
Goto Component
As shown in Figure 6.6, most scripts begin with a goto command. Accordingly, TrailBlazer
offers goto commands as suggestions when calculating the first step for the user to take.
The goto component uses the task description input by the user to suggest web sites on
which the task could be completed.
105
Forming goto suggestions consists of the following three steps: (i) determining which
words in the task description are most likely to describe the web site on which the task
it to be completed, (ii) searching the web using these most promising keywords, and (iii)
presenting the top results to users. A part-of-speech tagger first isolates the URLs, proper
nouns, and words that follow prepositions (e.g., “United” from the phrase “on United”) in
the task description. TrailBlazer proceeds in a series of rounds, querying with keywords in
the order described until it gathers at least 5 unique URLs. These URLs are then offered
as suggestions.
The success of the goto component is highly dependent on the task description provided
by the user and on the popularity of the site on which it should be completed. The authors
have observed this component to work well on many real-world tasks, but future work will
test and improve its performance.
6.6.4
General Suggestion Component
The main suggestion component described in this chapter is the general suggestion component, which suggests specific actions for users to complete on the current page. These
suggestions are presented as natural language steps in the CoScripter language and are
chosen from all the actions possible to complete on the current page. TrailBlazer ranks suggestions based on the user’s task description, knowledge mined from the CoScripter script
repository, and the history of actions that the user has already completed.
Suggestions are first assigned a probability by a Naive Bayes classifier and then ranked
according to them. Naive Bayes is a simple but powerful supervised learning method that
after training on labeled examples can assign probability estimates to new examples. Although the probabilities assigned are only estimates, they are known to be useful for ranking
[76]. The model is trained on tasks that were previously demonstrated using either TrailBlazer or CoScripter, which are contained within the CoScripter script repository.
The knowledge represented by the features used in the model could also have also been
expressed as static rules for the system to follow. TrailBlazer’s built-in machine learning
model enables it to continually improve as it is used. Because tasks that users complete
3. Prior Action Script Similarity
4. Likelihood Action Pair
5. Same Form as Prior Action
6. Button First Form Action
User History
2. Task Script Similarity
Repository
1. Task Description Similarity
Task
106
Figure 6.7: The features calculated and used by TrailBlazer in order to rank potential action
suggestions, along with the three sources from which they are formed.
using TrailBlazer implicitly describe new scripts, the features based on the script repository
should become more informative over time as more scripts are added.
6.6.5
Features Used in Making Suggestions
In order to accurately rank potential actions, TrailBlazer relies on a number of informative, automatically-derived features (Figure 6.7). The remainder of this section explains
the motivation behind the features found to be informative and describes how they are
computed.
Leveraging Action History
TrailBlazer includes several features that leverage its record of actions that it has observed
the user perform. Two features capture how the user’s prior actions relate to their interaction with forms (Figure 6.7-5,6). Intuitively, when using a form containing more than
one element, interacting with one increases the chance that you will interact with another
in the same form. The Same Form as Prior Action feature expresses whether the action
under consideration refers to an element in a form for which an action has previously been
completed. Next, although it can occur in other situations, pressing a button in a form
usually occurs after acting on another element in a form. The Button First Form Action
107
feature captures whether the potential action is a button press in a form in which no other
elements have been acted upon.
Similarity to Task Description
The Task Description Similarity feature enables TrailBlazer to weight steps similar to the
task description provided by the user more highly (Figure 6.7-1). Similarity is quantified
by calculating the vector-cosine between the words in the task description and the words
in each potential suggestion. The word-vector cosine metric considers each set of words as
a vector in which each dimension corresponds to a different term, and in which the length
is set to the frequency by which each word has been observed. For this calculation, a list
of stopwords are removed. The similarity between the task description word vector vd and
the potential suggestion word vector vs is calculated as follows.
V C(vd , vs ) =
vd · vs
||vd || ∗ ||vs ||
(6.1)
The word-vector cosine is often used in information retrieval settings to compare documents with keyword queries [10].
Using the Existing Script Repository
TrailBlazer uses the record of user actions combined with the scripts contained with the
script repository to directly influence which suggestions are chosen. The CoScripter repository contains nearly 1000 human-created scripts describing the steps to complete a diversity
of tasks on the web. These scripts contain not only the specific steps required to complete
a given web-based task, but also more general knowledge about web tasks. Features used
for ranking suggestions built from the repository are based on (i) statistics of the actions
and the sequence in which they appear, and (ii) matching suggestions to relevant scripts
already in the repository. These features represent the knowledge contained within existing
scripts, enabling TrailBlazer to apply that knowledge to tasks for which no script exists.
Some action types are more likely than others according to how many actions the user
has completed (Figure 6.6). For instance, clicking a link is more likely near the beginning
108
of a task than near the end. In addition, some actions are more likely to follow actions of
particular types. For instance, clicking a button is more likely to follow entering text than
it is clicking a link because buttons are usually pressed after entering information into a
form. Following this motivation, the Likelihood Action Pair feature used by TrailBlazer is
the likelihood of each action given the actions that the user completed before (Figure 6.7-4).
This likelihood is computed through consideration of all scripts in the repository.
Leveraging Existing Scripts for Related Tasks
TrailBlazer also directly uses related scripts already in the repository to help form its suggestions. TrailBlazer retrieves two sets of related scripts and computes a separate feature
for each. First, TrailBlazer uses the task description provided by the user as a query to the
repository, retrieving scripts that are related to the user’s task. For instance, if the user’s
task description was “Flight status on Air Canada,” matches on the words “flight” and
“status” will enable the system to retrieve scripts for finding the flight status on “United”
and “Delta.” The procedure for checking flight status on both of these sites is different
than it is on Air Canada but certain steps, like entering information into a textbox with
a label containing the words “Flight Number” are repeated on all three. The Task Script
Similarity feature captures this information (Figure 6.7-2).
The second set of scripts that TrailBlazer retrieves are found using the last action that
the user completed. These scripts may contain subtasks that do not relate to the user’s
stated goal but can still be predictive of the next action to be completed. For instance, if
the user just entered their username into a textbox with the label “username,” many scripts
will be retrieved suggesting that a good next action would be to enter their “password” into
a password box. The Prior Action Script Similarity feature enables TrailBlazer to respond
to relevant sub-tasks (Figure 6.7-3).
The motivation for the Task Script Similarity and Prior Action Script features is that
if TrailBlazer can find steps in existing scripts similar to either the task description or an
action previously completed by the user, then subsequent steps in that script should be
predictive of future actions. The scores assigned to each step are, therefore, fed forward to
109
other script steps so that they are weighted more highly. All tasks implicitly start with a
goto step specifying the page on which the user first requests suggestions, so a prior action
always exists. The process used is similar to spreading activation, which is a method used
to connect semantically-related elements represented in a tree structure [33]. The added
value from a prior step decreases exponentially for each subsequent step, meaning that steps
following close after highly-weighted steps primarily benefit.
To compute these features, TrailBlazer first finds a set of related scripts S by sending
either the task description or the user’s prior action as a query to the CoScripter repository.
TrailBlazer then derives a weight for each of the steps contained in each related script. Each
script s contains a sequential list of natural language steps (Figure 6.1). The weight of each
script’s first step is set to V C(s0 , query), the vector-cosine between the first step and the
query as described earlier. TrailBlazer computes the weight of each subsequent step, as
follows:
W (si ) = w ∗ W (si−1 ) + V C(si , query)
(6.2)
TrailBlazer currently uses w = 0.3, which has worked well in practice. The fractional
inclusion of the weight of prior steps serves to feed their weight forward to later steps.
Next, TrailBlazer constructs a weighted sentence sentS of all the words contained within
S. The weight of each word is set to the sum of the computed weights of each step in which
each word is contained, W (si ). The final feature value is the word-vector cosine between
vectors formed from the words in sentS and query. Importantly, although the features
constructed in this way do not explicitly consider action types, the labels assigned to page
elements, or the types of page elements, all are implicitly included because they are included
in the natural language CoScripter steps.
6.6.6
Presenting Suggestions to Users
Once the values of all of the features are computed and all potential actions are ranked,
the most highly-ranked actions are presented to the user as suggestions. The suggestions
are integrated into the accessible guide interface outlined earlier. TrailBlazer provides five
suggestions, displayed in the interface in rank order (Figure 6.8).
110
Figure 6.8: Suggestions are presented to users within the page context, inserted into the
DOM of the web page following the last element with which they interacted. In this case, the
user has just entered “105” into the “Flight Number” textbox and TrailBlazer recommends
clicking on the “Check” button as its first suggestion.
The suggestions are inserted into the DOM immediately following the target of the prior
command, making them appear to non-visual users to come immediately after the step
that they just completed. This continues the convenient non-visual interface design used in
TrailBlazer for script play back. Users are directed to the suggestions just as they would be
directed to the next action in a pre-existing script. Just as with pre-defined actions, users
can choose to review the suggestions or choose to skip past them if they prefer, representing
a hallmark of mixed-initiative design [59]. Because the suggestions are contained within a
single listbox, moving past them requires only one keystroke.
Future user studies will seek to answer questions about how to best present suggestions
to users, how many suggestions should be presented, and how the system’s confidence in its
suggestions might be conveyed by the user interface.
111
6.7
Evaluation of Suggestions
We evaluated TrailBlazer by testing its ability to accurately suggest the correct next action
while being used to complete 15 tasks. The chosen tasks represented the 15 most popular
scripts in the CoScripter repository according to the number people who have run them.
The scripts contained a total of 102 steps, with an average of 6.8 steps per script (SD=3.1).
None of the scripts included in the test set were included when training the model.
Using existing scripts to test TrailBlazer provided two advantages. The first was that
the scripts represented a natural ground truth to which we could compare TrailBlazer’s
suggestions and the second was that each provided a short title that we could use as the
user’s description for purposes of testing. The provided titles were relatively short, averaging
5.1 words per title. The authors believe that it is not unreasonable to assume that users
could provide similar task descriptions since users provided these titles.
On the 15 tasks in this study, TrailBlazer listed the correct next action as its top suggestion in 41.4% of cases and within the top 5 suggestions in 75.9% of cases (Figure 6.9).
Predicting the next action correctly can dramatically reduce the number of elements that
users need to consider when completing tasks on the web. The average number of possible
actions per step was 41.8 (SD=37.9), meaning that choosing the correct action by chance
has a probability of only 2.3%. TrailBlazer’s suggestions could help users avoid a long,
linear search over these possibilities.
6.7.1
Discussion
TrailBlazer was able to suggest the correct next action among its top 5 suggestions in
75.9% of cases. The current interface enables users to review these 5 choices quickly, so
that in these cases users will not need to search the entire page in order to complete the
action - TrailBlazer will lead them right there. Furthermore, in the 24.1% of cases in which
TrailBlazer did not make an accurate suggestion, users can continue completing their tasks
as they would have without TrailBlazer. Future studies will look at the effect on users of
incorrect suggestions and how we might mitigate these problems.
112
Suggestion Performance
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
1
2
3
4
5
6
7
8
9
10
Suggestions Provided
Figure 6.9: The fraction of the time that the correct action appeared among the top suggestions provided by TrailBlazer for varying numbers of suggestions. The correct suggestion
was listed first in 41.4% cases and within the top 5 in 75.9% of cases.
TrailBlazer had difficulties making correct suggestions when the steps of the procedure
seemed arbitrary. For instance, the script for changing one’s Emergency Contact information
begins by clicking through links titled “Career and Life” and “About me - personal” on pages
on which nearly 150 different actions could be made. Because no similar scripts existed,
no features indicated that it should be selected. Later, the script included the step, “click
the “Emergency contacts” link,” which the system recommended as its first choice. These
problems illustrate importance of having scripts in the first place, and are indicative of the
improvements possible as additional scripts and other knowledge are incorporated.
Fortunately, TrailBlazer is able to quickly recover upon making an error. For instance, if
it does not suggest an action involving a particular form and the user completes an action on
the form anyway, TrailBlazer is likely to recover on the next suggestion. This is particularly
true with form elements because of the features created specifically for this case (features
5 and 6 in Figure 6.7), but TrailBlazer is often able to recover in more general cases.
TrailBlazer benefits greatly from its design as a guide to a human who can occasionally
correct its suggestions, and its operators also benefit because of the efficiency gains when it
is correct.
113
6.8
Summary and Future Directions
This chapter introduced TrailBlazer, an accessible interface to how-to knowledge that helps
blind users complete web-based tasks faster and more effectively by guiding them through
a task step-by-step. By directing the user’s attention to the right places on the page and
by providing accessible shortcut keys, TrailBlazer enables users to follow existing how-to
instructions quickly and easily. A formative evaluation of the system revealed that users
were positive about the system, but that the lack of how-to scripts could be a barrier to
use. We extended the TrailBlazer system to dynamically suggest possible next steps based
on a short description of the desired task, the user’s previous behavior, and a repository of
existing scripts. The user’s next step was contained within the top 5 suggestions 75.9% of
the time, showing that TrailBlazer is successfully able to guide users through new tasks.
TrailBlazer can accurately predict the next step that users are likely to want to perform
based on an initial action and provides an accessible interface enabling blind users to leverage
those suggestions. A next step will be user studies with blind web users to study how
blind web users use the system and discover improvements on the provided interface to
suggestions. Interfaces for non-visual access could benefit from moving toward task-level
assistance of the kind exposed by TrailBlazer. Current interfaces too often focus on either
low-level annotations or deciding beforehand what users will want to do, taking away their
control.
Informing potential users of TrailBlazer and encouraging them to install it can limit
its potential impact. In the next chapter (Chapter 7), we discuss WebAnywhere, which
not only improves the availability of web access but also the availability of new tools such
as TrailBlazer. Chapter 8 discusses in more depth the implications of the WebAnywhere
delivery model, which promises to make it easier for users to gain access to tools like
TrailBlazer.
114
Chapter 7
IMPROVING THE AVAILABILITY OF WEB ACCESS WITH
WEBANYWHERE
WebAnywhere is a web application the enables blind web users to access the web on
any computer that is available. WebAnywhere requires no new software to installed and,
as a web application, users always have the most recent version when they visit the site.
WebAnywhere is easily extensible and open source, enabling researchers to easily try out
new ideas. The infrastructure supports remote user studies, so researchers can also use it
as part of empirical evaluations.
This chapter first motivates WebAnywhere and describes its architecture. An evaluation
of WebAnywhere that shows blind participants can use it to perform common tasks is
presented, followed by an automated performance study under different conditions to explore
the limitations of this approach.
7.1
Introduction & Motivation
WebAnywhere is a web-based screen reader that can be used by blind individuals to access
the web from any computer with web access and audio output. Other screen readers are
expensive, costing nearly a thousand dollars for each installation because of their complexity,
relatively small market and high support costs. Development of these programs is complex
because the interface with each supported program must be deciphered independently. As
a result of their expense and ignorance, screen readers are not installed on most computers,
leaving blind users on-the-go unable to access the web from any computer that they happen
to have access to and many blind users unable to afford a screen reader unable to access
the web at all.
Even blind web users who have a screen reader installed at home or at work cannot
access the web everywhere that sighted people can. From terminals in public libraries to
the local gym, from coffee shops to pay-per-use computers at the airport, from a friend’s
115
Figure 7.1: People often use computers besides their own, such as computers in university
labs, library kiosks, or friends’ laptops.
laptop to a school laboratory (Figure 7.1); computers are used for a variety of useful tasks
that most of us take for granted, such as checking our email, viewing the bus schedule
or finding a restaurant. Few would argue that the ease of use of web mail or document
editors has surpassed desktop analogs, but their popularity is increasing, indicating the
rising importance of accessing the web from wherever someone happens to be.
The cost of screen readers is one problem that leads to screen readers not being installed
where people need them. Libraries are known to provide an invaluable connection to web
information for those with low incomes [13], and, while almost all libraries in the United
States today provide Internet access, many do not provide screen readers because of their
expense. Even when they do, they are provided on a limited number of computers. Most
blind people with low incomes cannot afford a computer and a screen reader of their own,
and governmental assistance in receiving one (in countries where such programs exist) are
often tied to a demonstrated need for one because of school or employment, as it is in the
United States. Unemployed blind people often miss out on such services even though they
could potentially benefit the most. In countries where these programs do not exist, the
problem may be worse.
For blind users unable to afford a full screen reader, WebAnywhere might serve as a
temporary alternative. Voice output while navigating through a web page can also be
116
WebAnywhere
Browser Frame
http://www2008.org/
Replicates browser
functionality and provides
a screen reading interface
to both web content and
browser functions.
WebAnywhere
Content Frame
Loads web content via
proxy server. Browser
frame speaks the web
content loaded here.
Figure 7.2: WebAnywhere is a self-voicing web browser inside a web browser.
beneficial for people who have low vision or dyslexia. Web developers have been shown to
produce more accessible content when they have access to a screen reader [85], but may be
deterred from using one by the expense and the hassle of installing one. All of these users
could use WebAnywhere for free from anywhere.
7.1.1
Using WebAnywhere
When users open WebAnywhere’s homepage it speaks the contents of the page that is
currently loaded (initially a welcome page). WebAnywhere voices both its interface and the
content of the current web page. Users can navigate from this starting page to any other
web page and the WebAnywhere interface will speak those pages to the user as well. No
separate software needs to be downloaded or run by the user; the system runs entirely as a
web application with minimal permissions.
WebAnywhere replicates much of the functionality of existing screen readers for enabling
interaction with the web. WebAnywhere traverses the DOM of each loaded web page using
a pre-order Depth First Search (DFS), which is approximately top-to-bottom, left-to-right
in the visual page. As the system’s cursor reaches each element of the DOM, it speaks the
element’s textual representation to the user (Figure 7.2). For instance, upon reaching a link
containing the text “Google,” it will speak “Link Google.” Users can follow the link by
117
pressing enter. They can also skip forward or backward in this content by sentence, word
or character. Users can also skip through page content by headings, input elements, links,
paragraphs and tables. In input fields, the system speaks the characters typed by the user
and allows them to review what they have typed.
As a web application running with minimal permissions, WebAnywhere does not have
access to the interface of the browser. It instead replicates the functionality already provided, such as providing its own address bar where the URL of the current page is displayed
and users can type to navigate to a new page (Figure 7.2).
We used the following design goals while developing WebAnywhere in order to ensure
that it would be accessible and usable on most computers:
1. WebAnywhere should replicate the functionality and security provided by traditional
screen readers when used to browse the web.
2. In order to work on as many web-enabled devices and public computer terminals as
possible, WebAnywhere should not require special software or permissions on the host
machine in order to run.
3. WebAnywhere should be usable: users should not be substantially affected or aware
of the engineering decisions made in order to satisfy the first two goals.
7.2
Related Work
Section 2.3 presents an overview of different systems impacting the availability of access for
blind people. Prior solutions have all required users to either have permission to run or
install software on their machine or carry around a specialized device.
Serotek provides provides products that are most related to WebAnywhere. The Serotek
System Access Mobile (SAM) [119] screen reader designed is designed to run directly from
a USB key on Windows computers without prior installation. It is available for $500 US
and still requires access to a USB port and permission to run arbitrary executables. The
Serotek System Access to Go (SA-to-Go) screen reader can be downloaded from the web
via a speech-enabled web page, but the program requires Windows, Internet Explorer, and
permission to run executables on the computer. This product has recently been made
118
freely available by the AIR Foundation [2], who also provide a self-voicing Flash interface
for downloading and running the screen reader. Starting this system requires downloading
more than 10 MB of data, compared to WebAnywhere’s 100 kB. WebAnywhere, therefore,
may be more appropriate for quickly checking email or looking up information, even on
systems where SA-to-Go is able to run.
A small number of web pages voice their own content, but the scope of information
or access is limited. Talklets enable web developers to provide a spoken version of their
web pages as a single file that users cannot control [123]. Scribd.com provides a unvoiced
interface for converting documents to speech [117], but the speech is available only as a
single MP3 file that does not support interactive navigation. The National Association for
the Blind, India, provides access to a portion of their web content via a self-voicing Flash
Movie that provides keyboard commands for navigation [92], but the information contained
in the separate Flash movie is not comprehensive of the entire web site. The web information
that can be accessed by WebAnywhere is not limited.
7.3
Public Computer Terminals
The potential of WebAnywhere to provide access everywhere is limited by the capabilities
of public computer terminals. In the United States, nearly all public libraries provide Internet access to their patrons, and 63% of these libraries provide high speed connections [13].
As of 2006, 22.4% of libraries specifically provided Internet-based video content and 32.8%
provided Internet-based audio content [13]. Other public terminals likely have different
capabilities. The aspects of public terminals explored here will also determine the ability of other web applications to enable self-voicing technology, and is, therefore, generally
applicable beyond WebAnywhere.
We surveyed 15 public computer terminals available in Seattle, WA, to get an idea of
their technical capabilities and the environments in which they are located. We visited
computers in the following locations: 5 libraries, 3 Internet cafes, 3 university & community
college labs, 2 business centers, a gym and a retirement community. The computers were
all managed by different groups. For instance, we visited only one Seattle public library.
Although we considered only public terminals located in Seattle, we found public terminals
W
eb
Sc An
r y
H een wh
el R er
Fl p A ea e U
as va d s
O h I ila er able
th ns bl
So er S tal e
u o le
So nd un d
u In d
H nd itia Pla
ea P ll y
H dp lay y A er
ea ho ed ud
ib
Sp dp ne
le
ea hon Ja
ke es ck
rs
119
Public Terminal
Survey
Internet Cafes (3)
3
0
3
3
3
3
3
2
1
1
Kiosks (3)
2
0
2
2
3
2
2
3
0
2
Libraries (5)
5
0
5
5
4
3
3
5
1
1
Other (4)
4
0
2
4
3
4
4
3
0
1
All (15)
14
0 12 14 13 12 12 13
2
5
Figure 7.3: Survey of public computer terminals by category indicating that WebAnywhere
can run on most of them.
with diverse capabilities, suggesting that our results may generalize. For instance, while
most ran Windows XP as their operating system, two ran Macintosh OS X, and one ran
Linux. The terminals were designed for different use cases. Several assumed users would
stand while accessing them, one was used primarily as a print station and many appeared
to be several years old.
Figure 7.3 summarizes the results of this survey. Most computers tested were capable
of running WebAnywhere, and in 12 of the locations, someone was available in an official
capacity to assist users. In all locations, people (including employees) were nearby and could
have also assisted users. The most notable restriction to access that we found was that only
2 locations (a library and an Internet cafe) provided headphones for users. However, in all
of the libraries that we visited, we were able to ask to use headphones. Bringing headphones
seems like a reasonable requirement given that headphones are inexpensive and many people
already carry them for listening to their music players and other devices.
In the 5 locations that did not provide headphones, speakers were available. Using speakers is not ideal because it renders a user’s private browsing public and could be bothersome
to others, but at least in these locations users could access the web without requiring them
to bring anything with them. One location was restricted to using the embedded sound
player, which suggests that while supporting it could help in some cases, access on most
120
computers could probably be achieved only using the Flash player to play sound. On 14
out of 15 systems, blind users could potentially access web content even though none of the
computers had a screen reader installed on them.
7.4
The WebAnywhere System
WebAnywhere is designed to function on any computer with web access and that ability to
play sound. Its design has carefully considered independence from a particular web browser
or plugin. To facilitate its use on public systems on which users may not have permission to
install new software, functionality that would require this has been moved to a remote server.
WebAnywhere can play sound using several different sound players commonly available on
web browsers.
The system consists of the following three components (Figure 7.5): (i) client-side
Javascript, which supports user interaction, decides which sounds to play and interfaces
with sound players to play them, (ii) server-side text-to-speech generation and caching, and
(iii) a server-side transformation proxy that makes web pages appear to come from a local server to overcome violations of the same-origin security policy enforced by most web
browsers. WebAnywhere consists of less than 100 KB of data in four files, and that is all a
user needs to download to begin using the system.
7.4.1
WebAnywhere Script
The client-side portion of WebAnywhere forms a self-voicing web browser that can be run inside an existing web browser (Figure 7.4). The system is written in cross-browser Javascript
that is downloaded from the server, allowing it to be run in most modern web browsers, including Firefox, Internet Explorer and Safari. The system captures all key events, allowing
it to both provide a rich set of keyboard commands like what users are accustomed to in
their usual screen readers and to maintain control of the browser window. WebAnywhere’s
use of Javascript to capture a rich set of user interaction is similar to that of UsaProxy [9]
and Google Analytics [50], which are used to gather web usage statistics.
Web pages viewed in the system are loaded through a modified version of the web-based
proxy PHProxy [6]. This enables the system to bypass the same-origin policy that prevents
121
WebAnywhere
Browser Frame
http://www.webengineering.org
Replicates browser functionality
and provides a screen reading
interface to both web content
and browser functions.
WebAnywhere
Content Frame
Loads web content via a web
proxy server. Browser frame
voices the web content loaded
here.
Action
Speech Sound
Page has loaded.
7.9 kB
ICWE 2008
Welcome.
12.1 kB
4.4 kB
Image ICWE 2008
5.7 kB
TAB
Link: Home.
TAB
Link: Open Calls.
CTRL
+
+
9.2 kB
5.6 kB
Heading 1
h
Welcome
CTRL
4.4 kB
Heading 2
h
4.4 kB
5.6 kB
Important Deadline Ahead
0.0
1.0
Play Time (seconds)
10.7 kB
2.0
Figure 7.4: Browsing the ICWE 2008 homepage with the WebAnywhere self-voicing, webbrowsing web application. Users use the keyboard to interact with WebAnywhere like they
would with their own screen readers. Here, the user has pressed the TAB key to skip to
the next focusable element, and CTRL+h to skip to the next heading element. Both web
content and interfaces are voiced to enable blind web users access.
scripts from accessing content loaded from other domains. Without this step, the scripts
retrieved from WebAnywhere’s domain would not be able to traverse the DOM of pages
that are retrieved, for instance, from google.com. Deliberately bypassing the same-origin
policy can introduce security concerns, which we address in Section 7.9.
7.4.2
Producing and Playing Speech
Speech is produced on separate speech servers. Our current system uses the free Festival
Text-to-Speech System [125] because it is distributed along with the Fedora Linux distribution on which the rest of the system runs. The sounds produced by the Festival Text-to-
122
Figure 7.5: The WebAnywhere system consists of server-side components that convert text
to speech and proxy web content; and client-side components that provide the user interface
and coordinate what speech will be played and play the speech. Users interact with the
system using the keyboard.
Speech (TTS) system are converted server-side to the MP3 format because this format can
be played by most sound players already available in browsers and because it creates small
files necessary for achieving our goal of low latency. For example, the sound “Welcome to
WebAnywhere,” played when the system loads, is approximately 10k, while the single letter
“t”, played when users type the letter, is 3k. See Figure 7.4 for more examples of how the
speech sounds in WebAnywhere are generated and used.
Sounds are cached on the server so they don’t need to be generated again by the TTS
service and on the client as part of the existing browser cache. For most efficient caching,
WebAnywhere would generate a separate sound for each word, but this results in choppysounding speech. Another option would be to generate a single speech sound for an entire
web page, but this would prevent the system from providing its rich interface. Sound players
already installed in the browser do not support jumping to arbitrary places in a sound file
as would be required when a user decides that they want to skip to the middle of the page.
Instead, the system generates a separate sound for each phrase and the WebAnywhere script
coordinates which sound to play based on user input.
The system primarily uses the SoundManager 2 Flash Object [115] for playing sound.
This Flash object provides a Javascript bridge between the WebAnywhere Javascript code
and the Flash sound player. It provides an event-based API for sound playback that includes
123
an onfinish event that signals when a sound has finished playing. It is also able to begin
playing sounds before they have finished downloading using streaming, which results in
lower perceived latency. Adobe reports that version 8 or later of the Flash player, required
for Sound Manager 2, is installed on 98.5% of computers [3].
To enable the system to operate on more systems, we have also developed our own
Javascript API for playing speech using embedded players, such as Quicktime or the Windows Media Player. The existing API for controlling embedded players is limited and makes
precise reaction to sounds that are being played difficult. While programmatic methods exist for playing and stopping audio files, they do not implement an onfinish event that
would provide a programmatic method for determining when a sound has finished playing.
WebAnywhere relies on this information to tell it when it should start playing the next
sound in the page when a user is reading through a page. We initially required users to
manually advance sounds, but this proved cumbersome and caused the system to act differently based on which sound player was being used. It was also frustrating for users to read
a large section of a page as they have become accustomed to doing using their usual screen
readers.
To enable WebAnywhere to simulate an onfinish event for embedded sound players, the
TTS service includes a header in its HTTP responses that specifies the length of each speech
sound. Before programmatically embedding each sound file into the page, WebAnywhere
first issues an xmlHttpRequest for the file and records the length of the returned sound.
WebAnywhere then sets a timer for slightly longer than the length of the sound and finally
inserts an embedded player for the sound into the navigation frame. Because the sound
has already been retrieved via the programmatic request, it is located in the cache and the
embedded player can start playing it almost immediately - there is a small delay for the
embedded player to load and begin playing the sound. The timer is used as an onfinish
event signaling that the sound has stopped playing and the next sound in the queue of
sounds to play should be played.
124
7.5
User-Driven Design
WebAnywhere was designed with close consultation with blind users. These potential users
of WebAnywhere have offered valuable feedback on its design.
During preliminary evaluations with 3 blind web users, participants wanted to be able
to customize the shortcut keys used and other features of WebAnywhere in order to emulate
their usual screen readers (either JAWS or Window-Eyes) [21]. To support this, WebAnywhere now includes user-configurable components that specify which key combinations to
use for various functionality and the text that is read when specified actions are taken. Users
can also choose to emulate their preferred screen reader (either JAWS or Window-Eyes) and
their preferred browser (Internet Explorer or Firefox). For instance, Internet Explorer announces “Type the Internet address of a document or folder, and Internet Explorer will
open it for you” when users move to the location bar, while users will hear “Location Bar
Combo Box” if they are using Firefox. These settings can be changed using a web-based
form and are saved for each individual user.
To load their personal settings, users first press a shortcut key. The system asks them
to enter the name of their profile and press enter when they are finished. It then applies
the appropriate settings. Users can create an account and edit their personal settings
either using a screen reader to which they are already accustomed or by using the default
interface provided by WebAnywhere. Explanations of the available functionality and initial
keyboard shortcuts assigned to each are read when the system first loads. It is unclear to
what extent frequent users will want to use personalized settings, but the option to use
personalized settings may ease the transition to WebAnywhere from another screen reader.
In the future, personal profiles may also enable users to specify their preferred speech rate
and interface language.
When browsing the web, screen readers use an off-screen model of each web page that is
encountered, which results in the screen reader exposing two different complementary but
incomplete modes of operation to the user. Because WebAnywhere accesses the DOM of
the web page directly, it does not need a separate forms mode. This has the advantage of
immediately updating when content in the page dynamically changes. Traditional screen
125
readers must be manually toggled between a “forms mode” and “browse mode”1 using a
specific control key. Even though this is cumbersome and unnecessary in WebAnywhere, it
caused confusion for users that were accustomed to it, and so this behavior can be emulated
in WebAnywhere.
Our consultants felt that the main limitation of the system was the limited functionality
that was provided to skip through page content relative to other screen readers. Users felt
that WebAnywhere most needed the following two features:
• Skipping functionality - Users have become accustomed to the rich array of shortcuts designed to enable users to skip through content. Our initial system only supported skipping by focusable element using the TAB key, but most screen readers
provide shortcuts to skip through web content by heading, by input element, by paragraph and by link.
• Find feature - Users wanted to be able to search for text in the page using the
familiar find functionality provided by most browsers, which has been shown valuable
for blind web users [35]. The existing find functionality in the web browser is not
accessible using WebAnywhere because, as a web application, WebAnywhere can only
access the elements in the web pages it loads.
We used these preliminary evaluations to direct the priorities for development of WebAnywhere and the system now includes these features. Individuals also wanted a variety of
other functionality available in existing screen readers, such as the ability to spell out the
word that was just spoken, to speak user-defined pronunciations for words, and to specify
the speech rate. These can be implemented in future versions of WebAnywhere.
7.5.1
Reaching WebAnywhere
Before using WebAnywhere, blind users must first navigate to the WebAnywhere web page.
Blind users have proven adept at using computers to start a screen reader when one is not yet
running. For instance, some screen readers required users to login to the operating system
1
Names for these modes differ; these are used by JAWS.
126
without using the screen reader. Existing solutions, such as the Serotek System Access
Mobile [119], share this requirement and are still used. In most locations, blind users can
ask for assistance in navigating to the WebAnywhere page and then browse independently.
This issue is explored more in our survey of public terminals presented in Section 7.3.
Windows provides a limited voice interface that could be used to reach WebAnywhere.
Windows Narrator can be started using a standard shortcut key and can voice the run
dialog enabling users to navigate to the WebAnywhere URL. Web Narrator’s rudimentary
support for browsing the web is not sufficient for web access, but would enable users to open
the WebAnywhere web page.
Navigation Functionality
CTRL
CTRL
+
+
l
Focus location text field.
f
Focus finder text field.
Reading Granularity
or
Read next/previous element.
or
SHIFT
Read next/previous word.
+
or
Read next/previous character.
Skipping Functionality
Skip to next focusable element.
TAB
CTRL
CTRL
CTRL
CTRL
+
+
+
+
h
Skip to next heading.
i
Skip to next input element.
r
Move to next row in table.
d
Move to next column in table.
* Pressing SHIFT in combination with these keys will reverse skipping direction.
Figure 7.6: Selected shortcut functionality provided by WebAnywhere and the default keys
assigned to each. The system implements the functionality for more than 30 different
shortcut keys. Users can customize the keys assigned to each.
7.6
User Evaluation
In order to investigate the usability and perceived value of WebAnywhere, we conducted a
study with 8 blind users (4 female) ranging in age from 18 to 51. Participants represented a
127
diversity of screen reader users, including both students and professionals, involved in fields
ranging from the sciences and law to journalism and sales. Their self-described, screenreader expertise varied from beginner to expert. Similarly, their experience using methods
for accessing the web when away from their own computers using the methods described
in Section 2.3 varied considerably. Participants were compensated with $15 US for their
participation in our study, and none had participated in earlier stages of the development
of WebAnywhere.
Two of our participants were located remotely. Remote user studies can be particularly
appropriate for users with disabilities for whom it may be difficult or costly to conduct inperson studies. Such studies have been shown to yield similar quantitative results, although
risk collecting less-informative qualitative feedback [101]. We conducted interviews with
remote participants to gather valuable qualitative feedback.
We examined (i) the effects of technological differences between our remote screen reader
and a local one, and (ii) the likelihood that participants would use WebAnywhere in the
future. While we did not explicitly compare the screen reading interface with existing
screen readers, all of our participants had previously used a screen reader and many of their
comments were made with respect to this experience.
In this evaluation, participants were first introduced to the system and then asked to
browse the WebAnywhere homepage using it. They then independently completed the
following four tasks: searching Google to find the phone number of a local restaurant,
finding when the next bus will be arriving at a bus stop, checking a Gmail email account, and
completing a survey about their experience using WebAnywhere. Gmail.com and google.com
were frequently visited by blind participants in our WebinSitu study (Chapter 3), and
mybus.org is a popular site for checking the bus in Seattle, where most of our evaluators
live. The authors feel that these tasks are representative of those that a screen reader user
who is away from their primary computer may want to perform.
We did not test the system with blind individuals who would like to learn to use a
screen reader but cannot afford one. Using a screen reader efficiently requires practice and
our expectation is that if current screen reader users can use WebAnywhere then others
could likely learn to use it as well.
128
Task 1: Restaurant Phone Number on Google
Participants were asked to find the phone number of the Volterra Italian Restaurant in
Seattle by using google.com. Participants were told to search for the phrase “Volterra
Seattle.” The phone number of the restaurant can be found on the Google results page,
although some participants did not notice this and, instead, found the number on the
restaurant’s home page. This task represented an example of users wanting to quickly find
information on-the-go.
Task 2: Gmail
Participants checked a web-based email account and located and read a message with the
subject “Important Information.” Participants first navigated to the gmail.com homepage
and entered a provided username and password into the appropriate fields. They next
found the message and then read its text. This task involved navigating the complex pages
of gmail.com that include a lot of information and large tables.
Task 3: Bus Schedule
Participants found when the 48 bus would next be arriving at the intersection of 15th
Ave and 65th St using the real-time bus tracking web page mybus.org. Participants first
navigated to the mybus.org homepage, entered the bus number into a text input field and
clicked the submit button. This brought them to a page with a large list of links consisting
of all of the stops where information is available for the bus. Participants needed to find
the correct stop among these links and then navigate to its results. This task also included
navigating through large tables of information.
Task 4: WebAnywhere Survey
The final task asked participants to complete a survey about their experiences using WebAnywhere. They completed the survey about WebAnywhere using the WebAnywhere screen
reader itself. This task involved completing a web-based survey that consisted of eleven
statements and associated selection boxes that allowed them to specify to what extent they
129
agreed or disagree with the statement on a Likert scale. The survey also included a text
area for additional, free-form comments. For this task, the researchers left the room, and
so participants completed the task completely independently.
7.6.1
Study Results
All participants were able to complete the four tasks. Most users were not familiar with
the pages used in the study and found it tedious to find the information of interest without
already knowing the structure of the page. However, most noted that this experience was
not unlike using their usual screen reader to access a page which they had not accessed
before. Some participants noted functionality available in their current screen reader that
would have helped them complete the tasks more quickly. For example, the JAWS screen
reader has a function for skipping to the main content area of a page.
Participants who were already familiar with the web pages used to complete the tasks
in the study, were, unsurprisingly, faster at completing those tasks. For instance, several
participants were frequent Gmail users and leveraged their knowledge of the page’s structure to quickly jump to the inbox. In that example, skipping through the page using a
combination of the next heading and next input element shortcut keys is an efficient way
to reach the messages in the inbox.
Figure 7.7 summarizes the results of the survey participants took as part of task 4.
Participants all felt that WebAnywhere was a bit tedious to use, although many mentioned
in a post-study interview that it was only slightly more tedious than the screen reader to
which they are accustomed. Most agreed that mobile technology for accessing the web is
expensive and most find themselves in situations where a computer is available but they
cannot access it because a screen reader is not installed on it. The main exception was a
skilled computer user who carries a portable version of the NVDA screen reader with him
on a USB key. He was uniformly negative about WebAnywhere because WebAnywhere
provided an inferior experience relative to NVDA and his solution works on the computers
that he has tried. Most of our participants could see themselves using the system when it
is released.
130
Responses
strongly
disagree 1
Question
strongly
5 agree
3
2. WebAnywhere is tedious to use.
2.5
3. I could use WebAnywhere to access the web.
3.5
4. I could access the web from more locations using WebAnywhere.
3.5
6. I would use WebAnywhere to access the web from computers lacking screen readers.
3.5
8. Other tools would provide access in as many locations as this tool.
9. I often find myself where a computer is available but it lacks a screen reader.
10. I would use WebAnywhere if no other screen reader is available.
strongly
5 agree
4
5. With practice, I could independently access the web with WebAnywhere.
7. Technologies enabling mobile access are expensive.
strongly
disagree 1
Median
1. WebAnywhere is difficult to use
11. WebAnywhere would be useful for someone unable to afford another screen reader.
5
3.5
5
4
3.5
Figure 7.7: Participant response to the WebAnywhere, indicating that they saw a need for
a low-cost, highly-available screen-reading solution (7,8,9) and thought that WebAnywhere
could provide it (3,4,6,10). Full results available in Appendix A.
7.6.2
Discussion
Participants felt that WebAnywhere would be adequate for accessing the web, although
none were prepared to give up their normal screen reader to use it instead. This was the
expected and desired result. One participant remarked that the system would be useful for
providing access when he is visiting a relative where he would not be comfortable installing
new software.
Participants who completed our study after the release of the free Serotek tool said that
they could see themselves using both tools depending on their situation. For instance, if they
only needed to find a phone number or an email, they would probably use WebAnywhere
because it does not take as long to load. They would also use WebAnywhere if they were
on a machine on which SA-to-Go would not run. SA-to-Go is an important option because
they can use it to access applications other than the web.
Participants did not mention the latency of the system as a major concern, but some
were confused when a sound or web page took a while to load because, during that period, WebAnywhere was silent. We can address this by having the system periodically
say “content loading, X%” while content is loading. One participant mentioned that the
131
WebAnywhere
Prefetching
Fast.
Browser Cache (MB)
Slower.
Server Cache (GB)
Slow.
Text-to-Speech Server
Figure 7.8: Caching and prefetching on the server and client improve latency.
latency of the speech repeated to him while typing was bothersome. We have improved
this by prefetching the speech representation of letters and punctuation that result when
users type. Others noted that errors in the current implementation occasionally produced
incorrect effects and that WebAnywhere lacks some of the functionality of existing screen
readers. Many of these shortcomings have already been addressed in the current version of
WebAnywhere.
Most importantly, participants successfully accessed the web using WebAnywhere; future
versions will seek to further improve the user experience and functionality.
7.7
Reducing Latency
WebAnywhere uses remote text-to-speech conversion, and the latency of requesting, generating and retrieving speech could potentially disrupt the user experience. Because the sound
players used in WebAnywhere are able to play sounds soon after they begin downloading,
latency is low on high-speed connections but can be noticeable on slower connections. Latency of retrieving speech is an important factor in the system because it directly determines
the user-perceived latency of the system. When a user hits a key, they know that the system
has responded only when the sound that should be played in response to that key press is
played by the system.
132
To reduce the perceived latency of retrieving speech, the system aggressively caches the
speech that is generated by the TTS service on both the server and client. In order to
increase the likelihood that the speech a user wants to be played is in the cache when they
want to play it, the system can use several different prefetching strategies designed to prime
these caches. Prefetching has been explored before as a way to reduce web latency [98] and
has been shown to dramatically reduce the latency for general web browsing [72]. Traditional
screen readers running as processes on the client machine do not require prefetching because
generating and playing sound has low latency. Web applications have long used prefetching
as a mechanism for reducing the delay introduced by network latency. For instance, web
applications that use dynamic images use Javascript scripts to preload images to make
these changes appear more responsive. The prefetching and caching strategies explored in
WebAnywhere may also be useful for visually rich web applications.
7.7.1
Caching
The system uses caching on both the server and client browser in order to reduce the
perceived latency of retrieving speech. TTS conversion is a relatively processor-intensive
task. To reduce how often speech must be generated from scratch, WebAnywhere stores
the speech that is generated on a hard disk on the server. While hard disk seek times can
be slow, their latency is low compared with the cost of generating the speech again.
The speech that is retrieved from the client is cached on the client machine by the
browser. Most browsers maintain their own caches of files retrieved from the web, and an
unprivileged web application such as WebAnywhere does not have permission to directly
set either the the size of the cache or the cache replacement policy, and WebAnywhere does
not attempt to do so. Flash uses the regular browser cache. Both the Internet Explorer and
Firefox disk caches default to 50 Megabytes, which can hold a large number of the relatively
small MP3 files used to represent speech.
The performance implications of these caching strategies are explored in Section 7.8.1.
133
7.7.2
Prefetching Speech
The goal of prefetching in WebAnywhere is to determine what text the user is likely to want
to play as speech in the future and increase the likelihood that the speech sounds requested
are in the web browser’s cache by the time that the user wants them to be played. The
browser cache is stored in a combination of memory and hard disk, and retrieving sounds
to play from it is a very low-latency operation relative to retrieving sounds from the remote
WebAnywhere server. The distribution of requested speech sounds is expected to be Zipflike [26], resulting in most popular sounds already likely to be in the cache, but a long tail
of speech sounds that have not been generated before.
All prefetching strategies add strings to a priority queue, which a separate prefetching
thread uses to prioritize which strings should be converted to speech. We explored several
different strategies for deciding what strings should be given highest prefetching priority
by the system. The function of each strategy is to determine the priority that should be
assigned to each string.
To prefetch speech sounds, the prefetching thread issues an xmlHttpRequest request
for the speech sound (MP3) representing each string from its queue. This populates the
browser’s local cache, so that when the sound is requested later, it is retrieved quickly from
the cache. We next present several different prefetching strategies that we have implemented. Section 7.8.3 presents a comparison of these strategies.
DFS Prefetching
WebAnywhere and other screen readers traverse the DOM using a pre-order Depth First
Search (DFS). The basic prefetching mode of the WebAnywhere system performs a separate
DFS of the DOM that inserts the text that will be spoken for each node in the DOM into the
priority queue with a weight corresponding to its order in the DFS traversal of the DOM.
This method retrieves speech sounds in the order in which users would reach them if they
were to read through a page in the natural top-to-bottom, left-to-right order. If users either
normally read in this order and if they do not read through the page more quickly than the
prefetching is able to be performed, then this strategy should work well. However, blind
134
web users are known for leveraging the large number of shortcut keys made available by
their screen readers to skip around in web content [35, 17], so it is worthwhile considering
other strategies that may better address this usage.
DFS Prefetching + Position Update
The system could better adapt if it updated the nodes to be prefetched based on the node
currently being read. This could prevent nodes that have already been skipped by the user
from being prefetched and taking bandwidth that could otherwise be used to download
sounds that are more likely to be played. The DFS+Update prefetching algorithm includes
support for changing its position in prefetching. For example, if the prefetcher is working
on content near the top of the page when a user skips to the middle of the page using
the the find functionality, the prefetcher will be able to update its current position and
continue prefetching at the new location. When the user skips ahead in the page, the
priority of elements in the queue are updated to reflect the new position. These updates
make prefetching more likely to improve performance.
DFS Prefetching + Prediction
WebAnywhere also prefetches sounds based on a personalized, predictive model of user
behavior. The shortcut keys supported by the system are rich, but users frequently employ
only a few. Furthermore, users do not randomly skip through the page; meaning that, the
likelihood that a user will issue each keyboard shortcut can be inferred from such factors
as the keys that they previously pressed and the node that is currently being read. For
instance, a user who has pressed TAB to move from one link to the next is more likely to
press TAB again upon reaching another link than is a user who has been reading through
the page sequentially. Similarly, some users may frequently skip forward by form element,
while others may rarely do so.
To use such behavior patterns to improve the efficacy of prefetching, WebAnywhere
records each user’s interactions with the system and uses this empirical data to construct
a predictive model. The model is used to predict which actions the user is most likely to
135
take at each point, helping to direct the prefetcher to retrieve those sounds most likely
to be played. An action is defined as a shortcut key pressed by the user. WebAnywhere
records the history of actions performed by the user and the history of the current node
types associated with each. The system distinguishes three types: link, input element, and
other. These actions were chosen because they roughly align with the most popular actions
currently implemented in the system and could be expanded in the future.
The probability of the next action actioni being action x assuming that the next action
depends only on prior observations of actions and visited nodes is as follows:
P (actioni = x|node0 , ..., nodei , action0 , ..., actioni−1 )
WebAnywhere uses the standard Markov assumption to estimate this probability by
looking back only one time step [112]. Therefore, the probability that the user’s next action
is x given the type of the current node and the user’s previous action can be expressed as
follows:
P (actioni = x|nodei , actioni−1 )
All actions are initially assigned uniform probability. These probabilities are dynamically
updated as the system is used and sounds in the priority queue are weighted using them.
To be specific, for each possible condition (combination of previous action and type of the
current node) w, a count cw (x) is maintained. The count for each possible condition is
initially set to 1 and is incremented by 1 when the event x occurs in condition w. An
event x is defined as the user taking a specified action while in a particular condition. The
probability of each new action can then be calculated as follows:
P (actioni = x|nodei , actioni−1 = w) = cw (x)/
P
y cw (y).
WebAnywhere reweights nodes in the priority queue used for prefetching according
to these probabilities. Section 7.8.2 presents an evaluation of the accuracy of predictive
prefetching.
136
7.8
Evaluation
We evaluated WebAnywhere along several dimensions, including both how caching improves
the performance of the system and the load that it can withstand, and the accuracy and
latency effects of the prefetch strategies discussed in Section 7.7.2.
7.8.1
Server Load
In order for us to release the system, we must be able to support a reasonable number of
simultaneous users per machine. In this section, we present our evaluation of the performance of the WebAnywhere speech retrieval system under increasing load, while varying
the caching and prefetching strategies used by the system.
We chose to evaluate the latency of sound retrieval because it will contribute most to
the perceived latency of the system. When users press a key, they expect the appropriate
speech sound to be played immediately. The TTS system and cache were running on a single
machine with a 2.26 GHz Intel Pentium 4 Processor and 1 GB of memory. To implement
this evaluation, we first recorded the first 250 requests that the system made for speech
sounds when reading news.google.com from top to bottom. This represented a total of
446 seconds of speech with an average file size of 11.7 kB over the 250 retrieved files. A
multi-threaded script was used to replay those requests for any number of users. The script
first retrieves a speech sound and then waits for the length of the speech sound. It repeats
this process until all of the recorded sounds have completed, reproducing the requests that
WebAnywhere would make when reading the page. This script was run on a separate
machine that issued requests over a high-speed connection to the WebAnywhere server.
We tested the following three caching conditions: (i) both server and browser caching
enabled, (ii) only server caching enabled, and (iii) no caching enabled. Speech sounds were
assumed to be in the server cache when it was used. Speech sounds were added to the
browser cache as they were retrieved. Figure 7.9 presents the results of our evaluation,
which demonstrates that TTS production is the major bottleneck in the system. Latency
of retrieved speech quickly increases as the system attempts to serve more than 10 users.
With server-side caching, this is dramatically improved. Client-side caching in the browser
137
Average Latency (sec)
6
5
4
3
2
1
0
5
10
15
20
Number of Simultaneous Users
No Cache
Server Cache
Server + Browser Cache
Figure 7.9: An evaluation of server load as the number of simultaneous users reading
news.google.com is increased for 3 different caching combinations.
improves slightly more, although its effect is limited because in this example because relatively few speech sounds are repeated. Repeated sounds include “link,” which is said before
each link that is read, and “Associated Press,” which appears frequently because this is a
news site. As we move toward releasing WebAnywhere to a larger audience, we will continue
to evaluate its performance under real-world conditions.
A number of assumptions were made that may not be upheld in practice. For example,
the system will not achieve the perfect server-side cache hit-rate assumed here, although
both the server and browser caches will likely have had more opportunity to be primed when
users read through multiple pages. Most users also do not read pages straight through. As
we have observed users using the system, we have seen users most often skipping around
through the page, often returning to nodes that they had visited before, causing speech
already retrieved by the browser to be played again. In this initial system we have also
not optimized the TTS generation or the cache routines themselves but could likely achieve
better results by doing so. Finally, latency here was calculated as the delay between when
a speech sound was requested and when it was retrieved. In practice, the Flash sound
player can stream sounds and begin playing them much earlier. Despite its limitations,
138
this evaluation generally illustrates the performance of the current system and where future
development should be targeted.
7.8.2
Prefetching Accuracy
This section presents an analysis of the predictive power of the model underlying DFS
Prefetching + Prediction described in Section 7.7.2. To conduct this study we collected
traces of the interactions of 3 users of WebAnywhere in a study that we presented earlier
in this chapter (Section 7.6). In that study, users completed the following four tasks using
WebAnywhere: checking their email on gmail.com, looking up the next arrival of a bus,
finding the phone number of a restaurant using google.com and completing a web survey
about WebAnywhere.
In total, we recorded 2632 individual key presses. 1110 of these were not command
key presses and resulted in the name of a key being spoken, for instance “a” or “forward
slash.” The system prefetches these keys automatically when it first loads. 1522 of these
key presses were commands that caused the screen reader to read a new element in the
page. For instance, the TAB key would cause WebAnywhere to advance its internal cursor
to the next focusable page element and read it to the user. Using this data, we computed
the probability of each future action given the current node and the user’s previous action
as described earlier. Figure 7.10 shows the counts that we recorded. From this data, it
appears that users are more likely to issue a keyboard command again after issuing it once.
Some commands are also more likely given certain types of nodes, for instance users are
more likely to request the next input element when the cursor is currently located on an
input element.
We replayed the traces in order to build the models that would have been created when
a user browsed using WebAnywhere in order to measure the system’s accuracy in predicting
the user’s next action. Using the Markov model to predict the next action that the user is
likely to choose is able to correctly predict the next action in 72.4% of cases. However, simply
predicting that the user will choose to repeat the action they just performed predicts the
next action in 74.5% of cases. Markov prediction is still useful because it can quickly adapt
139
Node
Link
Input
Other
next node
next focusable
prior focusable
prior node
next focusable
prior node
next node
next input
next node
next focusable
prior node
next heading
next focusable
next node
prior node
prior focusable
next heading
next input
prior heading
prior input
Actions Observed
83
00
00
03
05
17
33
00
91
06
11
07
11
207
31
01
12
01
00
00
05 00 00
00 60 01
01 04 09
06 00 00
07 117 16
56 02 02
10 12 01
00 08 00
08 03 00
02 36 12
23 03 00
00 00 00
09 213 29
23 15 01
85 00 02
05 22 25
01 00 00
00 09 00
00 00 00
00 00 00
03
01
00
00
00
00
01
01
03
01
02
15
02
07
02
00
19
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
16
03
01
00
00
01
03
00
00
01
18
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
ne
x
ne pr t no
xt i o r d e
pr fo no __
i o cu d
rf s e
n e o cu a b l _ _
x s e_
pr t he ab _
i o a l e_
r h di _
e n
ne ad g__
x in
p r t i n g_
io pu _
r i t_
np _
ut
__
All
Prior Action
Observed
Actions
Figure 7.10: Counts of recorded actions along with the contexts in which they were recorded
(current node and prior action), ordered by observed frequency.
to individual users whose behavior may not follow this particular pattern. Its predictions
also enable prefetching of the second-most-likely action. The true action is in the two most
likely candidates in 87.3% of cases.
Predictive prefetching is quite accurate. Markov predictions are used in DFS Prefetching
+ Prediction. The next section explores how that accuracy manifests in the perceived
latency of the system.
7.8.3
Observed Latency
The main difference between WebAnywhere and other screen readers is that WebAnywhere
generates speech remotely and transfers it to the client for playback. Existing screen readers
play sounds with almost no latency because the sounds are both generated and played on
the client. To better understand the latency trade-offs inherent in WebAnywhere and the
140
Figure 7.11: Average latency per sound using different prefetching strategies. The first
set contains tasks performed by participants in our user evaluation, including results for
prefetching strategies that are based on user behavior. The second set contains five popular
sites, read straight through from top to bottom, with and without DFS prefetching. Bars
are shown overlapping.
implemented prefetching algorithms, we performed a series of experiments designed to test
latency under various conditions. A summary of these results is presented in Figure 7.11.
For all experiments, we report the average latency per sound on both a dial-up and a
high-speed connection, whose connection speeds were 44 kBps and 791 kBps, respectively.
Although 63% of public libraries have high-speed connections in the United States [13], a
dial-up connection may better represent speeds available in many communities. Timing was
recorded and aggregated within the WebAnywhere client script. The latency of a sound is
the time between when the sound is requested by the client-side script and when that sound
begins playing. The average latency is the time that one should expect to wait before each
sound is played. Because the system streams audio files, the entire sound does not need to
be loaded when it starts playing. Experiments were conducted on a single computer and
the browser cache was cleared between experiments.
We first compared the DFS prefetching strategy to using no prefetching on a straightthrough read of the 5 web pages visited most often by blind users in a study of browsing
behavior [17]. We did not test the other prefetching methods because absent user interaction
they operate identically to DFS prefetching. On these tests, the average latency for the
high-speed connection was under 500 ms even without prefetching and under 100 ms with
prefetching (Figure 7.11). Delays under 200 ms generally are not perceivable by users [83]
141
and this may explain why most users did not cite latency as a concern with WebAnywhere
during a user evaluation of it [22]. The dial-up connection recorded latencies above 2
seconds per sound for all five web pages, making it almost unusable without prefetching.
The latency with prefetching averages less than one second per sound, and the average
length of all sounds was 2.4 seconds.
Screen reader users often skip around in content instead of reading it straight through.
Using recordings of the actions performed by participants during a user evaluation [22], we
identified the common methods used to complete the tasks and replayed them manually to
see the effect of different prefetching strategies on these tasks. Recording and then replaying
actions in order to test web performance under varying strategies has been done before [71].
The prefetching strategies tested were DFS-DOM traversal, DFS with dynamic updating
and Markov model prediction.
Observed latency was again quite low for runs using the high-speed connection. When
using the dial-up connection, however, the results differed dramatically. On both the Gmail
and Google tasks, DFS prefetching increased the latency of retrieving sounds. This happened because our participants used skipping extensively on these sites and quickly moved
beyond the point in the DOM where the prefetcher was retrieving sounds. When this happened, the prefetcher used bandwidth to retrieve speech that the user was not going to play
later, slowing the retrieval of speech that would be played. Only the survey task showed
a significant benefit for the predictive model of user behavior. On this task, participants
exhibited a regular pattern of tabbing from selection box to selection box, making their
actions easy to correctly determine. Importantly, the prediction method did not perform
worse than the Update method, and both far outperformed both DFS and no prefetching.
7.9
Security
WebAnywhere enables users to browse untrusted web content. Because WebAnywhere is a
web application running inside the browser with no special permissions, it lacks the usual
mechanisms that web browsers use to enforce security. In this section, we describe the
primary security concerns resulting from our engineering decisions in building WebAnywhere
and the steps we have taken to address them.
142
7.9.1
Enforcing the Same-Origin Policy
The primary security policy that web browsers enforce is called the same-origin policy,
which prevents scripts loaded from one web domain from accessing scripts or documents
loaded from another domain [111]. This policy restricts scripts from stealYourPassword.com
from accessing content on, for instance, bankOfAmerica.com. To enable WebAnywhere to
access and voice the contents of the pages its users want to visit, the system retrieves all
web content through a web proxy in order to bypass the same-origin restriction. This makes
all content appear to originate from the WebAnywhere domain (for these examples, assume
wadomain.org) and affords the WebAnywhere script access to that content.
Bypassing the same-origin policy is fundamental for WebAnywhere’s operation, but
gives malicious web sites an opportunity to violate expected security guarantees, potentially
accessing information to which they should not have access. WebAnywhere cannot directly
enforce the same-origin policy, and so it instead ensures that all content retrieved that should
be isolated based on the same-origin policy originates from a different domain. This is done
by prepending the original domain of a resource onto its existing domain. For instance,
content from yahoo.com is made to originate from yahoo.com.wadomain.org. WebAnywhere
rewrites all URLs in this way, causing the browser to correctly enforce the same-origin policy
for the content viewed.
All requests to open new pages and URL references within retrieved pages are rewritten
to point to the proper domain. The web proxy server enforces that a request for web content
located on domain d must originate from the d.wadomain.org domain. WebAnywhere is able
to respond to requests for any domain, regardless of the subdomain supplied, through the
use of a wildcard DNS record [89] for *.wadomain.org that directs all such requests to
WebAnywhere. The WebAnywhere script and the Flash sound player must also originate
from the the d.wadomain.org domain, and so they are reloaded. This 100 KB download
need only occur when users browse to a new domain.
Browser cookies also have access restrictions designed to prevent unauthorized scripts
from accessing them [99]. The PHProxy web proxy used by WebAnywhere keeps track of
the cookies assigned by each domain and only sends cookies set by a domain to that domain.
143
This is not entirely sufficient. Access to cookies is controlled both by the domain and path
listed when a cookie is set as determined by the web page that sets each. Future versions of
WebAnywhere will modify the domain and the path requested by the cookie to match its
location in WebAnywhere. For example, the URL www.domain.com/articles/index.php will
appear to come from www.domain.com.wadomain.org/web-proxy/articles/index.php using
the URL rewriting supported by the Apache web server. The domain and path of the
SetCookie request could be adjusted accordingly.
Others have attempted to detect malicious Javascript in the browser client [52], but this
relies on potentially malicious Javascript code being isolated from code within the browser.
The WebAnywhere Javascript runs in the same security context as the potentially malicious
code. BrowserShield describes a method for rewriting static web pages in order to enforce
run-time checks for security vulnerabilities with Javascript scripts [109]. It is targeted at
protecting against generalized threats and is, therefore, a fairly heavyweight option. The
same-origin policy is our main concern because it is the security policy that we removed by
introducing the web proxy. We believe the approach here could be used more generally by
web proxies in order to make them less vulnerable to violations of the same-origin policy.
7.9.2
Remaining Concerns
Fixing the same-origin policy vulnerability created by WebAnywhere was of primary importance to us, but other concerns remain. The first is that in order to work for secure
web sites, WebAnywhere must intercept and decode secure connections made by users of
the system before forwarding them on. When a secure request is made, the web proxy
establishes a separate secure connection with both the client and the server. The data is
unencrypted on the WebAnywhere server. All accesses to WebAnywhere are made over its
SSL-enabled web server, but users still must trust that the WebAnywhere server is secure
and, therefore, may want to avoid connecting to secure sites using the system.
The second concern that remains unresolved is the opportunity for sites to use phishing
to misrepresent their identity, potentially tricking users into giving up personal information.
Although phishing is a problem general to web browsing, the unique design of WebAny-
144
where makes phishing potentially more difficult to detect. A web page could override the
WebAnywhere script used to play speech and prevent users from discovering the real origin
of the web page. For instance, as the system currently exists, a malicious site could override
the commands in WebAnywhere used to speak the current URL, preventing a user from
discovering the real web address they are visiting. Future versions of WebAnywhere will
include protections like those in BrowserShield [109] to enforce runtime checks to ensure the
WebAnywhere functions have not been altered.
Finally, because content that has been read previously is cached on the server, malicious
users could determine what other users have had read to them, possibly exposing private
information. While this problem is shared by all proxy-based systems, WebAnywhere enables it at a finer granularity than most other systems, which is potentially more revealing.
For instance, if a user visits a page that contains their credit card number, it is likely that
the system will choose to generate a separate speech sound for their number. A malicious
user could repeatedly query the system for credit card numbers and isolate those that are
retrieved most quickly. We have partially addressed this problem by not caching sounds
that originate from secure web addresses.
7.10
Summary
The WebAnywhere web-based, self-voicing web browser enables blind individuals otherwise
unable to afford a screen reader and blind individuals on-the-go to access the web from
any computer that happens to be available. WebAnywhere is able to be run on most
systems, even public terminals on which users are given few permissions. Its small startup
size means that users will be able to quickly begin browsing the web, even on relatively
slow connections. Participants that evaluated WebAnywhere were able to complete tasks
representative of those that users may want to complete on-the-go.
The focus of this chapter has been to improve web access for blind web users. After we
released WebAnywhere, we found that a wider variety of people were using it to address their
needs, illustrating the power of access technology that is easy to get going. WebAnywhere
also offers a promising way for new technology to reach users. These implications are
discussed in greater detail in the next chapter.
145
Chapter 8
A NEW DELIVERY MODEL FOR ACCESS TECHNOLOGY
In the previous chapter (Chapter 7), we introduced WebAnywhere, a web-based screen
reader that blind people can use to access the web on any computer to which they have
access. This chapter explores the implications of the WebAnywhere models as a method for
delivering more general access technology.
We released WebAnywhere on a public site in June 2008 and since then it has attracted
a large number of visitors. Surprisingly, many of these visitors weren’t blind web users
- WebAnywhere also attracted people with low vision, web developers, special education
teachers, and people with learning disabilities. People came from all over the world, and
a small community of developers has begun to create localized versions for many different
languages. WebAnywhere has the potential to serve as a vehicle to disseminate access
technology quickly and easily to a large number of users across the world.
8.1
The Audience for WebAnywhere
Since its release in June 2008, a large number of users have used WebAnywhere (Figure 8.1).
These users have offered their feedback directly through emails and indirectly via features
of the web traffic that they generate.
8.1.1
User Feedback
Participants in our initial user study of the system requested several features that are offered
by other screen readers but which are currently unavailable in WebAnywhere. Many of these
requests involved new keyboard shortcuts and functionality, but several involved producing
different, individualized speech sounds. Implementing these features in a straightforward
way has the potential to reduce the efficacy of the prefetching and caching strategies employed by the system. For instance, users requested that the system use a voice that is more
146
800
(sliding window)
Unique Users Per Week
1000
600
400
200
0
Nov. 15
May 1
Figure 8.1: Weekly Web Usage between November 15, 2008 and May 1, 2009. An average
of approximately 600 unique users visit WebAnywhere each week. The large drop in users
in December roughly corresponded to winter break. WebAnywhere offers the chance for a
living laboratory to improve our understanding of how blind people browse the web and the
problems that they face.
preferred by them; popular screen readers offer tens of voices form which users can choose.
Others asked for the ability to set the speech rate to a custom rate. Because using a screen
reader can be inefficient, many users speed up the rate of the speech that is read by two
times or more. Many users, however, do not prefer this because speech can be difficult to
understand at high speeds. Both of these improvements will cause the speech played by the
system to less frequently be located in its cache, and, therefore, the value of these features
will need to be balanced by their performance implications. We also plan to explore the
option of switching to client-side TTS when users both have the permission to use it and
it is available. Several operating systems have native support for TTS that WebAnywhere
could leverage when permitted.
8.1.2
Broader Audience
Releasing WebAnywhere demonstrated that the audience for WebAnywhere was much
broader than we had originally anticipated. Many WebAnywhere users have sent us email
feedback since its release, from which we have identified two themes. First, we discovered
initially from our usage logs that WebAnywhere had a large global reach (Figure 8.2). The
147
UNITED STATES: 6815
CANADA: 675
INDIA: 455
SPAIN: 351
URUGUAY: 335
GERMANY: 315
FRANCE: 265
HONG KONG: 147
NEW ZEALAND: 127
NETHERLANDS: 95
SWITZERLAND: 59
DENMARK: 51
IRELAND: 49
PORTUGAL: 43
MALAYSIA: 37
ARGENTINA: 35
UNITED ARAB EMIRATES: 33
OMAN:29
CHILE: 25
ISRAEL: 19
UNITED KINGDOM: 2307
ITALY: 611
AUSTRALIA: 409
TAIWAN: 341
BRAZIL: 321
SINGAPORE: 273
CHINA: 207
THAILAND: 137
JAPAN: 97
MEXICO: 67
AUSTRIA: 57
POLAND: 51
BELGIUM: 47
SWEDEN: 43
SOUTH AFRICA: 37
IRAN: 33
FINLAND: 29
NORWAY: 27
SLOVAKIA: 21
TURKEY: 19
Figure 8.2: From November 2008 to May 2009, WebAnywhere was used by people from
over 90 countries. This chart lists the 40 best-represented countries ranked by the number
of unique IPs identified from each country that accessed WebAnywhere over this period.
33.9% of the total 23,384 IPs could not be localized and are not included.
most popular request we have received is for support of additional languages. The most
active contributor to the WebAnywhere open source project1 has been a developer who has
created a Cantonese version of WebAnywhere (Figure 8.3).
Second, from the feedback of users, it has become clear that not only blind people are
using WebAnywhere. We have received emails from web developers who use WebAnywhere
to quickly test the accessibility of their content. A special education teacher emailed us
saying that she uses WebAnywhere with our her students. Specialized software is available
that is specifically designed for developers wanting to create accessible content and for
students with learning disabilities. We speculate that because WebAnywhere can be used
without installing new software and works on any platform, it is likely to be used even
when better alternatives are available. Future work will look to (i) understand why people
are using WebAnywhere, (ii) how WebAnywhere could better support the features that
these new audiences want, and (ii) how tools targeting specific groups might provide the
advantages that cause people to use options like WebAnywhere instead.
1
http://webanywhere.googlecode.com
148
Figure 8.3: WebAnywhere, May 2009. Since its release, new languages have been added
to WebAnywhere. These screenshot shows an early version of a Cantonese version of the
system. We have also started to introduce features that may make it more useful for other
populations. The text being read is highlighted within the page and shown in a magnified,
high-contrast view.
The implication is that the WebAnywhere approach to providing access may be appropriate for people with different needs all over the world.
8.2
Getting New Technology to Users: Two Examples
WebAnywhere has the potential to bring both access and new technology to users. Many
research projects and good ideas fail to make it to users because they are difficult to integrate
into existing products and require users to find and install new software.
8.2.1
Social Accessibility
To take advantage of collaboratively-authored accessibility improvements in systems like
Accessmonkey (Chapter 4), Social Accessibility [121], or AxsJAX [49], users must first
install software. Often this involves installing a browser extension or plugin. For the reasons
mentioned previously (Chapter 7), users may not be able to install new software in order
to benefit from this technology.
149
WebAnywhere can include new technology, for instance adding support for TrailBlazer
trails (Chapter 6), and users will get the latest updated version when they next use it.
The improvements that are made using all three current social accessibility systems can be
introduced using a Javascript script. WebAnywhere can easily inject these scripts when it
loads a new web page.
8.2.2
Recording and Replaying Tasks
TrailBlazer (Chapter 6) offers users the ability to record and replay web-based tasks. WebAnywhere has immediate access to the actions that users are performing since it is the
interface that they are using and can easily record them. We have already implemented
basic macro recording and playback in WebAnywhere and plan to fully support TrailBlazer
scripts in future versions. Importantly, WebAnywhere users will not have to install any new
software, support for web macros can be added transparently without user involvement.
8.2.3
Key Features
The WebAnywhere delivery model includes several key features fundamental to its delivery
model which future projects may want to emulate. We have outlined these features below:
• Free - The fact that WebAnywhere is free for users to use is important. Beyond
the issues are fairness and equal access discussed in the introduction to the previous
chapter, a free tool allows people to easily try the software without committing to a
purchase. We also do not have to put restrictions on where it can be run.
• No Installation - A related advantage of the WebAnywhere model is that no new
software needs to be installed. As a consequence, software developed following the
WebAnywhere model will work on any platform that supports web access, even those
that are developed later.
• Low-Cost Distribution & Updates - As a consequence of web-based delivery,
users always receive the latest version of WebAnywhere. New features and updates
can reach users quickly, helping to decrease the lag users might experience in their
access technology responding to technology trends.
150
The features just described make WebAnywhere easy to try out, easy to demonstrate to
others, quick to adapt to changing technology, and available wherever people are.
8.3
Summary
This chapter has discussed some of the interesting implications of WebAnywhere and some
of our experiences with having released it publicly. The main conclusion we have drawn
from this experience is that there is a large need for access technology that is easy to run and
personalize on whatever machine people to which people have access. Releasing technology
on the web means that anyone can access it worldwide, and there might be even a greater
need for affordable access technology that can run on any computer in other countries.
151
Chapter 9
CONCLUSION AND FUTURE DIRECTIONS
This dissertation has explored the potential of including blind web users as partners in
improving access to the web. In this context, we have offered general contributions in (i)
tracking the actions of web users for predictive purposes, (ii) enabling end user customization
of web content through better interfaces, and (iii) formulating design constraints for making
tools widely available on the web. The inclusion of blind users in this work has been both
explicit, such as by providing improvements using Accessmonkey or choosing to install the
Usable CAPTCHA Greasemonkey script, and implicit, such as when their web interactions
improved the performance of WebAnywhere’s prefetching and TrailBlazer’s suggestions.
Including users in the process of improving their access is powerful and we believe this
approach may extend to other populations and contexts.
This chapter first overviews the contributions of this dissertation, then discusses future
directions, and lastly presents some final remarks on the broader message of the work.
9.1
Contributions
This dissertation has offered contributions in both understanding the problems that blind
web users face and in technology that will help blind web users improve the accessibility
of their own web experiences. Many of the tools and technology developed have broader
applications to understanding the web experiences of all users and helping users with diverse
requirements collaboratively create more effective access.
Recording the actions that users perform on the web forms an important component
of improving access for two reasons. First, it helps in building tools that improve our
understanding of how people use the web and the problems that they experience. Second,
recording actions is an important first step to predicting what actions users will perform
next. Both of these can be used to improve access.
152
9.1.1
Understanding Web Browsing Problems
WebinSitu (Chapter 3) explored the extent of accessibility problems and their practical
effects from the perspective of users. This remote study used an advanced web proxy that
leverages AJAX technology to record both the pages viewed and the actions taken by users
on the web pages that they visited. Over the period of one week, participants used the
assistive technology and software to which they were already accustomed and had already
configured according to preference. These advantages allowed us to aggregate observations
of many users and to explore the practical expects on and coping strategies employed by
our blind participants.
Our WebinSitu study reflects web accessibility from the perspective of web users and
describes quantitative differences in the browsing behavior of blind and sighted web users.
These results have motivated the areas that we have explored with the subsequent research
presented in this dissertation. Conducting remote studies of these systems using the WebinSitu infrastructure has allowed us to include more participants and offers promising
directions for future research (Section 9.2.4).
9.1.2
Predicting Web Actions
A core contribution of the research presented here are methods for observing and predicting
user actions on the web - button presses, clicks on links, reading the next heading, etc. We
show several examples of how predicting what users will do next can result in interfaces
that users can use to improve their own web experiences.
There has been a long history of both (i) prefetching the web pages users will likely
visit next, and (ii) adapting pages for smaller screens (or screen readers, for that matter).
For instance, the PROTEUS system [5] learned a model of page access from web logs.
PROTEUS used this model to create shortcut links to pages located deep within a site and
to hide content not related to the inferred user goal. Many other systems (before and after)
offered their own variations on this theme.
The models used in these prior systems were generally learned from the history of pages
that users visited, often drawn from page access logs. In contrast, the work presented here
153
learns models of user actions from actions observed within web pages (buttons pressed,
content read, links followed, commands executed, etc.).
Action-based learning within web pages is an important new direction. Web pages are
becoming more complex and web applications more popular, making the question of what
to do next within a web page increasingly important - it’s been important for blind users
for a long time. Users may also only visit a single web page when using a web application
(for instance, on gmail.com), making the question of what web page to visit next irrelevant.
Importantly, actions other than following a link can lead to a new HTTP request. Both of
these lead to the breakdown of the page-based models.
More fundamentally, page-based models treat the web as a collection of linked, static
documents - an assumption that is increasingly violated. Action-based models treat web
pages as applications. As such, ideas explored in this area can find applications in traditional
desktop applications as well. Related work in this area has generally been limited by the
difficulty of recording and automating arbitrary desktop applications. The relative openness
of the web allows more progress to be made in this domain.
The models explored here are constructed and predictions are made directly in the
browser, personalized to each user as they browse. The models can learn not only from
what others have done on a page before, but also from what a particular user has done on
the pages they have visited recently. Most of the prior work has formulated models offline,
often requiring the logs from specific sites.
9.1.3
Intelligent End User Tools for Improving Access
As part of this dissertation, we have created a number of tools that have either been released
by us or helped to influence released projects.
• Accessmonkey - Accessmonkey was at the forefront of the burgeoning area of social accessibility. The Accessmonkey research prototype has influenced several related
projects. AxsJAX by Google injects scripts into web pages converting them to dynamic web applications targeting non-visual use [49]. The Social Accessibility project
lets blind web users report accessibility problems to sighted volunteers whose fixes
154
are represented as scripts [121]. Accessmonkey was created with the idea that blind
web users who were not programmers could independently improve accessibility - both
Accessmonkey and the tools that have followed it have made steps toward that aim.
• More Usable Audio CAPTCHAs - Our more usable interface to audio CAPTCHAs
helps improve the interactions required to solve existing audio CAPTCHAs and results in improved performance by blind web users. Several popular sites have since
made their audio CAPTCHAs easier to use by blind web users. Although they did not
adopt our entire interface, they solve the problem of the screen reader speaking over
the playing audio CAPTCHA in another way. A common solution is to automatically
focus the answer box after the play button is pressed and then delay the start of the
audio CAPTCHA by preceding it with several seconds of beeps. This prevents the
screen reader from talking over the playing CAPTCHA.
• TrailBlazer - A version of the TrailBlazer interface for accessible script playback has
been released as a feature of CoScripter. WebAnywhere now includes limited support
for macro recording and playback. Both tools plan to include better support in the
future and study how these capabilities are used “in the wild.” WebinSitu clearly
demonstrated the differences in browsing efficiency of blind and sighted participants.
This difference is one reason we believe script playback and suggestions might be
especially beneficial for blind users. As the web becomes more complex, similar tools
may become increasingly useful for sighted users as well.
• WebAnywhere - WebAnywhere has already been released for nearly a year as discussed in Chapter 8. Multiple parties have contributed to the open source project,
both to improve and add desired features, and as a platform for their own research.
We believe WebAnywhere is a harbinger of a delivery model especially promising
for access technology. Hosting technology remotely and reusing mainstream devices
to deliver it means disabled users are in control and do not need rely on computer
administrators who may be ill-equipped to provide the access technology they need.
155
9.1.4
Summary of Technical Contributions
We have shown that tools that record, playback and predict web actions can both (i) help
us understand the problems faced by blind web users and (ii) enable blind web users to
improve the accessibility, usability, and availability of web content. We have created tools
that implement these ideas, several of which people are actively using.
9.2
Future Directions
Throughout most of its existence, the web has been an untamed, loosely-connected collection
of documents. As the web continues to evolve, we see it transitioning away from the current
model in which users are responsible for finding what they want to one in which the web
itself takes a more active role in enabling what users want. The research and directions
explored within this dissertation suggest numerous opportunities for future work.
9.2.1
Improving Web Interactions by Predicting Actions
Both WebAnywhere (Chapter 7) and TrailBlazer (Chapter 6) use predictive models of web
actions to improve the web interface. Predicting actions within web pages has not been
adequately explored prior to this work and has the potential to improve interactions in
diverse scenarios and on many different types of devices. For instance, if the web browser
could predict what part of a complex web page in which you might be interested, it could
highlight those parts of the page, providing easy access. A browser on a small screen device
could provide easy access to its top 5 predictions. Prefetching based on predicted actions
within a web application could also be useful for any application that updates its content.
9.2.2
Extending to New Technologies
A problem with accessibility is that nearly as soon as a new technology is made accessible it
is eclipsed with another. This has already happened with the web, as the static web pages
created using only HTML have started to give way to dynamic web applications powered
by Javascript. While the particular technologies used will change, the ideas presented here
are likely to translate. The challenge is to build technologies that can adapt faster and with
less modification.
156
Currently, the web is becoming more closed and more resistant to end user personalization. For instance, Flash and Silverlight are increasingly popular formats for web content.
While the long-term success of either format is uncertain, these examples suggest that whatever comes next may be more closed, representing a challenge not only for accessibility but
also for end user web programming in general.
9.2.3
More Available Access
This dissertation has motivated availability as a new dimension on which to consider access.
WebAnywhere seeks to improve the availability of access by exposing a base level of access on
any computer with web access. Future work may look to explore improving the availability
in other ways. For instance, in some situations, people don’t have access to a computer,
but they do have access to a smart phone. Can we build in access on smart phones? Even
more people have access to a simple cellphone that does not provide data services. Can we
provide access to the web using only the voice channel.
Increasing the diversity of devices on which people can access web content can help
realize the potential of the web for everyone. Building software, such as web applications,
that can work on a variety of common, mainstream devices can help maximize potential
impact. Another option may be to build non-visual access into the server and enable anyone
to access their computer or the web with a phone call. Most importantly, it is increasingly
not enough to consider whether the web is possible to use, or even whether the experience
is usable; we also need to consider the availability of tools enabling access.
9.2.4
Longitudinal Field Studies in a Living Laboratory
WebAnywhere offers not only the opportunity to provide web access to many people who
might not have access; it also provides the opportunity to study a large number of how a
large number of people browse the web using the tool and iterate on tools to help them
improve access. We have recently added the ability to record and playback tasks on the
web to WebAnywhere and plan to use this to help test the ability of trails to help web users
more easily complete web tasks.
157
For the WebinSitu study (Chapter 3), users had to explicitly sign up and were paid
for participation. The large number of visitors coming to WebAnywhere every day offer
a rich resource to better understand what is working and what is not without requiring
participants to use or configure a new tool. We think conducting studies and iterating over
designs in this environment over longer periods of time is a powerful direction for future
research.
9.3
Final Remarks
This dissertation has explored the following thesis:
With intelligent interfaces supporting them, blind end users can collaboratively and effectively improve the accessibility, usability, and availability of their own web access.
Promoting blind web users as active participants in the development of accessible content
represents a paradigm shift in access technology, demonstrating a new role that disabled
users can play in access technology. We hope this work will be a valuable example to researchers, developers and practitioners. In the design of new tools, disabled people should
be seen as effective partners in creating a more accessible experience for everyone. This dissertation represents important first steps in this direction revealing significant opportunities
for new research.
158
BIBLIOGRAPHY
[1] A-prompt. Adaptive Technology Resource Centre (ATRC) and the TRACE Center
at the University of Wisconsin. http://www.aprompt.ca/. Accessed April 17, 2007.
[2] Accessibility is a right foundation. http://accessibilityisaright.org/. Accessed May 28,
2009.
[3] Adobe shockwave and flash players:
Adoption statistics.
http://www.adobe.com/products/player census/. Accessed June 15, 2007.
Adobe.
[4] Alexa web search – data services, 2008. http://www.alexa.com. Accessed June 15,
2008.
[5] Corin Anderson, Pedro Domingos, and Daniel S. Weld. Web site personalizers for
mobile devices. In IJCAI Workshop on Intelligent Techniques for Web Personalization
(ITWP), 2001.
[6] Abdullah Arif. PHProxy. http://whitefyre.com/poxy/. Accessed June 7, 2007.
[7] Chieko Asakawa and Hironobu Takagi. Web Accessibility: A Foundation for Research,
chapter Transcoding. Springer, 2008.
[8] Richard Atterer. Logging usage of ajax applications with the usaproxy http proxy.
In Proceedings of the WWW 2006 Workshop on Logging Traces of Web Activity: The
Mechanics of Data Collection, 2006.
[9] Richard Atterer, Monika Wnuk, and Albrecht Schmidt. Knowing the user’s every
move - user activity tracking for website usability evaluation and implicit interaction.
In Proceedings of the 15th International Conference on World Wide Web (WWW ’06),
pages 203–212, New York, NY, 2006.
[10] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.
[11] Sean Bechhofer, Carole Goble, Leslie Carr, Simon Kampa, Wendy Hall, and Dave
De Roure. Cohse: Conceptual open hypermedia service. Frontiers in Artifical Intelligence and Applications, 96, 2003.
[12] Leean Bent and Geoff Voelker. Whole page performance. In Proceedings of the 7th
annual Web Caching Workshop, June 2002.
159
[13] John Carlo Bertot, Charles R. McClure, Paul T. Jaeger, and Joe Ryan. Public libraries
and the internet 2006: Study results and findings. Technical report, Information Use
Management and Policy Institute, Florida State University, September 2006.
[14] Jeffrey P. Bigham. Increasing web accessibility by automatically judging alternative
text quality. In Proceedings of the 12th International Conference on Intelligent user
interfaces (IUI ’07), New York, NY, USA, 2007. ACM Press.
[15] Jeffrey P. Bigham, Maxwell B. Aller, Jeremy T. Brudvik, Jessica O. Leung, Lindsay A.
Yazzolino, and Richard Ladner. Inspiring blind high school students to pursue computer science with instant messenging chatbots. In Proceedings of the 39th SIGCSE
technical symposium on Computer science education (SIGCSE ’08), Portland, OR,
USA, 2008.
[16] Jeffrey P. Bigham and Anna C. Cavender. Evaluating existing audio CAPTCHAs and
an interface optimized for non-visual use. In Proceedings of the ACM Conference on
Human Factors in Computing Systems (CHI 2009), pages 1829–1838, Boston, MA,
USA.
[17] Jeffrey P. Bigham, Anna C. Cavender, Jeremy T. Brudvik, Jacob O. Wobbrock, and
Richard Ladner. WebinSitu: A comparative analysis of blind and sighted browsing
behavior. In Proceedings of the 9th International ACM SIGACCESS Conference on
Computers and Accessibility (ASSETS ’07), pages 51–58, Tempe, AZ, USA.
[18] Jeffrey P. Bigham, Ryan S. Kaminsky, Richard Ladner, Oscar M. Danielsson, and
Gordon L. Hempton. WebInSight: Making web images accessible. In Proceedings
of 8th International ACM SIGACCESS Conference on Computers and Accessibility
(ASSETS ’06), pp. 181–188. Portland, Oregon, 2006.
[19] Jeffrey P. Bigham and Richard E. Ladner. Accessmonkey: A Collaborative Scripting
Framework for Web Users and Developers. In Proceedings of the 4th International
Cross-Disciplinary Conference on Web Accessibility (W4A 2007), pp. 25–34. Banff,
Canada, 2007.
[20] Jeffrey P. Bigham, Tessa Lau, and Jeffrey Nichols. TrailBlazer: Enabling blind users
to blaze trails through the web. In Proceedings of the 12th International Conference
on Intelligent User Interfaces (IUI 2009), Sanibel Island, FL, USA, 2009.
[21] Jeffrey P. Bigham and Craig M. Prince. Webanywhere: a screen reader on-the-go.
In Proceedings of the 9th International ACM SIGACCESS Conference on Computers
and accessibility (ASSETS ’07), pages 225–226, New York, NY, USA, 2007. ACM.
[22] Jeffrey P. Bigham, Craig M. Prince, and Richard E. Ladner. Webanywhere: A screen
reader on-the-go. In Proceedings of the International Cross-Disciplinary Conference
on Web Accessibility (W4A 2008), pages 73–82, Beijing, China, 2008.
160
[23] Charles M. Blow. Two little boys. http://blow.blogs.nytimes.com/2009/04/24/twolittle-boys/ Accessed May 15, 2009.
[24] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller.
Automation and customization of rendered web pages. In Proceedings of the 18th
annual ACM symposium on User interface software and technology (UIST ’05), pages
163–172, Seattle, WA, USA, 2005.
[25] Braille sense. GW Micro. http://www.gwmicro.com/Braille Sense/. Accessed April
12, 2007.
[26] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching and
zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126–134,
1999.
[27] A.J. Bernheim Brush, Morgan Ames, and Janet Davis. A comparison of synchronous
remote and local usability studies for an expert interface. In CHI ’04 extended abstracts
on Human factors in computing systems, pages 1179–1182, New York, NY, USA, 2004.
ACM Press.
[28] Michael D. Byrne, Bonnie E. John, Neil S. Wehrle, and David C. Crow. The tangled
web we wove: A taxonomy of www use. In Proceedings of the Conference on Human
Factors in Computing Systems (CHI ’99), 1999.
[29] Kumar Chellapilla, Kevin Larson, Patrice Y. Simard, and Mary Czerwinski. Designing
human friendly human interaction proofs (hips). In Proceedings of Computer Human
Interaction (CHI ’05), pages 711–720, 2005.
[30] Charles Chen.
Fire vox:
A screen
http://firevox.clcworld.net/. Accessed July 23, 2007.
reader
firefox
extension.
[31] Joe Clark.
Reader’s guide to sydney olympics accessibility complaint, 2001.
http://www.contenu.nu/socog.html. Accessed May 15, 2007.
[32] Mark Claypool, Phong Le, Makoto Wased, and David Brown. Implicit interest indicators. In Proceedings of the 6th International Conference on Intelligent User Interfaces
(IUI 2001), pages 33–40, New York, NY, USA, 2001.
[33] A.M. Collins and E.F. Loftus. A spreading activation theory of semantic processing.
Psychological Review, 82:407–428, 1975.
[34] Disability Rights Commission. The web: Access and inclusion for disabled people.
The Stationary Office, 2004.
161
[35] Kara Pernice Coyne and Jakob Nielsen. Beyond alt text: Making the web easy to use
for users with disabilities, 2001.
[36] Timothy C. Craven. Some features of alt text associated with images in web pages.
Information Research, 11, 2006.
[37] Allen Cypher. Eager: programming repetitive tasks by example. In Proceedings of the
SIGCHI Conference on Human factors in computing systems (CHI ’91), pages 33–39,
New Orleans, Louisiana, United States, 1991.
[38] O. De Troyer and C. Leune. Wsdm: A user-centered design method for web sites. In
Proceedings of the Seventh International World Wide Web Conference, pages 85–94,
1998.
[39] D. Diaper and L. Worman. Two falls out of three in the automated accessibility
assessment of world wide web sites: A-prompt v. bobby. People and Computers
XVII, pages 349–363, 2003.
[40] Email2me. Across Communications. http://www.email2phone.net/. Accessed February 9, 2007.
[41] Alexander Faaborg and Henry Lieberman. A goal-oriented web browser. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2006),
pages 751–760, 2006.
[42] Firefox accessibility extension. Illinois Center for Information Technology. Accessed
July 23, 2007.
[43] E. P. George and D. R. Cox. An analysis of transformations. Journal of Royal
Statistical Society, Series B(26):211–246, June 1964.
[44] Philip Brighten Godfrey.
Text-based captcha algorithms.
In First
Workshop on Human Interactive Proofs. Unpublished Manuscript, 2002.
http://www.adaddin.cs.cmu.edu/hips/events/abs/godfreyb abstract.pdf.
[45] Jeremy Goecks and Jude Shavlik. Learning users’ interests by unobtrusively observing
their normal behavior. In Proceedings of the 5th International Conference on Intelligent user interfaces (IUI 2000), pages 129–132, New York, NY, USA, 2000. ACM
Press.
[46] Martin Gonzalez. Automatic data-gathering agents for remote navigability testing.
IEEE Software, 19(6):78–85, 2002.
162
[47] Martin Gonzalez, Marcos Gonzalez, Cristobal Rivera, Ignacio Pintado, and Agueda
Vidau. Testing web navigation for all: An agent-based approach. In Proceedings
of 10th International Conference on Computers Helping People with Special Needs
(ICCHP 2006), volume 4061 of Lecture Notes in Computer Science, pages 223–228,
Berlin, Germany, 2006. Springer.
[48] GOOG-411. Google Labs. http://labs.google.com/goog411/. Accessed February 7,
2007.
[49] Google-AxsJAX. http://code.google.com/p/google-axsjax/. Accessed April 15, 2009.
[50] Google analytics. http://analytics.google.com/. Accessed February 12, 2009.
[51] Greasemonkey Firefox Extension. http://www.greasespot.net/. Accessed June 4,
2009.
[52] Oystein Hallaraker and Giovanni Vigna. Detecting malicious javascript code in
mozilla. In Proceedings of the 10th IEEE International Conference on Engineering
of Complex Computer Systems (ICECCS 2005), pages 85–94, Washington, DC, USA,
2005. IEEE Computer Society.
[53] Simon Harper, Carole Goble, Robert Stevens, and Yeliz Yesilada. Middleware to
expand context and preview in hypertext. In Proceedings of the 6th International
ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2004), pages
63–70, New York, NY, USA, 2004. ACM Press.
[54] Simon Harper and Neha Patel. Gist summaries for visually impaired surfers. In
Proceedings of the 7th International ACM SIGACCESS Conference on Computers
and Accessibility (ASSETS 2005), pages 90–97, New York, NY, USA, 2005. ACM
Press.
[55] Simon Harper, Yeliz Yesilada, Carole Goble, and Robert Stevens. How much is too
much in a hypertext link?: investigating context and preview – a formative evaluation.
In Proceedings of the 15th Conference on Hypertext and hypermedia (HYPERTEXT
2004), pages 116–125, Santa Cruz, CA, USA, 2004.
[56] Jonathan Holman, Jonathan Lazar, Jinjuan Heidi Feng, and John D’Arcy. Developing
usable captchas for blind users. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2007), pages 245–246,
New York, NY, USA, 2007.
[57] Jason I. Hong, Jeffrey Heer, Sarah Waterson, and James A. Landay. Webquilt:
A proxy-based approach to remote web usability testing. Information Systems,
19(3):263–285, 2001.
163
[58] Jason I. Hong and James A. Landay. Webquilt: a framework for capturing and
visualizing the web experience. In Proceedings of the 10th International Conference
on the World Wide Web (WWW 2001), pages 717–724, 2001.
[59] Eric Horvitz. Principles of mixed-initiative user interfaces. In Proceedings of the
SIGCHI Conference on Human factors in computing systems (CHI ’99), pages 159–
166, New York, NY, USA, 1999. ACM Press.
[60] Anita W. Huang and Neel Sundaresan. A semantic transcoding system to adapt web
services for users with disabilities. In Proceedings of the fourth International ACM
Conference on Assistive technologies (Assets 2000), pages 156–163, New York, NY,
USA, 2000. ACM Press.
[61] Gennaro Iaccarino, Delfina Malandrino, and Vittorio Scarano. Efficient edge-services
for colorblind users. In Proceedings of the 15th International Conference on World
Wide Web (WWW 2006), pages 919–920, New York, NY, 2006. ACM Press.
[62] Gennaro Iaccarino, Delfina Malandrino, and Vittorio Scarano. Personalizable edge services for web accessibility. In Proceedings of the 2006 International cross-disciplinary
workshop on Web accessibility (W4A 2006), pages 23–32, New York, NY, USA, 2006.
ACM Press.
[63] IBM alphaWork’s aDesigner. http://www.alphaworks.ibm.com/tech/adesigner. Accessed May 15, 2007.
[64] IBM home page reader. http://www-03.ibm.com/able/. Accessed May 15, 2009.
[65] Melody Y. Ivory. Automated Web Site Evaluation Reseachers’ and Practitioners’
Perspectives. Kluwer Academic Publishers, 2003.
[66] H. Jung, J. Allen, N. Chambers, L. Galescu, M. Swift, and W. Taysom. One-shot procedure learning from instruction and observation. In Proceedings of the International
FLAIRS Conference: Special Track on Natural Language and Knowledge Representation.
[67] Shinya Kawanaka, Yevgen Borodin, Jeffrey P. Bigham, Darren Lunn, Hironobu Takagi, and Chieko Asakawa. Accessibility commons: a metadata infrastructure for web
accessibility. In Proceedings of the 10th International ACM SIGACCESS Conference
on Computers and accessibility (ASSETS 2008), pages 153–160, Halifax, Nova Scotia,
Canada, 2008.
[68] Caitlin Kelleher and Randy Pausch. Stencils-based tutorials: design and evaluation.
In Proceedings of the SIGCHI Conference on Human factors in computing systems
(CHI 2005), pages 541–550, Portland, Oregon, USA, 2005.
164
[69] Brian Kelly, David Sloan, Lawrie Phipps, Helen Petrie, and Fraser Hamilton. Forcing
standardization or accommodating diversity?: a framework for applying the wcag in
the real world. In Proceedings of the 2005 International Cross-Disciplinary Workshop
on Web Accessibility (W4A 2005), pages 46–54, New York, NY, USA, 2005. ACM
Press.
[70] Greg Kochanski, Daniel Lopresti, and Chilin Shih. A reverse turing test using speech.
In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002), pages 1357–1360, 2002.
[71] Mimika Koletsou and Geoff Voelker. The medusa proxy: A tool for exploring userperceived web performance. In Proceedings of the 6th annual Web Caching Workshop,
June 2001.
[72] Tom M. Kroeger, Darrell D. E. Long, and Jeffrey C. Mogul. Exploring the bounds
of web latency reduction from caching and prefetching. In USENIX Symposium on
Internet Technologies and Systems, 1997.
[73] Richard Ladner, Melody Y. Ivory, Raj Rao, Sheryl Burgstahler, Dan Comden,
Sangyun Hahn, Mathew Renzelmann, Satria Krisnandi, Mahalakshmi Ramasamy,
Beverly Slabosky, Andrew Martin, Amelia Lacenski, Stuart Olsen, and Dmitri Groce.
Automating tactile graphics translation. In Proceedings of the Seventh International
ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2005), pages
50–57, New York, NY, 2005. ACM Press.
[74] Jonathan Lazar, Jinjuan Feng, and Aaron Allen. Determining the impact of computer
frustration on the mood of blind users browsing the web. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS
2006), pages 149–156, New York, NY, USA, 2006. ACM Press.
[75] Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. Coscripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the 26th SIGCHI
Conference on Human Factors in Computing Systems (CHI 2008), pages 1719–1728,
Florence, Italy, 2008.
[76] David D. Lewis. Naive bayes at forty: The independence assumption in information
retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
[77] Lift. UsableNet, 2006. http://www.usablenet.com/. Accessed April 15, 2009.
[78] Linux screen reader (LSR). http://live.gnome.org/LSR. Accessed February 17, 2007.
[79] R. C. Littell, G. A. Milliken, W. W. Stroup, and R. D. Wolfinger. SAS System for
Mixed Models. SAS Institute, Inc., Cary, North Carolina, USA, 1996.
165
[80] Greg Little, Tessa Lau, Allen Cypher, James Lin, Eben M. Haber, and Eser Kandogan. Koala: capture, share, automate, personalize business processes on the web. In
Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI
2007), pages 943–946, 2007.
[81] Greg Little and Robert C. Miller. Translating keyword commands into executable
code. In Proceedings of the 19th annual ACM symposium on User interface software
and technology (UIST 2006), pages 135–144, New York, NY, USA, 2006. ACM Press.
[82] JAWS 8.0 for Windows. Freedom Scientific. http://www.freedomscientific.com. Accessed May 4, 2007.
[83] I. Scott MacKenzie and Colin Ware. Lag as a determinant of human performance
in interactive systems. In Proceedings of the INTERACT and Conference on Human
factors in computing systems (CHI ’93), pages 488–493, New York, NY, USA, 1993.
ACM Press.
[84] Jalal Mahmud, Yevgen Borodin, and I.V. Ramakrishnan. Csurf: A context-driven
non-visual web-browser. In Proceedings of the International Conference on the World
Wide Web (WWW 2007), pages 31–40.
[85] Jennifer Mankoff, Holly Fait, and Tu Tran. Is your web page accessible?: a comparative study of methods for assessing web page accessibility for the blind. In Proceedings
of the SIGCHI Conference on Human factors in computing systems (CHI 2005), pages
41–50, New York, NY, USA, 2005.
[86] Robert C. Miller and B. Myers. Creating dynamic world wide web pages by demonstration, 1997.
[87] Hisashi Miyashita, Daisuke Sato, Hironobu Takagi, and Chieko Asakawa. Aibrowser
for multimedia: introducing multimedia content accessibility for visually impaired
users. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2007), pages 91–98, New York, NY, USA, 2007.
ACM.
[88] Mobile speak pocket. Code Factory. http://www.codefactory.es/mobile speak pocket/.
Accessed February 7, 2009.
[89] Paul Mockapetris. RFC 1034 Domain Names - Concepts and Facilities. Network
Working Group, November 1987. http://tools.ietf.org/html/rfc1034.
[90] Saikat Mukherjee, I.V. Ramakrishnan, and A. Singh. Bootstrapping semantic annotation for content-rich html documents. In Proceedings of the International Conference
on Data Engineering (ICDE 2005), 2005.
166
[91] Saikat Mukherjee, Guizhen Yang, Wenfang Tan, and I.V. Ramakrishnan. Automatic
discovery of semantic structures in html documents. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2003), 2003.
[92] National association for the blind, India. http://www.nabindia.com/. Accessed July
23, 2007.
[93] National federation of the blind vs. target corporation. U.S. District Court: Northern
District of California, 2006. No. C 06-01802 MHP.
[94] Jeffrey Nichols and Tessa Lau. Mobilization by demonstration: using traces to reauthor existing web sites. In Proceedings of the 13th International Conference on
Intelligent User Interfaces (IUI 2008), pages 149–158, New York, NY, USA, 2008.
[95] Jakob Nielsen, editor. Designing Web usability : The Practice of simplicity. New
Riders, 2000.
[96] NVDA screen reader. NV Access Inc., http://www.nvda-project.org/. Accessed
November 17, 2007.
[97] Orca: the gnome project. http://live.gnome.org/Orca. Accessed February 11, 2008.
[98] Venkata N. Padmanabhan and Jeffrey C. Mogul. Using predictive prefetching to
improve world wide web latency. SIGCOMM Comput. Commun. Rev., 26(3):22–36,
1996. ISSN 0146-4833.
[99] Joon S. Park and Ravi Sandhu. Secure cookies on the web. IEEE Internet Computing,
4(4):36–44, July 2000.
[100] Helen Petrie, Fraser Hamilton, and Neil King. Tension, what tension?: Website
accessibility and visual design. In Proceedings of the International cross-disciplinary
workshop on Web accessibility (W4A 2004), pages 13–18, New York, NY, USA, 2004.
ACM Press.
[101] Helen Petrie, Fraser Hamilton, Neil King, and Pete Pavan. Remote usability evaluations with disabled people. In Proceedings of the SIGCHI Conference on Human
Factors in computing systems (CHI 2006), pages 1133–1141, New York, NY, USA,
2006. ACM Press.
[102] Helen Petrie, Chandra Harrison, and Sundeep Dev. Describing images on the web:
a survey of current practice and prospects for the future. In Proceedings of Human
Computer Interaction International (HCII 2005), July 2005.
167
[103] Helen Petrie and Omar Kheir. The relationship between accessibility and usability of
websites. In Proceedings of the SIGCHI Conference on Human factors in computing
systems (CHI 2007), pages 397–406, San Jose, California, USA, 2007.
[104] Mark Pilgrim, editor. Greasemonkey Hacks: Tips & Tools for Remixing the Web with
Firefox. O’Reilly Media, 2005.
[105] Peter Plessers, Sven Casteleyn, Yeliz Yesilada, Olga De Troyer, Robert Stevens, Simon
Harper, and Carole Goble. Accessibility: a web engineering approach. In Proceedings
of the 14th International Conference on World Wide Web (WWW 2005), pages 353–
362, New York, NY, USA, 2005. ACM Press.
[106] I.V. Ramakrishnan, A. Stent, and Guizhen Yang. Hearsay: Enabling audio browsing
on hypertext content. In Proceedings of the 13th International Conference on the
World Wide Web (WWW 2004), 2004.
[107] T.V. Raman. Emacspeak—a speech interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’96), pages 66–71, Vancouver,
British Columbia, Canada, 1996.
[108] T.V. Raman. Auditory User Interfaces: Toward the Speaking Computer. Kluwer
Academic Publishers, Boston, MA, 1997.
[109] Charlie Reis, John Dunagan, Helen J. Wang, Opher Dubrosky, and Saher Esmeir.
Browsershield: Vulnerability-driven filtering of dynamic html. In Proceedings of the
8th Symposium on Operating Systems Design and Implementation (OSDI 2006), 2006.
[110] Roadmap for accessible rich internet applications (wai-aria roadmap). World Wide
Web Consortium, 2007. http://www.w3.org/TR/WCAG20/.
[111] Jesse
Ruderman.
The
same
origin
policy,
http://www.mozilla.org/projects/security/components/same-origin.html.
2008.
[112] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. PrenticeHall, Englewood Cliffs, NJ, 2nd edition edition, 2003.
[113] Alex Safonov. Web macros by example: users managing the www of applications. In
CHI ’99 extended abstracts on Human factors in computing systems (CHI ’99), pages
71–72, New York, NY, USA, 1999. ACM Press.
[114] Graig Sauer, Harry Hochheiser, Jinjuan Feng, and Jonathan Lazar. Towards a universally usable captcha. In Proceedings of the 4th Symposium On Usable Privacy and
Security (SOUPS 2008), Pittsburgh, PA, USA, 2008.
168
[115] Scott Schiller. Sound manager 2, 2007. http://www.schillmania.com/projects/soundmanager2/.
[116] C. Schuster and A. Von Eye. The relationship of anova models with random effects
and repeated measurement designs. Journal of Adolescent Research, 16(2):205–220,
2001.
[117] Scribd. http://www.scribd.com/. Accessed March 21, 2008.
[118] Ted Selker. Cognitive adaptive computer help (coach). In Proceedings of the International Conference on Artificial Intelligence, pages 25–34, IOS, Amsterdam, 1989.
[119] Serotek system access mobile. Serotek. http://www.serotek.com/. Accessed November
7, 2007.
[120] Statistical Methods for Psychology. PWS-KENT Publishing Company, Boston, third
edition, 1992.
[121] Hironobu Takagi, Shinya Kawanaka, Masatomo Kobayashi, Takashi Itoh, and Chieko
Asakawa. Social accessibility: achieving accessibility through collaborative metadata
authoring. In Proceedings of the 10th International ACM SIGACCESS Conference
on Computers and accessibility (ASSETS 2008), pages 193–200, Halifax, Nova Scotia,
Canada, 2008.
[122] Hironobu Takagi, Shin Saito, Kentarou Fukuda, and Chieko Asakawa. Analysis of
navigability of web applications for improving blind usability. ACM Transactions on
Computer-Human Interaction, 14(3):13, 2007.
[123] Talklets. Hidden Differences Group. http://www.talklets.com/. Accessed April 7,
2007.
[124] Jennifer Tam, Jiri Simsa, David Huggins-Daines, Luis von Ahn, and Manuel Blum.
Improving audio captchas. In Proceedings of the 4th Symposium on Usability, Privacy
and Security (SOUPS 2008), Pittsburgh, PA, USA, July 2008.
[125] Paul A. Taylor, Alan W. Black, and Richard J. Caley. The architecture of the the
festival speech synthesis system. In Proceedings of the 3rd International Workshop on
Speech Synthesis, Sydney, Australia, November 1998.
[126] Jim Thatcher, Paul Bohman, Michael Burks, Shawn Henry, Bob Regan, Sarah
Swierenga, Mark D. Urban, and Cynthia D. Waddell. Constructing Accessible Web
Sites. glasshaus Ltd., Birmingham, UK, 2002.
[127] Thunder screenreader. http://www.screenreader.net/. Accessed February 16, 2007.
169
[128] Turnabout. Reify Software. http://www.reifysoft.com/turnabout.php. Accessed June
7, 2006.
[129] Scott R. Turner. Playtpus firefox extension, 2006. http://platypus.mozdev.org/.
[130] Gregg Vanderheiden. Fundamental principles and priority setting for universal usability. In Proceedings on the 2000 Conference on Universal Usability (CUU 2000), pages
32–37, Arlington, Virginia, United States, 2000.
[131] Voice
extensible
markup
language
(VoiceXML)
http://www.w3.org/TR/voicexml21/. Accessed April 6, 2007.
2.1.
[132] Voiceover: Macintosh OS X. http://www.apple.com/accessibility/voiceover/. Accessed April 5, 2007.
[133] Luis von Ahn, Manuel Blum, and John Langford. Telling humans and computer
apart automatically: How lazy cryptographers do ai. Communications of the ACM,
47(2):57–60, February 2004.
[134] Luis von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI
2004), April 2004.
[135] Luis von Ahn, Shiry Ginosar, Mihir Kedia, Ruoran Liu, and Manuel Blum. Improving
accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2006), pages 79–82, New York,
NY, USA, 2006. ACM Press.
[136] Michael Vorburger. Altifier: Web accessibility enhancement tool, 1999.
[137] Howie Wang. Nextplease!, 2006. http://nextplease.mozdev.org/.
[138] Takayuki Watanabe. Experimental evaluation of usability and accessibility of heading
elements. In Proceedings of the International Cross-Disciplinary Conference on Web
Accessibility (W4A 2007), pages 157 – 164, 2007.
[139] W3C markup validation service v0.7.4. http://validator.w3.org/. Accessed November
11, 2006.
[140] Watchfire bobby. http://www.watchfire.com/products/webxm/bobby.aspx. Accessed
April 7, 2007.
[141] Web accessibility checker. University of Toronto Adaptive Technology Resource Centre
(ATRC), 2006. http://checker.atrc.utoronto.ca/. Accessed March 15, 2007.
170
[142] Web Content Accessibility Guidelines 2.0 (wcag 2.0). World Wide Web Consortium.
http://www.w3.org/TR/WCAG20/.
[143] Web content accessibility guidelines 1.0 (WCAG 1.0). World Wide Web Consortium,
1999.
[144] WebVisum firefox extension, 2008. http://www.webvisum.com/.
[145] Ryen White and Steven M. Drucker. Investigating behavioral variability in web search.
In Proceedings of the International Conference on the World Wide Web (WWW 2007),
2007.
[146] Window-eyes. GW Micro. http://www.gwmicro.com/Window-Eyes/. Accessed April
3, 2008.
[147] Windows
narrator:
Microsoft’s
windows
xp
and
vista,
http://www.microsoft.com/enable/training/windowsxp/narratorturnon.aspx.
2008.
[148] Yahoo accessibility improvement petition. http://www.petitiononline.com/yabvipma/.
Accessed September 3, 2008.
[149] Yeliz Yesilada, Simon Harper, Carole Goble, and Robert Stevens. Screen readers
cannot see (ontology based semantic annotation for visually impaired web travellers).
In Proceedings of the 4th International Conference on Web Engineering (ICWE 2004),
pages 445–458, 2004.
[150] Yeliz Yesilada, Robert Stevens, and Carole Goble. A foundation for tool based mobility
support for visually impaired web users. In Proceedings of the 12th International
Conference on World Wide Web (WWW 2003), pages 422–430, New York, NY, USA,
2003. ACM Press.
171
VITA
Jeffrey P. Bigham received his B.S.E degree in Computer Science from Princeton University in 2003. Starting in fall 2003, he attended the University of Washington, where he
worked with Richard E. Ladner. He has won the Microsoft Imagine Cup Accessible Technology Award, the W4A Accessibility Challenge Delegate’s Award, the Andrew W. Mellon Foundation Award for Technology Collaboration, the NCTI Technology in the Works
Award, and the University of Washington College of Engineering Student Innovator Award
for Research. In 2008, he was awarded an Osberg Fellowship. He received his M.Sc. degree in 2005 and his Ph.D. in 2009, both in Computer Science and Engineering from the
University of Washington.