Perceptually-motivated Non-Photorealistic Graphics

Transcription

Perceptually-motivated Non-Photorealistic Graphics
NORTHWESTERN UNIVERSITY
Perceptually-motivated Non-Photorealistic Graphics
A DISSERTATION
SUBMITTED TO THE GRADUATE SCHOOL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
for the degree
DOCTOR OF PHILOSOPHY
Field of Computer Science
By
Holger Winnemöller
EVANSTON, ILLINOIS
December 2006
2
c Copyright by Holger Winnemöller 2006
All Rights Reserved
3
ABSTRACT
Perceptually-motivated Non-Photorealistic Graphics
Holger Winnemöller
At a high level, computer graphics deals with conveying information to an observer by visual
means. Generating realistic images for this task requires considerable time and computing
resources. Human vision faces the opposite challenge: to distill knowledge of the world from
a massive influx of visual information. It is reasonable to assume that synthetic images based
on human perception and tailored for a given task can (1) decrease image synthesis costs by
obviating a physically realistic lighting simulation, and (2) increase human task performance
by omitting superfluous detail and enhancing visually important features.
This dissertation argues that the connection between non-realistic depiction and human perception is a valuable tool to improve the effectiveness of computer-generated images to support
visual communication tasks, and conversely, to learn more about human perception of such images. Artists have capitalized on non-realistic imagery to great effect, and have become masters
of conveying complex and even abstract messages by visual means. The relatively new field of
non-photorealistic computer graphics attempts to harness artists’ implicit expertise by imitating
their visual styles, media, and tools, but only few works move beyond such simulations to verify
4
the effectiveness of generated images with perceptual studies, or to investigate which stylistic
elements are effective for a given visual communication task.
This dissertation demonstrates the mutual beneficence of non-realistic computer graphics
and perception with two rendering frameworks and accompanying psychophysical studies:
(1) Inspired by low-level human perception, a novel image-based abstraction framework
simplifies and enhances images to make them easier to understand and remember.
(2) A non-realistic rendering framework generates isolated visual shape cues to study human perception of fast-moving objects.
The first framework leverages perception to increase effectiveness of (non-realistic) images
for visually-driven tasks, while the second framework uses non-realistic images to learn about
task-specific perception, thus closing the loop. As instances of the bi-directional connections between perception and non-realistic imagery, the frameworks illustrate numerous benefits including effectiveness (e.g. better recognition of abstractions versus photographs), high performance
(e.g. real-time image abstraction), and relevance (e.g. shape perception in non-impoverished
conditions).
5
Dedication
To my parents, for their unconditional love and support.
6
Acknowledgements
There are many people who can be blamed, to various degrees, for helping me get away
with a PhD:
My parents endowed me with a working brain, and always made sure that I use it to its full
potential. My sister, Martina, kept telling me to believe in myself, and I am starting to listen
to her. Angela has been my confidante and friend in good times and when things were rough,
which they were a bit.
Bruce Gooch, my advisor at Northwestern University, believed in my ideas, gave me the
freedom to pursue my goals, supported me generously wherever he could, and has been the
mentor that I had wanted for a long time. He also managed to give the graphics group a sense
of family and belonging. Without Jack Tumblin, I would never have come to Northwestern to
begin with. He made the first contact, invited me to come to Evanston as a scholar, and has been
supportive and interested even when I decided to work with Bruce. Talking to Jack reminds you
that there is always more to learn in life. Amy Gooch was one of my NPR contacts when I
was at Cape Town university, looking for a place to finish my PhD. She was helpful then, and
has helped me with papers, gruesome corrections, and good advice ever since. Bryan Pardo
graciously agreed to be on my PhD committee, and gave me much of his time and many helpful
suggestions for the dissertation. James Gain kindly offered to be my co-advisor at UCT when
all else failed. I am sure he would have done a fine job.
7
Ankit Mohan and Pin Ren have been my Evanston friends from the first day they welcomed
me in their office. Since that day, we’ve had many interesting, silly, and funny times together.
I’ll always remember our 48 hour retreat around Lake Michigan. Sven Olsen has been my
conspirator for the Videoabstraction project and a great companion during those long office
hours when everybody else was already sleeping. David Feng joined the team only later, but
quickly became an integral part of the crew. I miss having a worthy squash opponent. Marc
Nienhaus joined the graphics group as a post-doc and left three months later as a cool roommate
and good friend. I will also miss the rest of the graphics lab, Tom Lechner, Sangwon Lee,
Yolanda Rankin, Vidya Setlur, and Conrad Albrecht-Buehler, but I am sure that our paths will
cross in the times to come.
The rest of my family, especially my brother Ronald, and my friends in Germany and South
Africa have been a constant source of inspiration in my life. They have achieved so much, made
me so proud, and given me good reasons not to give up whenever times were tough. You know
who you are.
I would like to thank the many volunteers for my experiments, who were always patient,
courteous, and interested. I also owe thanks to Rosalee Wolfe and Karen Alkoby for the deaf
signing video; Douglas DeCarlo and Anthony Santella for proof-reading and supplying datadriven abstractions and eye-tracking data for the Videoabstraction project; as well as Marcy
Morris and James Bass for acquiring image permission from Cameron Diaz.
8
Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.
Realistic versus Non-realistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.
The Art of Perception and the Perception of Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 2.
General Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.
Simple Error Metrics (Non-Perceptual) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.
Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.
Visible Difference Predictors (VDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 3.
3.1.
Real-Time Video Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9
3.2.
Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.
Framework Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 4.
An Experiment to Study Shape-from-X of Moving Objects . . . . . . . . . . . . . . . . . 102
4.1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.
Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5.
Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.6.
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.7.
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.8.
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Chapter 5.
General Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.1.
Vision Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Chapter 6.
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.1.
Conclusion drawn from Real-time Video Abstraction Chapter . . . . . . . . . . . . . . . . . . 173
6.2.
Conclusions drawn from Shape-from-X Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.3.
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10
Appendix A.
User-data for Videoabstraction Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Appendix B.
User-data for Shape-from-X Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Appendix C.
Links for Selected Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11
List of Tables
2.1
Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1
Within-“Aggregate Measure” Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.2
Significance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1
Data for Videoabstraction Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
A.2
Data for Videoabstraction Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
B.1
Shading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
B.2
Outline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
B.3
Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
B.4
TexISO Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
B.5
TexNOI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
B.6
Questionnaire Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
C.1
Internet references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
12
List of Figures
1.1
Photorealistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2
Realism vs. Non-Realism - Subway System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3
Perceptual Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4
Realism in Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1
Simple Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1
Abstraction Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2
Explicit Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3
Scale-Space Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4
Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5
Linear vs. Non-linear Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6
Diffusion Conduction Functions and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7
Progressive Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8
Data-driven Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.9
Painted Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.10
Separable Bilateral Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.11
Center-Surround Cell Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13
3.12
DoG Edge Detection and Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.13
DoG Parameter Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.14
Edge Cleanup Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.15
DoG vs. Canny Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.16
IWB Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.17
Computing Warp Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.18
Luminance Quantization Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.19
Sample Images for Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.20
Sample Images from Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.21
Participant-data for Video Abstraction Experiments . . . . . . . . . . . . . . . . . . . . . 82
3.22
Failure Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.23
Benefits for Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.24
Automatic Indication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.25
Motion Blur Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.26
Motion Blur Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1
Shape-from-X Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2
Left: Depth Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3
Right: Tilt & Slant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4
Real-time Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
14
4.6
Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.7
The First Version of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.8
Constructing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.9
Shape Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.10
Experiment Object Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.11
Mistaken Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.12
Aggregate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.13
Detailed Aggregate Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.14
Detailed Aggregate Measures Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1
Lifecycle of a Synthetic Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2
Flicker Color Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3
Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.4
Originals for Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.5
Anomalous Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.6
Deleted Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.1
Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
15
CHAPTER 1
Introduction
This dissertation presents two rendering frameworks and validatory studies to demonstrate,
by example, the intimate connection between non-photorealistic (NPR) graphics and perception, and to show how the two research areas can form an effective and natural symbiosis.
While the notion of such a connection is not novel in itself, it is also not commonly leveraged, particularly within the NPR community. It is my hope that future researchers will adopt
the frameworks, methodologies, and experiments documented in this dissertation to the mutual
benefit of both communities.
To explain the origins of this connection and its significance, it is instructive to discuss
non-photorealism and then list the commonalities of non-photorealism and perception. Seeing
as non-photorealism is defined per exclusion (being not realistic) instead of explicit goals, it
seems appropriate to look briefly at the historical contrast between realistic and non-realistic
graphics.
1.1. Realistic versus Non-realistic Graphics
Traditionally, the ultimate goal of computer graphics has been photo-realism; to generate
synthetic images that are indistinguishable from photographs [166, 63, 85, 28]. Today, this goal
has arguably been achieved. Given enough time and resources, synthetic renderers can generate
imagery that is indistinguishable from photographic images to the naked eye (Figure 1.1), and
models exist that simulate optical processes down to the level of individual photons [37]. While
16
Figure 1.1. Photorealistic Graphics. This image shows a state-of-the-art rendering of a synthetic scene (using POV-Ray 3.6). Notable realistic optical effects include: reflection, refraction, global illumination, depth-of-field, and lens
distortion.— {By Gilles Tran, Public Domain. See Table C.1 for URLs to selected images.}
this success does not foreclose further research to advance the number and types of optical
phenomena that can be modeled, or to improve efficiency, there are an increasing number of
researchers that question realism as the only viable goal for computer graphics. The question
these scientists ask is: What are the images we create used for?1
1.1.1. Depiction Purpose
“Because pictures always have a purpose, producing a picture is essentially an optimization process. Depiction consists in trying to make the picture that best satisfies the goals.” ([40], pg. 116,
1
Pat Hanrahan, in his Eurographics 2005 keynote address, saw slideshow presentations at conferences as one of
the main uses.
17
(a) Photographic
(b) Schematic
Figure 1.2. Realism vs. Non-Realism - Subway System. (a) An aerial photograph of London. This image is ill-suited to show the underground subway
system covering the photographed area. (b) A schematic (non-realistic) map of
part of the London subway system with a variety of abstractions/simplifications:
All streets, buildings, and parks are omitted. Train-paths are drawn color-coded
and so that angles are multiples of 45◦ . Train stations are symbolized by circles,
indicating connections through connected circles. Other symbols list additional
services offered at a given station.— {(a) by Andrew Bossi, GNU Free Documentation License.
(b) after maps of Transport of London.}
original emphasis). If this purpose is simulation of physical interaction between light and matter
(for research, realistic conceptualization, or entertainment) [162, 37, 165], then photo-realism
is a logically sound choice. If, on the other hand, the purpose is more general or abstract (to
convey an idea, to give directions, to explain a situation, to give an example) then photo-realism
may confuse the issue at hand through unnecessary specificity, visual clutter (masking), and
physical limitations. For example, the spatial layout map of a subway system does not include
every bend and corner (specificity) because only the stations and their relative positions are
of interest to the viewer (Figure 1.2). The map does not include all the buildings and streets
where the subway runs (visual clutter) because this would make it difficult to see the subway
paths. Lastly, the map could not have been captured in a single photograph (physical limitation)
because most parts of the subway system are underground and mutually hidden.
18
1.1.2. Realistic Non-realism
Images generated with a specific purpose in mind could thus be called artistic, symbolic, stylistic, comprehensive, instructive, expressive, or communicative but, unfortunately, the rather
unimaginative term non-photorealism has become established. Perhaps because of this lack
of a purpose-statement, much of the research in non-photorealism has focussed, again, on realism instead. This dissertation uses a similar classification to Gooch and Gooch [61] who
identify three main areas of NPR research: (1) Natural media simulation; (2) Artistic Tools; and
(3) Artistic style simulation.
Natural media simulation.
Natural media simulation concerns itself with simulating
(realistically) the substance that is applied to an image (e.g. oil, acrylic, coal), the instruments
with which the substance is applied (e.g. brush, pencil, crayon), and the substrate to which the
substance is applied (e.g. canvas, paper) [61]. In all cases, the simulated media is intended to
produce surface marks that are indistinguishable from the real media [131, 170, 143, 32].
Artistic Tools.
Simulated media itself is of little practical use if it is not controlled by
some entity. Assisting users in creating images is therefore a worthwhile endeavor. Commercial
products, such as Photoshop or CorelDraw provide a rich set of tools and functionality by repurposing standard input devices (mouse, keyboard, digital tablet). Other software and research
work assist users with technically challenging, tedious, or repetitive tasks [152, 190, 120, 72,
24], but ultimately the user still has to create the image and therefore make all decisions about
layout, design, placement, etc.
19
Artistic Styles.
The last category of NPR research takes inspiration from existing artistic
styles and attempts to automatically transform some data (usually geometric models or photographs) into images in a given artistic style. Examples of this work include the creation of
line drawings from three-dimensional models [170, 38, 86, 174], light-models for cartoon-like
shading [99, 25, 84, 173], and painterly systems from geometric models [110], videos [73, 68]
or photographs [176].
It should be noted that none of these systems presumes to create Art, they merely generate images that (as realistically as possible) resemble a particular artistic style. In short, much
of NPR research is still devoted to realistic picture (as opposed to photo) creation. There exist several noteworthy exceptions to this trend, which serve as inspiration and which I discuss
throughout this dissertation. Saito and Takahashi [142] increased the comprehensibility of image using G-Buffers. Gooch and Willemsen used a non-realistic virtual environment to study
virtual distance perception [60]. DeCarlo and Santella created meaningful abstractions guided
by eye-tracking data [36, 144]. Gooch et al. showed that illustrations of faces were more effective than photographs for facial recognition [62]. Raskar et al. visually enhanced images
with multi-flash hardware [138]. DeCarlo et al. facilitated shape perception with suggestive
contours [35]. It is clear from these citations that currently only a fairly small number of researchers are addressing NPR issues beyond realistic simulation of artistic media and styles.
1.1.3. Stylistic Effects and Effective Styles
So what is wrong with reproducing an artistic style? After all, artists have been very successful
at communicating abstract ideas, expressing emotions, triggering curiosity, and entertaining.
The answer is that there might be nothing wrong at all, but we cannot be sure. As Durand put
20
it, “[the] availability of this new variety of styles raises the important question of the choice
of an appropriate style, especially when clarity is paramount” ([40], pg. 120). Santella and
DeCarlo [144] argued that many NPR systems abstract imagery, but often without meaning
or purpose, other than stylistic. In Santella and DeCarlo’s experiment they found that lack of
shading and uniformly varying the level of detail in pure line drawings had no effect on the
number of viewer’s fixation points, whereas targeted control of detail “[. . . ] affected viewers in
a way that supports an interpretation of enhanced understanding.” ([144], pg.77).
Many authors of NPR systems motivate their work with the expressiveness and communicative benefits of stylistic imagery, but few go on to prove that transferring visual aspects
of a given style to their synthesized imagery satisfies these higher perceptional or cognitive
goals. Admittedly, some NPR systems do not stand to gain much from perceptual validation,
particularly artistic tools and natural media simulations. Such systems are better served with
the extreme programming method and physical validation, respectively. Most other NPR systems and the NPR community at large, however, stand to benefit from perceptual evaluation and
validation. The measures currently employed to compare different NPR algorithms and implementations are mainly taken from realistic graphics performance measures, chiefly frame-rates
dependent on geometric complexity or screen resolution. While such measures are suitable to
demonstrate performance enhancements for algorithmic improvements, they are ill suited to
objectively answer questions like: “Does this system capture the essence of an artistic style?”;
“Does this system help a user in detecting certain features of an image faster?”; or “How do
we know which style to choose to support a given perceptual task?”
Several authors, including myself, believe that the answers to these questions lie in perception. Seeing as most visual art is conceived through visual experience and expressed through an
21
(a) Size Constancy
(b) Shape Constancy
Figure 1.3. Perceptual Constancy. (a) Objects at a distance or reflections in a
mirror produce greatly reduced retinal images, yet they are perceived as being
of a normal size (e.g. a train at a distance does not appear to be a toy-train; it
appears as a normal-sized train at a distance). (b) Although the two views of the
bunny generate radically different retinal images, they are nonetheless perceived
as depicting the same bunny. Note that shape constancy does not require an observer to have seen any particular view previously.— {Bunny model courtesy of Stanford
University Computer Graphics Laboratory.}
artistic process that heavily relies on feedback from the human visual system (HVS), it is likely
that (1) Perception is a large influence in the creation of Art; and (2) the analysis of artistic
principles may lead to insights on human perception. The following Section discusses these
connections between perception and art (and by extension NPR).
1.2. The Art of Perception and the Perception of Art
The neurobiologist Semir Zeki claims that “[. . . ] the overall function of art is an extension
of the function of the brain” ([189], pg. 76). More specifically, Zeki defines the function of the
(visual) brain as the “[. . . ] search for constancies with the aim of obtaining knowledge about
the world” (pg. 79). Similarly, he defines the general function of art as a “[. . . ] search for the
constant, lasting, essential, and enduring features of objects, surfaces, situations, and so on.”
(pg. 79).
22
1.2.1. Constancy
Why is constancy so important? In the words of Durand, “The notion of invariants and constancy are crucial in studying vision and the complex dualism of pictures. Invariants are intrinsic
properties of scenes or objects, such as reflectance, as opposed to accidental extrinsic properties
such as outgoing light that vary with, e.g., the lighting condition or the viewpoint. Constancy is
the ability to discount the accidental conditions and to extract invariants.” ([40], pg. 113, original emphasis). There are many examples of perceptual constancy (Figure 1.3): color constancy
allows us to see a green apple as green, regardless of whether we encounter it during an orange
sunset or in a fluorescently-lit room. Size constancy allows us to subjectively perceive our own
reflection in a mirror as normal-sized, although the dimensions of the reflection are objectively
halved. Shape constancy permits objects to be recognized from a variety of viewpoints, even
novel ones that have not been experienced before.
Not surprisingly, the notion of intrinsic and extrinsic properties has had a profound impact
on the evolution of art. For example, the Dutch Golden Age of the 17th century focussed on
high detail and realism, whereas many of the modern artistic styles, like cubism, pointillism,
fauvism, and expressionism focussed instead on cognitive and perceptual aspects of depiction.
The difference between the realistic and expressionistic art forms (Figure 1.4) “[. . . ] can also
be stated in terms of depicting ’what I see’ (extrinsic) as opposed to depicting ’what I know’
(intrinsic).” ([40], pg. 113).
1.2.2. Goals of Art and Vision
Focussing again on vision, Gregory believes that, “[. . . ] perception involves going beyond
the immediately given evidence of the senses: this evidence is assessed on many grounds and
23
(a) Photorealistic Painting
(b) Expressionistic Painting
Figure 1.4. Realism in Art. Two approaches to engage a viewer. (a) This painting, called Escaping Criticism (1874), by Pere Borrell de Caso is an example of
a trompe l’oeil, a work of art that is so realistic that it tricks the observer into
beliving that the depicted scene exists in reality. (b) This Portrait of Dr. Gachet
(1890) by Vincent van Gogh shortly before his suicide employs various stylistic elements like visible brush-strokes, contrasting colors, and symbolism (the
foxglove was used for medical cures and thus attributes Gachet)— {Both images in
public domain.}
generally we make the best bet, and see things more or less correctly. But the senses do not give
us a picture of the world directly; rather they provide evidence for the checking of hypothesis
about what lies before us. Indeed, we may say that the perception of an object is an hypothesis,
suggested and tested by sensory data” ([65], p. 13). The process of seeing is therefore not just
a passive absorption of electromagnetic radiation, but an active, highly complex, and parallel
search2 to gain knowledge from our visual surroundings. It appears then, that many of the
goals of art and perception are similar - “[. . . ] the brain must discount much of the information
2
Given the complexity of the vision process, Hoffman refers to the mechanisms which allow for our effortless
visual experience as Visual Intelligence [75]. Biological evolutionists have even offered that much of the human
brain’s cognitive and intellectual capabilities owe to the great computational demands of vision [117, 34].
24
reaching it, select only what is necessary in order to obtain knowledge about the visual world,
and compare the selected information with its stored record of all that it has seen.” ([189],
pg. 78). An “[. . . ] artist must also be selective and invest his work with attributes that are
essential, discarding much that is superfluous. It follows that one of the functions of art is an
extension of the major function of the visual brain.” ([189], pg. 79).
Given this goal agreement, it is not farfetched to assume that artistic images (e.g. pictures,
paintings) that are designed appropriately can greatly assist the brain in performing its difficult
task.
1.2.3. Perceptual Art(ists)
Some authors go as far as claiming that many artistic styles are based upon the collective perceptual insight of generations of artists (e.g. [189, 135]). Zeki (himself a leading neurologist)
writes, “artists are neurologists, studying the brain with techniques that are unique to them and
reaching interesting but unspecified conclusions about the organization of the brain. Or, rather,
that they are exploiting the characteristics of the parallel processing-perceptual systems of the
brain to create their works, sometimes even restricting themselves largely or wholly to one system, as in kinetic art.” ([189], pg. 80). Specifically, Zeki and Lamb found that various types
of late kinetic art are ideal stimuli for the motion sensitive cells in area V5 of the visual cortex [185]. In another experiment, Zeki and Marini [186] showed that fauvist paintings, which
often divorce shapes from their naturally assumed colors, excite quite distinct neurological pathways from representational art where objects appear in normal color. Gooch et al. demonstrated
25
that caricatured line drawings of unknown faces are learned up to two times faster than the corresponding photographs [62]. Ryan and Schwartz reported similar findings for drawings and
cartoons of objects [141].
Zeki refers to Art that is designed to specifically stimulate particular types of cortical cells
(intentionally or not) as art of the receptive field. “The receptive field is one of the most important concepts to emerge from sensory physiology in the past fifty years. It refers to the part
of the body (in the case of the visual system, the part of the retina or its projection into the
visual field) that, when stimulated, results in a reaction from the cell, specifically, an increase
or decrease in its resting electrical discharge rate. To be able to activate a cell in the visual
brain, one must not only stimulate in the correct place (i.e., stimulate the receptive field) but
also stimulate the receptive field with the correct visual stimulus, because cells in the visual
brain are remarkably fussy about the kind of visual stimulus to which they will respond. The art
of the receptive field may thus be defined as that art whose characteristic components resemble
the characteristics of the receptive fields of cells in the visual brain and which can therefore be
used to activate such cells.” ([189], pp. 88).
1.2.4. Benefits of combining NPR and Perception
One principled method3 of unlocking the perceptual potential of art for the purpose of creating
task-oriented computer-generated imagery, then, is to study and leverage the different visual
areas of the brain, or more precisely, the cells comprising these areas and the stimuli to which
these cells are responsive. The benefit of designing imagery based on perceptual principles
3
This is not to say that artistic development is unprincipled, but rather that less quantifiable factors, like experience,
intuition, and aesthetic sense, play a more marked role than is commonly regarded as scientific (of course, it is often
exactly these qualities that lead to the most exciting and groundbreaking scientific discoveries).
26
instead of physical/optical laws is that we can focus on creating and supporting the visual stimuli
pertinent for a given perceptual task and eliminate unnecessary detail.
The reverse approach is similarly advantageous (compared to fully realistic imagery) - by
generating non-realistic images that purposefully only trigger certain visual areas we can study
how the generated visual stimuli influence task-specific perception in isolation.
These are the two approaches exemplified by the NPR rendering frameworks and perceptual
studies in this dissertation.
1.3. Contributions
This dissertation presents two frameworks and accompanying studies that demonstrate the
important link between non-realistic graphics and perception research. Each framework uses
fundamental concepts of one research area to inform the other.
1.3.1. Perception informing Graphics
Chapter 3 presents a real-time NPR image processing framework to convert images or video
into abstracted representations of the input data. The framework is designed to operate on general natural scenes and produces abstractions that can improve the communication content of
the resulting imagery. Specifically, participants in two user studies are able to recognize/identify
objects and faces quicker than in the source photographs. The framework achieves meaningful
abstraction by implementing a simple model of low-level human vision. This model estimates
regional perceptual importance within images and removes superfluous detail (simplification)
while at the same time supporting perception of important regions by increasing local contrast (enhancement) and thus catering specifically to edge-sensitive cortical cells. Compared
27
to other automatic abstraction systems, the framework presented here offers superior temporal
coherence, does not rely on an explicit image structure representation, and can be efficiently
implemented on modern parallel graphics hardware.
1.3.2. Graphics informing Perception
The human visual system derives shape from a multitude of shape cues. Chapter 4 presents a
novel experiment to study shape perception of dynamically moving objects. The experimental
framework generates NPR display conditions that specifically target individual shape perception
mechanisms. By comparing user performance for a highly dynamic, interactive task under
each of the display conditions, the experiment establishes a relative effectiveness ordering for
the given shape cues. Data collected during experimentation indicates that shape perception
in a severely time-constrained condition may behave differently from static shape perception
and that a shape cue prioritization may occur in the former condition. The sensitivity of the
experimental design and its flexibility enable a large number of future investigations into the
effects of isolated shape cues and their parameterizations. Such research, in turn, should help
in the design of better graphics and visualization systems.
1.3.3. Evaluation for NPR systems
Several reasons exist why psychophysical evaluation and validation experiments are not performed more commonly for NPR systems designed to increase the communication potential or
expressiveness of images. Experiments are difficult to devise, time consuming to perform, and
require careful analysis. These issues could be somewhat mitigated by establishing a corpus
of experiments for NPR validation, along with a database of test imagery. This dissertation
28
contributes to such a corpus by defining clear perceptual goals for the stylization frameworks
presented, along with psychophysical experiments to test the effectiveness of achieving these
goals. It is my hope that the presented frameworks will provide a foundation for future NPR
work, and similarly, that the given validatory experiments will be used to evaluate and compare
future NPR systems.
29
CHAPTER 2
General Related Work
Various existing works have used mathematical and perceptual models and metrics to guide
approximation algorithms, to control data compression, and to determine data similarity. Although many metrics designed for photorealistic imagery do not directly apply to NPR imagery,
they are nonetheless illustrative of the different approaches to compression, comparison, and
analysis that realistic and non-realistic imagery require. I therefore discuss related photorealistic works in this chapter and defer the discussion of non-photorealistic works to the individual
frameworks in Chapter 3 and Chapter 4.
Most research into perception for photorealistic graphics1 centers around perceptual models
and metrics. In the context of this dissertation a perceptual model is an algorithm that simulates
a particular aspect of human visual perception (for example saliency or contrast sensitivity),
whereas a perceptual metric may use a given model to quantify the perceived differences between two stimuli or the probability that artifacts (e.g. as a result of compression) in a stimulus
may be detected.
2.1. Simple Error Metrics (Non-Perceptual)
A number of commonly used metrics, particularly in the compression and signal processing
communities, are mathematical in nature and not derived from perceptual models. Among
1
It is interesting to note that many photorealistic applications employ perceptual metrics to degrade imagery up
to the point where such degradation becomes perceptible or even objectionable. Their ultimate goal therefore
shifts from physical realism to perceived realism; a goal much more in line with other perceptually-guided but
intentionally non-realistic graphics.
30
these are the relative error (RE), the mean-squared error (MSE), and the peak signal-to-noise
ratio (PSNR). Given two grayscale images2, A and B, with J pixels in the horizontal direction
(width) and I pixels in the vertical direction (height), the measures are defined as:
I−1
P J−1
P
(2.1)
RE(A, B) =
(Ai,j − Bi,j )2
i=0 j=0
I−1
P J−1
P
,
2
(Ai,j )
i=0 j=0
I−1
P J−1
P
(2.2)
(2.3)
M SE(A, B) =
(Ai,j − Bi,j )
i=0 j=0
I ·J
P SN R(A, B, m) = 10 · log10
,
m2
M SE(A, B)
.
While Equation 2.2 yields an absolute value depending on the range of A and B, Equation 2.1 and Equation 2.3 give a relative error value. In the case of PSNR, the result is based on
a maximum possible value, m, for each pixel3, and expressed in decibels (dB).
Figure 2.1 puts the use of these error metrics for quantifying image quality or image fidelity
into perspective. I generated these images by creating an abstraction (see Chapter 3) of an
original image, computing the error values between original and abstraction, and generating two
other types of common image distortion (noise and blur) with similar error values (Table 2.1).
Comparing the images in Figure 2.1 should make it clear that the perceived quality of each
image, and the perceived fidelity to the original image differs greatly between the types of
distortions, despite the fact that their RE, MSE, and PSNR scores are nearly identical.
2
This discussion uses scalar-valued images for simplicity, but applies equally to color images.
For a common grayscale image, m = 28 − 1 = 255.
3
31
Figure 2.1. Simple Error Metrics. An original image and three variations with
the same level of errors. Noise: The original image with salt-and-pepper noise
added. Blur: The original image with a Gaussian filter applied. Abstract: The
original image processed with the real-time abstraction framework discussed in
Chapter 3.
Metric
Noise
Blur
Abstract Polarity
RE
0.0576
0.0570
0.0578
↓
MSE
1.004 × 103 0.991 × 103 1.006 × 103
↓
PSNR
18.106
18.169
18.113
↑
PNG (78.4%)
94.7%
31.3%
40.5%
n.a.
PDIFF4
1070
2917
3144
↓
HDR-VDP5
42.53%
99.71%
54.72%
↓
Table 2.1. Comparison Metrics. This table lists numeric values for a number
of error and comparison metrics applied to the images in Figure 2.1. Polarity
symbolizes whether a low numeric value indicates a small error (↓) or a large
error (↑).
Another method of comparing images is to look at their information content (entropy). Considering that humans have to extract information from images in order to understand them, this
seems like a sensible approach.
4
Number of pixels perceived to be different from original. Settings: gamma = 2.2, luminance = 100 lux, fov = 6◦ .
Percentage of pixels with p > 95% chance of being perceived as different. Same settings as PDIFF.
5
32
Table 2.1, row 4, lists file-size ratios for the lossless PNG compression6 compared to an
uncompressed image. The compression ratio of the original image is given in the Metric column. When examining the other columns we can see that adding noise to the image increases
the entropy of the original image, while blurring (averaging) reduces the entropy, as expected.
The problem here is that my generic use of the word information (or entropy) does not determine how useful this information might be for visual communication purposes. The addition
of random information (noise), uncorrelated to the content of the image, does not enhance the
image. Conversely, I demonstrate in Chapter 3 that targeted removal of information (unlike the
uniform blur in Figure 2.1) can actually help perceptual tasks based on image understanding.
From Section 1.2.1, we know that much of visual perception is concerned with removing extrinsic information while distilling intrinsic information, so it is not information in itself that is
important but the type of information plays a deciding role. Simple metrics are not designed to
make such distinctions.
2.2. Saliency
Simple, mathematical metrics commonly fail for perceptual applications because, when it
comes to the human visual system, not all pixels are created equal7. The location, neighborhood,
and semantic meaning of (a group of) pixels are generally more important than their exact color.
Humans can only focus in a very narrow foveal region8, so pixels in this region have more
impact on the perceived image. Additionally, color discrimination in this region is fairly good
6
ISO standard, ISO/IEC 15948:2003 (E).
Besides the fact that humans do not operate on pixels per se, anyway.
8
The fovea spans about 15◦ visual angle.
7
33
but motion detection is better outside the foveal region. Pixels can further be masked9 by texture
or noise [46].
As a rule, some image regions are visually more important (have a higher saliency) than
others. Given their narrow foveal extent, humans have to continually scan their visual field
with head movements and quick saccadic eye movements. For visual efficiency and to preserve
energy, these movements are mostly directed towards salient regions in the visual field. Saliency
is therefore an important tool to model and predict perceptual attention10.
Itti et al. [79, 78] computed explicit contrast measures for brightness, color opponency,
and orientation (via Gabor filters) at multiple spatial scales. They then averaged the individual
contrasts over all scales onto an arbitrary common scale. Finally, they normalized and averaged
all contrasts to obtain a combined saliency map. From this, they predicted the sequence and
durations of eye fixations using local maxima and a capacitance-based model of inhibition-ofreturn11.
Privitera and Stark [130] analyzed the effectiveness of 10, partly perceptually-inspired, image processing operators (including Gabor, discrete wavelet transform, and Laplacian of Gaussian) to predict human eye fixations. They computed the 10 operators for a test image and
clustered local maxima until they reached a predetermined number of clusters. By comparing
the remaining clusters with actual fixation locations obtained from human subjects, they determined the reliability of each operator to predict fixation points. Privitera and Stark’s approach
9
For example, a green leaf on a red blanket is perceived very prominently, whereas the same leaf would probably
not be noticed in a pile of other leaves. The pile of leaves thus masks the single leaf.
10
Santella and DeCarlo [36, 144] exploit this fact by using eye-tracking data to guide their NPR abstraction system.
11
Preventing visiting the same maxima in short succession.
34
was novel in that they did not assert a priori which image processing operator would model human attention accurately. Rather, they assembled a number of suitable operators and evaluated
them empirically.
Because image distortions in non-salient regions commonly remain unnoticed, saliency
forms a central component in many perceptual error metrics (Section 2.3) as well as optimization and compression algorithms (Section 2.4).
2.3. Visible Difference Predictors (VDPs)
To address the shortcomings of simple error metrics, researchers have designed several
perceptually-based difference predictors that take into account a limited number of low level
human vision mechanisms, including saliency. As the name suggests, a VDP metric predicts if
a human could tell two images apart, or how different a human would judge two images to be.
Daly’s [33] VDP modeled three aspects of human vision: non-linear brightness perception, the contrast sensitivity function (CSF), and masking [46] due to texture and other noise.
Mantiuk et al. [106] modified Daly’s VDP for use with high-dynamic range (HDR) imagery.
Yee et al. [180] defined a predictive error map, ℵ, that considered intensity, color, orientation, and motion at different spatial scales to estimate visual saliency and to determine the
perceived visual differences in salient regions.
The PDIFF and HDR-VDP entries in Table 2.1 list the pairwise difference scores between
the original image and the distorted images in Figure 2.1 for Yee et al.’s [180] (PDIFF) and Mantiuk et al.’s [106] (HDR-VDP) metrics. For the comparisons, I chose environmental conditions
similar to the user-studies in Section 3.4. The PDIFF scores for Noise and Blur appropriately
indicate the aggressive distortion of the blur operation. Note though, that the abstracted image,
35
itself derived using a model of human perception, attains the worst score of all. Although HDRVDP still prefers the Noise image to the Abstract image, the metric at least performs better at
judging the excessive visual loss in the Blur image.
The problem lies not in the abstracted image and not even necessarily in the VDPs but in
my use of the VDPs. The above VDPs predict perceivable differences between images, they
do not predict the perceived likeness of images. Many forms of art are exceptionally good
likenesses of a scene, despite the fact that their visual appearance is markedly different from the
real world. For this reason, standard VDPs and other perceptual metrics devised for realistic
scenes generally fare poorly on NPR imagery.
To the best of my knowledge, no NPR image quality or fidelity metrics exist to-date and
I believe this to be an excellent opportunity for future research. As a starting point it might
be interesting to leverage the null-operator qualities12 of some NPR systems to transform both
images to be compared into the same domain and then compute a simple error score.
2.4. Applications
While the above models and metrics can be used directly to compare images, for example
in database searches, they are more commonly integrated into applications to control image
distortion. The two application areas I focus on, lossy compression (Section 2.4.1) and adaptive
rendering (Section 2.4.2), are both very active research areas in their own right. Because this
section only addresses peripherally related work and because this work is too vast to present
comprehensively in this space, I limit my discussion to exemplary applications, instead.
Two points are worth remembering throughout the following discussion. First, all of the
listed works are perceptually motivated, yet only the smallest number of them perform user
12
For example, abstracting an already abstracted image in Chapter 3 changes almost nothing.
36
studies for perceptual validation. Second, even the most sophisticated perceptual models are
mostly used only to hide artifacts and to degrade images without an objectionable loss in visual quality, they are generally not designed to make an image easier or quicker to understand.
Although counter-examples exist, particularly for contrast reduction and tone-mapping work,
these shortcomings prevent us from harnessing the full potential of perceptual models for graphical applications.
2.4.1. Lossy Compression
To obtain very high compression ratios, lossy compression methods sacrifice some information,
that is, the signal recovered from a compressed stream is commonly not identical to the original signal. To ensure that this information loss remains below a perceivable threshold, or at
least does not become objectionable, perceptual models and metrics can guide the compression
process.
In the past, many researchers have developed lossy compression methods for a number of
different signal types and signal dimensions. The most common types are images, video, and
geometric meshes, while the most common dimensions are spatial (domain), temporal, and
dynamic range.
Images.
Reid et al. [139], gave an overview of so-called second-generation (2G) coding
techniques, i.e. lossy image compression systems that incorporate a simple HVS model. They
concluded that most existing 2G systems outperform first-generation systems, that the 2G systems are of similar complexity, and that an objective quality comparison is impossible until a
quantitative quality metric is adopted.
37
In similar work, Kambhatla et al. [87] compared several image compression schemes, including mixture of principal components (MPC), wavelets, and Karhunen Loève transform
(KLT; also known as principal component analysis, PCA). They found that while PSNR for
wavelet transform and KLT are higher than MPC, the MPC method produced less subjective
errors as judged by (a13) radiologist(s) analyzing brain magnetic resonance images (MRI).
Video.
Bordes and Philippe [15] proposed perceptual enhancements to the MPEG-2
compression standard14. They developed a quality map based on a pyramid decomposition of
spatial frequencies together with a multi-resolution motion representation. This quality map
was then used in a pre-process to remove non-visible15 information to limit the amount of data
to be encoded. The second use of the quality map was to locally adapt the encoding quantization
for constant bitrate encoding.
Meshes.
Williams et al. [167] developed a view-dependent mesh simplification algorithm
sensitive to an object’s silhouette, its texture, as well as the dynamic scene illumination. The
authors weighed the cost-benefit trade-off between these factors in terms of distortion effects
and rendering costs, and allocated run-time resources accordingly.
Watson et al. [163] applied two different mesh simplification schemes, VClust and QSlim,
to 36 polygonal models of animals and manmade artifacts. They then compared results of a series of user-studies including naming times, ratings, and preferences, to the results of numerous
automatic measures computed in object and image space. They found that ratings and preferences were predicted adequately with automatic measures, while naming times were not. The
13
The authors gave no details on their subjective evaluation.
This codec, most commonly used for high-quality DVD video-encoding, is part of the larger MPEG compression
and coding family. More information is available at http://www.mpeg.org.
15
The authors did not define this term clearly. I assume they referred to quality-loss below a certain threshold.
14
38
authors also found significant effects between the two object types, indicating that mesh simplification systems may need to consider a broader range of information than mere geometry and
connectivity.
Dynamic Range.
To address the problem of displaying high dynamic range images on
low dynamic range displays, Tumblin et al. [157] developed two contrast reduction methods.
The first method, practical only for synthetic images, computed separate image channels for
lighting and surface information. By compressing only the lighting channels the authors were
able to reduce the overall contrast of images while preserving much of the surface information.
The second, generally applicable method allowed users to manually specify foveal fixation
locations. The algorithm then adjusted global contrast based on foveal contrast adaptation while
attempting to preserve local contrast in the fixation regions.
Tumblin and Turk [158] took inspiration from artists’ approach to high dynamic range reproduction in developing their low curvature image simplifier (LCIS). They argued that skilled
artists preserve details by drawing scene contents in coarse-to-fine order using a hierarchy of
scene boundaries and shadings. The LCIS operator, a partial differential equation inspired by
anisotropic diffusion, was designed to dissect a scene into smooth regions bounded by sharp
gradient discontinuities. A single parameter, K, chosen for each LCIS, controlled region size
and boundary complexity. Using a hierarchy of LCISs the authors could compress the dynamic
range of large contrast features and then add detail from small features back into the final image.
In addition to its value as a tone reproduction operator, this work is relevant to my research due
to its similar approach (albeit for different reasons) to feature analysis and simplification via
anisotropic-like diffusion (Section 3.3.2).
39
Temporal Dynamic Range.
Mantiuk et al. [107] extended the MPEG-4 video com-
pression standard to deal with high dynamic range video. They described a luminance quantization method optimized for contrast threshold perception of the HVS. Additionally, the proposed quantization offered perceptually-optimized luminance sampling to implement global
tone mapping operators via simple and efficient look-up tables.
Pattanaik et al. [122] proposed a new operator to account for transient scene intensity adjustments of the HVS in animation or interactive real-time simulations. Their operator simulated
the dramatic compression of visual responses, and the gradual recovery of normal vision, caused
by large contrast fluctuations, for example when quickly entering or leaving a dark tunnel on a
bright sunny day.
2.4.2. Adaptive Rendering
Realistic image synthesis is computationally extremely expensive due to the complexity of interactions between light and matter that need to be modeled to achieve a convincing level of
optical/physical realism. This problem can be mitigated by lowering the goal from optical realism to perceived realism, instead. Using perceptual models and metrics, applications can
allocate rendering resources to salient image regions, while reducing computational accuracy
and resolution in less salient regions.
Error Sources.
Arvo et al. [2] defined three main causes of error in global illumina-
tion algorithms: (1) Perturbed Boundary Data - errors in the input data due to limitations of
measurement or modeling; (2) Discretization Errors - introduced when analytical functions are
replaced by finite-dimensional linear systems for actual computations; and (3) Computational
40
Errors - due to limited arithmetic precision. Any or all of these errors can result in the visual degradation of synthetic images and objectionable artifacts, such as faceting on tesselated
curved surfaces, banding (and even exaggerated Mach-banding effects) due to quantization,
aliasing as a result of insufficient sampling, and noise as a residual effect of stochastic models
used in random sample placement.
Static Scenes.
Ferweda et al. [46] made use of the common observation that some of
these artifacts can be masked (hidden) when they appear co-located with visual texture. The
authors developed a computational model of visual masking that predicted how the presence of
one visual pattern affected the detection of another. Using their system, the authors could select
and devise texture patterns to use in synthetic image generation that would hide artifacts due to
the above error-types.
Bolin and Meyer [13] presented a perceptually inspired approach to optimize sampling distributions for image synthesis. They computed a wavelet representation of the currently rendered scene and used a custom image quality model in combination with statistical information
about the spatial frequency distribution of natural images to determine locations where additional samples needed to be taken. Their approach was able to predict masking effects and
could be used to attain equivalent visual quality from different rendering techniques by controlling sample placement.
In similar work, Ramasubramanian et al. [136] devised a physical error metric that accounted for the HVS’s loss of sensitivity at high background illumination levels, high spatial
frequencies, and high contrast levels (visual masking). To reduce the cost of their metric for
adaptive rendering, the authors separated luminance-dependent processing from the expensive
spatially-dependent component, which could be pre-computed one off.
41
Recently, Cater at al. [23] performed user studies to demonstrate that different visual tasks
have an effect on eye-tracking of images (effectively changing saliency in an image). They
therefore extended previous HVS-based systems by additionally considering a so-called taskmap. The task map encoded information about objects’ locations and their purpose for a given
task, and was generally specified manually. The authors modified the Radiance rendering engine [162] to synthesize images, optimized for a given task. Unlike Santella and DeCarlo [144]
they did not perform further user studies to prove that their optimized images retained the same
fixation locations as the unoptimized images.
Dynamic Scenes.
In addition to a saliency map and spatial frequency estimation, the
perceptual model of Yee et al. [180] included an estimate of retinal velocity. Because detail
resolution in high velocity regions is limited, the authors could speed up global illumination
solutions by up to an order of magnitude.
Myszkowski [116], developed an extension to Daly’s [33] VDP, called Animation Quality
Metric (AQM) to facilitate high-quality walk-throughs of static environments and to speed up
global illumination computations of dynamic environments.
42
CHAPTER 3
Real-Time Video Abstraction
Figure 3.1. Abstraction Example. Abstractions like the one shown here can
be more effective in visual communication tasks than photographs. Original:
Snapshot of two business students on an overcast day. Abstracted: After several
bilateral filtering passes and with DoG-edges overlayed. Quantized: Luminance
channel soft-quantized to 8 bins. Note how folds in the clothing and shadows on
ground are emphasized.
In this chapter, I present an automatic, real-time video and image processing framework with
the goal of improving the effectiveness of imagery for visual communication tasks (Figure 3.1).
This goal is naturally broken down into two tasks: (1) Modifying imagery based on visual
perception principles (Sections 3.2-3.3); and (2) proving that such modifications can lead to
43
improved performance in visual communication (Section 3.4). Additionally, I show how the
various processing steps in my framework can be utilized for artistic stylization purposes.
The framework operates by modifying the contrast of perceptually important features, namely
luminance and color opponency. It reduces contrast in low-contrast regions using an approximation to anisotropic diffusion, and artificially increases contrast in higher contrast regions
with difference-of-Gaussian edges. The abstraction step is extensible and allows for artistic or
data-driven control. Abstracted images can optionally be stylized using soft color quantization
to create cartoon-like effects.
Technical Contributions. Unlike most previous video stylization systems, my framework
is purely image-based and refrains from deriving an explicit image representation1. That is, instead of computing a structural description of the image content and then subsequently stylizing
or otherwise modifying this description, my framework directly manipulates perceptual features
of an image, in image space. While this may seem to limit stylization capabilities at first sight,
I devise several soft quantization functions that offer important benefits for abstraction, performance, and stylization: (1) a significant improvement in temporal coherence without requiring
user-correction; (2) a highly parallel framework design, allowing for a GPU-based, real-time
implementation; and (3) parameters for the quantization functions which allow for a different,
but rich set of stylization options, not easily available to previous systems.
Theoretical Contributions. I demonstrate the effectiveness of the abstraction framework
with two user-studies and find that participants are faster at naming abstracted faces of known
persons compared to photographs. Traditionally, faces are considered very difficult to abstract
1
While implementation details may vary, an explicit image representation generally describes an image in terms of
vector or curve-based bounded areas. See Section 3.1.1, pg. 47 and Figure 3.2 for details.
44
and stylize. Participants are also better at remembering abstracted images of arbitrary scenes in
a memory task. The user studies employ small images to emulate portable display technology.
I believe that small imagery will play an increasingly important role in the immediate future,
with the onset in ubiquity of mobile display-enabled devices like mobile phones, digital cameras, personal digital assistants, game consoles, and multimedia players. To keep these devices
portable, their display size is necessarily limited and the given screen space has to be used
effectively. A framework that offers increased recognition of image features for visual communication purposes while reducing the complexity of images and thus aiding compression is
therefore a valuable asset.
My framework is one of only a few existing automatic abstraction systems built upon perceptual principles, and the only one to date that achieves real-time performance.
3.1. Related Work
A number of issues are important for most stylization and abstraction systems and can be
used to differentiate my work from previous systems. These are defined in the following section
and later used to discuss previous systems.
3.1.1. Definitions
Automatic vs. User-driven.
As discussed in Section 2.3 various computational models of
low-level human perception have been proposed. These automatically approximate a limited
set of visual perceptual phenomena. No computational (or even theoretical) model exists todate that satisfactorily predicts or synthesizes anything but the most basic visual features. Most
models break down when attempting to analyze global effects requiring semantic information
45
or integration over the entire visual field, and effects based on binocular vision. These limitations are partly due to the fact that not much is known about how humans achieve such global
analysis [188]. Consequently, any system relying on semantic information or intended to create art requires human interaction. Other systems, particularly those intended to aid (and not
replace) humans in a particular visual task, can well benefit from automation. Ideally, a system
should offer a best-effort automatic solution along with an overriding or extension mechanism
to improve upon the results. This is the approach I have taken in my automatic video abstraction
framework.
Real-time vs. Off-line.
By definition, the amount of computation that can be performed
by a real-time system is limited by the intended frame-rate. Because my framework is designed
to support visual communication, real-time performance is paramount to support interactive
applications like video-telephony or video-conferencing. Other applications, like visual database searches or summaries can be created off-line and then accessed asynchronously. My
framework design leverages parallelism of the underlying image processing operations wherever possible, enabling real-time performance on modern GPU processors.
Temporal Coherence.
Temporal coherence is a desirable property of any animation and
video system because unintentional incoherence draws perceptional attention and is therefore
distracting. A system exhibits temporal coherence if small input changes lead to small output changes and is not given for most stylization systems using discrete conditionals and hard
quantization functions. Additional problems arise if scene objects need to be identified and
tracked through computer vision algorithms, as those algorithms are often brittle (see Explicit
46
Figure 3.2. Explicit Image Structure. Two pairs of images showing explicit
image structure (Left image of pair shows color coded segments. Right image of
pair shows colors derived from original image). Coarse Segmentation: The level
of detail is manually chosen to segment the image into semantically meaningful
segments. Some detail, like the face, is too fine to be resolved at this level. Fine
Segmentation: The level of detail is chosen so that the face is resolved, but this
leads to over-segmentation in the remaining image. A common approach to this
problem is to over-segment an image and then use a heuristical method to merge
adjacent segments, but such heuristics are commonly non-robust and temporally
incoherent, requiring user correction.
image structure, below). My framework offers temporal coherence by two different mechanisms: (1) reducing noise in the input images with non-linear diffusion; and (2) soft pseudoquantization functions that are all continuous or semi-continuous2 (and adaptive where applicable).
2
Formally, a function, f , defined on some topological space X, f : X 7→ R, is upper semi-continuous at x0 , if
lim sup f (x) ≤ f (x0 ), and lower semi-continuous, if lim inf f (x) ≥ f (x0 ).
x→x0
x→x0
For my soft quantization functions, it is also true that the ranges of the continuous intervals are much greater than
the ranges of the discontinuities.
47
Explicit Image Structure and Stylization.
An explicit image structure is the logical rep-
resentation of image elements, such as objects, and their relative positioning (Figure 3.2). Image structure is commonly represented with a (possibly multi-resolution) hierarchy of contourbound areas, expressed as polylines or parametric curves. There exist several advantages of
such explicit representations. They can be arbitrarily scaled, they can be recombined in different ways, and most importantly for stylization systems, their geometric descriptions can be parameterized and then simplified or stylized freely. Several disadvantages counterbalance these
benefits. Correctly identifying and extracting image structure from raw images is a difficult and
costly vision problem, often requiring user-correction and preventing real-time performance
(see Automatic vs. User-driven and Real-time vs. Off-line, above). A related problem is that
of tracking image structure between successive frames, particularly for noisy input, non-trivial
camera movements, and occlusions. My framework stays clear of these vision problems to become fully automatic as well as real-time at the cost of a more limited range of stylistic options.
I offset this limitation by providing a rich set of user-parameters to the quantization functions
of the framework.
In addition to the points mentioned above, the discussion in Section 1.1.3 on the merits of
psychophysical validation applies directly to related works as well.
Having defined the most important design factors for work directly related to mine, I can
now continue to discuss previous systems in terms of these factors.
3.1.2. Previous Systems
Among the earliest work on image-based NPR was that of Saito and Takahashi [142] who
performed image processing operations on data buffers derived from geometric properties of
48
3-D scenes. These buffers contained highly accurate values for scene normals, curvature, depth
discontinuities and other measure that are difficult to derive from natural images without knowledge of the underlying scene geometry. Unlike my own framework, their approach was mainly
limited to visualizing synthetic scenes with known geometry.
To reliably derive limited image structure from their source data, Raskar et al. [138] computed ordinal depth from pictures taken with purpose-built multi-flash hardware. This allowed
them to separate texture edges from depth edges and perform effective texture removal and
other stylization effects. My own framework cannot derive ordinal depth information or deal
well with general repeated texture but also requires no specialized hardware and therefore does
not face the technical challenges of multi-flash for video.
Several video stylization systems have been proposed, mainly to help artists with laborintensive procedures [161, 26]. Such systems computed explicit image structure by extending
the mean-shift-based stylization approach of DeCarlo and Santella [36] to computationally expensive3 three-dimensional segmentation surfaces. Difficulties with contour tracking required
substantial user intervention to correct errors in the segmentation results, particularly in the
presence of occlusions and camera movement. My framework does not derive an explicit representation of image structure but offers a different mechanism for stylization, which is much
faster to compute, fully automatic, and temporally coherent.
Contemporaneous work by Fischer et al. [49] explored the use of automatic stylization techniques in augmented reality applications. To visually merge virtual objects with a live video
stream, they applied stylization effects to both virtual and real inputs. Although parts of their
3
Wang et al.’s [161] system took over 12 hours to segment 300 frames (10 seconds of video) and users had to
correct errors in approximately every third frame.
49
system are similar to the framework presented here, their approach is style-driven instead of perceptually motivated, leading to different implementation approaches. As a result, their system
is limited in the amount of detail it can resolve, their stylization edges require a post-processing
step for thickening, and their edges tend to suffer from temporal noise4.
Recently, some authors of NPR systems have defined task-dependent objectives for their
stylized imagery and tested these with perceptual user studies. DeCarlo and Santella [36] used
eye-tracking data to guide image simplification in a multi-scale system. In follow-up work, Santella and DeCarlo [144] found that their eye-tracking-driven simplifications guided viewers to
regions determined to be important. They also considered the use of computational saliency as
an alternative to measured saliency. My own work does not rely on eye-tracking data, although
such data can be used. My implicit visual saliency model is less elaborate than the explicit
model of Santella and DeCarlo’s later work, but can be computed in real-time and can be extended for a more sophisticated off-line version. Their explicit image structure representation
allowed for more aggressive stylization, but included no provisions for the temporal coherence
featured in my framework.
Gooch et al. [62] automatically created monochromatic human facial illustrations from
Difference-of-Gaussian (DoG) edges and a simple model of brightness perception. Using an
extended soft-quantization version of a DoG edge detector, my framework can create similar
illustrations in a single pass and additionally address color, real-time performance and temporal
coherence. My face recognition study follows closely the protocol set forth by Stevenage [149]
and consequently used by Gooch et al. [62].
4
It should be noted that while these drawbacks are generally not desirable for a video stylization system, they
helped to effectively hide the boundaries between real and virtual objects in Fischer et al.’s system.
50
Work by Tumblin and Turk [158], traditionally associated with the tone-mapping literature,
is worth mentioning for its use of related techniques and the fact that the authors took inspiration from artistic painterly techniques5. In order to map high-dynamic range (HDR) images into
a range displayable on standard display devices, Tumblin and Turk decomposed an HDR image into a hierarchy of large and fine features (as defined by a conductance threshold function,
related to local contrast). Hierarchical levels with a large dynamic range were then compressed
before combination with smaller features, effectively compressing the range of the entire image without sacrificing small detail. The low curvature image simplifiers (LCIS) used at each
hierarchy level are closely related to the approximate anisotropic diffusion operation I use for
simplification, but are based on higher order derivatives. Despite this similarity, Tumblin and
Turk’s goals were different in that they would not modify low dynamic range images, whereas
I am interested in simplifying and abstracting these.
3.2. Human Visual System
Visual processing of information in humans involves a large part of the brain and processing
operations too vast and complex to be currently fully understood, let alone be modeled. Given
the design considerations defined in Section 3.1.1 (automation, real-time performance, temporal
coherence) I limit the framework to modeling a small part of visual processing and base my
design on the following assumptions:
(1) The human visual system operates on various features of a scene.
(2) Changes in these features (contrasts) are of perceptual importance and therefore visually interesting (salient).
5
Artists are commonly faced with the difficulty of capturing high dynamic range, real-world scenes on a canvas of
limited dynamic range
51
(3) Polarizing contrast (decreasing low contrast while increasing high contrast) is a basic
but useful method for automatic image abstraction.
3.2.1. Features
Although the human visual experience is generally holistic, several distinct visual features are
believed to play a vital role in low level human vision, among these are luminance, color opponency, orientation, and motion [121]. Evidence for such features derives from several sources.
Within the visual cortex, several structurally different and variedly connected sub-regions
have been identified, whose comprising cells are selectively sensitive to very distinct visual
stimuli (e.g. Area V3: Orientation. Area V4: Color. Area V5: Global Motion) [188]. In addition, cerebral lesions and other pathological conditions can lead to cases where the holistic
visual experience is selectively impaired (e.g. color blindness types: protanopia, deuteranopia,
tritanopia, monochromasy, and cerebral achromatopsia; form deficiency: visual agnosia; motion blindness: akinetopsia) [188]. Similar evidence can be gleaned from blind people who
regain sight. Their visual system is generally heavily underdeveloped and (depending on age)
may never fully recover, but they can almost immediately perceive lines, edges, brightness, and
color6 [65].
Based on this evidence, I consider luminance, color, and edges (which really are a secondary feature) in my real-time framework. The framework uses the perceptually uniform
CIELab [179] color space to encode the luminance (L) and color opponency (a and b) features
of input images and performs all abstraction operations in this feature space. The perceptual
6
The problem often does not lie in perceiving the individual features of the visual world, but their meaningful
integration and interpretation.
52
uniformity of CIELab guarantees that small distances measured in this space correspond to perceptually just noticable differences (see Contrast, below). The framework design further allows
for inclusion of additional features for off-line processing or when such features can be viably
computed in real-time on future hardware (Section 3.5.4).
3.2.2. Contrast
Constant features are generally not a prime source of biologically vital information (e.g. a featureless blue sky; a tree with uniformly green leaves; a stationary bush). Changes in features
(feature contrasts) are often much more important (e.g. the silhouette of a hawk hovering above;
the color-contrast of a red apple on a green tree; the motion of a tiger moving in the bushes).
For this reason, humans are notoriously inept at estimating absolute stimuli and much more proficient at distinguishing even small differences between two similar stimuli [105]. People can
name and describe only a handful of colors, yet they can differentiate hundreds of thousands of
colors. People have difficulty estimating speed when moving, yet they are extremely sensitive
to acceleration and deceleration7. Only very few people can tell the frequency of a pure sinusoidal sound wave, yet most people can distinguish two different notes. In technical terms, the
resolution of absolute measures of features can be orders of magnitude less than the differential resolution of so-called just-noticeable-differences 8 (JND) [105, 45]. Because changes play
such an important role in perception, much of my framework is based on contrasts (see below).
7
For example, without visual feedback, one cannot tell if an elevator is moving or stationary, only if it is starting
or stopping.
8
Is is therefore not surprising to find differential measures becoming increasingly prominent in computer graphics
research [123, 156, 59].
53
3.2.3. Saliency
To remove extraneous detail from imagery while emphasizing important detail requires a measure of visual importance. Itti et al. [79, 78] recognized the biological importance of high feature contrasts in their saliency model, introduced in Section 2.2. Because their explicit model
is computationally rather expensive and thus too complex for real-time applications, I employ
a simpler, implicit9 saliency model for my automatic, real-time implementation. Within my
framework, the following restrictions apply:
(1) It considers just two feature contrasts: luminance, and color opponency.
(2) It does not model effects requiring global integration.
(3) It processes images only within a small range of spatial scales (Section 3.2.5).
Since the framework (Figure 3.4) optionally allows for externally-guided abstraction via usermaps (Equation 3.2 and Figure 3.9), a more complex saliency map, like that of Itti et al., can be
supplied at the cost of sacrificing real-time performance.
3.2.4. Contrast Polarization
Exaggerating feature contrasts can aide in visual perception. For example, super-portraits and
caricatures have been shown to help recognition of faces10 [17, 149, 62] and can be considered
a special case of the more general peak-shift effect [66].
My approach for image simplification and abstraction is therefore to simply polarize the
existing contrast characteristics of an image: to diminish feature contrast in low contrast regions,
9
Here, implicit means that contrast is both the measure that defines saliency and the operand that is modified via
saliency.
10
Here, feature refers to facial features (like big nose, tight lips); and contrast refers to feature differentials compared to an ideal norm-face.
54
Figure 3.3. Scale-Space Abstraction. Left: Image of a man used as the base
level in a scale-space representation. Left to right: Difference-of-Gaussian
(DoG) feature edges computed at increasingly coarser scales. As the kernel size
for the DoG filters increases (about an order of magnitude from left to right), the
visual depiction changes from a concrete instance of a man in shirt and trousers,
to a generic and abstract standing figure.
while increasing feature contrast in high contrast regions in order to yield abstractions that are
easier and faster to understand.
3.2.5. Scale-Space
Real-world entities are comprised of structural elements at different scales (Figure 3.3). A forest
can span dozens of kilometers, each tree can be dozens of meters high, branches are several meters long, leaves are best measured in centimeters, while the leaves’ cells extend only fractions
of millimeters. It makes as little sense to describe a forest in terms of millimeters as it does to
describe a leaf in terms of kilometers. The fact that scale is such an important aspect when discussing structure has led to the development of several scale-space theories [175, 93, 102]. In
terms of the human visual system, Witkin’s continuous (linear) scale-space theory is compatible
with results by De Valois and De Valois [159], showing that receptive fields (Figure 3.11) of
cortical cells include a fairly dense representation of sizes in the spatial frequency domain [121].
55
Figure 3.4. Framework Overview. Each step lists the function performed,
along with user parameters. The right-most paired images show alternative results, depending on whether luminance quantization is enabled (right) or not
(left). The top image pair shows the final output after the optional image-based
warping step.— {Cameron Diaz with permission of Cameron Diaz.}
My framework supports structural scale with various framework parameters (σd , σe ), which
can be used to extract and smooth features at a given scale (Figures 3.3 and 3.13). Particularly,
a single spatial scale can be defined for edge-detection, and the non-linear diffusion process
(Section 3.3.2) inherently operates at multiple scales due to its iterative nature11.
3.3. Implementation
The basic workflow of my framework is shown in Figure 3.4. The framework first polarizes
the given contrast in an image using nonlinear diffusion (Section 3.3.2). It then adds highlighting edges to increase local contrast (Section 3.3.3), and it optionally stylizes (Section 3.3.5) and
sharpens (Section 3.3.4) the resulting images.
11
It should be noted that only the base scale of the non-linear diffusion is well-defined. Additional scales are
spatially varying due to the non-linearity. As such, the multi-scale operations are less powerful than those based
on Itti et al.’s explicit representation.
56
3.3.1. Notation
This work combines results from such diverse disciplines as Psychology, Physics, Computer Vision, and Computer Graphics, each of which tend to have their unique formalisms and notation.
In my own work, I try to use recognizable formulations of existing results and favor readability
over mathematical rigor. Specifically, I mix notation from continuous and discrete domains and
I do not discuss issues arising due to boundary conditions, as these issues are not specific to
my work. Additionally, numerical accuracy is not a deciding factor in my framework because
there exists no ground-truth to judge against and because the filters I employ are stable for the
parameter ranges given, unless explicitly stated otherwise.
3.3.2. Extended Nonlinear Diffusion
Linear vs. Non-linear diffusion.
Linear filters, like the well-known Gaussian blur, are an
effective method for decreasing the contrast of an image (Figure 3.5). In the frequency domain,
the Gaussian blur acts as a low-pass filter, meaning that high-frequency components are subdued
or even eliminated. As a result, edges become softer and contrast decreases. Unfortunately, this
particularly applies to sharp edges, which contain a broad spectrum of frequency components.
As it is my goal not only to lower low contrast, but also to preserve or even increase high
contrast, edge blurring poses a problem.
To explain the relevant technical terms in their historical context, it is useful to introduce
an alternative description of the Gaussian blur. Several linear filters, the Gaussian blur among
them, can be interpreted as solutions to the heat equation [43]. To gain an intuitive understanding one can imagine a room filled with a gas of spatially varying temperature. Because the gas
is free to move around, it will attempt to reach an equilibrium state of constant temperature
57
Figure 3.5. Linear vs. Non-linear Filtering. Scanlines (top row) of luminance
values for the horizontal lines marked in green in the bottom row images. Significant luminance discontinuities are marked with vertical lines. Original: The
original scanline contains several large and sharp discontinuities, corresponding to semantically meaningful regions in the source image that I would like
to preserve (wall, guard outline right, right leg fold, guard outline left). The
scanline also contains a large amount of small, high frequency components on
top of the base signal. These smaller components generally constitute texture
or noise, which I would like to subdue. Linear Filter: Linear filtering (here,
Gaussian blur) successfully subdues high-frequency components, thus simplifying the scanline. Since a linear filter operates isotropically and homogenously
it also suppresses the high-frequency components of the sharp discontinuities
thus smoothing these undesirably. Non-Linear Filter: The anisotropic and inhomogeneous action of the non-linear filter smooths away high frequencies in
low contrast regions, while preserving most frequencies in high contrast regions.
Compare the shape of all scanlines, especially at the discontinuities marked with
vertical lines.
everywhere. If there exists no spatial bias in the way the gas can move (apart from boundary
conditions), the system is said to have a constant diffusion conduction function and the gas diffuses isotropically in all directions. In that case a linear diffusion function, like the Gaussian
blur, can be used to model the diffusion process12.
12
To bring this example back to the image domain, imagine an arbitrary image to which one applies a very small
Gaussian blur. As a result neighboring colors mix. Repeating this process ad infinitum mixes all colors into a
single color that is the average of the initial colors.
58
Perona and Malik [125] defined a class of filters with spatially varying diffusion conduction functions, resulting in anisotropic diffusion. These filters have the property of blurring
small discontinuities and sharpening edges13 (Figure 3.5). Using such a filter with a conduction function based on feature contrast, the contrast can be effectively amplified or subdued.
Unfortunately, Perona and Malik’s neural-net-like implementation is not efficient enough for a
real-time system on standard graphics hardware, due to its very small (see Footnote 12) spatial
support14.
Barash and Comaniciu [4] demonstrated that anisotropic diffusion solvers can be extended
to larger spatial neighborhoods, thus producing a broader class of extended nonlinear diffusion
filters. This class includes iterated bilateral filters as one special case, which I prefer due to
their larger support size and the fact that they can be approximated quickly and with few visual
artifacts using a separated kernel [126].
Extended Nonlinear Diffusion.
Given an input image f (·), which maps pixel locations
into some feature space, I define the following customized bilateral filter, H(·):
Z
(3.1)
H(x̂, σd , σr ) =
− 21
e
Z
kx̂−xk
σd
− 12
e
13
2
w(x, x̂, σr )f (x) dx
kx̂−xk
σd
2
w(x, x̂, σr ) dx
A filter that increases the steepness of edges towards a true step-like discontinuity is sometimes called shockforming.
14
Their approach is still very much parallelizable and could be efficiently implemented in special hardware.
59
x̂ : Pixel location
w(·) : Range weight function
x : Neighboring pixel
m(·) : Linear weighting function
σd : Neighborhood size
w0 (·) : Diffusion conduction function
σr : Conductance threshold
u(·) : User-defined map
In this formulation, x̂ is a pixel location and x are neighboring pixels, where the neighborhood size is defined by σd (blur radius). For implementation purposes, I limit the evaluation
radius to two standard deviations, ±2σd , and normalize the convolution kernel to account for
the missing area under the curve. This rule applies similarly to all following functions involving
convolutions with exponential fall-off.
Increasing σd results in more blurring, but if σd is too large features may blur across significant boundaries. The range weighting function, w(·), is closely related to the diffusion
conduction function (see below) and determines where in the image contrasts are smoothed or
sharpened by iterative applications of H(·).
(3.2)
w(x, x̂, σr ) = (1 − m(x̂)) · w0 (x, x̂, σr ) + m(x̂) · u(x̂)
Range Weights and Diffusion Conduction Functions.
My definition of Equation 3.2
extends the traditional Bilateral filter to become more customizable for data-driven or artistic
control.
For the real-time, automatic case, I set m(·) = 0, such that w(·) = w0 (·) and Equation 3.1
becomes the familiar bilateral filter [155]. Here, w0 (·), is the traditional diffusion conduction
60
Figure 3.6. Diffusion Conduction Functions and Derivatives. Three possible
range functions, w0 , for use in Equation 3.2. All functions have a Gauss-like
bell shape, but differ in their differentiability and differential function shape.
Since all functions produce very similar results when applied to an image, the
best choice for a given application depends largely on the support for optimized
implementations of each function.
function and can take on numerous forms (Figure 3.6), given that w0 (x = x̂, x̂, σr ) = c, with c
some finite constant, and lim w0 (x, x̂, σr ) = 0.
x→±∞
61
(3.3)
wE0 (x, x̂, σr )
(3.4)
wI0 (x, x̂, σr ) =
(3.5)
− 12
= e
wC0 (x, x̂, σr ) =
∆fx̂
σr
A
2
∆fx̂
σr
2
1+



x̂ ·π
 A2 · 1 + cos ∆f
3·σ
r


0
(3.6)
if − 3 σr ≤ ∆fx̂ ≤ 3 σr ,
otherwise.
where ∆fx̂ = kf (x̂) − f (x)k
Equation 3.3 is the conduction function used by Tomasi and Manduchi [155] (they use the
term range weighting function, as above) and employed for most images in this chapter. Equation 3.4 is based on one of Perona and Malik’s [125] original functions and Equation 3.5 is a
function I devised for its finite spatial support (both other functions have infinite support and
need to be truncated and normalized for practical implementations). Figure 3.6 shows comparisons of these functions along with their first two derivatives. In practice, I find that all functions
give comparable results and a selection is best based on implementation efficiency on a given
platform and subjective quality estimates. As I am interested in manipulating contrast, all proposed conduction functions operate on local contrast15, as defined in Equation 3.6. Perona and
Malik [125] called parameter σr in Equations 3.2-3.5 the conductance threshold in reference
to its deciding role in whether contrasts are sharpened or blurred. Small values of σr preserve
almost all contrasts, and thus lead to filters with little effect on the image, whereas for large
15
Other non-linear diffusion filters that operate on higher-order derivatives of the image have been proposed to
achieve different goals [164, 158].
62
Figure 3.7. Progressive Abstraction. This figure shows a source image (unfiltered) that is progressively abstracted by successive applications of an extended
nonlinear diffusion filter. Note how low contrast detail (e.g. the texture in the
stone wall and the soft folds in the guard’s garments) is smoothed away, while
high contrast detail (facial features, belts, sharp creases in garment) is preserved
and possibly enhanced.
values, lim w0 (·) = 1, thus turning H(·) into a standard, linear Gaussian blur. For intermeσr →∞
diate values of σr , iterative filtering of H(·) results in an extended nonlinear diffusion process,
where the degree of smoothing or sharpening is determined by local contrasts in f (·)’s feature
space. Figure 3.7 shows the progressive removal of low contrast detail due to iterative nonlinear
diffusion.
Automatic vs. Data-driven Abstraction.
With m(·) 6= 0, the range weighting function,
w(·), turns into a weighted sum of w0 (·) and an arbitrary importance field, u(·), defined over
the image. In this case, m(·) and u(·) can be computed via a more elaborate visual saliency
model [130, 78], derived from eye-tracking data [36], or painted by an artist [71]. Figure 3.8
shows comparisons between DeCarlo and Santella’s [36] explicit stylization system and my
63
Figure 3.8. Data-driven Abstraction. This figure shows images abstracted automatically (left) vs. abstractions guided by eye-tracking data (right). The top
row are original images by DeCarlo and Santella [36], while the bottom row
shows my results given the same data. It is noteworthy that despite some stylistic differences the two systems abstract images very similarly, although the systems themselves are radically different in design. Particularly, it is not necessary
to derive a computationally expensive explicit image representation to achieve
meaningful abstraction.— {Top images and eye-tracking data by Doug DeCarlo and Anthony
Santella, with permission.}
implicit framework. I created the data-driven example by converting DeCarlo and Santella’s
eye-tracking data into an importance map, u(·), setting m(·) := u(·), and tuning the remaining
framework parameters to approximate the spatial scales and simplification levels found in DeCarlo and Santella’s original image, to allow for better comparability. After setting the initial
64
Figure 3.9. Painted Abstraction. User-painted masks achieve an effective separation of foreground and background objects. Automatic: The automatic abstraction of the source image yields the same level of abstraction everywhere.
Foreground & Background: Masks (shown as insets) selectively focus abstraction primarily on the background and foreground, respectively.— {Original Source
image in public domain.}
parameters, the framework ran automatically. Note that although the two abstraction systems are
radically different in design and implementation (e.g. DeCarlo and Santella’s image-structurebased system versus my image-based framework), the level of abstraction achieved by both is
very similar.
Figure 3.9 demonstrates the use of a user-painted importance mask, u(·). As above, I set
m(·) := u(·). The masks, shown as insets in the figure, are kept simple for demonstrative
reasons but could be arbitrarily complex. In effect, a user can simply paint abstraction onto
an image with a brush, the level of abstraction depending on the brightness and spatial extent
of the brush. Since the framework operates in real-time, this process affords immediate visual
feedback to the user and allows even novice users to easily create abstractions with a simple
and intuitive interaction common in many image manipulation products.
65
Optimizations and Other Considerations.
Applying a full extended non-linear diffu-
sion solver with reasonable spatial support and sufficient iterations to achieve valuable abstraction is computationally too expensive for real-time purposes. Fischer et al. [49] addressed this
problem by applying their full filter implementation on a downsampled input image and then
interpolating the result to the original size. While this allowed them to perform at least one
iteration in real-time, the upsampling interpolation caused blurring of the resulting image, as
expected.
My solution uses a separable implementation of the non-linear diffusion kernel. A twodimensional kernel is separable if it is equal to the convolution of two one-dimensional kernels:
Z
Z
k2 (x1 ) dx1 ∗
k1 (x1 , x2 ) dx1 dx2 =
R2
Z
R
k3 (x2 ) dx2
R
The one-dimensional kernels can be applied sequentially (the latter operating on the result of the
former) thus reducing the computational complexity from O(n2 ) to O(n), where n is the radius
of the convolution kernel, and in turn limiting costly memory fetches. Mathematically, a nonlinear filter is generally not separable, the bilateral filter included. Still, I have obtained good
results with this approach in practice. My results show empirically that a separable approximation to a bilateral filter produces minor (difficult to see with the naked eye) spatial biasing
artifacts compared to the full implementation for a small number of iterations (< 5 for most
images tested and using the default values in this chapter). Due to the shock-forming behavior
of the bilateral filter, these biases tend to harden and become more pronounced with successive
iterations (Figure 3.10).
66
Figure 3.10. Separable Bilateral Approximation. Two images of a bilateral
filter diffusion process after 41 iterations. Full: Using the full two-dimensional
implementation. Approximate: Using two separate one-dimensional passes. In
most cases these errors are fairly small and only become prominent after a large
number of iterations.— {Fair use: The images shown here for educational purposes are derivations of a small portion of an original image as shown on daily television.}
Pham and Vliet [126] corroborate this result in contemporaneous work. They show empirically that a single iteration of a separable bilateral filter produces few visual artifacts, even for
the worst-case scenario of a 45◦ tilted discontinuous edge.
Figure 3.10 shows results for large number of iterations where errors tend to accumulate.
I observe two types of effects: (1) sharp diagonal edges often evolve into jagged horizontal
and vertical steps (examples À, Á, and Â); and (2) soft diagonal edges fail to evolve (examples  and Ã). For most of my videos and images, including the user-study, I have found it
sufficient to apply between 2-4 iterations, so that spatial biases are rarely noticeable. The speed
improvement, on the other hand, is in excess of 30 times in the GPU implementation.
67
Receptive Field
Output
1D Response Profile
Position
Figure 3.11. Center-Surround Cell Activation. The receptive field of a cortical cell is modeled as an antagonistic system in which the stimulation of the
central cell (blue) is inhibited by the simultaneous excitation of its surrounding
neighbors (green). In other words, a center-surround cell is only triggered if it
itself receives a signal while its receptive field is not stimulated. This system
gives rise to the Mexican hat shape in 3-D (left, checkered shape) and the corresponding curves shown in the right image. The combined response curve can be
modeled by subtracting two Gaussian distribution functions whose standard deviations are proportional to the spatial extent of the central cell and its receptive
field [121].
As noted previously, all abstraction operations are performed in CIELab space. Consequently, the parameter values given here and in the following sections are based on the assumption that L ∈ [0, 100] and (a, b) ∈ [−127, 127].
3.3.3. Edge detection
In general, edges are defined by high local contrast, so adding visually distinct edges to regions
of high contrast further increases the visual distinctiveness of these locations.
68
Figure 3.12. DoG Edge Detection and Enhancement. The center-surround
mechanism described in Figure 3.11 can be used to detect edges in an image.
Source: An abstracted image used to detect edges. DoG Result: The raw output
of a DoG filter needs to be quantized to obtain high contrast edges. Step Quantization: Discontinuous quantization results in temporally incoherent edges near
the step boundary. Smooth Quantization: Using Equation 3.7 for quantization
results in edges that quickly fade at the quantization boundary, leading to improved temporal coherence. Compare the circled edges in the bottom images.
Marr and Hildreth [109] formulated an edge detection mechanism based on zero-crossings
of the second derivative of the luminance function. They postulated that retinal cells (center), which are stimulated while their surrounding cells are not stimulated, could act as neural
implementations of this edge detector (Figure 3.11). A computationally efficient approximation to this edge detection mechanism is the quantized result of the difference-of-Gaussians
69
Figure 3.13. DoG Parameter Variations. Extending the standard DoG edge
detector with soft quantization parameters allows me to create a rich set of
stylistic variations. Left: A classic DoG result (no shading) with fine edges
(low scale parameter σe ). (σe , τ, ) = (0.7, 0.9904, 0). Center: Same edge
scale as in left image, but with additional shading information. Note that this
image is not simply a combination of edges with luminance information as in
Gooch et al. [62], because edges in dark regions (e.g. person’s right cheek,
bottom of beard) are still visible (as bright lines against dark background).
In terms of style, the image has a distinct charcoal-and-pencil appearance.
(σe , τ, ) = (0.7, 0.9896, 0.00292). Right: Coarse edges using a large spatial
kernel (compare detail in hair and hat with left image) and light shading around
eyes, cheek and throat. (σe , τ, ) = (1.8, 0.9650, 0.01625). Parameter ϕe = 5.0
throughout. Given the above parameters, these images are created fully automatc
ically in a single processing step.— {Original photograph used as input Andrew
Calder, with
permission.}
(DoG) operator (Figure 3.12). Rather than using a binary edge quantization model as in previous works [22, 62, 49], I define my edges using a slightly smoothed continuous function,
D(·), (Equation 3.7; depicted in Figure 3.4, bottom inset) to increase temporal coherence in
animations and to allow for a wider range of stylization effects (Figure 3.13) than previous
implementations.
70
Figure 3.14. Edge Cleanup Passes. DoG Edges are extracted after ne < nb
bilateral filter passes to eliminate noise that could lead to temporal incoherence
in the edges. From left to right, this figure shows the original edges contained
in a source image, the edges extracted after two and after four bilateral cleanup
passes. Note that the differences between no cleanup and two passes are much
greater than between two and four passes, indicating that a point of diminishing
c
returns is quickly reached.— {Original photograph used as input Andrew
Calder, with permission.}
(3.7)
D(x̂, σe , τ, , ϕe ) =



1
if (Sσe − τ · Sσr ) > ,


1 + tanh(ϕe · (Sσe − τ · Sσr )) otherwise.
(3.8)
Sσe ≡ S(x̂, σe )
(3.9)
Sσr ≡ S(x̂,
(3.10)
S(x̂, σe ) =
√
1
2πσe 2
1.6 · σe )
Z
1 kx̂−xk 2
f (x) e− 2 ( σe ) dx
Equation 3.8 and Equation 3.9 represent Gaussian blurs (Equation 3.10) with different standard deviations and correspond to the center and negative surround responses of a cell, respectively. The factor of 1.6 in Equation 3.9 relates the size of a typical center-surround cell to
71
Figure 3.15. DoG vs. Canny Edges. DoG Edges: Soft DoG edges tuned to
yield results comparable to the Canny edges. The thickness of the lines are
proportional to the strength of edges as well as the scale at which edges are
detected (Figure 3.3) giving the lines an organic feel. Canny Edges: Canny edgelines are designed to be infinitely thin, irrespective of scale. This is advantageous
for image segmentation (Figure 3.2), but often belies the true scale of edges,
making it more difficult to visually interpret the resulting lines. Canny Edges
Eroded: Morphological thickening of lines, as in Fischer et al. [49], can easily
c
hide small detail (e.g. threads in hat).— {Original photograph used as input Andrew
Calder,
with permission.}
the extent of its receptive field [109]. Together, the parameters τ and in Equation 3.7 control
the amount of center-surround difference required for cell activation. Parameter commonly
remains zero, while τ is smaller yet very close to one. Various visual effects can be achieved by
changing these default values (Figure 3.13). Parameter ϕe controls the sharpness of the activation falloff. A larger value of ϕe increases the sharpness of the fall-off function thereby creating
a highly sensitive edge detector with reduced temporal coherence, while a small value increases
temporal coherence but only detects strong edges. Typically, I set ϕe ∈ [0.75, 5.0]. Parameter
σe determines the spatial scale for edge detection (Figure 3.3). The larger the value, the coarser
the edges that are detected. For nb bilateral iterations, I extract edges after ne < nb iterations to
reduce noise (Figures 3.14 and 3.23). Typically, ne ∈ {1, 2} and nb ∈ {3, 4}.
72
Canny [22] devised a more sophisticated edge detection algorithm (sometimes called optimal), which due to its computer vision roots is commonly used to derive explicit image representations via segmentation [36], but has also been used in purely image-based systems [49].
Canny edges are well suited for image segmentation because they are infinitely thin16 and guaranteed to lie on any real edge in an image, but at the same time they can become disconnected
for large values of σe and are computationally more expensive than DoG edges. DoG edges
are cheaper to compute and not prone to disconnectedness, but may drift from real image edges
for large values of σe . I prefer DoG edges for computational efficiency, temporal coherence,
because their thickness scales naturally with σe (Figure 3.3 and Figure 3.15), and because my
soft-quantization version (Equation 3.7) allows for a number of stylistic variations. I address
edge drift with image-based warping.
3.3.4. Image-based warping (IBW)
DoG edges can become dislodged from true edges for large values of σe and may not line up
perfectly with edges in the color channels. To address such small edge drifts and to sharpen
the overall appearance of the final result (Figure 3.4, top-right), I optionally perform an imagebased warp, or warpsharp filter. IBW is a technique first proposed by Arad and Gotsman [1] for
image sharpening and edge-preserving upscaling, in which they moved pixels along a warping
field towards nearby edges (Figure 3.16).
16
For their image-based system, Fischer et al. [49] artificially increase the thickness of Canny edges using morphological operations.
73
Figure 3.16. IWB Effect. Top Row: An image before and after warping and the
color-coded differences between the two (Green = black expands; Red = black
recedes). Bottom Row: Detail of the person’s left eye. Note that although the
effect is fairly subtle (zoom in for better comparison) it generally improves the
subjective quality of images considerably, particularly for upscaled images. This
figure uses an edge image as input for clarity, but in the full implementation the
c
entire color image is warped.— {Original photograph used as input Andrew
Calder, with
permission.}
Given an image, f (·), and a warp-field, Mw : R2 7→ R2 , which maps the image-plane onto
itself17, the warped image, W (x), is constructed as:
(3.11)
W (x) = f (Mw−1 (x))
This notation is after Arad and Gotsman [1] where Mw−1 is used to indicate backward mapping,
which is preferable for upscaling interpolation.
17
That is, the warp-field maps pixels positions rather than pixel values.
74
Figure 3.17. Computing Warp Fields. An input image is blurred and convolved with horizontal and vertical Sobel kernels, resulting in spatially varying
c
warp fields for sharpening an image.— {Original photograph used as input Andrew
Calder,
with permission.}
In my implementation, which closely follows Loviscach’s [103] simpler IBW approach, Mw
is the blurred and scaled result of a Sobel filter, a simple 2-valued vector field that in the discrete
domain (see Section 3.3.1) is easily invertible to obtain Mw−1 :
(3.12)
(3.13)
1
Mw (x̂, σw , ϕw ) = ϕw ·
2πσw 2





Ψ(x) = 
f
(x)
∗
L


Z
1
·Ψ(x) · e− 2 (
−1
0
+1
−2
0
+2
−1
0
+1

kx̂−xk 2
σw
) dx



, fL (x) ∗ 


+1
+2
+1
0
0
0
−1
−2
−1
T



Here, parameter σw in Equation 3.12 controls the area of influence that edges have on the
resulting warp. The larger the value the more distant pixels are affected. Parameter ϕw controls
the warp-strength, that is, how much affected pixels are warped toward edges. A value of zero
has no effect, while very large values can significantly distort the image and push pixels beyond
75
the attracting edge18. For most images, I use σw = 1.5, and ϕw = 2.7 with bi-linear or bi-cubic
backward mapping.
Note, that while Equation 3.11 operates on all channels of the input image, Equation 3.13 is
only based on the Luminance channel, fL , of the image. Figure 3.17 shows the horizontal and
vertical Sobel components of Ψ(·) for a given input image.
3.3.5. Temporally Coherent Stylization
To further simplify an image (in terms of its color histogram) and to open the framework further
for creative use, I perform an optional color quantization step on the abstracted images, which
results in cartoon or paint-like effects (Figures 3.1 and 3.18).
(3.14)
Q(x̂, q, ϕq )
=
qnearest +
∆q
tanh(ϕq · (f (x̂) − qnearest ))
2
In Equation 3.14, Q(·) is the pseudo-quantized image, ∆q is the bin width, qnearest is the bin
boundary closest to f (x̂), and ϕq controls the sharpness of the transition from one bin to another
(top inset, Figure 3.4). Equation 3.14 is formally a discontinuous function, but for sufficiently
large ϕq , these discontinuities are not noticeable.
For a fixed ϕq , the transition sharpness is independent of the underlying image, possibly
creating many noticeable transitions in large smooth-shaded regions. To minimize jarring transitions, I define the sharpness parameter, ϕq , to be a function of the luminance gradient in the
abstracted image. I allow hard bin boundaries only where the luminance gradient is high. In
low gradient regions, bin boundaries are spread out over a larger area. I thus offer the user
a trade-off between reduced color variation and increased quantization artifacts by defining a
18
For completeness: negative values of ϕw push pixels away from edges, which looks interesting, but is generally
not useful for meaningful image abstraction.
76
Figure 3.18. Luminance Quantization Parameters. An original image along
with parameters resulting in sharp and soft quantizations. Compare details in
marked regions. Sharp: A very large ϕq creates hard, toon-shading like boundaries. (q, Λϕ , Ωϕ , ϕq ) = (8, 2.0, 32.0, 500.0). Soft: A larger number of quantization bins and low value of ϕq creates soft, paint-like whisks at the quantization
boundaries. (q, Λϕ , Ωϕ , ϕq ) = (14, 3.4, 10.6, 9.7). Edge scale σe = 2.0 for both
abstractions.
target sharpness range [Λϕ , Ωϕ ] and a gradient range [Λδ , Ωδ ]. I clamp the calculated gradients
to [Λδ , Ωδ ] and then generate a ϕq value by mapping them linearly to [Λϕ , Ωϕ ]. The effect for
typical parameter values are hard, cartoon-like boundaries in high gradient regions and soft,
painterly-like transitions in low gradient regions (Figure 3.18). Typical values for these parameters are q ∈ [8, 10] equal-sized bins and a gradient range of [Λδ = 0, Ωδ = 2], mapped to
sharpness values between [Λϕ = 3, Ωϕ = 14].
Although soft quantization is not a novel idea, it has hardly been used for abstraction systems, particularly in a locally adaptive form. My pseudo quantization approach, apart from being effective and efficient to implement, offers significant temporal coherence advantages over
previous systems using discontinuous quantization or automatic image-structure-based systems.
77
In standard quantization, an arbitrarily small luminance change can push a value to a different
bin, thus causing a large output change for a small input change, which is particularly troublesome for noisy input. With soft quantization, such a change is spread over a larger area,
making it less noticeable. Using a gradient-based sharpness control, sudden changes are further
subdued in low-contrast regions, where they would be most objectionable. Finally, an adaptive
controlling mechanism offers the benefits of both effective quantization and temporal coherence
with easily adjustable trade-off parameters set by the user.
3.3.6. Optimizations
In designing my framework, I capitalize on two types of optimizations: parallelism and separability.
Parallelism.
Modern graphics processor units (GPUs) are highly efficient parallel com-
putation machines and are particularly well suited for many image processing operations. To
take advantage of this parallel computing power, every element in my processing framework is
P
highly parallelizable, that is, it does not rely on global operations (like min(·), max(·), (·),
etc.) and all operations only rely on previous processing steps (i.e. no forward dependencies).
In addition, the non-linear diffusion (Section 3.3.2) and edge-detection (Section 3.3.3) operations after the initial noise-removal iterations (n > ne ) can be performed in parallel, as can the
center and surround kernel convolutions of the edge-detection itself. I use Olsen’s [119] GPU
image processing system to automatically compute and schedule processes and resolve memory
dependencies.
78
Separability.
As discussed in Section 3.3.2, the separable implementation of a two-
dimensional filter kernel yields a significant performance gain. Since the Gaussian(-like) convolution features so heavily in the abstraction framework (see Section 6.1.3 for a discussion of
this observation), I take advantage of this optimization in almost every processing step (nonlinear diffusion, edge-detection, and image-based-warping).
3.4. Experiments
Section 3.2 explains the perceptual considerations that have gone into the framework design
and Section 3.3 details the various image processing operations that implement the corresponding image simplification and abstraction steps, but this still does not guarantee that the abstracted images are effective for visual communication. To verify that my abstractions preserve
or even distill perceptually important information, I performed two task-based studies that test
recognition speed and short term memory retention. The studies use small images because (1) I
see portable visual communication and low-bandwidth applications to practically benefit most
from my framework and (2) because small images may be a more telling test of the framework
as each pixel represents a larger percentage of the image.
Participants.
In each study, 10 (5 male, 5 female) undergraduates, graduate students or
research staff acted as volunteers.
Materials.
Images in Study 1 are scaled to 176 × 220, while those in Study 2 are scaled
to 152 × 170. These resolutions approximate those of many portable devices. Images are shown
centered on an Apple Cinema Display at a distance of 24 inches to subtend visual angles of 6.5◦
and 6.0◦ , respectively. The unused portion of the monitor framing the images is set to white.
79
Figure 3.19. Sample Images for Study 1. The top row shows the original images (non-professional photographs) and the bottom row shows the abstracted
versions. Note how many wrinkles and individual strands of hair are smoothed
away, reducing the complexity of the images while actually improving recognition in the experiment. All images use the same σe for edges and the same
number of simplification steps, nb . — {Pierce Brosnan and Ornella Muti by Rita Molnár, Creative Commons License. Paris Hilton by Peter Schäfermeier, Creative Commons License. George Clooney,
public domain.}
In Study 1, 50 images depicting the faces of 25 famous movie stars are used as visual stimuli.
Each face is depicted as a color photograph and as a color abstracted image created with my
framework (Figure 3.19). In Study 2, 32 images depicting arbitrary scenes are used as visual
stimuli. Humans are a component in 16 of these images (Figure 3.20).
Analysis.
For both studies, p-values are computed using two-way analysis of variance
(ANOVA), with α = 0.05.
80
Figure 3.20. Sample Images from Study 2. The top row shows the original
snapshot-style photographs and the bottom row shows the abstracted versions.
Note how much of the texture in the original photographs (like water waves,
sand, and grass) is abstracted away to simplify the images. All images use the
same σe for edges and the same number of simplification steps, nb .
3.4.1. Study 1: RecognitionSpeed
Hypothesis.
Study 1 tests the hypothesis (H1 ) that abstracted images of familiar faces are
recognized quicker compared to normal photographs. Faces are a very important component of
daily human visual communication and I want the framework to help in the efficient representation of faces.
Procedure.
To ensure that participants in the study are likely to know the persons depicted
in the test images, I use photographs of celebrities as source images and controls. The study
uses a protocol [149] demonstrated to be useful in the evaluation of recognition times for facial
81
images [62] and consists of two phases: (1) reading the list of 25 movie star names out loud;
and (2) a reaction time task in which participants are presented with sequences of the 25 facial
images. All faces take up approximately the same space in the images and are three quarter
views. By pronouncing the names of the people that are rated, participants tend to reduce
the tip-of-the-tongue effect where a face is recognized without being able to quickly recall the
associated name [149]. For the same reason, participants are told that first, last, or both names
can be given, whichever is easiest. Each participant is asked to say the name of the pictured
person as soon as that person’s face is recognized. A study coordinator records reaction times,
as well as accuracy of the answers. Images are shown and reaction times recorded using the
Superlab software product for 5 seconds at 5-second intervals. The order of image presentation
is randomized for each participant.
Data Conditioning.
Two additional volunteers were eliminated from the study after
failing familiarity requirements. One volunteer was not familiar with at least 25 celebrities.
Another volunteer claimed familiarity with at least 25 celebrities, but his or her accuracy for
both photographs and abstractions was more than three standard deviations from the remainder
of the group, indicating the the volunteer was not reliably able to associate faces with names.
By the same reasoning, two images were deleted from the experimental evaluation because
their accuracy (in both conditions) was more than three standard deviations from the mean. This
could indicate that those images simply were not good likenesses of the depicted celebrities or
that familiarity with the celebrities’ names was higher than with their faces.
Results and Discussion.
Data for this study (Figure 3.21, Top Graph; Table A.1) shows a
correlation trend between timings for abstractions and photographs. Three data pairs (2, 4 & 5)
82
Figure 3.21. Participant-data for Video Abstraction Experiments. Top
Graph: Data for study 1 showing per-participant averages for all faces. Middle & Bottom Graphs: Data for study 2 showing timings and number of clicks
for participants to complete two memory games, one with photographs and one
with abstractions. Data pairs for both experiments are not intended to refer to
the same participant and are sorted in ascending order of abstraction time.
83
show only a very small difference between recognition times in both presentation conditions,
but for all data pairs, the abstraction condition requires less time than the photographs.
Averaging over all participants shows that participants are faster at naming abstract images
(M = 1.32s) compared to photographs (M = 1.51s), thus rejecting the null hypothesis in favor
of H1 (p < 0.018). In other words, the likelihood of obtaining the results of our study by pure
chance is less than 1.8% and it is therefore more reasonable to assume that the results were
caused by a significant increase in recognizability of the abstracted images. The accuracy for
recognizing abstract images and photographs are 97% and 99% respectively, and there is no
significant speed for accuracy trade-off. I can thus conclude that substituting abstract images
for fully detailed photographs reduces recognition latency by 12.6%.
Interestingly, this significant improvement was neither reported by Stevenage [149] nor by
Gooch et al. [62]. Since both of these authors only used black-and-white stimuli, I suspect
that the simplified color information in my abstraction framework contributes to the measured
improvement in recognition speed. This promises to be a worthwhile avenue for future research.
It is worth pointing out that the performance improvement measured in this study might
seem small in terms of percentage, but it represents an improvement in a task that humans are
already extremely proficient at. In fact, there exist brain structures dedicated to the recognition
of faces [188] and many people can recognize familiar faces from the image of a single eye or
the mouth alone. A similar remark can be made about the result of the next study, which are
even more marked.
84
3.4.2. Study 2: M emory Game
Hypothesis.
Study 2 tests the hypothesis (H2 ) that abstracted images are easier to memorize
(in a memory game) compared with photographs. By removing extraneous detail from source
images and highlighting perceptually important features, my framework emphasizes the essential information in these images. If done successfully, less information needs to be remembered
and prominent details are remembered more easily.
Procedure.
Study 2 assesses short term memory retention for abstract images versus
photographs with a memory game, consisting of a grid of 24 cards (12 pairs) that are randomly
distributed and placed face-down. The goal is to create a match by turning over two identical
cards. If a match is made, the matched cards are removed. Otherwise, both cards are returned
to their face-down position and another set of cards is turned over. The game ends when all
pairs are matched. The study uses a Java program of the card game in which a user turns over a
virtual card with a mouse click. The 12 images used in any given memory game are randomly
chosen from the pool of 32 images without replacement, and randomly arranged. The program
records the time it takes to complete a game and the number of cards turned over (clicks) before
all pairs are matched.
Study 2 consists of three phases: (1) a practice memory game with alphabet cards (no
images); (2) a memory game of photographs; and (3) a memory game of abstract images. All
participants first play a practice game with alphabet cards to learn the user-interface and to
develop a game strategy without being biased with any of the real experimental stimuli. No
data is recorded for the practice phase. For the remaining two phases, half the participants are
85
presented with photographs followed by abstracted images; and the other half is presented with
abstracted images followed by photographs.
Results and Discussion.
In the study, participants were significantly faster in com-
pleting a memory game using abstract images (Mtime = 60.0s) compared to photographs
(Mtime = 76.1s), thus rejecting the null hypothesis in favor of H2 (p time < 0.003). The fact that
the probability of obtaining the measured timings by pure chance is less than 0.3% indicates a
statistically highly significant result. Participants further needed to turn over far less cards in
the game with abstract images (Mclicks = 49.2) compared to photographs (Mclicks = 62.4)
with a type-I error likelihood of p clicks < 0.004, again highly significant. Presentation order (abstractions first or photographs first) did not have a significant effect. Despite the fact
that the measured reduction in time (21.3%) and the reduction in the number of cards turned
over (21.2%) were almost identical, the per-participant data in Figure 3.21 (Middle and Bottom
graphs) and Table A.1 does not indicate a strong correlation between timing results and clicks.
As in study 1, the results (both timing and clicks) for all participants were lower for abstractions
than for photographs (only minimally so for the timing data pairs 2 & 10). Since the number
of clicks corresponds to the number of matching errors made before completing the game19, the
lower number of clicks for the abstracted images indicates significantly fewer matching errors
compared to photographs and I conclude that my framework can simplify images in a way that
makes them easier to remember.
19
The minimum number of clicks is 24, one per card. This is unrealistic, however, as the probability for randomly
picking a matching pair by turning two cards out of 24 is 1 : 23. By removing this pair, no additional knowledge
of the game is discovered, so that even with perfect memory the probability for the next pair is 1 : 21, and so on.
86
Figure 3.22. Failure Case. A case where the contrast-based importance assumption fails. Left: The subject of this photograph has very low contrast compared with its background. Right: The cat’s low contrast fur is abstracted away,
while the detail in the structured carpet is further emphasized. Despite this rare
reversal of contrast assignment, the cat is still well represented.
3.5. Framework Results and Discussion
3.5.1. Performance
The framework was implemented and tested in both a GPU-based real-time version, using
OpenGL and fragment shader programs, and a CPU-based version. Both versions were tested
on an Athlon 64 3200+ with Windows XP and a GeForce GT 6800. Performance values depend
on graphics drivers, image size, and framework parameters. Typical values for a 640 × 480
video stream and the default parameters given in this text are 9 − 15 frames per second (FPS)
for the GPU version and 0.3 − 0.5 FPS for the CPU version.
3.5.2. Limitations
Contrast.
The framework depends on local contrast to estimate visual saliency. Images with
87
very low contrast do not carry much visual information to abstract (e.g. the fur in Figure 3.22).
Simply increasing contrast of the original image may reduce this problem, but also increases
noise. Figure 3.22 demonstrates a rare inversion of this general assumption, where the main
subject exhibits low contrast and is deemphasized, while the background exhibits high contrast
and is emphasized. Extracting semantic meaning about foreground versus background from
images automatically and reliably is a hard problem, which is why I use the contrast heuristic,
instead. Note that despite the contrast reversal the cat in the abstracted image in Figure 3.22 is
still clearly separated from the similarly colored background due to overall contrast polarization.
In practice, I have obtained good results for many indoor and outdoor scenes.
Scale-Space.
Human vision operates at a large range of spatial scales simultaneously. By
applying multiple iterations of a non-linear diffusion filter, the framework covers a small range
of spatial scales, but the range is not explicitly parameterized and not as extensive as that of real
human vision.
Global Integration.
Several features that may be emphasized by my framework are actu-
ally deemphasized in human vision, among these are specular highlights and repeated texture
(like the high-contrast carpet in Figure 3.22). Repeated texture can be considered a higherorder contrast problem: while the weaving of the carpet exhibits high-contrast locally, at a
global level the high-contrast texture itself is very regular and therefore exhibits low contrast in
terms of texture-variability. Dealing with these phenomena using existing techniques requires
global image processing, which is impractical in real-time on today’s GPUs, due to their limited
gather-operation capabilities20.
20
The framework deals partially with some types of repeated texture. See Section 3.5.7 (Indication) for details.
88
3.5.3. Compression
A thorough discussion of theoretical data compression and codecs exceeds the scope of this
dissertation because traditional compression schemes and error metrics are optimized for natural
images, not abstractions (Sections 2.3 and 2.4.1). To recall, many existing error metrics, even
perceptual ones, yield a high error value for the image pairs in Figures 3.19 and 3.20, although
I have shown in Section 3.4 that my abstractions are often better at representing image content
for visual communication purposes than photographs.
An interesting point of discussion in this respect is the error source. Several popular blockbased encoding schemes (e.g. JPEG, MPEG-1, MPEG-2) exhibit blockiness artifacts at low bitrates while many frequency-based compression schemes produce ringing around sharp edges.
All of these artifacts are perceptually very noticeable. Artifacts in abstraction systems, like that
presented here, are of stylistic nature and people tend to be much more accepting of these [145]
because they do not expect a realistic result. Non-realistic image compression promises to be
an exciting new research direction.
In terms of the constituent filters in the framework, Pham and Vliet [126] have shown that
video compresses better using traditional coding methods when bilaterally filtered beforehand,
judged by RMS error and MPEG quality score. Collomosse et al. [26] list theoretical compression results for vectorized cartoon images. Possibly most applicable to my abstractions is work
by Elder [42], who describes a method for storing the color information of an image only in
high-contrast regions, achieving impressive compression results.
Without going into technical detail, it can be shown that the individual filter steps in the
framework simplify an image in the Shannon [146] sense and a suitable component compression
scheme should be able to capitalize on that. For example, the emphasis edges in Section 3.3.3
89
pose a problem for most popular compression schemes due to their large spectral range, yet the
edges before quantization are derived from a severely band-limited DoG filter of an image. In
general, an effective compression scheme would not attempt to compress the final images, but
rather the individual filter outputs before quantization. The final composition of the channels
would then be left to the decompressor. Another advantage of this approach, which promises
novel applications for streaming video, is that only selected channels may be distributed for
extreme low-bandwidth transmission (e.g. only the highlight edges) and that the stylistic options
represented by the quantization parameters can be chosen by a decompression client (viewer)
instead of hard-coded into the image-stream.
3.5.4. Feature Extension
I do not include an orientation dependent feature in the contrast feature space because of its
relatively high computational cost and because orientation is generally only necessary for highlevel vision processes, like object recognition, whereas my work focuses on using low level
human vision processes to improve visual communication. Should such a feature be required,
the combined response for Gabor filters at different angular orientations can be included in the
input feature space conversion step in Figure 3.4. This response would need to be scaled to a
comparable range as the other feature channels to retain perceptual uniformity. For implementation details of a separable, recursive Gabor filter, compatible with the framework, see Young
et al. [181].
90
Figure 3.23. Benefits for Vectorization. Vectorizing abstracted images carries
several advantages. Edges: Extracting edges after ne smoothing passes removes
many noisy artifacts that would otherwise have to be vectorized. Here, the difference between two consecutive passes is shown. Color: Quantization results after
1 and 5 non-linear diffusion passes, respectively. The simplification achieved by
the abstraction is evident in the simplified quantization contours, which requires
fewer control-points for vector-encoding. Zoom in for full detail.
3.5.5. Video Segmentation
My stylization step in Section 3.3.5 is a relatively simple modification to an abstracted image.
Despite this, I have found that it yields surprisingly good result in terms of color flattening and
is much faster than the mean-shift procedures used in off-line cartoon stylization for video [26,
161]. Interestingly, several authors [4, 14] have shown that anisotropic diffusion filters are
closely related to the mean-shift algorithm [27]. It is thus conceivable that various graphics
applications that today rely on mean-shift could benefit from a much faster anisotropic diffusion
implementation, at least as a pre-process to speed up convergence.
91
3.5.6. Vectorization
An explicit image representation is an integral part of many existing stylization systems. Although I already discussed the trade-offs between these explicit representations and my imagebased approach, I show here how my abstraction framework can be used as a pre-process to
derive an efficient explicit image representation.
Benefits.
Vectorization21 of images of natural scenes requires significant simplification
for most practical applications because generally neighboring pixels are of different colors so
that a true representation of an input image might require a single polygon for each pixel. This
simplification is essentially analogous to the abstraction qualities I have discussed so far: I want
to simplify (the contours and colors) of a vector representation of an image, while retaining
most of the perceptually important information. Consequently, it is not surprising that my abstraction framework can aid in this simplification process with its use of a non-linear diffusion
filter and pseudo color quantization. Figure 3.23 demonstrates the two key benefits: (1) noise
removal; and (2) contour simplification. Because the non-linear diffusion step of the framework
removes high-frequency noise, this information does not need to be encoded into a complex
vector representation. Similarly, the quantization contours of the abstracted images are progressively simplified in their shape, requiring fewer control points to encode into any standard
vector format. This approach of simplification followed by vectorization contrasts with the
traditional approach of vectorization followed by simplification [36, 161, 26]. The main advantage of the traditional approach is that vector representations at different spatial scales can be
treated independently, in the course of which some features may be removed completely as part
of the simplification. The advantages of the approach presented here are that, as above, many
21
Here, defined as the act of converting an image into bounded polygons of a single color.
92
features do not need to be vectorized in the first place, that the simplification can happen much
faster, and that temporal coherence of consecutive frames is improved. The reason for increased
temporal coherence is rooted in the sparse parametric representation that vectorization affords.
Given an efficient (low redundancy) vector representation (e.g. B-splines) of a shape, this shape
can change considerably if one of its control-points is removed or altered excessively. Since
simplification of vectors includes just these types of modifications, the traditional vectorization
approach is prone to very unstable shape representations22. If vectorization is performed after
simplification, then temporal coherence is mainly a function of the coherence quality of the
vectorization input. Given the good coherence characteristics of my framework23, this leads to
improved temporal coherence after vectorization.
Implementation.
The vectorization implementation I have chosen is based on simple iso-
contour extraction of the color information in the abstracted and hard-quantized images. I vectorize the edge and color information separately to keep the vectorized representation as simple
as possible. Individual polygons are expressed as polylines or Beziér curves, depending on the
local curvature of the underlying contours and written out as Postscript files. Vectorization of a
single image takes in the order of 1-3 seconds, depending on the resolution of the input image
and the desired complexity of the vectorized output. This process is not optimized for efficiency.
Limitation.
An advantage of temporal vectorization extensions, in addition to increased
temporal coherence, is the possibility of a more compact temporal vector representation. Instead
of encoding each frame independently one can specify an initial shape and then encode how the
22
This has led to some computationally very expensive temporal vectorization extensions [161, 26].
This refers to coherence as a result of simplification and smoothing, not soft-quantization functions, as these
functions cannot be used for vectorization (see Limitation, below).
23
93
shape transforms for successive frames24. Unfortunately, this is a difficult problem, as it requires
accurate knowledge of where a shape in one frame can be found in the next. This objecttracking problem (in this case, contour-tracking) is a major research effort in the computer
vision community and not, as yet, robustly solved. For this reason, both Wang et al. [161]
and Collomosse et al. [26] require user-interaction to correct for tracking mistakes, particularly
in the presence of camera movement and occlusions. My vectorization approach faces the
same challenges and limitations when moving from a single frame encoding to an inter-frame
encoding scheme.
Vectorization, as defined here, requires true discontinuous quantization boundaries (for both
edges and color information). As a result my vectorized images loose those temporal coherence
advantages that stem from the soft-quantization functions of my framework.
3.5.7. Complimentary Framework Effects
In addition to the design goals that I implemented within the abstraction framework, a handful
of stylistic effects presented themselves for free as a result of the framework’s various image
processing operations. Initially, this came as a surprise to me, considering that (1) I did not
intentionally program these effects; (2) most of the effects are traditionally considered artistic,
not perceptual or computational; and (3) most effects are considered challenging research objectives in their own right (see Indication, below). Upon reflection, though, these observations
strengthen my belief that there are many unknown connections between perception and art that
await to be modeled and measured with the use of NPR. In this dissertation, I include the two
most prominent effects that have also been discussed in previous work.
24
Most video compression schemes make use of this inter-frame coherence by encoding just the information that
changes between two frames in so called delta-frames.
94
Figure 3.24. Automatic Indication. The inhomogeneous texture in these images causes spatially varying abstraction. As a result, fine detail subsists in some
regions, while being abstracted away in other regions. Note how the bricks in
the top and middle images are only represented intermittently with edges, yet the
observer perceives the entire wall as bricked. The few visible brick instances are
interpreted as indicating a brick wall and empty regions are visually interpolated.
The same applies to the blinds in the middle image and the shingles, the wheat
and the trees in the bottom image. These types of indication are commonly used
by artists, particularly the shadows indicated underneath the windowsill in the
top image and the fine branches in the bottom image, which are hinted at by
faint color, while only the main branches are drawn with edges.
95
Indication.
Indication is the process of representing a repeated texture with a small num-
ber of exemplary patches and relying on an observer to interpolate between patches. Winkenbach and Salesin [169] explain the associated challenges thus: “Indication is one of the most
notoriously difficult techniques for the pen-and-ink student to master. It requires putting just
enough detail in just the right places, and also fading the detail out into the unornamented
parts of the surface in a subtle and unobtrusive way. Clearly, a purely automated method for
artistically placing indication is a challenging research project.”
For structurally simple, slightly inhomogeneous textures with limited scale variation, like
the examples in Figure 3.24, my framework can perform simple automatic indication, including stroke texture25 (Figure 3.24: top-right image, shadows under window-sill). The framework achieves indication by extracting edges after a number of abstraction simplification steps.
Depending on the given image contrast and Equation 3.2, some parts of an image are simplified more, some less, in an approximation to the perceived difference in those image regions.
The emphasis DoG edges then highlight high contrast texture regions that remain prominent
throughout the simplification process. All other edges in textured regions are removed, leaving
the missing texture to be inferred by an observer.
As DeCarlo and Santella [36] noted, such simple indication does not deal well with complex
or foreshortened textures. The automatic indication in my framework is not as effective as
the user-drawn indications of Winkenbach and Salesin [169], but some user guidance can be
supplied via Equation 3.2, to provide vital semantic meaning.
25
Winkenbach and Salesin [169] refer to line markings that represent both texture and tone (brightness) as stroke
texture.
96
Figure 3.25. Motion Blur Examples. Motion Lines: Cartoons often indicate
motion with motion lines. Motion Blur: This sequence shows a radial pattern of
rays at different orientations (angle) and of varying width (radius), which is convolved with a motion blur filter at different orientations. Note that lines parallel
to the direction of the motion blur are preserved, while lines perpendicular to the
motion blur are maximally blurred.
Figure 3.26. Motion Blur Result. Original: Images of a stationary car and a
moving motion-blurred car. DoG Filter: Corresponding images from my modified DoG filter. Note how many of the remaining horizontal lines resemble the
speed lines used by comic artists.— {Original image released under GNU Free Documentation
License.}
Motion Lines. Comic artists commonly indicate motion with motion lines parallel to the
suggested direction of movement (Figure 3.25, Motion Lines). Interestingly, Kim and Francis [89, 52] showed that these motion lines are not purely artistic and actually have perceptual
foundations, which is likely the reason why artists have adopted them in the first place and why
97
they are so easily understood. The DoG edges in my framework automatically create streaks
resembling motion lines as shown in Figure 3.26. Although I did not explicitly program this
behavior (as in Collomosse et al. [26]), it can be easily explained.
Motion blur is a temporal accumulation effect that occurs when a camera moves relative to
a photographed scene. This relative movement can be any affine transformation like translation,
rotation, and scaling but I focus this discussion on translational movements only. A motion blur,
or oriented Blur, O(·), can be formulated using a modification of the familiar Gaussian kernel:
Z
(3.15)
(3.16)
O(x̂, σo , θ) =
Θ(x, θ) =
2
1 kΘ(x̂−x,θ)k
f (x) · e− 2 ( σo ) dx
Z
1 kΘ(x̂−x,θ)k 2
e− 2 ( σo ) dx
cos(θ)
sin(θ)
0
0
!
·x
Here, parameter σo determines how much the image is blurred, i.e. the duration of the exposure in relation to the speed of the scene relative to the camera. Parameter θ indicates the blur
direction in the image plane. Equation 3.15 is a very simple but sufficient model for this discussion that does not take into account depth (image elements moving at different speeds) and
assumes that only the camera moves with respect to the scene. Figure 3.25 (Motion Blur) shows
the result of this filter on a pattern of lines of varying widths and different orientations. Lines
parallel to the blur direction are blended only with themselves and appear unaffected, while
lines perpendicular to the blur direction are blended with neighboring lines and loose sharpness. Intermediate angles vary with the sine of the angle. In the car example in Figure 3.26, the
vertical line of the door is blurred away, while the door’s horizontal line (parallel to the motion)
is preserved.
98
The DoG filter therefore mainly detects edges in the direction of motion, because other
edges are largely blurred away. As a consequence, the resulting image looks like it has motion
lines added.
3.5.8. Comparison to Previous Systems
I have pointed out throughout this Chapter how my framework differs from previous systems
in terms of the design goals I have chosen, and I have demonstrated performance increases
in perceptual tasks not evident in previous work [149, 62]. However, these comparisons are
still not as detailed as they should be, mainly due to the fact that the NPR community is lacking comparison criteria above the level of simple frame-rate counts. There are other issues
that compound this problem. Most stylization systems are not based on perceptual principles
and therefore not psychophysically validated. Performing such comparative analyses oneself
is complicated by the fact that previous stylization systems are rarely freely available, difficult
and time-consuming to implement, and that they generally have a limited amount of results
openly available. There simply is no standard repository of imagery for NPR applications (like
the Stanford bunny for meshes, or the Lena image for image processing). I hope that my work
can contribute to the solution to these problems by making available a large number of input
and result images and videos, and more importantly, by validating my own framework with
psychophysical experiments that can be used in direct comparison with future NPR systems.
99
3.5.9. Future Work
Despite the numerous processing steps that comprise my video abstraction framework, it is
simple to implement and shows great potential in terms of computational and perceptual efficiency. I therefore hope that the framework will be adopted for a number of interesting research
directions.
NPR Compression.
As noted in Section 3.5.3, I believe that abstractions generated by
my framework are subject to good compression ratios, yet most current compression schemes
are likely to perform sub-optimally. Non-photorealistic compression is basically unheard of,
partly because the compression community has very well-defined and rigid ideas about realism
and desirable image fidelity. I believe NPR compression to be promising future research mainly
because of the significant removal of information in abstractions and because of the ability to
alter the reconstruction parameters on the decompression side for stylistic effect and perceptual
efficiency.
Minimal Graphics.
In their paper called Minimal Graphics, Herman and Duke [69]
state that “[the] main question which still remains is how to automatically extract the minimal
amount of information necessary for a particular task?”. I have shown that two specific tasks
can be performed better given my abstractions, but I did not show (nor do I believe) that this
performance increase is maximal. As Section 3.4 demonstrated, removal of information can
actually lead to better efficiency for specific perceptual tasks, but there is a point at which additional removal of information will bring about a decline in efficiency26. It would be interesting
and valuable to use a framework like the one presented here to graph a chart of image information versus task efficiency and to map these findings to framework parameters. Such perceptual
26
This can be proven by considering the extreme case of removing all information.
100
research using an NPR framework would be another example of how to close the loop of mutual
beneficence that this dissertation is intended to demonstrate.
3.6. Summary
In this chapter, I presented a video and image abstraction framework (Figure 3.4) that works
in real-time, is temporally coherent, and can increase perceptual performance for two recognition and memory tasks (Section 3.4).
Framework.
Unlike previous systems, my framework is purely image-based and demon-
strates that meaningful abstraction is possible without requiring a computationally expensive
explicit image representation. To the best of my knowledge, my framework is one of only three
automatic abstraction systems that prove effectiveness for visual communication tasks with
user studies. Of these, my studies are the most comprehensive (two tasks with colored stimuli compared to one study with colored stimuli and two studies with black-and-white stimuli,
respectively).
By basing the framework design on perceptual principles, I obtain at least two visual effects
(Section 3.5.7) in my output images for free27, which previous systems implemented explicitly
and with computational overhead. These effects are indication, the suggestion of extensive
image texture by sparse texture elements, and motion lines, an artistic technique to illustrate
motion in static images.
Customizable Non-linear Diffusion.
I developed an extension (Equation 3.2) to the
bilateral filter (Equation 3.1) as an approximation to non-linear diffusion that allows for external
control via user-data in various forms (painted, data-driven, or computed).
27
Without explicit computation devoted to these effects.
101
Temporally Coherent Quantization.
I constructed two smooth quantization functions,
both of which (1) increase temporal coherence for animation; and (2) offer stylization options
not available to previous systems using discontinuous quantization functions. The first quantization function (Equation 3.7) operates on the well-known DoG function to extract edges from
an image. The second quantization function (Equation 3.14) flattens colors in an image for
data reduction and artistic purposes. Another contribution of this second function is its spatially
adaptive behavior, which achieves a good trade-off between the desired level of quantization
and temporal coherence by adapting to local image gradients.
Additional Materials.
More information on this project, including a conference pa-
per [171], GPU code, and an explanatory video, can be found on the Siggraph 2006 conference DVD. The same materials and additional images are also available online at http:
//videoabstraction.net.
102
CHAPTER 4
An Experiment to Study Shape-from-X of Moving Objects
Figure 4.1. Shape-from-X Cues. The human visual system derives shape information from a number of distinct visual cues which can be targeted and tested
using non-photorealistic imagery. Shading: Lambertian shading varies with the
cosine of the angle between light direction and surface normal. Flat surfaces exhibit constant color, while curved surfaces show color gradients. Texture: Texture
elements, or texels, accommodate their form to align with the underlying surface,
causing texture compression. Contours: Discontinuities of various surface properties are shown in different colors (red: silhouette, black: outline, green: ridges,
blue: valleys). Note that the objects’ shadows are synonymous with silhouettecontours as seen from the casting light’s point-of-view. Motion: Under rigid
body motion, points on the surface move at different speeds and in different
directions.
In this Chapter, I present a psychophysical experiment that uses non-photorealistic imagery
to study the perception of several shape cues for rigidly moving objects in an interactive task.
Traditionally, most shape perception studies only display a small number (generally one;
two for some comparison experiments. See Section 4.3.4) of static objects. Yet, most interactive graphical environments, such as medical visualization, architectural visualization, virtual
103
reality, physical simulations, and games, contain a large number of concurrent dynamic shapes
and objects that move independently or relative to an observer. Because shape perception is
vital to many recognition and interaction tasks, it is of great interest to study shape perception
for multiple shapes in dynamic environments, in order to develop effective display algorithms.
The experiment I propose in this chapter benefits greatly from carefully designed nonphotorealistic imagery to separate and individually study shape cues that find common usage in
many computer graphics applications.
4.1. Introduction
The art and science of photorealism in computer graphics, as exemplified in Figure 1.1,
has shown impressive improvements over the last decades, but the associated computational
demands have put this level of realism out of reach of most real-time applications. As a result,
real-time 3-D graphics commonly only offer best-effort approximations in terms of realistic
lighting, shading, and material effects. These limitations beg several important questions. What
effects do the approximations have on applications depending on shape perception? If we want
to prioritize computational resources for the most effective shape cues for a given set of shapes
or a given application, how do we determine this effectiveness?
Another set of questions concerns the necessity for realism. I have already mentioned in
Chapter 1 and demonstrated in Chapter 3 that sometimes less is more when it comes to visual
stimuli for humans1. Realistic images can, at times, be overbearing or conflicting in terms of
the information that is presented to a viewer, and it may be more effective for a given task to
display less information that is emphasized appropriately [150]. Being freed from the restrictions that reality (even an approximate one) imposes, how can we emphasize the shape of an
1
Incidentally, results in this Chapter reiterate this concept.
104
object effectively using stylistic (non-realistic) elements? Similarly, how can we compare the
effects of various known stylization techniques for conveying shape?
I believe the answers to these questions to be important, not only because they will advance
the state-of-the-art in realistic and non-realistic graphics, but because they may provide insights
into the development of art, and our perception of art. Of course, this chapter provides only
very few actual answers to these questions. What it provides instead is a simple and flexible
experiment that enables research into these questions.
4.1.1. Experimental Design Goals
The set of shape cues I investigate (shading, contours, and textures) is not meant to be exhaustive, but rather demonstrative. The number of additional existing shape cues, their possible
parameterizations, and the permutations of combined effects are probably too vast to explore
in a single lifetime. As such, the main purpose of this chapter is to demonstrate an example
of the types of studies that my experiment supports and to offer my methodology up for other
researchers in computer graphics and perception to perform their own investigations.
In designing the experiment, I take special care to address the following key issues:
(1) A number of different shape cues can be studied in isolation and in combination —
This is important to support a broad range of studies.
(2) The difficulty of the experimental task can be easily adjusted — If the task is too easy
or too difficult no meaningful data can be gathered. The task should be designed so that
participants at different performance levels can provide meaningful statistical data.
105
(3) The interaction itself is simple — It is important to separate the task from the interaction necessary to perform the task. While the task should be as difficult as possible
(without being impossible), the interaction should be very simple to ensure that the
performance of the task is measured and not that of the interaction.
(4) The performance of participants can be tested under time-constrained conditions —
Most traditional shape experiments have no time limit for their trials. Because humans
can only attend to very few stimuli simultaneously [39, 160] the results under timepressure might very well be different from these static experiments and offer important
guidance for the design of real-time applications.
(5) The experimental shapes are general, relevant, and parameterizable — This is important so that valid and meaningful statements can be made about the shapes that are
tested and the results that apply to them. It also facilitates replication and verification
of experimental results by third parties.
(6) Learning effects and other biases for the task are minimal — After an initial period of
getting acquainted with the interaction and developing a strategy, learning and memory
of the experimental procedure should not impact the performance of the interactive
task2, so that performance differences are due to varied experimental conditions and
not increasing experience. For the same reason, the experimental conditions should not
be biased or otherwise predictable to ensure the experimental data reflects perceptual
performance instead of system biases or deductive reasoning abilities.
2
Note, that this is different from studying memory performance, as in Section 3.4.2. Even there, the position of
cards between trials was randomized, so that participants could not remember the correct position from the previous
trial. Instead, participants had to remember the positions anew for each trial, thus making each trial independent.
106
It is my hope that these design goals are specific enough to provide meaningful results, yet
general enough to allow other researchers to (1) adopt my experimental framework to study
other types of shape cues, and to (2) evaluate the effectiveness of interactive non-photorealistic
rendering systems to convey shape information. Section 4.4 explains how I implement the
above goals in my own experiment and Section 4.7 demonstrates via data analysis that these
goals were attained.
4.1.2. Overview
In the experiment I present here, participants are shown 16 moving objects, 4 of which are designated targets, rendered in different shape-from-X styles. Participants select these targets by
simply touching a touch-sensitive table onto which the objects are projected. The experimental data shows that simple Lambertian shading offers the best shape cue, followed by outline
contours and, lastly, texturing. The data also indicates that multiple shape cues should be used
with care, as these may not behave additively in a highly dynamic environment. This result is
in contrast to previous additive shape cue studies for static environments and reflects the importance of investigating shape perception in the presence of motion. To the best of my knowledge,
my experiment is unique in its capacity to compare the effectiveness of multiple shape cues in
dynamic environments and it represents a step away from traditional, impoverished (reductionist) test conditions, which may not translate well to real-time, interactive applications. Other
advantages of the experiment are that it is simple to implement, engaging and intuitive for participants, and sensitive enough to detect significant performance differences between all single
shape cues.
107
4.1.3. Note on Chapter structure
Although this chapter follows largely the same structure as Chapter 3, it does so in a slightly different order. This is due to the fact that Chapter 3 presents an automatic abstraction system based
on perception and verified by two experiments; whereas this chapter presents a psychophysical
experiment to study perception, based on non-photorealistic imagery. In this chapter, I therefore
introduce important aspects of the human visual system (Section 4.2) before discussing related
work (Section 4.3).
4.2. Human Visual System
Most interaction with our visual world requires some shape identification or categorization.
The shape of the visible portion of an object can be correctly interpreted if the distance between
each point on the surface of the object, PO , and its projection onto the eye’s retina, PE , is
known (Figure 4.2). I will refer to this distance as the depth at PE . Calculating the depth from
the light-signal at the retina is an ill-constrained problem, because the light reaching PE could
have emanated from any point along the view-ray cast through PO . To address this problem,
the human visual system is equipped with a number of mechanisms to infer depth information
from an image. The convergence of depth interpretations from different mechanisms leads to
a stable perception of shape3. The different depth interpretation mechanisms that allow shape
perception are collectively referred to as Shape-from-X. The most important shape cues for
computer graphics applications are shading, texture, contours, and motion. Other important
3
Sometimes convergence does not occur, leading to multiple possible shape interpretations, as in the famous Necker
cube illusion. It is interesting that in such cases only one interpretation can be perceived at a time and that the
different perceived interpretations alternate perpetually [75].
108
Figure 4.2. Left: Depth Ambiguity. Light reflects off a surface point PO , reaching the retina at point PE . The length of the vector |~v |, ~v = PO − PE , is the
distance between the surface point and the viewer. This situation is ambiguous
for the viewer, because the light could have emanated anywhere along the ray,
PE + α · ~v , α ∈ R+ .
Figure 4.3. Right: Tilt & Slant. The orientation of a surface at a point can be
described by the tilt and slant of a thumbtack gimbal placed at that point and
aligned with the surface normal. Both the length of the gimbal’s rod and the
elongation of the attached disk in the image plane indicate the local surface orientation.
shape cues exist, like binocular stereopsis and ocular accommodation and vergence, but are less
commonly applied in a computer graphics context.
4.2.1. Shading
The shading of an object is a complex function of the object’s properties, such as shape and
material, as well as those of its environment, including lights, other objects and the direction
from which it is viewed. For simple illumination conditions (see below) a change in surface
orientation can be inferred from a change in surface shading [95, 96]. Real-time computer
graphics commonly approximate realistic shading with the Phong reflection model [127], a
109
local illumination model that considers ambient light, diffuse reflection, and specular reflection.
To reduce the number of free variables in my experiment, I set the ambient contribution to zero
and only model diffuse reflection (Lambertian shading) as
Ir = kd · I0 · dot(~n, ~`),
where I0 and Ir are the incoming and reflected light intensities, respectively, ~n is the surface
normal at PO , ~` is the incoming light direction (as in Figure 4.2), dot denotes the vector dotproduct and kd ∈ [0 . . . 1] indicates the diffuse reflectance properties of the object. I use a single
point-light-source at infinity. Since the dot-product of two vectors changes smoothly according
to the cosine of the angle between the vectors, the change in light-intensity on a Lambertian
surface is a good indicator of change in surface orientation (Figure 4.1).
4.2.2. Contours
Smooth changes in depth often indicate smooth changes in the shape of a surface. Depth discontinuities, on the other hand, are a likely sign of figure/ground separation (where figure is
an object and ground is everything else), changes in local topology, or abutting but distinct
surfaces. Such discontinuities are therefore important visual markers for the distinction of
figure from ground, object components, and surfaces. Changes in surface normals and other
differential geometry measures, such as principal curvatures, can also be used to mark shape
discontinuities or extrema in images. Together, these define the set of contours of an object,
some of which are shown in Figure 4.1, Contours. Note that while some contour-types depend
110
only on object shape, others also depend on the observer’s point-of-view [97]. Several nonphotorealistic rendering algorithms rely on contours to convey essential, but much condensed
shape information [70, 35].
4.2.3. Texture
Texture is most often described in terms elemental texture units, called texels, and their distribution on a surface. Figure 4.1, Texture, illustrates the use of a random cell texture to indicate
shape. Many natural scenes and materials contain textures, such as fields of grass or flowers,
heads in a crowd, woven fabric, etc. While Gibson [56] was the first to identify and investigate the importance of texture as a depth-cue, several works have since extended his research.
Cumming et al. [31] defined three distinct parameters along which texture covaries with depth:
compression4, density, and perspective. They found that compression accounts for the majority of texture variation in shape, so I focus on this cue. Compression refers to the change in
shape of a texel when mapped onto a surface non-orthogonal to the viewer. Another important factor in texturing is the distribution of texels on a surface, which is generally achieved
through a parametrization function of the surface. This function provides a mapping relating
texel distribution and orientation to surface shape.
4.2.4. Motion
If an object moves relative to an observer (via translation, rotation, or a combination of the
two), then points on its surface that are at different depths move at different relative speeds.
Therefore, the relative movements of these points convey the underlying depths for rigid objects
4
Not to be confused with the term compression used in information theory.
111
(Figure 4.1, Motion). The rigidity constraint is necessary because plastic deformations can also
lead to relative movement of surface points, and the two types of motion, rigid and plastic,
cannot be distinguished visually. Such a constraint is also employed by human perception in
the form of a bias towards recognizing motion as rigid, if such an interpretation is consistent
with the visual stimulus [121].
4.2.5. Limitations
None of the shape cues above is a sufficient requisite for shape perception. The shading of
an object may be indistinguishable from the color variation of its material. The efficiency of
shape-from-texture depends largely on the texture used, its homogeneity and its parametrization, all of which are arbitrary. Contours are highly localized visual markers, requiring visual
interpolation, and are commonly under-constrained. Lastly, shape-from-motion depends on the
reliable tracking of surface points, as well as robust distinction of rigid and plastic motion,
both of which cannot be guaranteed. This insufficiency of any single shape cue to provide robust depth-information explains the redundant shape detection mechanisms of the human visual
system.
4.3. Related Work
Compared to Chapter 3, the rendering techniques used in my experiment are too basic to
warrant a comparison to previous work. Instead, I focus here on the related experiments that
researchers have undertaken to study shape. The following discussion is structured according
to the shape-from-X cues of Section 4.2.
112
4.3.1. Shape from Shading
In early pioneering work on non-realistic shape perception, Ryan and Schwartz [141] presented
participants with photographs and shaded images of objects in different configurations and measured the time it took participants to correctly identify the depicted configuration. Due to the
preliminary nature of their study, they used only 3 arbitrary real objects. The configurations
of the objects depended on their functionality, which may not have been known to participants.
More importantly, the authors lacked computer graphics capabilities and commissioned an artist
to produce the shaded images. Their experiment therefore largely measured the artist’s craftsmanship at conveying the different object configurations.
Koenderink et al. [95, 96] invented a thumbtack-shaped widget, as in Figure 4.3, for participants to indicate the perceived shape on a Lambertian surface. Sweet and Ware [153] investigated interaction between shading and texture and also included specular reflections in one of
their experiments. Again, participants used Koenderink’s thumbtack widget for feedback. Johnston and Passmore [83] mapped Phong-shaded spheres with band-limited random-dot textures.
Instead of using the thumbtack widget they asked subjects forced-choice questions about the
spheres and a paired surface patch, which was oriented in a different direction or had a different
curvature to the spheres. As explained in Section 4.3.4, none of the presented evaluation techniques lend themselves to experimentation with moving objects and their results may therefore
not apply to many highly dynamic computer graphics applications.
Barfield et al. [5] investigated the effect of simple computer shading techniques (wireframe,
flat shading with one or two light sources, and smooth shading) on the mental rotation performance (see Section 4.3.4, Mental Rotation) of participants. The mental rotation task is similar
113
to the task in my experiment, but several differences exist to enable real-time dynamic interaction: my experiment uses multiple concurrent shapes, the shapes all move, and the shapes differ
in their constituent parts, not in their arrangement.
Rademacher et al. [132] compared photographs to shaded computer graphics images of simple geometric shapes and asked participants whether they were seeing a photograph or synthetic
image. In a similar experiment, Ferwerda et al. [47] compared the perceived realism of photographs of automobile designs with versions rendered in OpenGL and rendered with a global
illumination model. As such, neither experiment directly measured the perception of shape, but
the contribution of soft shading, number of lights, and surface properties to the perception of
subjective realism.
Kayert et al. [88] probed neural activity of Macaque monkeys using invasive surgical probes
to study the modulation of inferior temporal cells to nonaccidental shape properties (NAP)
versus metric shape properties (MP) of shaded objects. Such experiments can obviously not be
performed on human subjects.
Biederman and Bar [8] used non-sensical objects with diffuse and specular shading to compare shape perception theories based on NAP against theories based on MP. The effects of
shading itself were not measured.
4.3.2. Shape from Contours
In their study described above, Ryan and Schwartz [141] also presented participants with line
drawings and cartoons of different object configurations. Because an artist generated their images, it is likely that considerable perceptual and cognitive effort went into creating effective
114
images5. At the same time, the method for creating the images cannot be described quantitatively. The results of their experiment are thus largely biased by the artist and difficult to
replicate.
Shepard and Metzler [147] evaluated mental rotation performance of three-dimensional
chained cubes presented in a line-drawing style. Yuille and Steiger [184] performed followup work based on the same experiment.
In his recognition-by-components (RBC) paper, Biederman [10] used line drawings to illustrate his theory and perform various experiments to determine the effects of reducing the
number of components used to represent an object, and of deleting parts of the lines used to
represent each component. The experiments were designed to support the RBC theory and not
to test the effectiveness of line-drawings to purvey shape.
In contrast to the large corpus of research in rendering techniques for contours on polygonal
meshes [35, 70], implicit surfaces [16, 128], and even volumetric data [20, 41], the literature
on perceptual evaluation of contour rendering from 3-D models is relatively sparse. Gooch and
Willemsen [60] performed a blind walking task in an immersive virtual environment, rendered
with contours, to determine perceived distances as compared to distances estimated in the real
world. They did not evaluate how contours compared to other shape cues. Another difference to my work is that Gooch and Willemsen probed estimation of quantitative distances, a
task that humans find notoriously difficult, whereas my experimental design is geared towards
shape estimation and categorization, for which only relative or qualitative depth-information is
required.
5
In fact, the different types of representations varied not only in their use of shading or lines, but also in the amount
of detail that was depicted. Particularly the cartoon representations were more symbolic than literal copies of the
original scene.
115
4.3.3. Shape from Texture
Various authors have shown that texture elements, aligned with the first and second principle
curvature directions of a surface, are good candidates for indicating local surface shape and
curvature [77, 58, 90, 153]. The specific experiments of these authors do not translate well
to dynamic scenes, but it will be interesting future work to verify their results for dynamic
environments using my experiment.
4.3.4. Measurements
Most work on shape perception uses one of the following established methods to measure perceived surface shape:
(1) Thumbtack gimbal.
Participants place a (virtual) gimbal widget akin to a thumb-
tack at a particular orientation on the object’s surface so that the pin’s direction is
aligned with the estimated surface normal. Both the direction of the pin and the eccentricity of the attached disk are used to indicate and measure estimated tilt and slant, as
in Figure 4.3 [95, 98, 118, 96, 58, 90, 153].
(2) Mental Rotation.
Participants are shown a pair of images in one of two configu-
rations: (a) depicting the same object but from different viewpoints; and (b) depicting
different objects. The experimenter measures the time a participant takes to decide
on a given configuration. This task requires participants to mentally rotate one of the
shapes to match the other and can employ 2-D shapes and rotations (in-plane) [30, 51]
or 3-D shapes and rotations [147, 184, 5, 11].
116
(3) Exemplars/Comparisons.
Several physical objects with a variety of surface shapes
are kept at hand. These are used as exemplars, so that a participant of the study can
indicate the position on a surface with similar properties to that of the object being studied (e.g. [151]). When measuring perceived distances in virtual environments some
experiments required users to walk the estimated distance in the real-world as an analogous measure [60, 111], while others asked participants to estimate the time it would
take them to walk the perceived distance at a constant pace in the real world [129].
(4) Naming of objects.
Subjects are shown depictions of real-world objects and are
asked to name the depicted object as quickly as they can [10].
Discussion.
While the gimbal and exemplar methods are capable of yielding highly sen-
sitive quantitative data, they do not transfer to a fully dynamic context like the one I am investigating. Moving objects simply do not hold still long enough to perform these types of
measurements. The same restriction applies to mental rotation, because at least one of the two
shapes to be compared has to be stationary. Naming of objects requires participants to be familiar with the object they are presented with. Even if they know the name, participants might still
suffer from the tip-of-the-tongue effect, where a known word is not readily verbalized [149].
My solution is to opt for the more qualitative shape perception task detailed in Section 4.4.
Technically, this task is closer to shape categorization than exact shape quantification, but then,
so are most everyday shape-dependent tasks. To distinguish a plate from a cup humans need to
make qualitative judgements about the objects’ shapes rather than compare their exact dimensions or other geometric measures.
117
Direct vs. indirect Measurements.
Another way to classify measurements is in terms
of direct versus indirect measurements. Placing a widget on a surface position yields direct
numerical values for the estimated surface normal. As discussed, this is not practical for moving
objects. Consequently, much of the work on distance perception in Virtual Environments (VEs)
uses indirect measurements. Plumert at al. [129], Gooch and Willemsen [60], and Messing and
Durgin [111], have all used walking-related tasks to indirectly estimate perceived distances in
VEs. The task was either to guess the time it would take to walk from the current position
to another position inside the VE, or to actually walk the estimated distance in the real world
without visual feedback from the VE.
Indirect measurements have the disadvantage that they generally include larger individual
variations and have to be related to the measure of interest via some mapping, which may introduce additional error. The advantage is that they allow a dynamic and often more natural and
intuitive experimental scenario, e.g. walking versus orientating a widget using a mouse. In my
experiment, I use several indirect measures of performance, allowing me to present participants
with an intuitive task and interaction paradigm that evokes competitive performance levels. I
address individual variations and cumulative errors in the statistical analysis of the experimental
data (Section 4.6.2 and Section 4.6.3).
4.3.5. Test Shapes
The test objects in previous studies can be broken down into two broad categories: (1) representational, i.e. representing real-world objects (e.g. flashlight, banana, table, chair) [141, 10,
11, 60, 47, 129]; and (2) non-representational (or nonsensical) objects.
118
Figure 4.4. Real-time Models. The complexity of models used in real-time and
interactive applications, like games and 3-D visualizations, is often kept fairly
low to minimize the computational demand of the rendering process. Many of
these simple models can thus be well described in terms of generalized cylinders [12, 67].
Because the perception of representational objects is affected by a person’s familiarity with
the object (e.g. a telephone versus a specialized technical instrument), most related work employed non-representational shapes, instead. Some authors have used wire-like [140, 18] or
tube-shaped [5] objects that resemble bent paper-clips and pipes, while others used similar
stimuli comprised of chained cubes instead of wires [147, 184, 154]. Several authors have
used soft, organic, blob-shaped or amoebic objects [18, 118]. Some experiments were based
on generalized cylinders [8, 11, 88]. Finally, a number of experiments, mostly those involving shape-from-texture studies, used shapes resembling undulating hills and valleys or folded
cloth [151, 153, 90, 3].
My own experiment uses a form of generalized cylinders, called geons [10] (Section 4.4.6)
to avoid the familiarity problem associated with real-world objects. Additionally, generalized
cylinders are easily parameterized, and, unlike wires, cubes, blobs, or cloth, are flexible enough
119
to describe many basic shapes, and can be combined to form a large number of real-world
objects, particularly low-resolution objects commonly found in real-time graphics (Figure 4.4).
4.4. Implementation
In this Section, I use the concepts and terminology introduced in the related work (Section 4.3) to explain how my experimental design implements the goals put forth in the introduction (Section 4.1). I start off with a brief overview of the experiment to define the given task
and interaction, and then list the details of implementing each of the goals.
4.4.1. Overview
Participants in the experiment have to distinguish particular shapes from a set of moving objects
under different display conditions. Figure 4.5 shows the experimental setup. Participants sit in
front of a touch-sensitive board onto which I project moving test shapes with an overhead dataprojector. Participants are asked to select objects that share certain shape characteristics. An
object is selected by simply touching a finger on the board where the object is displayed. Each
experimental trial consists of different phases during which objects are displayed in different
shape-from-X modes (Figure 4.6). Although I am mainly interested in shape from shading,
contours, and texture, I test two additional display modes, one combining shading and contours
and one using an alternative color texture (for details, see Section 4.4.2). The system records
all user events, as well as several system events (Section 4.6).
120
Figure 4.5. Experimental Setup. Left: Schematic diagram of setup. A data projector projects imagery onto a conductivity-based touch-sensitive surface. The
user, grounded through the seat and a foot mat, simply taps on virtual objects
displayed on the surface. A single computer synthesizes imagery and gathers
data. Right: Photograph of actual setup.
4.4.2. Shape Cue Variety (1)
In the real world (or in photorealistic rendering) all shape cues are simultaneously present in a
scene to various degrees. The fact that the human visual system derives shape information from
numerous sources does not mean, however, that all of these sources are equally valuable or can
be leveraged equally. To test the individual contributions of shape cues to shape perception, the
cues have to be separated into orthogonal (mutually independent) stimuli. The following list
121
Figure 4.6. Display Modes. Left column: Screen-shots from the experiment for
each of the display modes. Test objects are rendered onto a static background
with the same visual characteristics as the foreground objects to prevent outlines
from depth-discontinuities (as in the right column). Static objects are extremely
difficult to identify. Once they are moving, they pop out from the background
immediately. Right column: Objects highlighted visually for comparison.
122
describes the rendering techniques used to produce these non-photorealistic stimuli, which are
depicted in Figure 4.6.
(1) Outline.
Of the many possible contours I can render (silhouette, outline, creases,
ridges, valleys, etc. [70, 137]), I choose outlines, because they are the basis of most
NPR line-style algorithms. Outlines are those edges, e, for which
(4.1)
dot(~v , ~t1 ) · dot(~v , ~t2 ) < 0 , with e = {~t1 , ~t2 },
where ~v is the view vector, and e is the edge shared by the triangles6 with normals ~t1
and ~t2 . There exist many efficient methods to implement Equation 4.1 [70], but one of
the fastest methods is a two-pass rendering approach in which first only back-facing
(dot(~v , ~tb ) > 0) triangles are rendered into the graphics card’s depth buffer, followed
by front-facing (dot(~v , ~tf ) < 0) triangles that are rendered into the color buffer with
OpenGL’s line drawing mode.
Since colored, shaded, and textured objects create a natural silhouette (a type of
contour) against a differently colored, shaded, or textured background (right column
in Figure 4.6), I have to ensure that the other display modes do not inadvertently create
a contour cue. To ensure this, I fill the display background with static random elements
as in Figure 4.6, left column. These backgrounds are designed to resemble the current
display mode without containing any complete instance of the 16 test objects (this
ensures that participants are not exposed to targets in the background with which they
might try to interact) and thereby create a homogenous display. While this makes
6
I use triangles for the outline definition because triangles form a generic geometric primitive supported by most
rendering systems.
123
static identification extremely difficult, the test objects pop out immediately from their
surroundings when animated (see also the discussion on Motion, below).
(2) Shading.
The open graphics language (OpenGL [177]) provides built-in support for
the shading model described in Section 4.2.1. To obtain smooth shading across triangles approximating curved surfaces, OpenGL interpolates the normals at the vertices.
Special care needs to be taken when rendering sharp edges (e.g. the boxes in Row 3
of Figure 4.10) to prevent the interpolation scheme from visually smoothing out these
edges. I achieve this with so-called smoothing groups, which only interpolate normals
within each group, but not across groups. This requires specifying several normals per
vertex, depending on which face references the vertex.
(3) Mixed. In anticipation that Shading and Outline might yield statistically indistinguishable data, I add a M ixed mode combining the two, to test for cumulative effects7.
(4) TexISO & TexNOI. For texturing I rely on OpenGL’s built-in texturing capabilities. I use a trichromatic random-design texture, which is sphere-mapped onto the
objects8. To prevent the texture cue from interfering with the shading cue, the colors
of the texture are isoluminant (T exISO mode), i.e. they have different chrominance
values but the same luminance9. The colors are chosen to roughly fall within the red,
green, and blue part of the color spectrum, and are calibrated for equal luminance at
the participant’s head position using a Pentax Spotmeter V lightmeter. In case that
7
See Section 4.7.3 for a discussion of possible interactions between shape cues.
Along with the choices for lighting model and contour type, this mapping was picked with reason but can still
be considered fairly arbitrary. There exist many more possible mappings than can be explored in this dissertation.
See Section 6.2 for further discussion.
9
Note that colors may reproduce non-isoluminant on your printer or display device.
8
124
an isoluminant texture interacts particularly strongly with motion [104], I include a
control texture mode without isoluminant colors (T exN OI mode).
Motion.
As illustrated in Figure 4.1, motion itself is a shape cue. There are several rea-
sons why I do not separate motion from the other shape cues. First, many real-time graphics
environments, including games, immersive VR, and visualizations contain a significant number
of dynamic elements. Since shape cues may perform differently in highly dynamic environments compared to static environments, it is sensible to include motion in the experimental
setup. Second, shape-from-Motion relies on discernible parts of an object to move. In general,
these parts are only discernible because of their color, shading, contour, or texture properties.
Although motion is processed independently in separate cortical structures (area V5, to be specific [188]) these structures rely on the output from other cortical areas. It is therefore simply
impractical to separate motion from other shape cues.
One concern might still be that motion interacts with (depends on) some cues more strongly
than with others. Critics of my experiment have even ventured that motion is the main effect
I am able to measure. My position is, as above, that I am interested in the effectiveness of
different shape cues in dynamic environments, irrespective of any naturally existing bias that
may exist and which is therefore also part of any real or virtual dynamic scene. In response
to the above criticism, I refer to the results in Section 4.7, showing significant performance
differences for all different types of shape cues.
4.4.3. Adjusting Task Difficulty (2)
If an experimental task is too easy all participants are likely to excel and no statistical variation
can be measured to compute the effect of different experimental conditions. The same applies
125
to a task that is too difficult. If no participant is able to perform the task, no meaningful data can
be gathered. If we think of the performance graph as a simple statistical curve with total failure
(0%) on the one side and absolute success (100%) on the other, then there exists some transitional region in the middle that represents the performance threshold for that task. The best
data is measured at the point of greatest slope in the transition region because small changes
in subjective task difficulty lead to the greatest effect in measured performance. Of course,
this point is difficult to find in practice because it depends on many variables, including priming, learning, daily form and individual variability. Unlike physical performance measures like
speed and strength, perceptual performance tends not to vary as greatly between participants.
Small variations do exist, though, and should be accounted for in the task design.
In short, the given task should be difficult enough to challenge experts, yet manageable
for novices. I design towards this goal by including two system parameters that affect task
difficulty: object speed and object count. The speed of objects is set low enough to ensure
adequate visual detection and interaction, that is, all participants can interact at least with some
objects and the number of objects they interact with, correctly or incorrectly, determines their
performance. The number of objects, on the other hand, is set so high as to make a perfect
trial (correct interaction with all objects) unlikely, even for an expert. I determined the actual
parameters for speed and object count heuristically by performing a small set of trials.
4.4.4. Simple Interaction (3)
Version 0.1: What not to do.
In a first version of the experiment, I initially used a task
that required participants to drive a virtual car through a winding course (Figure 4.7). The
idea was to use a task that participants were used to from daily experience and that would bear
126
practical relevance for interactive tasks in computer graphics application (e.g. navigation and
orienteering). A fair amount of effort and coding went into modeling the car’s interaction with
gravity, the terrain, friction, inertia, etc. down to the wheels spinning at the correct rotational
velocity when in contact with the ground. I took every precaution to ensure that the car would
handle as expected from a real car, yet would be easy to drive. I even set the initial acceleration
and maximum speed so that participants only had to steer right or left. Despite all this, the
experiment turned out to be a failure because for the majority of participants the driving task
was simply too difficult. Because my intention in this chapter is not merely to introduce a
single experiment, but rather a flexible methodology, I believe the mistakes of that first design
are instructive as good examples of what problems to be aware of and what design decisions can
impact interaction performance (the main discussion continues with the Conclusion paragraph
on page 128).
Adjusting task difficulty.
As discussed above, the task should accommodate participants
with different performance levels or skills. In the car experiment, the main means of adjusting
the task difficulty was to change the maximum speed of the car. Because a faster car allowed
less reaction time, the task difficulty could be increased. Due to the realistic physics, however,
a faster car was also more difficult to control, would break out in sharp turns, etc. (Figure 4.7,
right). Despite several trials to establish a good common speed, the interaction turned out to be
too difficult for most participants and too easy for some.
Measuring performance.
I employed several indirect performance measurements in the
car experiment including lap-time and deviation from an ideal path. In the analysis of the data, I
determined that lap-time was directly correlated with deviation from the ideal path and that the
deviation was mostly a factor of interaction difficulty and not of the difficulty in perceiving the
127
Figure 4.7. The First Version of the Experiment. Left: Screen-shots of single
(top row) and dual (bottom row) shape cue modes. The outline mode provides
a strong visual cue for the bottom of the valley (center of screen), a good guide
to drive towards. Shading provides the same cue although less pronounced, but
additionally yields curvature information. Texture does not provide much shape
information in a static screen-shot but helps to produce shape-from-Motion when
animated. Right: Analysis view (top-down) of 1.5 laps of a good driving performance with inset showing the participant’s view at the simulated time. Note
how the driver undulated around the ideal path, even on straight sections.
shape of the driving terrain. Participants constantly oversteered in one direction, followed by
overcompensation in the opposite direction (Figure 4.7, right). As a result, the distance traveled
by some participants was almost 1.7 times that of the ideal path and many participants suffered
various degrees of motion sickness.
Viewpoint.
To increase the available reaction time and give a better sense of spatial
awareness, the car experiment featured a third person (bird’s eye) view of the scene, common in
many games (Figure 4.7, left). This allowed participants to see further ahead and perceive the
car in its environmental context. It also placed the participants in a very unusual position to drive
a car and most participants had trouble adjusting to this view. As it turned out, participants with
128
game experience had few problems adjusting to the third person perspective, while for most
others the cognitive leap seemed too large without significant practice.
Conclusion.
The interaction with the system should fulfil the following requirements:
• Learning – The interaction should be simple to learn so that the amount of necessary
training per participant is minimized.
• Intuition – The interaction should be intuitive, so that participants are not preoccupied
with remembering arbitrary mappings between interaction and task.
• Unobtrusiveness – The interaction should not be obtrusive (e.g. by attaching many
wires or restrictive head-gear) to ensure that participants behave naturally.
• Dynamics – The interaction should be suitable for a task with moving objects.
The car experiment was able to address the dynamics and unobtrusiveness requirements, but
appeared difficult to learn and unintuitive for some participants.
Solution.
My approach in the present version of the experiment is a touch-to-select inter-
action paradigm (Figure 4.5), which requires no technical skills10 and is commonplace in many
social situations (simple learning).
To the best of my knowledge, the use of a touch-table interface for shape perception studies
is novel and offers three distinct advantages. First, it removes the level of indirection associated
with many pointing devices (intuition). Mice, for example, translate motion on the plane of
their supporting surface (e.g. table) into motion in the plane of the display device. These two
planes are commonly perpendicular to each other, resulting in some learning effort for novice
users. Second, it eliminates the need for a cursor to indicate the current position of the pointing
10
In fact, pointing is one of the earliest methods of gestural expression for children as young as infancy [50].
129
device, which might otherwise distract. Third, it does not require any external pointing device
(unobtrusiveness).
A disadvantage of the particular front-projection model I use is that the participant’s hands
cast visible shadows on the display. Interestingly, only 7 of the 21 participants ever noticed
these shadows and of those 7, only 3 felt somewhat impaired by the shadows.
4.4.5. High-demand Task/Time-Constraint (4)
In shape discrimination tasks involving reaction time there are two conceptual methods to increase the task difficulty. One is to decrease the perceptual difference between target and distractor stimuli, thereby increasing the time participants take to correctly distinguish the two.
The other is to decrease the time participants have to make a distinction.
I implement the second method by exposing participants to a large number of potential
targets (not by minimizing the time each target is displayed) so that there is less reaction time
available per target11. The reason I prefer the second method is that, in my opinion, it reflects
better the type of real-life dynamic interactions we encounter in our daily lives. When driving
down a road we have to make split-second decisions about objects that are either negligible or
potential hazards. When playing sports or even just walking in a crowded room, we have to
constantly make decisions about the shape and motion of our surroundings, often without much
time to make these decisions.
In a severely time-constrained situation, we may not have the luxury of taking in all the
available evidence and making a correct decision. Instead, we have to make a best-effort decision with the information available at the time. Additionally, it is known that humans can only
11
Of course, participants can choose to ignore most targets and instead focus only on a few but with more accuracy.
My experiment accounts for such strategic variations (Section 4.6).
130
attend to a rather limited number of perceptual stimuli at a time [39, 160]. This may lead to
shape cue prioritization not evident in static experiments but important for real-time graphics
applications.
4.4.6. Test Shapes (5)
Section 4.3.5 briefly discussed the different test shapes that have featured in previous work. I
now pick up some of the concepts introduced there to define what I consider to be important
characteristics of a test shape set.
Familiarity Bias.
Familiarity with the shapes should not influence their perception.
Some people might be more familiar with dogs than with chinchillas, and this familiarity may
influence the reaction time and accuracy for shape-dependent tasks, particularly those using
naming times as measurements.
Generality.
The test shapes should represent a large number of basic shapes. If, for
example, a shape set only consists of shapes with right angles, then the results I obtain from a
study using these shapes do not apply to rounded shapes, or shapes with non-orthogonal angles.
Relevance.
While I prefer nonsensical shapes to avoid familiarity biases, I want the set
of shapes to reasonably approximate a number of real-world objects in conjunction with each
other. For example, stick-shaped objects could be used to model a broom or rake, but they
would make poor components for building voluminous objects like a book or a refrigerator.
Parametrization.
In reporting my experimental findings, I should be able to parameter-
ize the shapes that I used, so that others can get an intuition for the types of shapes the findings
131
are valid for and so that they can replicate my results. A counter-example I mentioned previously is the work by Ryan and Schwartz [141]. They used images of a hand, a switch, and a
steam valve as stimuli. No obvious correlation exists between these objects, they are not representative of a class of objects, and it is not obvious which of their shape characteristics had an
influence on the experimental outcome.
Shape Theories.
A number of different theories exist to explain human perception of
shape [94, 108, 57, 10, 64], each with their own strengths and limitations. While some theories are based on exemplars, others define metric properties, or non-accidental properties of
shapes12.
For the purpose of my experiment, I can choose any theory that suitably fulfills my requirements of non-bias, generality, relevance, and parametrization. An important point to note is that
I do not actually require the theory to be valid in terms of modeling human perception because
I use the theory only to model shapes, not to model perception13.
Geon Theory.
One theory that suits my requirements, and which describes shapes
that are easily parameterized with standard computer graphics techniques is Biederman’s [10]
recognition-by-components (RBC), or geon theory. A geon is a volumetric shape similar to a
generalized cylinder [12, 67], i.e. a volume constructed by sweeping a two-dimensional shape
along a possibly curved axis. Geons can vary in terms of the geometry and symmetry of the
sweeping shape and axis. Biederman defined 4 categories in which geons may vary, by imposing the restriction that each category must produce non-accidental viewing features. That is,
the feature (e.g. curved vs. straight sweeping axis) must be evident for all but a small number
12
An in-depth discussion of different theories is beyond the scope of this dissertation but the interested reader
might find Gordon [64] useful as a starting reference.
13
A good theory can, of course, help to interpret the results of an experiment.
132
Figure 4.8. Constructing Shapes. Each experimental object consists of a main
body and two attachments. To ensure that attachments are visible from any viewpoint, they are duplicated for each object, mirrored, and rotated by 90◦ . Colors
are used only for illustration purposes.
Figure 4.9. Shape Categories. Both the main body of objects and the attachments vary along two non-accidental shape categories. Main Body: Main bodies
either have a round or square cross-section, and have a longitudinal axis of constant or tapered width. Attachments: Attachments also have a round or square
cross-section, but their longitudinal axis is either straight or curved.
133
of distinct accidental views and that slight perturbation of an accidental view must reveal the
feature. This restriction ensures that unique geon identification is invariant to most translations
and rotations.
To construct compound objects, geon theory devises a hierarchical network, whose nodes
are geons and whose structure indicates relative geon positioning. Like all other existing shape
categorization theories, geon theory faces problems of generality, i.e. it is not evident how subtle
shape differences, like those between apples and oranges, could be modeled. Nonetheless, the
simple shapes described by geon theory are reminiscent of the basic shape primitives found
in many virtual environments, architectural mockups, and computer games (Figure 4.4). The
principled categorization of geon shapes further allows me to specify exactly the types of shapes
and objects for which my experimental results are valid.
Constructing Shapes.
To construct shapes for my experiment, I combine a main body
shape with two identical but mirrored and rotated attachment shapes (Figure 4.8). Because a
single large shape is easy to see and differentiate, participants are instructed to ignore the shape
of the main body and only differentiate shapes according to their attachments. The main body
therefore merely serves to increase the difficulty of the perceptual task without increasing the
cognitive load on participants14. The attachments are duplicated and transformed so that they
are visible from any direction as the compound object moves across the display.
The main bodies and attachments vary along three parametric dimensions (CS, LS, LA, see
Figure 4.10 caption), adapted from a subset of Biederman’s descriptors. Main bodies vary
according to their cross-section (CS) and longitudinal size (LS), whereas attachments vary
according to their cross-section and longitudinal axis (LA), (Figure 4.9). Parameters for the
14
Compared to a hypothetical combination task, where participants would have to look for attachments only on
particular main body shapes.
134
Figure 4.10. Experiment Object Matrix. The complete set of experimental
objects comprised of 2 × 2 × 2 × 2 = 16 shape permutations. The variational parameters are: CS, cross-section (square/round); LA, longitudinal axis
(straight/curved); LS, longitudinal size (constant/tapered). Parameters for these
properties are chosen to preserve the volumes of main bodies and attachments.
135
construction of the main body and attachments are chosen to yield an approximately constant
volume, to ensure that average display size of objects under random rotation is approximately
equal.
Targets and Distractors.
Together, the permutations of parameters add up to 2 (main
body+shape) × 23 (3 parametric dimensions with 2 choices each ) = 16 objects, listed in Figure 4.10. For each trial of the experiment, a different column (same attachments, different main
body) in Figure 4.10 is selected as the set of target objects, with the remaining objects acting as
distractors.
4.4.7. Learning and Biasing (6)
The performance of participants in a visual perception experiment depends partly on the participants (e.g. acuteness of vision, reaction-time) and partly on the experimental setup (e.g. display
modes, test-shapes). The setup parameters affecting performance should not be predictable
to ensure that the experimental data reflects the participants’ perceptual abilities and not their
memory or deductive reasoning skills. For example, if participants knew that every fourth object
was a target object, while the three intermediate objects were distractors, then they would only
have to recognize one target correctly and thence continue to count. In the car-experiment described in Section 4.4.4, I only used two different tracks, one for training and one for the trials,
so that participants’ performances were the aggregated effect of the different visual stimuli and
of learning the curves of the trial-track. Such aggregation can be separated into its constituent
components using statistical techniques, but this requires many more independent trials.
In practice, learning of some sort is often unavoidable, even if it is only to practice an
interaction technique or to internalize the experimental instructions. To ensure that this learning
136
does not affect the experimental data, the experimenter can perform a practice trial without
collecting data.
A system bias is an effect that results in two experimental conditions differing by any other
measure than the intended free variable. If, for example, the target-to-distractor-ratio of the
Outline mode was different from the Shading mode, then this could affect the experimental
data even if the two modes otherwise behaved identically in terms of their ability to provide
shape information. The following paragraphs list the precautions I took to minimize learning
effects and biases during experimental trials.
Display Strategies.
Because the different shape cues provide different shape informa-
tion, participants have to develop varied strategies to distinguish targets from distractors (Figure 4.11). For example: flat, shaded surfaces are single-colored, while curved, shaded surfaces
show color gradients. Outlines, on the other hand, do not use color at all. To enable participants
to develop a strategy for each display mode and to ensure that the learned strategy for one mode
does not bias the performance in a later display mode that uses a similar strategy, I require
participants to perform a trial run that shows all display modes in random order (the detailed
experimental procedure is listed in Section 4.5).
Randomization.
To avoid introducing a system bias into the experiment, I fully random-
ize all system variables. Particularly, I randomize the order in which the columns in Figure 4.10
are chosen as targets objects, including the practice trial. The order of the 5 display modes for
each trial and the practice trial is also random.
Objects move across the screen in random linear paths, but I ensure that they always cross
the entire display, that all objects take the same amount of time to cross the display (8 seconds),
137
Figure 4.11. Mistaken Identity. During the experiment, objects constantly rotate randomly. This ensures that the objects can be viewed from all directions,
generates a depth-from-motion cue, and separates objects from the background.
The rotation also increases the likelihood of accidental views for which some
objects may look alike. Per definition of accidental views, these views are inherently unstable and will quickly disambiguate, but the viewer is required to track
objects for a finite amount of time to reliably interpret the scene. Labels in the
image correspond to the labels in Figure 4.10, i.e. objects with the same label are
different views of the same object, while objects with different labels are views
of different objects. The top row illustrates different objects that look similar for
some views. The bottom row shows that for some views (middle) the silhouette
of two objects (left and right) can be identical. Different perceptual strategies
may be necessary to disambiguate similar looking shapes.
and that they constantly rotate at similar speeds (between 1-3 radians per second). I determined
the values for linear and angular velocities empirically based on a small group of participants.
I ensure that the ratio of target objects to distractor objects always remains 1 : 4 by only
using the fixed set of objects shown in Figure 4.10. When any object is selected by a participant
(correct or incorrect), or when an object has crossed the display entirely, it is re-initialized,
which causes its trajectory and rotational velocity to be reset. The object is also deactivated for
138
a random time between 2-7 seconds, so that participants cannot anticipate the type of object
re-appearing on the display.
4.4.8. Hardware & Software
I implemented the entire experimental system in C++, using OpenGL for rendering. The system,
including rendering and data acquisition, ran on an AMD AthlonTM 3500+ with 2Gb RAM and
displayed via a Dell 3300MP digital projector. Participants interacted via a touch-sensitive
DiamondTouch table interface (Figure 4.5).
4.5. Procedure
All participants are asked to be seated in front of the experimentation desk and given brief
oral instructions as to the duration of the experiment as well as the interaction method. They
are then asked to follow the instructions on the screen and address any questions they may have
after reading the instructions but before commencing the experiment. The participants wear
ear-mufflers and sit in an isolated partition to minimize distractions. Participants read a few
short pages of instructions. The instructions introduce the objects that are used in the experiment and explain with text and visual examples how to differentiate targets from distractors. To
advance to the next instruction page, participants use the same interaction as in the experiment
(i.e. touching the table). Participants are then asked to perform a short practice trial (20 seconds per display mode). In the practice trial, participants are given visual feedback about their
performance. The instructions state that such feedback is not given during the experimental
trials. The user display is replicated on an external display visible to the experimenter who can
139
monitor the participants’ performances and note any obvious problems (e.g. a participant only
hitting distractors instead of targets). After the practice trial, the experimental trials begin.
Each experimental trial is preceded by a single instruction summary page, followed by a
textual description and visual example of targets versus distractors. Afterwards, the actual trial
begins. Each trial consists of the same set of targets shown in all 5 display modes in randomized
order. Each display mode is shown for 60 seconds, followed by a fade-to-black and several
seconds of darkness to prevent delayed interactions from one mode affecting the following
mode. Each trial ends with intructions to the participants informing them of the completion
of the trial and allowing them to rest for up to a minute before continuing. Altogether, each
participant performs 4 trials, one for each column in Figure 4.10, for a total time of about
25-30 minutes, including the practice trial and rest-periods.
After the last trial, participants are asked to fill out a short questionnaire with yes/no and
Likert-type (1 to 5) questions (Figure B.1) to collect subjective ratings for shape cues, personal
performance, experimental duration, fatigue, and discomfort.
4.6. Evaluation
In Section 4.3.4, I explained the different measurement methods commonly used in shape
perception experiments. In this Section, I discuss the direct measurements I gather for each
participant (Section 4.6.1) and how these are converted into indirect measures to discount individual variations due to risk disposition and interaction strategy (Section 4.6.2). Finally, Section 4.6.3 describes the statistical analysis of the acquired data.
140
4.6.1. Measurements
For each trial of each participant, the system records the following named interaction events
(direct measurements), along with time-stamps:
shots The number of times a participant indicates an object-selection by
touching the table input-device.
correct The number of times the touched object is of the target type.
incorrect The number of times the touched object is of the distractor type.
missed The number of times the participant touches the background instead of
an object. This is seldom due to a participant mistaking the background
for an actual object, but rather because of imprecise hand-eye coordination (missed events are generally followed immediately by correct or
incorrect events).
Using the above definitions, it is always true that shots = correct + incorrect + missed.
The system also records the following system events:
lost The number of target objects that traverse the screen completely without
intervention. This happens if the participant fails to identify the object
as being of the target type, or if the participant is too busy interacting
with other objects.
initialized The number of objects that are initialized to traverse the screen. An object is re-initialized every time the participant touches it, or if it traverses
the screen completely without interaction.
141
4.6.2. Aggregate (Indirect) Measures
Independent of their objective performance skills, participants’ data in task-driven studies is
sensitive to subjective factors such as competitiveness, strategy, and risk profile. In my experiment some people might only take a shot if they are very certain of the object type they are
selecting, while others might take as many shots as possible while risking false target identification. To normalize for these factors, and to answer other performance questions that cannot
be measured directly, I define the following aggregate measures in terms of their name, the
question the measure is supposed to help answer, a motivation for the measure’s definition, the
formula to compute the measure, and the measure’s scale/range. All aggregate measures are
defined in terms of the direct measures of Section 4.6.1.
success — Of the shots taken, how many are correct? — Some participants might be more
aggressive and willing to take risks. If they shoot often, then their absolute correct count might
be high despite also making many mistakes. To compare such participants to more conservative
ones, who shoot less but are more accurate, I normalize the correct count on the number of
shots taken:
success =
correct
.
shots
Scale: The range of success is normalized, with a value of 1 indicating a perfect score.
failure — Of the shots taken, how many are incorrect? — An equivalent motivation as for
success (above) applies here:
failure =
incorrect
.
shots
Scale: The range of f ailure is normalized, with a value of 1 indicating that all attempted shots
were incorrect.
142
risk — Of the objects crossing the screen, how many shot attempts are taken? — To assess
the risk profile of a participant, I measure the readiness of that participant to shoot at an object
crossing the screen. The higher the risk measure, the more a participant is willing to risk an
incorrect target choice (or the more confident the participant is in his or her decision). This
makes risk more of a personality measure than a performance measure:
risk =
shots
.
initialized − shots
Scale: Because each shot itself causes an object re-initialization (to keep the number of objects
on the screen roughly constant), I subtract shots from the denominator. This means that risk
has no upper bound (and is not normalized to 1). A risk value of 0 indicates no risk (no shots).
A value of 1 indicates very high risk (the number of shots equals the number of freely initialized
objects). Values above 1 indicate extreme risk (more shot attempts than new objects traversing
the screen), but no participant in the study exhibited such risk behavior..
placement — How good is each participant’s hand-eye coordination, and is this a function
of display mode? — When participants interact with the system they are supposed to select a
target object (whether their actual selection is correct or incorrect is irrelevant). Failure to do so
(selecting the background instead), indicates poor hand-eye coordination, which may be linked
to the display mode:
placement =
correct + incorrect
shots − missed
=
.
shots
shots
Scale: The range of placement is normalized, with a value of 1 indicating perfect placement.
143
detection — How well can correct target objects be detected on the display? — This
measure compares the correctly shot targets to the number of target objects that traversed the
screen completely without having been shot at. Because lack of hand-eye coordination can
lower the number of correctly identified targets that are actually shot, the numerator takes into
account both the correct shots, as well as a fraction of missed shots that likely would have been
correct given overall performance:
correct + missed ·
detection =
lost
correct
correct+incorrect
.
Scale: Like risk, the range of detection is not normalized to 1, but uses the value of 1 as a
qualitative threshold. A detection value of 0 means that no objects were correctly detected. A
value of 1 indicates that half the target-type objects were detected. A value of 2 means that
twice as many target-type objects were detected than were lost, etc.
4.6.3. Analysis
I first convert the direct measurements into aggregate measures and average the latter over the
4 trials that each participant performs (Figure 4.13 and Figure 4.14). I then average these perparticipant averages (Table B.1–B.5) over each display mode (Figure 4.12) to obtain overall
performance values for each display mode.
To establish whether different shape-from-X display modes have a significant effect on aggregate performance measures, I perform a repeated-measure analysis of variance (ANOVA)
for each of the aggregate measures (Table 4.1).
144
Figure 4.12. Aggregate Comparison. These bar graphs show aggregate measures averaged over all participants and all trials. Error bars are normalized
standard errors. Note the different vertical scales (to show small details) when
comparing. Interpretations for scales are given in Section 4.6.2.
145
Figure 4.13. Detailed Aggregate Measures. These charts show per-participant
values for each display mode and each aggregate measure. Most data is reasonably normal-distributed (see Figure 4.14), with occasional outliers, some of
which are extreme. Individual performance appears mostly consistent throughout, in accordance with the general analysis remarks in this Section. Note the
different scales of the individual charts.
146
Figure 4.14. Detailed Aggregate Measures Histograms. These charts show
the histograms (frequency distribution) for the detailed participant data in Figure 4.13. Despite the erratic appearance of the traces (the number of participants
limits the resolution of the histogram) the distributions appear to be evenly or
normal distributed and no clustering is evident. Note the different scales of the
individual charts.
147
The ANOVA analysis only determines overall effect of display modes. Further analysis
of pairs of results for different display modes using Student’s t-test allows me to detect significantly different means for each display mode pair. Table 4.2 shows the t-test results for all
combinations of display modes. Since the likelihood of a false positive is higher across multiple
tests than for each individual test, I use the highly conservative Bonferroni correction, which
divides the alpha-value, α, by the number of tests, n, so that, α → αn .
4.7. Results and Discussion
I tested 21 participants (8 female, 13 male) comprised of graduate students and university
staff volunteers. All participants had normal or corrected-to-normal vision.
4.7.1. Data Consistency and Distribution
The scatter-spread charts in Figure 4.13 show that the group of participants as a whole performed fairly consistently (vertical ordering of participants does not change dramatically between the different display modes). Good performers generally performed well throughout and
poor performers generally performed worse for most modes. Some notable outliers are evident,
though. One extreme outlier (more than three interquartile ranges from the third quartile) is
recorded for participant BA056’s M ixed-risk performance. Further analysis of this trial data
shows that the participant attempted more than double the number of shots during the first trial
compared to the remaining trials. Although this led to an 18% higher success rate in the first
trial, it also resulted in an up to ten-fold higher f ailure rate. The reason for this abnormality
is difficult to ascertain because the remaining trials show much more moderation in terms of
shots attempted. Since no feedback was given during or between trials, the participant decided
148
to adjust his or her interaction behavior autonomously. I eliminate the above-mentioned extreme
outlier from the analysis lest I contaminate the remaining data.
Another interesting example is BA055’s M ixed-detection performance, which, in this case,
is exceptionally good. As above, this anomaly is due to the performance of a single trial (third).
During this trial the participant lost almost no objects and detection becomes very sensitive for
very low values of lost due to not being a normalized measure. It should be noted, however, that
BA055’s detection score is among the highest of all participants for all display modes. Given
this detection proficiency, it is somewhat surprising that the participant’s success scores are
only average, especially considering that the f ailure scores are also among the lowest of all
participants. A possible reason may be the fairly low placement scores (hitting the background
instead of objects). Judging by the low f ailure scores, the poor placement is not likely due to
mistaking the background for real objects, but more likely due to poor hand-eye coordination
or fatigue (most missed objects occurred in the last trial).
4.7.2. Strategies
During the initial practice trial, participants develop strategies for categorizing target and distractor objects for the different display modes (Section 4.4.7). To support reasoning about shape
perception strategies, I use Figure 4.14, which displays the same data as Figures 4.13 and 4.12
but this time as histograms, to demonstrate distribution properties. The graphs in Figure 4.14
approximate a near normal distribution (when taking into account the low resolution of the histogram, limited by the number of participants in the study and the discrete nature of histograms).
Most importantly, there exists no evidence to suggest participants splitting into multiple distinct
clusters (e.g. like the two humps of a bactrian camel). Such clusters would be suggestive of the
149
existence of a small number of shape recognition strategies with distinct performance characteristics, applied by participant subgroups. The lack of clusters does not rule out the possibility of
several coexisting strategies, however. Multiple strategies could exist that happen to be equally
effective. Or there could be a large number of strategies with different performance characteristics and the exhibited distributions are an effect of the population’s likelihood to adopt an
optimal strategy for a given display mode.
A simpler explanation, and one in line with exit interviews and personal experience, is that
all single shape cues suggest a single strategy15. The results for M ixed mode are therefore
highly interesting as several possibilities arise:
(1) Coexistence : The two strategies for Outline and Shading remain in effect independently and participants choose the better strategy to apply to M ixed. In that case,
M ixed should perform like the better of Outline and Shading.
(2) Interference : The two strategies for Outline and Shading remain in effect but
interdependently. In the case of constructive interference, participants make use of
both strategies simultaneously and performance rises above the level of Outline or
Shading alone. In the case of destructive interference, one strategy hinders the other
and performance is less than the better of Outline and Shading.
(3) Synergy : The simultaneous presence of Outline and Shading shape cues allows for
a novel strategy only applicable to M ixed. In this case, participants can choose to use
the new strategy or stick with the better of the Outline and Shading strategies. The
performance should thus not be worse than for the individual shape cues.
15
The following discussion is facilitated by assuming a single strategy per display mode, but does not depend on
it. Each occurrence of strategy could be replaced by distinct set of strategies without affecting the arguments.
150
4.7.3. Interference vs. Synergy
Altogether, I find that Shading provides the best shape cue in my study (as determined by
success, f ailure, and detection scores), followed by M ixed, Outline, and the texture modes,
T exIso & T exN OI (Figure 4.12).
According to the above discussion, the particular ordering of Shading, M ixed, and Outline
suggests a destructive interference effect instead of coexistence, synergy, or constructive interference.
Intriguingly, the combination of Outline and Shading actually decreases the efficiency of
Shading, instead of adding constructively by helping to disambiguate, as might be expected.
This finding reiterates a common theme throughout this dissertation, that for perceptual tasks
less visual information can be more effective than more information. Indeed, several participants commented in the exit interview that the M ixed mode offered too much information and
confused them. My explanation for this result is that different detection strategies for shapefrom-contours and shape-from-shading could impede each other. For the Outline mode it is
advantageous to compare the terminating contour angle of attachments, while Shading offers
the most reliable information in terms of presence or absence of gradients along the surfaces
interior of attachments. If these strategies are different enough, or even mutually exclusive,
participants may find it difficult to focus on one strategy while ignoring the other. These results
are therefore highly valuable for the design of effective shapes and shape-cues for interactive
non-realistic rendering systems.
This result is also important because it partly contradicts and partly augments findings of
previous shape perception studies. Bülthoff [19], for example, found that subjects underestimated curvature of static objects shown with shading or texture alone, but results improved
151
when shading and texture were shown in conjunction, lending support to a synergy or constructive interference theory. I believe the fact that Bülthoff detected an additive effect while I found
that multiple shape-cues may be counterproductive can be explained by expanding upon my
above theory on detection strategies with a timing argument.
Participants may find it difficult to focus on one strategy while ignoring the other, under
time-constrained conditions. While there is no reason to believe that humans would not take
all available evidence under consideration when given the time, I have mentioned previously
that the human visual system can only attend to a limited number of stimuli simultaneously.
It is therefore conceivable that for static scenes and when given ample time humans use multiple shape cues constructively, while in a time-critical interactive situation a shape cue prioritization takes place16. Such an argument could also find support in findings by Moutoussis
and Zeki [114], stating that each of the different visual processing systems of the HVS “[...]
terminates its perceptual task and reaches its perceptual endpoint at a slightly different time
than the others, thus leading to a perceptual asynchrony in vision - color is seen before form,
which is seen before motion, with the advantage of colour over motion being of the order of
60-100 ms [...]” ([189], pg. 79). I thus believe it is vital to perform more studies on shapeperception for real-time, interactive tasks.
4.7.4. Interaction
Table 4.1 shows analysis of variance (ANOVA) results for the different aggregate measurements, to test if varying the display mode had a significant effect on the means of these measures. The F (dof, n) value in the first column represents the ratio of two independent estimates
16
Although the results were not statistically significant I also found evidence for such prioritization in the data
trends of the car-experiment.
152
Measure
success
f ailure
risk
placement
detection
F(4,84)
48.594
50.154
13.317
1.412
49.625
p
1.14 ·10−20
4.68 ·10−21
2.23 ·10−8
0.238
6.32 ·10−21
Table 4.1. Within-“Aggregate Measure” Effects. This table lists the F -value
for the given degrees of freedom, and the p-value for each of the aggregate measures across all display modes. Display modes are averaged over all trials of all
participants. Given values assume sphericity.
of the variance of a normal distribution, where dof = m − 1 are the degrees of freedom, m is
the factor level (here, different display modes: m = 5), and n = 84 are the number of observations under identical conditions (4 trials for 21 participants). Higher F -ratios indicate a greater
dissimilarity of the variances under investigation. For the given dof and n values, an F -ratio
above 2.5 indicates statistical significance at the p = 0.05 level (actual p-values are shown in
the second column). That is, for F -ratios above 2.5 the chance of obtaining the observed data
is less than or equal to five percent. This means that it is much more likely that the observations
were not obtained by chance and instead represent an actual effect, in this case that the different
display modes have a significant effect on aggregate performance measures.
Given the values in Table 4.1, I find that the different display modes have a (highly, p <
0.01) significant effect on all our aggregate measures, except placement. This is an ideal result,
because it means that while performance measures related to the task are critically affected by
the display modes, the placement measure, related to interaction, does not vary significantly
with display mode. In other words, participants are able to consistently touch their intended
moving objects, even if the objects themselves may be difficult to differentiate.
153
Modes
Outline vs. TexNOI
Outline vs. TexISO
Outline vs. Shading
Shading vs. TexNOI
Shading vs. TexISO
Mixed vs. TexNOI
Mixed vs. TexISO
Mixed vs. Outline
Mixed vs. Shading
TexISO vs. TexNOI
succ. fail. risk place. detec.
vsig vsig vsig 0.061 vsig
vsig vsig vsig 0.080 vsig
vsig vsig vsig 0.184 vsig
vsig vsig vsig 0.469 vsig
vsig vsig vsig 0.401 vsig
vsig vsig 0.001 0.282 vsig
vsig vsig 0.001 0.190 vsig
0.025 0.002 0.018 0.740 vsig
0.025 vsig 0.346 0.380 0.099
0.267 0.237 0.490 0.987 0.347
Table 4.2. Significance Analysis. This table lists p-values for Student’s paired
t-test of all combinations of display modes. The columns refer to each of the
aggregate measures. A value of p < 0.005 is considered significant (bold-italic),
while a value of vsig (p < 0.0005) is highly significant, under the highly conservative Bonferroni correction.
Another conclusion is that performance differences for the different display modes can be
detected with the aggregate measures, number of participants and number of trials specified in
this chapter. This suggests reusability of the experimental setup for numerous other dynamic
shape perception studies (Section 6.2).
As evident in Figure 4.12, the success rates for all display modes are significantly higher
than pure chance (25%), and chance (50%) if participants ignored instructions and considered
only one of the two attachment categories (CS & LA, in Figure 4.10). This indicates that participants understand the instructions correctly (using both attachment categories for distinction)
and find the task easy enough to perform.
These result prove the successful implementation of the third design goal (interaction simplicity) and I am hopeful that the experimental methodology can be adopted for a large variety
of additional display modes.
154
4.7.5. Motion and Color
For the detailed t-test analysis in Table 4.2 no significant differences are found between the
different texture modes. This is surprising, as motion perception is commonly thought to be
linked to luminance channels and independent of color [187, 134, 115]. In that case, the isoluminant texture mode, T exISO, should perform worse than the non-isoluminant texture mode,
T exN OI. In fact, the results of my study show the opposite trend (although that trend is not statistically significant) and are in line with subjective responses from the exit interview (Table B.6,
bottom row), indicating that T exISO appears easier than T exN OI to most participants. This
is an interesting finding and may substantiate recent studies that propose two different motion
pathways in the HVS, which process slow motion (chromatically) differently from fast motion
(achromatically) [55, 168, 104].
4.7.6. Risk Assessment
An interesting result, evident in Figure 4.12, is the positive correlation between success and
risk, and the negative correlation between f ailure and risk (significant at p(dof =3) < 0.01 in
both cases). Intuitively, it seems that more risk should lower success and increase f ailure, to
the point of ultimate risk, equating to pure chance. In my interpretation of this data, participants
are generally able to judge their limitations well and behave rather conservatively, in line with
the instruction to be as fast as possible without making any mistakes.
Mostly non-significant differences are found for M ixed vs. Outline, and M ixed vs.
Shading. I attribute this to the fact that M ixed performance is between Shading and Outline
for all measures but risk (Figure 4.12), the relatively large variability of M ixed, and the use
of the conservative Bonferroni correction. A possible explanation for the high risk value of
155
M ixed is that participants became more daring because they assumed that more visual information would improve their correct target identification. The objective performance measures
(success and f ailure) do not corroborate this notion, and, interestingly, neither do the exit
interview results (Table B.6).
4.7.7. Exit Interview
From the exit interview (Figure B.1 and Table B.6) I gather that participants were content with
the duration of the experiment. No participant felt dizzy or disoriented during or after the
experiment and participants found the interaction paradigm very intuitive. Most participants
described the experiment as “fun”, even though this was not asked in the questionnaire.
4.8. Summary
In this chapter, I presented an experiment to study shape perception of multiple concurrent
dynamic objects. This experimental approach is novel as it deviates from traditional reductionist
(single, static shapes) studies whose results may not apply directly to most interactive graphics
applications.
Experimental Framework.
I created several non-realistic display modes specifically
designed to target only a single shape cue at a time, allowing me to study individual shape cues
as well as combinations thereof.
My framework implementation carefully follows a number of high-level design goals described in Section 4.1.1 and the statistical significance of the data collected during my study
156
(Section 4.6.3) indicates a high-quality experimental setup where these design goals were attained. Results further indicate that my experiment supports a number of additional shape-fromX studies with only minor modifications (Section 6.2).
Interaction.
I presented a novel interaction paradigm that does not rely on pointing de-
vices or other indirect mechanisms. Participants interact with the experiment by simply touching a table at the position where they see an object. This interaction is simple, intuitive, unobtrusive, and reliable (see placement values in Figure 4.12).
Task.
Compared to most previous shape perception studies, which can quickly become
monotonous and tiring, my experimental task is inspired by games (Section 4.4) and intended to
be motivating. Participants have to react quickly and stay alert to achieve a good performance.
Most participants became very competitive during the experiment and described the task as fun
(Section 4.7.7).
Results.
Although the main contribution of this chapter is a reusable experimental design
that allows for a large number of graphics-relevant shape perception studies, the results for the
single study I performed are already interesting (Section 4.7). The most important of these
results is that, in dynamic situations, shape cues do not seem to add constructively and may
even interfere destructively. This is an important result for interactive graphics, and one which
contrasts previous studies on static objects (Section 4.7.3).
Shape Cues in Action.
It is common knowledge amongst graphic designers that different
types of shape-cues are effective for different design goals. I have made use of this concept
throughout the figures in this chapter: Figure 4.10 uses shading and coloring to indicate shape
and differentiate object parts, Figure 4.6 uses contours to draw attention to target objects, and
157
Figure 4.3 uses texture to illustrate a curved surface. It is important to study the effectiveness
of these shape cues for various perceptual tasks, to apply the cues appropriately in a given
graphical situation.
158
CHAPTER 5
General Future Work
Given the beneficial relationship between NPR and Perception advocated in this dissertation,
one might naturally ask questions such as: How far can we push this relationship? or What is
the ultimate perceptual depiction? or What other non-realistic imagery, apart from that inspired
by art, can be used for visual communication purposes? Although I do not presume to have
conclusive answers to these questions, I believe the direction outlined by my dissertation points
towards some interesting leads. To start this discussion off, I revisit the topic of realism versus
non-realism in the light of the issues addressed in previous chapters.
5.0.1. Realistic Images
Figure 5.1 illustrates simplistically the general lifecycle of a synthetic image from conception
to perception and onto cognition. I argued in Section 1.1.1 that every image serves a purpose.
For now, let this purpose be to convey a message, even if this message is only the image itself
(e.g. “A table and chair in the corner of a room”). In the purely photorealistic approach, this
message is encoded into a life-like visual representation, without reference to the HVS1, to be
consumed by an observer. The observer’s task, then, is to decipher the message given the input
image. If all elements of this encoding and decoding process work well, the observer recovers
a good approximation of the original message. Because the entire process is rather lengthy
1
As noted in Section 2.4.2, even adaptive rendering, which sometimes does consider the HVS, does so mostly to
hide artifacts, not to enhance images.
159
Figure 5.1. Lifecycle of a Synthetic Image. The image generation (rendering,
blue outlines) starts with a concept: A table and chair stand in the corner of a
room. A user models the objects, sets up the scene, and renders the image to
a display device. An observer views the final image on the display and starts
deconstructing the retinal projection (vision, red outlines). The observer goes
through various low-level and cognitive processing steps before recognizing the
depicted scene: A chair next to a table in a room. If rendering and vision work
in perfect harmony, the initial concept and the recognized scene are identical.
Vision shortcuts are the attempt to bypass some of the rendering and visual decoding pipeline to affect a more direct visual communication.
and complicated, and because there are no possible shortcuts (see below) for realistic image
synthesis, there are various stages at which the message can be degraded or confused.
5.0.2. Non-realistic Images
Non-realistic image synthesis is not bound by the constraints of the physical world. It thus
becomes easier to eliminate detail that (1) does not contribute to representing the message and
could in the worst case mask the message (confusion), and (2) requires additional rendering
160
resources, thereby incurring unnecessary costs. The best example of purposeful omission of
information is abstraction. Restrooms around the world generally do not post photographs of a
man and a woman on their doors. Doing so would give too much information, be too specific.
Patrons may be lead to believe that the room behind the door belongs to the depicted person.
Instead, restroom signs are abstract representations of men and women, so that any person of
the appropriate gender can identify with the depiction. The allowed shortcuts for non-realistic
images are to bypass optical models required for realistic image synthesis. I should note that
my use of the term shortcut chiefly refers to optimizations in visual communication. While it
is possible that such shortcuts are also computationally efficient (as is the case for many nonrealistic image synthesis algorithms that do not rely on global illumination solutions) I do not
require this to consider a shortcut to be effective.
5.0.3. Perceptually-based Images
I have argued throughout this dissertation that the effectiveness of non-realistic imagery can be
further increased by considering human perception. I indicate this with the perceptually-based
rendering label in Figure 5.1. To generate images optimized for low-level human vision, the
rendering process needs to include a model of perception (light-blue rendering input). Although
such a model does not introduce additional shortcuts on the rendering side, it might increase
efficiency on the visual decoding side. This is the approach I took in Chapter 3 and which lead
to increased performance in two perceptual tasks.
One way to discuss the questions I pose at the beginning of this Section is to investigate any
perceptual shortcuts beyond those already mentioned. In other words, Can we generate images
that convey a given message while by-passing more of the coding/decoding pipeline? I believe
161
the answer is, yes. To substantiate this claim, let me give a few examples of what I refer to as
vision shortcuts.
5.1. Vision Shortcuts
In most realistic and even non-realistic graphics, there exists a fairly straightforward connection between a generated visual stimulus and its perceptual response. The intensity of a pixel
on a monitor is related to the perceived brightness of that pixel. The perceived color of a pixel
is related to the red, green, and blue intensities of that pixel, and so on.
There exist, however, various examples of visual stimuli producing a perceptual sensation
that is naturally associated with a very different type of stimulus: a sequence of black-and-white
signals can create the illusion of colors. An interlaced duo-chrome image can be perceived to
contain colors outside the gamut of additive mixture. A static texture pattern can elicit the
sensation of motion. Partially deleted outlines can be perceived as complete. The following
sections introduce these perceptual phenomena in terms of non-realistic imagery and discuss
some of their potential applications for visual communication.
5.1.1. Benham-Fechner Illusion: Flicker Color
The Phenomenon.
In 1895, a toy-maker named Charles Benham created a spinning top
painted with a pattern similar to the left pattern in Figure 5.2. This toy was inspired by his
finding that when the pattern was spun, it created the appearance of multiple colored, concentric rings2 [7]. Gustav Fechner [44] and Hermann von Helmholtz investigated the phenomenon
2
I first experienced this illusion in a Natural Science museum in India. The exhibit was in motion when I read
the accompanying instructions and it was not until the disk was almost stationary that I was finally convinced that
there were, indeed, no colors.
162
Figure 5.2. Flicker Color Designs. The left and center circular designs can be
enlarged, cut out and placed on an old record turntable with adjustable speed.
When viewing the animated pattern, most people experience concentric circles
in different colors. When the rotation is reversed, the color ordering reverses
accordingly. The square design is intended for a conveyor-belt motion, or to be
painted onto a cylinder. These designs are but a few of many others possible.
Note though, that all designs contain half a period of blackness.
more generally and termed it pattern induced flicker color (PIFC), or flicker color for short.
Although the effect has been researched for a long time [21], a satisfactory explanation remains
elusive. An early theory stipulated that the pulse patterns of the Benham design approximated
neural coding of color information, similar to Morse code. Festinger et al. [48] argued that
Benham’s induced (or subjective) colors were only faint because they poorly approximated real
neural codes. They devised several new patterns with cell-typical activation and fall-off characteristics and demonstrated that their patterns did not require the half-period rest-state of typical
Benham-like patterns (Figure 5.2). Festinger et al.’s theory was later disputed, particularly by
Jarvis [81] who could not reproduce their results. A currently accepted partial explanation argues that lateral inhibition of neighboring HVS cells exposed to flicker stimuli causes subjective
colors to be seen [182].
163
Applications.
Apart from research work, the PIFC effect has found applications in oph-
thalmic treatment and even numerous patents (BD Patent Nr. 931533 11. Aug. 1955; U.S.
Patent #2844990, July 29, 1958; U.S. Patent #3311699, Mar. 28, 1967), including novelty advertisement before the era of color television. If more knowledge existed about the causes of
PIFCs and a reliable method to synthesize saturated and vibrant PIFCs were known, it could
be possible to induce color sensations in retinally color-blind or otherwise retinally-damaged
people.
5.1.2. Retinex Color
The Phenomenon. In Section 1.2.1, I mentioned the phenomenon of color constancy, which
allows humans to perceive the true color of a material instead of its reflected color. Another
phenomenon, sometimes called color illusion, is best explained with an example: If a small grey
square (the shape is not important) is placed upon a larger green square, then the grey square
appears tinted lightly red. Similarly, if a grey square is placed upon a larger red square, the
grey square appears tinted slightly green, i.e. the tint appears as the opposite color of the square
it is placed upon. This effect also works with cyan/yellow combinations and poses problems
for theories that posit that the cones in humanoid retinas are independently sensitive to red,
blue, and green wavelengths. To explain these color illusions, the cones’ responses cannot be
interpreted independently. Alternative theories, along with supporting physiological evidence
exist, based on antagonistic interactions between combinations of cones resulting in spectrally
opposing stimulation [76, 80].
164
Figure 5.3. Retinex Images. Viewing Instructions: Due to interlacing, the images may not display well at some magnification levels. In the electronic version
of this document, zoom into each image until all the horizontal lines comprising
the images appear of the same height. Then adjust your viewing distance to the
display until the individual lines seize to be discernible. In this configuration,
examine the images for 30 seconds or more and then determine what colors you
see. Afterwards you can compare the real colors in Figure 5.4. Finally, zoom
fully into the above images to inspect the actual colors used.
Edwin Land devised an experiment using both phenomena to suggest subjective colors,
which are objectively not present. In this experiment, he implemented the color illusion phenomenon with a picture slide and a few color-filters to produce duo-chrome images that induced
the illusion of colors which were present only in the original image. The HVS interpreted
the overall color bias of his images as a global illuminant, thus taking advantage of the color
constancy phenomenon. Land described this experiment and the accompanying theory in his
Retinex3 publication [100].
Figure 5.3 shows two Retinex image examples. The images are best viewed on a computer
display with adjusted viewing conditions (see Figure 5.3 caption for instructions). The left
3
Retinex=Retina + visual cortex.
165
Figure 5.4. Originals for Retinex Images. Originals used to construct images
in Figure 5.3.— {Left: Public Domain. Right: Creative Commons License.}
image really only uses one color, red, but induces the sensation of green (and other colors) with
interlaced grey bands. Note the brown tinge of the burger bun, the yellow of the fries and the
blue’ish-green tray. The right image uses two different colors, green and red, to achieve a much
fuller color appearance. Note the grey color of the sweater-vest and the blue color of the shirt.
None of these colors are in the gamut of additive mixture of red and green. The image borders
are not strictly necessary but they help to improve the effect. In the left image, I selected a green
that is suggestive of the perceived color of the tray. In the right image, I selected a substitute
white, sampled from the bright stripes in the sweater-vest.
In previous work [172], I have conducted a distributed user-study4, to test whether subjective
colors could be induced reliably on different monitors and under different illumination conditions. The results indicated that this was possible for a variety of monitors and illumination
4
Similar to those suggested in Section 6.2.
166
conditions; that participants perceived colors clearly outside the gamut of additive mixture; but
that some colors were not identified uniquely.
Applications.
Retinex theory, even though heavily criticized when Land first presented
it, has regained some interest in the research community and is used increasingly in image
enhancement applications [54, 6], including images taken by NASA5 [133, 178] during orbital
and space missions. Along with the flicker color illusion, discussed above, applied Retinex
theory is one of the prime examples of vision shortcuts. In fact, because these two examples
address visual processes so early on in the HVS, it could be possible to do away with an external
display device altogether. Once artificial retinas become a reality, Retinex theory and flicker
color could be used to encode color for direct neural stimulation.
5.1.3. Anomalous Motion
The Phenomenon. Anomalous motion is an example of a more indirect method of triggering
a sensation generally associated with a different stimulus. When viewing images like those in
Figure 5.5 at a suitable magnification level, the motion texture elements, which I call motiels,
appear to be moving. Akiyoshi Kitaoka has created many different types of anomalous motion
illusions6 and published several papers and books on the phenomenon [92, 91]. Despite Kitaoka’s and other research efforts, there exist many more types of anomalous motion designs
than theories explaining their perceptual mechanisms. In a sense, these illusions are great examples of Zeki’s observation about artists acting as neurologists7 (Section 1.2.3). There exist
5
http://dragon.larc.nasa.gov.
A large number of Kitaoka’s designs are available at http://www.ritsumei.ac.jp/˜akitaoka/
index-e.html.
7
This is not to imply that A. Kitaoka’s scientific prowess is in any way inferior to his artistic talents, but rather that
less scholarly individuals throughout the Internet have found it possible to adopt and modify his original designs.
6
167
Figure 5.5. Anomalous Motion. Tow row: Two anomalous motion designs
using the same motiel shape, but different color schemes. Bottom Row: The
Rotating Snakes illusion, after A. Kitaoka. Changing the viewing distance or
zooming in/out affects the magnitude of the effect. Try viewing only one image
at a time.
168
various rules of thumb to create anomalous motion designs. Most designs require a repeated
texture element (motiel) with the following characteristics: One side of the shape is brighter
than the center, while the opposing side is darker than the center. The brightness of the center
region should not be too different from the background. The shape of the motiel can be varied.
Most observers perceive motion in the light-to-dark direction of the motiel. The size of the
motiel has a significant effect on the magnitude of the illusion. These and additional rules help
in designing anomalous motion illusions, but they do not explain them. However, parametrization of these rules combined with computer graphics visualization may help us to learn more
about the extent to which these rules apply and when they break down. This, in turn, is likely to
increase our understanding of the illusions, and may lead to perceptual models explaining them
in more detail, again reiterating the leitmotif of my dissertation.
Applications.
Possible uses of these illusions, in addition to their entertainment factor
and scientific interest, could include indication of motion in print media, motion visualization
in static displays, and velocity indication of slow moving objects. These applications are similar
to those Freeman et al. [53] proposed, although their system required a short animation sequence
not realizable on truly static media.
5.1.4. Deleted Contours
The Phenomenon.
As noted in Section 4.2, the HVS is equipped with a fair amount of redun-
dancy to increase robustness of visual tasks and to deal with underconstrained visual situations.
Figure 5.6 illustrates another facet of this concept. The cube example shows that the straight
edges connecting the corners of the cube do not add much more visually useful information
to the image. Their presence can be inferred from the termination points of the corners. The
169
exact mechanism by which humans are able to automatically complete such missing contours
is not fully understood, but Hoffman [75] composed a set of rules that are viable candidates
for visual hypothesis testing, as introduced in Section 1.2.2. While Koenderink [94, 97] investigated the geometric properties of contours that allow shape recovery, Biederman and others [10, 9, 112, 113] demonstrated via user-studies which types of contour deletions the HVS
could recover. The scissor example in Figure 5.6 shows that, as mentioned in Section 2.2, not all
visual information is of equal importance. While some contour deletions are easily recovered,
others are not. Interestingly, adding arbitrary plausible masking shapes to the unrecoverable
scissor image re-enables recognition.
I believe deleted contours are an excellent example of the minimal graphics described by
Herman et al. [69], which I mentioned in Section 3.5.9. If we do not require a complete contour
description to obtain shape, then how much do we need, and what? Junctions (corners and intersections) are good candidates for a necessity requirement, but we need additional information
to discern curved features (e.g. a circle). Hoffman’s minima rule (referring to an extremum in
curvature) and other shape parsing rules [74, 148] could help in that respect.
Applications.
In terms of applications of this phenomenon, it is surprising that (to the best
of my knowledge) no adaptive rendering technique (Section 2.4.2) makes use of the fact that
some parts of an outline are more salient than others. Given that most rendering systems have
3-D information readily available and could easily compute contour saliency, this is an avenue
worth investigating; not only to speed up rendering and hide artifacts, but to actively increase
the visual clarity of a rendered image.
170
Figure 5.6. Deleted Contours. Cube: To perceive a cube, it is not necessary to
fully depict it because the HVS fills in missing information automatically. Scissors: Not all information is equally valuable. Strategically placed information
in the recoverable version facilitate the recognition task. Both the recoverable
and non-recoverable versions contain about the same length of total contour,
but redundant and coincidental information in the non-recoverable version make
identification very difficult. Adding masking cues to the non-recoverable version
(recoverable again) disambiguates coincidental information, leading to renewed
recovery.— {Scissor example after Biederman [10].}
5.2. Discussion
The examples of vision shortcuts I give in Section 5.1 are all more or less ad hoc and there
exists no unified framework that ties them all together. Some of their possible applications may
sound fantastical – for now. Too little is still known about the perceptual mechanisms whereby
vision shortcuts operate. NPR systems in conjunction with perceptual studies might bridge this
knowledge gap some day.
171
The great potential I see in vision shortcuts is as a continuation of what art (and NPR) has
already achieved: a divorce of function from form. This separation allows for greater freedom
in the design of images and more direct targeting of visual information. Art is not bound by the
requirement to simulate optical processes in order to convey a message, and more often than not
this helps the visual communication purpose of an image. Vision shortcuts take the separation
of function and form one step further. By almost directly addressing HVS processes (as in
the Flicker color examples) function (perception of color) can be targeted completely without
form. Note, that form, here, does not refer to a generic medium through which function may
be applied, but only to the natural medium associated with the function. In the case of color,
the natural medium is light of different wavelengths. Flicker color replaces this medium with a
series of pulses that objectively are nothing more than intermittent light signals, but in the HVS
these become perceptions of color. The divorce of color from wavelengths may enable us to
create revolutionary new display devices and techniques.
I do concede that we are still a long way from incorporating vision shortcuts into standard
rendering pipelines, but I hope that research at the interface between NPR and Perception, as
advocated in my dissertation, will bring us closer to that goal.
172
CHAPTER 6
Conclusion
In the beginning of this dissertation, I argue that the connection between non-realistic depiction and human perception is a valuable tool to improve the visual communication potential
of computer-generated images, and conversely, to learn more about human perception of such
images.
My perception-centric approach to non-realistic depiction differs from most previous NPR
work in that I am not interested in merely replicating an artistic style1, but that I focus on the
perceptual motivation for using non-realistic imagery.
In Chapter 3, I show how a perceptually inspired image processing framework can create
images that are effective for visual communication tasks. These images also resemble cartoons;
not primarily because my motivation was to imitate a cartoon-style, but because in the design
of the framework I used the same perceptual principles that make cartoons highly effective for
visual communication purposes. This subtle difference has important consequences: although
the resulting images of my framework may resemble those of previous works, my perceptually
inspired framework is faster than previous systems, more temporally coherent, and implicitly
generates certain visual effects (indication and motion lines), that other NPR cartooning systems
have to program explicitly. The appearance of these complimentary effects is likely linked
to the fact that, although the effects are commonly considered merely stylistic, they actually
1
This is not to say that I argue against purely artistic use of NPR. There definitely is merit in such use for creative expression and aesthetic purposes. For this reason, I included the various stylistic parameters introduced in
Chapter 3.
173
have roots in perceptual mechanisms and physiological structures [89, 52]. This demonstrates
some of the benefits of re-examining non-realistic graphics and artistic styles in the light of
perceptual motivations. Not only can this approach teach us about art and perception of art, but
it can provide insights to leverage the perceptual principles that make art so effective for visual
communication. We can then use that knowledge to improve computer graphics, realistic and
non-realistic alike.
Similarly, we can leverage existing non-realistic imaging techniques and the immense processing power of graphics hardware to perform perceptual studies more relevant to interactive
computer graphics applications (Chapter 4) than the impoverished studies that are traditionally performed. The knowledge gained from such studies is not only valuable for non-realistic
graphics, but is likely to transfer to improving realistic computer graphics, as well.
6.1. Conclusion drawn from Real-time Video Abstraction Chapter
In Chapter 3, I have presented a simple and effective real-time framework that abstracts
images while retaining much of their perceptually important information, as demonstrated by
two user studies (Section 3.4).
In addition to the contribution of the actual framework, I can draw several high-level conclusions from Chapter 3. While not all of these conclusions are necessarily novel, they are in
my opinion particularly well reflected in the framework’s design and implementation.
6.1.1. Contrast
All of the important processing steps in the framework are based on contrasts, not absolute
values, continuing a recently developing trend in the graphics community towards differential
174
(change-based) models and algorithms [123, 156, 59]. Particularly, the automatic version of
the non-linear diffusion approximation in Section 3.3.2 uses the given contrast in an image
to change said contrast, forming a closed-loop, implicit algorithm. I believe that differential
methods will play an increasingly important role in future systems, particularly in those based
on perceptual principles.
6.1.2. Soft Quantization
Temporal coherence has been a major problem for many animated stylization systems from the
very beginning [101]. There are several reasons for this. Stylization is often an arbitrary external
addition to an image (e.g. waviness or randomness in line-drawings of exact 3-D models) and
should therefore be controlled via a temporally smooth function (e.g. Perlin noise [124] or
wavelet noise [29]). Another problem that is more difficult to address is that of quantization.
Many existing stylization systems force discontinuous quantizations, particularly to derive an
explicit image representation [36, 161, 26]. My approach is different. I want to aid the human
visual system to increase efficiency for certain visual tasks, but I do not endeavor to perform
the visual task for the user, who is much more capable than any system that I can devise. In
terms of quantization this means that I will not force a quantization if I cannot be relatively sure
to make the correct decision (e.g. whether a pixel belongs to one object or another). Instead, I
perform a quasi-quantization, or soft-quantization, which suggests rather than enforces, and let
the observers mentally complete the picture for themselves. This principle is used effectively
in the color-quantization in Section 3.3.5 and the edge detection in Section 3.3.3. In essence, it
can often be useful to give a good partial solution than an erroneous full solution which needs
to be corrected with the help of the user [161, 26].
175
6.1.3. Gaussians, Art, and Perception
Gaussian filters or Gaussian-like convolutions appear recurringly in Chapter 3, from scale-space
theory and edge-detection to diffusion-approximations and motion-blur. Why should this be
so? My personal explanation is related to the receptive-field concept, which is illustrated in
Figure 3.11 and to which Zeki refers as “[...] one of the most important concepts to emerge
from sensory physiology in the past fifty years.” ([189], pp. 88). The receptive field of a cell
connects the cell with its physical or logical neighbors2 and thus performs information integration over larger and larger areas, and eventually the entire visual field. In essence, the output
of many cortical cells depends on the weighted input of its own trigger mechanism and that of
its neighbors, not entirely unlike the convolution of a Gaussian-like kernel with an image. This
connection has been shown physiologically [183] before, but I believe it to be important for two
additional reasons.
First, using Gaussian-like convolutions allows for very efficient, parallel, and implicit information processing frameworks, which become akin to neural net implementations when
processed iteratively. It might thus be interesting to look at neural nets and related artificial
intelligence applications in terms of image processing operations that can leverage the parallel
processing power of modern GPUs for high-performance computations.
Second, many of the Gaussian-based image processing operations I have used show interesting connections to well-known artistic techniques and principles. There are obvious connections3 like the one between DoG edges and line-drawings, but there are also less obvious
connections, like the indication and motion lines of Section 3.5.7. Another effect (not shown in
2
This corresponds to the topological (physical) versus feature-based (logical) mappings found to connect the different visual cortical areas [188].
3
In terms of relatedness, not causality.
176
detail) is that recursively bilaterally filtered images often tend to look like water-color paintings
(in which non-linear color diffusion through canvas and water plays a pivotal role).
In short, Gaussians-like filters seem to play heavily into both perception and art, and given
the fact that various artistic techniques have been traced back to cortical structures and lowlevel perception [66, 17, 89, 52, 189], it might be worthwhile to attempt an explanation and
parametrization of art in terms of a perceptually-based computational information-integration
model using Gaussian-like functions.
6.2. Conclusions drawn from Shape-from-X Chapter
In Chapter 4, I have presented an experiment to study shape perception of moving objects
using non-realistic imagery. My main contributions in designing the experiment are the use of
simple, non-realistic visual stimuli to separate shape cues onto orthogonal perceptual shapefrom-X axes, and to display these cues in a highly dynamic environment for a time-critical,
game-like task. The experiment therefore demonstrates well the contribution that NPR can
make to perceptual studies, as well as a methodology that NPR researchers can use to validate
and improve their systems.
One of the most interesting results of the user study I performed using my experimental design is that shape cues in a time-constrained condition need not interact constructively. In fact,
my results indicate that the opposite may be the case. When multiple shape cues are present,
they can conflict and impede each other. Apart from the other results discussed in Section 4.7,
the most important conclusions to be drawn from Chapter 4 is that the experimental framework
fulfilled all the design goals put forth in Section 4.1.1, and that this allows for numerous additional studies to be performed and evaluated with my experimental design. One benefit of
177
the design is that setting up new studies can be as simple as generating new sets of shapes to
be tested, or varying the texture parametrization. The following sections describe some of the
possible dynamic shape studies that my experiment supports, and which might yield valuable
perceptual insights to improve existing rendering systems and to develop new display algorithms for interactive applications.
6.2.1. Contours
The different types of contours (silhouettes, outlines, ridges, valleys, creases, etc.) illustrated in
Figure 4.1, Contours, might be investigated.
Of particular interest would be the evaluation of the perceptually motivated suggestive
contours [35]. I actually included DeCarlo et al.’s suggestive contour code in the initial carexperiment, but ended up not using it because the coarse real-time models of my setup did not
provide enough geometric detail for the suggestive contour method to work properly, and higher
resolution models prohibited real-time rendering. The same limitations apply to the geon shapes
of my current experiment. Because studying the effectiveness of suggestive contours requires a
more complex shape set and higher resolution models, some obstacles will have to be overcome
to enable real-time rendering performance. An exciting development in that regard is the new
feature set of the latest generation of graphics cards, which allows for geometry generation in
GPU code. This, together with instancing, might enable real-time suggestive contour rendering
of multiple complex, high-resolution models.
178
6.2.2. Textures
Apart from the simple sphere-mapping used in my experiment, a number of other shape parameterizations can be used to map texture onto objects [77, 58, 90, 153]. The type of texture can
also be varied. My experiment uses a random-design texture comprised of structure at a variety
of spatial scales. Most perception studies that focus solely on texture use sinusoidal gratings
of a well-defined frequency and amplitude. It will be interesting to study how these textures
perform in a dynamic experiment.
6.2.3. Shapes
I varied attachment shape along two categorical axes but there are other shape categories to
explore. Some of my results (Section 4.7.3) suggest that different shape cues might be more
effective for certain types of shapes than for others, so it will be interesting to perform additional
studies that match shape classes with optimal shape cues or shape cue combinations.
6.2.4. Dynamics
Another result of my experiment indicates that shape perception in dynamic environments may
abide by different rules than perception in static environments. In line with recent research
on motion detection of variously colored stimuli [55, 168, 104], it might be interesting to see
what exactly constitutes a static versus a dynamic environment. What kind of translational and
rotational velocities can be considered dynamic or even highly dynamic? How fast can the
different shape-from-X mechanisms of the human visual system reliably detect subtle shape
differences?
179
I explained in Section 4.4.2, Motion (pg. 124), that shape-from-Motion in isolation is difficult to study because for motion to be perceived, something has to move. Experiments on the
perception of biological motion have shown that motion can indeed be divorced from form [82].
Coherently moving dots in a random dot display can be perceived to represent the motion of
various rigid bodies or even complex biological entities. The dots are totally devoid of shape
when stationary, but become part of a moving form when animated (similar to the effect of the
complex background textures in my experiment). The resolution of sparsely distributed dots on
a display is obviously too limited to resolve the subtle differences between the shape categories
tested in my study, but it will be interesting to devise a modified version of my experiment that
can help to separate the effect of shape-from-Motion from the contribution of the other shape
cues in a dynamic environment.
6.2.5. Games and Distributed Studies
Finally, I am encouraged by the positive user-feedback from the experiment. Perceptual studies
often employ repetitive, time-consuming, and tedious tasks to obtain accurate data. Such tasks
can negatively impact concentration and performance levels. I found in my experiment that
participants generally enjoyed the interaction because it was simple and engaging. The task
also triggered competitive behavior in participants who wanted to do well and shoot as many
correct targets as possible. I see game-like interaction tasks for perceptual studies as a way of
obtaining data that is more relevant to real-life activities than in most traditional, reductionist
experiments. Of course, there are problems and pitfalls, as I have found out with the carexperiment, but a careful experimental design can minimize those problems. I am interested
to see how popular game paradigms, like racing games, first-person shooters, or third-person
180
obstacle games can be modified to yield perceptually valuable and scientifically sound data.
One big advantage of such an approach, in addition to its immediate applicability to interactive
graphics, game design, and perception research, is the large base of volunteer gamers who could
download the experiment/game and would generate valuable data just by playing. The obvious
problems of limited control over the environmental conditions during the experimental trials
would have to be weighed against the benefits of fast and copious data-gathering possible in a
distributed, autonomous experiment.
6.3. Summary
The graphics community at large has acquired much knowledge about the design and performance of rendering algorithms as well as interactive and even immersive applications. Yet,
very little is known about the perceptual effects of these algorithms and applications on human
task performance. It is my hope that in the future we will harness more of the advanced rendering systems and processing power that computer graphics has to offer, to perform perceptual
studies that would otherwise not be possible. In return, the insights gained from such perceptual
studies can flow right back into designing graphical systems that are not only fast and photorealistic, but that provide verifiably effective visual stimuli for the human tasks they are intended
to support.
181
References
[1] Nur Arad and Craig Gotsman. Enhancement by image-dependent warping. IEEE Trans.
on Image Processing, 8(9):1063–1074, 1999. 72, 73
[2] James Arvo, Kenneth Torrance, and Brian Smits. A framework for the analysis of error
in global illumination algorithms. In SIGGRAPH ’94: Proceedings of the 21st annual
conference on Computer graphics and interactive techniques, pages 75–84, New York,
NY, USA, 1994. ACM Press. 39
[3] Alethea Bair, Donald House, and Colin Ware. Perceptually optimizing textures for layered surfaces. In APGV ’05: Proceedings of the 2nd symposium on Applied perception in
graphics and visualization, pages 67–74, New York, NY, USA, 2005. ACM Press. 118
[4] Danny Barash and Dorin Comaniciu. A common framework for nonlinear diffusion,
adaptive smoothing, bilateral filtering and mean shift. Image and Video Computing,
22(1):73–81, 2004. 58, 90
[5] Woodrow Barfield, James Sandford, and James Foley. The mental rotation and perceived
realism of computer-generated three-dimensional images. Intl. J. Man-Machine Studies,
29:669–684, 1988. 112, 115, 118
[6] H.G. Barrow and J.M. Tenenbaum. Line drawings as three-dimensional surfaces. Artificial Intelligence, 17:75–116, 1981. 166
[7] C.E. Benham. The artificial spectrum top. Nature (London), 51:200, 1894. 161
[8] I. Biederman and M. Bar. One-shot viewpoint invariance in matching novel objects. Vision Research, 39(17):2885–2899, 1999. 113, 118
[9] I. Biederman and E. E. Cooper. Priming contour-deleted images: evidence for intermediate representations in visual object recognition. Cognitve Psychology, 23(3):393–419,
1991. 169
182
[10] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2):115–147, 1987. 114, 116, 117, 118, 131, 169, 170
[11] Irving Biederman and Peter C. Gerhardstein. Recognizing depth-rotated objects: Evidence and conditions for three-dimensional viewpoint invariance. Experimental Psychology, 19(6):1162–1182, 1993. 115, 117, 118
[12] T. O. Binford. Generalized cylinders representation. In S. C. Shapiro, editor, Encyclopedia of Artificial Intelligence, pages 321–323, New York, 1987. John Wiley & Sons. 118,
131
[13] Mark R. Bolin and Gary W. Meyer. A perceptually based adaptive sampling algorithm.
In SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphics
and interactive techniques, pages 299–309, New York, NY, USA, 1998. ACM Press. 40
[14] R. Van den Boomgaard and J. Van de Weijer. On the equivalence of local-mode finding,
robust estimation and mean-shift analysis as used in early vision tasks. 16th Internat.
Conf. on Pattern Recog., 3:927–930, 2002. 90
[15] Philippe Bordes and Philippe Guillotel. Perceptually adapted MPEG video encoding.
Human Vision and Electronic Imaging V, 3959(1):168–175, 2000. 37
[16] D. J. Bremer and J. F. Hughes. Rapid approximate silhouette rendering of implicit surfaces. Implicit Surfaces ’98, pages 155–164, 1998. 114
[17] S. E. Brennan. Caricature generator: The dynamic exaggeration of faces by computer.
Leonardo, 18(3):170–178, 1985. 53, 176
[18] H. H. Bülthoff and S. Edelman. Psychophysical Support for a Two-Dimensional View
Interpolation Theory of Object Recognition. Proc. of the Natl. Ac. of Sciences, 89(1):60–
64, 1992. 118
[19] H. H. Bülthoff and H. A. Mallot. Integration of stereo, shading and texture. In A. Blake
and T. Troscianko, editors, AI and the Eye, pages 119–146. Wiley, London, UK, 1990.
150
[20] Michael Burns, Janek Klawe, Szymon Rusinkiewicz, Adam Finkelstein, and Doug DeCarlo. Line drawings from volume data. ACM Trans. Graph., 24(3):512–518, 2005. 114
[21] C. Von Campenhausen and J. Schramme. 100 years of Benham’s top in colour science.
Perception, 24(6):695–717, 1995. 162
183
[22] J. F. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8:769–798, 1986. 69, 72
[23] K. Cater, A. Chalmers, and G. Ward. Detail to attention: exploiting visual tasks for
selective rendering. In EGRW ’03: Proceedings of the 14th Eurographics workshop on
Rendering, pages 270–280, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics
Association. 41
[24] Stephen Chenney, Mark Pingel, Rob Iverson, and Marcin Szymanski. Simulating cartoon
style animation. In NPAR ’02: Proceedings of the 2nd international symposium on Nonphotorealistic animation and rendering, pages 133–138, New York, NY, USA, 2002.
ACM Press. 18
[25] Johan Claes, Fabian Di Fiore, Gert Vansichem, and Frank Van Reeth. Fast 3D cartoon
rendering with improved quality by exploiting graphics hardware. In Proceedings of Image and Vision Computing New Zealand (IVCNZ) 2001, pages 13–18. IVCNZ, November
2001. 19
[26] John P. Collomosse, David Rowntree, and Peter M. Hall. Stroke surfaces: Temporally
coherent artistic animations from video. IEEE Trans. on Visualization and Computer
Graphics, 11(5):540–549, 2005. 48, 88, 90, 91, 92, 93, 97, 174
[27] Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In ICCV ’99:
Proceedings of the Int. Conference on Computer Vision-Volume 2, page 1197, Washington, DC, USA, 1999. IEEE Computer Society. 90
[28] Robert L. Cook, Loren Carpenter, and Edwin Catmull. The Reyes image rendering architecture. In SIGGRAPH ’87: Proceedings of the 14th annual conference on Computer
graphics and interactive techniques, pages 95–102, New York, NY, USA, 1987. ACM
Press. 15
[29] Robert L. Cook and Tony DeRose. Wavelet noise. ACM Trans. Graph., 24(3):803–811,
2005. 174
[30] Lynn A. Cooper. Mental rotation of random two-dimensional shapes. Cognitive Psychology, 7:20–43, 1975. 115
[31] B. Cumming, E. Johnston, and A. Parker. Effects of different texture cues on curved
surfaces viewed stereoscopically. Vision Research, 33(5-6):827–838, 1993. 110
[32] Cassidy J. Curtis, Sean E. Anderson, Joshua E. Seims, Kurt W. Fleischer, and David H.
Salesin. Computer-generated watercolor. Proceedings of SIGGRAPH 97, pages 421–430,
August 1997. 18
184
[33] S. J. Daly. Visible differences predictor: an algorithm for the assessment of image fidelity.
Proc. SPIE, 1666:2–15, 1992. 34, 41
[34] Richard Dawkins. Climbing Mount Improbable. W. W. Norton & Company, 1997. 23
[35] Doug DeCarlo, Adam Finkelstein, and Szymon Rusinkiewicz. Interactive rendering of
suggestive contours with temporal coherence. In NPAR ’04, pages 15–24, New York,
NY, USA, 2004. ACM Press. 19, 110, 114, 177
[36] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. ACM
Trans. Graph., 21(3):769–776, 2002. 19, 33, 48, 49, 62, 63, 72, 91, 95, 174
[37] Michael F. Deering. A photon accurate model of the human eye. ACM Trans. Graph.,
24(3):649–658, 2005. 15, 17
[38] Oliver Deussen and Thomas Strothotte. Computer-generated pen-and-ink illustration of
trees. Proceedings of SIGGRAPH 2000, pages 13–18, July 2000. 19
[39] J. Duncan. Selective attention and the organization of visual information. Journal of experimental psychology. General., 113(4):501–517, December 1984. 105, 130
[40] Frédo Durand. An invitation to discuss computer depiction. In NPAR ’02: Proceedings of
the 2nd international symposium on Non-photorealistic animation and rendering, pages
111–124, New York, NY, USA, 2002. ACM Press. 16, 20, 22
[41] David Ebert and Penny Rheingans. Volume illustration: non-photorealistic rendering of
volume models. In VIS ’00: Proceedings of the conference on Visualization ’00, pages
195–202, Los Alamitos, CA, USA, 2000. IEEE Computer Society Press. 114
[42] James H. Elder. Are edges incomplete? Internat. Journal of Computer Vision, 34(23):97–122, 1999. 88
[43] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence,
1998. 56
[44] G. T. Fechner. Über eine Scheibe zur Erzeugung subjectiver Farben. Annalen der Physik
und Chemie. Verlag von Johann Ambrosius Barth, Leipzig, pages 227–232, 1838. 161
[45] G. T. Fechner. Elemente der Psychophysik, volume 2. Breitkopf und Haertel, Leipzig,
1860. 52
[46] James A. Ferwerda, Peter Shirley, Sumanta N. Pattanaik, and Donald P. Greenberg. A
model of visual masking for computer graphics. In SIGGRAPH ’97: Proceedings of the
185
24th annual conference on Computer graphics and interactive techniques, pages 143–
152, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 33, 34,
40
[47] James A. Ferwerda, Stephen H. Westin, Randall C. Smith, and Richard Pawlicki. Effects
of rendering on shape perception in automobile design. In APGV ’04: Proceedings of
the 1st Symposium on Applied perception in graphics and visualization, pages 107–114,
New York, NY, USA, 2004. ACM Press. 113, 117
[48] L. Festinger, M. R. Allyn, and C. W. White. The perception of color with achromatic
stimulation. Vision Res., 11(6):591–612, 1971. 162
[49] J. Fischer, D. Bartz, and W. Straßer. Stylized Augmented Reality for Improved Immersion. In Proc. of IEEE VR, pages 195–202, 2005. 48, 65, 69, 71, 72
[50] Alan Fogel and Thomas E. Hannan. Manual actions of nine- to fifteen-week-old human infants during face-to-face interaction with their mothers. Child Development,
56(5):1271–1279, Oct. 1985. 128
[51] Mark D. Folk and R. Duncan Luce. Effects of stimulus complexity on mental rotation rate
of polygons. Experimental Psychology: Human Perception and Performance, 13(3):395–
404, 1987. 115
[52] Gregory Francis and Hyungjun Kim. Motion parallel to line orientation: Disambiguation
of motion percepts. Perception, 28:1243–1255, 1999. 96, 173, 176
[53] William T. Freeman, Edward H. Adelson, and David J. Heeger. Motion without movement. In SIGGRAPH ’91: Proceedings of the 18th annual conference on Computer
graphics and interactive techniques, pages 27–30, New York, NY, USA, 1991. ACM
Press. 168
[54] B. Funt, K. Barnard, M. Brockington, and V. Cardei. Luminance based multi scale
retinex. In Proceedings AIC Colour 97 Kyoto 8th Congress of the International Colour
Association, volume 1, pages 330–333, May 1997. 166
[55] K. R. Gegenfurtner and M. J. Hawken. Interaction of motion and color in the visual
pathways. Trends Neuroscience, 19(9):394–401, 1996. 154, 178
[56] J. J. Gibson. The perception of the visible world. American Journal of Psychology,
63:367–384, 1950. 110
[57] J. J. Gibson. The Ecological Approach to Visual Perception. Lawrence Erlbaum Assoc.
Inc., 1987. 131
186
[58] Ahna Girshick, Victoria Interrante, Steven Haker, and Todd Lemoine. Line direction
matters: an argument for the use of principal directions in 3D line drawings. In NPAR
’00, pages 43–52, New York, NY, USA, 2000. ACM Press. 115, 178
[59] Amy Ashurst Gooch. Preserving Salience By Maintaining Perceptual Differences for
Image Creation and Manipulation. PhD thesis, Northwestern University, 2006. 52, 174
[60] Amy Ashurst Gooch and Peter Willemsen. Evaluating space perception in NPR immersive environments. In NPAR ’02, pages 105–110, New York, NY, USA, 2002. ACM
Press. 19, 114, 116, 117
[61] Bruce Gooch and Amy Ashurst Gooch. Non-Photorealistic Rendering. A. K. Peters,
2001. 18
[62] Bruce Gooch, Erik Reinhard, and Amy Gooch. Human facial illustrations: Creation and
psychophysical evaluation. ACM Trans. Graph., 23(1):27–44, 2004. 19, 25, 49, 53, 69,
81, 83, 98
[63] Cindy M. Goral, Kenneth E. Torrance, Donald P. Greenberg, and Bennett Battaile. Modeling the interaction of light between diffuse surfaces. In SIGGRAPH ’84: Proceedings
of the 11th annual conference on Computer graphics and interactive techniques, pages
213–222, New York, NY, USA, 1984. ACM Press. 15
[64] Ian E. Gordon. Theories of Visual Perception. Psychology Press, New York, 3rd edition,
Dec. 2004. 131
[65] R. L. Gregory. Eye and Brain - The Psychology of Seeing. Oxford University Press, 1994.
23, 51
[66] M. H. Hansen. Effects of discrimination training on stimulus generalization. Journal of
Experimental Psychology, 58:321–334, 1959. 53, 176
[67] J. W. Harris and H. Stocker. General cylinder. In Handbook of Mathematics and Computational Science, page 103, New York, 1998. Springer-Verlag. 4.6.1. 118, 131
[68] James Hays and Irfan Essa. Image and video based painterly animation. In NPAR ’04:
Proceedings of the 3rd international symposium on Non-photorealistic animation and
rendering, pages 113–120, New York, NY, USA, 2004. ACM Press. 19
[69] Ivan Herman and D. J. Duke. Minimal graphics. IEEE Computer Graphics and Applications, 21(6):18–21, 2001. 99, 169
187
[70] Aaron Hertzmann. Introduction to 3D non-photorealistic rendering: Silhouettes and outlines. In Non-Photorealistic Rendering (Siggraph ’99 Course Notes), 1999. 110, 114,
122
[71] Aaron Hertzmann. Paint by relaxation. In CGI ’01:Computer Graphics Internat. 2001,
pages 47–54, 2001. 62
[72] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin.
Image analogies. In SIGGRAPH ’01: Proceedings of the 28th annual conference on
Computer graphics and interactive techniques, pages 327–340, New York, NY, USA,
2001. ACM Press. 18
[73] Aaron Hertzmann and Ken Perlin. Painterly rendering for video and interaction. In NPAR
’00: Proceedings of the 1st international symposium on Non-photorealistic animation
and rendering, pages 7–12, New York, NY, USA, 2000. ACM Press. 19
[74] D. D. Hoffman and M. Singh. Salience of visual parts. Cognition, 63(1):29–78, 1997.
169
[75] Donald D. Hoffman. Visual Intelligence: How We Create What We See. W.W. Norton &
Company, NY, 2000. 23, 107, 169
[76] L. Hurvich. Color Vision. Sinauer Assoc., Sunderland, Mass., 1981. 163
[77] Victoria Interrante. Illustrating surface shape in volume data via principal directiondriven 3D line integral convolution. In SIGGRAPH ’97, pages 109–116, New York, NY,
USA, 1997. ACM Press/Addison-Wesley Publishing Co. 115, 178
[78] Laurent Itti and Christof Koch. Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3):194–203, 2001. 33, 53, 62
[79] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention
for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259,
1998. 33, 53
[80] D. Jameson and L. M. Hurvich. Some quantitative aspects of an opponent-colors theory.
I. chromatic response and spectral saturation. II. brightness, saturation and hue in normal
and dichromatic vision. Journal of the Optical Society of America, 45(8):602–616, 1955.
163
[81] J. R. Jarvis. On Fechner-Benham subjective colour. Vision Res., 17(3):445–451, 1977.
162
188
[82] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14(2):201–211, 1973. 179
[83] Alan Johnston and Peter J. Passmore. Shape from shading. I: Surface curvature and orientation. Perception, 23:169–189, 1994. 112
[84] Scott F. Johnston. Lumo: illumination for cel animation. In NPAR ’02: Proceedings of
the 2nd international symposium on Non-photorealistic animation and rendering, pages
45–52, New York, NY, USA, 2002. ACM Press. 19
[85] James T. Kajiya. The rendering equation. In SIGGRAPH ’86: Proceedings of the 13th
annual conference on Computer graphics and interactive techniques, pages 143–150,
New York, NY, USA, 1986. ACM Press. 15
[86] Robert D. Kalnins, Philip L. Davidson, Lee Markosian, and Adam Finkelstein. Coherent
stylized silhouettes. ACM Trans. Graph., 22(3):856–861, 2003. 19
[87] Nanda Kambhatla, Simon Haykin, and Robert D. Dony. Image compression using KLT,
wavelets and an adaptive mixture of principal components model. J. VLSI Signal Process.
Syst., 18(3):287–296, 1998. 37
[88] G. Kayaert, I. Biederman, and R. Vogels. Shape Tuning in Macaque Inferior Temporal
Cortex. Journal of Neuroscience, 23(7):3016–3027, 2003. 113, 118
[89] Hyungjun Kim and Gregory Francis. A computational and perceptual account of motion
lines. Perception, 27:785–797, 1998. 96, 173, 176
[90] Sunghee Kim, Haleh Hagh-Shenas, and Victoria Interrante. Conveying threedimensional shape with texture. In APGV ’04: Proceedings of the 1st Symposium on
Applied perception in graphics and visualization, pages 119–122, New York, NY, USA,
2004. ACM Press. 115, 118, 178
[91] Akiyoshi Kitaoka. Trick Eyes. Barnes & Noble Books, 2005. 166
[92] Akiyoshi Kitaoka and Hiroshi Ashida. Phenomenological characteristics of the peripheral drift illusion. Vision, 15(4):261–262, 2003. 166
[93] J. J. Koenderink. The structure of images. Biological Cybernetics, 50:363–370, 1984. 54
[94] J. J. Koenderink and A. J. Doorn. The internal representation of solid shape with respect
to vision. Biological Cybernetics, 32(4):211–216, 1979. 131, 169
189
[95] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Surface perception in pictures.
Perception and Psychophysics, 52(5):487–496, 1992. 108, 112, 115
[96] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Pictorial surface attitude and
local depth comparisons. Perception and Psychophysics, 58(2):163–173, 1996. 108, 112,
115
[97] Jan J. Koenderink. What does the occluding contour tell us about solid shape? Perception,
13:321–330, 1984. 110, 169
[98] Jan J. Koenderink and Andrea J. Van Doorn. Relief: Pictorial and otherwise. Image and
Vision Computing, pages 321–334, 1995. 115
[99] Adam Lake, Carl Marshall, Mark Harris, and Marc Blackstein. Stylized rendering techniques for scalable real-time 3D animation. In NPAR ’00: Proceedings of the 1st international symposium on Non-photorealistic animation and rendering, pages 13–20, New
York, NY, USA, 2000. ACM Press. 19
[100] Edwin H. Land. The retinex theory of color vision. Scientific American, 237(6):108–128,
1977. 164
[101] John Lansdown and Simon Schofield. Expressive rendering: A review of nonphotorealistic techniques. IEEE Comput. Graph. Appl., 15(3):29–37, 1995. 174
[102] Tony Lindeberg. Scale-Space Theory in Computer Vision. Kluwer, Netherlands, 1994. 54
[103] Joern Loviscach. Scharfzeichner: Klare Bilddetails durch Verformung. Computer Technik, 22:236–237, 1999. 74
[104] Zhong-Lin Lu, Luis A. Lesmes, and George Sperling. The mechanism of isoluminant
chromatic motion perception. Proc. Natl. Acad. Science USA, 96(14):8289–8294, 1999.
124, 154, 178
[105] R. Duncan Luce and Ward Edwards. The derivation of subjective scales from just noticeable differences. Psychol. Rev., 65(4):222–237, 1958. 52
[106] Rafał Mantiuk, Scott Daly, Karol Myszkowski, and Hans-Peter Seidel. Predicting visible
differences in high dynamic range images - model and its calibration. In Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Scott J. Daly, editors, Human Vision and Electronic
Imaging X, volume 5666, pages 204–214, 2005. 34
190
[107] Rafał Mantiuk, Grzegorz Krawczyk, Karol Myszkowski, and Hans-Peter Seidel.
Perception-motivated high dynamic range video encoding. ACM Trans. Graph.,
23(3):733–741, 2004. 39
[108] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. 131
[109] D. Marr and E. C. Hildreth. Theory of edge detection. Proc. Royal Soc. London, Bio.
Sci., 207:187–217, 1980. 68, 71
[110] Barbara J. Meier. Painterly rendering for animation. Proceedings of SIGGRAPH 96,
pages 477–484, August 1996. 19
[111] Ross Messing and Frank H. Durgin. Distance perception and the visual horizon in headmounted displays. ACM Trans. Appl. Percept., 2(3):234–250, 2005. 116, 117
[112] A. S. Meyer, A. M. Sleiderink, and W. J. M. Levelt. Viewing and naming objects: Eye
movements during noun phrase production. Cognition, 66(2):25–33, 1998. 169
[113] C. Moore and P. Cavanagh. Recovery of 3D volume from 2-tone images of novel objects.
Cognition, 67(1):45–71, 1998. 169
[114] K. Moutoussis and S. Zeki. A direct demonstration of perceptual asynchrony in vision.
Proc. R. Soc. Lond. B Biol. Sci., 264(1380):393–399, 1997. 151
[115] K. T. Mullen and C. L. Baker Jr. A motion aftereffect from an isoluminant stimulus.
Vision Res., 25(5):685–688, 1985. 154
[116] Karol Myszkowski. Perception-based global illumination, rendering, and animation techniques. In SCCG ’02: Proceedings of the 18th spring conference on Computer graphics,
pages 13–24, New York, NY, USA, 2002. ACM Press. 41
[117] D. E. Nilsson and S. Pelger. A pessimistic estimate of the time required for an eye to
evolve. Proc. R. Soc. Lond. B Bio. Sci., 256(1345):53–58, 1994. 23
[118] J. F. Norman, J. T. Todd, and F. Phillips. The perception of surface orientation from
multiple sources of information. Perception and Psychophysics, 57(5):629–636, 1995.
115, 118
[119] Sven C. Olsen, Holger Winnemöller, and Bruce Gooch. Implementing real-time video
abstraction. In Proceedings of SIGGRAPH 2006 Sketches. 77
[120] Victor Ostromoukhov. Digital facial engraving. Proceedings of SIGGRAPH 99, pages
417–424, August 1999. 18
191
[121] Stephen E. Palmer. Vision Science: Photons to Phenomenology. The MIT Press, 1999.
51, 54, 67, 111
[122] Sumanta N. Pattanaik, Jack Tumblin, Hector Yee, and Donald P. Greenberg. Timedependent visual adaptation for fast realistic image display. In SIGGRAPH ’00: Proceedings of the 27th annual conference on Computer graphics and interactive techniques,
pages 47–54, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
39
[123] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Trans.
Graph., 22(3):313–318, 2003. 52, 174
[124] Ken Perlin. Improving noise. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 681–682, New York, NY,
USA, 2002. ACM Press. 174
[125] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(7):629–639, 1991.
58, 61
[126] Tuan Q. Pham and Lucas J. Van Vliet. Separable bilateral filtering for fast video preprocessing. In IEEE Internat. Conf. on Multimedia & Expo, pages CD1–4, Amsterdam, July
2005. 58, 66, 88
[127] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM,
18(6):311–317, 1975. 108
[128] Simon Plantinga and Gert Vegter. Contour generators of evolving implicit surfaces. In
SM ’03: Proceedings of the eighth ACM symposium on Solid modeling and applications,
pages 23–32, New York, NY, USA, 2003. ACM Press. 114
[129] Jodie M. Plumert, Joseph K. Kearney, James F. Cremer, and Kara Recker. Distance perception in real and virtual environments. ACM Trans. Appl. Percept., 2(3):216–233, 2005.
116, 117
[130] Claudio M. Privitera and Lawrence W. Stark. Algorithms for defining visual regions-ofinterest: Comparison with eye fixations. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 22(9):970–982, 2000. 33, 62
[131] Thierry Pudet. Real time fitting of hand-sketched pressure brushstrokes. Eurographics
1994, 13(3):277–292, August 1994. 18
192
[132] Paul Rademacher, Jed Lengyel, Edward Cutrell, and Turner Whitted. Measuring the perception of visual realism in images. In Proceedings of the 12th Eurographics Workshop
on Rendering Techniques, pages 235–248, London, UK, 2001. Springer-Verlag. 113
[133] Z. Rahman, D. J. Jobson, G. A. Woodell, and G. D. Hines. Automated, on-board terrain
analysis for precision landings. In Visual Information Processing XIV, Proc. SPIE 6246,
2006. 166
[134] V. S. Ramachandran and R. L. Gregory. Does colour provide an input to human motion
perception? Nature, 275:55–56, Sep. 1978. 154
[135] V. S. Ramachandran and W. Hirstein. The science of art. Journal of Consciousness Studies, 6(6–7):15–51, 1999. 24
[136] Mahesh Ramasubramanian, Sumanta N. Pattanaik, and Donald P. Greenberg. A perceptually based physical error metric for realistic image synthesis. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques,
pages 73–82, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co.
40
[137] Ramesh Raskar. Hardware support for non-photorealistic rendering. In HWWS ’01: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware,
pages 41–47, New York, NY, USA, 2001. ACM Press. 122
[138] Ramesh Raskar, Kar-Han Tan, Rogerio Feris, Jingyi Yu, and Matthew Turk. Nonphotorealistic camera: depth edge detection and stylized rendering using multi-flash
imaging. ACM Trans. Graph., 23(3):679–688, 2004. 19, 48
[139] M. M. Reid, R. J. Millar, and N. D. Black. Second-generation image coding: an overview.
ACM Comput. Surv., 29(1):3–29, 1997. 36
[140] I. Rock and J. DiVita. A case of viewer-centered perception. Cognitive Psychology,
19:280–293, 1987. 118
[141] T. A. Ryan and C. B. Schwartz. Speed of perception as a function of mode of representation. American Journal of Psychology, 69(1):60–69, March 1956. 25, 112, 113, 117,
131
[142] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-D shapes. In
Proc. of ACM SIGGRAPH 90, pages 197–206, 1990. 19, 47
[143] Michael P. Salisbury, Michael T. Wong, John F. Hughes, and David H. Salesin. Orientable
textures for image-based pen-and-ink illustration. In SIGGRAPH ’97: Proceedings of the
193
24th annual conference on Computer graphics and interactive techniques, pages 401–
406, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 18
[144] Anthony Santella and Doug DeCarlo. Visual interest and NPR: An evaluation and manifesto. In Proc. of NPAR ’04, pages 71–78, 2004. 19, 20, 33, 41, 49
[145] Jutta Schumann, Thomas Strothotte, Andreas Raab, and Stefan Laser. Assessing the effect of non-photorealistic rendered images in CAD. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Common Ground, pages 35–41, 1996.
88
[146] Claude E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27:623–656, October 1948. 88
[147] R. N. Shepard and J. Metzler. Mental rotation of three-dimensional objects. Science, New
Series, 171(3972):701–703, Feb. 1971. 114, 115, 118
[148] M. Singh, G. D. Seyranian, and D. D. Hoffman. Parsing silhouettes: the short-cut rule.
Perceptual Psychophysics, 61(4):636–660, 1999. 169
[149] Sarah V. Stevenage. Can caricatures really produce distinctiveness effects? British Journal of Psychology, 86:127–146, 1995. 49, 53, 80, 81, 83, 98, 116
[150] Thomas Strothotte and Stefan Schlechtweg. Non-Photorealistic Computer Graphics:
Modeling, Rendering, and Animation. Morgan Kaufmann, 2002. 103
[151] Kim Sunghee, H. Hagh-Shenas, and Victoria Interrante. Conveying shape with texture:
An experimental investigation of the impact of texture type on shape categorization judgments. 2003 IEEE Symposium on Information Visualization, pages 163–170, 2003. 116,
118
[152] Ivan Sutherland. Sketchpad: A man-machine graphical communication system. In Proc.
AFIPS Spring Joint Computer Conference, pages 329–346, Washington, D.C, 1963.
Spartan Books. 18
[153] Graeme Sweet and Colin Ware. View direction, surface orientation and texture orientation for perception of surface shape. In GI ’04: Proceedings of the 2004 conference on
Graphics interface, pages 97–106. Canadian Human-Computer Communications Society, 2004. 112, 115, 118, 178
[154] M. J. Tarr. Orientation Dependence in Three-Dimensional Object Recognition. PhD thesis, Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 1989.
118
194
[155] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proceedings
of ICCV ’98, pages 839–846, 1998. 59, 61
[156] J. Tumblin, A. Agarwal, and R. Raskar. Why I want a gradient camera. Computer Vision
and Pattern Recognition (CVPR), pages 103–110, 2005. 52, 174
[157] Jack Tumblin, Jessica K. Hodgins, and Brian K. Guenter. Two methods for display of
high contrast images. ACM Trans. Graph., 18(1):56–94, 1999. 38
[158] Jack Tumblin and Greg Turk. LCIS: A boundary hierarchy for detail-preserving contrast
reduction. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer
graphics and interactive techniques, pages 83–90, New York, NY, USA, 1999. ACM
Press/Addison-Wesley Publishing Co. 38, 50, 61
[159] R. L. De Valois and K. K. De Valois. Spatial Vision. Oxford University Press, New York,
1988. 54
[160] P. Verghese and D. G. Pelli. The information capacity of visual attention. Vision Research,
32(5):983–995, May 1992. 105, 130
[161] Jue Wang, Yingqing Xu, Heung-Yeung Shum, and Michael F. Cohen. Video tooning.
ACM Trans. Graph., 23(3):574–583, 2004. 48, 90, 91, 92, 93, 174
[162] Gregory J. Ward. The RADIANCE lighting simulation and rendering system. In SIGGRAPH ’94: Proceedings of the 21st annual conference on Computer graphics and
interactive techniques, pages 459–472, New York, NY, USA, 1994. ACM Press. 17, 41
[163] Benjamin Watson, Alinda Friedman, and Aaron McGaffey. Measuring and predicting
visual fidelity. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 213–220, New York, NY, USA, 2001.
ACM Press. 37
[164] Joachim Weickert. Anisotropic Diffusion in Image Processing. ECMI. Teubner, Stuttgart,
1998. 61
[165] Andreas Wenger, Andrew Gardner, Chris Tchou, Jonas Unger, Tim Hawkins, and Paul
Debevec. Performance relighting and reflectance transformation with time-multiplexed
illumination. ACM Trans. Graph., 24(3):756–764, 2005. 17
[166] Turner Whitted. An improved illumination model for shaded display. Commun. ACM,
23(6):343–349, 1980. 15
195
[167] Nathaniel Williams, David Luebke, Jonathan D. Cohen, Michael Kelley, and Brenden
Schubert. Perceptually guided simplification of lit, textured meshes. In SI3D ’03: Proceedings of the 2003 symposium on Interactive 3D graphics, pages 113–121, New York,
NY, USA, 2003. ACM Press. 37
[168] A. Willis and S. J. Anderson. Separate colour-opponent mechanisms underlie the detection and discrimination of moving chromatic targets. Proc. R. Soc. Lond. B Biol. Sci.,
265(1413):2435–2441, 1998. 154, 178
[169] Georges Winkenbach and David H. Salesin. Computer-generated pen-and-ink illustration. In Proc. of ACM SIGGRAPH 94, pages 91–100, 1994. 95
[170] Georges Winkenbach and David H. Salesin. Rendering parametric surfaces in pen and
ink. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 469–476, New York, NY, USA, 1996. ACM Press.
18, 19
[171] Holger Winnemöeller, Sven C. Olsen, and Bruce Gooch. Real-time video abstraction.
ACM Trans. Graph., 25(3):1221–1226, 2006. 101
[172] Holger Winnemöller. Testing effects of color constancy for images displayed on CRT
devices. Technical Report CS03-03-00, University of Cape Town, Computer Science
Department, September 2003. 165
[173] Holger Winnemöller and Shaun Bangay. Geometric approximations towards free specular comic shading. Computer Graphics Forum, 21(3):309–316, September 2002. 19
[174] Holger Winnemöller and Shaun Bangay. Rendering Optimisations for Stylised Sketching. In ACM Afrigraph 2003: 2nd International Conference on Computer Graphics,
Virtual Reality and Visualization in Africa, pages 117–122. ACM, ACM SIGGRAPH,
February 2003. 19
[175] A. P. Witkin. Scale-space filtering. In 8th Int. Joint Conference on Artificial Intelligence,
pages 1019–1022, Karlsruhe, Germany, 1983. 54
[176] Eric Wong. Artistic rendering of portrait photographs. Master’s thesis, Cornell University, 1999. 19
[177] M. Woo and M. B. Sheridan. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 1.2. Addison-Wesley Longman Publishing Co., Inc. Boston, MA,
USA, 1999. 123
196
[178] G. A. Woodell, D. J. Jobson, Z. Rahman, and G. D. Hines. Advanced image processing
of aerial imagery. In Visual Information Processing XIV, Proc. SPIE 6246, 2006. 166
[179] G. Wyszecki and W. S. Styles. Color Science: Concepts and Methods, Quantitative Data
and Formulae. Wiley, New York, NY, 1982. 51
[180] Hector Yee, Sumanita Pattanaik, and Donald P. Greenberg. Spatiotemporal sensitivity and visual attention for efficient rendering of dynamic environments. ACM Trans.
Graph., 20(1):39–65, 2001. 34, 41
[181] Ian T. Young, Lucas J. Van Vliet, and Michael Van Ginkel. Recursive gabor filtering.
IEEE Trans. on Signal Processing, 50(11):2798–2805, 2002. 89
[182] R. A. Young. Some observations on temporal coding of color vision: psychophysical
results. Vision Res., 17(8):957–965, 1977. 162
[183] R. A. Young. The gaussian derivative model for spatial vision: I. retinal mechanisms.
Spatial Vision, 2:273–293, 1987. 175
[184] John C. Yuille and James H. Steiger. Nonholistic processing in mental rotation: Some
suggestive evidence. Perception & Psychophysics, 31(3):201–209, 1982. 114, 115, 118
[185] S. Zeki and M. Lamb. The neurology of kinetic art. Brain, 117:607–636, 1994. 24
[186] S. Zeki and M. Marini. Three cortical stages of colour processing in the human brain.
Brain, 121:1669–1685, 1998. 24
[187] S. M. Zeki. Colour coding in the superior temporal sulcus of rhesus monkey visual cortex.
Proc. R. Soc. Lond. B Biol. Sci., 197(1127):195–223, 1977. 154
[188] Semir Zeki. A vision of the brain. Blackwell Scientific Publications Oxford, 1993. 45,
51, 83, 124, 175
[189] Semir Zeki. Art and the brain. Journal of Consciousness Studies, 6(6–7):76–96, 1999.
21, 24, 25, 151, 175, 176
[190] Robert C. Zeleznik, Kenneth P. Herndon, and John F. Hughes. SKETCH: An interface
for sketching 3D scenes. In SIGGRAPH ’96: Proceedings of the 23rd annual conference
on Computer graphics and interactive techniques, pages 163–170, New York, NY, USA,
1996. ACM Press. 18
197
APPENDIX A
User-data for Videoabstraction Studies
Table A.1 and Table A.2 list the per participant data values for study 1 and study 2 in
Section 3.4. Figure 3.21 visualizes data for both tables.
In these tables and the following, Std. Dev. stands for standard deviation, σ, and Std. Err.
√
stands for standard error (not normalized), se ≡ σ/ n, where n is the number of samples.
Study1: Recognition
Time(msec)
Data Pair Photograph Abstraction
1
1159
965
2
1291
1237
3
1660
1281
4
1305
1285
5
1342
1330
6
1486
1367
7
1712
1378
8
1622
1388
9
1748
1435
10
1811
1520
Average
1513.5
1318.5
Std. Dev.
227.5
148.5
Std. Err.
72.0
47.0
Table A.1. Data for Videoabstraction Study 1. This table shows the average
time (in milliseconds) each participant took to recognize a depicted face (photograph or abstraction) taken over all faces presented to the participant. The data
pairs are ordered in ascending abstraction time, corresponding to Figure 3.21,
top graph.
198
Study2: Memory
Time(secs)
Clicks
Data Pair Photograph Abstraction Photograph Abstraction
1
60.5
48.7
60
42
2
54.1
50.8
58
52
3
68.1
51.7
86
64
4
64.6
55.4
50
40
5
92.5
57.0
62
52
6
77.7
57.1
59
42
7
76.9
60.0
64
52
8
92.0
66.2
64
44
9
91.2
71.2
51
42
10
83.7
81.4
70
62
Average
75.5
59.4
62.4
49.2
Std. Dev.
13.3
9.9
9.7
8.2
Std. Err.
4.0
3.0
2.9
2.5
Table A.2. Data for Videoabstraction Study 2. This table shows the time (in
milliseconds) and number of clicks each participant used to complete a memory
game with photograph and a memory game with abstraction images. The data
pairs are ordered in ascending abstraction time, corresponding to Figure 3.21,
middle and bottom graphs (this ordering is not intended to correspond to Table A.1).
199
APPENDIX B
User-data for Shape-from-X Study
The tables B.1-B.5 list experimental data (aggregate values averaged over four trials) for
each display mode for all 21 participants of the shape-from-X study described in Section 4.5.
Figure B.1 shows the questionnaire given to participants after they completed the experimental trials. Table B.6 lists the numerical data gathered from the questionnaire.
200
Shading
UserID
BA044
BA046
BA047
BA048
BA049
BA050
BA051
BA052
BA053
BA054
BA055
BA056
BA057
BA058
BA059
BA060
BA061
BA062
BA063
BA064
BA065
Average
Std. Dev.
Std. Err.
success
0.868
0.910
0.863
0.920
0.928
0.878
0.860
0.820
0.855
0.800
0.835
0.670
0.810
0.903
0.900
0.785
0.778
0.785
0.788
0.793
0.880
0.839
0.063
0.014
failure
0.018
0.073
0.050
0.055
0.020
0.043
0.060
0.040
0.020
0.035
0.000
0.198
0.035
0.008
0.033
0.053
0.073
0.020
0.038
0.038
0.010
0.044
0.041
0.009
risk
0.525
0.235
0.460
0.688
0.403
0.505
0.360
0.335
0.895
0.443
0.910
0.793
0.813
0.643
0.393
0.463
0.625
0.800
0.345
0.493
0.465
0.552
0.198
0.043
placement
0.885
0.983
0.910
0.978
0.948
0.923
0.923
0.860
0.880
0.833
0.835
0.870
0.845
0.910
0.933
0.835
0.850
0.805
0.823
0.830
0.890
0.883
0.051
0.011
detection
2.970
1.021
2.311
3.344
2.135
2.445
1.585
1.483
6.654
2.196
5.767
1.758
4.565
3.868
1.857
1.976
2.957
3.545
1.547
2.289
2.354
2.792
1.433
0.313
Table B.1. Shading Data. Averages of each participant over four trials for the
Shading display mode.
201
Outline
UserID
BA044
BA046
BA047
BA048
BA049
BA050
BA051
BA052
BA053
BA054
BA055
BA056
BA057
BA058
BA059
BA060
BA061
BA062
BA063
BA064
BA065
Average
Std. Dev.
Std. Err.
success
0.848
0.693
0.723
0.830
0.838
0.903
0.765
0.683
0.850
0.690
0.808
0.683
0.723
0.755
0.753
0.803
0.760
0.648
0.743
0.785
0.918
0.771
0.075
0.016
failure
0.095
0.280
0.185
0.140
0.128
0.068
0.213
0.190
0.058
0.065
0.038
0.248
0.130
0.110
0.170
0.103
0.163
0.065
0.095
0.090
0.028
0.127
0.069
0.015
risk
0.495
0.093
0.283
0.310
0.358
0.315
0.235
0.188
0.483
0.355
0.648
0.690
0.565
0.428
0.230
0.390
0.418
0.825
0.288
0.398
0.263
0.393
0.177
0.039
placement
0.938
0.973
0.908
0.965
0.963
0.968
0.978
0.873
0.908
0.755
0.848
0.928
0.848
0.863
0.923
0.905
0.920
0.715
0.838
0.875
0.943
0.897
0.069
0.015
detection
2.179
0.263
0.883
1.200
1.413
1.311
0.808
0.639
2.366
1.465
3.080
1.507
2.016
1.664
0.732
1.464
1.499
3.194
1.157
1.646
1.196
1.509
0.740
0.162
Table B.2. Outline Data. Averages of each participant over four trials for the
Outline display mode.
202
M ixed
UserID
BA044
BA046
BA047
BA048
BA049
BA050
BA051
BA052
BA053
BA054
BA055
BA056
BA057
BA058
BA059
BA060
BA061
BA062
BA063
BA064
BA065
Average
Std. Dev.
Std. Err.
success
0.853
0.760
0.890
0.903
0.933
0.838
0.843
0.900
0.885
0.738
0.788
0.683
0.743
0.800
0.805
0.825
0.798
0.750
0.688
0.723
0.860
0.810
0.073
0.016
failure
0.085
0.190
0.075
0.055
0.038
0.068
0.070
0.055
0.043
0.083
0.045
0.193
0.078
0.065
0.063
0.103
0.075
0.135
0.080
0.098
0.040
0.083
0.043
0.009
risk
0.533
0.248
0.470
0.693
0.430
0.470
0.365
0.345
0.633
0.485
1.048
2.900
0.698
0.608
0.420
0.398
0.630
0.903
0.345
0.533
0.503
0.650
0.549
0.120
placement
0.938
0.953
0.965
0.960
0.973
0.903
0.908
0.953
0.928
0.820
0.833
0.875
0.823
0.868
0.870
0.928
0.873
0.885
0.765
0.823
0.898
0.892
0.057
0.012
detection
2.287
1.005
2.103
3.440
2.083
2.149
1.506
1.557
3.231
2.047
5.975
3.033
2.833
2.741
1.904
1.690
2.669
3.172
1.343
2.050
2.635
2.450
1.043
0.228
Table B.3. Mixed Data. Averages of each participant over four trials for the
M ixed display mode.
203
T exISO
UserID
BA044
BA046
BA047
BA048
BA049
BA050
BA051
BA052
BA053
BA054
BA055
BA056
BA057
BA058
BA059
BA060
BA061
BA062
BA063
BA064
BA065
Average
Std. Dev.
Std. Err.
success
0.668
0.658
0.640
0.620
0.648
0.758
0.730
0.670
0.740
0.735
0.705
0.468
0.673
0.680
0.820
0.728
0.543
0.580
0.590
0.540
0.718
0.662
0.084
0.018
failure
0.223
0.253
0.245
0.295
0.300
0.185
0.230
0.185
0.150
0.140
0.100
0.350
0.233
0.250
0.058
0.183
0.398
0.135
0.188
0.213
0.138
0.212
0.082
0.018
risk
0.235
0.133
0.258
0.353
0.303
0.195
0.180
0.148
0.390
0.133
0.378
0.513
0.300
0.223
0.173
0.293
0.235
0.438
0.183
0.275
0.235
0.265
0.103
0.022
placement
0.888
0.908
0.883
0.915
0.945
0.940
0.960
0.855
0.888
0.880
0.805
0.818
0.905
0.928
0.878
0.910
0.940
0.715
0.775
0.753
0.855
0.873
0.066
0.014
detection
0.718
0.403
0.747
0.934
0.743
0.673
0.517
0.458
1.214
0.423
1.415
0.780
0.789
0.677
0.733
0.955
0.544
1.309
0.577
0.814
0.811
0.773
0.274
0.060
Table B.4. TexISO Data. Averages of each participant over four trials for the
T exISO display mode.
204
T exN OI
UserID
BA044
BA046
BA047
BA048
BA049
BA050
BA051
BA052
BA053
BA054
BA055
BA056
BA057
BA058
BA059
BA060
BA061
BA062
BA063
BA064
BA065
Average
Std. Dev.
Std. Err.
success
0.688
0.675
0.510
0.620
0.520
0.773
0.645
0.573
0.653
0.715
0.690
0.565
0.695
0.650
0.690
0.640
0.628
0.480
0.710
0.580
0.795
0.643
0.082
0.018
failure
0.273
0.218
0.385
0.323
0.458
0.200
0.258
0.323
0.238
0.123
0.120
0.315
0.165
0.228
0.135
0.113
0.258
0.165
0.170
0.255
0.143
0.231
0.093
0.020
risk
0.193
0.153
0.240
0.275
0.300
0.258
0.193
0.115
0.295
0.115
0.383
0.533
0.358
0.233
0.170
0.328
0.285
0.395
0.145
0.263
0.210
0.259
0.103
0.023
placement
0.958
0.893
0.895
0.945
0.975
0.968
0.900
0.895
0.893
0.838
0.808
0.880
0.858
0.880
0.828
0.753
0.888
0.645
0.880
0.833
0.938
0.874
0.076
0.017
detection
0.555
0.439
0.533
0.717
0.655
0.878
0.532
0.344
0.903
0.374
1.404
0.719
1.127
0.731
0.577
1.135
0.794
1.111
0.489
0.707
0.770
0.738
0.277
0.060
Table B.5. TexNOI Data. Averages of each participant over four trials for the
T exN OI display mode.
(B) Lines
Rating: ...............
(C) Shading&Lines
Rating: ...............
(D) Texture1
Rating: ...............
(E) Texture2
Rating: ...............
(Scale: 1=Very Clear … 5=Totally Unclear)
(Scale: 1=Very Easy … 5 = Very difficult)
(Scale: 1=Too short … 5 = Too long)
Comfort: ……………………… (Scale: 1=Not at all … 5 =I was very uncomfortable)
7.) Did the experiment cause you any discomfort?
Exhaustion: …………………… (Scale: 1=Not at all … 5 =I was very exhausted)
6.) Did the experiment tire/exhaust you?
Length: ..................................
5.) Rate the duration of the experiment
Interaction:……………….
4.) Rate how difficult you found the mode of interaction with the system (i.e. clicking/tapping)
Instructions: ……………….
3.) Rate how clear the instructions were to understand
Performance: ............................ (Scale: 1=Very Good … 5 =Poor)
2.) Rate how well you think you performed in the experiment (did you hit most targets?)
Rating: ...............
(A) Shading
1.) Please rank the display modes in order of difficulty. If some modes felt the same, you can assign the
same number.
(Scale: 1=Easiest … 5 =Most difficult)
PiGeonAtor Questionnaire
11.) Please give any additional comments or suggestions you may have
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
10.) If, yes, above, do you think it impaired your performance?
( ) Yes
( ) No
8.) If you did not answer Not at all above, please explain:
Comfort explanation: ..........................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
9.) Did you notice that your hands were casting a shadow?
( ) Yes
( ) No
205
Figure B.1. Questionnaire. Participants were asked to fill out this short questionnaire after completing all trials.
206
10) ShadowImpair
9) ShadowCast
7) Discomfort
6) Exhaustion
5) Duration
4) Interaction
3) Instructions
2) Performance
1e) Texture2 (TexNOI)
1d) Texture1 (TexISO)
1c) Shading & Lines (Mixed)
1b) Lines (Outlines)
1a) Shading
Subjective Difficulty
UserID
BA044 1
2
2
5
4
3
1
2
4
4
3
0
0
BA046 2
5
1
4
3
3
2
1
3
3
1
0
0
BA047 2
3
1
5
4
4
1
1
3
1
1
0
0
BA048 1
5
2
3
4
2
1
1
3
2
2
0
0
BA049 3
4
1
5
5
4
2
3
3
2
1
0
0
BA050 2
3
1
4
5
2
1
1
3
2
1
0
0
BA051 1
4
1
5
5
3 2.5 2
5
4
2
1
0
BA052 2
3
1
5
5
3
1
2
4
2
1
0
0
BA053 2
4
1
3
5
2
4
2
3
3
3
1
0
BA054 1
2
1
4
3
3
3
1
4
3
1
0
0
BA055 2
3
1
4
5
4
1
1
3
1
1
0
0
BA056 3
1
1
4
5
3
4
3
2
2
1
0
0
BA057 1
3
1
5
4
2
2
3
5
4
3
0
0
BA058 1
4
2
5
5
2
1
3
3
4
1
1
1
BA059 1
3
2
4
5 2.5 3
1
3
2
1
1
0
BA060 2
3
1
4
5
3
2
1
4
2
1
0
0
BA061 2
3
1
4
5
3
1
1
3
2
2
0
0
BA062 2
3
1
4
5
3
1
3
3
1
1
0
0
BA063 2
3
1
4
5 2.5 4
2 2.5 2
2
1
1
BA064 2
3
1
5
4
3
2
2
5
3
3
1
1
BA065 2
3
1
4
5
4
5
2
3
2
2
0
0
Average 1.8 3.2 1.2 4.3 4.6 2.9 2.1 1.8 3.4 2.4 1.6 0.3 0.2
Std. Err. 0.1 0.2 0.1 0.1 0.2 0.2 0.3 0.2 0.2 0.2 0.2 0.1 0.1
Mode
2
3
1
4
5
Table B.6. Questionnaire Data. Numerical results for the questionnaire shown
in Figure B.1. See questionnaire for meaning of each column and scales used.
Display mode names in parentheses are those used in this dissertation. For questions 9 and 10: 1=Yes and 0=No. Mode in the last row refers to the statistical
measure (most frequent number), not display mode.
207
APPENDIX C
Links for Selected Objects
Table C.1 lists publicly accessible internet URLs for several images and other objects used
in this dissertation. No guarantees can be made about the validity and availability of these links.
Figure 1.1 :http://commons.wikimedia.org/wiki/Image:Glasses_800.png
Figure 1.2 (a): http://commons.wikimedia.org/wiki/Image:IMG_0071_-_England%2C_London.JPG
Figure 1.3 (b), Bunny model: http://graphics.stanford.edu/data/3Dscanrep/
Figure 1.4 (a): http://commons.wikimedia.org/wiki/Image:Escaping_criticism_by_Caso.jpg
Figure 1.4 (b): http://commons.wikimedia.org/wiki/Image:Portrait_of_Dr._Gachet.jpg
Figure 3.8 Eye-tracking data and source image: http://www.cs.rutgers.edu/˜decarlo/abstract.html
Figure 3.9 Source images: http://upload.wikimedia.org/wikipedia/commons/4/4c/Pitt_Clooney_Damon.jpg
Figure 3.10 Source image: http://www.indcjournal.com/archives/Lehrer.jpg
Figures 3.13–3.17 Source image: http://www.flickr.com/photos/johnnydriftwood/115499900/
Figure 3.26 Original, stationary: http://commons.wikimedia.org/wiki/Image:Ferrari-250-GT-Berlinetta-1.jpg
Figure 4.4 Girl courtesy of: www.crystalspace3d.org
Figure 4.4 Man & Tool courtesy of: http://www.3dcafe.com
Figure 4.4 Architecture model courtesy of Google 3D Warehouse: http://sketchup.google.com/3dwarehouse
Figure 4.7 Rendering engine: http://fabio.policarpo.nom.br/fly3d/index.htm
Figure 5.4 Left: http://commons.wikimedia.org/wiki/Image:Burger_King_Whopper_Combo.jpg
Figure 5.4 Right: http://www.flickr.com/photo_zoom.gne?id=100995096&size=o
208
Table C.1. Internet references. Links to selected images and 3-D models.