Image Sequence Stylization: A Frame-to-Frame

Transcription

Image Sequence Stylization: A Frame-to-Frame
Image Sequence Stylization:
A Frame-to-Frame Coherent Approach
Mihai Parparita
Princeton University
Abstract
We present a system for the
generation of stylized images from real-world
captured 2D material. A set of passes is used
to extract interesting features such as edges
and large color areas. To resolve previously
encountered issues with frame-to-frame
coherence, an approach using optical flow is
chosen. The end result is a stylized drawing
of configurable style that does not exhibit the
traditional "jitter" seen in systems that are not
aware of image sequences. We demonstrate
our technique with the accompanying still
images rendered with our system.
1 Introduction
A recent trend in computer graphics is
a move away from trying to achieve
photorealistic results exclusively, with
projects appearing that have stylized outputs
as their objectives. Many such systems use
3D data as their inputs, in order to leverage
hardware acceleration frameworks for
performance and the added precision that
comes from knowing the all of the properties
of an entire scene. Despite these advantages,
an approach that uses captured 2D images as
input enjoys other trade-offs, most notably in
the ease-of-use that can be achieved (since no
3D modeling package is required) and the
broadened audience that is enabled (e.g.
children could conceivably film a scene they
would like to see processed, even if they are
unable to model it).
Working in image-space, we apply
basic computer vision algorithms such as
Canny edge detection and k-means clustering
in order to extract desired features. The key
points differentiation in our system lie in our
chosen way of stylization, and the way it is
modified in order to deal with image
sequences. Past algorithms have focused on
converting the entire image into a set of
Figure 1 Sample program output
brush strokes, mimicking the way a painter
would approach a scene as shown in [3] and
[5]. We choose to emphasize edges between
objects and large, flat color areas for a more
cartoon and/or simplistic watercolor look,
depending on the chosen parameters. A focus
on edges has the implication that any "false
motion" (i.e. jitter induced by noise and
minute differences in the appearance of a
supposedly stationary object) is much more
apparent that it would otherwise be. Though
this problem has been tackled in the daLi! 3D
rendering system as implemented in [8], no
work has been done in 2D space. Solving this
problem is one of the key steps towards
achieving reasonable stylized output from 2D
image sequences.
Section 2 presents our basic method
1
of extracting straight-line segments from the
gradient of an image. These segments are
them combined into continuous strokes, as
presented in Section 3. Color areas are
extracted using k-means clustering, working
in HSV space. Section 4 additionally shows
how clusters are also used to deduce
additional information about strokes. Each of
these stages works by taking only a single
frame into account, but with the aid of optical
flow, frame-to-frame coherence is
established, as shown in Section 5. Section 6
shows how the results of all these steps are
combined in order to yield our desired
stylized output. Sections 7 through 10 further
evaluate the system.
- the edge strength has not dropped by
more than a certain percentage of the seed's
strength
- the magnitude direction has not
changed by more than a certain amount
- no more than a set portion of pixels
that the line passes through have been used
by other segments (to allow crossing
segments while preventing completely
overlapped ones)
The above conditions mean that there
is not a 1-to-1 mapping between segments
and seeds, rather for N seeds usually N/2
segments are found.
2 Edge Detection and Segment Extraction
Although distinct line segments do
result in an acceptable replication of object
outlines, there are drawbacks to leaving
outline information in this state. First of all,
curved or otherwise deformed outlines may
not be represented by a continuous set of
edges (i.e. there may be gaps and segment
endpoints may not line up correctly).
Furthermore, it is impossible to apply the
same artistic style to the entire outline of an
object if the segments that make it up are not
grouped in any way.
To alleviate these issues, nearby
segments are collapsed into a single
continuous stroke (which can be thought of
as a poly-line). Finding segments in
proximity to one another by a simple
traversal of the segment list would have
O(N^2) complexity for N segments, thus a
more efficient method was chosen, at the
expense of a space tradeoff. An "ID reference
image", as demonstrated in [4] and [9], is
generated by rasterizing all segments, each
with a distinctive color that is in fact an index
into the original segment list. To then find
segments within M pixels of one another, it is
enough to examine a 2M x 2M neighborhood
for each segment endpoint. If pixels of
another color are encountered, the two
segments they represent can be collapsed into
a single stroke.
The net result of this approach is a
scalable method of stroke extraction that
works well enough to yield clean borders of
objects.
3 Stroke Extraction
Edge detection is performed by first
obtaining the image gradient by convolving
the image with the derivative of a Gaussian
kernel. The magnitudes of these gradients
represent the edge strengths throughout the
image. Per the Canny algorithm [2],
elimination of local non-maximal points is
also performed. Once a suitable edge strength
image is thus generated, we then find "seed"
points for the segments by finding pixels
with gradient magnitudes above a certain
limit. These seeds are also chosen to be
relatively separated from one another, such
that areas with stronger edges don't
necessarily get more segments. Finding N
such seeds would have O(N^2) complexity
normally, since upon finding of a candidate
seed, it would have to be compared with all
existing ones to ensure the distance condition
holds. However, since the image is traversed
vertically, the current seeds are implicitly
sorted by their y coordinates. Thus when
comparing seeds, it is possible to limit the
search to the ones in the previous M scanlines (where M is the desired minimum
separation), resulting in near-linear
performance for increasing dataset sizes.
Once the seeds are chosen, segments
are "grown" around each one of them. This is
done by looking at the image gradient at the
seed, and then extending lines in both
perpendicular directions as long as the
following conditions hold:
- the line endpoints are still within
image bounds
4 Clustering
2
straddles two or more clusters then it is most
likely part of an outline, whereas if it lies
completely in one cluster (or just crosses
from one to another) then it is an internal
edge. By altering the rendered appearance of
such "internal" (or "detail") strokes in the
final image, more subtle effects can be
achieved. The figure below shows the source
image, the strokes rendered using uniform
thickness, and finally the use of thinning
applied to strokes labeled as “detailed”.
A further trait of the stylized method
that we chose was to limit the number of
colors present in the final image, in order to
mimic the limited palette of an artist. To
achieve this we use simple k-means
clustering, as in [10], to divide up all of the
colors in the image into k fundamental
"buckets." However, to yield more
deterministic results (since simple k-means
clustering starts off with random guesses for
the initial values) we perform a simple
histogram analysis of the colors present (at a
much reduced precision) in order to have
reasonable "seeds. A further tweak was to
ensure that these seeds were separated (as
determined by a color distance function
described below) by at least a certain amount,
so that even if various shades of the same
color dominated the images, they did not
dominate the cluster list.
One of the key operations in the
iterative process of clustering is finding the
distance (in our case, the distance between
the color of a pixel and the clusters that we
are trying to assign it to). Initial attempts to
perform this distance in RGB color space
revealed a key issue with this numerical
representation of color. Euclidean distance in
this space does not map very well to
perceptual distance, with visually close
colors have a large numeric separation, and
vice-versa. To counteract this problem, the
source images were converted to the HSV
color space, which provides a much better
mapping.
Though HSV did solve the distance
problem, it brought another to light, since
compression artifacts (as seen in JPEG
images and DV stream captures) became
much more apparent in some of the channels.
To alleviate this problem, a slight blur was
applied to the image before it was clustered.
This blurring also helped to eliminate
"speckling," the presence of a few scattered
pixels belonging to one cluster embedded in a
larger area belonging to another cluster.
Cluster data also proved to have an
additional use in the final image. Though the
extracted strokes represent distinctive edges
in the source image, some of these edges are
the true outlines of objects, while others
represent internal features. Clusters can be
used to make this distinction, i.e. if a stroke
Figure 2 Detailed strokes in action
5 Frame-to-Frame Coherence
If the above steps were simply
applied to every frame in an image sequence,
the results would be suboptimal. Despite the
advent of consumer digital video cameras of
increased resolution; noise, compression
artifacts and subtle changes in lighting still
affect the appearance of supposedly
unchanged portions of the frame. The net
result is a combination of "jitter" (where
strokes appear to shake) and "flicker" (where
strokes quickly appear and disappear).
Though these effects can be considered of an
artistic nature (i.e. if an artist were to draw
each frame by hand he/she would not place
lines and strokes in exactly the same position
each time), it does have the downside of
detracting from the real motion that is
present, since the entire scene appears to be
moving.
The most basic approach for dealing
with this problem is to find strokes are
supposed to be stationary from frame to
frame, and "snap" them to the position of the
previous ones. Though this will not help with
the issue of rapid stroke appearance and
disappearance, it will alleviate jitter.
5.1 Optical Flow
Determining which strokes are
3
stationary requires some knowledge about the
motion present in a frame. We choose to
apply an optical flow detection algorithm in
order to obtain this information, as shown in
[6]. An iterative approach was chosen, which
first finds broad motion and then more
detailed movement. The net result is a vector
for each pixel, representing the displacement
that it underwent when compared to the
previous frame. If this displacement is below
a certain value (to account for noise in the
data), then it can be declared as being
stationary. By looking at all of the pixels that
a stroke sits on top of, it can be possible to
determine its movement status.
that repeatedly draws
the same stroke while
perturbing each control
point by various
amounts over each
pass. As the figure
demonstrates,
by
drawing only one pass
with no perturbation,
an effect akin to ink is
achieved. A few (i.e.
around five) passes
with a thin line and
small amount of
perturbation has the
appearance of a sketch,
while a thicker pen size
with a large amount of
perturbation
and
numerous
passes
results
in
the
appearance of charcoal.
The figure to the right
shows these three
styles in action.
Figure 3 Various
The use of a
rendering styles
random perturbation
amount per control point has additional
implications when considering image
sequences. If the control points of a
stationary stroke are perturbed by different
amounts each frame, the "jitter" effect that
was previously removed is now back, though
this time it is completely artificially induced.
The solution is to store a "seed" for the
random number generator with each stroke,
such that subsequent re-drawings of the
stroke will preserve its appearance. This also
makes accurate stroke correspondence even
more important.
As an alternative to drawing the
strokes in black on top of the clusters, it is
also possible to make the strokes take on the
color of the clusters they are near, for another
stylized effect, as figure 5 shows.
Finally, to further enhance the output
of the system, it is also possible to modulate
the output with a paper texture.
5.2 Stroke Correspondence
Once strokes are labeled as
"stationary" there is still the matter of
determining correspondence between frames.
This is complicated by the fact that
parametrizations (i.e. the length and count of
segments that make up a stroke) may differ,
for example to the point where a stroke in
one frame is replaced by two smaller ones.
Thus not only would a simple traversal of the
stroke list for two frames have O(N^2)
complexity, it would also have to deal with
cases such as this. Thus in the end, it was
decided to implement a similar approach to
the one used in the original extraction of
strokes. An "ID reference image" with one
image's strokes is drawn in a temporary
buffer. Then, the control points of the other
set of stationary strokes are examined. If they
overlap with another stroke, that stroke is
transferred. Strokes that are completely
overlapped with ones from previous frames
are removed entirely.Figure 4 two adjacent
frames from the source image sequence, the
extracted strokes with no coherence matching
attempted, and the result where stationary
strokes are matched from frame to frame.
6 Rendering and Sample Output
Due to the blurring of the image prior
to clustering, the clusters already have a
stylized, "organic" appearance to them.
However, the strokes are still just a set of
points through which a polyline must pass,
therefore some stylized effect must be
applied to them. We chose a simple approach
7 Performance
Though the aim of this system is not
to process images in real-time, performance
is still crucial due to images sequences
4
serving as input. For example, an initial
version took 9 seconds per 640 x 480 frame
on a 450 MHz Power Macintosh G4. This
means that to process a one-minute long 30
fps clip it would take 4.5 hours. As
mentioned above, processing algorithms were
also chosen such that they exhibited nearlinear scalability wherever possible.
Additionally, since the target platform has at
least two CPUs, passes were multi-threaded
wherever possible. Since the two halves of
each image can be processed independently
with almost no communication or resource
contention between threads, gains from
multi-threading were near-ideal. The final
performance was close to 4 seconds per
frame, which still makes processing of long
clips time-consuming, but manageable.
Performance could be further
increased by further taking advantage of
other target-platform specific features such as
the presence of a vector processing unit in the
G4, which would allow up to 8x increases in
performance for convolutions and other
operations.
outputs, the system was successful at
achieving its goals. However, it is still
dependent on having good input in order to
achieve its goals. Two classes of images were
observed to be best suited for our image
stylization approach. First, buildings and
other constructions have very straight edges
and large flat color areas, which makes it
very easy to determine outlines and cluster
their colors, as demonstrated in Figure 7. One
conceivable application for this property is to
rapidly convert images of existing buildings
into sketches, such that artistic renderings of
proposed buildings can be more easily
integrated into the current cityscape.
A second type of image that worked
very well was that of children's toys, since
they are also usually very brightly colored
and thus have very definite edges. This
brings up another possible application;
something akin to the SnakeToonz [1] system
that can be used by children. Simple scenes
and skits staring toys can be filmed and then
processed with the system, all requiring very
little adult interaction.
In its current implementation,
clustering also suffers slightly from a case of
frame-to-frame incoherence. Since the cluster
seeds are recalculated every single frame, it
is possible that very similar colors will be
assigned to different clusters, resulting in
flicker here as well. One solution that was
considered was to use the cluster values from
the previous frame as the seeds for the
current one. While this did alleviate flicker, it
caused problems when objects of a
completely new color entered the frame,
since it too a "while" (i.e. several frames)
before new cluster values representing those
colors appeared.
8 Interface
One of the issues with having a
system with so many passes is that each step
has its own collection of parameters that must
be tweaked in order to obtain optimal results.
An initial version of the system had a total of
20 sliders that controlled various thresholds,
blurring amounts, etc. In order to simplify the
interface and to keep with the stated goal of
making the system usable to non-experts,
these were collapsed to three values:
• “Detail” (low to high) controls amount of
blurring, various distance thresholds and
neighborhood sizes for segment and stroke
extraction.
• “Colors” (few to many) controls number of
clusters, and the separation between them.
• “Motion” (slow to fast) controls number of
iterations and initial guesses for optical flow.
Controls for other parameters, such as
the stroke rendering options described in
Section 6, were retained since they should be
intuitive even to non-technical users.
10 Future Work
As mentioned, the coherence system
currently only concerns itself with "jitter",
but flicker is still an issue. To solve the
problem of strokes disappearing and
reappearing, the fact that the system has
"perfect" knowledge of future events can be
taken into account. For example, if a
stationary stroke is not present in a frame, it
can be checked whether it will reappear two,
three, four, etc. frames in the future, and if
9 Discussion
As it can be seen from the sample
5
[5] Peter Litwinowicz. Processing images
and video for an impressionist effect.
Proc. SIGGRAPH, 1997.
[6] Bruce D. Lucas, and Takeo Kanade, An
Iterative Image Registration Technique
with an Application to Stereo Vision.
Proceedings of Imaging Understanding
Workshop. 121-130 (1981).
[7] Maic Masuch, Stefan Schlechtweg.
Speedlines - Depicting Motion in
Motionless Pictures. Siggraph 1999, 813 August 1999, Los Angeles.
[8] Maic Masuch, Lars Schuhmann, and
Stefan Schlechtweg. Animating FrameTo-Frame-Coherent Line Drawings for
Illustrated Purposes. Proceedings of
Simulation und Visualisierung '98,
S.101-112, SCS Europe, 1998.
[9] J. D. Northrup, and Lee Markosian.
Artistic Silhouettes: A Hybrid Approach.
In Non-photorealistic Animation and
Rendering. ACM SIGGRAPH, June
2000.
[10] Rose H. Turi, and Siddheswar Ray. Kmeans clustering for colour image
segmentation with automatic detection of
K . Proceedings of the International
Association of Science and Technology
for Development (IASTED), Signal and
Image Processing (SIP '98), Las Vegas,
Nevada, USA, 28-31 October, 1998
that is the case, copy it over regardless of its
present state.
To solve the above-mentioned cluster
coherence issue, optical flow could be put to
use again. When the overall optical flow in a
frame exhibits some large amount of change,
it could be interpreted as a sign that
significant changes in color are taking place,
and the cluster seeds could be re-initialized
with the current scene contents.
The optical flow information that is
currently obtained is in fact overkill for our
needs, since all we care about is whether
pixels have moved or not, while direction and
magnitude are not really significant. Two
paths exist here; either the movement
detection could be simplified in order to reap
performance benefits, or the extra
information could be put to use. As an
example of the latter, consider the addition of
motion lines behind moving objects as a way
to emphasize motion, as was done in 3D
space in [7]. Stroke coherence could also be
applied to slow-moving objects, where
"jitter" is also likely to be disturbing.
Another issue that comes up is that
for a longer image sequence, the same
parameters may not apply at all points in
time. It should be possible to implement a
keyframing where parameters are chosen at
"significant" intervals (significance could
again be determined by sudden changes in
overall optical flow) and then blended in
between those frames.
References
[1] Aseem Agarwala. SnakeToonz : A SemiAutomatic Approach to Creating Cel
Animation from Video. NPAR 2002
[2] John Canny, A Computational Approach
to Edge Detection, IEEE Transactions on
Pattern Analysis and Machine
Intelligence, Vol. 8, No. 6, Nov. 1986.
[3] Aaron Hertzmann, and Ken Perlin.
Painterly rendering for video and
interaction. Proceedings of NPAR 2000,
7—12
[4] Michael A. Kowalski, Lee Markosian, J.
D. Northrup, Lubomir Bourdev, Ronen
Barzel, Loring S. Holden, and John F.
Hughes. Art--based rendering of fur,
grass, and trees. Proceedings of
SIGGRAPH 1999.
6
Figure 4 Demonstration of frame-to-frame coherence
7
Figure 5 Demonstration of stroke coloring as alternative stylization
8
Figure 6 More sample output
9
Figure 7 Demonstration of application on buildings
10