Image Sequence Stylization: A Frame-to-Frame
Transcription
Image Sequence Stylization: A Frame-to-Frame
Image Sequence Stylization: A Frame-to-Frame Coherent Approach Mihai Parparita Princeton University Abstract We present a system for the generation of stylized images from real-world captured 2D material. A set of passes is used to extract interesting features such as edges and large color areas. To resolve previously encountered issues with frame-to-frame coherence, an approach using optical flow is chosen. The end result is a stylized drawing of configurable style that does not exhibit the traditional "jitter" seen in systems that are not aware of image sequences. We demonstrate our technique with the accompanying still images rendered with our system. 1 Introduction A recent trend in computer graphics is a move away from trying to achieve photorealistic results exclusively, with projects appearing that have stylized outputs as their objectives. Many such systems use 3D data as their inputs, in order to leverage hardware acceleration frameworks for performance and the added precision that comes from knowing the all of the properties of an entire scene. Despite these advantages, an approach that uses captured 2D images as input enjoys other trade-offs, most notably in the ease-of-use that can be achieved (since no 3D modeling package is required) and the broadened audience that is enabled (e.g. children could conceivably film a scene they would like to see processed, even if they are unable to model it). Working in image-space, we apply basic computer vision algorithms such as Canny edge detection and k-means clustering in order to extract desired features. The key points differentiation in our system lie in our chosen way of stylization, and the way it is modified in order to deal with image sequences. Past algorithms have focused on converting the entire image into a set of Figure 1 Sample program output brush strokes, mimicking the way a painter would approach a scene as shown in [3] and [5]. We choose to emphasize edges between objects and large, flat color areas for a more cartoon and/or simplistic watercolor look, depending on the chosen parameters. A focus on edges has the implication that any "false motion" (i.e. jitter induced by noise and minute differences in the appearance of a supposedly stationary object) is much more apparent that it would otherwise be. Though this problem has been tackled in the daLi! 3D rendering system as implemented in [8], no work has been done in 2D space. Solving this problem is one of the key steps towards achieving reasonable stylized output from 2D image sequences. Section 2 presents our basic method 1 of extracting straight-line segments from the gradient of an image. These segments are them combined into continuous strokes, as presented in Section 3. Color areas are extracted using k-means clustering, working in HSV space. Section 4 additionally shows how clusters are also used to deduce additional information about strokes. Each of these stages works by taking only a single frame into account, but with the aid of optical flow, frame-to-frame coherence is established, as shown in Section 5. Section 6 shows how the results of all these steps are combined in order to yield our desired stylized output. Sections 7 through 10 further evaluate the system. - the edge strength has not dropped by more than a certain percentage of the seed's strength - the magnitude direction has not changed by more than a certain amount - no more than a set portion of pixels that the line passes through have been used by other segments (to allow crossing segments while preventing completely overlapped ones) The above conditions mean that there is not a 1-to-1 mapping between segments and seeds, rather for N seeds usually N/2 segments are found. 2 Edge Detection and Segment Extraction Although distinct line segments do result in an acceptable replication of object outlines, there are drawbacks to leaving outline information in this state. First of all, curved or otherwise deformed outlines may not be represented by a continuous set of edges (i.e. there may be gaps and segment endpoints may not line up correctly). Furthermore, it is impossible to apply the same artistic style to the entire outline of an object if the segments that make it up are not grouped in any way. To alleviate these issues, nearby segments are collapsed into a single continuous stroke (which can be thought of as a poly-line). Finding segments in proximity to one another by a simple traversal of the segment list would have O(N^2) complexity for N segments, thus a more efficient method was chosen, at the expense of a space tradeoff. An "ID reference image", as demonstrated in [4] and [9], is generated by rasterizing all segments, each with a distinctive color that is in fact an index into the original segment list. To then find segments within M pixels of one another, it is enough to examine a 2M x 2M neighborhood for each segment endpoint. If pixels of another color are encountered, the two segments they represent can be collapsed into a single stroke. The net result of this approach is a scalable method of stroke extraction that works well enough to yield clean borders of objects. 3 Stroke Extraction Edge detection is performed by first obtaining the image gradient by convolving the image with the derivative of a Gaussian kernel. The magnitudes of these gradients represent the edge strengths throughout the image. Per the Canny algorithm [2], elimination of local non-maximal points is also performed. Once a suitable edge strength image is thus generated, we then find "seed" points for the segments by finding pixels with gradient magnitudes above a certain limit. These seeds are also chosen to be relatively separated from one another, such that areas with stronger edges don't necessarily get more segments. Finding N such seeds would have O(N^2) complexity normally, since upon finding of a candidate seed, it would have to be compared with all existing ones to ensure the distance condition holds. However, since the image is traversed vertically, the current seeds are implicitly sorted by their y coordinates. Thus when comparing seeds, it is possible to limit the search to the ones in the previous M scanlines (where M is the desired minimum separation), resulting in near-linear performance for increasing dataset sizes. Once the seeds are chosen, segments are "grown" around each one of them. This is done by looking at the image gradient at the seed, and then extending lines in both perpendicular directions as long as the following conditions hold: - the line endpoints are still within image bounds 4 Clustering 2 straddles two or more clusters then it is most likely part of an outline, whereas if it lies completely in one cluster (or just crosses from one to another) then it is an internal edge. By altering the rendered appearance of such "internal" (or "detail") strokes in the final image, more subtle effects can be achieved. The figure below shows the source image, the strokes rendered using uniform thickness, and finally the use of thinning applied to strokes labeled as “detailed”. A further trait of the stylized method that we chose was to limit the number of colors present in the final image, in order to mimic the limited palette of an artist. To achieve this we use simple k-means clustering, as in [10], to divide up all of the colors in the image into k fundamental "buckets." However, to yield more deterministic results (since simple k-means clustering starts off with random guesses for the initial values) we perform a simple histogram analysis of the colors present (at a much reduced precision) in order to have reasonable "seeds. A further tweak was to ensure that these seeds were separated (as determined by a color distance function described below) by at least a certain amount, so that even if various shades of the same color dominated the images, they did not dominate the cluster list. One of the key operations in the iterative process of clustering is finding the distance (in our case, the distance between the color of a pixel and the clusters that we are trying to assign it to). Initial attempts to perform this distance in RGB color space revealed a key issue with this numerical representation of color. Euclidean distance in this space does not map very well to perceptual distance, with visually close colors have a large numeric separation, and vice-versa. To counteract this problem, the source images were converted to the HSV color space, which provides a much better mapping. Though HSV did solve the distance problem, it brought another to light, since compression artifacts (as seen in JPEG images and DV stream captures) became much more apparent in some of the channels. To alleviate this problem, a slight blur was applied to the image before it was clustered. This blurring also helped to eliminate "speckling," the presence of a few scattered pixels belonging to one cluster embedded in a larger area belonging to another cluster. Cluster data also proved to have an additional use in the final image. Though the extracted strokes represent distinctive edges in the source image, some of these edges are the true outlines of objects, while others represent internal features. Clusters can be used to make this distinction, i.e. if a stroke Figure 2 Detailed strokes in action 5 Frame-to-Frame Coherence If the above steps were simply applied to every frame in an image sequence, the results would be suboptimal. Despite the advent of consumer digital video cameras of increased resolution; noise, compression artifacts and subtle changes in lighting still affect the appearance of supposedly unchanged portions of the frame. The net result is a combination of "jitter" (where strokes appear to shake) and "flicker" (where strokes quickly appear and disappear). Though these effects can be considered of an artistic nature (i.e. if an artist were to draw each frame by hand he/she would not place lines and strokes in exactly the same position each time), it does have the downside of detracting from the real motion that is present, since the entire scene appears to be moving. The most basic approach for dealing with this problem is to find strokes are supposed to be stationary from frame to frame, and "snap" them to the position of the previous ones. Though this will not help with the issue of rapid stroke appearance and disappearance, it will alleviate jitter. 5.1 Optical Flow Determining which strokes are 3 stationary requires some knowledge about the motion present in a frame. We choose to apply an optical flow detection algorithm in order to obtain this information, as shown in [6]. An iterative approach was chosen, which first finds broad motion and then more detailed movement. The net result is a vector for each pixel, representing the displacement that it underwent when compared to the previous frame. If this displacement is below a certain value (to account for noise in the data), then it can be declared as being stationary. By looking at all of the pixels that a stroke sits on top of, it can be possible to determine its movement status. that repeatedly draws the same stroke while perturbing each control point by various amounts over each pass. As the figure demonstrates, by drawing only one pass with no perturbation, an effect akin to ink is achieved. A few (i.e. around five) passes with a thin line and small amount of perturbation has the appearance of a sketch, while a thicker pen size with a large amount of perturbation and numerous passes results in the appearance of charcoal. The figure to the right shows these three styles in action. Figure 3 Various The use of a rendering styles random perturbation amount per control point has additional implications when considering image sequences. If the control points of a stationary stroke are perturbed by different amounts each frame, the "jitter" effect that was previously removed is now back, though this time it is completely artificially induced. The solution is to store a "seed" for the random number generator with each stroke, such that subsequent re-drawings of the stroke will preserve its appearance. This also makes accurate stroke correspondence even more important. As an alternative to drawing the strokes in black on top of the clusters, it is also possible to make the strokes take on the color of the clusters they are near, for another stylized effect, as figure 5 shows. Finally, to further enhance the output of the system, it is also possible to modulate the output with a paper texture. 5.2 Stroke Correspondence Once strokes are labeled as "stationary" there is still the matter of determining correspondence between frames. This is complicated by the fact that parametrizations (i.e. the length and count of segments that make up a stroke) may differ, for example to the point where a stroke in one frame is replaced by two smaller ones. Thus not only would a simple traversal of the stroke list for two frames have O(N^2) complexity, it would also have to deal with cases such as this. Thus in the end, it was decided to implement a similar approach to the one used in the original extraction of strokes. An "ID reference image" with one image's strokes is drawn in a temporary buffer. Then, the control points of the other set of stationary strokes are examined. If they overlap with another stroke, that stroke is transferred. Strokes that are completely overlapped with ones from previous frames are removed entirely.Figure 4 two adjacent frames from the source image sequence, the extracted strokes with no coherence matching attempted, and the result where stationary strokes are matched from frame to frame. 6 Rendering and Sample Output Due to the blurring of the image prior to clustering, the clusters already have a stylized, "organic" appearance to them. However, the strokes are still just a set of points through which a polyline must pass, therefore some stylized effect must be applied to them. We chose a simple approach 7 Performance Though the aim of this system is not to process images in real-time, performance is still crucial due to images sequences 4 serving as input. For example, an initial version took 9 seconds per 640 x 480 frame on a 450 MHz Power Macintosh G4. This means that to process a one-minute long 30 fps clip it would take 4.5 hours. As mentioned above, processing algorithms were also chosen such that they exhibited nearlinear scalability wherever possible. Additionally, since the target platform has at least two CPUs, passes were multi-threaded wherever possible. Since the two halves of each image can be processed independently with almost no communication or resource contention between threads, gains from multi-threading were near-ideal. The final performance was close to 4 seconds per frame, which still makes processing of long clips time-consuming, but manageable. Performance could be further increased by further taking advantage of other target-platform specific features such as the presence of a vector processing unit in the G4, which would allow up to 8x increases in performance for convolutions and other operations. outputs, the system was successful at achieving its goals. However, it is still dependent on having good input in order to achieve its goals. Two classes of images were observed to be best suited for our image stylization approach. First, buildings and other constructions have very straight edges and large flat color areas, which makes it very easy to determine outlines and cluster their colors, as demonstrated in Figure 7. One conceivable application for this property is to rapidly convert images of existing buildings into sketches, such that artistic renderings of proposed buildings can be more easily integrated into the current cityscape. A second type of image that worked very well was that of children's toys, since they are also usually very brightly colored and thus have very definite edges. This brings up another possible application; something akin to the SnakeToonz [1] system that can be used by children. Simple scenes and skits staring toys can be filmed and then processed with the system, all requiring very little adult interaction. In its current implementation, clustering also suffers slightly from a case of frame-to-frame incoherence. Since the cluster seeds are recalculated every single frame, it is possible that very similar colors will be assigned to different clusters, resulting in flicker here as well. One solution that was considered was to use the cluster values from the previous frame as the seeds for the current one. While this did alleviate flicker, it caused problems when objects of a completely new color entered the frame, since it too a "while" (i.e. several frames) before new cluster values representing those colors appeared. 8 Interface One of the issues with having a system with so many passes is that each step has its own collection of parameters that must be tweaked in order to obtain optimal results. An initial version of the system had a total of 20 sliders that controlled various thresholds, blurring amounts, etc. In order to simplify the interface and to keep with the stated goal of making the system usable to non-experts, these were collapsed to three values: • “Detail” (low to high) controls amount of blurring, various distance thresholds and neighborhood sizes for segment and stroke extraction. • “Colors” (few to many) controls number of clusters, and the separation between them. • “Motion” (slow to fast) controls number of iterations and initial guesses for optical flow. Controls for other parameters, such as the stroke rendering options described in Section 6, were retained since they should be intuitive even to non-technical users. 10 Future Work As mentioned, the coherence system currently only concerns itself with "jitter", but flicker is still an issue. To solve the problem of strokes disappearing and reappearing, the fact that the system has "perfect" knowledge of future events can be taken into account. For example, if a stationary stroke is not present in a frame, it can be checked whether it will reappear two, three, four, etc. frames in the future, and if 9 Discussion As it can be seen from the sample 5 [5] Peter Litwinowicz. Processing images and video for an impressionist effect. Proc. SIGGRAPH, 1997. [6] Bruce D. Lucas, and Takeo Kanade, An Iterative Image Registration Technique with an Application to Stereo Vision. Proceedings of Imaging Understanding Workshop. 121-130 (1981). [7] Maic Masuch, Stefan Schlechtweg. Speedlines - Depicting Motion in Motionless Pictures. Siggraph 1999, 813 August 1999, Los Angeles. [8] Maic Masuch, Lars Schuhmann, and Stefan Schlechtweg. Animating FrameTo-Frame-Coherent Line Drawings for Illustrated Purposes. Proceedings of Simulation und Visualisierung '98, S.101-112, SCS Europe, 1998. [9] J. D. Northrup, and Lee Markosian. Artistic Silhouettes: A Hybrid Approach. In Non-photorealistic Animation and Rendering. ACM SIGGRAPH, June 2000. [10] Rose H. Turi, and Siddheswar Ray. Kmeans clustering for colour image segmentation with automatic detection of K . Proceedings of the International Association of Science and Technology for Development (IASTED), Signal and Image Processing (SIP '98), Las Vegas, Nevada, USA, 28-31 October, 1998 that is the case, copy it over regardless of its present state. To solve the above-mentioned cluster coherence issue, optical flow could be put to use again. When the overall optical flow in a frame exhibits some large amount of change, it could be interpreted as a sign that significant changes in color are taking place, and the cluster seeds could be re-initialized with the current scene contents. The optical flow information that is currently obtained is in fact overkill for our needs, since all we care about is whether pixels have moved or not, while direction and magnitude are not really significant. Two paths exist here; either the movement detection could be simplified in order to reap performance benefits, or the extra information could be put to use. As an example of the latter, consider the addition of motion lines behind moving objects as a way to emphasize motion, as was done in 3D space in [7]. Stroke coherence could also be applied to slow-moving objects, where "jitter" is also likely to be disturbing. Another issue that comes up is that for a longer image sequence, the same parameters may not apply at all points in time. It should be possible to implement a keyframing where parameters are chosen at "significant" intervals (significance could again be determined by sudden changes in overall optical flow) and then blended in between those frames. References [1] Aseem Agarwala. SnakeToonz : A SemiAutomatic Approach to Creating Cel Animation from Video. NPAR 2002 [2] John Canny, A Computational Approach to Edge Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, Nov. 1986. [3] Aaron Hertzmann, and Ken Perlin. Painterly rendering for video and interaction. Proceedings of NPAR 2000, 7—12 [4] Michael A. Kowalski, Lee Markosian, J. D. Northrup, Lubomir Bourdev, Ronen Barzel, Loring S. Holden, and John F. Hughes. Art--based rendering of fur, grass, and trees. Proceedings of SIGGRAPH 1999. 6 Figure 4 Demonstration of frame-to-frame coherence 7 Figure 5 Demonstration of stroke coloring as alternative stylization 8 Figure 6 More sample output 9 Figure 7 Demonstration of application on buildings 10