The roots of image based rendering
Transcription
The roots of image based rendering
The roots of image based rendering Wolfgang Illmeyer, 532, 0326382 June 1, 2006 Abstract When talking about rendering, especially in conjunction with widely available consumer 3D-graphics accelerators and computer games, we almost always talk about geometry based approaches, but there’s also the image based approach, which is based on light rays. This paper describes the roots of image based rendering up to the papers which uncovered the full potential of image based rendering techniques - Levoy’s and Hanrahan’s “Light Field Rendering” [Levoy ’96] and Gortler et al. - “The Lumigraph” [Gortler ’96]. 1 Contents 1 Image based rendering 1.1 About image based rendering . . . . . . . . . . . . . . . . . . 1.2 Traditional / image based rendering tradeoff . . . . . . . . . . 3 3 3 2 Sprites 2.1 About sprites . . . . . . . . . . . . . . . . . 2.2 Sprites in hardware . . . . . . . . . . . . . . 2.2.1 The C64 and the VIC-II Chip . . . . 2.2.2 The Amiga and the Agnus Chip . . . 2.2.3 Sprites in OpenGL based accelerators 2.3 Sprites and image based rendering . . . . . . . . . . . . 4 4 4 4 4 5 5 . . . . . . . . . . 5 5 6 6 7 7 7 7 8 8 8 3 QuickTime VR 3.1 About QuickTime VR . . . . . 3.2 Panoramic images in QuickTime 3.2.1 Creation . . . . . . . . . 3.2.2 Storage . . . . . . . . . 3.2.3 Viewing . . . . . . . . . 3.3 3D object viewing in QuickTime 3.3.1 Creation . . . . . . . . . 3.3.2 Storage . . . . . . . . . 3.3.3 Viewing . . . . . . . . . 3.4 QuickTime VR and image based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VR (“Panorama movies”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VR (“Object movies”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . rendering . . . . . . . . . . . . . . . . . . . . . . . . . 4 The plenoptic function 8 5 Light field rendering 5.1 The light slab . . . 5.2 Light field creation 5.3 Light field storage . 5.4 Light field display . 9 9 10 12 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Lumigraph 13 6.1 Free handed camera . . . . . . . . . . . . . . . . . . . . . . . . 13 6.2 Surface approximation . . . . . . . . . . . . . . . . . . . . . . 13 6.3 Depth correction . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 1 1.1 Image based rendering About image based rendering Traditional geometry based rendering approaches rely on a model of the geometry to be drawn in some way. This is often a problem when an object in the real world needs to be modeled. There are multiple approaches, like coordinate measuring machines (CMM) or laser scanners, but they just deliver surface vertices - so it’s still a long way to go until there’s a complete model of the object. The surface has to be recreated either by adding edges for a polygonal model or creating some kind of spline patches to describe it. There’s no texture either yet. Image based techniques allow for easier modeling of real world objects. To create a model for image based rendering, we just need a set of photos from the object, captured at special positions around it. The resulting data still has to be processed for space and rendering speed reasons, but the model is already complete at that point. Back in the days when polygonal 3D accelerators weren’t widely available, “scaled down” image based rendering approaches were used a lot in consumer hardware. Virtually any computer game uses sprites in some way. Aside from games, there also is Apple’s QuickTime VR, which was designed to deliver “Virtual Reality” experience to consumer devices with limited computing power and without any special acceleration or input hardware. Both of them are image based rendering approaches, because no geometry data is used for rendering in either case. 1.2 Traditional / image based rendering tradeoff Whereas geometry based rendering relies heavily on computing power for better image quality or more complex models, this is not the case with image based rendering approaches. If we want better image quality, we have to improve the quality of the source image data somehow. The computing power needed for one image of a given size is always equal, no matter how complex the scene is. The only exception is the choice of different interpolation techniques. To prevent aliasing, we can use interpolation in 2D or 4D, which of course slows down the rendering process. However, image based rendering techniques do rely heavily on storage space. If we want to render larger images, we need source material in higher resolutions, which is limited mostly by the available RAM. 3 2 2.1 Sprites About sprites Sprites are small rectangular bitmaps, which can be partially transparent. They contain no geometry data. Sprites experienced heavy use in computer games, but they’re also used for special effects in movies. Sprites have the advantage, that they can be accelerated very easily, and therefore there are many home computers and gaming consoles which support accelerated sprite rendering. The support for sprites became less important in the last years, because nearly every device has polygonal rendering acceleration, which includes sprites as a special case. Also, computing power is no more issue these days, so acceleration wouldn’t even be needed for sprite rendering. Sprites are used when classic polygonal rendering delivers bad results. In terms of image quality the best example are flames. Flames look quite strange when rendered as polygons, while they have a quite natural look when rendered as a sprite. When there are a lot of “unimportant” geometry details, as it is the case with grass in outdoor scenes, polygonal rendering would be too slow. The viewer doesn’t pay attention to every single blade of grass, so it can be rendered more efficiently using sprites with a fraction of the polygons. 2.2 2.2.1 Sprites in hardware The C64 and the VIC-II Chip The C64 was a popular 8-bit home computer by Commodore. For rendering sprites, it used the “VIC-II” Chip by MOS Technology [Bauer ’96]. Up to eight sprites could be loaded into the on-chip 16 KB graphics RAM. The chip only needs to be told which of the sprites should be drawn where, and all of that could be changed on every scanline. The VIC-II had no framebuffer, it directly rendered to PAL/NTSC output. 2.2.2 The Amiga and the Agnus Chip The Amiga[Wikipedia], also sold by Commodore also was a popular home computer. It was heavily used for video editing. The Amiga’s graphics chip called “Denise”, used a simple kind of framebuffer, which could be accessed by the “Agnus” chip, which controlled the RAM. “Agnus” contained a special blitting unit, which enabled it to copy sprite data to the framebuffer, while the CPU could do something else. 4 2.2.3 Sprites in OpenGL based accelerators Todays modern 3D accelerators can draw sprites as fast as any polygon. For OpenGL based accelerators, Sprites are just quadrilaterals with normals facing the viewer. This technique is called “Billboarding” in the context of 3D scenes. 2.3 Sprites and image based rendering Sprites are as powerful as any captured image - there’s only one viewpoint for the scene, and only one viewing direction. This is actually the most basic form of image based rendering. We can add more viewpoints by making multiple images/sprites from the same object. This was actually used in “Doom” to simulate different views at the monsters. Besides that, we can also perform simple transforms on sprites, which are not supported by all the mentioned acceleration methods. Sprites can be rotated around the viewing axis, and they can be zoomed. Other transforms (e.g. shear) are possible, too, but in context of image based rendering, they are not significant. Figure 1: Example sprites of Doom by id Software for different viewpoints 3 3.1 QuickTime VR About QuickTime VR QuickTime VR [Chen ’95] was created by Apple Computer Inc. as a way to provide a cheap virtual reality extension to their “QuickTime” multimedia framework. “Virtual reality” in this context means being able to either rotate the camera at a fixed viewpoint so that one can almost freely look around in a pre-captured environment, or rotating the camera around an object. QuickTime VR provides complete means to create, distribute and view such scenes. It includes software for authoring, streaming and displaying virtual reality content. The most important design goal of QuickTime VR was to produce an affordable system, so that the rendering could take place in normal personal computers, without the need for too much storage 5 space or an especially fast CPU. Because 3D input- or output devices are also quite expensive, there’s no support for any 3D helmets or data gloves. These requirements impose restrictions on what can be done in QuickTime VR. 3.2 Panoramic images in QuickTime VR (“Panorama movies”) Figure 2: Example cylinder environment map stitched together from multiple photos 3.2.1 Creation To create panoramic pictures, there are several possibilities. We can stitch together multiple images captured with a rotatable camera where the rear nodal point of the objective is on the rotation axis. For this purpose, a so called “Fish eye” objective can be used. Fish eye objectives have a field of vision of about 180 degrees, so only 2 images must be captured for a full panorama. Another possibility is to use special panoramic cameras. To create a model of the panorama, the image material needs to be projected on a surface which can be easily parameterized, like a sphere, a cube or a cylinder. QuickTime VR uses a cylinder without caps. This limits the vertical viewing angle, because there are no caps, but it makes projection, reprojection and storage easier and faster. 6 3.2.2 Storage What essentially needs to be stored for panoramic images, is the cylinder, which can be cut and rolled up in a plane - so it is actually just an image of a cylindric environment map. In QuickTime, this image is split in tiles, and each tile is saved as a separate, compressed image in a QuickTime video track. 3.2.3 Viewing To view a panoramic image, the required tiles are extracted from the video track. Then they are reprojected back on the screen. Tiles adjacent to the viewing direction are preloaded and cached in order to provide realtime viewing. Figure 3: Reprojection using a cube and a sphere 3.3 3.3.1 3D object viewing in QuickTime VR (“Object movies”) Creation Object movies are created by taking photos of an object from different angles and a fixed distance from the center. We could for example take a photo on every 10 degrees vertically and horizontally. This is achieved through a special apparatus consisting of rotatable platform and a meridian where the camera can be mounted at different positions. For object movies, lighting is especially important. Lighting can either be fixed relative to the cameras position, or relative to the object’s rotation. With camera-attached lighting, the viewer gets the impression that he stays at a fixed position while rotating the object, whereas with object-attached lighting, the viewer thinks he walks around the object, while it stays in a fixed place. 7 3.3.2 Storage QuickTime VR uses the same storage pattern for object movies as for panorama movies, but this time there are already multiple pictures, nothing needs to be split. 3.3.3 Viewing For viewing, the corresponding frame is extracted from the video and shown. Adjacent frames can be preloaded and cached to allow for realtime viewing. 3.4 QuickTime VR and image based rendering Panorama movies go one step beyond sprite rendering. There’s still only one possible viewpoint, but the viewing direction can be chosen freely (with the exception of the vertical viewing angle, because the caps of the cylinder map are not saved). There can be multiple viewing points where each has its own panorama. In QuickTime VR, panoramas can be linked together, so that the viewer can click on the next viewing point in the panorama to advance. The images could also be rotated around the viewing axis, but this is not supported by QuickTime VR. The object movie approach is essentially equal to sprite rendering. Rotation around the view axis would be possible here, too, but is not implemented in QuickTime VR. 4 The plenoptic function The plenoptic function is the basis of any image based rendering approach. The name comes from the latin, with “plenus” meaning “complete” or “full” and “optic” meaning “pertaining to vision”. It describes the radiance of light from any given direction (θ, φ) at any given point of view (Vx , Vy , Vz ) at any time (t) at any wavelength (λ) independent from the origin point of the radiance. R = P (θ, φ, λ, Vx , Vy , Vz , t) (1) For affordable image based rendering approaches, the plenoptic function needs to be sampled in an efficient way. For current approaches, the plenoptic function is reduced to not include time, as well as all the different wavelengths. The reduced plenoptical function looks like that: RGB = P (θ, φ, Vx , Vy , Vz ) 8 (2) Figure 4: Illustration of the plenoptic function From the original 7 dimensions, only 5 are left, and the function returns an RGB value rather than a radiance of a wavelength. However, for practical applications, this parameterization is not feasible, because 5D sampling generates far too much data. Current applications place an additional restriction at the scene: When all light rays originate from a convex space and the viewing point is outside of that space (in “occluder free space”), a 4D parametrization of the plenoptic function is sufficient. 5 Light field rendering The light slab structure provides means for creating, storing and rendering precaptured objects in an efficient way with image based rendering methods. It offers a freely chosable point and direction of view. However, the lighting of the model is static and there’s no concept of time, so the model can’t be lighted dynamically or animated in any way. The Light Field paper not only describes the light slab structure, but also suggests methods for creating, storing and rendering lightfields. 5.1 The light slab The light slab is a structure to efficiently store a subset of the plenoptic function. It consists of two planes which are parameterized by (u, v) and (s, t). Every ray of light that passes both planes can be stored in the light slab and can be referenced by (u, v, s, t), where (u, v) and (s, t) are the intersection 9 points at the two planes. Figure 5: The Light slab representation Using light slabs, we only have a 4D parameterization of the “ray space”, in contrast to the 5D parameterization of the reduced plenoptic function. This requires unobstructed, free space, though. To create a Light field model of a given object, we have to enclose it in light slabs, so that it can be viewed from any point outside the model. 5.2 Light field creation Light fields can be created from virtual scenes as well as from photographs of real objects. For synthetic scenes, the raytracer software needs to be modified slightly to directly output the light slabs by tracing every ray that crosses any pair of discrete sampling points of the (u, v) and (s, t) planes. If the synthetic scene is not rendered using raytracing, we can produce a set of 2D images. The virtual camera is placed at every point on the (u, v) plane and looks at the (s, t) plane. However, the pixels of the rendered image must exactly correspond to the points on the (s, t) plane. This can be achieved through a sheared perspective projection. The resulting 4D light slab can be visualized as an array of (u, v) images of size s · t, or the other way round as an array of (s, t) images of size u · v, as seen in Figure 6. To create a light field of real world objects, they need to be photographed from different known positions in order to create light slabs. The Light Field paper suggests to use a special gantry which can move the camera in horizontal and vertical direction on a plane and additionally adjust pitch and yaw of the camera so that it always points at the center of the object. This ensures a full coverage of the whole (u, v) and (s, t) plane. After the whole plane of the gantry has been scanned through, the object and its lighting are rotated by 90 degrees and the whole process is repeated until four complete light slabs have been captured. 10 Figure 6: Visual representations of a lightslab. Figure 7: Gantry for creating a light field 11 5.3 Light field storage As a light slab is actually a 4D image, it naturally uses up a lot of storage space. The largest example of a light field in the paper uses up 1.6 GB of space in uncompressed form. The data in light slabs is also highly redundant, so a compression scheme is needed. The Authers set up a compression pipeline for lightfields consisting of lossy and lossless compression schemes which allows for a compression ratio of about 120:1. To achieve that, the light field array is compressed with vector quantization. To achieve that, the light slabs are split into either 2D or 4D tiles, yielding 12- or 48 dimensional vectors. A representative subset of the tiles is then used for training, which means generating a representative subset for all training vectors which matches the training set with the least mean squared error. This representative subset is called “codebook”. Once the codebook is generated, compression can start. Each tile of the light slab is replaced by the index of the codebook entry which it matches best. The paper uses 14 Bit indices which are padded up to 16 Bit to simplify the decoding process. In this first compression stage, they achieve a compression ratio of 24:1. The result of the vector quantization still has a lot of redundancy, for example constant background color. The second step eliminates that redundancy by entropy coding. The gzip implementation of the Lempel-Ziv coding is used to achieve that. gzip typically can reduce the size of the light slab indices by another 5:1. 5.4 Light field display Figure 8: Rendering an image from a light slab To view a light field, it has to be loaded into memory. At first, the gzip compressed indices and the code book are uncompressed and stored in RAM. The renderer then traces a ray from an eyepoint through the view plane and through the (u, v) and (s, t) planes. Then it looks up the codewords it needs 12 for the current ray in the codeword array and get the corresponding vector from the codebook. Due to the discrete nature of the (u, v) and (s, t) sampling points, adjacent rays have to be interpolated for rendering an image to prevent aliasing effects. This can either be done only in the (u,v) plane, but quadralinear interpolation in both planes (u, v, s, t) delivers best results. 6 Lumigraph The Lumigraph paper by Gortler et al. [Gortler ’96] discusses similar ideas as the light field paper and introduces a system for capturing real world objects for creation of a light field, or Lumigraph, as they call it, more easily. They eliminate the need for computer-controlled or -assisted camera posing in favor of the use of a hand camera, so that bigger objects can also be captured. The lumigraph also introduces depth correction using surface approximation and a method to fill gaps in the model, where too little data has been captured. 6.1 Free handed camera To allow the use of a free handed camera, it must be calibrated. This is a two step process. There’s the intrinsic parameter calibration, which means positioning all rays that a camera captures relative to the viewpoint. For the Lumigraph system, a camera with a fixed lens is used, so the intrinsic parameter calibration must only be performed once. The extrinsic parameter calibration extracts the pose of the camera (its position and orientation) from the image. Because the pose of the camera is changed on every frame, this step is performed repeatedly. To assist this process, Images are taken at a special stage. It consists of three fixed, orthogonal walls. The walls are all colored in cyan to assist in the surface approximation stage and for parameter calibration, the walls bear 30 markers, each consisting of up to three concentric circles of different sizes. The markers are used for both intrinsic and extrinsic parameter calibration, however for extrinsic calibration, only 8 markers need to be visible at any time. Using this stage, the whole upper hemisphere around the object can be captured. 6.2 Surface approximation The Lumigraph system uses surface approximation for the following depth correction step. The surface approximation is achieved by constructing an octree. For this purpose, a subset of the captured images is segmented into 13 Figure 9: Markers for camera calibration on a stage for data acquisition object and background. The segmented images are then used together with their camera pose information to cut out the shape of the object from the octree. After about 4 subdivisions of the octree, the resulting 3D model can be hollowed and passed on to depth correction. Figure 10: Segmented image and surface approximation 6.3 Depth correction If we wanted to aproximate the ray (s, u) in a 2D-Lumigraph (see Figure 11), we would most likely chose the ray (si+1 , up ) for looking up the color of the corresponding pixel, because it is next to (s, u). However, given the depth information (z) extracted from the previous step, we can see that for example the ray (si , u0 ) is much nearer to the point on the surface where (s, u) intersects the object than (si+1 , up ). We can calculate an u0 for any si so that (si , u0 ) intersects the surface at the same place as the original ray. If z is normalized so that z = 0 means 14 the object surface lies on the (u, v) plane and z = 1 means it’s on the (s, t) plane, then u0 can be calculated as follows: z u0 = u + (s − si ) 1−z Figure 11: Depth correction References [Levoy ’96] Pat Hanrahan, Marc Levoy Light Field Rendering. Computer Graphics Proceedings, Annual Conference Series (Proc. SIGGRAPH ’96), pages 31-42, 1996 [Gortler ’96] Michael F. Cohen, Steveb J. Gortler, Radek Grzesczuk, Richard Szeliski The Lumigraph. Computer Graphics Proceedings, An- nual Conference Series (Proc. SIGGRAPH ’96), pages 43-54, 1996 [McMillan & Bishop ’95] Leonard McMillan, Gary Bishop. Plenoptic Modeling. Computer Graphics Proceedings, Annual Conference Series (Proc. SIGGRAPH ’95), pages 39-46, 1995 [Chen ’95] Shenchang Eric Chen, Apple Computer Inc. QuickTime VR - an image-based approach to virtual environment navigation. Proc. ACM SIGGRAPH ’95, pages 29–38. [Bauer ’96] Christian Bauer The MOS 6567/6569 video controller (VIC-II) and its application in the Commodore 64 Christian Bauer, 1996 http://www.minet.uni-jena.de/~andreasg/c64/vic_artikel/vic_article_1.htm [Wikipedia] Wikipedia http://de.wikipedia.org/w/index.php?title=Amiga&id=17159946 15