Distributed Event-Based
Transcription
Distributed Event-Based
Diploma Thesis Erich Semlak / 9655661 Distributed Event-Based Video Capturing and Streaming Diplomarbeit zur Erlangung des akademischen Grades Diplomingenieur im Diplomstudium Informatik. Angefertigt am Institut für Telekooperation Eingereicht von: Erich Semlak Betreuung: Prof. Dr. Gabriele Kotsis Beurteilung: Prof. Dr. Gabriele Kotsis Linz, Juni 2005 Page 1 Diploma Thesis Erich Semlak Acknowledgement I would like to thank … … Prof. Dr. Gabriele Kotsis for her fair and constructive comments, … Mag. Reinhard Kronsteiner for his supportive patience and understanding, … Margot for bearing me while developing and writing this thesis, … Gunther for his hints and vocabulary, … Oliver for his encouraging words, … my colleagues for making me finish my studies. Diploma Thesis Erich Semlak Abstract Diese Arbeit erläutert einen Lösungsansatz für Echtzeit-Videoverarbeitung mit besonderem Schwerpunkt auf verteilte(s) Capturing, Verarbeitung und Streaming. Das entwickelte System erörtert die Möglichkeiten heutiger Standard-PC-Hardware, CPU Rechenleistung und Ethernet-Netzwerk. Es basiert auf einer weitverbreiteten und flexiblen Multmedia-Architektur (DirectShow) um Erweiterungen zu vereinfachen und eine große Anzahl an bereits vorhandenen Komponenten zu eröffnen. Das entwickelte System kann für Überwachungs- oder Sendezwecke eingesetzt werden, abhängig von den Anforderungen an Qualität und Bildauflösung. Es ist begrenzt skalierbar in Bezug auf die Anzahl der Videoeingänge und der Videoauflösung und kann unter Windows ohne zusätzliche Kompressionshardware oder Software betrieben werden. This thesis explains an approach for realtime video processing with special emphasis on distributed capturing, processing and subsequent streaming. The developed system checks out the possibilities of standard PC hardware, CPU processing power and Ethernet network. It is based on a widespread and universal multimedia architecture (DirectShow) to ease further expansion and open up a wide range of already available components. The developed system is applicable for surveillance or broadcasting tasks, depending on quality and resolution requirements. It is limited scalable regarding number of inputs and video resolution and can be run under Windows without need of extra compression hardware or expensive software. Diploma Thesis Erich Semlak Index 1. 2. 3. Introduction....................................................................................................... 7 1.1. Motivation ......................................................................................... 7 1.2. Outline of the Work ........................................................................... 8 Theoretical Part ................................................................................................ 9 2.1. 2.1.1. 2.1.2. Digital Video...................................................................................... 9 History............................................................................................... 9 Fundamentals ................................................................................. 10 2.2. 2.2.1. 2.2.2. 2.2.3. Video Capturing .............................................................................. 12 Analogue- Digital-Conversion ......................................................... 12 Capturing Hardware........................................................................ 13 Relation to this work........................................................................ 14 2.3. 2.3.1. 2.3.2. 2.3.3. 2.3.4. 2.3.5. Video Compression......................................................................... 15 Introduction ..................................................................................... 15 Discrete Cosine Transform ............................................................. 15 Non-linear editing............................................................................ 16 Video Compression Formats .......................................................... 18 Relation to this work........................................................................ 22 2.4. 2.4.1. 2.4.2. 2.4.3. 2.4.4. 2.4.5. Video Streaming ............................................................................. 23 Video Streaming Formats ............................................................... 24 Streaming Methods......................................................................... 25 Streaming Method Considerations.................................................. 27 Protocols......................................................................................... 28 Protocols considerations................................................................. 31 2.5. 2.5.1. 2.5.2. Software.......................................................................................... 32 Commercial..................................................................................... 32 Shareware ...................................................................................... 33 2.6. 2.6.1. 2.6.2. 2.6.3. Techniques and SDKs .................................................................... 34 Microsoft DirectShow ...................................................................... 34 Video For Windows......................................................................... 36 Windows Media Series 9 ................................................................ 37 Practical Part................................................................................................... 40 3.1. 3.1.1. 3.1.2. Problem .......................................................................................... 40 Model .............................................................................................. 40 Applications .................................................................................... 43 3.2. 3.2.1. 3.2.2. 3.2.3. 3.2.4. 3.2.5. 3.2.6. 3.2.7. 3.2.8. Technology Evaluation.................................................................... 45 Evaluation Aspects ......................................................................... 45 DirectShow SDK ............................................................................. 48 Evaluation of DirectShow................................................................ 49 Windows Media SDK ...................................................................... 51 Evaluation of Windows Media SDK ................................................ 53 Combination of DirectShow and Windows Media ........................... 54 Video for Windows .......................................................................... 54 From Scratch .................................................................................. 54 Page 4 Diploma Thesis 3.2.9. 3.2.10. Erich Semlak Overview......................................................................................... 55 Conclusion ...................................................................................... 55 3.3. Implementation Considerations ...................................................... 56 3.4. 3.4.1. 3.4.2. 3.4.3. 3.4.4. 3.4.5. Filter Development.......................................................................... 57 Motion Detection Filter .................................................................... 58 Object Detection Filter .................................................................... 59 Network Sending/Receiving Filter ................................................... 60 Infinite Pin Tee Filter ....................................................................... 61 Web Streaming Filter ...................................................................... 63 3.5. 3.5.1. 3.5.2. Applications .................................................................................... 65 Slave Application ............................................................................ 65 Master Application .......................................................................... 67 3.6. 3.6.1. 3.6.2. Test & Conclusion........................................................................... 72 Development................................................................................... 72 Capacity and Stress Test................................................................ 73 4. Recent Research ............................................................................................ 78 4.1. 4.1.1. 4.1.2. 4.1.3. 4.1.4. 4.1.5. 4.1.6. 4.1.7. Video Capturing and Processing .................................................... 78 Object Tracking............................................................................... 78 Object Detection and Tracking Approaches.................................... 80 Object Tracking Performance ......................................................... 82 Motion Detection ............................................................................. 83 Object Tracking Using Motion Information ...................................... 84 Summary ........................................................................................ 85 Relation to this work........................................................................ 86 4.2. 4.2.1. 4.2.2. 4.2.3. 4.2.4. Video Streaming ............................................................................. 87 Compression vs. Delivery ............................................................... 87 Compression................................................................................... 87 Streaming ....................................................................................... 88 Relation to this work........................................................................ 90 5. Final Statement............................................................................................... 91 6. Sources ........................................................................................................... 92 6.1. Books & Papers .............................................................................. 92 6.2. URLs............................................................................................... 96 7. Pictures ......................................................................................................... 100 8. Code samples ............................................................................................... 101 9. Tables ............................................................................................................ 101 A. A Short History Of DirectShow.................................................................... 102 A.1. DirectShow Capabilities ................................................................ 102 A.2. Supported Formats in DirectShow ................................................ 104 A.3. Concepts of DirectShow ............................................................... 105 A.4. Modular Design............................................................................. 106 A.5. A.5.1. Filters ............................................................................................ 107 Filter Types ................................................................................... 107 Page 5 Diploma Thesis A.5.2. A.5.3. Erich Semlak Connections between Filters......................................................... 109 Intelligent Connect ........................................................................ 110 A.6. Filter Graphs ................................................................................. 110 A.7. The Life Cycle of a Sample........................................................... 111 A.8. GraphEdit...................................................................................... 112 B. Programming DirectShow............................................................................ 113 B.1. Writing a DirectShow Application .................................................. 113 B.2. C# or C++? ................................................................................... 113 B.3. Rewriting DirectShow interfaces for C# ........................................ 114 B.4. Initiating the filter graph................................................................. 117 B.5. Adding Filters to the Capture Graph ............................................. 118 B.6. Connecting Filters through Capture Graph ................................... 121 B.7. Running the Capture Graph.......................................................... 122 B.8. Writing new DirectShow Filters ..................................................... 123 C. Sources for Appendix A and B.................................................................... 124 C.1. Books & Papers ............................................................................ 124 C.2. URLs............................................................................................. 124 Page 6 Diploma Thesis Erich Semlak 1. Introduction “A picture is worth a thousand words”. 1.1. Motivation If a picture is worth thousand words, what about the value of video? During the 1990s, the personal computer underwent a radical transformation. At the beginning of that decade it was more an information processing device, suitable only for word processing and spreadsheets. By the turn of the century (and millennium), the PC had become a media machine, playing music and movies, games, DVDs, and streaming live reports from the web. Part of the reason for this development of the possibilities of the PC came from the exponential growth in processing power, memory, and storage, following Moore's Law [URL32], which effectively states that computers’ CPU speed doubles every 18 months. Even with today’s high performance CPUs there are still problems which would take too long to be solved on a single computer. Such problems are solved by distributed systems [URL33] which share the workload among several computers. Complex video processing tasks, as video compression and object detection, still exceed the capabilities of a single high-end 3.5GHz personal computer, especially if handling multiple high-resolution video sources. Therefore a distributed approach seems to be appropriate to solve such kind of problems though still using inexpensive off-the-shelf-PC hardware. Today there are few systems which use a distributed approach for video processing based on common PC hardware. Some of them use parallel systems for video effect insertion [MAY1999] others try to parallelize the video content analysis (e.g. motion and object detection) [LV2002]. There are also systems which use distributed video input, such as the DIVA system [TRI2002], which observes a remote scene for incidents. None of these systems offers the whole range of functions for capturing multiple (remote) sources, processing them (effects, motion and object detection) depending on video content analyzing and streaming the result out on a network. This thesis shows an approach for realtime video processing with special emphasize on distributed capturing, processing and subsequent streaming. The developed system checks out the possibilities of standard PC hardware, CPU processing power and Ethernet network. It is based on a widespread and universal multimedia architecture (DirectShow) to ease further expansion and open up a wide range of already available filters. Therefore it makes it possible to use different types of compression and formats without necessity of writing own, new components. Page 7 Diploma Thesis Erich Semlak These qualities are unique in comparison with different products (see chapter 2.5), which are more complicated to be expanded because of expensive SDKs needed and proprietary architecture. The capability to combine distributed processing with standard hardware and multimedia software architecture can hardly be found even with commercial products. The developed system is applicable for surveillance or web broadcasting tasks, depending on quality and resolution requirements. It is limited scalable regarding number of inputs and video resolution and can be run under Windows without need of extra compression hardware or expensive software. 1.2. Outline of the Work To fathom the possibilities of hardware and appropriate software systems an application had to be developed to allow scalable usage of today common hardware and network to do simultaneous capturing and video analyzing on multiple PCs and subsequent streaming to the internet in a comfortable and stable manner. A distributed hierarchical approach will be proposed. Detailed problem and model considerations can be found in chapter 3.1. Therefore different software approaches have been evaluated in terms of modular design and openness towards future technologies. Out of these technologies, Microsoft’s DirectShow (see chapter 2.6) technology has been chosen as most appropriate to be a basis for developing a system which fulfills the requirements as requested. Chapter 2 provides the theoretical foundation needed for better understanding the entire background of video capturing and streaming. Before starting development, an in-depth look into DirectShow was necessary to understand the internals and techniques how to develop new components. Especially using C# for handling DirectShow filter graphs (see chapter A.6) turned out to be difficult because DirectShow is intended to be used with C++. In the course of this cumbersome information retrieval a DirectShow tutorial emerged, which should help other developers to handle C# and DirectShow together more easily. This tutorial can be found at chapter B. After writing the needed interfaces for C# and DirectShow, two applications (Master and Slave, chapter 4.4) have been created to perform the needed tasks (see chapter 4.1 for the problem description) for the distributed video processing approach. Therefore some new DirectShow filters (chapter 3.4) had to be developed, which can also be used with other applications using DirectShow. These applications have then been tested with different input, output and hardware configurations (chapter 3.6). The results thereof show CPU and network workloads for up to four video sources and allow an estimation for higher counts of inputs respectively show where bottlenecks limit the scalability of the developed system. Finally, recent research in the field of video processing has been compared with results of this thesis (chapter 4). Page 8 Diploma Thesis Erich Semlak 2. Theoretical Part 2.1. Digital Video The recording and editing of sound has long been in the domain of the PC, but doing the same with moving video has only recently gained acceptance as a mainstream PC application. In the past, digital video work was limited to a small group of specialist users, such as multimedia developers and professional video editors, who were prepared to pay for expensive and complex digital video systems. It was not until 1997, when the CPUs reached the 200 Mhz mark, after several years of intense technological development, that the home PC was strong enough for the tasks to come. 2.1.1. History In the early 1990s, a digital video system capable of capturing full-screen video images would have cost several thousands of Euros. The biggest cost element was the compression hardware, needed to reduce the huge files that result from the conversion of an analogue video signal into digital data, to a manageable size. Those expansion cards consisted of several processors, exceeding the PCs own processing power by multiple, also regarding the price. Less powerful "video capture" cards were available, capable of compressing quarterscreen images - 320x240 pixels - but even these were far too expensive (more than $ 2000) for the average PC user. The end user market was limited to basic cards that could capture video, but which had no dedicated hardware compression features of their own. These low-cost cards relied on the host PC to handle the raw digital video files they produced, and the only way to keep file sizes manageable was to drastically reduce the image size. This often meant to capture the video data uncompressed to hard disc arrays (to handle the approx. 33 MB/sec for SVHS quality, which needed the full bandwidth of an EISA-Bus) and compressing it afterwards, taking several hours for some minutes of material. Until the arrival of the Pentium processor in 1993 even the most powerful PCs were limited to capturing images no more than 160x120 pixels. For a graphics card running at a resolution of 640x480, a 160x120 image filled just one-sixteenth of the screen. As a result these low-cost video capture cards were generally dismissed as little more than toys, incapable of performing any worthwhile real-world application. The turning point for digital video systems came as processors finally exceeded 200MHz. At this speed, PCs could handle images up to 320x240 without the need for expensive compression hardware. The advent of the Pentium II and ever more processing power made video capture cards which offered less than full-screen capability virtually redundant and by the autumn of 1998 there were several consumer-oriented video capture devices on the market which provided full-screen video capture for as little as a few hundred Euros. Page 9 Diploma Thesis Erich Semlak 2.1.2. Fundamentals Understanding what digital video is first requires an understanding of its ancestor broadcast television or analogue video. The invention of radio demonstrated that sound waves can be converted into electromagnetic waves and transmitted over great distances to radio receivers. Likewise, a television camera converts the color and brightness information of individual optical images into electrical signals to be transmitted through the air or recorded onto video tape. Similar to a movie, television signals are converted into frames of information and projected at a rate fast enough to fool the human eye into perceiving continuous motion. When viewed by an oscilloscope, the analogue signal looks like a continuous landscape of jagged hills and valleys, analogous to the changing brightness and color information. Most countries around the World use one of three main television broadcast standards. These three main standards are NTSC, PAL and SECAM. Unfortunately each standard is incompatible among each other. This chart below gives a description of each standard and the technical variations within each: SYSTEM NTSC M PAL B,G,H PAL I PAL D PAL N PAL M SECAM B,G,H SECAM D,K,K1,L Lines/Field 525/60 625/50 625/50 625/50 625/50 525/60 625/50 625/50 Horizantal Frequency 15.734 kHz 15.625 kHz 15.625 kHz 15.625 kHz 15.625 kHz 15.750 kHz 15.625 kHz 15.625 kHz Vertical Frequency 60 Hz 50 Hz 50 Hz 50 Hz 50 Hz 60 Hz 50 Hz 50 Hz Color Subcarrier Frequency 3.579545 mHz 4.433618 MHz 4.433618 MHz 4.433618 MHz 3.582056 MHz 3.575611 MHz - - 4.2 mHz 5.0 MHz 5.5 MHz 6.0 MHz 4.2 MHz 4.2 MHz 5.0 MHz 6.0 MHz 4.5 mHz 5.5 MHz 6.0 MHz 6.5 MHz 4.5 MHz 4.5 MHz 5.5 MHz 6.5 MHz Video Bandwidth Sound Carrier Table 1: Overview of TV formats [URL17] The following details about each standard are extracted from [URL18] and [ORT1993]: NTSC (National Television System Committee) was the first color TV broadcast system implemented in the United States in 1953. NTSC is used by many countries on the American continent as well as many Asian countries including Japan. NTSC runs on 525 lines per frame. As the electric current in USA alternates 60 times per second, there are 60 half pictures per second displayed. With PAL (Phase-Alternation-Line) each complete frame is drawn line-by-line, from top to bottom. It was developed by Walter Bruch at Telefunken [URL31] Germany (German State Television) and is used in most countries in Western Europe, Asia, throughout the Pacific and southern Africa. Europe uses an AC electric current that alternates 50 times per second (50Hz) therefore the PAL performs 50 passes (halfpictures) each second. As it takes two passes to draw a complete frame, the picture rate is 25 fps. The odd lines are drawn on the first pass, the even lines on the second. This procedure is known as “interlaced” which is supposed to compensate the low refresh rate. Computer monitors display images non-interlaced so they show Page 10 Diploma Thesis Erich Semlak a whole picture each pass. Interlaced signals, particularly at a rather low rate of 50Hz (modern monitors show pictures at at least 70 Hz), cause unsteadiness and flicker, and are inappropriate for displaying text or thin horizontal lines. SECAM (Sequential Couleur Avec Memoire or Sequential Color with Memory) was developed in France and is used in France and its’ territories, much of Eastern Europe, the Middle East and northern Africa. This system uses the same resolution of PAL, 625 lines, and frame rate, 25 per second, but the way SECAM processes the color information is not compatible with PAL. Page 11 Diploma Thesis Erich Semlak 2.2. Video Capturing The practical part of this thesis handles live video input for further processing. Therefore it needs video capturing to turn the cameras’ view into digitally useable data. To understand the process of video streaming this chapter explains its’ principles and techniques. 2.2.1. Analogue- Digital-Conversion To store visual information digitally, the hills and valleys of the analogue video signal have to be translated into the digital equivalent - ones and zeros - by a sophisticated computer-on-a-chip, called an analogue-to-digital converter (ADC). The conversion process is known as sampling, or video capture. Since computers have the capability to deal with digital graphics information, no other special processing of this data is needed to display digital video on a computer monitor. To view digital video on a traditional television set, the process has to be reversed. A digital-to-analogue converter (DAC) is required to decode the binary information back into the analogue signal. The digitisation of the analogue TV signal is performed by a video capture card which converts each frame into a series of bitmapped images to be displayed and manipulated on the PC. This takes one horizontal line at a time and, for the PAL system, splits each into 768 sections. At each of these sections, the red, green and blue values of the signal are calculated, resulting in 768 colored pixels per line. The 768 pixel width arises out of the 4:3 aspect ratio of a TV picture. Out of the 625 lines in a PAL signal, about 50 are used for Teletext (an information retrieval service through television broadcast) and contain no picture information, so they are not digitised. To get the 4:3 ratio, 575 lines times four divided by three gives 766.7. Video capture cards usually digitise 576 lines, splitting each line into 768 segments, which gives an exact 4:3 ratio (compared to the more modern 16:9 ratio of widescreen television). [STO_1995] Thus, after digitisation, a full frame is made up of 768x576 pixels. Each pixel requires three bytes for storing the red, green and blue components of its color (for 24-bit color). Each frame therefore requires 768x576x3 bytes = 1.3MB. In fact, the PAL system takes two passes to draw a complete frame - each pass resolving alternate scan lines to reduce flickering. The upshot is that one second of video requires a massive 32.5MB (1.3 x 25 fps). The human eye is more susceptible to brightness than it is to color [URL34]. The YUV model is a method of encoding pictures used in television broadcasting in which intensity is processed independently from color. Y is for intensity and is measured in full resolution, while U and V are for color difference signals and are measured at either half resolution (known as YUV 4:2:2) or at quarter resolution (known as YUV 4:1:1). Digitising a YUV signal instead of an RGB signal requires 16 bits (two bytes) instead of 24 bits (three bytes) to represent true color, so one second of PAL video ends up requiring about 22MB. [ORT1993] Page 12 Diploma Thesis Erich Semlak The NTSC system (chapter 4.1.2) used by America and Japan has 525 lines and runs at 30 fps - the latter being a consequence of the fact that their electric current alternates at 60Hz rather than the 50Hz found in Europe. NTSC frames are usually digitised at 640x480, which fits exactly into VGA resolution. This is not a coincidence, but is a result of the PC having been designed in the US and the first IBM PCs having the capability to be plugged into a TV. 2.2.2. Capturing Hardware A typical video capture card is a system of hardware and software which together allow a user to convert video into a computer-readable format by digitising video sequences to uncompressed or, more normally, compressed data files. Uncompressed PAL video creates about 32.5MB data per second, so some kind of compression has to be employed to make it more manageable. It is down to a codec to compress video during capture and decompress it again for playback, and this can be done in software or hardware. Even in the age of GHz-speed CPUs, a hardware codec is necessary to achieve anything near broadcast quality video. Picture 1: miro Video PCTV card The shown TV card consists of a TV tuner (the closed box in the upper section) and the video decoder chip (right lower corner with „Bt“ on it). Most video capture devices employ a hardware Motion-JPEG codec [URL35], which uses JPEG compression on each frame to achieve smaller file sizes, while retaining editing capabilities. The huge success of DV-based camcorders in the late 1990s has led to some high-end cards employing a DV (see chapter 3.2.1 for compression formats) codec instead. Once compressed, the video sequences can then be edited on the PC using appropriate video editing software and output in S-VHS quality to a VCR, television, camcorder or computer monitor. The higher the quality of the video input and the higher the PC's data transfer rate, the better the quality of the video image output. Less compression means less loss of information so there are less artifacts and/or higher frame rate. Some video capture cards (e.g. Hauppauge Bt878 TV cards) keep their price down by omitting their own audio recording hardware. Instead they provide pass through Page 13 Diploma Thesis Erich Semlak connectors that allow audio input to be directed to the host PC's sound card. This is no problem for simple editing work, but without dedicated audio hardware problems can arise in synchronising the audio and video tracks on longer and more complex edits. Video capture cards are equipped with a number of input and output connectors. There are two main video formats: composite video is the standard for most domestic video equipment, although higher quality equipment often uses the S-Video format. Most capture cards will provide at least one input socket that can accept either type of video signal, allowing connection to any video source (e.g. VCR, video camera, TV tuner and laser disc) that generates a signal in either of these formats. Additional sockets can be of benefit though, since complex editing work often requires two or more inputs. Some cards are designed to take an optional TV tuner module and, increasing, video capture cards actually include an integrated TV tuner (e.g. the Hauppauge WinTV Series or the Pinnacle PCTV). Video output sockets are provided to allow video sequences to be recorded back to tape and some cards also allow video to be played back either on a computer monitor or TV. Less sophisticated cards require a separate graphics adapter or TV tuner card to provide this functionality. [CT1996_11] 2.2.3. Relation to this work Capturing video by software needs an interface to communicate with the camera device to configure input resolution, frame rate, shutter and light exposure. Some capturing hardware also has ability to move the camera head and/or provides zoom. To handle all these functions, a driver is needed to function as a mediator between capturing hardware and software. The developed system of this work bases on DirectShow (chapter 2.6.1) and therefore needs DirectShow-compatible capturing hardware. To be DirectShowcompatible, there must be drivers available which provide filters which can be used through DirectShow. Most of available webcams and DV cameras can be controlled by DirectShow, some older hardware provides only Video For Windows (chapter 2.6.2) drivers, which are useless with DirectShow, and some proprietary hardware does not provide any drivers at all. As the handling of the capturing hardware is entirely managed by DirectShow, the developed system only takes the video data which comes through the input filter, respectively the driver, without any need for direct accessing the hardware. Page 14 Diploma Thesis Erich Semlak 2.3. Video Compression 2.3.1. Introduction Video compression methods tend to be lossy. This means that after decompressing the video data is not exactly the same which was originally encoded. By cutting video's resolution, color depth and frame rate, PCs managed postage stamp-size windows at first, but then ways were devised to represent images more efficiently and reduce data without affecting physical dimensions. The technology by which video compression is achieved is known as a "codec", an abbreviation of compression/decompression. Various types of codec have been developed implementable in either software and hardware, and sometimes utilising both allowing video to be readily translated to and from its compressed state. Lossy techniques reduce data - both through complex mathematical encoding and through selective intentional shedding of visual information that our eyes and brain usually ignore - and can lead to perceptible loss of picture quality. "Lossless" compression, by contrast, discards only redundant information. Codecs can be implemented in hardware or software, or a combination of both. They have compression ratios ranging from a gentle 2:1 to an aggressive 100:1, making it feasible to deal with huge amounts of video data. The higher the compression ratio, the worse the resulting image. Color fidelity fades, artefacts and noise appear in the picture, the edges of objects become over-apparent until eventually the result is unwatchable. [CTR_1996_11] 2.3.2. Discrete Cosine Transform By the end of the 1990s, the dominant techniques were based on a three-stage algorithm known as DCT (Discrete Cosine Transform). DCT uses the fact that adjacent pixels in a picture - either physically close in the image (spatial) or in successive images (temporal) - may be the same value. Picture 2: DCT compression cycle A mathematical transform - a relative of the Fourier transform - is performed on grids of 8x8 pixels (hence the blocks of visual artefacts at high compression levels). It does not reduce data but the resulting coefficient frequency values are no longer equal in their information-carrying roles. Specifically, it has been shown that for visual Page 15 Diploma Thesis Erich Semlak systems, the lower frequency components are more important than high frequency ones. A quantisation process weights these accordingly and ejects those contributing least visual information, depending on the compression level required. For instance, losing 50 percent of the transformed data may only result in a loss of five per cent of the visual information. Then entropy encoding - a lossless technique - jettisons any truly unnecessary bits. [URL20] 2.3.3. Non-linear editing Initially, compression was performed by software. Limited CPU power constrained how clever an algorithm could be to perform its task in a 25th of a second - the time needed to draw a frame of full-motion video. Nevertheless, Avid Technology [URL49] and other pioneers of NLE (non-linear editing) introduced PC-based editing systems at the end of the 1980s using software compression. Although the video was a quarter of the resolution of broadcast TV, with washed-out color and thick with blocky artefacts, NLE signaled a revolution in production techniques. At first it was used for off-line editing, when material is trimmed down for a program. Up to 30 hours of video may be shot for a one-hour documentary, so it is best to prepare it on cheap, nonbroadcast equipment to save time in an on-line edit suite. NLE systems really took off in 1991, when hardware-assisted compression brought VHS-quality video. The first hardware-assisted video compression is known as MJPEG (motion JPEG). It is a derivation of the DCT standard developed for still images known as JPEG. It was never intended for video compression, but when CCube (who were bought in by LSI Electronics in 2001) introduced a codec chip in the early 1990s that could JPEG as many as 30 still images a second, NLE pioneers could not resist. By squeezing data as much as 50 times, VHS-quality digital video could be handled by PCs. [CT1996_5] [URL22] In time, PCs got faster and storage got cheaper, meaning less compression had to be used so that better video could be edited. By compressing video by as little as 10:1 a new breed of non-linear solutions emerged in the mid-1990s. These systems were declared ready for on-line editing; that is, finished programs could essentially be played out of the box. Their video was at least considered to be of broadcast quality for the sort of time and cost-critical applications that most benefited from NLE, such as news, current affairs and low-budget productions. The introduction of this technology proved controversial. Most images compressed cleanly at 10:1, but certain material - with a lot of detail and areas of high contrast were degraded. “Normal” viewers would ever notice, but for broadcast engineers the ringing and blocky artefacts seemed obvious. Also, in order to change the contents of the video images, to add an effect or graphic, the material must first be decompressed and then recompressed. This process, though digital, is akin to an analogue degeneration. Artefacts are added like noise with each cycle (so-called “concatenation”). Sensibly designed systems render every effect in a single pass, but if several compressed systems are used in a production and broadcast environment, this concatenation presents a problem. Compression technology arrived just as proprietary uncompressed digital video equipment had infiltrated all areas of broadcasters and video facilities. Though the Page 16 Diploma Thesis Erich Semlak cost savings of the former were significant, the associated degradation in quality meant that acceptance by the engineering community was slow at first. However, as compression levels dropped - to under 5:1 - objections began to evaporate and even the most exacting engineer conceded that such video was comparable to the widely used BetaSP analogue tape. Mild compression enabled Sony to build its successful Digital Betacam format [URl50] video recorder, which is now considered a gold standard. With compression a little over 2:1, so few artefacts (if any) are introduced that video is processed dozens of generations apparently untouched. The cost of M-JPEG hardware has fallen steeply in the past few years and reasonably priced PCI cards and USB devices capable of a 3:1 compression ratio and bundled with NLE software are now readily available (e.g. Pinnacle’s Dazzle Digital Video Creator [URL51]). Useful as M-JPEG is, it was not designed for moving pictures. When it comes to digital distribution, where bandwidth is at a premium, the MPEG family of standards - specifically designed for video - offer significant advantages. Chapter 4.2.2 gives a further insight about progress of recent research in the field of video compression. Page 17 Diploma Thesis Erich Semlak 2.3.4. Video Compression Formats The internal details of many media formats are closely guarded pieces of information, for competitive reasons. Nearly every encoding method employs sophisticated techniques of mathematical analysis to squeeze a sound or video sequence into fewer bits. In the early days of media programming for the PC, their development was a long lasting and complex process. This table shows the differences between the common compression formats: VCD SVCD X(S)VCD DivX DV DVD Formal standard? Yes Yes No No Yes Yes Resolution 352x240 480x480 720x480 640x480 720x480 720x480 NTSC/PAL 352x288 480x456 720x576 or lower 720x576 720x576 DV MPEG-2 or lower MPEG-1 or Video compression MPEG-1 MPEG-2 MPEG-2 MPEG-4 Audio compression MPEG-1 MPEG-1 MPEG-1 WMA DV AC-3 MB/min 10 20-30 20-40 10-20 216 30-70 DVD Player compatibility Very good Good Good None None Excellent CPU intensive Low High High Very high High Very High Quality Good Very good Very good Very good Excellent Excellent MP3 MPEG-2 Table 2: Video compression formats overview These video compression techniques are described in the following. 2.3.4.1. H.261 H.261 is also known as P*64 where P is an integer number meant to represent multiples of 64kbit/sec. H.261 was targeted at teleconferencing applications and is intended for carrying video over ISDN - in particular for face-to-face videophone applications and for videoconferencing with multiple participants. The actual encoding algorithm is similar to (but incompatible with) MPEG. H.261 needs substantially less CPU power for real-time encoding than MPEG. The algorithm includes a mechanism which optimises bandwidth usage by trading picture quality against motion, so that a quickly-changing picture will have a lower quality than a relatively static picture. H.261 used in this way is thus a constant-bit-rate encoding rather than a constant-quality, variable-bit-rate encoding. [URL23] 2.3.4.2. H.263 H.263 is a draft ITU-T standard designed for low bitrate communication. It is expected that the standard will be used for a wide range of bitrates, not just low bitrate applications. It is expected that H.263 will replace H.261 in many applications. The coding algorithm of H.263 is similar to that used by H.261, however with some improvements and changes to improve performance and error recovery. Half pixel precision is used for H.263 motion compensation whereas H.261 used full pixel precision and a loop filter. Some parts of the hierarchical structure of the datastream are now optional, so the codec can be configured for a lower datarate or better error Page 18 Diploma Thesis Erich Semlak recovery. There are now four optional negotiable options included to improve performance: Unrestricted Motion Vectors, Syntax-based arithmetic coding, Advance prediction, and forward and backward frame prediction similar to MPEG called P-B frames. H.263 supports five resolutions: - CIF (Common Intermediate Format, 352x288 pixels at 30 fps) - QCIF (Quarter CIF, 176x144) - SQCIF (Semi QCIF, 128x96) - 4CIF (704x576) - 16CIF (1408x1152) The support of 4CIF and 16CIF means the codec could then compete with other higher bitrate video coding standards such as the MPEG standards. [URL23] 2.3.4.3. MPEG The Moving Picture Experts Group (MPEG) have defined a series of standards for compressing motion video and audio signals using DCT (Discrete Cosine Transform) compression which provide a common world language for high-quality digital video. These use the JPEG algorithm for compressing individual frames, then eliminate the data that stays the same in successive frames. The MPEG formats are asymmetrical - meaning that it takes longer to compress a frame of video than it does to decompress it - requiring serious computational power to reduce the file size. The results, however, are impressive: MPEG video needs less bandwidth than M-JPEG because it combines two forms of compression. M-JPEG video files are essentially a series of compressed stills. Using intraframe, or spatial compression, it disposes off redundancy within each frame of video. MPEG does this but also utilises another process known as interframe, or temporal compression. This eradicates redundancy between video frames. Take two, sequential frames of video and you will notice very little changes in a 25th of a second. So MPEG reduces the data rate by recording changes instead of complete frames. MPEG video streams consist of a sequence of sets of frames known as a GOP (group of pictures). Each group, typically eight to 24 frames long, has only one complete frame represented in full, which is compressed using only intraframe compression. It is just like a JPEG still and is known as an I frame. Around it are temporally-compressed frames, representing only change data. During encoding, powerful motion prediction techniques compare neighbouring frames and pinpoint areas of movement, defining vectors for how each will move from one frame to the next. By recording only these vectors, the data which needs to be recorded can be substantially reduced. P (predictive) frames, refer only to the previous frame, while B (bi-directional) rely on previous and subsequent frames. This combination of compression techniques makes MPEG highly scalable. [URL23] Page 19 Diploma Thesis Erich Semlak 2.3.4.4. MJPEG There is really no such standard as "motion JPEG" or "MJPEG" for video. Various vendors have applied JPEG to individual frames of a video sequence, and have called the result "M-JPEG". JPEG is designed for compressing full-color or gray-scale images of natural, real-world scenes. It works well on photographs, naturalistic artwork, and similar material; not so well on lettering, simple cartoons, or line drawings. JPEG is a lossy compression algorithm which uses DCT-based encoding. JPEG can typically achieve 10:1 to 20:1 compression without visible loss, 30:1 to 50:1 compression is possible with small to moderate defects, while for very-lowquality purposes such as previews or archive indexes, 100:1 compression is quite feasible. Non-linear video editors are typically used in broadcast TV, commercial post production, and high-end corporate media departments. Low bitrate MPEG-1 quality is unacceptable to these customers, and it is difficult to edit video sequences that use inter-frame compression. Consequently, non-linear editors (e.g., AVID, Matrox, FAST, etc.) will continue to use motion JPEG with low compression factors (e.g., 6:1 to 10:1). [URL21] 2.3.4.5. MPEG1 MPEG-1 (aka White Book standard) was designed to get VHS-quality video to a fixed data rate of 1.5 Mbit/s so it could play from a regular CD (for the VideoCD format). Published in 1993, the standard supports video coding at bit-rates up to about 1.5 Mbit/s and virtually transparent stereo audio quality at 192 Kbit/s, providing 352x240 resolution at 30 fps, with quality roughly equivalent to VHS videotape. The 352x240 resolution is typically scaled and interpolated. (Scaling causes a blocky appearance when one pixel - scaled up - becomes four pixels of the same color value. Interpolation blends adjacent pixels by interposing pixels with "best-guess" color values.) Most graphics chips can scale the picture for full-screen playback, however software-only half-screen playback is an useful trade-off. MPEG-1 enables more than 70 minutes of good-quality video and audio to be stored on a single CD-ROM disc. Prior to the introduction of Pentium-based computers, MPEG-1 required dedicated hardware support. It is optimised for non-interlaced video signals. [URL21] 2.3.4.6. MPEG2 During 1990, MPEG recognised the need for a second, related standard for coding video at higher data rates and in an interlaced format. The resulting MPEG-2 standard is capable of coding standard definition television at bit-rates from about 1.5 Mbit/s to some 15 Mbit/s. MPEG-2 also adds the option of multi-channel surround sound coding and is backwards compatible with MPEG-1. It is interesting to note that, for video signals coded at bitrates below 3 Mbit/s, MPEG-1 may be more efficient than MPEG-2. MPEG-2 has a resolution of 704x480 at 30 fps - four times greater than MPEG-1 - and is optimised for the higher demands of broadcast and entertainment applications, such as DSS satellite broadcast and DVD-Video. At a data rate of around 10 Mbit/s, the latter is capable of delivering near-broadcastquality video with five-channel audio. Resolution is about twice that of a VHS videotape and the standard supports additional features such as scalability and the Page 20 Diploma Thesis Erich Semlak ability to place pictures within pictures. (Extended) Super Video CD, DVD and DivX (until Version 5) use MPEG2. MPEG-3, intended for HDTV, was included in MPEG-2. [URL21] 2.3.4.7. MPEG4 In 1993 work was started on MPEG-4, a low-bandwidth multimedia format akin to QuickTime that can contain a mix of media, allowing recorded video images and sounds to co-exist with their computer-generated counterparts. Importantly, MPEG-4 provides standardised ways of representing units of audible, visual or audio-visual content, as discrete "media objects". These can be of natural or synthetic origin, meaning, for example, they could be recorded with a camera or microphone, or generated with a computer. Possibly the greatest of the advances made by MPEG-4 is that it allows viewers and listeners to interact with objects within a scene. Divx (since Version 6) uses MPEG4 to compress videos to fit on a single Mode 2 CD while preserving high quality. [URL21] 2.3.4.8. MPEG7 MPEG-7, formally named "Multimedia Content Description Interface", aims to create a standard for describing the multimedia content data that will support some degree of interpretation of the information's meaning, which can be passed onto, or accessed by, a device or a computer code. [ISO1999] Page 21 Diploma Thesis Erich Semlak 2.3.5. Relation to this work There are two points where video compression is needed and takes place in the developed system: The first is at the connection between Slave and Master (see Picture 10: slave model). As the amount of video data coming in from the camera’s driver constitutes about 7.6 MByte/sec (at 352x288/25fps), it consumes a lot of bandwith while transferring the stream to the Master machine. This means one client would eat up a whole 100MBit connection. Therefore it is necessary to compress the Slaves’ videos before delivering them. As the video has to be decompressed for doing the fading but has to be re-compressed again before streaming out on the web, there are two compression-decompression cycles. Every compression-decompression cycle decreases video quality as it increases artefacts and color inconsistencies. So on the one hand, the Slave’s compression codec has to come up against this issue by using a compression format that holds down artefacts. On the other hand it has to be fast enough to provide video compression in realtime, which means that compression time per frame has to be less than the frame’s time span. After testing the codecs which come with Windows (Intel Video 5.10, Intel Indeo, Cinepak, Microsoft RLE, Microsoft Video 1), Microsoft’s MPEG-4 Codec emerged to be most appropriate in terms of CPU usage and compression ratio (about 1:120). There has also been tested a lossless codec by Alparysoft [URL59] which achieves a compression ratio of 1:3, but for higher resolutions than 352x288 CPU usage raises up to 90%, which is too high for usage in the Slave application, as there also has to be done the object and motion detection simultaneously The compression filter used in the Slave application can be easily replaced by any other, as far it is a DirectShow filter and the Master machine has the necessary decoding filter installed. The second point where video has to be compressed is for streaming the result out on the Web (see Picture 11: master model). This stream has to meet the requirements for being viewed with common media players, like Microsoft Media Player or Quicktime (chapter 2.4.1). Besides the compression ratio has to be high enough for the available bandwith. As the intended output resolution will not exceed 352x288 and bitrate will remain below 1.5 MBit, MPEG1 (chapter 20) seems appropriate. The current version of the Master application uses the Moonlight Encoder to convert the raw video into a MPEG1 video stream. Another benefit of using MPEG1 with the Moonlight Encoder is that the trial version runs without registration as long as choosing MPEG1 for the output format. Besides, as the MPEG1 standard is the oldest MPEG standard, every player that is able to play MPEG streams should handle the resulting stream without problems. Page 22 Diploma Thesis Erich Semlak 2.4. Video Streaming There are two ways to view media on the Internet (such as video, audio, animations, etc): Downloading and streaming. When a file is downloaded the entire file is saved on the user’s computer (usually in a temporary folder), which is then opened and viewed. This has some advantages (such as quicker access to different parts of the file) but has the big disadvantage of having to wait for the whole file to download before any of it can be viewed. If the file is quite small this may not be too much of an inconvenience, but for large files and long presentations it can be annoying. Streaming media works a bit differently - the end user can start watching the file almost as soon as it begins downloading. In effect, the file is sent to the user in a (more or less) constant stream, and the user watches it as it arrives. When audio or video is streamed, a small buffer space is created on the user's computer, and data starts downloading into it. As soon as the buffer is full (usually just a matter of seconds), the file starts to play. As the file plays, it uses up information in the buffer, but while it is playing, more data is being downloaded. As long as the data can be downloaded as fast as it is used up in playback, the file will play smoothly. The obvious advantage with this method is that almost no waiting is involved (compared to the full length of media). Streaming media can also be used to broadcast live events (within certain limitations). This is sometimes referred to as a Webcast (or Netcast) [URL38]. When creating streaming video, there are two things which need to be understood: The video file format and the streaming method. Page 23 Diploma Thesis Erich Semlak 2.4.1. Video Streaming Formats There are many video file formats to choose from when creating video streams, the three most common are: • • • Windows Media [URL8] RealMedia [URL36] Quicktime [URL37] There are pros and cons for each type but in the end it comes down to personal preference. One has to be aware that many of the users will have their own preferences and some users will only use a particular format, so separate files for each format should be created for reaching the widest possible audience. The following overview is extracted from [URL16], [MEN2003] and [TOP2003]: 2.4.1.1. Windows Media Technology WMT is a streamable version of the popular AVI extension. The player is free and integrated into the Windows Media Player. The video window can be resized by the viewer. It streams extremely well off the net at full screen resolution. The ASF format breaks up the AVI file into bits of streamable data packs to be able to be transmitted over the net, depending on the targeted audience's connection speed. A browser downloads the file then opens up the media player and plays the file immediately. The quality can be as good as MPEG1, significantly higher than real video or QuickTime, when there is high bandwidth available. 2.4.1.2. RealMedia RealMedia is the popular streaming standard on the PC market, but MAC support is limited and the server is expensive. RealVideo can be created and embedded into HTML without a special server, but the playback of the stream is not as good. Streaming performance depends on both the speed of the user connection and speed of the web server and the line it is connected to. While the picture quality is great at high speed, the frame rate is not exactly that great. The video can become so jumpy that watching the movie is frustrating and the storytelling essence is gone. 2.4.1.3. Quicktime Quicktime is by far the most portable video standard that's been around for a long time. QuickTime streaming requires a special server, but any Mac running the latest OS and Quicktime Pro can be configured as a Quicktime server, all what is needed is a high speed connection. QuickTime allows to configure the video into streams that actually download and play simultaneously without any jerks. Performance varies due to web server performance. Popular because it allows both MAC and PC users share data. Page 24 Diploma Thesis Erich Semlak 2.4.2. Streaming Methods Part of the thesis practical requirements is to stream the resulting video out on the Internet. To achieve this, some considerations have to be done, as streaming uses some kind of network which is subject to some restrictions and conditions. These conditions shall be described and treated in the following chapters. 2.4.2.1. HTTP Server versus Dedicated Media Server Two major approaches are emerging for streaming multimedia content to clients. The first is the “server-less” approach which uses the standard web-server and the associated HTTP protocol to get the multimedia data to the client. The second is the server-based approach that uses a separate server specialized to the video/multimedia streaming task. The specialization takes many forms, including optimized routines for reading the huge multimedia files from disk, the flexibility to choose any of UDP/TCP/HTTP/Multicast protocols to deliver data, and the option to exploit continuous contact between client and server to dynamically optimize content delivery to the client. The primary advantages of the server-less approach are: - there is one less software piece to learn and manage, and - from an economic perspective, there is no video-server to pay for. In contrast, the server-based approach has the advantages that it: - makes more efficient use of the network bandwidth, - offers better video quality to the end user, - supports advanced features like admission control and multi-stream multimedia content, - scales to support large number of end users, and - protects content copyright. The tradeoffs clearly indicate that for serious providers of streaming multimedia content the server-based approach is the superior solution. RealPlayer, StreamWorks [URL40] and VDOnet's VDOLive [URL41] require you to install their A/V server software on your Web server computer. Among other things, this software can tailor the quality and number of streams, and provide detailed reports of who requested which streams. Other programs, such as Shockwave and VivoActive, are server-less. They do not require any special A/V server software beyond your ordinary Web server software. With these programs, you simply link a file on your server's hard drive from a Web page. When someone hits the link, the file starts to download. Server-less programs are simple to incorporate into a Web site but do not have the reporting capabilities of server-based programs. Depending on the users’ internet connections (Modem, DSL, Cable) the maximum bitrates they are able to consume may differ. If different bitrate versions shall be offered, there has to be a link for each because there is no automatic bandwith-detection, as it is provided by e.g. a Quicktime Media Server. [TOP2003] [MEN2003] Page 25 Diploma Thesis Erich Semlak 2.4.2.2. HTTP Streaming Video This is the simplest and cheapest way to stream video from a website. Small to medium-sized websites are more likely to use this method than the more expensive streaming servers. For this method there is no need of any special type of website or host - just a host server which recognises common video file types (most standard hosting accounts do this). There are some limitations to bear in mind regarding HTTP streaming: - HTTP streaming is a good option for websites with modest traffic, i.e. less than about a dozen people viewing at the same time. For heavier traffic a more serious streaming solution should be considered. There can't be stream live video, since the HTTP method only works with complete files stored on the server. There is no automatic detection of the end user's connection speed using HTTP. If different versions for different speeds shall be created, a separate file for each speed has to be created HTTP streaming is not as efficient as other methods as it produces more data overhead which therefore leads to additional server load Especially traffic issues may be of interest if many people request streaming files simultaneously as there might not be enough bandwidth for the server. In this case it may be necessary to limit the number of simultaneous connections, so there will be enough bandwidth for every connected user. [TOP2003] [MEN2003] 2.4.2.3. Java Replayers Replacing Plugins New solutions are appearing which use Java to eliminate the need to download and install plugins or players [URL39]. Such an approach will become standard once the Java Media Player APIs being developed by Sun, Silicon Graphics and Intel are available. This approach will also ensure client platform independence. [TOP2003] [MEN2003] 2.4.2.4. FireWalls For security reasons cautious administrators and users run their computers behind firewalls to guard them against intruders and hackers. Nearly all streaming products require users behind a firewall to have a UDP port opened for the video streams to pass through (1558 for StreamWorks [URL40], 7000 for VDOLive [URL41], 7070 for RealAudio). Rather than punch security holes in the firewall, Xing/StreamWorks has developed a proxy software package you can compile and use, while VDONet/VDOLive and Progressive Networks/RealPlayer are approaching leading firewall developers to get support for their streams incorporated into upcoming products. Currently a number of products change from UDP to HTTP or TCP when UDP can't get through firewall restrictions. This reduces the quality of the video. In all cases, it is still a security issue for network managers. [URL28] Page 26 Diploma Thesis Erich Semlak 2.4.3. Streaming Method Considerations This thesis’ system uses a dedicated RTSP server for delivering the resulting MPEG1 stream. This server-based approach shows its’ advantage when serving multiple clients, as it uses multicast [URL61]. Therefore it doesn’t need to send an own stream to each client but sends one single stream to all clients at the same time. This holds down network and machine load. As the streamed video is live (in opposite to a prerecorded video) all connected clients see the same moment in time, which is desired in this case. The protocol would provide functions for random seeking within media, but this is not required and therefore disabled. There is no automatic bandwidth detection implemented. The current system’s version provides only one bitrate, but as the filter graph (which connects each video processing component (filter) within the streaming application, see chapter A.6 for details) can be easily extended with multiple output filters, there could be run multiple streams with different bitrates and streamed through different ports. Limitation will be CPU performance as every compression task takes his share of workload. Future versions can remove this limitation by distributing the resulting video per network to multiple PCs which then do the compression for each desired bitrate. In that scenario, the compression-decompression-cycle-problem (see chapter 2.3.5) has to be regarded, as the video distribution to the PCs (for compression) would lead to further decreased video quality (which means more artefacts) if a lossy compression codec would be used. Page 27 Diploma Thesis Erich Semlak 2.4.4. Protocols To send the streaming data across the network, a protocol has to be used to assure the correct arrival on the other side. There are several possible protocols, some more appropriate than others. The differences and properties of these protocols will be explained following. This information about protocols is excerpted from [URL23], [URL60], [MEN2003] and [TOP2003]. 2.4.4.1. TCP HTTP (Hypertext Transfer Protocol) uses TCP (Transmission Control Protocol) as the protocol for reliable document transfer. It needs to establish a connection before sending data is possible. If packets are delayed or damaged, TCP will effectively stop traffic until either the original packets or backup packets arrive. Hence it is unsuitable for video and audio because: • • TCP imposes its own flow control and windowing schemes on the data stream, effectively destroying temporal relations between video frames and audio packets Reliable message delivery is unnecessary for video and audio - losses are tolerable and TCP retransmission causes further jitter and skews. Picture 3: TCP protocol handshaking and dataflow Page 28 Diploma Thesis Erich Semlak 2.4.4.2. UDP UDP (User Datagram Protocol) is the alternative to TCP. RealPlayer, StreamWorks [URL40] and VDOLive [URl41] use this approach. (RealPlayer gives you a choice of UDP or TCP, but the former is preferred.) UDP needs no connection, forsakes TCP's error correction and allows packets to drop out if they're late or damaged. When this happens, dropouts will occur but the stream will continue. Picture 4: UDP protocol dataflow Despite the prospect of dropouts, this approach is arguably better for continuous media delivery. If broadcasting live events, everyone will get the same information simultaneously. One disadvantage to the UDP approach is that many network firewalls (see FireWalls, chapter 2.4.2.4) block UDP information. While Progressive Networks, Xing and VDOnet offer work-arounds for client sites (revert to TCP), some users simply may not be able to access UDP files. 2.4.4.3. RTP RTP (Real Time Protocol) is the Internet-standard protocol (RFC 1889, 1890) for the transport of real-time data, including audio and video. RTP consists of a data and a control part called RTCP. The data part of RTP is a thin protocol providing support for applications with real-time properties such as continuous media (e.g., audio and video), including timing reconstruction, loss detection, security and content identification. RTCP provides support for real-time conferencing of groups of any size within an internet. This support includes source identification and support for gateways like audio and video bridges as well as multicast-to-unicast translators. It offers quality-of-service feedback from receivers to the multicast group as well as support for the synchronization of different media streams. None of the commercial streaming products uses RTP (Real-time Transport Protocol), a relatively new standard designed to run over UDP. Initially designed for video at T1 or higher bandwidths, it promises more efficient multimedia streaming than UDP. Streaming vendors are expected to adopt RTP, which is used by the MBONE. 2.4.4.4. VDP Vosaic uses VDP (Video Datagram Protocol), which is an augmented RTP i.e. RTP with demand resend. VDP improves the reliability of the data stream by creating two channels between the client and server. One is a control channel the two machines use to coordinate what information is being sent across the network, and the other channel is for the streaming data. When configured in Java, this protocol, like HTTP, is invisible to the network and can stream through firewalls. Page 29 Diploma Thesis Erich Semlak 2.4.4.5. RTSP In October 1996, Progressive Networks and Netscape Communications Corporation announced that 40 companies including Apple Computer, Autodesk/Kinetix, Cisco Systems, Hewlett-Packard, IBM, Silicon Graphics, Sun Microsystems, Macromedia, Narrative Communications, Precept Software and Voxware would support the Real Time Streaming Protocol (RTSP), a proposed open standard for delivery of real-time media over the Internet. RTSP is a communications protocol for control and delivery of real-time media. It defines the connection between streaming media client and server software, and provides a standard way for clients and servers from multiple vendors to stream multimedia content. The first draft of the protocol specification, RTSP 1.0, was submitted to the Internet Engineering Task Force (IETF) on October 9, 1996. RTSP is built on top of Internet standard protocols, including: UDP, TCP/IP, RTP, RTCP, SCP and IP Multicast. Netscape's Media Server and Media Player products use RTSP to stream audio over the Internet. RTSP is designed to work with time-based media, such as streaming audio and video, as well as any application where application-controlled, time-based delivery is essential. It has mechanisms for time-based seeks into media clips, with compatibility with many timestamp formats, such as SMPTE timecodes. In addition, RTSP is designed to control multicast delivery of streams, and is ideally suited to full multicast solutions, as well as providing a framework for multicast-unicast hybrid solutions for heterogeneous networks like the Internet 2.4.4.6. RSVP RSVP (Resource Reservation Protocol) is an Internet Engineering Task Force (IETF, [URL56]) proposed standard for requesting defined quality-of-service levels over IP networks such as the Internet. The protocol was designed to allow the assignment of priorities to "streaming" applications, such as audio and video, which generate continuous traffic that requires predictable delivery. RSVP works by permitting an application transmitting data over a routed network to request and receive a given level of bandwidth. Two classes of reservation are defined: a controlled load reservation provides service approximating "best effort" service under unloaded conditions; a guaranteed service reservation provides service that guarantees both bandwidth and delay. Page 30 Diploma Thesis Erich Semlak 2.4.5. Protocols considerations This thesis’ system uses network connections for three purposes: Communication between Slave and Master application This connection is used for negotiating media types, sending commands and results of object and motion detection. Especially for sending the commands between Master and Slaves, a reliable connection is very essential, so TCP seems the only appropriate protocol for this. Additional overhead caused by the protocol can be neglected, as the amount of packets is rather low. Although a connection has to be setup for each client, this is not an issue, because number of clients will be limited. Delivering video input from the Slaves’ cameras to the Master The video data from the Slave is delivered in a continuous stream to the Master application. The direction the data goes is one-way only and handshaking is not needed, so UDP is best choice. If single packets are dropped, there is no big impact to the resulting output, worst case some more artefacts might occur on the video picture for a moment. The video delivering task is done by the network sender filter (chapter 3.4.3). Streaming video out on the Web The resulting video is sent as MPEG stream out on the web. The streaming server filter component (chapter 3.4.5) used in this thesis’ system uses the RTSP protocol for delivery (chapter 2.4.3). This protocol provides multicast and is therefore economical with bandwidth. This doesn’t limit the maximum of servable client to bandwidth divided by bitrate. As mentioned before, each client will see the same moment in time which is desired as the streamed content is live, therefore any random access functions are disabled. Page 31 Diploma Thesis Erich Semlak 2.5. Software There are a lot of video processing tools available, every capturing device such as Webcams or TV tuner cards has at least one bundled. Main difference is if the software is intended to capture still images or motion video, or even both. There are only some of them picked out without emphasize on any company but on the diversity of purpose, as the captured stills and/or video can be further processed in many different ways. 2.5.1. Commercial The following information describes the key features of each application and is taken from www.burnworld.com and www.manifest-tech.com, respectively extracted from the software provider’s website. ISpy Grabs an image from any Video for Windows compatible framegrabber and sends the JPEG image to a homepage on the Internet. It can use either dial-up networking or a LAN connection for an FTP upload, or it can just save the image on your local disk. There is a free demo version of ISpy available, which is fully functional but will add the words 'demo version' to every uploaded image. (http://www.surveyorcorp.com/ispy/) Gotcha Motion detection video software records what you want to see. Capturing images only when triggered, it provides time stamped video files of events as they occur without recording inactivity between events. FTP and E-mail support. It is well suited for office or home video surveillance, because it is able to capture several source at the same time. (http://www.gotchanow.com/) Adobe Premiere Adobe Premiere provides a single environment that is focused on optimizing the editing process and it is famous for offering precise control for editing tracks in the timeline. Premiere has an impressive heritage as the market-leading professional video-editing software package, on both Windows and Macintosh platforms. The benefits of its popularity include the large number of resources available for it, such as books, training courses, user groups, and on-line discussion groups. (http://www.adobe.com/motion/) Page 32 Diploma Thesis Erich Semlak Pinnacle Studio Pinnacle Studio is video editing software which delivers a top-notch blend of features, usability, and performance for most consumers. It is easy to use, has a very broad feature set and is very handy, particularly for audio and DVD authoring. Pinnacle Studio Plus is an approachable video editor for novices, while offering the depth experienced video enthusiasts won't find in other packages at this price. (http://www.pinnaclesys.com) 2.5.2. Shareware Ulead Video Studio Ulead VideoStudio is a fast and easy video editor with leading technology in DV and MPEG-2 video formats for DVD quality input, editing and output. You can use a straightforward, six-step Video Wizard to introduce you to video editing. Video can be edited by adding titles, transitions, voiceovers and music to turn them into movie masterpieces. You can also send video e-mail and web cards to share your video on the computer in full-screen for the best viewing. (http://www.ulead.com) Video Site Monitor Surveillance WebCams 2.59 This is a multi-camera video monitoring surveillance program for use at home or work. It features video motion detection and can capture video to disk. Cameras can be accessed remotely over the Internet to review captured video, or take in a live feed. Up to eight cameras are supported. It also offers audio alarms, e-mail notification and quick FTP file uploads. You can schedule the video settings by timer, and everything is password protected. (http://www.fgeng.com/) Virtual Dub VirtualDub is a video capture/processing utility for 32-bit Windows platforms (98/NT/2000/XP), licensed under the GNU General Public License (GPL). It lacks the editing power of a general-purpose editor such as Adobe Premiere, but is streamlined for fast linear operations over video. It has batch-processing capabilities for processing large numbers of files and can be extended with third-party video filters. VirtualDub is mainly geared toward processing AVI files, although it can read (not write) MPEG-1 and also handle sets of BMP images. (http://www.virtualdub.org) Page 33 Diploma Thesis Erich Semlak 2.6. Techniques and SDKs There are different bases where multimedia applications can be developed on. To provide some fundamentals for better understanding the decisions made in chapter 3.2 the techniques to be considered are presented and described in their history and origin. These fundamentals also support considerations about their future development and life cycle. 2.6.1. Microsoft DirectShow In the early 1990s, after the release of Windows 3.1, a number of hardware devices were introduced to take advantage of the features of its graphical user interface. Among these were inexpensive digital cameras (also known as so-called “webcams” today), which used CCD1 technology to create low-resolution (often poor black-andwhite) video. These devices were connected to the host computer through the parallel port (normally reserved for the printer) with software drivers that cared about the data transfer from the camera to the computer. As these devices became more common, Microsoft introduced “Video for Windows” (VfW), which is described in a later chapter. Although VfW proved to be sufficient for many software developers, it has a some limitations, in particular, it is quite difficult to support the popular MPEG standard for video. It would take a complete rewrite of Video for Windows to do accomplish that. As Windows 95 was near release, Microsoft started a project known as “Quartz”2, meant to create a new set of APIs that could provide all of Video for Window’s functionality with MPEG support in a 32-bit environment (although Windows 95 isn’t really 32-bit). That seemed straightforward enough, but the engineers working on Quartz realized that an extensive set of devices was going to enter the market, such as digital camcorders and PC-based TV-tuners, which would require a more comprehensive level of support than anything they had planned to offer. The designers of Quartz realized they couldn’t possible imagine every scenario or even try to get it all into a single API. Instead, the designers of Quartz chose a “framework” architecture, where the components can be snapped together, much like LEGO bricks. To simplify the architecture of a complex multimedia application, Quartz provides a basic set of building components, known as filters, to perform essential function such as reading data from a file, playing it to the speakers, rendering it to the screen, and so on. Using the (at that time) newly developed Microsoft Component Object Model (COM), Quartz tied these filters together into filter graphs, which orchestrated a flow of media data, a so-called stream, from capture through any intermediate processing to its eventual output to the display. Through COM, each filter would be able to inquire about the capabilities of other filters as they were connected together into a filter graph. And because Quartz filters would be self-contained COM objects, they could 1 „charge coupled device“, a light-sensitive electronic component which measures luminosity. It contains a matrix of pixels. The proportional charge of each pixel is then stored digitally. [URL2] 2 „Quartz“ can be even found today, as the main library of DirectShow is named “quartz.dll”. It contains all standard filters that come with DirectShow. Page 34 Diploma Thesis Erich Semlak be created by third-party developers for their own hardware designs or software needs. In this way, Quartz would be endlessly extensible, if one needed some feature that Quartz didn't have, he could always write his own filter. The developers of Quartz raised a Microsoft research project known as "Clockwork” [URL57], which provided a basic framework of modular, semiindependent components working together on a stream of data. From this beginning, Quartz evolved into a complete API for video and audio processing, which Microsoft released in 1995 as ActiveMovie, shipping it as a component in the DirectX Media SDK. In 1996, Microsoft renamed ActiveMovie to DirectShow (to indicate its relationship with DirectX3), a name it retains to this day. In 1998, a subsequent release of DirectShow added support for DVDs and analog television applications, both of which had become common. Finally, in 2000, DirectShow was fully integrated with DirectX, shipping as part of the release of DirectX 8. This integration means that every Windows computer with DirectX installed (and that are most PCs nowadays) has the complete suite of DirectShow services and is fully compatible with any DirectShow application. DirectX 8 also added support for Windows Media (which will be described in a following chapter), a set of streaming technologies designed for high-quality audio and video delivered over low-bandwidth connections, and the DirectShow Editing Services, a complete API for video editing. Picture 5: Windows Movie Maker GUI Microsoft bundled a new application into the release of Windows Millennium Edition: Windows Movie Maker. Built using DirectShow, it gives novice users of digital camcorders an easy-to-use interface for video capture, editing, and export, which means an outstanding demonstration of the capabilities of the DirectShow API. In the two years after the release of DirectX 8, most of the popular video editing applications have come to use DirectShow to handle the intricacies of communication with a wide array of digital camcorders. Those programmers of these applications 3 Microsoft DirectX is an advanced suite of multimedia application programming interfaces (APIs) built into Microsoft Windows. It provides a standard development platform for Windows-based PCs for writing hardware-specific code. [URL 6] Page 35 Diploma Thesis Erich Semlak made the choice, using DirectShow to handle the low-level sorts of tasks that would have consumed many, many hours of research, programming, and testing. With the release of DirectX 9 (the most recent, as this thesis is written), very little has changed in DirectShow, with one significant exception: the Video Mixing Renderer (VMR) filter. The VMR allows the programmer to mix multiple video sources into a single video stream that can be played within a window or applied as a "texture map", a bit like wallpaper, to a surface of a 3D object created in Microsoft Direct3D. The list of uses for DirectShow is long, but the two most prominent examples are Windows Movie Maker and Windows Media Player. Both have shipped as standard components of Microsoft operating systems since Windows Millennium Edition. [URL25] [PES2003] For a deeper insight into DirectShow and its components, please refer to appendix A. 2.6.2. Video For Windows Video For Windows is a set of software application program interfaces (APIs) that provided basic video and audio capture services that could be used in conjunction with these new devices. Video for Windows was introduced as an SDK separate from a Microsoft OS release in the fall of 1992. VFW became part of the core operating system in Windows 95 and NT3.51. Although it will continue to be supported indefinitely, further development has stopped, and the feature set effectively frozen with the release of Windows98. [ORT1993] Video for Windows consists also of 5 applications [URL24], so called MultiMedia Data Tools: - VideEdit, for loading, playing, editing and saving video clips - VidCap, for capturing of single frames or video clips - BitEdit, for editing single frames in a video file - PalEdit, for editing the color palette of an video file - WavEdit, for recording and editing audio files Page 36 Diploma Thesis Erich Semlak 2.6.3. Windows Media Series 9 Microsoft Windows Media Services is a next-generation platform for streaming digital media. You can use the Windows Media Services SDK to build custom applications on top of this platform. For example, you can use the SDK to: - Create a custom user interface to administer Windows Media Services. Programmatically control a Windows Media server. Programmatically configure the properties of system plug-ins included with Windows Media Services. Create your own plug-ins to customize core server functionality. Dynamically create and manage server-side playlists. Following descriptions are taken from [URL8] and [URL9]: 2.6.3.1. Windows Media Encoder SDK The Windows Media Encoder SDK is one of the main components of the Microsoft Windows Media 9 Series SDK. The Windows Media Encoder 9 Series SDK is designed for anyone who wants to develop a Windows Media Encoder application by using a powerful Automation-based application programming interface (API). With this SDK, a developer using C++, Microsoft Visual Basic, or a scripting language can capture multimedia content and encode it into a Windows Media-based file or stream. For instance, you can use this Automation API to: - - - - - Broadcast live content. A news organization can use the Automation API to schedule the automatic capture and broadcast of live content. Local transportation departments can stream live pictures of road conditions at multiple trouble spots, alerting drivers to traffic congestion and advising them of alternate routes. Batch-process content. A media production organization that must process a high volume of large files can create a batch process that uses the Automation API to repeatedly capture and encode streams, one after the other. A corporation can use the Automation API to manage its streaming media services with a preferred scripting language and Windows Script Host. Windows Script Host is a language-independent host that can be used to run any script engine on the Microsoft Windows 95 or later, Windows NT, or Windows 2000 operating systems. Create a custom user interface. An Internet service provider (ISP) can build an interface that uses the functionality of the Automation API to capture, encode, and broadcast media streams. Alternatively, you can use the predefined user interfaces within the Automation API for the same purpose. Remotely administer Windows Media Encoder applications. You can use the Automation API to run, troubleshoot and administer Windows Media Encoder applications from a remote computer. This SDK documentation provides an overview of general encoding topics, a programming guide, and a full reference section documenting the exposed interfaces, objects, enumerated types, structures and constants. Developers are also encouraged to view the included samples. [URL8] Page 37 Diploma Thesis Erich Semlak 2.6.3.2. Windows Media Format SDK The Windows Media Format SDK is a component of the Microsoft Windows Media Software Development Kit (SDK). Other components include the Windows Media Services SDK, Windows Media Encoder SDK, Windows Media Rights Manager SDK, Windows Media Device Manager SDK, and Windows Media Player SDK. The Windows Media Format SDK enables developers to create applications that play, write, edit, encrypt, and deliver Advanced Systems Format (ASF) files and network streams, including ASF files and streams that contain audio and video content encoded with the Windows Media Audio and Windows Media Video codecs. ASF files that contain Windows Media–based content have the .wma and .wmv extensions. For more information about the Advanced Systems Format container structure, see Overview of the ASF Format. The key features of the Windows Media Format SDK are: - - - - - - Support for industry-leading codecs: The Windows Media Format 9 Series SDK includes the Microsoft Windows Media Video 9 codec and the Microsoft Windows Media Audio 9 codec. Both of these codecs provide exceptional encoding of digital media content. This SDK also includes the Microsoft Windows Media Video 9 Screen codec for compressing computer-screen activity during sessions of user applications, and the new Windows Media Audio 9 Voice codec, which encodes low-complexity audio such as speech and intelligently adapts to more complex audio such as music, for superior representation of combined voice-music scenarios. Support for writing ASF files: Files are created based on customizable profiles, enabling easy configuration and standardization of files. This SDK can be used to write files in excess of 2 gigabytes, enabling longer, better-quality, continuous files. Support for reading ASF files: This SDK provides support for reading local ASF files as well as reading ASF data being streamed over a network. Support is also provided for many advanced reading features, such as native support for multiple bit rate (MBR) files, which contain multiple streams with the same content encoded at different bit rates. The reader automatically selects which MBR stream to use, depending upon available bandwidth at the time of playback. Support for delivering ASF streams over a network: This SDK provides support for delivering ASF data through HTTP to remote computers on a network, and also for delivering data directly to a remote Windows Media server. Support for editing metadata in ASF files: Information about a file and its content is easily manipulated with this SDK. Developers can use the robust system of metadata attributes included in the SDK, or create custom attributes to suit their needs. Improved support for editing applications: This version of the Windows Media Format SDK includes support for advanced features useful to editing applications. The features include fast access to decompressed content, frame-based indexing and seeking, and general improvements in the accuracy of seeking. The new synchronous reading methods provide reading capabilities all within a single thread, for cleaner, more efficient code. Page 38 Diploma Thesis - - Erich Semlak Support for reading and editing metadata in MP3 files: This SDK provides integrated support for reading MP3 files with the same methods used to read ASF files. Applications built with the Windows Media Format SDK can also edit metadata attributes in MP3 files using built-in support for the most common ID3 tags used by content creators. Support for Digital Rights Management protection: This SDK provides methods for reading and writing ASF files and network streams that are protected by Digital Rights Management to prevent unauthorized playback or copying of the content. Page 39 Diploma Thesis Erich Semlak 3. Practical Part 3.1. Problem This thesis is supposed to show an approach for realtime video processing with special emphasize on distributed capturing, processing and subsequent streaming. The system to be developed shall check out the possibilities of standard PC hardware, CPU processing power and Ethernet network. This software solution is supposed to capture video from multiple sources, process them in a specific way and stream the result out on the network. 3.1.1. Model The system to be developed can be considered as simple black box: Picture 6: black box model The involved processes can be divided into four main phases: Picture 7: video process phases The estimated workload for handling the input videos and generating the output stream may exceed the performance of a today common single processor system, therefore a solution is aspired which distributes the workload on multiples computers. Considerations about how to divide the process phases have to regard the dataflow and the amount of data within. If each process phase would be managed by an own machine, the whole video (and analysis’ results’) data would have to be delivered onto the next phases’ machine over a network. This would result in a lot of traffic and might not be necessary. As the capturing task is not considered to be very CPU consuming (see chapter 2.2.2) it can be held on the same machine with the video analyzing task. Therefore it seems advantageous to combine the video capturing and analyzing functions in one application. These functions occur per video input, so the count can vary. The video processing (effects and mixing) part exists only one single time in the system. It can be combined with the web streaming part unless the web streaming task shall be distributed to multiple machines (if the workload exceeds CPU Page 40 Diploma Thesis Erich Semlak performance). Tests with a filter graph in GraphEdit have shown that a single compression process (352x288/25fps) can be managed on a single machine without performance problems, so the video processing and web streaming tasks can also be combined into a single application. Hence the entire process can be divided into two parts, the video capturing and analyzing part and the video processing and web streaming part, each in their own application. For better determination, these parts, which captures the input video, will be named as “Slaves”, the machine (for the time being we will consider it to be a single computer) which is supposed to generate the output stream will be named as “Master”. The refined model would then look like this: Picture 8: distributed system model This approach supports also scalability regarding the amount of video inputs but is limited by the performance of the Master. Page 41 Diploma Thesis Erich Semlak A real-world view could look like this: Picture 9: real world example This example illustrates very comprehendible the set-up of an use-case in a small scale. It can also be considered as a substitute for a bigger scale application, like a racetrack or the surveillance system of an underground train. It is assumed that the slave and master machines are connected through some kind of network. The capacity of this network is considered to be constrained. In other respects the specific properties of the network will be neglected in the following expertise. Now the roles of the slave and master machines are to be examined more precisely. Page 42 Diploma Thesis Erich Semlak 3.1.2. Applications Following the particular task of the Master and Slave applications will be specified in detail and there upon an object model will be derived. 3.1.2.1. Slave Application The Slaves will be responsible for - controlling the attached video devices, - receive captured video data from these devices, - compression of video data for reduction of network traffic, - motion and object detection and - sending compressed video data to master machine. Picture 10: slave model Which seems rather unimportant in the above schema is the network communication task which is only illustrated by the arrow out of the Slave-box. Nevertheless this communication can also be seen as particular task which will be mentioned in the object model. Therefore the Slave task can be divided into two major parts with regard to the usage of Direct Show. On the one side a filter graph has to be spanned and on the other side the connection and communication with the Master application has to be established and operated. Page 43 Diploma Thesis Erich Semlak 3.1.2.2. Master application The Master machine tasks will be - receiving the compressed video data from the slave machines, - decompression of the Slaves’ video data, - combination of video frames for specific effects, - compression of resulting video data and - streaming compressed video data out on the network. Picture 11: master model Similar to the Slave task model the network communication is only illustrated by the arrow from the left which contains the compressed video data. Of course the network communication task itself has to be argued and will be reflected later. The video processing task performs the combination of the incoming video streams via different effects and the decision which video stream(s) to be showed. This decision is made upon video content analysis (motion detection data from slaves) or through manual input. manual decision mode automatic video content analysis manual input source/effect selection video sources video combination Picture 12: master video processing Page 44 video sources Diploma Thesis Erich Semlak 3.2. Technology Evaluation The following chapter describes which technologies and platforms have been tested, which aspects have been considered and how far the candidates fulfill these aspects and requirements. The candidates to be evaluated are: - DirectShow SDK - Windows Media SDK - Combination of DirectShow and Windows Media - Video For Windows - “from scratch” 3.2.1. Evaluation Aspects There are some requirements, which the target and development platform have to meet to be appropriate to solve the given specification. Some of them arise from the technical circumstances, others are more personal preferences because of former experiences or special interests respectively expertise. 3.2.1.1. Target platform When searching for video software almost every link that can be found is about some software for the Windows platform. Especially well-known developers for video editing and processing software like Pinnacle, Adobe, Ulead, Cyberlink and Microsoft only provide solutions for the Windows platform. There are a few video processing and editing tools available for Linux (URL [64]), but none of them are professional solutions as they all seem to be shareware. Therefore there aren’t any development standards regarding video data on Linux. This means that Windows is the appropriate platform for development of the intended solution. 3.2.1.2. Development Environment I am doing my regular programming and development work under Windows .NET 2003 with Visual Studio .NET and I have got some quite good experiences with it. The .NET object model provides a lot of convenient components and functions. Therefore I decided to use .NET as much as possible for the things to come. I would prefer to code in C# as it is well structured and more easily readable than C++ code sometimes is. But I am aware there maybe some situations where I will have to fall back to C++ as most SDKs and libraries base on C++. This means that the technology to be chosen has to be supported by .NET and vice versa. Only constraint for the later usage and operation of the developed applications will be that the .NET Framework 1.1 has to be installed but I consider this fact as acceptable because there will be more applications to come which will base on the .NET Framework. Page 45 Diploma Thesis Erich Semlak 3.2.1.3. Development aspects Development aspects are: - the needed components and SDKs have to be available for free - the components or environment have to be available for the specified target platform - the SDKs have to be available for a programming language which I am able to code (Visual Basic, C++, C#, Java) to avoid an additional initial skill adaptation phase. 3.2.1.4. Modularity One of the specification issues is that the system to develop has to be modular, so that it is possible to add new parts (e.g. effects) without being constrained to redesign the whole thing, maybe even without recompiling anything. So it would be advantageous if the underlying components or development techniques would support this aspect. 3.2.1.5. Complexity Depending on the programming experience the complexity of the development environment is criteria. High complex object models may take a longer phase to get accustomed with and may lead to more development and debugging costs. Since the grade of complexity sometimes coheres with the grade of possibilities a system possesses it might be accepted or even wanted. A low level of complexity may mean that an average task can be achieved with a few lines of code. 3.2.1.6. Abstraction level The components to be used shall provide an appropriate level of abstraction of the underlying hardware devices and drivers to release the developer of hardware or driver quirks. On the other hand it has to offer facilities for accessing low level information and interfaces if needed, e.g. the raw video and stream data itself or some hardware settings, to ensure “total control” over what is going on. This is more or less the “KO”-criteria because without access to specific low level data and interfaces some functions required by the specification cannot be obtained. 3.2.1.7. 3rd Party Components The availability of 3rd party components is of interest as there may be components which could be integrated to avoid to reinvent the wheel and to save development time. Page 46 Diploma Thesis Erich Semlak 3.2.1.8. Documentation A very important part of every SDK is the documentation. Without a proper and comprehensive documentation it would be nearly impossible to use any unfamiliar environment. The documentation has to contain at least an object reference. Page 47 Diploma Thesis Erich Semlak 3.2.2. DirectShow SDK As DirectShow is part of DirectX so support for handling DirectShow is included in the DirectX SDK. To my mind DirectShow seems to be treated somehow stepmotherly. The focus is more set on those parts of DirectX which are used for game development. The DirectShow SDK offers a powerful digital media streaming architecture, highlevel APIs, and an extensive library of plug-in components respectively filters. The DirectShow architecture provides a generalized and consistent approach for creating virtually any type of digital media application. A DirectShow application is written by creating an instance of a high-level object called the filter graph manager, and then use it to create configurations of filters (called filter graphs) that work together to perform some desired task on a media stream. The filter graph manager and the filters themselves handle all buffer creation, synchronization, and connection details, so the application developer needs only to write code to build and operate the filter graph; there is no need to touch the media stream directly, although access to the raw data is provided in various ways for those applications that require it. DirectShow also includes a set of high-level APIs, known as DirectShow Editing Services (DES), which enables the creation of non-linear video editing applications. The documentation is not included in the DirectX SDKs documentation and help-files but can be downloaded as extra and integrates into the MSDN Library, if installed. It contains an object reference, a lot of examples for many different purposes and is easily readable and comprehendible. With the help of GraphEdit, a tool shipped with DirectShow, filter graphs can be easily pre-tested before casting them into code. This supports developing new filters. Page 48 Diploma Thesis Erich Semlak 3.2.3. Evaluation of DirectShow After evaluation of the DirectShow SDK with regard to the required development aspects, these compliances can be found: 3.2.3.1. Development and Platform Aspects - - DirectShow SDK is available for Windows as part of DirectX The SDK is available for free and can be received from the Microsoft Downloads website The SDK can be used within the desired programming environment and support one ore more of my favored programming languages. Managed Code as with C# can be developed only by rewriting the needed interfaces because the SDK provides only some of them. Especially filter development is only reasonable by using C++. Debugging with DirectShow is sometimes annoying when developing new filters. There is no direct step-by-step debugging possible, as a filter has to be compiled as DLL and then be loaded and hooked into a filter graph, but It is feasible to use an extern Debugger to hook on the GraphEdit process and to halt at predefined breakpoints, if the filter was compiled with debugging information. 3.2.3.2. Modularity The level of modularity and openness in Windows Media is very high. The structure of a filter graph is totally flexible and can contain multiple source inputs and target outputs. The number of filters between input and output is also variable and more or less constrained by performance issues. Out of this a self developed filter can be used in any 3rd party DirectShow based application. 3.2.3.3. Complexity Complexity in DirectShow depends on how far the possibilities shall be exhausted. A simple capture-render application can be developed in short time and is supported at that through the intelligent connection mechanisms. Nevertheless the object model of DirectShow is extensive and powerful and supports every problem mentioned in the requirements. 3.2.3.4. Abstraction Level The level of abstraction depends on how DirectShow and Filter graphs are used. As long as only existing filters a hooked together, the abstraction level remains high and the developer doesn’t need to bother with hardware or file specific subtleties. But by writing some own filters, every aspect of DirectShow and the underlying architecture can be controlled and exhausted. Page 49 Diploma Thesis Erich Semlak 3.2.3.5. 3rd party components The availability of 3rd party components is huge. A broad count of filters for many different purposes (color conversion, resizing, cropping, network source/target, compressors/decompressors) can be found at MontiVision [URL43] and LEAD [URL44]. Aside from this DirectShow comes already with a great number of filters. 3.2.3.6. Documentation The provided documentation is comprehensive and easily understandable. The Object Model and programming techniques for several example applications are well explained. The DirectShow documentation can also be found on the Web without need to install the SDK at the online MSDN library [URL42]. Page 50 Diploma Thesis Erich Semlak 3.2.4. Windows Media SDK The Windows Media 9 Series SDK consists of three major parts: Windows Media Encoder SDK, Windows Media Format SDK, Windows Player SDK, Windows Media Services SDK and Windows Media Rights Management SDK [URL52]. The Windows Media Services SDK component enables content developers and system administrators to support Windows Media Technologies in their Web sites and is therefore not of interest for our specification. Windows Media Rights Manager SDK is designed for developers who wish to deliver digital media via the Internet in a protected and secure manner. It can help protect the rights of content owners while enabling consumers to easily and legitimately obtain digital content. This is also not an aspect regarding our specification and hence not part of the evaluation. 3.2.4.1. Windows Media Encoder SDK The Windows Media Encoder 9 Series SDK is designed for people who want to develop an application based on the Windows Media Encoder and provides a powerful automation-based API. The SDK can be used with C++/C#, Microsoft Visual Basic to capture multimedia content and encode it into a Windows Media-based file or stream. The automation API can be used to: - - - Broadcast live content. A live video source can be captured and streamed directly to a Media Server or out on the network. Batch-process content. A high volume of large files can by processed by a pre-created batch process that uses the Automation API to repeatedly capture and encode streams, one after the other. This can be achieved by using a preferred scripting language and Windows Script Host. Create a custom user interface. An interface can be built, that uses the functionality of the Automation API to capture, encode, and broadcast media streams. Alternatively, the predefined user interfaces can be used within the Automation API for the same purpose. Remotely administer Windows Media Encoder applications. The Automation API can be used to run, troubleshoot and administer Windows Media Encoder applications from a remote computer. The SDK documentation provides an overview of general encoding topics, a programming guide, and a full reference section documenting the exposed interfaces, objects, enumerated types, structures and constants. There are also several samples included. Page 51 Diploma Thesis Erich Semlak 3.2.4.2. Windows Media Format 9 Series SDK The Windows Media Format SDK is meant for creating applications that play, write, edit, encrypt, and deliver Advanced Systems Format (ASF) files and network streams, including ASF files and streams that contain audio and video content encoded with the Windows Media Audio and Windows Media Video codecs. The key features of the Windows Media Format SDK are: - Support for industry-leading codecs. The SDK includes the Microsoft Windows Media Video 9 codec and the Microsoft Windows Media Audio 9 codec. Both of these codecs provide exceptional encoding of digital media content. Support for writing ASF files based on customizable profiles excessing 2 GB Support for reading ASF files as well as reading ASF data being streamed over a network. Support for delivering ASF streams over a network through HTTP and also for delivering data directly to a remote Windows Media server. Support for Digital Rights Management protection. Methods for reading and writing ASF files and network streams that are protected by Digital Rights Management is provide to prevent unauthorized playback or copying of the content. 3.2.4.3. Windows Media Player SDK The Microsoft Windows Media Player provides information and tools to customize Windows Media Player and to use the Windows Media Player ActiveX control. Support for customizing Windows Media Player is provided by: - Windows Media Player skins. Skins allow you both to customize the Player user interface and to enhance its functionality by using XML. Windows Media Player plug-ins. Windows Media Player includes support for plug-ins that create visualization effects, that perform digital signal processing (DSP) tasks, that add custom user interface elements to the full mode Player, and that render custom data streams in digital media files created using the ASF file format. Embedding the Windows Media Player control is supported for a variety of technologies, including: • • • • • • HTML in Web browsers. Microsoft Internet Explorer and Netscape Navigator version 4.7, 6.2, and 7.0 browsers are supported. Programs created with the Microsoft Visual C++ development system Programs based on Microsoft Foundation Classes (MFC) Programs created with Microsoft Visual Basic 6.0 Programs created using the .NET Framework, including programs written in the C# programming language Microsoft Office Page 52 Diploma Thesis Erich Semlak 3.2.5. Evaluation of Windows Media SDK Regarding the initially mentioned aspects the evaluation of the Windows Media SDK fulfills the given demands as following: 3.2.5.1. Development and Platform Aspects - Windows Media SDK is obviously available for Windows The SDK is available for free and can be download from the Microsoft Downloads website. The SDK can be used within .NET and supports C++ as well as managed languages. 3.2.5.2. Modularity The level of modularity and openness in Windows Media is rather low. The structure of capturing and streaming is strictly predetermined as there is an input source and the output target with compression n the middle. There is no possibility to hook some components between the input and output so there can’t be any custom effects added. There can also be only one input source at a time therefore no combination of inputs can me mixed. 3.2.5.3. Complexity Complexity in Windows Media is constantly low. The object model is quite simple and covers the powerful mechanism beneath. This prevents them of being accessed and baffles therefore any attempts of being used for tasks it isn’t designed for. 3.2.5.4. Abstraction level Abstraction level is very high with Windows Media. There are no interfaces down to the drivers or hardware devices. The raw streaming data is also not exposed and cannot be read or modified directly. 3.2.5.5. Documentation The provided documentation is comprehensive and easily understandable. The Object Model and programming techniques for several example applications are well explained. The documentation can also be found online at Microsoft [URL45]. 3.2.5.6. 3rd party components The availability of 3rd party plugins is limited, there are some available at Inscriber [URL46] for titling and graphic insertion and Consolidated Video [URL47] for color masking and alpha blending. DirectShow transform filters can be used, but the cowork is sometimes briddle. Page 53 Diploma Thesis Erich Semlak 3.2.6. Combination of DirectShow and Windows Media There has also been considered to use a combination of DirectShow and Windows Media. For instance, the Source and effect parts could be developed with DirectShow, the network streaming parts with Windows Media Encoder to take advantage of the included compression codecs. Unfortunately, as there are no mechanisms in the Windows Media Encoder object model that allow an integration of any other components that could read or modify the media data stream, nor it is possible to use Windows Media Encoder in DirectShow as an output rendering filter, a combination can not be managed. 3.2.7. Video for Windows Video for Windows seems more to be a relict out of the 16-bit era and lacks many of the now common functions. VfW has a number of other deficiencies: - No way to enumerate available capture formats. - No TvTuner support. - No video input selection. - No VBI support. - No programmatic control of video quality parameters such as brightness, contrast, hue, focus, zoom. Therefore it doesn’t seem to be appropriate to develop an application that meets all the requirements of the specification. 3.2.8. From Scratch For the sake of completeness the possibility shall be addressed: to write all of the communication with hardware drivers by one self and implement all of the video processing and output. Obviously this is an unfavorable alternative because it would mean a huge task to develop. This doesn’t make any sense in consideration of the previous described techniques. Page 54 Diploma Thesis Erich Semlak 3.2.9. Overview This comparison shall once again illustrate the aspects of the evaluated alternatives: Platform Windows Media SDK Windows Development .NET, C++ Modularity low Complexity DirectShow SDK Windows .NET (parts) Video for Windows Windows From Scratch Windows C++, VB .NET, C, C++ high very low - low flexible simple - Abstraction Level high flexible very high - Documentation comprehensive comprehensive general none lots none none Components Table 3: Overview SDKs 3.2.10. few C++ (total) Conclusion The above table states that Direct Show SDK is the favorable alternative for the given task if the values for Modularity and Abstraction level, which are the two most important categories, are compared. As mentioned at the beginning of this chapter, the most important criteria is the option to gear into the media data stream and to access low level device functionality if . This is not assured with Windows Media SDK, which shows a low rating for modularity and abstraction level and offers only a few components. Microsoft states that “…DirectShow provides generic sourcing and rendering support, whereas the Windows Media Encoder SDK and Windows Media Player SDK ActiveX control are tailored capture and playback solutions...“ [URL48]. This means that Windows Media seems to not to fit in the requirements for developing this thesis’ applications. DirectShow simplifies the video managing process by its modular design and the availability of many ready-to-go filters. The comprehendible documentation and extensive object model enables the development of reusable components. The GraphEdit tool eases the testing of those components and allows some kind of prototyping without need of code. Furthermore, DirectShow shows no fundamental disadvantages regarding the remaining evaluated aspects. Therefore the application to develop will be created under application of the DirectShow SDK. Page 55 Diploma Thesis Erich Semlak 3.3. Implementation Considerations The decision for using DirectShow as base technology for developing the applications for this thesis determines effectively the way how they have to be designed. The possibilities of DirectShow by means of modularity offer a serial way to process video streams by adding several filter components to build a filter graph. This approach gives a maximum flexibility for all required functions. Therefore the applications model structure (like illustrated in Picture 10: slave model and Picture 11: master model) can be almost entirely transferred to the implementation design. Each video processing and analyzing task can be packaged in a DirectShow filter: source filter motion dection filter object detection filter zoom/clip filter compression filter network sending filter Picture 13: Slave application model Picture 14: Master application model As explained in chapter B.1, writing a DirectShow application follows certain development rules and thus specifies the internal structures of data and its processing. The implementation process is supported through the use of GraphEdit which allows filter graphs to be tested without need of code, which eases the testing and debugging of self-written filters. Therefore all needed filters have to be ready before the Master and Slave applications can be developed. Page 56 Diploma Thesis Erich Semlak 3.4. Filter Development To handle the different tasks between video source and output target, a filter graph has to be spanned which consists of appropriate filters and event handling aside from the pure video processing. These tasks in execution order are very similar to the models’ of the Slave and Master applications: Slave: - capturing input motion detection object detection video compression deliverance to Master Master: - receiving Slaves’ input (video & motion/object data) video decompression mixing and fading video re-compression Web streaming For some of these tasks, ready usable filters are available, but the rest has to be developed. Capturing filters have to be available for the video device to be used. Compression/Decompression filters can be found as great many (lot of them for free), so there is no need to develop another. Especially the motion/object detection and network delivering filters can only be found as expensive commercial products, besides not compatible to DirectShow, so these are to be written for new. Page 57 Diploma Thesis Erich Semlak 3.4.1. Motion Detection Filter The Motion Detection filter analyzes the incoming video pictures for eventual motion. It takes the average of the last 5 pictures (which would be a fifth of a second when capturing with 25 frames per second) to avoid disturbance by flickering or shutter changes of the camera. The incoming camera pictures are downscaled and averaged in tiles of 8x8 pixels grayscale. These tile planes are then compared to the initially stored backplane. Major differences are seen as motion, the average position of the tiles with changed grayscale values are then estimated as center of motion. The backplane is gradually converged to the averaged camera pictures to adapt to the current camera picture. Interface and functions provided by the Motion Detection filter are: Interface: Property page: GUID {43D849C0-2FE8-11cf-BCB1-444553540000} GUID {43D849C0-2FE8-11cf-BCB1-444553540000} STDMETHODIMP get_motion(int *c,int *px,float *py); Parameter description: c: count of moving picture parts px/py: average center of moving picture parts, „epicenter“ of motion STDMETHODIMP StartDetect(); STDMETHODIMP StopDetect(); These functions allow to start and stop the motion detection, to decrease CPU load while no motion detection is needed. The same information can be retrieved through the properties sheet when using e.g. GraphEdit: Picture 15: motion detection filter properties page Page 58 Diploma Thesis Erich Semlak 3.4.2. Object Detection Filter The Motion Detection Filter provides functionality to follow an object, defined by a certain picture clip of the video input. The object data can be read out or put in to distribute to several instances of the filter. The object detection depends on a color histogram of the defined object’s picture clip. The Object Detection Filter uses an template with image subtraction. For performance reasons the image of the selected object is divided into 8x8 pixel tiles of which the color histogram is calculated. The incoming video picture is also divided in 8x8 pixel tiles and the objects histogram is then compared with every possible position on the picture. The effort in searching the object depends therefore on the size of the objects picture and the size of the incoming video. As the clip detection can’t be easy done through a property page, it isn’t possible to use with e.g. GraphEdit, so it can only be used out of written code. This makes sense because the handling of the resulting detection information has also to be processed by code because it isn’t The interface of the Object Detection Filter is: Interface: GUID { D9114FD6-9227-48e2-9193-4E8FC0664081} STDMETHOD StopDetect(); STDMETHOD StartDetect(); STDMETHOD get_object_data (int * px, int *py,int * width, int imgdata); STDMETHOD get_object_pos(int * px, int *py,int * width, int setobject); STDMETHOD set_object(int px, int py,int width, int height); STDMETHOD get_huemask_data(BYTE ** huedata, long * huelen, BYTE masklen); STDMETHOD set_huemask_data(BYTE * huedata, BYTE * maskdata, int picwidth, int picheight); * height , BYTE ** * height , int * ** maskdata, long * picx, int picy, int As with the Motion Detection filter, only every fifth frame is analyzed. This should be fast enough to follow a moving object. Parameter description: px/py: left top corner of picture clip containing object width/height: size of clip containing object imgdata: picture data of captured frame for object selection bounding box huedata/huelen: pointer where to put/get huedata and length of it maskdata/masklen: pointer where to put/get maskdata and its’ length picx/picy, picwidth/picheight: position and size of picture clip with object in it Page 59 Diploma Thesis Erich Semlak 3.4.3. Network Sending/Receiving Filter To deliver the compressed video data from the Slave to the Master, a sending at the Slave and a receiving filter at the Master is needed. To successfully hook the receiving filter into a filter graph, the filter’s output has to have a defined media type. The Slave’s media type is sent once per second and determines the media type from the Slave’s filter graph. After the mediatype has been set on the receiver’s output pin, the filter graph can be spawned. Delivering the video over the network is done by connectionless UDP protocol. In case of losing a video frame on the way the packet isn’t resent but for video streams this isn’t a big problem as long as the video stream keeps flowing onward. TCP tries to resend packets in correct order which could cause stuttering video, therefore it isn’t used. It is recommended to use the Network Sending/Receiving filters in connection with some light compression/decompression filter to decrease network traffic otherwise even a 100MBit line could be fully loaded quickly. The interfaces of the network filters look like following: Interface: GUID {1a8f2631-2bde-4ff5-a79e-0c495902ed1d} Property pages: Sender: GUID {47cfd9eb-6c13-4d90-9605-84e1115c7c96} Receiver: GUID {7d8a1eb0-dfd1-453e-b225-0808f8e4f810} STDMETHODIMP STDMETHODIMP STDMETHODIMP STDMETHODIMP STDMETHODIMP STDMETHODIMP SetNetwork(ULONG ulIP,USHORT usPort); GetNetwork(ULONG *pIP,USHORT *pPort); GetTraffic(ULONG *bytecnt,ULONG *framecnt,UONG *medtcount); GetInfo(LPSTR *infostr); AddTraffic(ULONG bytecnt, ULONG medtcnt); ResetTraffic(void); Parameter description: ulIP: the IP of the source pPort: port where the data is to be sent to, as sending port pPort+1 is used bytecnt/framcnt/medtcount: count of data bytes/frames/mediatype infos sent infostr: info about the filter’s state (stopped/paused/running) The property page allows also the input of IP and port and shows the amount of data and frames: Picture 16: HNS MPEG-2 sender properties page Page 60 Diploma Thesis Erich Semlak 3.4.4. Infinite Pin Tee Filter A “tee” is in technical terms a component (e.g. a pipe) with three in-/outputs or some kind of branching. It is often shaped like a Y or a T (therefore the “tee”). The “infinite” comes from the possibility to (theoretically) open up infinite inputs which result in one single output. The infinite tee is more or less the heart of the Master application. It combines several video inputs, does fading and clipping and has a single video stream as output. Each time an input filter is connected, a new input pin is generated. The maximum of input pins is hardwired to 1000, which is more than a PC will be able to handle for the next few years. This behavior can be easily demonstrated when using GraphEdit: Picture 17: filter graph using infinite pin tee The input pins’ resolutions can differ from each other, the output pin’s resolution is the maximum of width and height of the incoming media types. To set the inputs’ clip positions and fading values, the property page can be used: Page 61 Diploma Thesis Erich Semlak Picture 18: infinite pin tee filter properties page The same functionality can be achieved when using the filters COM interface: Interface: Property page: GUID {16a81957-4662-4616-be5a-a559ee21725a} GUID { bac542e3-30f0-4039-99d3-61b6a7849e45} STDMETHOD set_window (int input, int posx,int posy,int topx, int topy,int width,int height,int oppac,int zorder); STDMETHOD get_window (int input, int * posx,int * posy,int * topx, int * topy,int *width,int *height,int *oppac,int * zorder); STDMETHOD get_resolution) (int input, int *width,int *height); STDMETHOD get_sampleimage_ready (int inputnr, int * ready); STDMETHOD get_sampleimage (int inputnr, BYTE * buf); STDMETHOD get_ConnCount (int * count); STDMETHOD get_FramesOut (long * framesout); STDMETHOD get_FramesIn (long * framesin, long * framesc); Parameter description: posx/posy: position of clip in the source video topx/topy: position of clip in the output video width/height: size of clip oppac: oppacity, 0=totally translucent, 255=intransparent zorder: number of Z-plane, 0=most backward inputnr: number of input pin to configure ready: sample image (RGB) is ready at pointer buf buf: pointer to memory for sample image, needs (width*height*3) bytes count: number of input pins framesout: number of frames delivered to output pin framesin: number of input pin to configure Page 62 Diploma Thesis Erich Semlak 3.4.5. Web Streaming Filter There are a couple of streaming applications available which stream a video file or live capture source out on the network, but I didn’t find any working DirectShow filter for this task. It would have meant a lot of work to develop a RTP streamer from scratch, but fortunately there is an open source RTP streaming application available at LIVE.COM [29] which is originally written as a console application and for file sources only. The classes for the input of MPEG files (ByteStreamFileSource) had to be rewritten to communicate over shared memory with the filter’s input pin. The streaming server runs in an own thread simultaneous to the filter’s input thread and can be started and stopped through the shared memory interface. Interface of Streaming filter: GUID {066ec65d-75bf-4b66-80e1-44e5ec492f97} STDMETHOD SetNetwork (unsigned long ulIP, int usPort, char * aName); STDMETHOD GetNetwork (unsigned long * ulIP,int * pPort,char * aName); STDMETHOD StartStream(); STDMETHOD PauseStream(); Parameter description: ulIP: IP of local network interface to be used usPort: port on which stream shall be served (default: 9999) aName: stream name for accessing URL (default: “stream”) These parameters may also be set through the properties page: Picture 19: HNS MPEG streamer filter properties page Page 63 Diploma Thesis Erich Semlak The streaming filter is somehow fussy with the MPEG stream format which comes through the input pin. It only works with MPEG1 elementary streams, so no audio can be added. The Master application in my case uses the “Moonlight Video Encoder Std” with these settings: Picture 20: Moonlight MPEG2 Encoder settings Page 64 Diploma Thesis Erich Semlak 3.5. Applications To use the previous described filters and their functionality in a comfortable way they have to be wrapped into an application, each one for the slave and master machines. These applications provide the graphical user interface and manage the network communication needed additionally to the video broadcasting. The following chapters describe the functionality, usage and technical details of these applications. 3.5.1. Slave Application The slave application is meant to run without any user intervention except for choosing the video input to be used. After that, no more user input is necessary. Picture 21: Slave application camera selection dialog The program waits until a seeking broadcast (from port 4711 to 4811) of a master comes around and then answers it. Any configuration and control is then made through the master application. If the network connection gets lost or the master application is shut down, the slave goes back to listening mode to wait for a master’s broadcast. If master and slave reside in different networks so no broadcast can be send to one another, a “ping mode” can be used. For this, the slave must be configured on a particular master’s IP address and port. The slave tries to contact the master autonomous then. Disadvantage of this mode is that only a single master is contacted while through the broadcast mode any master can contact all waiting slaves. For routing reasons, this has to be a routeable address, so whether the master nor the slave can’t reside in a (different) private network to be accessed from outside. After a connection is set up and video transmission has started, the slave application shows the local video picture and traffic information. If motion/object detection is activated, the current position values are also displayed. Page 65 Diploma Thesis Erich Semlak Picture 22: Slave application graphical user interface The internal dataflow in the Slave’s filter graph involves motion & object detection and video compression and delivery to the Master. The filter graph could be spanned in GraphEdit like this: Picture 23: Slave's filter graph Page 66 Diploma Thesis Erich Semlak 3.5.2. Master Application The Master is responsible for the central user interaction, slave management and web streaming. It searches for waiting slaves, contacts them and starts the video transmission. Eventual data distribution (e.g. object detection, motion zoom) is also done by the Master. The typical workflow from starting the Master application until streaming the result to the web looks like this: Picture 24: Master application schematic workflow The optional object detection isn’t mention in this workflow but will be discussed later in detail. The internal Filter graph which is spawn when two Slaves are connected and the web streaming is activated would look in GraphEdit like following: Picture 25: Master’s Filter graph 3.5.2.1. Usage Now the same interaction steps through the graphical user interface: First of all the waiting slaves have to be selected which of them should be activated. To get the list of waiting slaves choose “Search” from the File menu. The Master broadcasts then his IP and incoming port address. The list displays the machines’ name where slaves are running on and the IP addresses and network ports. There can be more than one slave running on a single machine therefore the ports are necessary to distinguish between the running instances: Page 67 Diploma Thesis Erich Semlak Picture 26: Master application Slave selection After the selection procedure is finished, the Master sends a start command to those slaves. Now each activated slave sends his possible camera resolutions back to the Master. The user can now choose the desired camera resolution and the size of the clip (e.g. for motion zoom). Half and quarter clips can be selected more easily by clicking the appropriate buttons: Picture 27: Master application Slave resolution selection After having selected the resolution for each Slave, the Master spans his filter graph and hooks a receiving filter for each Slave into it. This takes a few second. Picture 28: Master application filter graph setup Page 68 Diploma Thesis Erich Semlak After everything is up, the GUI shows a configuration panel for each Slave where clip manipulation can be done and Motion Zoom or Object Detection may be activated. Picture 29: Master application graphical user interface If motion detection is activated the motion indicator shows the “amount” of motion for each slave. On the right side of the application window the resulting video preview is displayed The “fade” slider indicates the opacity of the Slave’s video in the output. The more left the slider is drawn, the more lucent the video gets. This also depends on the “ZOrder”, which sorts the video inputs from back to front, 0 means the most back plane. Background is black, so if there is only one input video, this video fades to black. The “Zoom”, “Clip” and “Output” panels can be used, if a smaller clip size than the camera resolution was selected. “Zoom” indicates, which part of the input video picture shall be delivered by the Slave “Clip” shows, which part if the delivered Slave’s video shall be handed on “Output” sets the position of the clip in the output video After everything has been configured to satisfaction, web streaming can be activated, simply by choosing “Streamer” from the “Extras” menu. The resulting video is then streamed out on the network. To view the stream on a remote machine with Quicktime or Media Player, the URL to be opened is “rtsp://server name or IP:port/streamname”. Page 69 Diploma Thesis Erich Semlak 3.5.2.2. Motion Detection / Motion Zoom If Motion Detection is activated for a Slave, it measures the amount of movement in the cameras visual field. How motion is measured is discussed in the chapter “Motion Detection Filter”. This motion information can be used to drive the “Motion Zoom” feature. This feature automatically moves the input clip (if it is smaller than the cameras input picture size) to the center of motion. This means that always that part of the input video is shown where motion takes place. While the Slave does the video analyzing the Master (or the user) sets the clip position and sends back the coordinates to the Slave which then sends this particular part of the video picture only. The data flow for motion detection and clip selection is like following: Picture 30: motion detection schematic data flow Motion Detection can also be used to automatically select the Slave which has the highest motion value. This mode is activated by choosing “Motion Fade” in the “Extras” menu. Sometimes this causes flickering video streams when more then one video input shows motion. 3.5.2.3. Object Detection In some cases the flickering effect of simple motion fading can be very annoying, especially when following a particular object which moves around the separate cameras’ visual fields or when there are more moving objects. In those cases “Object Detection” might help. As the Slaves have to know how the object looks like, the user first has to locate the desired object in one of the incoming videos. Then the picture clip containing the object’s view has to be selected by a bounding box. This clip is then received from the particular Slave and distributed to the others. From then on, each Slaves tries to locate and following the object in his view. If a Slave detects the object in his view, he sends the position to the Master which then selects this clip position which is done the same way as it is for the Motion Zoom. How object detection works in particular is discussed in the chapter “Object Detection Filter”. Page 70 Diploma Thesis Erich Semlak The data flow for object selection and distribution looks like this: Picture 31: object detection schematic data flow Object detection works fine as long as the object doesn’t move to fast or changes his shape (caused by rotation) and color (because of shadows) too much. This depends on the contrast between the object and its background or other objects. Detection works fine with the Slave which initially has the object in its view. Any other Slave which view is entered by the object has to “espy” the object first. This means it tries to determine were the object enters the view’s border. Only then it can be detected and followed. Page 71 Diploma Thesis Erich Semlak 3.6. Test & Conclusion In the course of evaluating, developing and testing the applications a wide experience was made. Some issues shall be told here to save somebody of making some efforts, which I have made, a second time. The objectives of the tests are to investigate to which extent the expectations and requirements made at the beginning of this thesis have been achieved. 3.6.1. Development Although there seem to be several people ([URL5], [URL43], [URL44]) who deal with DirectShow development there are still a lot of “secrets” which are difficult to disclose. Especially when using C# in combination with DirectShow interfaces, there were some differences to C++ to find out. Fortunately there is the Marshaling class in .NET which had to be used heavily when dealing with C++ pointers and arrays. As mentioned in the evaluation chapter, one of the big difficulties while developing DirectShow filters is debugging them. As they can’t be started stand-alone but in combination with a filter graph, debugging information can only be extracted by through writing it out in a file. It is possible to hook a debugger onto the application which spawns the filter graph, e.g. GraphEdit, but there may be several threads running, therefore setting breakpoints can be complicated. At last a failure may occur in one of the self-written filters but was caused by a “foreign” filter, e.g. by handling shared pointers in a wrong way. In a nutshell these are the main reasons why developing filters may take some time longer than writing a typical Windows application. Another issue, which makes debugging challenging is the fact that a filter graph typically consists of several filters running more or less in parallel and sometimes deal with the same data, as they pass a pointer on the video sample among each other. This may lead to searching the failure in a filter which isn’t responsible only because it throws an exception which originated actually from false operation in the filter being hooked either before or after the “suspect” filter. This interconnectivity, which is certainly one of the biggest strengths in the concept of DirectShow is also one of the reasons why filter development is sometimes more complicated. I started developing my first filters for over a year ago, with having programming experience for over 15 years, but even today filter development seems to be more than “just a hack”. There can hardly be found any forums that discuss filter development, most address problems with particular codecs or plugins, so this is only my personal opinion. Page 72 Diploma Thesis Erich Semlak 3.6.2. Capacity and Stress Test As with every distributed system, there are always some parts which have to be initially executed on one single machine, like data distribution or gathering, initialization, input or output. In the case of the Slave and Master applications, the video data has to be gathered in one single point. This means that there is a bottleneck to be expected, which limits the amount of possible incoming video streams, respectively their frame resolutions and rates. These tests are intended to find out, which amount of input streams at which resolution a given test system is capable to manage in a fluent and reasonable way, means that there have to be no drop-outs or increasing number of artefacts by dropped P-frames. The output video size will also be analyzed in the testing sequence, as the transcoding (compression-decompression-cycle) for streaming needs processing power. 3.6.2.1. Test system The test systems consists of a Master machine, equipped with a Pentium 4 3.5 GHz (with Hyperthreading [URL63] enabled) connected through a single 100MBit network to the slaves. The four Slave machines are differently equipped PCs from 600 Mhz to 2.6 GHz. The results where determined by using the Windows 2000 performance monitor and taking the average CPU load after 2 minutes of testing, running a single instance of each application. The resulting MPEG1 stream is compressed at 500Kbit/sec. For lack of cameras only, up to four Slaves could be tested simultaneously but the test results allow to estimate the behavior with a higher number of attached Slaves. Stress tests, which try to run the CPU at full load, have shown that, from a CPU workload of about 70-80%, the output video starts to flicker sometimes which comes from synchronization problems between the single incoming videos and the fixed rate of 25 frames per second which has to be kept because MPEG1 needs at minimum 23.97 frames per second by definition. This results from the CPU’s Hyperthreading mode, which appears for the performance monitor as two processors. As long as only one (logical) processor is used, the performance monitor shows only 50% load. As there are parts of the Hyperthreading-CPU which cannot be shared between tasks, this can lead to a bottleneck before 100% CPU load can be reached. Page 73 Diploma Thesis Erich Semlak 3.6.2.2. Master Performance The first test was made using an input and output video resolution of 352x288 pixels with 25 frames per second. The following chart shows the CPU’s workload as a function of the number of attached slaves: CPU Load (%), input 352x288/25fps, output 352x288/25fps 100 90 80 70 60 50 40 30 20 10 0 22 1 30 2 35 3 40 4 Num ber of Slaves Picture 32: CPU load at Master for 352x288/25fps Out of these results it can be estimated that the streaming needs about 16% CPU and every Slave decompression task takes about 6%. This is assumed because every decompression filter at the Master machine runs as its own task, each consuming the same share of CPU, as long it is not at full-capacity. So the estimated maximum count of Slaves is about 9 until the CPUs workload reaches 70%. The same test series with an input video resolution of 352x288 pixels with 25 frames per second and output resolution of 640x480 with 25fps: CPU Load (%), input 352x288/25fps, output 640x480/25fps 100 90 80 70 60 50 40 30 20 10 0 70 45 1 53 2 61 3 4 Num ber of Slaves Picture 33: CPU load at Master for 352x288/25fps-640x480/25fps These results show an estimated streaming workload of 37% and every Slave decompression task takes about 9%. This means that the maximum recommended Page 74 Diploma Thesis Erich Semlak CPU workload is already reached with 4 incoming streams, also assumed that the workload increases linearly. For the sake of completeness the same testes were carried out with an input and output resolution of 640x480/25fps, a load which quickly saturated the Master machine. CPU Load (%), input 640x480/25fps, output 640x480/25fps 100 90 80 70 60 50 40 30 20 10 0 95 100 76 60 1 2 3 4 Num ber of Slaves Picture 34: CPU load at Master for 640x480/25fps While testing with two Slaves the output video stream already started to flicker, so at these conditions the Master machine seems to be to slow to handle the whole task. Page 75 Diploma Thesis Erich Semlak 3.6.2.3. Network Load Measuring the network load showed that it won’t be the bottleneck. As one incoming stream uses about 60 KB of data, the maximum count of Slaves depends more on the CPU than the network’s bandwith: Netw ork Load, incom ing, input 352x288/25fps 250 210 200 175 165 145 150 120 KB/sec 95 100 65 Packets/sec 55 50 0 1 2 3 4 Num ber of Slaves Picture 35: incoming network load for 352x288/25fps Theoretically a count of 50 Slaves would result in a network load of 3 MB/sec but the overhead of 2800 packets/sec might limit it to a smaller number. The outgoing video data stream was also measured to show that it won’t be an issue regarding network capacity: Network Load, outgoing 800 720 700 600 500 430 KB/sec 400 Packets/sec 300 200 100 50 105 0 352x288/25fps 640x480/25fps Picture 36: outgoing network load Page 76 Diploma Thesis Erich Semlak 3.6.2.4. Slave Performance The Slave application isn’t as performance demanding as the Master application as long as only the capturing feature is activated. For this task a Pentium II with 600MHz is just fast enough to handle a video input resolution of 352x288 with 25 frames per second without drop-outs. For doing object or motion detection it takes a machine with about 1.5 Ghz or faster, depending on resolution and detection frequency (normally set to 5 times per second, but for noisy activity a higher frequency may be necessary). This means that for simple streaming systems older machines can be reactivated to handle the camera and send the compressed input video. As there are no additional requirements to the hardware, the Slave machines are assumed to be very cheap to get. Page 77 Diploma Thesis Erich Semlak 4. Recent Research This chapter tries to give some overview which scientific research and development is going on in the field of video capturing/processing and streaming, which progress has been made in the last few years and what may come in the near future. 4.1. Video Capturing and Processing Research in the field of video capturing covers less the techniques how to get from the pictures from the cameras lens to digital data but more what to do with the digitized video and how to handle and analyze the input. The scopes where digital video input is processed is quite diversified, therefore the following chapters will cover some different areas of applications. 4.1.1. Object Tracking Videos are actually sequences of images, each of which called a frame, displayed in fast enough frequency so that human eyes can percept the continuity of its content. It is obvious that all image processing techniques can be applied to individual frames. Besides, the contents of two consecutive frames are usually closely related. Visual content can be modeled as a hierarchy of abstractions. At the first level are the raw pixels with color or brightness information. Further processing yields features such as edges, corners, lines, curves, and color regions. A higher abstraction layer may combine and interpret these features as objects and their attributes. At the highest level are the human level concepts involving one or more objects and relationships among them [ZIV2002]. Object detection in videos involves verifying the presence of an object in image sequences and possibly locating it precisely for recognition. Object tracking is to monitor an object’s spatial and temporal changes during a video sequence, including its presence, position, size, shape, etc. This is done by solving the temporal correspondence problem, the problem of matching the target region in successive frames of a sequence of images taken at closely-spaced time intervals. These two processes are closely related because tracking usually starts with detecting objects, while detecting an object repeatedly in subsequent image sequence is often necessary to help and verify tracking. [SIU2002] In this thesis, object tracking is performed by the Slave application. Because object detection is a complex and costly task, it is done directly at the Slave and only the resulting coordinates are delivered to the Master (see chapter 3.5.2.3). Page 78 Diploma Thesis Erich Semlak 4.1.1.1. Applications Object tracking is quite important because it enables several applications such as: - Security and surveillance: to recognize people, to provide better sense of security using visual information - Medical therapy: to improve the quality of life for physical therapy patients and disabled people - Retail space instrumentation: to analyze shopping behavior of customers, to enhance building and environment design - Video abstraction: to obtain automatic annotation of videos, to generate object-based summaries - Traffic management: to analyze flow, to detect accidents - Video editing: to eliminate cumbersome human-operator interaction, to design futuristic video effects - Interactive games: to provide natural ways of interaction with intelligent systems such as weightless remote control. 4.1.1.2. Challenges A robust, accurate and high performance approach is still a great challenge today. The difficulty level of this problem highly depends on how you define the object to be detected and tracked. If only a few visual features, such as a specific color, are used as representation of an object, it is fairly easy to identify all pixels with same color as the object. On the other extremity, the face of a specific person, which full of perceptual details and interfering information such as different poses and illumination, is very hard to be accurately detected, recognized and tracked. Most challenges arise from the image variability of video because video objects generally are moving objects. As an object moves through the field of view of a camera, the images of the object may change dramatically. This variability comes from three principle sources: variation in target pose or target deformations, variation in illumination, and partial or full occlusion of the target [HAG1998] [ELL2001]. Page 79 Diploma Thesis Erich Semlak 4.1.2. Object Detection and Tracking Approaches 4.1.2.1. Feature-based object detection In feature-based object detection, standardization of image features and registration (alignment) of reference points are important. The images may need to be transformed to another space for handling changes in illumination, size and orientation. One or more features are extracted and the objects of interest are modeled in terms of these features. Object detection and recognition then can be transformed into a graph matching problem. [PUA2000] There are two sources of information in video that can be used to detect and track objects: visual features (such as color, texture and shape) and motion information. Combination of statistical analysis of visual features and temporal motion information usually lead to more robust approaches. A typical strategy may segment a frame into regions based on color and texture information first, and then merge regions with similar motion vectors subject to certain constraints such as adjacency. A large number of approaches have been proposed in literature. All these efforts focus on several different research areas each deals with one aspect of the object detection and tracking problems or a specific scenario. Most of them use multiple techniques and there are combinations and intersections among different methods. All these make it very difficult to have a uniform classification of existing approaches. So in the following sections, most of the approaches will be reviewed separately in association with different research highlights. [ZIV2002] Shape-based approaches Shape-based object detection is one of the hardest problems due to the difficulty of segmenting objects of interest in the images. In order to detect and determine the border of an object, an image may need to be preprocessed. The preprocessing algorithm or filter depends on the application. Different object types such as persons, flowers, and airplanes may require different algorithms. For more complex scenes, noise removal and transformations invariant to scale and rotation may be needed. Once the object is detected and located, its boundary can be found by edge detection and boundary-following algorithms. The detection and shape characterization of the objects becomes more difficult for complex scenes where there are many objects with occlusions and shading. [YIL2004] Color-based approaches Unlike many other image features (e.g. shape) color is relatively constant under viewpoint changes and it is easy to be acquired. Although color is not always appropriate as the sole means of detecting and tracking objects, but the low computational cost of the algorithms proposed makes color a desirable feature to exploit when appropriate. Color-based trackers have been proved robust and versatile for a modest computational cost. They are especially appealing for tracking tasks where the spatial structure of the tracked objects exhibits such a dramatic variability that Page 80 Diploma Thesis Erich Semlak trackers based on a space-dependent appearance reference would break down very fast. Trackers rely on the deterministic search of a window whose color content matches a reference histogram color model. Relying on the same principle of color histogram distance, but within a probabilistic framework, a new Monte Carlo tracking technique has been introduced. The use of a particle filter allows one to better handle color clutter in the background, as well as complete occlusion of the tracked entities over a few frames. This probabilistic approach is very flexible and can be extended in a number of useful ways. In particular, multi-part color modeling to capture a rough spatial layout ignored by global histograms, incorporation of a background color model when relevant, and extension to multiple objects. [PER2002] 4.1.2.2. Template-based object detection If a template describing a specific object is available, object detection becomes a process of matching features between the template and the image sequence under analysis. Object detection with an exact match is generally computationally expensive and the quality of matching depends on the details and the degree of precision provided by the object template. There are two types of object template matching, fixed and deformable template matching. Fixed template matching Fixed templates are useful when object shapes do not change with respect to the viewing angle of the camera. Two major techniques have been used in fix template matching. Image subtraction In this technique, the template position is determined from minimizing the distance function between the template and various positions in the image. Although image subtraction techniques require less computation time than the following correlation techniques, they perform well in restricted environments where imaging conditions, such as image intensity and viewing angles between the template and images containing this template are the same. Correlation Matching by correlation utilizes the position of the normalized cross-correlation peak between a template and an image to locate the best match. This technique is generally immune to noise and illumination effects in the images, but suffers from high computational complexity caused by summations over the entire template. Point correlation can reduce the computational complexity to a small set of carefully chosen points for the summations. [NGU2004] The Object Detection Filter (chapter 3.4.2) in this thesis uses a fixed template with image subtraction. For performance reasons the image of the selected object is divided into 8x8 pixel tiles of which the color histogram is calculated. The incoming video picture is also divided in 8x8 pixel tiles and the objects histogram is then compared with every possible position on the picture. The effort in searching the Page 81 Diploma Thesis Erich Semlak object depends therefore on the size of the objects picture and the size of the incoming video. Deformable template matching Deformable template matching approaches are more suitable for cases where objects vary due to rigid and non-rigid deformations. These variations can be caused by either the deformation of the object per se or just by different object pose relative to the camera. Because of the deformable nature of objects in most video, deformable models are more appealing in tracking tasks. [ZHO2000] In this approach, a template is represented as a bitmap describing the characteristic contour/edges of an object shape. A probabilistic transformation on the prototype contour is applied to deform the template to fit salient edges in the input image. An objective function with transformation parameters which alter the shape of the template is formulated reflecting the cost of such transformations. The objective function is minimized by iteratively updating the transformation parameters to best match the object. [SCL2001] 4.1.3. Object Tracking Performance Effectively evaluating the performance of moving object detection and tracking algorithms is an important step towards attaining robust digital video systems with sufficient accuracy for practical applications. As systems become more complex and achieve greater robustness, the ability to quantitatively assess performance is needed in order to continuously improve performance. [BLA2003] describes a framework for performance evaluation using pseudo-synthetic video, which employs data captured online and stored in a surveillance database. Tracks are automatically selected from a surveillance database and then used to generate ground truthed video sequences with a controlled level of perceptual complexity that can be used to quantitatively characterise the quality of the tracking algorithms. The main strength of this framework is that it can automatically generate a variety of different testing datasets. But there is also an algorithm which allows to evaluate a video tracking system without the need of ground-truth data. The algorithm is based on measuring appearance similarity and tracking uncertainty. Several experimental results on vehicle and human tracking are presented. Effectiveness of the evaluation scheme is assessed by comparisons with ground truth. The proposed self evaluation algorithm has been used in an acoustic/video based moving vehicle detection and tracking system where it helps the video surveillance maintaining a good target track by reinitializing the tracker whenever its performance deteriorates. [WU2004] Especially for pedestrian detection [BER2004] developed tools for evaluating results thereof. The developed tool allows a human operator to annotate on a file all pedestrians in a previously acquired video sequence. A similar file is produced by the algorithm being tested using the same annotation engine. A matching rule has been established to validate the association between items of the two files. For each frame Page 82 Diploma Thesis Erich Semlak a statistical analyzer extracts the number of mis-detections, both positive and negative, and correct detections. Using these data, statistics about the algorithm behavior are computed with the aim of tuning parameters and pointing out recognition weaknesses in particular situations. The presented performance evaluation tool has been proven to be effective though it requires a very expensive annotation process. Traditional motion-based tracking schemes cannot usually distinguish the shadow from the object itself, which results in a falsely captured object shape, posing a severe difficulty for a pattern recognition task. [JIA2004] present a color processing scheme to project the image into an illumination invariant space such that the shadow’s effect is greatly attenuated. The optical flow in this projected image together with the original image is used as a reference for object tracking so that it can extract the real object shape in the tracking process. 4.1.4. Motion Detection Detecting moving objects, or motion detection, obviously has very important significance in video object detection and tracking. Compared with object detection without motion, on one hand, motion detection complicates the object detection problem by adding object’s temporal change requirements, on the other hand, it also provides another information source for detection and tracking. A large variety of motion detection algorithms have been proposed. They can be classified into the following groups approximately. 4.1.4.1. Thresholding technique over the interframe difference These approaches rely on the detection of temporal changes either at pixel or block level. The difference map is usually binarized using a predefined threshold value to obtain the motion/no-motion classification. [URL30] The Motion Detection filter used in the Slave application (chapter 3.4.1) uses a similar algorithm. It takes a flowing average of the last 5 pictures (which would be a fifth of a second when capturing with 25 frames per second) to avoid disturbance by flickering or shutter changes of the camera. The incoming camera pictures are downscaled and averaged in tiles of 8x8 pixels grayscale. These tile planes are then compared to the initially stored backplane. Major differences are seen as motion, the average position of the tiles with changed grayscale values are then estimated as center of motion. The backplane is gradually converged to the averaged camera pictures to adapt to the current camera picture. 4.1.4.2. Statistical tests constrained to pixelwise independent decisions These tests assume intrinsically that the detection of temporal changes is equivalent to the motion detection. However, this assumption is valid when either large displacement appear or the object projections are sufficiently textured, but fails in the case of moving objects that preserve uniform regions. To avoid this limitation, Page 83 Diploma Thesis Erich Semlak temporal change detection masks and filters have also been considered. The use of these masks improves the efficiency of the change detection algorithms, especially in the case where some a priori knowledge about the size of the moving objects is available, since it can be used to determine the type and the size of the masks. On the other hand, these masks have limited applicability since they cannot provide an invariant change detection model (with respect to size, illumination) and cannot be used without an a priori context-based knowledge. [TOT2003] 4.1.4.3. Global energy frameworks The motion detection problem is formulated to minimize a global objective function and is usually performed using stochastic (Mean-field, Simulated Annealing) or deterministic relaxation algorithms (Iterated Conditional Modes, Highest Confidence First). In that direction, the spatial Markov Random Fields have been widely used and motion detection has been considered as a statistical estimation problem. Although this estimation is a very powerful, usually it is very time consuming. [JOD1997] 4.1.5. Object Tracking Using Motion Information Motion detection provides useful information for object tracking. Tracking requires extra segmentation of the corresponding motion parameters. There are numerous research efforts dealing with the tracking problem. Existing approaches can be mainly classified into two categories: motion-based and model-based approaches. Motion-based approaches rely on robust methods for grouping visual motion consistencies over time. These methods are relatively fast but have considerable difficulties in dealing with non-rigid movements and objects. Model-based approaches also explore the usage of high-level semantics and knowledge of the objects. These methods are more reliable compared to the motion-based ones, but they suffer from high computational costs for complex models due to the need for coping with scaling, translation, rotation, and deformation of the objects. [TAD2003] Tracking is performed through analyzing geometrical or region-based properties of the tracked object. Depending on the information source, existing approaches can be classified into boundary-based and region-based approaches. 4.1.5.1. Boundary-based approaches Also referred to as edge-based, this type of approaches relies on the information provided by the object boundaries. It has been widely adopted in object tracking because the boundary-based features (edges) provide reliable information which does not depend on the motion type, or object shape. Usually, the boundary-based tracking algorithms employ active contour models, like snakes and geodesic active contours. These models are energy-based or geometric-based minimization approaches that evolve an initial curve under the influence of external potentials, while it is being constrained by internal energies. [YIL2004] Page 84 Diploma Thesis Erich Semlak 4.1.5.2. Region-based approaches These approaches rely on information provided by the entire region such as texture and motion-based properties using a motion estimation/segmentation technique. In this case, the estimation of the target's velocity is based on the correspondence between the associated target regions at different time instants. This operation is usually time consuming (a point-to-point correspondence is required within the whole region) and is accelerated by the use of parametric motion models that describe the target motion with a small set of parameters. The use of these models introduces the difficulty of tracking the real object boundaries in cases with non-rigid movements/objects, but increases robustness due to the fact that information provided by the whole region is exploited. [HUA2002] propose a region-based method for model-free object tracking. In their method the object information of temporal motion and spatial luminance are fully utilized. First the dominant motion of the tracked object is computed. Using this result the object template is warped to generate a prediction template. Static segmentation is incorporated to modify this prediction, where the warping error of each watershed segment and its rate of overlapping with warped template are utilized to help classification of some possible watershed segments near the object border: Applications of facial expression tracking and two-handed gesture tracking demonstrate its performance. 4.1.6. Summary Along with the increasing popularity of video on internet and versatility of video applications, availability, efficiency of usage and application automation of videos will heavily rely on object detection and tracking in videos. Although so much work has been done, it still seems impossible so far to have a generalized, robust, accurate and real-time approach that will apply to all scenarios. This may require a combination of multiple complicated methods to cover all of the difficulties, such as noisy background, moving camera or observer, bad shooting conditions, object occlusions, etc. Of course, this will make it even more time consuming. But that does not mean nothing has been achieved. It seems that research may go more directions, each targeting on some specific applications. Some reliable assumption can always be made in a specific case, and that will make the object detection and tracking problem much more simplified. More and more specific cases will be conquered, and more and more good application products will appear. As the computing power keeps increasing more complex problems may become solvable. Page 85 Diploma Thesis Erich Semlak 4.1.7. Relation to this work The object detection feature in the current (first) version of this thesis’ system is working with easily detectable objects, as long as color and contrast differs enough from other objects and background. As object detection is not used for recognition of particular objects or patterns, the implemented approach is appropriate for now, even if it is rather weak. As this thesis mainly focuses on the capturing and streaming features, there was not as much effort invested in implementing a stronger algorithm for object detection. Depending on the planned application field, a different algorithm could be implemented in future versions of the object detection filter. As there can be connected different object detection filters for each Slave at runtime, an optimized version algorithm can be applied for different light and contrast conditions and objects. Page 86 Diploma Thesis Erich Semlak 4.2. Video Streaming The emergence of the Internet as a pervasive communication medium, and the widespread availability of digital video technology have led to the rise of several networked streaming media applications such as live video broadcasts, distance education and corporate telecasts. However, packet loss, delay, and varying bandwidth of the Internet have remained the major problems of multimedia streaming applications. The goal is to deliver smooth and low-delay streaming quality by combining new designs in network protocols, network infrastructure, source and channel coding algorithms. Efforts to reach this goal are as diverse as the mentioned problems. This chapter shall give an overview of the currently ongoing research and progress in moderating and solving these problems. 4.2.1. Compression vs. Delivery Streaming video over network, if it is assumed that the material is available in whatever pre-captured format, involves two major tasks, video compression and video delivery. These two steps can’t always be separated and aren’t therefore totally independent. There are some situations where the delivery procedure affects the compression process, e.g. when adapting the compression level to the available bandwith. 4.2.2. Compression Since raw video consumes a lot of bandwidth, compression is usually employed to achieve transmission efficiency. Video compression can be classified into two categories: scalable and nonscalable video coding. A nonscalable video encoder compresses the raw video into a single bit-stream, thereby leaving little scope for adaptation. On the other hand, a scalable video encoder compresses the raw video into multiple bit streams of varying quality. One of the multiple bit streams provided by the scalable video encoder is called the base stream, which, if decoded provides a coarse quality video presentation, whereas the other streams, called enhancement streams, if decoded and used in conjunction with the base stream improve the video quality. The best schemes in this area are Fine Granularity Scalability (FGS, [RAD2001]) which utilizes bitplane coding method to represent enhancement streams. A variation of FGS is Progressive FGS (PFGS) which, unlike the two layer approach of the FGS, uses a multiple layered approach towards coding. The advantage in doing this is that errors in motion prediction are reduced due to availability of incremental reference layers. [ZHU2005] Streaming video applications on the Internet generally have very high bandwidth requirements and yet are often unresponsive to network congestion. In order to avoid congestion collapse and improve video quality, these applications need to respond to congestion in the network by deploying mechanisms to reduce their bandwidth requirements under conditions of heavy load. In reducing bandwidth, video frames have low quality, while video with low motion will look better if some frames are dropped but the remaining frames have high quality. Page 87 Diploma Thesis Erich Semlak [TRI2002] presents a content-aware scaling mechanism that reduces the bandwidth occupied by an application by either dropping frames (temporal scaling) or by reducing the quality of the frames transmitted (quality scaling). Therefore video quality could be improved as much as 50%. [LEO2005] evaluated the performance on compressed video of a number of available similarity, blocking and blurring quality metrics. Using a systematic, objective framework based on simple subjective comparisons, he evaluated the ability of each metric to correctly rank order images according to the subjective impact of different spatial content, quantization parameters, amounts of filtering, distances from the most recent I-frame, and long-term frame prediction strategies. This evaluation shows that the recently proposed quality all have some weakness in measuring the quality of still frames from compressed video. As mentioned above, scalable video encoding is advantageous when adapting to varying network bandwiths. [TAU2003] proposed method involves the construction of quality layers for the coded wavelet sample data and a separate set of quality layers for scalably coded motion parameters. When the motion layers are truncated, the decoder receives a quantized version of the motion parameters used to generate the wavelet sample data. A linear model is used to infer the impact of motion quantization on reconstructed video distortion. An optimal tradeoff between the motion and subband bit-rates may then be found. Experimental results indicate that the cost of scalability is small. At low bit-rates, significant improvements are observed relative to lossless coding of the motion information. 4.2.3. Streaming Internet video delivery has been motivating research in multicast routing, quality of service (QoS, [URL54]), and the service model of the Internet itself for the last 15 years. Multicast delivery has the potential to deliver a large amount of content that currently cannot be delivered through broadcast. IP and overlay multicast are two architectures proposed to provide multicast support. A large body of research has been done with IP multicast and QoS mechanisms for IP multicast since the late 1980s. In the past five years, overlay multicast research has gained momentum with a vision to accomplish ubiquitous multicast delivery that is efficient and scales in the dimensions of the number of groups, number of receivers, and number of senders. 4.2.3.1. Internet One of the challenging aspects of video streaming over the Internet is the fact that the Internet's transmission resources exhibit multiple time-scale variability. There are two approaches to deal with the variability: small time-scale variability can be accommodated by receiver buffer and large time-scale variability can be accommodated by scalable video encoding. There still exist difficulties, since an encoded video generally exhibits significant rate variability to provide consistent video quality. [KIM2005] show that the real-time adaptation as well as the optimal adaptation algorithm provides consistent video quality when used over both TCPfriendly rate control (TFRC) and transmission control protocol (TCP). Page 88 Diploma Thesis Erich Semlak 4.2.3.2. Wireless As wireless networks a getting more and more common [URL55], video streaming techniques have to be adapted to the conditions in which wireless environments differ from cable-bound ones. [ARG2003] propose a protocol for media streaming in wireless IP-capable mobile multi-homed hosts. Next generation wireless networks, like 3G, 802.11a WLAN, and Bluetooth, are the target underlying technologies for the proposed protocol. Operation at the transport layer guarantees TCP-friendliness, error resilience and independence from the inner workings of the access technology. Furthermore, compatibility with UMTS and IMT-2000 wireless streaming service models are some of the additional benefits of our approach. Wireless streaming media communications are fragile to the delay jitter because the conditions and requirements vary frequently with the users' mobility [STA2003]. Buffering is a typical way to reduce the delay jitter of media packets before the playback, but it will incur a longer end-to-end delay. [WAN2004] propose a novel adaptive playback buffer (APB) based on the probing scheme. By utilizing the probing scheme, instantaneous network situations are collected and then using with the delay margin and the delay jitter margin, the playback buffer is adaptively adjusted to represent the continuous and real-time streaming media at the receiver. Unlike the previous studies, the novelty and contributions of this work paper are: a) accuracy: by employing the instantaneous network information, the adjustment to the playback buffer correctly reflects the current network situations, which makes the adjustment effective; b) efficiency: by utilizing the simple probing scheme, APB achieves the current network situations without the complex mathematic prediction and makes it efficient to adjust the playback buffer. Performance data obtained through extensive simulations show that our APB is effective to reduce the delay jitter and to decrease the buffer delay. 4.2.3.3. Summary In spite of Double-Layer DVDs and high-bandwith DSL-Connections there are still developers at MainConcept, Sorensen and Real trying to find even more efficient algorithms. These are supposed to shrink video for DVDs as well as for smallest bitrates for watching videos on a cell-phone-display. H.264 is dealed as the follower for the Blu-ray and HD-DVD. Tests have shown that most currently common codecs (DivX, XviD, MainConcept H.264, Nero Digital AVC, Sorenson AVC Pro and RealVideo 10) achieve comparable quality and compression levels. The open source community shows up to that teamwork can be advantageous [CT_2005_10]. Page 89 Diploma Thesis Erich Semlak 4.2.4. Relation to this work The current version of this thesis’ system uses a rather simple and straight-through compression and streaming approach. As mentioned in chapter 2.4.3 there is only a single output stream supported. To stream different bitrates and/or video resolutions at the same time, multiple encoders and streaming filters would have to be connected to the filter graph. To overcome performance limitations, as encoding consumes quite some CPU, this task could be distributed to multiple streaming servers. This approach for future versions is highly scalable and through the modular design of DirectShow it is also possible to determine the number of required server machines at runtime. To adapt to network congestion an adaptive approach to the streaming filter can be chosen. [LIN1998] addresses the problem of adding network and host adaptive capability to DirectShow RTP. Another approach for serving different bitrates and adaptive streaming would be by using a dedicated streaming server (Helix Universal Server from Real, Streaming Media Server from Microsoft) which takes the current RTP stream as input and then generates different bitrates and/or formats there out. Therefore the compression ratio for the RTP stream should be rather low to avoid losing too much information. The highest bitrate later served by the dedicated streaming server stands as reference level which makes sense not to be higher than the source stream’s bitrate. Page 90 Diploma Thesis Erich Semlak 5. Final Statement In the course of this work a system has been developed that is capable of processing and streaming captured video from multiple sources. This system uses multiple machines for capturing and analyzing multiple video inputs. The video streams and resulting motion and object detection data are delivered through common Ethernet network to a single Master machine. This machine does the merging and broadcasting of the resulting video stream. The system is therefore limited scalable regarding number of inputs and video resolution. This system gets by on today common hardware. It is based on Microsoft’s DirectShow technology and can fall back on numerous components ready available. According to this thesis’ evaluation, DirectShow shows significant advantages over Windows Media and Video for Windows. The system needs no extra compression hardware and is entirely software-based. As Slave machines even older hardware with about 600 MHz can still be used. As by-product, in the course of development and incorporation, a tutorial emerged which explains the usage and development under DirectShow with special emphasis on C#. This tutorial compensates the lack of books and information be found on the internet on this topic. Tests of the developed applications show that performance of common PCs with 3.0 GHZ is sufficient to handle multiple input streams. As the workload for capturing and analyzing is distributed to the Slave machines, up to 9 inputs can be processed simultaneously. These tests also show up the bottleneck at the Master machine. The system is capable of handling video resolutions of up to CIF (352x288), which are usual in web streaming, without dropouts. Higher video resolutions, like SVHS (720x576) or HDTV (starting at 1280x720), which are common in the field of professional video processing exceed the system’s possibilities by far and are reserved for special hardware. Page 91 Diploma Thesis Erich Semlak 6. Sources 6.1. Books & Papers [ARG2003] A. Argyriou and V. Madisetti; A media streaming protocol for heterogeneous wireless networks; CCW 2003 Proceedings; Pages 30 – 33 [APO2002] John G. Apostolopoulos, Wai-tian Tan and Susie J. Wee; Video Streaming: Concepts, Algorithms and Systems; HP Technical Report; 2002 [BER2004] M. Bertozzi, A.Broggi, P.Grisieri and A. Tibaldi; A tool for vision based pedestrian detection performance evaluation; IEEE Intelligent Vehicles Symposium 2004; Pages 784 - 789 [BLA2003] James Black, Tim Ellis and Paul Rosin; A Novel Method for Video Tracking Performance Evaluation; Joint IEEE Int Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, London 2003, Pages 125-132 [BRU2003] Kai Bruns and Benjamin Neidhold; Audio-, Video- und Grafikprogrammierung, Fachbuchverlag Leipzig; 2003 [CT1996_5] Ulrich Hilgefort, Neue digitale Schnittsysteme, C’t 5/1996, p. 84 [CT1996_11] Robert Seetzen, Bilderflut, C’t 11/1996, p. 30 [CT2005_10] Dr. Volker Zota; Kompressionist; Ct 10/2005; p. 146 [DEM2003] Gregory C. Demetriades; Streaming Media; Wiley; 2003 [EID2005] Horst Eidenberger; Proceedings of the 11th International Multimedia Modelling Conference, 2005, Pages 358-363 [ELL2001] Tim Ellis and Ming Xu; Object Detection and Tracking in an Open Dynamic World; Second IEEE International Workshop on Performance Evaluation of Tracking and Surveillance; 2001; Pages 211-219 [HAG1998] Gregory D. Hager and Peter N. Belhumeur; Efficient Region Tracking With Parametric Models of Geometry and Illumination; IEEE Transactions on Volume 20, Issue 10, Oct. 1998 Page(s):1025 1039 Page 92 Diploma Thesis Erich Semlak [HEL1994] Tobias Helbig; Development and Control of Distributed Multimedia Applications; Proceedings of the 4th Open Workshop on High-Speed Networks; 1994; Pages 208-213 [HUA2002] Yu Huang, Thomas S. Huang and Heinrich Niemann, A Region Based Method for Model-Free Object Tracking; 16th International Conference on Pattern Recognition, 2002; Proceedings. Volume 1; Page 592-595 [HP2002] John. G. Apostolopoulos; Video Streaming: Concepts, Algorithms and Systems; HP Laboratories Palo Alto; 2002 [ISO1999] MPEG-7 Context and Objectives; ISO/IETC JTC1/SC29/WG11 [ISO2004_1] Coding of moving pictures and associated audio for digital storage media at up to about 1.5 MBit/s; ISO Standard CD 11172-1 [ISO2004_2] Coding of moving pictures and associated audio for digital storage media at up to about 1.5 MBit/s; ISO Standard CD 11172-2 [JIA2003] Hao Jiang and Mark S. Drew; Shadow-Resistant Tracking in Video; International Conference on Multimedia and Expo; Proceedings Vol. 3, Pages 77-80 [JOD1997] P-M. Jodoin and M. Mignotte; Unsupervised motion detection using a Markovian temporal model with global spatial constraints; International Conference on Image Processing 2004; Volume 4 Pages 2591-2594 [KIM2005] Kim Taehyun and M.H. Ammar, Optimal quality adaptation for scalable encoded video; Journal on Selected Areas in Communications, IEEE Volume 23, Issue 2, Feb 2005, Pages:344 356 [LEO2005] Athanasios Leontaris; Comparison of blocking and blurring metrics for video compression; AT&T Labs; 2005 [LIN1998] Linda S. Cline, John Du, Bernie Keany, K. Lakshman, Christian Maciocco and David M. Putzolu; DirectShow(tm) RTP Support for Adaptivity in Networked Multimedia Applications; IEEE International Conference on Multimedia Computing and Systems, 1998; Proceedings, Pages:13 – 22 [LV2002] T. LV, B. Ozer and W. Wolf; Parallel architecture for video processing in a smart camera system; IEEE Workshop on Signal Processing Systems, 2002; Pages: 9 – 14 [MAY1999] Ketan Desharath Mayer-Patel; A Parallel Software-Only Video Effects Processing System; Ph.D. Thesis; 1999 [MEN2003] Eyal Menin; The Streaming Media Handbook; Prentice Hall; 2003 Page 93 Diploma Thesis Erich Semlak [MSDN2003] Microsoft; MSDN Library for Visual Studio .NET 2003; 2003 [MUL1993] Armin Müller; Multimedia PC; Vieweg; 2003 [NGU2004] Hieu T. Nguyen and Arnold W.M. Smeulders; Fast Occluded Object Tracking; Transactions on Pattern Analysis and Machine Intelligence, IEEE Volume 26, Issue 8, Aug. 2004; Pages: 1099 - 1104 [ORT1993] Michael Ortlepp and Michael Horsch; Video für Windows; Sybex; 1993 [PER2003] P. Perez et al.; Color-Based Probabilistic Tracking; Microsoft Research Paper; 2002 [PES2003] Mark D. Pesce; Programming Microsoft DirectShow for Digital Video and Television, Microsoft Press; 2003 [PUA2000] Kok Meng Pua; Feature-Based Video Sequence Identification; Ph.D. Thesis; 2000 [RAD2001] H.M. Radha, M. van der Schaar, Yingwei Chen; The MPEG-4 finegrained scalable video coding method for multimedia streaming over IP; IEEE Transactions on Multimedia, Volume 3, Issue 1, March 2001 Pages: 53 - 68 [ROS1997] Paul L.Rosin and Tim Ellis; Image difference threshold strategies and shadow detection; British Machine Vision Conf., Pages 347-356 [SCL2001] Stan Scarloff and Lifeng Liu; Deformable Shape Detection and Description, Transactions on Pattern Analysis and Machine Intelligence, IEEE Volume 23, Issue 5, May 2001 Pages: 475 - 489 [SHA2004] John Shaw; Introduction to Digital Media and Windows Media Series 9; Microsoft Corporation; 2004 [STA2003] Vladimir Stankovic and Raouf Hamazoui; Live Video Streaming over Packet Networks and Wireless Channels; Paper; Proceedings for the 13th Packet Video Workshop; Nantes 2003; [STO1995] [Dieter Stotz, Computergestützte Audio- und Videotechnik, Springer; 1995 [TAD2003] Hadj Hamma Tadjine and Gerhard Joubert; Colour Object Tracking without Shadow; The 23rd Picture Coding Symposium (PCS 2003); Pages 391-394 [TAU2003] David Taubman and Andrew Secker; Highly Scalable Video Compression with Scalable Motion Coding; International Conference on Image Processing, 2003; Proceedings, Pages: 273-276 Page 94 Diploma Thesis Erich Semlak [TOP2004] Michael Topic; Streaming Demystified; McGraw-Hill; 2002 [TOT2003] Daniel Toth and Til Aach; Detection and recognition of moving objects using statistical motion detection and Fourier descriptors; 12th International Conference on Image Analysis and Processing, 2003; Proceedings, Pages: 430 - 435 [TRI2002] Avanish Tripathi and Mark Claypool; Improving Multimedia Streaming with Content-Aware Video Scaling; Proceedings of the Second International Workshop on Intelligent Multimedia Computing and Networking; 2002; Pages 1021-1024 [TRV2002] Mohan M. Trivedi, Andrea Prati and Greg Kogut; Distributed Interactive Video Arrays For Event Based Analysis of Incidents; The IEEE 5th International Conference on Intelligent Transportation Systems; 2002; Proceedings, Pages: 950 - 956 [WAN2004] Tu Wanging and Jia Weija; Adaptive playback buffer for wireless streaming media; IEEE International Conference on Networks; 2004; Proceedings. 12th Volume 1, Pages: 191 - 195 vol.1 [WU2004] Hao Wu and Qinfen Zheng; Self Evaluation for Video Tracking Systems; 24th Army Science Conference Proceedings; 2004 [YIL2004] Alper Yilmaz and Mubarak Shah; Contour-Based Object Tracking with Occlusion Handling; IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26, Issue 11, Nov. 2004 Pages:1531 – 1536 [ZHO2000] Yu Zhong, Anil K. Jain and M.-P. Dubuisson-Jolly, Object Tracking Using Deformable Templates; IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, Issue 5, May 2000; Pages: 544 - 549 [ZHU2005] Bin B. Zhu, Chun Yuan, Yidong Wang and Shipeng Li, Scalable Protection for MPEG-4 Fine Granularity Scalability; Transactions on Multimedia, IEEE Volume 7, Issue 2, Apr 2005; Pages: 222 – 233 [ZIV2003] Zoran Živković; Motion Detection and Object Tracking in Image Sequences; Ph.D. Thesis; 2002 Page 95 Diploma Thesis Erich Semlak 6.2. URLs [URL1] Microsoft; DirectShow Reference; http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshowreference.asp;2004 [URL2] Wikipedia; Charge-coupled Device; http://de.wikipedia.org/wiki/Charge-coupled_Device;2004 [URL3] Microsoft, MSDN Library “Audio and Video”; http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnanchor/html/audiovideo.asp;2004 [URL4] Microsoft, MSDN Supported Formats; http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/supportedformatsindirectshow.asp;2004 [URL5] LeadTools, DirectShow, http://www.leadtools.com/SDK/Multimedia/MultimediaDirectShow.htm;2004 [URL6] Microsoft; DirectX Overview, http://www.microsoft.com/windows/directx/default.aspx?url=/windows/direc tx/productinfo/overview/default.htm;2004 [URL7] Jim Taylor; DVD Demystified, http://www.dvddemystified.com; 2004 [URL8] Microsoft; Windows Media; http://www.microsoft.com/windows/windowsmedia/default.aspx ;2004 [URL9] Idael Cardoso, CSharp Windows Media Format SDK Translation; http://www.codeproject.com/cs/media/ManWMF.asp ; 2004 [URL10] NetMaster; DirectShow .NET; http://www.codeproject.com/cs/media/directshownet.asp ; 2004 [URL11] Robert Laganiere, Programming computer vision applications; http://www.site.uottawa.ca/~laganier/tutorial/opencv+directshow/ ;2003 [URL12] Intel, Open Source Computer Vision Library; http://www.intel.com/research/mrl/research/opencv/; 2004 [URL13] Yunqiang Chen; DirectShow Transform Filter AppWizard; http://www.ifp.uiuc.edu/~chenyq/research/Utils/DShowFilterWiz/DShowFilt erWiz.html; 2001 [URL14] John McAleely; DXMFilter; http://www.cs.technion.ac.il/Labs/Isl/DirectShow/dxmfilter.htm; 1998 Page 96 Diploma Thesis Erich Semlak [URL15] Dennis P. Curtin; A Short Course in Digital Video; http://www.shortcourses.com/video/introduction.htm; 2004 [URL16] Wavelength Media; Creating Streaming Video; http://www.mediacollege.com/video/streaming/overview.html; 2004 [URL17] Alken M.R.S; Video Standards; http://www.alkenmrs.com/video/standards.html;2004 [URL18] Britannica Online; http://www.britannica.com; 2004 [URL19] MPEG.ORG; http://www.mpeg.org; 2004 [URL20] Björn Eisert; Was ist MPEG ?; http://www.cybersite.de/german/service/Tutorial/mpeg/; 1995 [URL21] Berkeley Multimedia Research Center; MPEG Background; http://bmrc.berkeley.edu/frame/research/mpeg/mpeg_overview.html; 2004 [URL22] CyberCollege; Linear and Nonlinear Editing; http://www.cybercollege.com/tvp056.htm; 2005 [URL23] Siemens; Online Lexikon; http://www.networks.siemens.de/solutionprovider/_online_lexikon; 2005 [URL24] Microsoft; Video For Windows; http://msdn.microsoft.com/library/default.asp?url=/library/enus/multimed/htm/_win32_video_for_windows.asp;2005 [URL25] Michael Blome; Core Media Technology in Windows XP Empowers You to Create Custom Audio/Video Processing Components; http://msdn.microsoft.com/msdnmag/issues/02/07/DirectShow/default.asp x; 2002 [URL26] Chris Thompson; DirectShow For Media Playback in Windows; http://www.flipcode.com/articles/article_directshow01.shtml;2000 [URL27] DFNO-Expo; Streaming Overview; http://www.dfn-expo.de/Technologie/Streaming/Streaming_tab.html; 1999 [URL28] Jane Hunter; A Review of Video Streaming over the Internet; http://archive.dstc.edu.au/RDU/staff/jane-hunter/video-streaming.html; 1997 [URL29] Ross Finlayson; LIVE.COM; Internet Streaming Media, Wireless and Multicast Technology, Services & Standards;http://www.live.com; 2005 [URL30] Robert B. Fisher; CVonline: The Evolving, Distributed, Non-Proprietary, On-Line Compendium of Computer Vision; http://homepages.inf.ed.ac.uk/rbf/CVonline/; 2005 Page 97 Diploma Thesis Erich Semlak [URL30] Paul L. Rosin; Thresholding for Change Detection; http://users.cs.cf.ac.uk/Paul.Rosin/thresh/thresh.html; 2004 [URL31] Telefunken Germany; http://www.telefunken.de/de/his/2336/his.htm; 2005 [URL32] WikiPedia; Moore's Law, http://en.wikipedia.org/wiki/Moore's_law; 1965 [URL33] Encyclopedia Britannica, Distributed Computing;http://www.britannica.com/eb/article?tocId=168849; 2005 [URL34] Encyclopedia Britannica, Human Eye Color Vision; http://www.britannica.com/eb/article?tocId=64933; 2005 [URL35] Webopedia; Motion JPEG; http://www.webopedia.com/TERM/m/motion_JPEG.html; 2005 [URL36] RealMedia; www.real.com; 2005 [URL37] Apple; Quicktime; http://www.apple.com/quicktime/; 2005 [URL38] WikiPedia; Webcast; http://en.wikipedia.org/wiki/Webcast; 2005 [URL39] Sun Developer Network; How to Create a Multimedia Player; http://java.sun.com/developer/technicalArticles/Media/Mediaplayer/; 2005 [URL40] StreamWorks; http://www.streamworks.dk/uk/default.asp; 2005 [URL41] Siggraph; VDO Live; http://www.siggraph.org/education/materials/HyperGraph/video/architectur es/VDO.html; 2005 [URL42] Microsoft; DirectShow Documentation; http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshow.asp; 2005 [URL43] MontiVision, DirectShow Filters; http://www.montivision.com/;2005 [URL44] LEAD; DirectShow Filters; http://www.leadtools.com/SDK/MULTIMEDIA/Multimedia-DirectShowFilters.htm; 2005 [URL45] Microsoft; Windows Media SDK Documentation; http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnanchor/html/anch_winmedsdk.asp; 2005 [URL46] Inscriber; Windows Media Plugins; http://www.inscriber.com/; 2005 [URL47] Consolidated Video; LiveAlpha; http://convid.com/; 2005 Page 98 Diploma Thesis Erich Semlak [URL48] Microsoft; Windows Media SDKs FAQ; http://msdn.microsoft.com/library/enus/dnwmt/html/ds_wm_faq.asp?frame=true#ds_wm_faq_topic07; 2005 [URL49] Avid Technology; http://www.avid.com/ [URL50] Betacam Format; http://betacam.palsite.com/format.html [URL51] Pinnacle Systems; Dazzle Video Creator; http://www.pinnaclesys.com/VideoEditing.asp?Category_ID=1&Langue_ID =7&Family=24 [URL52] Microsoft; Windows Media SDK Components; http://www.microsoft.com/windows/windowsmedia/mp10/sdk.aspx [URL54] Cisco; Quality of Service; http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/qos.htm [URL55] Carmen Nobel; Video Streaming goes Wireless; EWeek; http://www.eweek.com/article2/0,1759,1436939,00.asp; 2004 [URL56] The Internet Engineering Task Force; http://www.ietf.org [URL57] Michael B. Jones; The Microsoft Interactive TV System: An Experience Report; http://research.microsoft.com/~mbj/papers/mitv/tr-97-18.html; 1997 [URL58] Elecard Ltd; http://www.elecard.com [URL59] Alparysoft; Lossless Video Codec; http://www.alparysoft.com/products.php?id=8&item=35; 2005 [URL60] Real Time Streaming Protocol Information and Updates; http://www.rtsp.org/; 2005 [URL61] Wikipedia; Multicast technology; http://en.wikipedia.org/wiki/Multicast; 2005 [URL62] Agava’s DirectShow Development Site; Forum; http://dsforums.agava.com/cgi/yabb/YaBB.cgi?board=directshow; 2005 [URL63] Intel Corp.; Hyper-threading Technology; http://www.intel.com/technology/hyperthread/; 2005 [URL64] Cinelerra; Movie Studio in a Linux Box; http://heroinewarrior.com/cinelerra.php3; 2005 Page 99 Diploma Thesis Erich Semlak 7. Pictures Picture 1: miro Video PCTV card.............................................................................. 13 Picture 2: DCT compression cycle............................................................................ 15 Picture 3: TCP protocol handshaking and dataflow .................................................. 28 Picture 4: UDP protocol dataflow.............................................................................. 29 Picture 5: Windows Movie Maker GUI ...................................................................... 35 Picture 6: black box model........................................................................................ 40 Picture 7: video process phases............................................................................... 40 Picture 8: distributed system model.......................................................................... 41 Picture 9: real world example ................................................................................... 42 Picture 10: slave model ............................................................................................ 43 Picture 11: master model.......................................................................................... 44 Picture 12: master video processing......................................................................... 44 Picture 13: Slave application model.......................................................................... 56 Picture 14: Master application model........................................................................ 56 Picture 15: motion detection filter properties page.................................................... 58 Picture 16: HNS MPEG-2 sender properties page ................................................... 60 Picture 17: filter graph using infinite pin tee .............................................................. 61 Picture 18: infinite pin tee filter properties page........................................................ 62 Picture 19: HNS MPEG streamer filter properties page............................................ 63 Picture 20: Moonlight MPEG2 Encoder settings....................................................... 64 Picture 21: Slave application camera selection dialog.............................................. 65 Picture 22: Slave application graphical user interface .............................................. 66 Picture 23: Slave's filter graph .................................................................................. 66 Picture 24: Master application schematic workflow .................................................. 67 Picture 25: Master’s Filter graph............................................................................... 67 Picture 26: Master application Slave selection ......................................................... 68 Picture 27: Master application Slave resolution selection ......................................... 68 Picture 28: Master application filter graph setup....................................................... 68 Picture 29: Master application graphical user interface ............................................ 69 Picture 30: motion detection schematic data flow..................................................... 70 Picture 31: object detection schematic data flow ...................................................... 71 Picture 32: CPU load at Master for 352x288/25fps................................................... 74 Picture 33: CPU load at Master for 352x288/25fps-640x480/25fps .......................... 74 Picture 34: CPU load at Master for 640x480/25fps................................................... 75 Picture 35: incoming network load for 352x288/25fps .............................................. 76 Picture 36: outgoing network load ............................................................................ 76 Picture 37: Windows Media Player 9 ...................................................................... 103 Picture 38: Simple filter graph................................................................................. 105 Picture 39: connecting pins..................................................................................... 109 Picture 40: connection pins "intelligently" ............................................................... 110 Picture 41: sample life cycle ................................................................................... 111 Picture 42: Graphical User Interface of GraphEdit.................................................. 112 Picture 43: DirectShow application tasks................................................................ 113 Picture 44: Sample filter graph ............................................................................... 116 Picture 45: Filter selection dialog in GraphEdit ....................................................... 119 Page 100 Diploma Thesis Erich Semlak 8. Code samples Code sample 1: IFilter graph interface in C++ ........................................................ 115 Code sample 2: IFilter graph interface in C# .......................................................... 115 Code sample 3: initiating a filter graph in C# .......................................................... 117 Code sample 4: helper function for searching a filter by GUID ............................... 118 Code sample 5: finding a filter by GUID ................................................................. 118 Code sample 6: helper function for searching a filter by its name .......................... 120 Code sample 7: searching a filter by its name ........................................................ 120 Code sample 8: adding filters to the filter graph ..................................................... 120 Code sample 9: finding a filter’s pins helper function.............................................. 121 Code sample 10: get unconnected pins ................................................................. 121 Code sample 11: finding pins by name................................................................... 121 Code sample 12: connecting pins........................................................................... 122 Code sample 13: running the filter graph................................................................ 122 Code sample 14: stopping and clearing the filter graph.......................................... 123 9. Tables Table 1: Overview of TV formats [URL17] ................................................................ 10 Table 2: Video compression formats overview ......................................................... 18 Table 3: Overview SDKs........................................................................................... 55 Page 101 Diploma Thesis Erich Semlak A. A Short History Of DirectShow A.1. DirectShow Capabilities DirectShow capabilities can be separated into three broad areas, which reflect the three basic types of DirectShow filters. First are the capture capabilities. DirectShow can handle capture of audio from the microphone or from a line input, can control a digital camcorder or D-VHS VCR (a new kind of VCR that stores video digitally in high resolution), or can capture both audio and video from a live camera, such as a webcam. DirectShow can also open a file and treat it as if it were a "live" source. This way, one can work on video or audio that has been previously captured. Once the media stream has been captured, DirectShow filters can be used to transform it. Transform filters have been written to convert color video to black-andwhite, resize video images, add an echo effect to an audio stream, and so on. These transform filters can be connected, one after another, like so many building blocks, until the desired effect is achieved. Streams of audio and video data can be "split" and sent to multiple filters simultaneously, as if you added a splitter to the coaxial cable that carries a cable TV or satellite signal. Media streams can also be multiplexed, or “muxed”, together, taking two or more streams and making them one. Using a multiplexer, you can add a soundtrack to a video sequence, putting both streams together synchronously. After all the heavy lifting of the transform filters has been accomplished, there's one task left: rendering the media stream to the display, speakers, or a device. DirectShow has a number of built-in render filters, including simple ones that provide a window on the display for video playback. You can also take a stream and write it to disk or to a device such as a digital camcorder. Most DirectShow applications do not need the full range of DirectShow's capabilities; in fact, very few do. For example, Windows Media Player doesn't need much in the way of capture capabilities, but it needs to be able to play (or render) a very wide range of media types-MP3s, MPEG movies, AVI movies, WAV sounds, Windows Media, and so on. You can throw almost any media file at Windows Media Player (with the notable exception of Apple QuickTime and the RealNetworks media formats), and it'll play the file without asking for help. That's because Windows Media Player, built with DirectShow, includes all of DirectShow's capabilities to play a broad range of media. Page 102 Diploma Thesis Erich Semlak Picture 37: Windows Media Player 9 On the other hand, Windows Movie Maker is a great example of an application that uses nearly the full suite of DirectShow capabilities. It is fully capable of communicating with and capturing video from a digital camcorder (or a webcam). Once video clips have been captured, they can be edited, prettied up, placed onto a timeline, mixed with a soundtrack, and then written to disk (or a digital camcorder) as a new, professional-looking movie. You can even take a high-resolution, highbandwidth movie and write it as a low-resolution, low-bandwidth Windows Media file, suitable for dropping into an e-mail message or posting on a Web site. All of these capabilities come from Windows Movie Maker's extensive use of DirectShow because they're all DirectShow capabilities. The flexibility of DirectShow means that it can be used to rapidly prototype applications. DirectShow filters can be written quite quickly to provide solutions to a particular problem. It is widely used at universities and in research centers-including Microsoft's own-to solve problems in machine vision (using the computer to recognize portions of a video image) or for other kinds of audio or video processing, including the real-time processing of signals. There are some tasks that DirectShow cannot handle well, a few cases in which "rolling your own" is better than using DirectShow. These kinds of applications generally lie on the high end of video processing, with high-definition video pouring in at tens of Megabytes per second or multiple cameras being choreographed and mixed in real time. Right now these kinds of applications push even the fastest computers to the limits of processor speed, memory, and network bandwidth. That's not to say that you'll never be able to handle high-definition capture in DirectShow or real-time multi-camera editing. You can write DirectShow applications that edit highdefinition images and handle real-time multi-camera inputs, this thesis proofs it. In any case, working with video is both processor-intensive and memory-intensive, and many DirectShow applications will use every computing resource available, up to 100 percent of the CPU. So when decision is made to use DirectShow for a project, expectations have to be set appropriately. DirectShow is an excellent architecture for media processing, but it has its limits [PES2003] [URL25] Page 103 Diploma Thesis Erich Semlak A.2. Supported Formats in DirectShow DirectShow is an open architecture, which means that it can support any format as long as there are filters to parse and decode it. The filters provided by Microsoft themselves, either as redistributables through DirectShow or as Windows operating system components, provide default support for the following file and compression formats. [URL4] File types: Windows Media Audio (WMA)* Windows Media Video (WMV)* Advanced Systems Format (ASF)* Motion Picture Experts Group (MPEG) Audio-Video Interleaved (AVI) QuickTime (version 2 and lower) WAV AIFF AU SND MIDI Compression formats: Windows Media Video* ISO MPEG-4 video version 1.0* Microsoft MPEG-4 version 3* Sipro Labs ACELP* Windows Media Audio* MPEG Audio Layer-3 (MP3) (decompression only) Digital Video (DV) MPEG-1 (decompression only) MJPEG Cinepak An asterisk (*) indicates that DirectShow applications must use the Windows Media Format SDK to support this format. For more information, see the Audio and Video section of the Microsoft MSDN Library [URL3] Microsoft does not provide an MPEG-2 decoder. Several DirectShow-compatible hardware and software MPEG-2 decoders are available from third parties, e.g. Elecard [URL58] and Mainconcept [URL43]. As you can see, the MPEG1-format is supported for decompression only, as MPEG underlies license fees and is therefore not provided for free. There are several manufacturers who offer MPEG compression filters, such as Moonlight, Adobe, Panasonic, Honestech and MainConcept. Page 104 Diploma Thesis Erich Semlak A.3. Concepts of DirectShow DirectShow is composed of two types of classes of objects: filters, atomic entities of DirectShow, and filter graphs, collections of filters connected together. Filters themselves consist mainly of pins and maybe some other properties, which depend of the type of filter and what is intended to perform. Those pins may be inbound or outbound, which means that data flows into or out of the filter. Some filter only possess input or output pins, e.g. capturing devices need only output pins, because their input comes from an external device, while a video rendering device has only input pins and the output is shown on the screen. The filter graph dictates in which sequence the filters are connected together and performs the deliverance of data and synchronization. This data flow is commonly known as stream. Conceptually, a filter graph might be thought of as consecutive function calls, while the filters provide the functions. One difference between a common sequential program and a filter graph is that the filter graph executes continuously. Another important point distinguishes a DirectShow filter graph from an ordinary computer program: a filter graph can have multiple streams flowing across it and multiple paths through the filter graph. For example, a DirectShow application can simultaneously capture video frames from a webcam and audio from a microphone. Picture 38: Simple filter graph This data enters the filter graph through two independent source filters, which would likely later be multiplexed together into a single audio/ video stream. In another case, you might want to split a stream into two identical streams. One stream could be sent to a video renderer filter, which would draw it upon the display, while the other stream could be written to disk. Both streams execute simultaneously; DirectShow sends the same bits to the display and to the disk. DirectShow filters make computations and decisions internally-for instance, they can change the values of bits in a stream-but they cannot make decisions that affect the structure of the filter graph. A filter simply passes its data along to the next filter in the filter graph. It can't decide to pass its data to filter A if some condition is true or filter B if the condition is false. This means that the behavior of a filter graph is completely predictable; the way it behaves when it first begins to operate on a data stream is the way it will always operate. Page 105 Diploma Thesis Erich Semlak Although filter graphs are entirely deterministic, it is possible to modify the elements within a filter graph programmatically. A C++ program could create filter graph A if some condition is true and filter graph B if it is false. Or both could be created during program initialization (so that the program could swap between filter graph A and filter graph B on the fly) as the requirements of the application change. Program code can also be used to modify the individual filters within a filter graph, an operation that can change the behavior of a filter graph either substantially or subtly. So, although filter graphs can't make decisions on how to process their data, program code can be used to simulate that capability. For example, consider a DirectShow application that can capture video data from one of several different sources, say from a digital camcorder and a webcam. Once the video has been captured, it gets encoded into a compact Windows Media file, which could then be dropped into an e-mail message for video e-mail. Very different source and transform filters are used to capture and process a video stream from a digital camcorder than those used with a webcam, so the same filter graph won't work for both devices. In this case, program logic within the application could detect which input device is being usedperhaps based on a menu selection-and could then build the appropriate filter graph. If the user changes the selection from one device to another, program logic could rebuild the filter graph to suit the needs of the selected device. [PES2003] [URL26] A.4. Modular Design The basic power and flexibility of DirectShow derives directly from its modular design. DirectShow defines a standard set of Component Object Model (COM) interfaces for filters and leaves it up to the programmer to arrange these components in some meaningful way. Filters hide their internal operations; the programmer doesn't need to understand or appreciate the internal complexities of the Audio Video Interleaved (AVI) file format, for example, to create an AVI file from a video stream. All that's required is the appropriate sequence of filters in a filter graph. Filters are atomic objects within DirectShow, meaning they reveal only as much of themselves as required to perform their functions. Because they are atomic objects, filters can be thought of and treated just like puzzle pieces. The qualities that each filter possesses determine the shape of its puzzle piece, and that, in turn, determines which other filters it can be connected to. As long as the pieces match up, they can be fitted together into a larger scheme, the filter graph. All DirectShow filters have some basic properties that define the essence of their modularity. Each filter can establish connections with other filters and can negotiate the types of connections it is willing to accept from other filters. A filter designed to process MP3 audio doesn't have to accept a connection from a filter that produces AVI video-and probably shouldn't. Each filter can receive some basic messagesrun, stop, and pause-that control the execution of the filter graph. That's about it; there's not much more a filter needs to be ready to go. As long as the filter defines these properties publicly through COM, DirectShow will treat it as a valid element in a filter graph. This modularity makes designing custom DirectShow filters a straightforward process. The programmer's job is to design a COM object with the common interfaces for a DirectShow filter, plus whatever custom processing the filter requires. A custom Page 106 Diploma Thesis Erich Semlak DirectShow filter might sound like a complex affair, but it is really a routine job, one that will be covered extensively in the examples in Part III. The modularity of DirectShow extends to the filter graph. Just as the internals of a filter can be hidden from the programmer, the internals of a filter graph can be hidden from view. When the filter graph is treated as a module, it can assume responsibility for connecting filters together in a meaningful way. It is possible to create a complete, complex filter graph by adding a source filter and a renderer filter to the filter graph. These filters are then connected with a technique known as Intelligent Connect. Intelligent Connect examines the filters in the filter graph, determines the right way to connect them, adds any necessary conversion filters, and makes the connectionsall without any intervention from the programmer. Intelligent Connect can save you an enormous amount of programming time because DirectShow does the tedious work of filter connection for you. There is a price to be paid for this level of automation: the programmer won't know exactly which filters have been placed into the filter graph or how they're connected. Some users will have installed multiple MPEG decoders, such as one for a DVD player and another for a video editing application. Therefore, these systems will have multiple filters to perform a particular function. With Intelligent Connect, you won't know which filter DirectShow has chosen to use (at least, when a choice is available). It is possible to write code that will make inquiries to the filter graph and map out the connections between all the filters in the filter graph, but it is more work to do that than to build the filter graph from scratch. So, modularity has its upsidesease of use and extensibility-and its downsides-hidden code. Hiding complexity isn't always the best thing to do, and you might choose to build DirectShow filter graphs step by step, with complete control over the construction process. Overall, the modular nature of DirectShow is a huge boon for the programmer, hiding gory details behind clean interfaces. This modularity makes DirectShow one of the very best examples of object-oriented programming (OOP), which promises reusable code and clean module design, ideals that are rarely achieved in practice. DirectShow achieves this goal admirably, as you'll see. [PES2003] [URL26] A.5. Filters Filters are the basic units of DirectShow programs, the essential components of the filter graph. A filter is an entity complete unto itself. Although a filter can have many different functions, it must have some method to receive or transmit a stream of data. Each filter has at least one pin, which provides a connection point from that filter to other filters in the filter graph. Pins come in two varieties: input pins can receive a stream, while output pins produce a stream that can be sent along to another filter. A.5.1. Filter Types There are three basic classes of DirectShow filters, which span the path from input, through processing, to output (often referred to “rendering”). All DirectShow filters fall into one of these broad categories. A filter produces a stream of data, operates on that stream, or renders it to some output device. Page 107 Diploma Thesis Erich Semlak A.5.1.1. Source Filters Any DirectShow filter that produces a stream is known as a source filter. The stream might originate in a file on the hard disk, or it might come from a live device, such as a microphone, webcam, or digital camcorder. If the stream comes from disk, it could be a pre-recorded WAV (sound), AVI (movie), or Windows Media file. Alternately, if the source is a live device, it could be any of the many thousands of Windowscompatible peripherals. DirectShow is closely tied in to the Windows Driver Model (WDM), and all WDM drivers for installed multimedia devices are automatically available to DirectShow as source filters. So, for example, webcams with properly installed Windows drivers become immediately available for use as DirectShow source filters. Source filters that translate live devices into DirectShow streams are known as capture source filters. Chapter 12 covers the software design of a source filter in detail. A.5.1.2. Transform Filters Transform filters are where the interesting work gets done in DirectShow. A transform filter receives an input stream from some other filter (possibly a source filter), performs some operation on the stream, and then passes the stream along to another filter. Nearly any imaginable operation on an audio or video stream is possible within a transform filter. A transform filter can parse (interpret) a stream of data, encode it (perhaps converting WAV data to MP3 format) or decode it, or add a text overlay to a video sequence. DirectShow includes a broad set of transform filters, such as filters for encoding and decoding various types of video and audio formats. Transform filters can also create a tee in the stream, which means that the input stream is duplicated and placed on two (or more) output pins. Other transform filters take multiple streams as input and multiplex them into a single stream. Using a transform filter multiplexer, separate audio and video streams can be combined into a video stream with a soundtrack. A.5.1.3. Renderer Filters A renderer filter translates a DirectShow stream into some form of output. One basic renderer filter can write a stream to a file on the disk. Other renderer filters can send audio streams to the speakers or video streams to a window on the desktop. The Direct in DirectShow reflects the fact that DirectShow renderer filters use DirectDraw and DirectSound, supporting technologies that allow DirectShow to efficiently pass its renderer filter streams along to graphics and sound cards. This ability means that DirectShow's renderer filters are very fast and do not get tied up in a lot of user-to-kernel mode transitions. (In operating system parlance, this process means moving the data from an unprivileged level in an operating system to a privileged one where it has access to the various output devices.) A filter graph can have multiple renderer filters. It is possible to put a video stream through a tee, sending half of it to a renderer filter that writes it to a file, and sending the other half to another renderer filter that puts it up on the display. Therefore, it is possible to monitor video operations while they're happening, even if they're being recorded to disk-an important feature we'll be using later on. Page 108 Diploma Thesis Erich Semlak A.5.1.4. Roundup All DirectShow filter graphs consist of combinations of these three types of filters, and every DirectShow filter graph will have at least one source filter, one renderer filter, and (possibly) several transform filters. In each filter graph, a source filter creates a stream that is then operated on by any number of transform filters and is finally output through a renderer filter. These filters are connected together through their pins, which provide a well-defined interface point for transfer of stream data between filters. A.5.2. Connections between Filters Although every DirectShow filter has pins, it isn't always possible to connect an input pin to an output pin. When two filters are connecting to each other, they have to reach an agreement about what kind of stream data they'll pass between them. For example, there are many video formats in wide use, such as DV (digital video), MPEG-I, MPEG-2, QuickTime, and so on. The pins on a DirectShow filter handle the negotiation between filters and ensure that the pin types are compatible before a connection is made between any two filters. Every filter is required to publish the list of media types it can send or receive and a set of transport mechanisms describing how each filter wants the stream to travel from output pin to input pin. When a DirectShow filter graph attempts to connect the output pin of one filter to the input pin of another, the negotiation process begins. The filter graph examines the media types that the output pin can transmit and compares these with the media types that the input pin can receive. If there aren't any matches, the pins can't be connected and the connection operation fails. A transform filter that can handle DV might not be able handle any other video format. Therefore, a source filter that creates an MPEG-2 stream (perhaps read from a DVD) should not be connected to that transform filter because the stream data would be unusable. Picture 39: connecting pins Next the pins have to agree on a transport mechanism. If they can't agree, the connection operation fails. Finally one of the pins has to create an allocator, an object that creates and manages the buffers of stream data that the output pin uses to pass data along to the input pin. The allocator can be owned by either the output pin or the input pin; it doesn't matter, so long as they're in agreement. If all these conditions have been satisfied, the pins are connected. This connection operation must be repeated for each filter in the graph until there's a complete, uninterrupted stream from source filter, through any transform filters, to a renderer filter. When the filter graph is started, a data stream will flow from the output pin of one filter to the input pin of the other through the entire span of the filter graph. [URL1] [PES2003] Page 109 Diploma Thesis Erich Semlak A.5.3. Intelligent Connect One of the greatest strengths of DirectShow is its ability to handle the hard work of supporting multiple media formats. Most of the time it is not necessary for the programmer to be concerned with what kinds of streams will run through a filter graph. Yet to connect two pins, DirectShow filters must have clear agreement on the media types they're handling. How can both statements be true simultaneously? Intelligent Connect automates the connection process between two pins. You can connect two pins directly, as long as their media types agree. In a situation in which the media types are not compatible, you'll often need one (or several) transform filters between the two pins so that they can be connected together. Intelligent Connect does the work of adding and connecting the intermediate transform filters to the filter graph. Picture 40: connection pins "intelligently" For example, a filter graph might have a source filter that produces a stream of an MPEG file This filter graph has a renderer filter that shows the video on the screen. These two filters have nothing in common. They do not share any common media types because the MPEG data is encoded and maybe interleaved and must be decoded and de-interleaved before it can be shown. With Intelligent Connect, the filter graph can try combinations of intermediate transform filters to determine whether there's a way to translate the output requirements of the pin on the source filter into the input requirements of the render filter. The filter graph can do this because it has access to all possible DirectShow filters. It can make inquiries to each filter to determine whether a transform filter can transform one media type to another which might be an intermediate type-transform that type into still another, and so on, until the input requirements of the renderer filter have been met. A DirectShow filter graph can look very complex by the time the filter graph succeeds in connecting two pins, but from the programmer's point of view, it is a far easier operation. And if an Intelligent Connect operation fails, it is fairly certain there's no possible way to connect two filters. The Intelligent Connect capability of DirectShow is one of the ways that DirectShow hides the hard work of media processing from the programmer. [URL1] [PES2003] A.6. Filter Graphs The DirectShow filter graph organizes a group of filters into a functional unit. When connected, the filters present a path for a stream from source filters, through any transform filters, to rendering filters. But isn't enough to connect the filters, the filter graph has to tell the filters when to start their operation, when to stop, and when to pause. In addition, the filters need to be synchronized because they're all dealing with media samples that must be kept in sync. (Imagine the confusion if the audio and video kept going out of sync in a movie.) For this reason, the filter graph generates a software-based clock that is available Page 110 Diploma Thesis Erich Semlak to all the filters in the filter graph. This clock is used to maintain synchronization and allows filters to keep their stream data in order as it passes from filter to filter. Available to the programmer, the filter graph clock has increments of 100 nanoseconds. (The accuracy of the clock on your system might be less precise than 100 nanoseconds because accuracy is often determined by the sound card or chip set on your system.) When the programmer issues one of the three basic DirectShow commands-run, stop, or pause-the filter graph sends the messages to each filter in the filter graph. Every DirectShow filter must be able to process these messages. For example, sending a run message to a source filter controlling a webcam will initiate a stream of data coming into the filter graph from that filter, while sending a stop command will halt that stream. The pause command behaves superficially like the stop command, but the stream data isn't cleared out like it would be if the filter graph had received a stop command. Instead, the stream is frozen in place until the filter graph receives either a run or stop command. If the run command is issued, filter graph execution continues with the stream data already present in the filter graph when the pause command was issued. [URL1] [PES2003] A.7. The Life Cycle of a Sample To gain a more complete understanding of DirectShow, let's follow a sample of video data gathered from a webcam as it passes through the filter graph on its way to the display. There are only a few transformations necessary for this configuration. After processing, these samples are presented at the output pin of the webcam, which maintains the per-sample timestamp, ensuring that the images stay correctly synchronized as they move from filter to filter. Picture 41: sample life cycle Finally the stream arrives at its destination, the video renderer. The renderer filter accepts a properly formatted video stream from the DV video decoder and draws it onto the display. As each sample comes into the renderer filter, it is displayed within the DirectShow output window. Samples will be displayed in the correct order, from first to last, because the video renderer filter examines the timestamp of each sample to make sure that the samples are played sequentially. Now that the sample has reached a renderer filter, DirectShow is done with it, and the sample is discarded once it has been drawn on the display. The buffer that the filter allocated to store the sample is returned to a pool of free buffers, ready to receive another sample. This flow of samples continues until the filter graph stops or pauses its execution. If the filter graph is stopped, all of its samples are discarded; if paused, the samples are held within their respective filters until the filter graph returns to the running state. [URL5] [PES2003] Page 111 Diploma Thesis Erich Semlak A.8. GraphEdit GraphEdit is a tool to try filter graphs prior to code them in some programming language. Picture 42: Graphical User Interface of GraphEdit With GraphEdit you can easily create filter graphs by simply choosing the desired filters and connecting them. The basic elements of DirectShow applications – filter, connections and filter graphs – can be represented visually. GraphEdit is like a whiteboard on which prototype DirectShow filter graphs can be sketched. Because GraphEdit is built using DirectShow components, these whiteboard designs are fully functional, executable DirectShow filter graphs. GraphEdit uses a custom persistence format. This format is not supported for application use, but it is helpful for testing and debugging an application. Page 112 Diploma Thesis Erich Semlak B. Programming DirectShow B.1. Writing a DirectShow Application In broad terms, there are three tasks that any DirectShow application must perform. These are illustrated in the following diagram: Picture 43: DirectShow application tasks 1. The application creates an instance of the Filter Graph Manager. 2. The application uses the Filter Graph Manager to build a filter graph. The exact set of filters in the graph will depend on the application. 3. The application uses the Filter Graph Manager to control the filter graph and stream data through the filters. Throughout this process, the application will also respond to events from the Filter Graph Manager. When processing is completed, the application releases the Filter Graph Manager and all of the filters. DirectShow is based on COM; the Filter Graph Manager and the filters are all COM objects. One should have a general understanding of COM client programming before to begin programming DirectShow. The article "Using COM" in the DirectX SDK documentation is a good overview of the subject, there are also a lot of books about COM programming available. B.2. C# or C++? DirectShow is meant, through his modular concept, to be extended and enhanced with additional components, respectively filters. For people who are familiar with programming with C++ and COM programming, everything should be easy. If one is interested in using “the edge of technology” he might intend to develop the GUI and filter graph parts of the application by using Microsoft .NET respectively C#. C# has several advantages in GUI programming issues compared to C++ and makes development much easier without having to write several lines of code only to open a basic window. Unfortunately there aren’t any extensions available in .NET for using DirectShow in managed code (most parts of DirectX interfaces are available for managed code, but Page 113 Diploma Thesis Erich Semlak not DirectShow). This means that all interfaces and definitions in DirectShow would have to be rewritten for .NET, in other words, several days of work. But only some parts are needed to span a filter graph and to connect some existing filters, so the work of rewriting interfaces can be cut down to a minimum. The core C# language differs notably from C and C++ in its omission of pointers as a data type. Instead, C# provides references and the ability to create objects that are managed by a garbage collector. In the core C# language it is simply not possible to have an uninitialised variable, a "dangling" pointer, or an expression that indexes an array beyond its bounds. While practically every pointer type construct in C or C++ has a reference type counterpart in C#, nonetheless, there are situations where access to pointer types becomes a necessity. In unsafe code it is possible to declare and operate on pointers, to perform conversions between pointers and integral types, to take the address of variables, and so forth. In some sense, writing unsafe code is much like writing C code within a C# program. Nonetheless, for advanced DirectShow programming, so to say writing new filters, there’s not alternative to C++. Not only because the SDK grounds on C++ but also because there aren’t any tutorials or samples which use another language than C++. Moreover, C++ generates slightly faster code sometimes (because of its “unmanagedness”) and is therefore more appropriate for writing the performancedependent parts, as filters are. B.3. Rewriting DirectShow interfaces for C# To make the required functions of DirectShow accessible to C#, the necessary interfaces have to be rewritten. The clue for this is first the header file named “strmif.h” and is located in the “/include”-directory of the DirectX SDK. The first important interface will be “IFilter graph”, it is needed to span a filter graph, add/remove filters and connect them. The interface looks in C++ like this: MIDL_INTERFACE("56a8689f-0ad4-11ce-b03a-0020af0ba770") IFilter graph : public IUnknown { public: virtual HRESULT STDMETHODCALLTYPE AddFilter( /* [in] */ IBaseFilter *pFilter, /* [string][in] */ LPCWSTR pName) = 0; virtual HRESULT STDMETHODCALLTYPE RemoveFilter( /* [in] */ IBaseFilter *pFilter) = 0; virtual HRESULT STDMETHODCALLTYPE EnumFilters( /* [out] */ IEnumFilters **ppEnum) = 0; virtual HRESULT STDMETHODCALLTYPE FindFilterByName( /* [string][in] */ LPCWSTR pName, /* [out] */ IBaseFilter **ppFilter) = 0; virtual HRESULT STDMETHODCALLTYPE ConnectDirect( /* [in] */ IPin *ppinOut, Page 114 Diploma Thesis Erich Semlak /* [in] */ IPin *ppinIn, /* [unique][in] */ const AM_MEDIA_TYPE *pmt) = 0; virtual HRESULT STDMETHODCALLTYPE Reconnect( /* [in] */ IPin *ppin) = 0; virtual HRESULT STDMETHODCALLTYPE Disconnect( /* [in] */ IPin *ppin) = 0; virtual HRESULT STDMETHODCALLTYPE SetDefaultSyncSource( void) = 0; } Code sample 1: IFilter graph interface in C++ In C# it has to look like this: [ComVisible(true), ComImport, Guid("56a8689f-0ad4-11ce-b03a-0020af0ba770"), InterfaceType( ComInterfaceType.InterfaceIsIUnknown )] public interface IFilter graph { [PreserveSig] int AddFilter( [In] IBaseFilter pFilter, [In, MarshalAs(UnmanagedType.LPWStr)] string pName ); [PreserveSig] int RemoveFilter( [In] IBaseFilter pFilter ); [PreserveSig] int EnumFilters( [Out] out IEnumFilters ppEnum ); [PreserveSig] int FindFilterByName( [In, MarshalAs(UnmanagedType.LPWStr)] [Out] out IBaseFilter ppFilter ); string pName, [PreserveSig] int ConnectDirect( [In] IPin ppinOut, [In] IPin ppinIn, [In, MarshalAs(UnmanagedType.LPStruct)] AMMediaType pmt ); [PreserveSig] int Reconnect( [In] IPin ppin ); [PreserveSig] int Disconnect( [In] IPin ppin ); [PreserveSig] int SetDefaultSyncSource(); } Code sample 2: IFilter graph interface in C# As one can see, most of the definitions look very similar. There are only some cases were the data types differ in such way that there is some marshalling needed, especially for pointers on pointers of structures. Some of that basic work had already been done by someone else (CodeProject, [URL10]), so the basic structure for spanning a filter graph within C# could be achieved simply by using CodeProject’s code. For my purposes there had to be added many more functions for managing pins, mediatypes and connections. Page 115 Diploma Thesis Erich Semlak In the following example a filter graph will be setup up which looks like this in GraphEdit: Picture 44: Sample filter graph Page 116 Diploma Thesis Erich Semlak B.4. Initiating the filter graph As soon as the required interfaces are defined for C#, a filter graph can be set up. Therefore an instance of a Filter graph-object is created (just like CoCreateInstance in C++) and is then casted to some additional object types which extend the functionality for different purposes: IGraphBuilder - provides facilities to create a graph for a specific file, and to add/remove filters individually from the graph. IMediaControl - provides methods to start and stop the filter graph. IMediaEventEx - is used for setting up notifications that will be sent to the app when things happen in the graph (reach end of media, buffer underruns, etc.) IVideoWindow – sets properties on the video window. Applications can use it to set the window owner, the position and dimensions of the window, and other properties. The following code sets up the appropriate objects for spanning a filter graph: Type comType = null; object comObj = null; try { comType = Type.GetTypeFromCLSID( Clsid.Filter graph ); if( comType == null ) throw new NotImplementedException( @"DShow Filter registered!" ); comObj = Activator.CreateInstance( comType ); IGraphBuilder graphBuilder = (IGraphBuilder) comObj; comObj = null; Guid clsid = Clsid.CaptureGraphBuilder2; Guid riid = typeof(ICaptureGraphBuilder2).GUID; comObj = DsBugWO.CreateDsInstance( ref clsid, ref riid ); ICaptureGraphBuilder2 capGraph = (ICaptureGraphBuilder2) comObj; comObj = null; IMediaControl mediaCtrl = (IMediaControl) graphBuilder; IVideoWindow videoWin = (IVideoWindow) graphBuilder; IMediaEventEx mediaEvt = (IMediaEventEx) graphBuilder; } catch( Exception ee ) { MessageBox.Show( this, "Could not get interfaces\r\n" + "DirectShow.NET", MessageBoxButtons.OK, MessageBoxIcon.Stop ); } finally { if( comObj != null ) Marshal.ReleaseComObject( comObj ); comObj = null; } Code sample 3: initiating a filter graph in C# Page 117 graph not ee.Message, Diploma Thesis Erich Semlak B.5. Adding Filters to the Capture Graph Before the needed filters can be added and connected, they have to be allocated. There are two ways to achieve this, either to find them by their GUIDs or by name. The first way is the exact one, because the GUID is definite, the second one is sometimes more convenient, without having to find out the GUID, but there may be more than one filter having the same name (e.g. “Video Renderer”). Calling filters by GUID is short and simple, but a little helper-function makes it even more convenient. It needs the GUID of the filter to be found and returns the filter object: public int FindDeviceByID(System.Guid clsid,out IBaseFilter ofilter) { IntPtr ptrIf; ofilter=null; Guid ibasef =typeof(IBaseFilter).GUID; int hr = DsBugWO.CoCreateInstance( ref clsid, IntPtr.Zero, CLSCTX.Server, ref ibasef, out ptrIf ); if( (hr != 0) || (ptrIf == IntPtr.Zero) ) { Marshal.ThrowExceptionForHR( hr ); return -1; } Guid iu = new Guid( "00000000-0000-0000-C000-000000000046" ); IntPtr ptrXX; hr = Marshal.QueryInterface( ptrIf, ref iu, out ptrXX ); ofilter (IBaseFilter)System.Runtime.Remoting.Services._ EnterpriseServicesHelper.WrapIUnknownWithComObject( ptrIf ); return Marshal.Release( ptrIf ); } Code sample 4: helper function for searching a filter by GUID Using this helper-function handles finding e.g. the Video Renderer (which name is ambiguous on some systems) in one single line: // Setup preview Int hr=FindDeviceByID(new Guid("70E102B0-5556-11CE-97C0-00AA0055595A"), out vrendFilter); Code sample 5: finding a filter by GUID All other filters, where only the name is known, can be searched for their GUIDs either through the registry or the GraphEdit Tool. Open the Menu “Graph” and select “Insert filters…”., or Ctrl-F. There you can look up the required filter, most of are stored in the “Direct Show” section. After choosing a filter, the appropriate information is displayed: Page 118 Diploma Thesis Erich Semlak Picture 45: Filter selection dialog in GraphEdit The Filter Moniker shows the GUID of the selected filter. The first part by the backslash defines the filter class, “{083863F1-70DE-11D0-BD40-00A0C911CE86}” means the legacy filter category. The second part after the backslash is the GUID of the filter itself, in our example “70E102B0-5556-11CE-97C0-00AA0055595A”. Another way, as mentioned, is to enumerate the filters by name. This is sometimes more convenient but may take some more time, especially if the filter class where to search in contains a lot of filters, as the DirectShow section does. For crawling through the list of filters, there’s another convenient helper function: public int FindDeviceByName(string name,System.Guid fclsid,out IBaseFilter ofilter) { int i,ok=-1; ArrayList devices; DsDevice dev=null; object capObj=null; ofilter=null; DsDev.GetDevicesOfCat(fclsid, out devices); for(i=0;i<devices.Count;i++) { dev=devices[i] as DsDevice; Console.WriteLine(dev.Name); if (dev.Name!=null) { if(dev.Name.Trim()==name) break; } } if (i<devices.Count) { try Page 119 Diploma Thesis Erich Semlak { Guid gbf = typeof( IBaseFilter ).GUID; dev.Mon.BindToObject( null, null, ref gbf, out capObj ); ofilter = (IBaseFilter) capObj; capObj = null; ok=0; } catch( Exception ee ) { MessageBox.Show( this, "Could not create "+name+" device\r\n" + ee.Message, "DirectShow.NET", MessageBoxButtons.OK, MessageBoxIcon.Stop ); } finally { if( capObj != null ) Marshal.ReleaseComObject( capObj ); capObj = null; } } return ok; } Code sample 6: helper function for searching a filter by its name So the creation of the remaining filter is also just a matter of some lines of code: hr=FindDeviceByName(cameraname, Clsid.VideoInputDeviceCategory,out capFilter); hr=FindDeviceByName("HNS Motion Detection", Clsid.LegacyAmFilterCategory,out motdetFilter); hr=FindDeviceByName("Smart Tee", Clsid.LegacyAmFilterCategory,out smarttFilter); hr=FindDeviceByName("HNS MPEG-2 Sender", Clsid.LegacyAmFilterCategory,out nwrendFilter); hr=FindDeviceByName("Microsoft MPEG-4 Video Codec V1", Clsid.VideoCompressorCategory,out comprFilter); Code sample 7: searching a filter by its name All these enumerated filters have to be added to the filter graph before they can be connected: hr=graphBuilder.AddFilter(capFilter, "Camera" ); hr=graphBuilder.AddFilter(motdetFilter,"MotionDet"); hr=graphBuilder.AddFilter(comprFilter,"Compressor"); hr=graphBuilder.AddFilter(nwrendFilter,"Network Renderer"); hr=graphBuilder.AddFilter(vrendFilter,"Video Renderer"); hr=graphBuilder.AddFilter(smarttFilter,"Smart Tee"); Code sample 8: adding filters to the filter graph Page 120 Diploma Thesis Erich Semlak B.6. Connecting Filters through Capture Graph Now the filters can be connected through their input and output pins. These pins have also to be found. This can be achieved by the following helper-function: public int GetUnconnectedPin(IBaseFilter filter,out IPin opin, PinDirection pindir) { int hr; IEnumPins cinpins; IPin pin=null,ptmp; PinDirection pdir; int i; int ok=-1; opin=null; filter.EnumPins(out cinpins); while(cinpins.Next(1,ref pin,out i)==0) { pin.QueryDirection(out pdir); if (pdir==pindir) { hr=pin.ConnectedTo(out ptmp); if (hr<0) { opin=pin; ok=0; break; } } } return ok; } Code sample 9: finding a filter’s pins helper function The function needs the filter and the direction of the pins to be found. There are only input and output pins. Using this helper function makes finding the needed pins clear: hr=GetUnconnectedPin(capFilter,out capout,PinDirection.Output); hr=GetUnconnectedPin(motdetFilter,out mdin,PinDirection.Input); hr=GetUnconnectedPin(motdetFilter,out mdout,PinDirection.Output); hr=GetUnconnectedPin(smarttFilter,out stin,PinDirection.Input); hr=GetUnconnectedPin(comprFilter,out comprin,PinDirection.Input); hr=GetUnconnectedPin(comprFilter,out comprout,PinDirection.Output); hr=GetUnconnectedPin(nwrendFilter,out nwrendin,PinDirection.Input); hr=GetUnconnectedPin(vrendFilter,out vrendin,PinDirection.Input); Code sample 10: get unconnected pins If the names of the pins to be found are known, they can also be defined by using the GraphBuilder directly. This is necessary if there are different pins of the same direction, in this case the Smart Tee has to output pins, one for capturing and one for previewing: hr=smarttFilter.FindPin("Capture",out stoutcap); hr=smarttFilter.FindPin("Preview",out stoutprev); Code sample 11: finding pins by name After defining all pins, they can be connected through the GraphBuilder: hr=graphBuilder.Connect(capout,stin); hr=graphBuilder.Connect(stoutcap,mdin); Page 121 Diploma Thesis Erich Semlak hr=graphBuilder.Connect(mdout,comprin); hr=graphBuilder.Connect(comprout,nwrendin); hr=graphBuilder.Connect(stoutprev,vrendin); Code sample 12: connecting pins B.7. Running the Capture Graph After the filter graph is defined and spanned, there is only the video output panel to be defined for the video render, and then the graph can be run. try { // Set the video window to be a child of the main window hr = videoWin.put_Owner( vidPanel.Handle ); if( hr < 0 ) Marshal.ThrowExceptionForHR( hr ); // Set video window style hr = videoWin.put_WindowStyle( WS_CHILD | WS_CLIPCHILDREN ); if( hr < 0 ) Marshal.ThrowExceptionForHR( hr ); // position video window in client rect of owner window Rectangle rc = vidPanel.ClientRectangle; videoWin.SetWindowPosition( 0, 0, rc.Right, rc.Bottom ); // Make the video window visible, now that it is properly positioned hr = videoWin.put_Visible( DsHlp.OATRUE ); if( hr < 0 ) Marshal.ThrowExceptionForHR( hr ); hr = mediaEvt.SetNotifyWindow( slaveform.Handle, WM_GRAPHNOTIFY, IntPtr.Zero ); if( hr < 0 ) Marshal.ThrowExceptionForHR( hr ); } catch( Exception ee ) { MessageBox.Show( this, "Could not setup video window\r\n" + ee.Message, "DirectShow.NET", MessageBoxButtons.OK, MessageBoxIcon.Stop ); } // run it all hr = mediaCtrl.Run(); if( hr < 0 ) Marshal.ThrowExceptionForHR( hr ); Code sample 13: running the filter graph Then the filters are started and the media data is run through the graph until it is paused or stopped. To stop the graph and clean up resources, the allocated object handles have to be released: public void CloseInterfaces() { int hr; try { if( mediaCtrl != null ) { hr = mediaCtrl.Stop(); Page 122 Diploma Thesis Erich Semlak mediaCtrl = null; } if( mediaEvt != null ) { hr = mediaEvt.SetNotifyWindow( IntPtr.Zero, WM_GRAPHNOTIFY, IntPtr.Zero ); mediaEvt = null; } if( videoWin != null ) { hr = videoWin.put_Visible( DsHlp.OAFALSE ); hr = videoWin.put_Owner( IntPtr.Zero ); videoWin = null; } if( capGraph != null ) Marshal.ReleaseComObject( capGraph ); capGraph = null; if( graphBuilder != null ) Marshal.ReleaseComObject( graphBuilder ); graphBuilder = null; if( capFilter != null ) Marshal.ReleaseComObject( capFilter ); capFilter = null; } catch( Exception ) {} } Code sample 14: stopping and clearing the filter graph B.8. Writing new DirectShow Filters Although there is a rich set of filters coming ready with DirectShow, there might be some tasks that demand writing a new DirectShow filter. Because the initial setup of the code framework and the development environment (e.g. Visual Studio) depends on the required base classes, there’s some preparation work to do. This can be simplified by using some of the existing filter creation wizards, which can be found on the web ([URL13] [URL14]). Unfortunately none of them are working with Visual Studio .NET 2003, so there is just one alternative to write the whole framework by hand: copy one of the DirectShow SDK sample projects, which fits best to the scheme. This means to rename lots of definitions and maybe deleting some code and rewrite some other. Filter development is further covered in this thesis at the practical part. Page 123 Diploma Thesis Erich Semlak C. Sources for Appendix A and B C.1. Books & Papers [PES2003] Mark D. Pesce; Programming Microsoft DirectShow for Digital Video and Television, Microsoft Press; 2003 C.2. URLs [URL1] Microsoft; DirectShow Reference; http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshowreference.asp;2004 [URL5] LeadTools, DirectShow, http://www.leadtools.com/SDK/Multimedia/MultimediaDirectShow.htm;2004 [URL13] Yunqiang Chen; DirectShow Transform Filter AppWizard; http://www.ifp.uiuc.edu/~chenyq/research/Utils/DShowFilterWiz/DShowFilt erWiz.html; 2001 [URL14] John McAleely; DXMFilter; http://www.cs.technion.ac.il/Labs/Isl/DirectShow/dxmfilter.htm; 1998 [URL25] Michael Blome; Core Media Technology in Windows XP Empowers You to Create Custom Audio/Video Processing Components; http://msdn.microsoft.com/msdnmag/issues/02/07/DirectShow/default.asp x; 2002 [URL26] Chris Thompson; DirectShow For Media Playback in Windows; http://www.flipcode.com/articles/article_directshow01.shtml;2000 Page 124 Diploma Thesis Erich Semlak Curriculum Vitae Name: Erich Semlak Born: 31.5.1972 in Wilmington (DE) / USA Education 1978 - 1981 Primary School 1982 First Class Academic Gymnasium Linz / Spittelwiese 1983 - 1989 2. - 6. Class Stiftsgymnasium Kremsmünster 1989 – 1991 Vocational School 1993 - 1996 Academy for Applied Computer Science at WIFI Linz 1996 Study permit exam 1996 - 2005 Computer Science Studies at Johannes Kepler University / Linz Business Experience 1989 short experience as apprentice for Organ Building (2 months) 1989 - 1991 apprenticeship for merchandising at Ludwig Christ / Linz 1991 - 1994, 1996 continued at Ludwig Christ as employee 1995 Civil Service at the Red Cross 1996 - 2003 freelancer at the Ars Electronica Center / Linz since 2003 several short contracts as Freelancer for different companies Page 125