Distributed Event-Based

Transcription

Distributed Event-Based
Diploma Thesis
Erich Semlak / 9655661
Distributed Event-Based Video Capturing and Streaming
Diplomarbeit zur Erlangung
des akademischen Grades Diplomingenieur im Diplomstudium Informatik.
Angefertigt am Institut für Telekooperation
Eingereicht von:
Erich Semlak
Betreuung:
Prof. Dr. Gabriele Kotsis
Beurteilung:
Prof. Dr. Gabriele Kotsis
Linz, Juni 2005
Page 1
Diploma Thesis
Erich Semlak
Acknowledgement
I would like to thank …
… Prof. Dr. Gabriele Kotsis for her fair and constructive comments,
… Mag. Reinhard Kronsteiner for his supportive patience and understanding,
… Margot for bearing me while developing and writing this thesis,
… Gunther for his hints and vocabulary,
… Oliver for his encouraging words,
… my colleagues for making me finish my studies.
Diploma Thesis
Erich Semlak
Abstract
Diese Arbeit erläutert einen Lösungsansatz für Echtzeit-Videoverarbeitung mit
besonderem Schwerpunkt auf verteilte(s) Capturing, Verarbeitung und Streaming.
Das entwickelte System erörtert die Möglichkeiten heutiger Standard-PC-Hardware,
CPU Rechenleistung und Ethernet-Netzwerk. Es basiert auf einer weitverbreiteten
und flexiblen Multmedia-Architektur (DirectShow) um Erweiterungen zu vereinfachen
und eine große Anzahl an bereits vorhandenen Komponenten zu eröffnen. Das
entwickelte System kann für Überwachungs- oder Sendezwecke eingesetzt werden,
abhängig von den Anforderungen an Qualität und Bildauflösung. Es ist begrenzt
skalierbar in Bezug auf die Anzahl der Videoeingänge und der Videoauflösung und
kann unter Windows ohne zusätzliche Kompressionshardware oder Software
betrieben werden.
This thesis explains an approach for realtime video processing with special
emphasis on distributed capturing, processing and subsequent streaming. The
developed system checks out the possibilities of standard PC hardware, CPU
processing power and Ethernet network. It is based on a widespread and universal
multimedia architecture (DirectShow) to ease further expansion and open up a wide
range of already available components. The developed system is applicable for
surveillance
or
broadcasting
tasks,
depending
on
quality
and
resolution
requirements. It is limited scalable regarding number of inputs and video resolution
and can be run under Windows without need of extra compression hardware or
expensive software.
Diploma Thesis
Erich Semlak
Index
1.
2.
3.
Introduction....................................................................................................... 7
1.1.
Motivation ......................................................................................... 7
1.2.
Outline of the Work ........................................................................... 8
Theoretical Part ................................................................................................ 9
2.1.
2.1.1.
2.1.2.
Digital Video...................................................................................... 9
History............................................................................................... 9
Fundamentals ................................................................................. 10
2.2.
2.2.1.
2.2.2.
2.2.3.
Video Capturing .............................................................................. 12
Analogue- Digital-Conversion ......................................................... 12
Capturing Hardware........................................................................ 13
Relation to this work........................................................................ 14
2.3.
2.3.1.
2.3.2.
2.3.3.
2.3.4.
2.3.5.
Video Compression......................................................................... 15
Introduction ..................................................................................... 15
Discrete Cosine Transform ............................................................. 15
Non-linear editing............................................................................ 16
Video Compression Formats .......................................................... 18
Relation to this work........................................................................ 22
2.4.
2.4.1.
2.4.2.
2.4.3.
2.4.4.
2.4.5.
Video Streaming ............................................................................. 23
Video Streaming Formats ............................................................... 24
Streaming Methods......................................................................... 25
Streaming Method Considerations.................................................. 27
Protocols......................................................................................... 28
Protocols considerations................................................................. 31
2.5.
2.5.1.
2.5.2.
Software.......................................................................................... 32
Commercial..................................................................................... 32
Shareware ...................................................................................... 33
2.6.
2.6.1.
2.6.2.
2.6.3.
Techniques and SDKs .................................................................... 34
Microsoft DirectShow ...................................................................... 34
Video For Windows......................................................................... 36
Windows Media Series 9 ................................................................ 37
Practical Part................................................................................................... 40
3.1.
3.1.1.
3.1.2.
Problem .......................................................................................... 40
Model .............................................................................................. 40
Applications .................................................................................... 43
3.2.
3.2.1.
3.2.2.
3.2.3.
3.2.4.
3.2.5.
3.2.6.
3.2.7.
3.2.8.
Technology Evaluation.................................................................... 45
Evaluation Aspects ......................................................................... 45
DirectShow SDK ............................................................................. 48
Evaluation of DirectShow................................................................ 49
Windows Media SDK ...................................................................... 51
Evaluation of Windows Media SDK ................................................ 53
Combination of DirectShow and Windows Media ........................... 54
Video for Windows .......................................................................... 54
From Scratch .................................................................................. 54
Page 4
Diploma Thesis
3.2.9.
3.2.10.
Erich Semlak
Overview......................................................................................... 55
Conclusion ...................................................................................... 55
3.3.
Implementation Considerations ...................................................... 56
3.4.
3.4.1.
3.4.2.
3.4.3.
3.4.4.
3.4.5.
Filter Development.......................................................................... 57
Motion Detection Filter .................................................................... 58
Object Detection Filter .................................................................... 59
Network Sending/Receiving Filter ................................................... 60
Infinite Pin Tee Filter ....................................................................... 61
Web Streaming Filter ...................................................................... 63
3.5.
3.5.1.
3.5.2.
Applications .................................................................................... 65
Slave Application ............................................................................ 65
Master Application .......................................................................... 67
3.6.
3.6.1.
3.6.2.
Test & Conclusion........................................................................... 72
Development................................................................................... 72
Capacity and Stress Test................................................................ 73
4.
Recent Research ............................................................................................ 78
4.1.
4.1.1.
4.1.2.
4.1.3.
4.1.4.
4.1.5.
4.1.6.
4.1.7.
Video Capturing and Processing .................................................... 78
Object Tracking............................................................................... 78
Object Detection and Tracking Approaches.................................... 80
Object Tracking Performance ......................................................... 82
Motion Detection ............................................................................. 83
Object Tracking Using Motion Information ...................................... 84
Summary ........................................................................................ 85
Relation to this work........................................................................ 86
4.2.
4.2.1.
4.2.2.
4.2.3.
4.2.4.
Video Streaming ............................................................................. 87
Compression vs. Delivery ............................................................... 87
Compression................................................................................... 87
Streaming ....................................................................................... 88
Relation to this work........................................................................ 90
5.
Final Statement............................................................................................... 91
6.
Sources ........................................................................................................... 92
6.1.
Books & Papers .............................................................................. 92
6.2.
URLs............................................................................................... 96
7.
Pictures ......................................................................................................... 100
8.
Code samples ............................................................................................... 101
9.
Tables ............................................................................................................ 101
A.
A Short History Of DirectShow.................................................................... 102
A.1.
DirectShow Capabilities ................................................................ 102
A.2.
Supported Formats in DirectShow ................................................ 104
A.3.
Concepts of DirectShow ............................................................... 105
A.4.
Modular Design............................................................................. 106
A.5.
A.5.1.
Filters ............................................................................................ 107
Filter Types ................................................................................... 107
Page 5
Diploma Thesis
A.5.2.
A.5.3.
Erich Semlak
Connections between Filters......................................................... 109
Intelligent Connect ........................................................................ 110
A.6.
Filter Graphs ................................................................................. 110
A.7.
The Life Cycle of a Sample........................................................... 111
A.8.
GraphEdit...................................................................................... 112
B.
Programming DirectShow............................................................................ 113
B.1.
Writing a DirectShow Application .................................................. 113
B.2.
C# or C++? ................................................................................... 113
B.3.
Rewriting DirectShow interfaces for C# ........................................ 114
B.4.
Initiating the filter graph................................................................. 117
B.5.
Adding Filters to the Capture Graph ............................................. 118
B.6.
Connecting Filters through Capture Graph ................................... 121
B.7.
Running the Capture Graph.......................................................... 122
B.8.
Writing new DirectShow Filters ..................................................... 123
C.
Sources for Appendix A and B.................................................................... 124
C.1.
Books & Papers ............................................................................ 124
C.2.
URLs............................................................................................. 124
Page 6
Diploma Thesis
Erich Semlak
1. Introduction
“A picture is worth a thousand words”.
1.1. Motivation
If a picture is worth thousand words, what about the value of video?
During the 1990s, the personal computer underwent a radical transformation. At the
beginning of that decade it was more an information processing device, suitable only
for word processing and spreadsheets. By the turn of the century (and millennium),
the PC had become a media machine, playing music and movies, games, DVDs, and
streaming live reports from the web. Part of the reason for this development of the
possibilities of the PC came from the exponential growth in processing power,
memory, and storage, following Moore's Law [URL32], which effectively states that
computers’ CPU speed doubles every 18 months.
Even with today’s high performance CPUs there are still problems which would take
too long to be solved on a single computer. Such problems are solved by distributed
systems [URL33] which share the workload among several computers.
Complex video processing tasks, as video compression and object detection, still
exceed the capabilities of a single high-end 3.5GHz personal computer, especially if
handling multiple high-resolution video sources. Therefore a distributed approach
seems to be appropriate to solve such kind of problems though still using
inexpensive off-the-shelf-PC hardware.
Today there are few systems which use a distributed approach for video processing
based on common PC hardware. Some of them use parallel systems for video effect
insertion [MAY1999] others try to parallelize the video content analysis (e.g. motion
and object detection) [LV2002]. There are also systems which use distributed video
input, such as the DIVA system [TRI2002], which observes a remote scene for
incidents.
None of these systems offers the whole range of functions for capturing multiple
(remote) sources, processing them (effects, motion and object detection) depending
on video content analyzing and streaming the result out on a network.
This thesis shows an approach for realtime video processing with special emphasize
on distributed capturing, processing and subsequent streaming. The developed
system checks out the possibilities of standard PC hardware, CPU processing
power and Ethernet network. It is based on a widespread and universal multimedia
architecture (DirectShow) to ease further expansion and open up a wide range of
already available filters. Therefore it makes it possible to use different types of
compression and formats without necessity of writing own, new components.
Page 7
Diploma Thesis
Erich Semlak
These qualities are unique in comparison with different products (see chapter 2.5),
which are more complicated to be expanded because of expensive SDKs needed
and proprietary architecture. The capability to combine distributed processing with
standard hardware and multimedia software architecture can hardly be found even
with commercial products.
The developed system is applicable for surveillance or web broadcasting tasks,
depending on quality and resolution requirements. It is limited scalable regarding
number of inputs and video resolution and can be run under Windows without need
of extra compression hardware or expensive software.
1.2. Outline of the Work
To fathom the possibilities of hardware and appropriate software systems an
application had to be developed to allow scalable usage of today common hardware
and network to do simultaneous capturing and video analyzing on multiple PCs and
subsequent streaming to the internet in a comfortable and stable manner. A
distributed hierarchical approach will be proposed. Detailed problem and model
considerations can be found in chapter 3.1.
Therefore different software approaches have been evaluated in terms of modular
design and openness towards future technologies. Out of these technologies,
Microsoft’s DirectShow (see chapter 2.6) technology has been chosen as most
appropriate to be a basis for developing a system which fulfills the requirements as
requested. Chapter 2 provides the theoretical foundation needed for better
understanding the entire background of video capturing and streaming.
Before starting development, an in-depth look into DirectShow was necessary to
understand the internals and techniques how to develop new components. Especially
using C# for handling DirectShow filter graphs (see chapter A.6) turned out to be
difficult because DirectShow is intended to be used with C++.
In the course of this cumbersome information retrieval a DirectShow tutorial
emerged, which should help other developers to handle C# and DirectShow together
more easily. This tutorial can be found at chapter B.
After writing the needed interfaces for C# and DirectShow, two applications (Master
and Slave, chapter 4.4) have been created to perform the needed tasks (see chapter
4.1 for the problem description) for the distributed video processing approach.
Therefore some new DirectShow filters (chapter 3.4) had to be developed, which can
also be used with other applications using DirectShow.
These applications have then been tested with different input, output and hardware
configurations (chapter 3.6). The results thereof show CPU and network workloads
for up to four video sources and allow an estimation for higher counts of inputs
respectively show where bottlenecks limit the scalability of the developed system.
Finally, recent research in the field of video processing has been compared with
results of this thesis (chapter 4).
Page 8
Diploma Thesis
Erich Semlak
2. Theoretical Part
2.1. Digital Video
The recording and editing of sound has long been in the domain of the PC, but doing
the same with moving video has only recently gained acceptance as a mainstream
PC application. In the past, digital video work was limited to a small group of
specialist users, such as multimedia developers and professional video editors, who
were prepared to pay for expensive and complex digital video systems. It was not
until 1997, when the CPUs reached the 200 Mhz mark, after several years of intense
technological development, that the home PC was strong enough for the tasks to
come.
2.1.1. History
In the early 1990s, a digital video system capable of capturing full-screen video
images would have cost several thousands of Euros. The biggest cost element was
the compression hardware, needed to reduce the huge files that result from the
conversion of an analogue video signal into digital data, to a manageable size. Those
expansion cards consisted of several processors, exceeding the PCs own processing
power by multiple, also regarding the price.
Less powerful "video capture" cards were available, capable of compressing quarterscreen images - 320x240 pixels - but even these were far too expensive (more than
$ 2000) for the average PC user. The end user market was limited to basic cards that
could capture video, but which had no dedicated hardware compression features of
their own. These low-cost cards relied on the host PC to handle the raw digital video
files they produced, and the only way to keep file sizes manageable was to
drastically reduce the image size. This often meant to capture the video data
uncompressed to hard disc arrays (to handle the approx. 33 MB/sec for SVHS
quality, which needed the full bandwidth of an EISA-Bus) and compressing it
afterwards, taking several hours for some minutes of material.
Until the arrival of the Pentium processor in 1993 even the most powerful PCs were
limited to capturing images no more than 160x120 pixels. For a graphics card
running at a resolution of 640x480, a 160x120 image filled just one-sixteenth of the
screen. As a result these low-cost video capture cards were generally dismissed as
little more than toys, incapable of performing any worthwhile real-world application.
The turning point for digital video systems came as processors finally exceeded
200MHz. At this speed, PCs could handle images up to 320x240 without the need for
expensive compression hardware. The advent of the Pentium II and ever more
processing power made video capture cards which offered less than full-screen
capability virtually redundant and by the autumn of 1998 there were several
consumer-oriented video capture devices on the market which provided full-screen
video capture for as little as a few hundred Euros.
Page 9
Diploma Thesis
Erich Semlak
2.1.2. Fundamentals
Understanding what digital video is first requires an understanding of its ancestor broadcast television or analogue video. The invention of radio demonstrated that
sound waves can be converted into electromagnetic waves and transmitted over
great distances to radio receivers. Likewise, a television camera converts the color
and brightness information of individual optical images into electrical signals to be
transmitted through the air or recorded onto video tape. Similar to a movie, television
signals are converted into frames of information and projected at a rate fast enough
to fool the human eye into perceiving continuous motion. When viewed by an
oscilloscope, the analogue signal looks like a continuous landscape of jagged hills
and valleys, analogous to the changing brightness and color information.
Most countries around the World use one of three main television broadcast
standards. These three main standards are NTSC, PAL and SECAM. Unfortunately
each standard is incompatible among each other.
This chart below gives a description of each standard and the technical variations
within each:
SYSTEM
NTSC M
PAL
B,G,H
PAL I
PAL D
PAL N
PAL M
SECAM
B,G,H
SECAM
D,K,K1,L
Lines/Field
525/60
625/50
625/50
625/50
625/50
525/60
625/50
625/50
Horizantal
Frequency
15.734
kHz
15.625
kHz
15.625
kHz
15.625
kHz
15.625
kHz
15.750
kHz
15.625
kHz
15.625
kHz
Vertical
Frequency
60 Hz
50 Hz
50 Hz
50 Hz
50 Hz
60 Hz
50 Hz
50 Hz
Color
Subcarrier
Frequency
3.579545
mHz
4.433618
MHz
4.433618
MHz
4.433618
MHz
3.582056
MHz
3.575611
MHz
-
-
4.2 mHz
5.0 MHz
5.5 MHz
6.0 MHz
4.2 MHz
4.2 MHz
5.0 MHz
6.0 MHz
4.5 mHz
5.5 MHz
6.0 MHz
6.5 MHz
4.5 MHz
4.5 MHz
5.5 MHz
6.5 MHz
Video
Bandwidth
Sound
Carrier
Table 1: Overview of TV formats [URL17]
The following details about each standard are extracted from [URL18] and
[ORT1993]:
NTSC (National Television System Committee) was the first color TV broadcast
system implemented in the United States in 1953. NTSC is used by many countries
on the American continent as well as many Asian countries including Japan.
NTSC runs on 525 lines per frame. As the electric current in USA alternates 60 times
per second, there are 60 half pictures per second displayed.
With PAL (Phase-Alternation-Line) each complete frame is drawn line-by-line, from
top to bottom. It was developed by Walter Bruch at Telefunken [URL31] Germany
(German State Television) and is used in most countries in Western Europe, Asia,
throughout the Pacific and southern Africa. Europe uses an AC electric current that
alternates 50 times per second (50Hz) therefore the PAL performs 50 passes (halfpictures) each second. As it takes two passes to draw a complete frame, the picture
rate is 25 fps. The odd lines are drawn on the first pass, the even lines on the
second. This procedure is known as “interlaced” which is supposed to compensate
the low refresh rate. Computer monitors display images non-interlaced so they show
Page 10
Diploma Thesis
Erich Semlak
a whole picture each pass. Interlaced signals, particularly at a rather low rate of 50Hz
(modern monitors show pictures at at least 70 Hz), cause unsteadiness and flicker,
and are inappropriate for displaying text or thin horizontal lines.
SECAM (Sequential Couleur Avec Memoire or Sequential Color with Memory) was
developed in France and is used in France and its’ territories, much of Eastern
Europe, the Middle East and northern Africa. This system uses the same resolution
of PAL, 625 lines, and frame rate, 25 per second, but the way SECAM processes the
color information is not compatible with PAL.
Page 11
Diploma Thesis
Erich Semlak
2.2. Video Capturing
The practical part of this thesis handles live video input for further processing.
Therefore it needs video capturing to turn the cameras’ view into digitally useable
data. To understand the process of video streaming this chapter explains its’
principles and techniques.
2.2.1. Analogue- Digital-Conversion
To store visual information digitally, the hills and valleys of the analogue video signal
have to be translated into the digital equivalent - ones and zeros - by a sophisticated
computer-on-a-chip, called an analogue-to-digital converter (ADC). The conversion
process is known as sampling, or video capture. Since computers have the capability
to deal with digital graphics information, no other special processing of this data is
needed to display digital video on a computer monitor. To view digital video on a
traditional television set, the process has to be reversed. A digital-to-analogue
converter (DAC) is required to decode the binary information back into the analogue
signal.
The digitisation of the analogue TV signal is performed by a video capture card which
converts each frame into a series of bitmapped images to be displayed and
manipulated on the PC. This takes one horizontal line at a time and, for the PAL
system, splits each into 768 sections. At each of these sections, the red, green and
blue values of the signal are calculated, resulting in 768 colored pixels per line. The
768 pixel width arises out of the 4:3 aspect ratio of a TV picture. Out of the 625 lines
in a PAL signal, about 50 are used for Teletext (an information retrieval service
through television broadcast) and contain no picture information, so they are not
digitised. To get the 4:3 ratio, 575 lines times four divided by three gives 766.7. Video
capture cards usually digitise 576 lines, splitting each line into 768 segments, which
gives an exact 4:3 ratio (compared to the more modern 16:9 ratio of widescreen
television). [STO_1995]
Thus, after digitisation, a full frame is made up of 768x576 pixels. Each pixel requires
three bytes for storing the red, green and blue components of its color (for 24-bit
color). Each frame therefore requires 768x576x3 bytes = 1.3MB. In fact, the PAL
system takes two passes to draw a complete frame - each pass resolving alternate
scan lines to reduce flickering. The upshot is that one second of video requires a
massive 32.5MB (1.3 x 25 fps).
The human eye is more susceptible to brightness than it is to color [URL34]. The
YUV model is a method of encoding pictures used in television broadcasting in which
intensity is processed independently from color. Y is for intensity and is measured in
full resolution, while U and V are for color difference signals and are measured at
either half resolution (known as YUV 4:2:2) or at quarter resolution (known as YUV
4:1:1). Digitising a YUV signal instead of an RGB signal requires 16 bits (two bytes)
instead of 24 bits (three bytes) to represent true color, so one second of PAL video
ends up requiring about 22MB. [ORT1993]
Page 12
Diploma Thesis
Erich Semlak
The NTSC system (chapter 4.1.2) used by America and Japan has 525 lines and
runs at 30 fps - the latter being a consequence of the fact that their electric current
alternates at 60Hz rather than the 50Hz found in Europe. NTSC frames are usually
digitised at 640x480, which fits exactly into VGA resolution. This is not a coincidence, but is a result of the PC having been designed in the US and the first IBM
PCs having the capability to be plugged into a TV.
2.2.2. Capturing Hardware
A typical video capture card is a system of hardware and software which together
allow a user to convert video into a computer-readable format by digitising video
sequences to uncompressed or, more normally, compressed data files.
Uncompressed PAL video creates about 32.5MB data per second, so some kind of
compression has to be employed to make it more manageable. It is down to a codec
to compress video during capture and decompress it again for playback, and this can
be done in software or hardware. Even in the age of GHz-speed CPUs, a hardware
codec is necessary to achieve anything near broadcast quality video.
Picture 1: miro Video PCTV card
The shown TV card consists of a TV tuner (the closed box in the upper section) and
the video decoder chip (right lower corner with „Bt“ on it).
Most video capture devices employ a hardware Motion-JPEG codec [URL35], which
uses JPEG compression on each frame to achieve smaller file sizes, while retaining
editing capabilities. The huge success of DV-based camcorders in the late 1990s has
led to some high-end cards employing a DV (see chapter 3.2.1 for compression
formats) codec instead.
Once compressed, the video sequences can then be edited on the PC using
appropriate video editing software and output in S-VHS quality to a VCR, television,
camcorder or computer monitor. The higher the quality of the video input and the
higher the PC's data transfer rate, the better the quality of the video image output.
Less compression means less loss of information so there are less artifacts and/or
higher frame rate.
Some video capture cards (e.g. Hauppauge Bt878 TV cards) keep their price down
by omitting their own audio recording hardware. Instead they provide pass through
Page 13
Diploma Thesis
Erich Semlak
connectors that allow audio input to be directed to the host PC's sound card. This is
no problem for simple editing work, but without dedicated audio hardware problems
can arise in synchronising the audio and video tracks on longer and more complex
edits.
Video capture cards are equipped with a number of input and output connectors.
There are two main video formats: composite video is the standard for most domestic
video equipment, although higher quality equipment often uses the S-Video format.
Most capture cards will provide at least one input socket that can accept either type
of video signal, allowing connection to any video source (e.g. VCR, video camera, TV
tuner and laser disc) that generates a signal in either of these formats. Additional
sockets can be of benefit though, since complex editing work often requires two or
more inputs. Some cards are designed to take an optional TV tuner module and,
increasing, video capture cards actually include an integrated TV tuner (e.g. the
Hauppauge WinTV Series or the Pinnacle PCTV).
Video output sockets are provided to allow video sequences to be recorded back to
tape and some cards also allow video to be played back either on a computer
monitor or TV. Less sophisticated cards require a separate graphics adapter or TV
tuner card to provide this functionality. [CT1996_11]
2.2.3. Relation to this work
Capturing video by software needs an interface to communicate with the camera
device to configure input resolution, frame rate, shutter and light exposure. Some
capturing hardware also has ability to move the camera head and/or provides zoom.
To handle all these functions, a driver is needed to function as a mediator between
capturing hardware and software.
The developed system of this work bases on DirectShow (chapter 2.6.1) and
therefore needs DirectShow-compatible capturing hardware. To be DirectShowcompatible, there must be drivers available which provide filters which can be used
through DirectShow. Most of available webcams and DV cameras can be controlled
by DirectShow, some older hardware provides only Video For Windows (chapter
2.6.2) drivers, which are useless with DirectShow, and some proprietary hardware
does not provide any drivers at all.
As the handling of the capturing hardware is entirely managed by DirectShow, the
developed system only takes the video data which comes through the input filter,
respectively the driver, without any need for direct accessing the hardware.
Page 14
Diploma Thesis
Erich Semlak
2.3. Video Compression
2.3.1. Introduction
Video compression methods tend to be lossy. This means that after decompressing
the video data is not exactly the same which was originally encoded. By cutting
video's resolution, color depth and frame rate, PCs managed postage stamp-size
windows at first, but then ways were devised to represent images more efficiently
and reduce data without affecting physical dimensions. The technology by which
video compression is achieved is known as a "codec", an abbreviation of
compression/decompression. Various types of codec have been developed implementable in either software and hardware, and sometimes utilising both allowing video to be readily translated to and from its compressed state.
Lossy techniques reduce data - both through complex mathematical encoding and
through selective intentional shedding of visual information that our eyes and brain
usually ignore - and can lead to perceptible loss of picture quality. "Lossless"
compression, by contrast, discards only redundant information. Codecs can be
implemented in hardware or software, or a combination of both. They have
compression ratios ranging from a gentle 2:1 to an aggressive 100:1, making it
feasible to deal with huge amounts of video data. The higher the compression ratio,
the worse the resulting image. Color fidelity fades, artefacts and noise appear in the
picture, the edges of objects become over-apparent until eventually the result is
unwatchable. [CTR_1996_11]
2.3.2. Discrete Cosine Transform
By the end of the 1990s, the dominant techniques were based on a three-stage
algorithm known as DCT (Discrete Cosine Transform). DCT uses the fact that
adjacent pixels in a picture - either physically close in the image (spatial) or in
successive images (temporal) - may be the same value.
Picture 2: DCT compression cycle
A mathematical transform - a relative of the Fourier transform - is performed on grids
of 8x8 pixels (hence the blocks of visual artefacts at high compression levels). It does
not reduce data but the resulting coefficient frequency values are no longer equal in
their information-carrying roles. Specifically, it has been shown that for visual
Page 15
Diploma Thesis
Erich Semlak
systems, the lower frequency components are more important than high frequency
ones. A quantisation process weights these accordingly and ejects those contributing
least visual information, depending on the compression level required. For instance,
losing 50 percent of the transformed data may only result in a loss of five per cent of
the visual information. Then entropy encoding - a lossless technique - jettisons any
truly unnecessary bits. [URL20]
2.3.3. Non-linear editing
Initially, compression was performed by software. Limited CPU power constrained
how clever an algorithm could be to perform its task in a 25th of a second - the time
needed to draw a frame of full-motion video. Nevertheless, Avid Technology [URL49]
and other pioneers of NLE (non-linear editing) introduced PC-based editing systems
at the end of the 1980s using software compression. Although the video was a
quarter of the resolution of broadcast TV, with washed-out color and thick with blocky
artefacts, NLE signaled a revolution in production techniques. At first it was used for
off-line editing, when material is trimmed down for a program. Up to 30 hours of video
may be shot for a one-hour documentary, so it is best to prepare it on cheap, nonbroadcast equipment to save time in an on-line edit suite.
NLE systems really took off in 1991, when hardware-assisted compression brought
VHS-quality video. The first hardware-assisted video compression is known as MJPEG (motion JPEG). It is a derivation of the DCT standard developed for still
images known as JPEG. It was never intended for video compression, but when CCube (who were bought in by LSI Electronics in 2001) introduced a codec chip in the
early 1990s that could JPEG as many as 30 still images a second, NLE pioneers
could not resist. By squeezing data as much as 50 times, VHS-quality digital video
could be handled by PCs. [CT1996_5] [URL22]
In time, PCs got faster and storage got cheaper, meaning less compression had to
be used so that better video could be edited. By compressing video by as little as
10:1 a new breed of non-linear solutions emerged in the mid-1990s. These systems
were declared ready for on-line editing; that is, finished programs could essentially be
played out of the box. Their video was at least considered to be of broadcast quality
for the sort of time and cost-critical applications that most benefited from NLE, such
as news, current affairs and low-budget productions.
The introduction of this technology proved controversial. Most images compressed
cleanly at 10:1, but certain material - with a lot of detail and areas of high contrast were degraded. “Normal” viewers would ever notice, but for broadcast engineers the
ringing and blocky artefacts seemed obvious. Also, in order to change the contents of
the video images, to add an effect or graphic, the material must first be
decompressed and then recompressed. This process, though digital, is akin to an
analogue degeneration. Artefacts are added like noise with each cycle (so-called
“concatenation”). Sensibly designed systems render every effect in a single pass, but
if several compressed systems are used in a production and broadcast environment,
this concatenation presents a problem.
Compression technology arrived just as proprietary uncompressed digital video
equipment had infiltrated all areas of broadcasters and video facilities. Though the
Page 16
Diploma Thesis
Erich Semlak
cost savings of the former were significant, the associated degradation in quality
meant that acceptance by the engineering community was slow at first. However, as
compression levels dropped - to under 5:1 - objections began to evaporate and even
the most exacting engineer conceded that such video was comparable to the widely
used BetaSP analogue tape. Mild compression enabled Sony to build its successful
Digital Betacam format [URl50] video recorder, which is now considered a gold
standard. With compression a little over 2:1, so few artefacts (if any) are introduced
that video is processed dozens of generations apparently untouched.
The cost of M-JPEG hardware has fallen steeply in the past few years and
reasonably priced PCI cards and USB devices capable of a 3:1 compression ratio
and bundled with NLE software are now readily available (e.g. Pinnacle’s Dazzle
Digital Video Creator [URL51]). Useful as M-JPEG is, it was not designed for moving
pictures. When it comes to digital distribution, where bandwidth is at a premium, the
MPEG family of standards - specifically designed for video - offer significant
advantages.
Chapter 4.2.2 gives a further insight about progress of recent research in the field of
video compression.
Page 17
Diploma Thesis
Erich Semlak
2.3.4. Video Compression Formats
The internal details of many media formats are closely guarded pieces of
information, for competitive reasons. Nearly every encoding method employs
sophisticated techniques of mathematical analysis to squeeze a sound or video
sequence into fewer bits. In the early days of media programming for the PC,
their development was a long lasting and complex process.
This table shows the differences between the common compression formats:
VCD
SVCD
X(S)VCD
DivX
DV
DVD
Formal
standard?
Yes
Yes
No
No
Yes
Yes
Resolution
352x240
480x480
720x480
640x480
720x480
720x480
NTSC/PAL
352x288
480x456
720x576
or lower
720x576
720x576
DV
MPEG-2
or lower
MPEG-1 or
Video
compression
MPEG-1
MPEG-2
MPEG-2
MPEG-4
Audio
compression
MPEG-1
MPEG-1
MPEG-1
WMA
DV
AC-3
MB/min
10
20-30
20-40
10-20
216
30-70
DVD Player
compatibility
Very good
Good
Good
None
None
Excellent
CPU intensive
Low
High
High
Very high
High
Very High
Quality
Good
Very good
Very good
Very good
Excellent
Excellent
MP3
MPEG-2
Table 2: Video compression formats overview
These video compression techniques are described in the following.
2.3.4.1. H.261
H.261 is also known as P*64 where P is an integer number meant to represent
multiples of 64kbit/sec. H.261 was targeted at teleconferencing applications and is
intended for carrying video over ISDN - in particular for face-to-face videophone
applications and for videoconferencing with multiple participants. The actual
encoding algorithm is similar to (but incompatible with) MPEG. H.261 needs
substantially less CPU power for real-time encoding than MPEG. The algorithm
includes a mechanism which optimises bandwidth usage by trading picture quality
against motion, so that a quickly-changing picture will have a lower quality than a
relatively static picture. H.261 used in this way is thus a constant-bit-rate encoding
rather than a constant-quality, variable-bit-rate encoding. [URL23]
2.3.4.2. H.263
H.263 is a draft ITU-T standard designed for low bitrate communication. It is
expected that the standard will be used for a wide range of bitrates, not just low
bitrate applications. It is expected that H.263 will replace H.261 in many applications.
The coding algorithm of H.263 is similar to that used by H.261, however with some
improvements and changes to improve performance and error recovery. Half pixel
precision is used for H.263 motion compensation whereas H.261 used full pixel
precision and a loop filter. Some parts of the hierarchical structure of the datastream
are now optional, so the codec can be configured for a lower datarate or better error
Page 18
Diploma Thesis
Erich Semlak
recovery. There are now four optional negotiable options included to improve
performance: Unrestricted Motion Vectors, Syntax-based arithmetic coding, Advance
prediction, and forward and backward frame prediction similar to MPEG called P-B
frames.
H.263 supports five resolutions:
- CIF (Common Intermediate Format, 352x288 pixels at 30 fps)
- QCIF (Quarter CIF, 176x144)
- SQCIF (Semi QCIF, 128x96)
- 4CIF (704x576)
- 16CIF (1408x1152)
The support of 4CIF and 16CIF means the codec could then compete with other
higher bitrate video coding standards such as the MPEG standards. [URL23]
2.3.4.3. MPEG
The Moving Picture Experts Group (MPEG) have defined a series of standards for
compressing motion video and audio signals using DCT (Discrete Cosine Transform)
compression which provide a common world language for high-quality digital video.
These use the JPEG algorithm for compressing individual frames, then eliminate the
data that stays the same in successive frames. The MPEG formats are asymmetrical
- meaning that it takes longer to compress a frame of video than it does to
decompress it - requiring serious computational power to reduce the file size. The
results, however, are impressive:
MPEG video needs less bandwidth than M-JPEG because it combines two forms of
compression. M-JPEG video files are essentially a series of compressed stills. Using
intraframe, or spatial compression, it disposes off redundancy within each frame of
video. MPEG does this but also utilises another process known as interframe, or
temporal compression. This eradicates redundancy between video frames. Take two,
sequential frames of video and you will notice very little changes in a 25th of a
second. So MPEG reduces the data rate by recording changes instead of complete
frames.
MPEG video streams consist of a sequence of sets of frames known as a GOP
(group of pictures). Each group, typically eight to 24 frames long, has only one
complete frame represented in full, which is compressed using only intraframe
compression. It is just like a JPEG still and is known as an I frame. Around it are
temporally-compressed frames, representing only change data. During encoding,
powerful motion prediction techniques compare neighbouring frames and pinpoint
areas of movement, defining vectors for how each will move from one frame to the
next. By recording only these vectors, the data which needs to be recorded can be
substantially reduced. P (predictive) frames, refer only to the previous frame, while B
(bi-directional) rely on previous and subsequent frames. This combination of
compression techniques makes MPEG highly scalable. [URL23]
Page 19
Diploma Thesis
Erich Semlak
2.3.4.4. MJPEG
There is really no such standard as "motion JPEG" or "MJPEG" for video. Various
vendors have applied JPEG to individual frames of a video sequence, and have
called the result "M-JPEG". JPEG is designed for compressing full-color or gray-scale
images of natural, real-world scenes. It works well on photographs, naturalistic
artwork, and similar material; not so well on lettering, simple cartoons, or line
drawings. JPEG is a lossy compression algorithm which uses DCT-based encoding.
JPEG can typically achieve 10:1 to 20:1 compression without visible loss, 30:1 to
50:1 compression is possible with small to moderate defects, while for very-lowquality purposes such as previews or archive indexes, 100:1 compression is quite
feasible. Non-linear video editors are typically used in broadcast TV, commercial post
production, and high-end corporate media departments. Low bitrate MPEG-1 quality
is unacceptable to these customers, and it is difficult to edit video sequences that use
inter-frame compression. Consequently, non-linear editors (e.g., AVID, Matrox,
FAST, etc.) will continue to use motion JPEG with low compression factors (e.g., 6:1
to 10:1). [URL21]
2.3.4.5. MPEG1
MPEG-1 (aka White Book standard) was designed to get VHS-quality video to a fixed
data rate of 1.5 Mbit/s so it could play from a regular CD (for the VideoCD format).
Published in 1993, the standard supports video coding at bit-rates up to about 1.5
Mbit/s and virtually transparent stereo audio quality at 192 Kbit/s, providing 352x240
resolution at 30 fps, with quality roughly equivalent to VHS videotape. The 352x240
resolution is typically scaled and interpolated. (Scaling causes a blocky appearance
when one pixel - scaled up - becomes four pixels of the same color value.
Interpolation blends adjacent pixels by interposing pixels with "best-guess" color
values.) Most graphics chips can scale the picture for full-screen playback, however
software-only half-screen playback is an useful trade-off. MPEG-1 enables more than
70 minutes of good-quality video and audio to be stored on a single CD-ROM disc.
Prior to the introduction of Pentium-based computers, MPEG-1 required dedicated
hardware support. It is optimised for non-interlaced video signals. [URL21]
2.3.4.6. MPEG2
During 1990, MPEG recognised the need for a second, related standard for coding
video at higher data rates and in an interlaced format. The resulting MPEG-2
standard is capable of coding standard definition television at bit-rates from about 1.5
Mbit/s to some 15 Mbit/s. MPEG-2 also adds the option of multi-channel surround
sound coding and is backwards compatible with MPEG-1. It is interesting to note
that, for video signals coded at bitrates below 3 Mbit/s, MPEG-1 may be more
efficient than MPEG-2. MPEG-2 has a resolution of 704x480 at 30 fps - four times
greater than MPEG-1 - and is optimised for the higher demands of broadcast and
entertainment applications, such as DSS satellite broadcast and DVD-Video. At a
data rate of around 10 Mbit/s, the latter is capable of delivering near-broadcastquality video with five-channel audio. Resolution is about twice that of a VHS
videotape and the standard supports additional features such as scalability and the
Page 20
Diploma Thesis
Erich Semlak
ability to place pictures within pictures. (Extended) Super Video CD, DVD and DivX
(until Version 5) use MPEG2. MPEG-3, intended for HDTV, was included in MPEG-2.
[URL21]
2.3.4.7. MPEG4
In 1993 work was started on MPEG-4, a low-bandwidth multimedia format akin to
QuickTime that can contain a mix of media, allowing recorded video images and
sounds to co-exist with their computer-generated counterparts. Importantly, MPEG-4
provides standardised ways of representing units of audible, visual or audio-visual
content, as discrete "media objects". These can be of natural or synthetic origin,
meaning, for example, they could be recorded with a camera or microphone, or
generated with a computer. Possibly the greatest of the advances made by MPEG-4
is that it allows viewers and listeners to interact with objects within a scene. Divx
(since Version 6) uses MPEG4 to compress videos to fit on a single Mode 2 CD
while preserving high quality.
[URL21]
2.3.4.8. MPEG7
MPEG-7, formally named "Multimedia Content Description Interface", aims to create
a standard for describing the multimedia content data that will support some degree
of interpretation of the information's meaning, which can be passed onto, or
accessed by, a device or a computer code. [ISO1999]
Page 21
Diploma Thesis
Erich Semlak
2.3.5. Relation to this work
There are two points where video compression is needed and takes place in the
developed system:
The first is at the connection between Slave and Master (see Picture 10: slave
model). As the amount of video data coming in from the camera’s driver constitutes
about 7.6 MByte/sec (at 352x288/25fps), it consumes a lot of bandwith while
transferring the stream to the Master machine. This means one client would eat up a
whole 100MBit connection. Therefore it is necessary to compress the Slaves’ videos
before delivering them. As the video has to be decompressed for doing the fading
but has to be re-compressed again before streaming out on the web, there are two
compression-decompression cycles. Every compression-decompression cycle
decreases video quality as it increases artefacts and color inconsistencies.
So on the one hand, the Slave’s compression codec has to come up against this
issue by using a compression format that holds down artefacts. On the other hand it
has to be fast enough to provide video compression in realtime, which means that
compression time per frame has to be less than the frame’s time span.
After testing the codecs which come with Windows (Intel Video 5.10, Intel Indeo,
Cinepak, Microsoft RLE, Microsoft Video 1), Microsoft’s MPEG-4 Codec emerged to
be most appropriate in terms of CPU usage and compression ratio (about 1:120).
There has also been tested a lossless codec by Alparysoft [URL59] which achieves a
compression ratio of 1:3, but for higher resolutions than 352x288 CPU usage raises
up to 90%, which is too high for usage in the Slave application, as there also has to
be done the object and motion detection simultaneously
The compression filter used in the Slave application can be easily replaced by any
other, as far it is a DirectShow filter and the Master machine has the necessary
decoding filter installed.
The second point where video has to be compressed is for streaming the result out
on the Web (see Picture 11: master model). This stream has to meet the
requirements for being viewed with common media players, like Microsoft Media
Player or Quicktime (chapter 2.4.1). Besides the compression ratio has to be high
enough for the available bandwith. As the intended output resolution will not exceed
352x288 and bitrate will remain below 1.5 MBit, MPEG1 (chapter 20) seems
appropriate.
The current version of the Master application uses the Moonlight Encoder to convert
the raw video into a MPEG1 video stream. Another benefit of using MPEG1 with the
Moonlight Encoder is that the trial version runs without registration as long as
choosing MPEG1 for the output format. Besides, as the MPEG1 standard is the
oldest MPEG standard, every player that is able to play MPEG streams should
handle the resulting stream without problems.
Page 22
Diploma Thesis
Erich Semlak
2.4. Video Streaming
There are two ways to view media on the Internet (such as video, audio, animations,
etc): Downloading and streaming.
When a file is downloaded the entire file is saved on the user’s computer (usually in a
temporary folder), which is then opened and viewed. This has some advantages
(such as quicker access to different parts of the file) but has the big disadvantage of
having to wait for the whole file to download before any of it can be viewed. If the file
is quite small this may not be too much of an inconvenience, but for large files and
long presentations it can be annoying.
Streaming media works a bit differently - the end user can start watching the file
almost as soon as it begins downloading. In effect, the file is sent to the user in a
(more or less) constant stream, and the user watches it as it arrives.
When audio or video is streamed, a small buffer space is created on the user's
computer, and data starts downloading into it. As soon as the buffer is full (usually
just a matter of seconds), the file starts to play. As the file plays, it uses up
information in the buffer, but while it is playing, more data is being downloaded. As
long as the data can be downloaded as fast as it is used up in playback, the file will
play smoothly.
The obvious advantage with this method is that almost no waiting is involved
(compared to the full length of media). Streaming media can also be used to
broadcast live events (within certain limitations). This is sometimes referred to as a
Webcast (or Netcast) [URL38]. When creating streaming video, there are two things
which need to be understood: The video file format and the streaming method.
Page 23
Diploma Thesis
Erich Semlak
2.4.1. Video Streaming Formats
There are many video file formats to choose from when creating video streams, the
three most common are:
•
•
•
Windows Media [URL8]
RealMedia [URL36]
Quicktime [URL37]
There are pros and cons for each type but in the end it comes down to personal
preference. One has to be aware that many of the users will have their own
preferences and some users will only use a particular format, so separate files for
each format should be created for reaching the widest possible audience.
The following overview is extracted from [URL16], [MEN2003] and [TOP2003]:
2.4.1.1. Windows Media Technology
WMT is a streamable version of the popular AVI extension. The player is free and
integrated into the Windows Media Player. The video window can be resized by the
viewer. It streams extremely well off the net at full screen resolution. The ASF format
breaks up the AVI file into bits of streamable data packs to be able to be transmitted
over the net, depending on the targeted audience's connection speed. A browser
downloads the file then opens up the media player and plays the file immediately.
The quality can be as good as MPEG1, significantly higher than real video or
QuickTime, when there is high bandwidth available.
2.4.1.2. RealMedia
RealMedia is the popular streaming standard on the PC market, but MAC support is
limited and the server is expensive. RealVideo can be created and embedded into
HTML without a special server, but the playback of the stream is not as good.
Streaming performance depends on both the speed of the user connection and
speed of the web server and the line it is connected to. While the picture quality is
great at high speed, the frame rate is not exactly that great. The video can become
so jumpy that watching the movie is frustrating and the storytelling essence is gone.
2.4.1.3. Quicktime
Quicktime is by far the most portable video standard that's been around for a long
time. QuickTime streaming requires a special server, but any Mac running the latest
OS and Quicktime Pro can be configured as a Quicktime server, all what is needed is
a high speed connection. QuickTime allows to configure the video into streams that
actually download and play simultaneously without any jerks. Performance varies
due to web server performance. Popular because it allows both MAC and PC users
share data.
Page 24
Diploma Thesis
Erich Semlak
2.4.2. Streaming Methods
Part of the thesis practical requirements is to stream the resulting video out on the
Internet. To achieve this, some considerations have to be done, as streaming uses
some kind of network which is subject to some restrictions and conditions. These
conditions shall be described and treated in the following chapters.
2.4.2.1. HTTP Server versus Dedicated Media Server
Two major approaches are emerging for streaming multimedia content to clients. The
first is the “server-less” approach which uses the standard web-server and the
associated HTTP protocol to get the multimedia data to the client. The second is the
server-based approach that uses a separate server specialized to the
video/multimedia streaming task. The specialization takes many forms, including
optimized routines for reading the huge multimedia files from disk, the flexibility to
choose any of UDP/TCP/HTTP/Multicast protocols to deliver data, and the option to
exploit continuous contact between client and server to dynamically optimize content
delivery to the client.
The primary advantages of the server-less approach are:
- there is one less software piece to learn and manage, and
- from an economic perspective, there is no video-server to pay for.
In contrast, the server-based approach has the advantages that it:
- makes more efficient use of the network bandwidth,
- offers better video quality to the end user,
- supports advanced features like admission control and multi-stream
multimedia content,
- scales to support large number of end users, and
- protects content copyright.
The tradeoffs clearly indicate that for serious providers of streaming multimedia
content the server-based approach is the superior solution. RealPlayer,
StreamWorks [URL40] and VDOnet's VDOLive [URL41] require you to install their
A/V server software on your Web server computer. Among other things, this software
can tailor the quality and number of streams, and provide detailed reports of who
requested which streams. Other programs, such as Shockwave and VivoActive, are
server-less. They do not require any special A/V server software beyond your
ordinary Web server software. With these programs, you simply link a file on your
server's hard drive from a Web page. When someone hits the link, the file starts to
download. Server-less programs are simple to incorporate into a Web site but do not
have the reporting capabilities of server-based programs. Depending on the users’
internet connections (Modem, DSL, Cable) the maximum bitrates they are able to
consume may differ. If different bitrate versions shall be offered, there has to be a link
for each because there is no automatic bandwith-detection, as it is provided by e.g. a
Quicktime Media Server.
[TOP2003] [MEN2003]
Page 25
Diploma Thesis
Erich Semlak
2.4.2.2. HTTP Streaming Video
This is the simplest and cheapest way to stream video from a website. Small to
medium-sized websites are more likely to use this method than the more expensive
streaming servers. For this method there is no need of any special type of website or
host - just a host server which recognises common video file types (most standard
hosting accounts do this). There are some limitations to bear in mind regarding
HTTP streaming:
-
HTTP streaming is a good option for websites with modest traffic, i.e. less than
about a dozen people viewing at the same time. For heavier traffic a more
serious streaming solution should be considered.
There can't be stream live video, since the HTTP method only works with
complete files stored on the server.
There is no automatic detection of the end user's connection speed using
HTTP. If different versions for different speeds shall be created, a separate file
for each speed has to be created
HTTP streaming is not as efficient as other methods as it produces more data
overhead which therefore leads to additional server load
Especially traffic issues may be of interest if many people request streaming files
simultaneously as there might not be enough bandwidth for the server. In this case it
may be necessary to limit the number of simultaneous connections, so there will be
enough bandwidth for every connected user.
[TOP2003] [MEN2003]
2.4.2.3. Java Replayers Replacing Plugins
New solutions are appearing which use Java to eliminate the need to download and
install plugins or players [URL39]. Such an approach will become standard once the
Java Media Player APIs being developed by Sun, Silicon Graphics and Intel are
available. This approach will also ensure client platform independence.
[TOP2003] [MEN2003]
2.4.2.4. FireWalls
For security reasons cautious administrators and users run their computers behind
firewalls to guard them against intruders and hackers. Nearly all streaming products
require users behind a firewall to have a UDP port opened for the video streams to
pass through (1558 for StreamWorks [URL40], 7000 for VDOLive [URL41], 7070 for
RealAudio). Rather than punch security holes in the firewall, Xing/StreamWorks has
developed a proxy software package you can compile and use, while
VDONet/VDOLive and Progressive Networks/RealPlayer are approaching leading
firewall developers to get support for their streams incorporated into upcoming
products. Currently a number of products change from UDP to HTTP or TCP when
UDP can't get through firewall restrictions. This reduces the quality of the video. In all
cases, it is still a security issue for network managers.
[URL28]
Page 26
Diploma Thesis
Erich Semlak
2.4.3. Streaming Method Considerations
This thesis’ system uses a dedicated RTSP server for delivering the resulting
MPEG1 stream. This server-based approach shows its’ advantage when serving
multiple clients, as it uses multicast [URL61]. Therefore it doesn’t need to send an
own stream to each client but sends one single stream to all clients at the same time.
This holds down network and machine load. As the streamed video is live (in
opposite to a prerecorded video) all connected clients see the same moment in time,
which is desired in this case. The protocol would provide functions for random
seeking within media, but this is not required and therefore disabled.
There is no automatic bandwidth detection implemented. The current system’s
version provides only one bitrate, but as the filter graph (which connects each video
processing component (filter) within the streaming application, see chapter A.6 for
details) can be easily extended with multiple output filters, there could be run multiple
streams with different bitrates and streamed through different ports. Limitation will be
CPU performance as every compression task takes his share of workload.
Future versions can remove this limitation by distributing the resulting video per
network to multiple PCs which then do the compression for each desired bitrate. In
that scenario, the compression-decompression-cycle-problem (see chapter 2.3.5)
has to be regarded, as the video distribution to the PCs (for compression) would lead
to further decreased video quality (which means more artefacts) if a lossy
compression codec would be used.
Page 27
Diploma Thesis
Erich Semlak
2.4.4. Protocols
To send the streaming data across the network, a protocol has to be used to assure
the correct arrival on the other side. There are several possible protocols, some more
appropriate than others. The differences and properties of these protocols will be
explained following. This information about protocols is excerpted from [URL23],
[URL60], [MEN2003] and [TOP2003].
2.4.4.1. TCP
HTTP (Hypertext Transfer Protocol) uses TCP (Transmission Control Protocol)
as the protocol for reliable document transfer. It needs to establish a connection
before sending data is possible. If packets are delayed or damaged, TCP will
effectively stop traffic until either the original packets or backup packets arrive. Hence
it is unsuitable for video and audio because:
•
•
TCP imposes its own flow control and windowing schemes on the data
stream, effectively destroying temporal relations between video frames and
audio packets
Reliable message delivery is unnecessary for video and audio - losses are
tolerable and TCP retransmission causes further jitter and skews.
Picture 3: TCP protocol handshaking and dataflow
Page 28
Diploma Thesis
Erich Semlak
2.4.4.2. UDP
UDP (User Datagram Protocol) is the alternative to TCP. RealPlayer, StreamWorks
[URL40] and VDOLive [URl41] use this approach. (RealPlayer gives you a choice of
UDP or TCP, but the former is preferred.) UDP needs no connection, forsakes TCP's
error correction and allows packets to drop out if they're late or damaged. When this
happens, dropouts will occur but the stream will continue.
Picture 4: UDP protocol dataflow
Despite the prospect of dropouts, this approach is arguably better for continuous
media delivery. If broadcasting live events, everyone will get the same information
simultaneously. One disadvantage to the UDP approach is that many network
firewalls (see FireWalls, chapter 2.4.2.4) block UDP information. While Progressive
Networks, Xing and VDOnet offer work-arounds for client sites (revert to TCP), some
users simply may not be able to access UDP files.
2.4.4.3. RTP
RTP (Real Time Protocol) is the Internet-standard protocol (RFC 1889, 1890) for the
transport of real-time data, including audio and video. RTP consists of a data and a
control part called RTCP. The data part of RTP is a thin protocol providing support for
applications with real-time properties such as continuous media (e.g., audio and
video), including timing reconstruction, loss detection, security and content
identification. RTCP provides support for real-time conferencing of groups of any size
within an internet. This support includes source identification and support for
gateways like audio and video bridges as well as multicast-to-unicast translators. It
offers quality-of-service feedback from receivers to the multicast group as well as
support for the synchronization of different media streams. None of the commercial
streaming products uses RTP (Real-time Transport Protocol), a relatively new
standard designed to run over UDP. Initially designed for video at T1 or higher
bandwidths, it promises more efficient multimedia streaming than UDP. Streaming
vendors are expected to adopt RTP, which is used by the MBONE.
2.4.4.4. VDP
Vosaic uses VDP (Video Datagram Protocol), which is an augmented RTP i.e. RTP
with demand resend. VDP improves the reliability of the data stream by creating two
channels between the client and server. One is a control channel the two machines
use to coordinate what information is being sent across the network, and the other
channel is for the streaming data. When configured in Java, this protocol, like HTTP,
is invisible to the network and can stream through firewalls.
Page 29
Diploma Thesis
Erich Semlak
2.4.4.5. RTSP
In October 1996, Progressive Networks and Netscape Communications Corporation
announced that 40 companies including Apple Computer, Autodesk/Kinetix, Cisco
Systems, Hewlett-Packard, IBM, Silicon Graphics, Sun Microsystems, Macromedia,
Narrative Communications, Precept Software and Voxware would support the Real
Time Streaming Protocol (RTSP), a proposed open standard for delivery of real-time
media over the Internet. RTSP is a communications protocol for control and delivery
of real-time media. It defines the connection between streaming media client and
server software, and provides a standard way for clients and servers from multiple
vendors to stream multimedia content. The first draft of the protocol specification,
RTSP 1.0, was submitted to the Internet Engineering Task Force (IETF) on October
9, 1996. RTSP is built on top of Internet standard protocols, including: UDP, TCP/IP,
RTP, RTCP, SCP and IP Multicast. Netscape's Media Server and Media Player
products use RTSP to stream audio over the Internet.
RTSP is designed to work with time-based media, such as streaming audio and
video, as well as any application where application-controlled, time-based delivery is
essential. It has mechanisms for time-based seeks into media clips, with compatibility
with many timestamp formats, such as SMPTE timecodes. In addition, RTSP is
designed to control multicast delivery of streams, and is ideally suited to full multicast
solutions, as well as providing a framework for multicast-unicast hybrid solutions for
heterogeneous networks like the Internet
2.4.4.6. RSVP
RSVP (Resource Reservation Protocol) is an Internet Engineering Task Force (IETF,
[URL56]) proposed standard for requesting defined quality-of-service levels over IP
networks such as the Internet. The protocol was designed to allow the assignment of
priorities to "streaming" applications, such as audio and video, which generate
continuous traffic that requires predictable delivery. RSVP works by permitting an
application transmitting data over a routed network to request and receive a given
level of bandwidth. Two classes of reservation are defined: a controlled load
reservation provides service approximating "best effort" service under unloaded
conditions; a guaranteed service reservation provides service that guarantees both
bandwidth and delay.
Page 30
Diploma Thesis
Erich Semlak
2.4.5. Protocols considerations
This thesis’ system uses network connections for three purposes:
Communication between Slave and Master application
This connection is used for negotiating media types, sending commands and results
of object and motion detection. Especially for sending the commands between
Master and Slaves, a reliable connection is very essential, so TCP seems the only
appropriate protocol for this. Additional overhead caused by the protocol can be
neglected, as the amount of packets is rather low. Although a connection has to be
setup for each client, this is not an issue, because number of clients will be limited.
Delivering video input from the Slaves’ cameras to the Master
The video data from the Slave is delivered in a continuous stream to the Master
application. The direction the data goes is one-way only and handshaking is not
needed, so UDP is best choice. If single packets are dropped, there is no big impact
to the resulting output, worst case some more artefacts might occur on the video
picture for a moment. The video delivering task is done by the network sender filter
(chapter 3.4.3).
Streaming video out on the Web
The resulting video is sent as MPEG stream out on the web. The streaming server
filter component (chapter 3.4.5) used in this thesis’ system uses the RTSP protocol
for delivery (chapter 2.4.3). This protocol provides multicast and is therefore
economical with bandwidth. This doesn’t limit the maximum of servable client to
bandwidth divided by bitrate. As mentioned before, each client will see the same
moment in time which is desired as the streamed content is live, therefore any
random access functions are disabled.
Page 31
Diploma Thesis
Erich Semlak
2.5. Software
There are a lot of video processing tools available, every capturing device such as
Webcams or TV tuner cards has at least one bundled. Main difference is if the
software is intended to capture still images or motion video, or even both.
There are only some of them picked out without emphasize on any company but on
the diversity of purpose, as the captured stills and/or video can be further processed
in many different ways.
2.5.1. Commercial
The following information describes the key features of each application and is taken
from www.burnworld.com and www.manifest-tech.com, respectively extracted from
the software provider’s website.
ISpy
Grabs an image from any Video for Windows compatible framegrabber and sends
the JPEG image to a homepage on the Internet. It can use either dial-up networking
or a LAN connection for an FTP upload, or it can just save the image on your local
disk. There is a free demo version of ISpy available, which is fully functional but will
add the words 'demo version' to every uploaded image.
(http://www.surveyorcorp.com/ispy/)
Gotcha
Motion detection video software records what you want to see. Capturing images
only when triggered, it provides time stamped video files of events as they occur
without recording inactivity between events. FTP and E-mail support. It is well suited
for office or home video surveillance, because it is able to capture several source at
the same time. (http://www.gotchanow.com/)
Adobe Premiere
Adobe Premiere provides a single environment that is focused on optimizing the
editing process and it is famous for offering precise control for editing tracks in the
timeline. Premiere has an impressive heritage as the market-leading professional
video-editing software package, on both Windows and Macintosh platforms. The
benefits of its popularity include the large number of resources available for it, such
as books, training courses, user groups, and on-line discussion groups.
(http://www.adobe.com/motion/)
Page 32
Diploma Thesis
Erich Semlak
Pinnacle Studio
Pinnacle Studio is video editing software which delivers a top-notch blend of features,
usability, and performance for most consumers. It is easy to use, has a very broad
feature set and is very handy, particularly for audio and DVD authoring.
Pinnacle Studio Plus is an approachable video editor for novices, while offering the
depth experienced video enthusiasts won't find in other packages at this price.
(http://www.pinnaclesys.com)
2.5.2. Shareware
Ulead Video Studio
Ulead VideoStudio is a fast and easy video editor with leading technology in DV and
MPEG-2 video formats for DVD quality input, editing and output. You can use a
straightforward, six-step Video Wizard to introduce you to video editing. Video can be
edited by adding titles, transitions, voiceovers and music to turn them into movie
masterpieces. You can also send video e-mail and web cards to share your video on
the computer in full-screen for the best viewing.
(http://www.ulead.com)
Video Site Monitor Surveillance WebCams 2.59
This is a multi-camera video monitoring surveillance program for use at home or
work. It features video motion detection and can capture video to disk. Cameras can
be accessed remotely over the Internet to review captured video, or take in a live
feed. Up to eight cameras are supported. It also offers audio alarms, e-mail
notification and quick FTP file uploads. You can schedule the video settings by timer,
and everything is password protected.
(http://www.fgeng.com/)
Virtual Dub
VirtualDub is a video capture/processing utility for 32-bit Windows platforms
(98/NT/2000/XP), licensed under the GNU General Public License (GPL). It lacks the
editing power of a general-purpose editor such as Adobe Premiere, but is
streamlined for fast linear operations over video. It has batch-processing capabilities
for processing large numbers of files and can be extended with third-party video
filters. VirtualDub is mainly geared toward processing AVI files, although it can read
(not write) MPEG-1 and also handle sets of BMP images.
(http://www.virtualdub.org)
Page 33
Diploma Thesis
Erich Semlak
2.6. Techniques and SDKs
There are different bases where multimedia applications can be developed on. To
provide some fundamentals for better understanding the decisions made in chapter
3.2 the techniques to be considered are presented and described in their history and
origin. These fundamentals also support considerations about their future
development and life cycle.
2.6.1. Microsoft DirectShow
In the early 1990s, after the release of Windows 3.1, a number of hardware devices
were introduced to take advantage of the features of its graphical user interface.
Among these were inexpensive digital cameras (also known as so-called “webcams”
today), which used CCD1 technology to create low-resolution (often poor black-andwhite) video. These devices were connected to the host computer through the
parallel port (normally reserved for the printer) with software drivers that cared about
the data transfer from the camera to the computer. As these devices became more
common, Microsoft introduced “Video for Windows” (VfW), which is described in a
later chapter. Although VfW proved to be sufficient for many software developers, it
has a some limitations, in particular, it is quite difficult to support the popular MPEG
standard for video. It would take a complete rewrite of Video for Windows to do
accomplish that.
As Windows 95 was near release, Microsoft started a project known as “Quartz”2,
meant to create a new set of APIs that could provide all of Video for Window’s
functionality with MPEG support in a 32-bit environment (although Windows 95 isn’t
really 32-bit). That seemed straightforward enough, but the engineers working on
Quartz realized that an extensive set of devices was going to enter the market, such
as digital camcorders and PC-based TV-tuners, which would require a more
comprehensive level of support than anything they had planned to offer. The
designers of Quartz realized they couldn’t possible imagine every scenario or even
try to get it all into a single API. Instead, the designers of Quartz chose a “framework”
architecture, where the components can be snapped together, much like LEGO
bricks. To simplify the architecture of a complex multimedia application, Quartz
provides a basic set of building components, known as filters, to perform essential
function such as reading data from a file, playing it to the speakers, rendering it to the
screen, and so on.
Using the (at that time) newly developed Microsoft Component Object Model (COM),
Quartz tied these filters together into filter graphs, which orchestrated a flow of media
data, a so-called stream, from capture through any intermediate processing to its
eventual output to the display. Through COM, each filter would be able to inquire
about the capabilities of other filters as they were connected together into a filter
graph. And because Quartz filters would be self-contained COM objects, they could
1
„charge coupled device“, a light-sensitive electronic component which measures luminosity. It
contains a matrix of pixels. The proportional charge of each pixel is then stored digitally. [URL2]
2
„Quartz“ can be even found today, as the main library of DirectShow is named “quartz.dll”. It
contains all standard filters that come with DirectShow.
Page 34
Diploma Thesis
Erich Semlak
be created by third-party developers for their own hardware designs or software
needs. In this way, Quartz would be endlessly extensible, if one needed some
feature that Quartz didn't have, he could always write his own filter.
The developers of Quartz raised a Microsoft research project known as
"Clockwork” [URL57], which provided a basic framework of modular, semiindependent components working together on a stream of data. From this beginning,
Quartz evolved into a complete API for video and audio processing, which Microsoft
released in 1995 as ActiveMovie, shipping it as a component in the DirectX Media
SDK. In 1996, Microsoft renamed ActiveMovie to DirectShow (to indicate its relationship
with DirectX3), a name it retains to this day.
In 1998, a subsequent release of DirectShow added support for DVDs and analog
television applications, both of which had become common. Finally, in 2000,
DirectShow was fully integrated with DirectX, shipping as part of the release of
DirectX 8. This integration means that every Windows computer with DirectX
installed (and that are most PCs nowadays) has the complete suite of DirectShow
services and is fully compatible with any DirectShow application. DirectX 8 also
added support for Windows Media (which will be described in a following chapter),
a set of streaming technologies designed for high-quality audio and video
delivered over low-bandwidth connections, and the DirectShow Editing Services, a
complete API for video editing.
Picture 5: Windows Movie Maker GUI
Microsoft bundled a new application into the release of Windows Millennium Edition:
Windows Movie Maker. Built using DirectShow, it gives novice users of digital
camcorders an easy-to-use interface for video capture, editing, and export, which
means an outstanding demonstration of the capabilities of the DirectShow API.
In the two years after the release of DirectX 8, most of the popular video editing
applications have come to use DirectShow to handle the intricacies of communication
with a wide array of digital camcorders. Those programmers of these applications
3
Microsoft DirectX is an advanced suite of multimedia application programming interfaces (APIs) built
into Microsoft Windows. It provides a standard development platform for Windows-based PCs for
writing hardware-specific code. [URL 6]
Page 35
Diploma Thesis
Erich Semlak
made the choice, using DirectShow to handle the low-level sorts of tasks that would
have consumed many, many hours of research, programming, and testing.
With the release of DirectX 9 (the most recent, as this thesis is
written), very little has changed in DirectShow, with one significant
exception: the Video Mixing Renderer (VMR) filter. The VMR allows
the programmer to mix multiple video sources into a single video
stream that can be played within a window or applied as a "texture
map", a bit like wallpaper, to a surface of a 3D object created in Microsoft Direct3D.
The list of uses for DirectShow is long, but the two most prominent examples are
Windows Movie Maker and Windows Media Player. Both have shipped as standard
components of Microsoft operating systems since Windows Millennium Edition.
[URL25] [PES2003]
For a deeper insight into DirectShow and its components, please refer to appendix A.
2.6.2. Video For Windows
Video For Windows is a set of software application program interfaces (APIs) that
provided basic video and audio capture services that could be used in conjunction
with these new devices.
Video for Windows was introduced as an SDK separate from a Microsoft OS release
in the fall of 1992. VFW became part of the core operating system in Windows 95
and NT3.51.
Although it will continue to be supported indefinitely, further
development has stopped, and the feature set effectively frozen with the release of
Windows98. [ORT1993]
Video for Windows consists also of 5 applications [URL24], so called MultiMedia
Data Tools:
- VideEdit, for loading, playing, editing and saving video clips
- VidCap, for capturing of single frames or video clips
- BitEdit, for editing single frames in a video file
- PalEdit, for editing the color palette of an video file
- WavEdit, for recording and editing audio files
Page 36
Diploma Thesis
Erich Semlak
2.6.3. Windows Media Series 9
Microsoft Windows Media Services is a next-generation platform for streaming digital
media. You can use the Windows Media Services SDK to build custom applications
on top of this platform. For example, you can use the SDK to:
-
Create a custom user interface to administer Windows Media Services.
Programmatically control a Windows Media server.
Programmatically configure the properties of system plug-ins included with
Windows Media Services.
Create your own plug-ins to customize core server functionality.
Dynamically create and manage server-side playlists.
Following descriptions are taken from [URL8] and [URL9]:
2.6.3.1. Windows Media Encoder SDK
The Windows Media Encoder SDK is one of the main components of the Microsoft
Windows Media 9 Series SDK. The Windows Media Encoder 9 Series SDK is
designed for anyone who wants to develop a Windows Media Encoder application by
using a powerful Automation-based application programming interface (API). With
this SDK, a developer using C++, Microsoft Visual Basic, or a scripting language can
capture multimedia content and encode it into a Windows Media-based file or stream.
For instance, you can use this Automation API to:
-
-
-
-
-
Broadcast live content. A news organization can use the Automation API to
schedule the automatic capture and broadcast of live content. Local
transportation departments can stream live pictures of road conditions at
multiple trouble spots, alerting drivers to traffic congestion and advising them
of alternate routes.
Batch-process content. A media production organization that must process a
high volume of large files can create a batch process that uses the Automation
API to repeatedly capture and encode streams, one after the other. A
corporation can use the Automation API to manage its streaming media
services with a preferred scripting language and Windows Script Host.
Windows Script Host is a language-independent host that can be used to run
any script engine on the Microsoft Windows 95 or later, Windows NT, or
Windows 2000 operating systems.
Create a custom user interface. An Internet service provider (ISP) can build an
interface that uses the functionality of the Automation API to capture, encode,
and broadcast media streams. Alternatively, you can use the predefined user
interfaces within the Automation API for the same purpose.
Remotely administer Windows Media Encoder applications. You can use the
Automation API to run, troubleshoot and administer Windows Media Encoder
applications from a remote computer.
This SDK documentation provides an overview of general encoding topics, a
programming guide, and a full reference section documenting the exposed
interfaces, objects, enumerated types, structures and constants. Developers
are also encouraged to view the included samples. [URL8]
Page 37
Diploma Thesis
Erich Semlak
2.6.3.2. Windows Media Format SDK
The Windows Media Format SDK is a component of the Microsoft Windows Media
Software Development Kit (SDK). Other components include the Windows Media
Services SDK, Windows Media Encoder SDK, Windows Media Rights Manager SDK,
Windows Media Device Manager SDK, and Windows Media Player SDK.
The Windows Media Format SDK enables developers to create applications that
play, write, edit, encrypt, and deliver Advanced Systems Format (ASF) files and
network streams, including ASF files and streams that contain audio and video
content encoded with the Windows Media Audio and Windows Media Video codecs.
ASF files that contain Windows Media–based content have the .wma and .wmv
extensions. For more information about the Advanced Systems Format container
structure, see Overview of the ASF Format.
The key features of the Windows Media Format SDK are:
-
-
-
-
-
-
Support for industry-leading codecs: The Windows Media Format 9 Series
SDK includes the Microsoft Windows Media Video 9 codec and the Microsoft
Windows Media Audio 9 codec. Both of these codecs provide exceptional
encoding of digital media content. This SDK also includes the Microsoft
Windows Media Video 9 Screen codec for compressing computer-screen
activity during sessions of user applications, and the new Windows Media
Audio 9 Voice codec, which encodes low-complexity audio such as speech
and intelligently adapts to more complex audio such as music, for superior
representation of combined voice-music scenarios.
Support for writing ASF files: Files are created based on customizable profiles,
enabling easy configuration and standardization of files. This SDK can be
used to write files in excess of 2 gigabytes, enabling longer, better-quality,
continuous files.
Support for reading ASF files: This SDK provides support for reading local
ASF files as well as reading ASF data being streamed over a network.
Support is also provided for many advanced reading features, such as native
support for multiple bit rate (MBR) files, which contain multiple streams with
the same content encoded at different bit rates. The reader automatically
selects which MBR stream to use, depending upon available bandwidth at the
time of playback.
Support for delivering ASF streams over a network: This SDK provides
support for delivering ASF data through HTTP to remote computers on a
network, and also for delivering data directly to a remote Windows Media
server.
Support for editing metadata in ASF files: Information about a file and its
content is easily manipulated with this SDK. Developers can use the robust
system of metadata attributes included in the SDK, or create custom attributes
to suit their needs.
Improved support for editing applications: This version of the Windows Media
Format SDK includes support for advanced features useful to editing
applications. The features include fast access to decompressed content,
frame-based indexing and seeking, and general improvements in the accuracy
of seeking. The new synchronous reading methods provide reading
capabilities all within a single thread, for cleaner, more efficient code.
Page 38
Diploma Thesis
-
-
Erich Semlak
Support for reading and editing metadata in MP3 files: This SDK provides
integrated support for reading MP3 files with the same methods used to read
ASF files. Applications built with the Windows Media Format SDK can also
edit metadata attributes in MP3 files using built-in support for the most
common ID3 tags used by content creators.
Support for Digital Rights Management protection: This SDK provides
methods for reading and writing ASF files and network streams that are
protected by Digital Rights Management to prevent unauthorized playback or
copying of the content.
Page 39
Diploma Thesis
Erich Semlak
3. Practical Part
3.1. Problem
This thesis is supposed to show an approach for realtime video processing with
special emphasize on distributed capturing, processing and subsequent streaming.
The system to be developed shall check out the possibilities of standard PC
hardware, CPU processing power and Ethernet network. This software solution is
supposed to capture video from multiple sources, process them in a specific way and
stream the result out on the network.
3.1.1. Model
The system to be developed can be considered as simple black box:
Picture 6: black box model
The involved processes can be divided into four main phases:
Picture 7: video process phases
The estimated workload for handling the input videos and generating the output
stream may exceed the performance of a today common single processor system,
therefore a solution is aspired which distributes the workload on multiples computers.
Considerations about how to divide the process phases have to regard the dataflow
and the amount of data within. If each process phase would be managed by an own
machine, the whole video (and analysis’ results’) data would have to be delivered
onto the next phases’ machine over a network. This would result in a lot of traffic and
might not be necessary. As the capturing task is not considered to be very CPU
consuming (see chapter 2.2.2) it can be held on the same machine with the video
analyzing task. Therefore it seems advantageous to combine the video capturing and
analyzing functions in one application. These functions occur per video input, so the
count can vary.
The video processing (effects and mixing) part exists only one single time in the
system. It can be combined with the web streaming part unless the web streaming
task shall be distributed to multiple machines (if the workload exceeds CPU
Page 40
Diploma Thesis
Erich Semlak
performance). Tests with a filter graph in GraphEdit have shown that a single
compression process (352x288/25fps) can be managed on a single machine without
performance problems, so the video processing and web streaming tasks can also
be combined into a single application.
Hence the entire process can be divided into two parts, the video capturing and
analyzing part and the video processing and web streaming part, each in their own
application.
For better determination, these parts, which captures the input video, will be named
as “Slaves”, the machine (for the time being we will consider it to be a single
computer) which is supposed to generate the output stream will be named as
“Master”.
The refined model would then look like this:
Picture 8: distributed system model
This approach supports also scalability regarding the amount of video inputs but is
limited by the performance of the Master.
Page 41
Diploma Thesis
Erich Semlak
A real-world view could look like this:
Picture 9: real world example
This example illustrates very comprehendible the set-up of an use-case in a small
scale. It can also be considered as a substitute for a bigger scale application, like a
racetrack or the surveillance system of an underground train.
It is assumed that the slave and master machines are connected through some kind
of network. The capacity of this network is considered to be constrained.
In other respects the specific properties of the network will be neglected in the
following expertise.
Now the roles of the slave and master machines are to be examined more precisely.
Page 42
Diploma Thesis
Erich Semlak
3.1.2. Applications
Following the particular task of the Master and Slave applications will be specified in
detail and there upon an object model will be derived.
3.1.2.1. Slave Application
The Slaves will be responsible for
- controlling the attached video devices,
- receive captured video data from these devices,
- compression of video data for reduction of network traffic,
- motion and object detection and
- sending compressed video data to master machine.
Picture 10: slave model
Which seems rather unimportant in the above schema is the network communication
task which is only illustrated by the arrow out of the Slave-box. Nevertheless this
communication can also be seen as particular task which will be mentioned in the
object model.
Therefore the Slave task can be divided into two major parts with regard to the usage
of Direct Show. On the one side a filter graph has to be spanned and on the other
side the connection and communication with the Master application has to be
established and operated.
Page 43
Diploma Thesis
Erich Semlak
3.1.2.2. Master application
The Master machine tasks will be
- receiving the compressed video data from the slave machines,
- decompression of the Slaves’ video data,
- combination of video frames for specific effects,
- compression of resulting video data and
- streaming compressed video data out on the network.
Picture 11: master model
Similar to the Slave task model the network communication is only illustrated by the
arrow from the left which contains the compressed video data. Of course the network
communication task itself has to be argued and will be reflected later.
The video processing task performs the combination of the incoming video streams
via different effects and the decision which video stream(s) to be showed. This
decision is made upon video content analysis (motion detection data from slaves) or
through manual input.
manual
decision
mode
automatic
video content
analysis
manual input
source/effect
selection
video
sources
video
combination
Picture 12: master video processing
Page 44
video
sources
Diploma Thesis
Erich Semlak
3.2. Technology Evaluation
The following chapter describes which technologies and platforms have been tested,
which aspects have been considered and how far the candidates fulfill these aspects
and requirements.
The candidates to be evaluated are:
- DirectShow SDK
- Windows Media SDK
- Combination of DirectShow and Windows Media
- Video For Windows
- “from scratch”
3.2.1. Evaluation Aspects
There are some requirements, which the target and development platform have to
meet to be appropriate to solve the given specification. Some of them arise from the
technical circumstances, others are more personal preferences because of former
experiences or special interests respectively expertise.
3.2.1.1. Target platform
When searching for video software almost every link that can be found is about some
software for the Windows platform. Especially well-known developers for video
editing and processing software like Pinnacle, Adobe, Ulead, Cyberlink and Microsoft
only provide solutions for the Windows platform. There are a few video processing
and editing tools available for Linux (URL [64]), but none of them are professional
solutions as they all seem to be shareware. Therefore there aren’t any development
standards regarding video data on Linux. This means that Windows is the
appropriate platform for development of the intended solution.
3.2.1.2. Development Environment
I am doing my regular programming and development work under Windows .NET
2003 with Visual Studio .NET and I have got some quite good experiences with it.
The .NET object model provides a lot of convenient components and functions.
Therefore I decided to use .NET as much as possible for the things to come. I would
prefer to code in C# as it is well structured and more easily readable than C++ code
sometimes is. But I am aware there maybe some situations where I will have to fall
back to C++ as most SDKs and libraries base on C++.
This means that the technology to be chosen has to be supported by .NET and vice
versa. Only constraint for the later usage and operation of the developed applications
will be that the .NET Framework 1.1 has to be installed but I consider this fact as
acceptable because there will be more applications to come which will base on the
.NET Framework.
Page 45
Diploma Thesis
Erich Semlak
3.2.1.3. Development aspects
Development aspects are:
- the needed components and SDKs have to be available for free
- the components or environment have to be available for the specified target
platform
- the SDKs have to be available for a programming language which I am able to
code (Visual Basic, C++, C#, Java) to avoid an additional initial skill adaptation
phase.
3.2.1.4. Modularity
One of the specification issues is that the system to develop has to be modular, so
that it is possible to add new parts (e.g. effects) without being constrained to redesign
the whole thing, maybe even without recompiling anything. So it would be
advantageous if the underlying components or development techniques would
support this aspect.
3.2.1.5. Complexity
Depending on the programming experience the complexity of the development
environment is criteria. High complex object models may take a longer phase to get
accustomed with and may lead to more development and debugging costs.
Since the grade of complexity sometimes coheres with the grade of possibilities a
system possesses it might be accepted or even wanted. A low level of complexity
may mean that an average task can be achieved with a few lines of code.
3.2.1.6. Abstraction level
The components to be used shall provide an appropriate level of abstraction of the
underlying hardware devices and drivers to release the developer of hardware or
driver quirks. On the other hand it has to offer facilities for accessing low level
information and interfaces if needed, e.g. the raw video and stream data itself or
some hardware settings, to ensure “total control” over what is going on. This is more
or less the “KO”-criteria because without access to specific low level data and
interfaces some functions required by the specification cannot be obtained.
3.2.1.7. 3rd Party Components
The availability of 3rd party components is of interest as there may be components
which could be integrated to avoid to reinvent the wheel and to save development
time.
Page 46
Diploma Thesis
Erich Semlak
3.2.1.8. Documentation
A very important part of every SDK is the documentation. Without a proper and
comprehensive documentation it would be nearly impossible to use any unfamiliar
environment. The documentation has to contain at least an object reference.
Page 47
Diploma Thesis
Erich Semlak
3.2.2. DirectShow SDK
As DirectShow is part of DirectX so support for handling DirectShow is included in
the DirectX SDK. To my mind DirectShow seems to be treated somehow
stepmotherly. The focus is more set on those parts of DirectX which are used for
game development.
The DirectShow SDK offers a powerful digital media streaming architecture, highlevel APIs, and an extensive library of plug-in components respectively filters.
The DirectShow architecture provides a generalized and consistent approach for
creating virtually any type of digital media application. A DirectShow application is
written by creating an instance of a high-level object called the filter graph manager,
and then use it to create configurations of filters (called filter graphs) that work
together to perform some desired task on a media stream. The filter graph manager
and the filters themselves handle all buffer creation, synchronization, and connection
details, so the application developer needs only to write code to build and operate the
filter graph; there is no need to touch the media stream directly, although access to
the raw data is provided in various ways for those applications that require it.
DirectShow also includes a set of high-level APIs, known as DirectShow Editing
Services (DES), which enables the creation of non-linear video editing applications.
The documentation is not included in the DirectX SDKs documentation and help-files
but can be downloaded as extra and integrates into the MSDN Library, if installed.
It contains an object reference, a lot of examples for many different purposes and is
easily readable and comprehendible.
With the help of GraphEdit, a tool shipped with DirectShow, filter graphs can be
easily pre-tested before casting them into code. This supports developing new filters.
Page 48
Diploma Thesis
Erich Semlak
3.2.3. Evaluation of DirectShow
After evaluation of the DirectShow SDK with regard to the required development
aspects, these compliances can be found:
3.2.3.1. Development and Platform Aspects
-
-
DirectShow SDK is available for Windows as part of DirectX
The SDK is available for free and can be received from the Microsoft
Downloads website
The SDK can be used within the desired programming environment and
support one ore more of my favored programming languages. Managed Code
as with C# can be developed only by rewriting the needed interfaces because
the SDK provides only some of them. Especially filter development is only
reasonable by using C++.
Debugging with DirectShow is sometimes annoying when developing new
filters. There is no direct step-by-step debugging possible, as a filter has to be
compiled as DLL and then be loaded and hooked into a filter graph, but It is
feasible to use an extern Debugger to hook on the GraphEdit process and to
halt at predefined breakpoints, if the filter was compiled with debugging
information.
3.2.3.2. Modularity
The level of modularity and openness in Windows Media is very high. The structure
of a filter graph is totally flexible and can contain multiple source inputs and target
outputs. The number of filters between input and output is also variable and more or
less constrained by performance issues. Out of this a self developed filter can be
used in any 3rd party DirectShow based application.
3.2.3.3. Complexity
Complexity in DirectShow depends on how far the possibilities shall be exhausted. A
simple capture-render application can be developed in short time and is supported at
that through the intelligent connection mechanisms. Nevertheless the object model of
DirectShow is extensive and powerful and supports every problem mentioned in the
requirements.
3.2.3.4. Abstraction Level
The level of abstraction depends on how DirectShow and Filter graphs are used. As
long as only existing filters a hooked together, the abstraction level remains high and
the developer doesn’t need to bother with hardware or file specific subtleties. But by
writing some own filters, every aspect of DirectShow and the underlying architecture
can be controlled and exhausted.
Page 49
Diploma Thesis
Erich Semlak
3.2.3.5. 3rd party components
The availability of 3rd party components is huge. A broad count of filters for many
different purposes (color conversion, resizing, cropping, network source/target,
compressors/decompressors) can be found at MontiVision [URL43] and LEAD
[URL44]. Aside from this DirectShow comes already with a great number of filters.
3.2.3.6. Documentation
The provided documentation is comprehensive and easily understandable. The
Object Model and programming techniques for several example applications are well
explained. The DirectShow documentation can also be found on the Web without
need to install the SDK at the online MSDN library [URL42].
Page 50
Diploma Thesis
Erich Semlak
3.2.4. Windows Media SDK
The Windows Media 9 Series SDK consists of three major parts: Windows Media
Encoder SDK, Windows Media Format SDK, Windows Player SDK, Windows Media
Services SDK and Windows Media Rights Management SDK [URL52].
The Windows Media Services SDK component enables content developers and
system administrators to support Windows Media Technologies in their Web sites
and is therefore not of interest for our specification.
Windows Media Rights Manager SDK is designed for developers who wish to deliver
digital media via the Internet in a protected and secure manner. It can help protect
the rights of content owners while enabling consumers to easily and legitimately
obtain digital content. This is also not an aspect regarding our specification and
hence not part of the evaluation.
3.2.4.1. Windows Media Encoder SDK
The Windows Media Encoder 9 Series SDK is designed for people who want to
develop an application based on the Windows Media Encoder and provides a
powerful automation-based API. The SDK can be used with C++/C#, Microsoft Visual
Basic to capture multimedia content and encode it into a Windows Media-based file
or stream.
The automation API can be used to:
-
-
-
Broadcast live content. A live video source can be captured and streamed
directly to a Media Server or out on the network.
Batch-process content. A high volume of large files can by processed by a
pre-created batch process that uses the Automation API to repeatedly capture
and encode streams, one after the other. This can be achieved by using a
preferred scripting language and Windows Script Host.
Create a custom user interface. An interface can be built, that uses the
functionality of the Automation API to capture, encode, and broadcast media
streams. Alternatively, the predefined user interfaces can be used within the
Automation API for the same purpose.
Remotely administer Windows Media Encoder applications. The
Automation API can be used to run, troubleshoot and administer Windows
Media Encoder applications from a remote computer.
The SDK documentation provides an overview of general encoding topics, a
programming guide, and a full reference section documenting the exposed
interfaces, objects, enumerated types, structures and constants. There are also
several samples included.
Page 51
Diploma Thesis
Erich Semlak
3.2.4.2. Windows Media Format 9 Series SDK
The Windows Media Format SDK is meant for creating applications that play, write,
edit, encrypt, and deliver Advanced Systems Format (ASF) files and network
streams, including ASF files and streams that contain audio and video content
encoded with the Windows Media Audio and Windows Media Video codecs.
The key features of the Windows Media Format SDK are:
-
Support for industry-leading codecs. The SDK includes the Microsoft Windows
Media Video 9 codec and the Microsoft Windows Media Audio 9 codec. Both
of these codecs provide exceptional encoding of digital media content.
Support for writing ASF files based on customizable profiles excessing 2 GB
Support for reading ASF files as well as reading ASF data being streamed
over a network.
Support for delivering ASF streams over a network through HTTP and also for
delivering data directly to a remote Windows Media server.
Support for Digital Rights Management protection. Methods for reading and
writing ASF files and network streams that are protected by Digital Rights
Management is provide to prevent unauthorized playback or copying of the
content.
3.2.4.3. Windows Media Player SDK
The Microsoft Windows Media Player provides information and tools to customize
Windows Media Player and to use the Windows Media Player ActiveX control.
Support for customizing Windows Media Player is provided by:
-
Windows Media Player skins. Skins allow you both to customize the Player
user interface and to enhance its functionality by using XML.
Windows Media Player plug-ins. Windows Media Player includes support for
plug-ins that create visualization effects, that perform digital signal processing
(DSP) tasks, that add custom user interface elements to the full mode Player,
and that render custom data streams in digital media files created using the
ASF file format.
Embedding the Windows Media Player control is supported for a variety of
technologies, including:
•
•
•
•
•
•
HTML in Web browsers. Microsoft Internet Explorer and Netscape Navigator
version 4.7, 6.2, and 7.0 browsers are supported.
Programs created with the Microsoft Visual C++ development system
Programs based on Microsoft Foundation Classes (MFC)
Programs created with Microsoft Visual Basic 6.0
Programs created using the .NET Framework, including programs written in
the C# programming language
Microsoft Office
Page 52
Diploma Thesis
Erich Semlak
3.2.5. Evaluation of Windows Media SDK
Regarding the initially mentioned aspects the evaluation of the Windows Media SDK
fulfills the given demands as following:
3.2.5.1. Development and Platform Aspects
-
Windows Media SDK is obviously available for Windows
The SDK is available for free and can be download from the Microsoft
Downloads website.
The SDK can be used within .NET and supports C++ as well as managed
languages.
3.2.5.2. Modularity
The level of modularity and openness in Windows Media is rather low. The structure
of capturing and streaming is strictly predetermined as there is an input source and
the output target with compression n the middle. There is no possibility to hook some
components between the input and output so there can’t be any custom effects
added. There can also be only one input source at a time therefore no combination of
inputs can me mixed.
3.2.5.3. Complexity
Complexity in Windows Media is constantly low. The object model is quite simple and
covers the powerful mechanism beneath. This prevents them of being accessed and
baffles therefore any attempts of being used for tasks it isn’t designed for.
3.2.5.4. Abstraction level
Abstraction level is very high with Windows Media. There are no interfaces down to
the drivers or hardware devices. The raw streaming data is also not exposed and
cannot be read or modified directly.
3.2.5.5. Documentation
The provided documentation is comprehensive and easily understandable. The
Object Model and programming techniques for several example applications are well
explained. The documentation can also be found online at Microsoft [URL45].
3.2.5.6. 3rd party components
The availability of 3rd party plugins is limited, there are some available at Inscriber
[URL46] for titling and graphic insertion and Consolidated Video [URL47] for color
masking and alpha blending. DirectShow transform filters can be used, but the cowork is sometimes briddle.
Page 53
Diploma Thesis
Erich Semlak
3.2.6. Combination of DirectShow and Windows Media
There has also been considered to use a combination of DirectShow and Windows
Media. For instance, the Source and effect parts could be developed with
DirectShow, the network streaming parts with Windows Media Encoder to take
advantage of the included compression codecs.
Unfortunately, as there are no mechanisms in the Windows Media Encoder object
model that allow an integration of any other components that could read or modify
the media data stream, nor it is possible to use Windows Media Encoder in
DirectShow as an output rendering filter, a combination can not be managed.
3.2.7. Video for Windows
Video for Windows seems more to be a relict out of the 16-bit era and lacks many of
the now common functions.
VfW has a number of other deficiencies:
- No way to enumerate available capture formats.
- No TvTuner support.
- No video input selection.
- No VBI support.
- No programmatic control of video quality parameters such as brightness,
contrast, hue, focus, zoom.
Therefore it doesn’t seem to be appropriate to develop an application that meets all
the requirements of the specification.
3.2.8. From Scratch
For the sake of completeness the possibility shall be addressed: to write all of the
communication with hardware drivers by one self and implement all of the video
processing and output.
Obviously this is an unfavorable alternative because it would mean a huge task to
develop. This doesn’t make any sense in consideration of the previous described
techniques.
Page 54
Diploma Thesis
Erich Semlak
3.2.9. Overview
This comparison shall once again illustrate the aspects of the evaluated alternatives:
Platform
Windows Media
SDK
Windows
Development
.NET, C++
Modularity
low
Complexity
DirectShow
SDK
Windows
.NET (parts)
Video for
Windows
Windows
From Scratch
Windows
C++, VB
.NET, C, C++
high
very low
-
low
flexible
simple
-
Abstraction Level
high
flexible
very high
-
Documentation
comprehensive
comprehensive
general
none
lots
none
none
Components
Table 3: Overview SDKs
3.2.10.
few
C++ (total)
Conclusion
The above table states that Direct Show SDK is the favorable alternative for the
given task if the values for Modularity and Abstraction level, which are the two most
important categories, are compared.
As mentioned at the beginning of this chapter, the most important criteria is the
option to gear into the media data stream and to access low level device functionality
if . This is not assured with Windows Media SDK, which shows a low rating for
modularity and abstraction level and offers only a few components.
Microsoft states that “…DirectShow provides generic sourcing and rendering support,
whereas the Windows Media Encoder SDK and Windows Media Player SDK ActiveX
control are tailored capture and playback solutions...“ [URL48]. This means that
Windows Media seems to not to fit in the requirements for developing this thesis’
applications.
DirectShow simplifies the video managing process by its modular design and the
availability of many ready-to-go filters. The comprehendible documentation and
extensive object model enables the development of reusable components. The
GraphEdit tool eases the testing of those components and allows some kind of
prototyping without need of code.
Furthermore, DirectShow shows no fundamental disadvantages regarding the
remaining evaluated aspects.
Therefore the application to develop will be created under application of the
DirectShow SDK.
Page 55
Diploma Thesis
Erich Semlak
3.3. Implementation Considerations
The decision for using DirectShow as base technology for developing the
applications for this thesis determines effectively the way how they have to be
designed. The possibilities of DirectShow by means of modularity offer a serial way
to process video streams by adding several filter components to build a filter graph.
This approach gives a maximum flexibility for all required functions. Therefore the
applications model structure (like illustrated in Picture 10: slave model and Picture
11: master model) can be almost entirely transferred to the implementation design.
Each video processing and analyzing task can be packaged in a DirectShow filter:
source filter
motion dection
filter
object detection
filter
zoom/clip
filter
compression
filter
network
sending
filter
Picture 13: Slave application model
Picture 14: Master application model
As explained in chapter B.1, writing a DirectShow application follows certain
development rules and thus specifies the internal structures of data and its
processing.
The implementation process is supported through the use of GraphEdit which allows
filter graphs to be tested without need of code, which eases the testing and
debugging of self-written filters. Therefore all needed filters have to be ready before
the Master and Slave applications can be developed.
Page 56
Diploma Thesis
Erich Semlak
3.4. Filter Development
To handle the different tasks between video source and output target, a filter graph
has to be spanned which consists of appropriate filters and event handling aside from
the pure video processing.
These tasks in execution order are very similar to the models’ of the Slave and
Master applications:
Slave:
-
capturing input
motion detection
object detection
video compression
deliverance to Master
Master:
-
receiving Slaves’ input (video & motion/object data)
video decompression
mixing and fading
video re-compression
Web streaming
For some of these tasks, ready usable filters are available, but the rest has to be
developed.
Capturing filters have to be available for the video device to be used.
Compression/Decompression filters can be found as great many (lot of them for
free), so there is no need to develop another.
Especially the motion/object detection and network delivering filters can only be
found as expensive commercial products, besides not compatible to DirectShow, so
these are to be written for new.
Page 57
Diploma Thesis
Erich Semlak
3.4.1. Motion Detection Filter
The Motion Detection filter analyzes the incoming video pictures for eventual motion.
It takes the average of the last 5 pictures (which would be a fifth of a second when
capturing with 25 frames per second) to avoid disturbance by flickering or shutter
changes of the camera. The incoming camera pictures are downscaled and
averaged in tiles of 8x8 pixels grayscale. These tile planes are then compared to the
initially stored backplane. Major differences are seen as motion, the average position
of the tiles with changed grayscale values are then estimated as center of motion.
The backplane is gradually converged to the averaged camera pictures to adapt to
the current camera picture.
Interface and functions provided by the Motion Detection filter are:
Interface:
Property page:
GUID {43D849C0-2FE8-11cf-BCB1-444553540000}
GUID {43D849C0-2FE8-11cf-BCB1-444553540000}
STDMETHODIMP get_motion(int *c,int *px,float *py);
Parameter description:
c: count of moving picture parts
px/py: average center of moving picture parts, „epicenter“ of motion
STDMETHODIMP StartDetect();
STDMETHODIMP StopDetect();
These functions allow to start and stop the motion detection, to decrease CPU load
while no motion detection is needed.
The same information can be retrieved through the properties sheet when using e.g.
GraphEdit:
Picture 15: motion detection filter properties page
Page 58
Diploma Thesis
Erich Semlak
3.4.2. Object Detection Filter
The Motion Detection Filter provides functionality to follow an object, defined by a
certain picture clip of the video input. The object data can be read out or put in to
distribute to several instances of the filter. The object detection depends on a color
histogram of the defined object’s picture clip.
The Object Detection Filter uses an template with image subtraction. For
performance reasons the image of the selected object is divided into 8x8 pixel tiles of
which the color histogram is calculated. The incoming video picture is also divided in
8x8 pixel tiles and the objects histogram is then compared with every possible
position on the picture. The effort in searching the object depends therefore on the
size of the objects picture and the size of the incoming video.
As the clip detection can’t be easy done through a property page, it isn’t possible to
use with e.g. GraphEdit, so it can only be used out of written code. This makes sense
because the handling of the resulting detection information has also to be processed
by code because it isn’t
The interface of the Object Detection Filter is:
Interface: GUID { D9114FD6-9227-48e2-9193-4E8FC0664081}
STDMETHOD StopDetect();
STDMETHOD StartDetect();
STDMETHOD get_object_data (int * px, int *py,int * width, int
imgdata);
STDMETHOD get_object_pos(int * px, int *py,int * width, int
setobject);
STDMETHOD set_object(int px, int py,int width, int height);
STDMETHOD get_huemask_data(BYTE ** huedata, long * huelen, BYTE
masklen);
STDMETHOD set_huemask_data(BYTE * huedata, BYTE * maskdata, int
picwidth, int picheight);
* height , BYTE **
* height , int *
** maskdata, long *
picx, int picy, int
As with the Motion Detection filter, only every fifth frame is analyzed. This should be
fast enough to follow a moving object.
Parameter description:
px/py: left top corner of picture clip containing object
width/height: size of clip containing object
imgdata: picture data of captured frame for object selection bounding box
huedata/huelen: pointer where to put/get huedata and length of it
maskdata/masklen: pointer where to put/get maskdata and its’ length
picx/picy, picwidth/picheight: position and size of picture clip with object in it
Page 59
Diploma Thesis
Erich Semlak
3.4.3. Network Sending/Receiving Filter
To deliver the compressed video data from the Slave to the Master, a sending at the
Slave and a receiving filter at the Master is needed.
To successfully hook the receiving filter into a filter graph, the filter’s output has to
have a defined media type. The Slave’s media type is sent once per second and
determines the media type from the Slave’s filter graph. After the mediatype has
been set on the receiver’s output pin, the filter graph can be spawned.
Delivering the video over the network is done by connectionless UDP protocol. In
case of losing a video frame on the way the packet isn’t resent but for video streams
this isn’t a big problem as long as the video stream keeps flowing onward. TCP tries
to resend packets in correct order which could cause stuttering video, therefore it
isn’t used. It is recommended to use the Network Sending/Receiving filters in
connection with some light compression/decompression filter to decrease network
traffic otherwise even a 100MBit line could be fully loaded quickly.
The interfaces of the network filters look like following:
Interface: GUID {1a8f2631-2bde-4ff5-a79e-0c495902ed1d}
Property pages:
Sender:
GUID {47cfd9eb-6c13-4d90-9605-84e1115c7c96}
Receiver:
GUID {7d8a1eb0-dfd1-453e-b225-0808f8e4f810}
STDMETHODIMP
STDMETHODIMP
STDMETHODIMP
STDMETHODIMP
STDMETHODIMP
STDMETHODIMP
SetNetwork(ULONG ulIP,USHORT usPort);
GetNetwork(ULONG *pIP,USHORT *pPort);
GetTraffic(ULONG *bytecnt,ULONG *framecnt,UONG *medtcount);
GetInfo(LPSTR *infostr);
AddTraffic(ULONG bytecnt, ULONG medtcnt);
ResetTraffic(void);
Parameter description:
ulIP: the IP of the source
pPort: port where the data is to be sent to, as sending port pPort+1 is used
bytecnt/framcnt/medtcount: count of data bytes/frames/mediatype infos sent
infostr: info about the filter’s state (stopped/paused/running)
The property page allows also the input of IP and port and shows the amount of data
and frames:
Picture 16: HNS MPEG-2 sender properties page
Page 60
Diploma Thesis
Erich Semlak
3.4.4. Infinite Pin Tee Filter
A “tee” is in technical terms a component (e.g. a pipe) with three in-/outputs or some
kind of branching. It is often shaped like a Y or a T (therefore the “tee”). The “infinite”
comes from the possibility to (theoretically) open up infinite inputs which result in one
single output.
The infinite tee is more or less the heart of the Master application. It combines
several video inputs, does fading and clipping and has a single video stream as
output.
Each time an input filter is connected, a new input pin is generated. The maximum of
input pins is hardwired to 1000, which is more than a PC will be able to handle for the
next few years.
This behavior can be easily demonstrated when using GraphEdit:
Picture 17: filter graph using infinite pin tee
The input pins’ resolutions can differ from each other, the output pin’s resolution is
the maximum of width and height of the incoming media types.
To set the inputs’ clip positions and fading values, the property page can be used:
Page 61
Diploma Thesis
Erich Semlak
Picture 18: infinite pin tee filter properties page
The same functionality can be achieved when using the filters COM interface:
Interface:
Property page:
GUID {16a81957-4662-4616-be5a-a559ee21725a}
GUID { bac542e3-30f0-4039-99d3-61b6a7849e45}
STDMETHOD set_window (int input, int posx,int posy,int topx, int topy,int width,int
height,int oppac,int zorder);
STDMETHOD get_window (int input, int * posx,int * posy,int * topx, int * topy,int
*width,int *height,int *oppac,int * zorder);
STDMETHOD get_resolution) (int input, int *width,int *height);
STDMETHOD get_sampleimage_ready (int inputnr, int * ready);
STDMETHOD get_sampleimage (int inputnr, BYTE * buf);
STDMETHOD get_ConnCount (int * count);
STDMETHOD get_FramesOut (long * framesout);
STDMETHOD get_FramesIn (long * framesin, long * framesc);
Parameter description:
posx/posy: position of clip in the source video
topx/topy: position of clip in the output video
width/height: size of clip
oppac: oppacity, 0=totally translucent, 255=intransparent
zorder: number of Z-plane, 0=most backward
inputnr: number of input pin to configure
ready: sample image (RGB) is ready at pointer buf
buf: pointer to memory for sample image, needs (width*height*3) bytes
count: number of input pins
framesout: number of frames delivered to output pin
framesin: number of input pin to configure
Page 62
Diploma Thesis
Erich Semlak
3.4.5. Web Streaming Filter
There are a couple of streaming applications available which stream a video file or
live capture source out on the network, but I didn’t find any working DirectShow filter
for this task. It would have meant a lot of work to develop a RTP streamer from
scratch, but fortunately there is an open source RTP streaming application available
at LIVE.COM [29] which is originally written as a console application and for file
sources only. The classes for the input of MPEG files (ByteStreamFileSource) had to
be rewritten to communicate over shared memory with the filter’s input pin. The
streaming server runs in an own thread simultaneous to the filter’s input thread and
can be started and stopped through the shared memory interface.
Interface of Streaming filter:
GUID {066ec65d-75bf-4b66-80e1-44e5ec492f97}
STDMETHOD SetNetwork (unsigned long ulIP, int usPort, char * aName);
STDMETHOD GetNetwork (unsigned long * ulIP,int * pPort,char * aName);
STDMETHOD StartStream();
STDMETHOD PauseStream();
Parameter description:
ulIP: IP of local network interface to be used
usPort: port on which stream shall be served (default: 9999)
aName: stream name for accessing URL (default: “stream”)
These parameters may also be set through the properties page:
Picture 19: HNS MPEG streamer filter properties page
Page 63
Diploma Thesis
Erich Semlak
The streaming filter is somehow fussy with the MPEG stream format which comes
through the input pin. It only works with MPEG1 elementary streams, so no audio can
be added.
The Master application in my case uses the “Moonlight Video Encoder Std” with
these settings:
Picture 20: Moonlight MPEG2 Encoder settings
Page 64
Diploma Thesis
Erich Semlak
3.5. Applications
To use the previous described filters and their functionality in a comfortable way they
have to be wrapped into an application, each one for the slave and master machines.
These applications provide the graphical user interface and manage the network
communication needed additionally to the video broadcasting. The following chapters
describe the functionality, usage and technical details of these applications.
3.5.1. Slave Application
The slave application is meant to run without any user intervention except for
choosing the video input to be used. After that, no more user input is necessary.
Picture 21: Slave application camera selection dialog
The program waits until a seeking broadcast (from port 4711 to 4811) of a master
comes around and then answers it. Any configuration and control is then made
through the master application. If the network connection gets lost or the master
application is shut down, the slave goes back to listening mode to wait for a master’s
broadcast.
If master and slave reside in different networks so no broadcast can be send to one
another, a “ping mode” can be used. For this, the slave must be configured on a
particular master’s IP address and port. The slave tries to contact the master
autonomous then. Disadvantage of this mode is that only a single master is
contacted while through the broadcast mode any master can contact all waiting
slaves.
For routing reasons, this has to be a routeable address, so whether the master nor
the slave can’t reside in a (different) private network to be accessed from outside.
After a connection is set up and video transmission has started, the slave application
shows the local video picture and traffic information. If motion/object detection is
activated, the current position values are also displayed.
Page 65
Diploma Thesis
Erich Semlak
Picture 22: Slave application graphical user interface
The internal dataflow in the Slave’s filter graph involves motion & object detection
and video compression and delivery to the Master. The filter graph could be spanned
in GraphEdit like this:
Picture 23: Slave's filter graph
Page 66
Diploma Thesis
Erich Semlak
3.5.2. Master Application
The Master is responsible for the central user interaction, slave management and
web streaming. It searches for waiting slaves, contacts them and starts the video
transmission. Eventual data distribution (e.g. object detection, motion zoom) is also
done by the Master.
The typical workflow from starting the Master application until streaming the result to
the web looks like this:
Picture 24: Master application schematic workflow
The optional object detection isn’t mention in this workflow but will be discussed later
in detail.
The internal Filter graph which is spawn when two Slaves are connected and the web
streaming is activated would look in GraphEdit like following:
Picture 25: Master’s Filter graph
3.5.2.1. Usage
Now the same interaction steps through the graphical user interface:
First of all the waiting slaves have to be selected which of them should be activated.
To get the list of waiting slaves choose “Search” from the File menu. The Master
broadcasts then his IP and incoming port address.
The list displays the machines’ name where slaves are running on and the IP
addresses and network ports. There can be more than one slave running on a single
machine therefore the ports are necessary to distinguish between the running
instances:
Page 67
Diploma Thesis
Erich Semlak
Picture 26: Master application Slave selection
After the selection procedure is finished, the Master sends a start command to those
slaves. Now each activated slave sends his possible camera resolutions back to the
Master. The user can now choose the desired camera resolution and the size of the
clip (e.g. for motion zoom). Half and quarter clips can be selected more easily by
clicking the appropriate buttons:
Picture 27: Master application Slave resolution selection
After having selected the resolution for each Slave, the Master spans his filter graph
and hooks a receiving filter for each Slave into it. This takes a few second.
Picture 28: Master application filter graph setup
Page 68
Diploma Thesis
Erich Semlak
After everything is up, the GUI shows a configuration panel for each Slave where clip
manipulation can be done and Motion Zoom or Object Detection may be activated.
Picture 29: Master application graphical user interface
If motion detection is activated the motion indicator shows the “amount” of motion for
each slave. On the right side of the application window the resulting video preview is
displayed
The “fade” slider indicates the opacity of the Slave’s video in the output. The more
left the slider is drawn, the more lucent the video gets. This also depends on the “ZOrder”, which sorts the video inputs from back to front, 0 means the most back plane.
Background is black, so if there is only one input video, this video fades to black.
The “Zoom”, “Clip” and “Output” panels can be used, if a smaller clip size than the
camera resolution was selected.
“Zoom” indicates, which part of the input video picture shall be delivered by the Slave
“Clip” shows, which part if the delivered Slave’s video shall be handed on
“Output” sets the position of the clip in the output video
After everything has been configured to satisfaction, web streaming can be activated,
simply by choosing “Streamer” from the “Extras” menu. The resulting video is then
streamed out on the network. To view the stream on a remote machine with
Quicktime or Media Player, the URL to be opened is “rtsp://server name or
IP:port/streamname”.
Page 69
Diploma Thesis
Erich Semlak
3.5.2.2. Motion Detection / Motion Zoom
If Motion Detection is activated for a Slave, it measures the amount of movement in
the cameras visual field. How motion is measured is discussed in the chapter “Motion
Detection Filter”.
This motion information can be used to drive the “Motion Zoom” feature. This feature
automatically moves the input clip (if it is smaller than the cameras input picture size)
to the center of motion. This means that always that part of the input video is shown
where motion takes place.
While the Slave does the video analyzing the Master (or the user) sets the clip
position and sends back the coordinates to the Slave which then sends this particular
part of the video picture only.
The data flow for motion detection and clip selection is like following:
Picture 30: motion detection schematic data flow
Motion Detection can also be used to automatically select the Slave which has the
highest motion value. This mode is activated by choosing “Motion Fade” in the
“Extras” menu. Sometimes this causes flickering video streams when more then one
video input shows motion.
3.5.2.3. Object Detection
In some cases the flickering effect of simple motion fading can be very annoying,
especially when following a particular object which moves around the separate
cameras’ visual fields or when there are more moving objects. In those cases “Object
Detection” might help.
As the Slaves have to know how the object looks like, the user first has to locate the
desired object in one of the incoming videos. Then the picture clip containing the
object’s view has to be selected by a bounding box. This clip is then received from
the particular Slave and distributed to the others. From then on, each Slaves tries to
locate and following the object in his view. If a Slave detects the object in his view, he
sends the position to the Master which then selects this clip position which is done
the same way as it is for the Motion Zoom.
How object detection works in particular is discussed in the chapter “Object Detection
Filter”.
Page 70
Diploma Thesis
Erich Semlak
The data flow for object selection and distribution looks like this:
Picture 31: object detection schematic data flow
Object detection works fine as long as the object doesn’t move to fast or changes his
shape (caused by rotation) and color (because of shadows) too much. This depends
on the contrast between the object and its background or other objects.
Detection works fine with the Slave which initially has the object in its view. Any other
Slave which view is entered by the object has to “espy” the object first. This means it
tries to determine were the object enters the view’s border. Only then it can be
detected and followed.
Page 71
Diploma Thesis
Erich Semlak
3.6. Test & Conclusion
In the course of evaluating, developing and testing the applications a wide
experience was made. Some issues shall be told here to save somebody of making
some efforts, which I have made, a second time. The objectives of the tests are to
investigate to which extent the expectations and requirements made at the beginning
of this thesis have been achieved.
3.6.1. Development
Although there seem to be several people ([URL5], [URL43], [URL44]) who deal with
DirectShow development there are still a lot of “secrets” which are difficult to
disclose. Especially when using C# in combination with DirectShow interfaces, there
were some differences to C++ to find out. Fortunately there is the Marshaling class in
.NET which had to be used heavily when dealing with C++ pointers and arrays.
As mentioned in the evaluation chapter, one of the big difficulties while developing
DirectShow filters is debugging them. As they can’t be started stand-alone but in
combination with a filter graph, debugging information can only be extracted by
through writing it out in a file. It is possible to hook a debugger onto the application
which spawns the filter graph, e.g. GraphEdit, but there may be several threads
running, therefore setting breakpoints can be complicated. At last a failure may occur
in one of the self-written filters but was caused by a “foreign” filter, e.g. by handling
shared pointers in a wrong way. In a nutshell these are the main reasons why
developing filters may take some time longer than writing a typical Windows
application.
Another issue, which makes debugging challenging is the fact that a filter graph
typically consists of several filters running more or less in parallel and sometimes
deal with the same data, as they pass a pointer on the video sample among each
other. This may lead to searching the failure in a filter which isn’t responsible only
because it throws an exception which originated actually from false operation in the
filter being hooked either before or after the “suspect” filter. This interconnectivity,
which is certainly one of the biggest strengths in the concept of DirectShow is also
one of the reasons why filter development is sometimes more complicated.
I started developing my first filters for over a year ago, with having programming
experience for over 15 years, but even today filter development seems to be more
than “just a hack”. There can hardly be found any forums that discuss filter
development, most address problems with particular codecs or plugins, so this is only
my personal opinion.
Page 72
Diploma Thesis
Erich Semlak
3.6.2. Capacity and Stress Test
As with every distributed system, there are always some parts which have to be
initially executed on one single machine, like data distribution or gathering,
initialization, input or output. In the case of the Slave and Master applications, the
video data has to be gathered in one single point. This means that there is a
bottleneck to be expected, which limits the amount of possible incoming video
streams, respectively their frame resolutions and rates.
These tests are intended to find out, which amount of input streams at which
resolution a given test system is capable to manage in a fluent and reasonable way,
means that there have to be no drop-outs or increasing number of artefacts by
dropped P-frames.
The output video size will also be analyzed in the testing sequence, as the
transcoding (compression-decompression-cycle) for streaming needs processing
power.
3.6.2.1. Test system
The test systems consists of a Master machine, equipped with a Pentium 4 3.5 GHz
(with Hyperthreading [URL63] enabled) connected through a single 100MBit network
to the slaves. The four Slave machines are differently equipped PCs from 600 Mhz
to 2.6 GHz. The results where determined by using the Windows 2000 performance
monitor and taking the average CPU load after 2 minutes of testing, running a single
instance of each application. The resulting MPEG1 stream is compressed at
500Kbit/sec.
For lack of cameras only, up to four Slaves could be tested simultaneously but the
test results allow to estimate the behavior with a higher number of attached Slaves.
Stress tests, which try to run the CPU at full load, have shown that, from a CPU
workload of about 70-80%, the output video starts to flicker sometimes which comes
from synchronization problems between the single incoming videos and the fixed rate
of 25 frames per second which has to be kept because MPEG1 needs at minimum
23.97 frames per second by definition. This results from the CPU’s Hyperthreading
mode, which appears for the performance monitor as two processors. As long as
only one (logical) processor is used, the performance monitor shows only 50% load.
As there are parts of the Hyperthreading-CPU which cannot be shared between
tasks, this can lead to a bottleneck before 100% CPU load can be reached.
Page 73
Diploma Thesis
Erich Semlak
3.6.2.2. Master Performance
The first test was made using an input and output video resolution of 352x288 pixels
with 25 frames per second. The following chart shows the CPU’s workload as a
function of the number of attached slaves:
CPU Load (%), input 352x288/25fps, output 352x288/25fps
100
90
80
70
60
50
40
30
20
10
0
22
1
30
2
35
3
40
4
Num ber of Slaves
Picture 32: CPU load at Master for 352x288/25fps
Out of these results it can be estimated that the streaming needs about 16% CPU
and every Slave decompression task takes about 6%. This is assumed because
every decompression filter at the Master machine runs as its own task, each
consuming the same share of CPU, as long it is not at full-capacity. So the estimated
maximum count of Slaves is about 9 until the CPUs workload reaches 70%.
The same test series with an input video resolution of 352x288 pixels with 25 frames
per second and output resolution of 640x480 with 25fps:
CPU Load (%), input 352x288/25fps, output 640x480/25fps
100
90
80
70
60
50
40
30
20
10
0
70
45
1
53
2
61
3
4
Num ber of Slaves
Picture 33: CPU load at Master for 352x288/25fps-640x480/25fps
These results show an estimated streaming workload of 37% and every Slave
decompression task takes about 9%. This means that the maximum recommended
Page 74
Diploma Thesis
Erich Semlak
CPU workload is already reached with 4 incoming streams, also assumed that the
workload increases linearly.
For the sake of completeness the same testes were carried out with an input and
output resolution of 640x480/25fps, a load which quickly saturated the Master
machine.
CPU Load (%), input 640x480/25fps, output 640x480/25fps
100
90
80
70
60
50
40
30
20
10
0
95
100
76
60
1
2
3
4
Num ber of Slaves
Picture 34: CPU load at Master for 640x480/25fps
While testing with two Slaves the output video stream already started to flicker, so at
these conditions the Master machine seems to be to slow to handle the whole task.
Page 75
Diploma Thesis
Erich Semlak
3.6.2.3. Network Load
Measuring the network load showed that it won’t be the bottleneck. As one incoming
stream uses about 60 KB of data, the maximum count of Slaves depends more on
the CPU than the network’s bandwith:
Netw ork Load, incom ing, input 352x288/25fps
250
210
200
175
165
145
150
120
KB/sec
95
100
65
Packets/sec
55
50
0
1
2
3
4
Num ber of Slaves
Picture 35: incoming network load for 352x288/25fps
Theoretically a count of 50 Slaves would result in a network load of 3 MB/sec but the
overhead of 2800 packets/sec might limit it to a smaller number.
The outgoing video data stream was also measured to show that it won’t be an issue
regarding network capacity:
Network Load, outgoing
800
720
700
600
500
430
KB/sec
400
Packets/sec
300
200
100
50
105
0
352x288/25fps
640x480/25fps
Picture 36: outgoing network load
Page 76
Diploma Thesis
Erich Semlak
3.6.2.4. Slave Performance
The Slave application isn’t as performance demanding as the Master application as
long as only the capturing feature is activated. For this task a Pentium II with 600MHz
is just fast enough to handle a video input resolution of 352x288 with 25 frames per
second without drop-outs. For doing object or motion detection it takes a machine
with about 1.5 Ghz or faster, depending on resolution and detection frequency
(normally set to 5 times per second, but for noisy activity a higher frequency may be
necessary).
This means that for simple streaming systems older machines can be reactivated to
handle the camera and send the compressed input video. As there are no additional
requirements to the hardware, the Slave machines are assumed to be very cheap to
get.
Page 77
Diploma Thesis
Erich Semlak
4. Recent Research
This chapter tries to give some overview which scientific research and development
is going on in the field of video capturing/processing and streaming, which progress
has been made in the last few years and what may come in the near future.
4.1. Video Capturing and Processing
Research in the field of video capturing covers less the techniques how to get from
the pictures from the cameras lens to digital data but more what to do with the
digitized video and how to handle and analyze the input. The scopes where digital
video input is processed is quite diversified, therefore the following chapters will
cover some different areas of applications.
4.1.1. Object Tracking
Videos are actually sequences of images, each of which called a frame, displayed in
fast enough frequency so that human eyes can percept the continuity of its content. It
is obvious that all image processing techniques can be applied to individual frames.
Besides, the contents of two consecutive frames are usually closely related.
Visual content can be modeled as a hierarchy of abstractions. At the first level are
the raw pixels with color or brightness information. Further processing yields features
such as edges, corners, lines, curves, and color regions. A higher abstraction layer
may combine and interpret these features as objects and their attributes. At the
highest level are the human level concepts involving one or more objects and
relationships among them [ZIV2002].
Object detection in videos involves verifying the presence of an object in image
sequences and possibly locating it precisely for recognition. Object tracking is to
monitor an object’s spatial and temporal changes during a video sequence, including
its presence, position, size, shape, etc. This is done by solving the temporal
correspondence problem, the problem of matching the target region in successive
frames of a sequence of images taken at closely-spaced time intervals. These two
processes are closely related because tracking usually starts with detecting objects,
while detecting an object repeatedly in subsequent image sequence is often
necessary to help and verify tracking. [SIU2002]
In this thesis, object tracking is performed by the Slave application. Because object
detection is a complex and costly task, it is done directly at the Slave and only the
resulting coordinates are delivered to the Master (see chapter 3.5.2.3).
Page 78
Diploma Thesis
Erich Semlak
4.1.1.1. Applications
Object tracking is quite important because it enables several applications such as:
- Security and surveillance: to recognize people, to provide better sense of
security using visual information
- Medical therapy: to improve the quality of life for physical therapy patients and
disabled people
- Retail space instrumentation: to analyze shopping behavior of customers, to
enhance building and environment design
- Video abstraction: to obtain automatic annotation of videos, to generate
object-based summaries
- Traffic management: to analyze flow, to detect accidents
- Video editing: to eliminate cumbersome human-operator interaction, to design
futuristic video effects
- Interactive games: to provide natural ways of interaction with intelligent
systems such as weightless remote control.
4.1.1.2. Challenges
A robust, accurate and high performance approach is still a great challenge today.
The difficulty level of this problem highly depends on how you define the object to be
detected and tracked. If only a few visual features, such as a specific color, are used
as representation of an object, it is fairly easy to identify all pixels with same color as
the object. On the other extremity, the face of a specific person, which full of
perceptual details and interfering information such as different poses and
illumination, is very hard to be accurately detected, recognized and tracked. Most
challenges arise from the image variability of video because video objects generally
are moving objects. As an object moves through the field of view of a camera, the
images of the object may change dramatically. This variability comes from three
principle sources: variation in target pose or target deformations, variation in
illumination, and partial or full occlusion of the target [HAG1998] [ELL2001].
Page 79
Diploma Thesis
Erich Semlak
4.1.2. Object Detection and Tracking Approaches
4.1.2.1. Feature-based object detection
In feature-based object detection, standardization of image features and registration
(alignment) of reference points are important. The images may need to be
transformed to another space for handling changes in illumination, size and
orientation. One or more features are extracted and the objects of interest are
modeled in terms of these features. Object detection and recognition then can be
transformed into a graph matching problem. [PUA2000]
There are two sources of information in video that can be used to detect and track
objects: visual features (such as color, texture and shape) and motion information.
Combination of statistical analysis of visual features and temporal motion information
usually lead to more robust approaches. A typical strategy may segment a frame into
regions based on color and texture information first, and then merge regions with
similar motion vectors subject to certain constraints such as adjacency. A large
number of approaches have been proposed in literature. All these efforts focus on
several different research areas each deals with one aspect of the object detection
and tracking problems or a specific scenario. Most of them use multiple techniques
and there are combinations and intersections among different methods. All these
make it very difficult to have a uniform classification of existing approaches. So in the
following sections, most of the approaches will be reviewed separately in association
with different research highlights. [ZIV2002]
Shape-based approaches
Shape-based object detection is one of the hardest problems due to the difficulty of
segmenting objects of interest in the images. In order to detect and determine the
border of an object, an image may need to be preprocessed. The preprocessing
algorithm or filter depends on the application. Different object types such as persons,
flowers, and airplanes may require different algorithms. For more complex scenes,
noise removal and transformations invariant to scale and rotation may be needed.
Once the object is detected and located, its boundary can be found by edge
detection and boundary-following algorithms. The detection and shape
characterization of the objects becomes more difficult for complex scenes where
there are many objects with occlusions and shading. [YIL2004]
Color-based approaches
Unlike many other image features (e.g. shape) color is relatively constant under
viewpoint changes and it is easy to be acquired. Although color is not always
appropriate as the sole means of detecting and tracking objects, but the low
computational cost of the algorithms proposed makes color a desirable feature to
exploit when appropriate.
Color-based trackers have been proved robust and versatile for a modest
computational cost. They are especially appealing for tracking tasks where the
spatial structure of the tracked objects exhibits such a dramatic variability that
Page 80
Diploma Thesis
Erich Semlak
trackers based on a space-dependent appearance reference would break down very
fast. Trackers rely on the deterministic search of a window whose color content
matches a reference histogram color model. Relying on the same principle of color
histogram distance, but within a probabilistic framework, a new Monte Carlo tracking
technique has been introduced. The use of a particle filter allows one to better handle
color clutter in the background, as well as complete occlusion of the tracked entities
over a few frames. This probabilistic approach is very flexible and can be extended in
a number of useful ways. In particular, multi-part color modeling to capture a rough
spatial layout ignored by global histograms, incorporation of a background color
model when relevant, and extension to multiple objects. [PER2002]
4.1.2.2. Template-based object detection
If a template describing a specific object is available, object detection becomes a
process of matching features between the template and the image sequence under
analysis. Object detection with an exact match is generally computationally
expensive and the quality of matching depends on the details and the degree of
precision provided by the object template. There are two types of object template
matching, fixed and deformable template matching.
Fixed template matching
Fixed templates are useful when object shapes do not change with respect to the
viewing angle of the camera. Two major techniques have been used in fix template
matching.
Image subtraction
In this technique, the template position is determined from minimizing the distance
function between the template and various positions in the image. Although image
subtraction techniques require less computation time than the following correlation
techniques, they perform well in restricted environments where imaging conditions,
such as image intensity and viewing angles between the template and images
containing this template are the same.
Correlation
Matching by correlation utilizes the position of the normalized cross-correlation peak
between a template and an image to locate the best match. This technique is
generally immune to noise and illumination effects in the images, but suffers from
high computational complexity caused by summations over the entire template. Point
correlation can reduce the computational complexity to a small set of carefully
chosen points for the summations. [NGU2004]
The Object Detection Filter (chapter 3.4.2) in this thesis uses a fixed template with
image subtraction. For performance reasons the image of the selected object is
divided into 8x8 pixel tiles of which the color histogram is calculated. The incoming
video picture is also divided in 8x8 pixel tiles and the objects histogram is then
compared with every possible position on the picture. The effort in searching the
Page 81
Diploma Thesis
Erich Semlak
object depends therefore on the size of the objects picture and the size of the
incoming video.
Deformable template matching
Deformable template matching approaches are more suitable for cases where
objects vary due to rigid and non-rigid deformations. These variations can be caused
by either the deformation of the object per se or just by different object pose relative
to the camera. Because of the deformable nature of objects in most video,
deformable models are more appealing in tracking tasks. [ZHO2000]
In this approach, a template is represented as a bitmap describing the characteristic
contour/edges of an object shape. A probabilistic transformation on the prototype
contour is applied to deform the template to fit salient edges in the input image. An
objective function with transformation parameters which alter the shape of the
template is formulated reflecting the cost of such transformations. The objective
function is minimized by iteratively updating the transformation parameters to best
match the object. [SCL2001]
4.1.3. Object Tracking Performance
Effectively evaluating the performance of moving object detection and tracking
algorithms is an important step towards attaining robust digital video systems with
sufficient accuracy for practical applications. As systems become more complex and
achieve greater robustness, the ability to quantitatively assess performance is
needed in order to continuously improve performance.
[BLA2003] describes a framework for performance evaluation using pseudo-synthetic
video, which employs data captured online and stored in a surveillance database.
Tracks are automatically selected from a surveillance database and then used to
generate ground truthed video sequences with a controlled level of perceptual
complexity that can be used to quantitatively characterise the quality of the tracking
algorithms. The main strength of this framework is that it can automatically generate
a variety of different testing datasets.
But there is also an algorithm which allows to evaluate a video tracking system
without the need of ground-truth data. The algorithm is based on measuring
appearance similarity and tracking uncertainty. Several experimental results on
vehicle and human tracking are presented. Effectiveness of the evaluation scheme is
assessed by comparisons with ground truth. The proposed self evaluation algorithm
has been used in an acoustic/video based moving vehicle detection and tracking
system where it helps the video surveillance maintaining a good target track by reinitializing the tracker whenever its performance deteriorates. [WU2004]
Especially for pedestrian detection [BER2004] developed tools for evaluating results
thereof. The developed tool allows a human operator to annotate on a file all
pedestrians in a previously acquired video sequence. A similar file is produced by the
algorithm being tested using the same annotation engine. A matching rule has been
established to validate the association between items of the two files. For each frame
Page 82
Diploma Thesis
Erich Semlak
a statistical analyzer extracts the number of mis-detections, both positive and
negative, and correct detections. Using these data, statistics about the algorithm
behavior are computed with the aim of tuning parameters and pointing out
recognition weaknesses in particular situations. The presented performance
evaluation tool has been proven to be effective though it requires a very expensive
annotation process.
Traditional motion-based tracking schemes cannot usually distinguish the shadow
from the object itself, which results in a falsely captured object shape, posing a
severe difficulty for a pattern recognition task. [JIA2004] present a color processing
scheme to project the image into an illumination invariant space such that the
shadow’s effect is greatly attenuated. The optical flow in this projected image
together with the original image is used as a reference for object tracking so that it
can extract the real object shape in the tracking process.
4.1.4. Motion Detection
Detecting moving objects, or motion detection, obviously has very important
significance in video object detection and tracking. Compared with object detection
without motion, on one hand, motion detection complicates the object detection
problem by adding object’s temporal change requirements, on the other hand, it also
provides another information source for detection and tracking.
A large variety of motion detection algorithms have been proposed. They can be
classified into the following groups approximately.
4.1.4.1. Thresholding technique over the interframe difference
These approaches rely on the detection of temporal changes either at pixel or block
level. The difference map is usually binarized using a predefined threshold value to
obtain the motion/no-motion classification. [URL30]
The Motion Detection filter used in the Slave application (chapter 3.4.1) uses a
similar algorithm. It takes a flowing average of the last 5 pictures (which would be a
fifth of a second when capturing with 25 frames per second) to avoid disturbance by
flickering or shutter changes of the camera. The incoming camera pictures are
downscaled and averaged in tiles of 8x8 pixels grayscale. These tile planes are then
compared to the initially stored backplane. Major differences are seen as motion, the
average position of the tiles with changed grayscale values are then estimated as
center of motion. The backplane is gradually converged to the averaged camera
pictures to adapt to the current camera picture.
4.1.4.2. Statistical tests constrained to pixelwise independent decisions
These tests assume intrinsically that the detection of temporal changes is equivalent
to the motion detection. However, this assumption is valid when either large
displacement appear or the object projections are sufficiently textured, but fails in the
case of moving objects that preserve uniform regions. To avoid this limitation,
Page 83
Diploma Thesis
Erich Semlak
temporal change detection masks and filters have also been considered. The use of
these masks improves the efficiency of the change detection algorithms, especially in
the case where some a priori knowledge about the size of the moving objects is
available, since it can be used to determine the type and the size of the masks. On
the other hand, these masks have limited applicability since they cannot provide an
invariant change detection model (with respect to size, illumination) and cannot be
used without an a priori context-based knowledge. [TOT2003]
4.1.4.3. Global energy frameworks
The motion detection problem is formulated to minimize a global objective function
and is usually performed using stochastic (Mean-field, Simulated Annealing) or
deterministic relaxation algorithms (Iterated Conditional Modes, Highest Confidence
First). In that direction, the spatial Markov Random Fields have been widely used and
motion detection has been considered as a statistical estimation problem. Although
this estimation is a very powerful, usually it is very time consuming. [JOD1997]
4.1.5. Object Tracking Using Motion Information
Motion detection provides useful information for object tracking. Tracking requires
extra segmentation of the corresponding motion parameters. There are numerous
research efforts dealing with the tracking problem. Existing approaches can be
mainly classified into two categories: motion-based and model-based approaches.
Motion-based approaches rely on robust methods for grouping visual motion
consistencies over time. These methods are relatively fast but have considerable
difficulties in dealing with non-rigid movements and objects. Model-based
approaches also explore the usage of high-level semantics and knowledge of the
objects. These methods are more reliable compared to the motion-based ones, but
they suffer from high computational costs for complex models due to the need for
coping with scaling, translation, rotation, and deformation of the objects. [TAD2003]
Tracking is performed through analyzing geometrical or region-based properties of
the tracked object. Depending on the information source, existing approaches can be
classified into boundary-based and region-based approaches.
4.1.5.1. Boundary-based approaches
Also referred to as edge-based, this type of approaches relies on the information
provided by the object boundaries. It has been widely adopted in object tracking
because the boundary-based features (edges) provide reliable information which
does not depend on the motion type, or object shape. Usually, the boundary-based
tracking algorithms employ active contour models, like snakes and geodesic active
contours. These models are energy-based or geometric-based minimization
approaches that evolve an initial curve under the influence of external potentials,
while it is being constrained by internal energies. [YIL2004]
Page 84
Diploma Thesis
Erich Semlak
4.1.5.2. Region-based approaches
These approaches rely on information provided by the entire region such as texture
and motion-based properties using a motion estimation/segmentation technique. In
this case, the estimation of the target's velocity is based on the correspondence
between the associated target regions at different time instants. This operation is
usually time consuming (a point-to-point correspondence is required within the whole
region) and is accelerated by the use of parametric motion models that describe the
target motion with a small set of parameters. The use of these models introduces the
difficulty of tracking the real object boundaries in cases with non-rigid
movements/objects, but increases robustness due to the fact that information
provided by the whole region is exploited.
[HUA2002] propose a region-based method for model-free object tracking. In their
method the object information of temporal motion and spatial luminance are fully
utilized. First the dominant motion of the tracked object is computed. Using this result
the object template is warped to generate a prediction template. Static segmentation
is incorporated to modify this prediction, where the warping error of each watershed
segment and its rate of overlapping with warped template are utilized to help
classification of some possible watershed segments near the object border:
Applications of facial expression tracking and two-handed gesture tracking
demonstrate its performance.
4.1.6. Summary
Along with the increasing popularity of video on internet and versatility of video
applications, availability, efficiency of usage and application automation of videos will
heavily rely on object detection and tracking in videos. Although so much work has
been done, it still seems impossible so far to have a generalized, robust, accurate
and real-time approach that will apply to all scenarios. This may require a
combination of multiple complicated methods to cover all of the difficulties, such as
noisy background, moving camera or observer, bad shooting conditions, object
occlusions, etc. Of course, this will make it even more time consuming. But that does
not mean nothing has been achieved. It seems that research may go more
directions, each targeting on some specific applications. Some reliable assumption
can always be made in a specific case, and that will make the object detection and
tracking problem much more simplified. More and more specific cases will be
conquered, and more and more good application products will appear. As the
computing power keeps increasing more complex problems may become solvable.
Page 85
Diploma Thesis
Erich Semlak
4.1.7. Relation to this work
The object detection feature in the current (first) version of this thesis’ system is
working with easily detectable objects, as long as color and contrast differs enough
from other objects and background. As object detection is not used for recognition of
particular objects or patterns, the implemented approach is appropriate for now, even
if it is rather weak. As this thesis mainly focuses on the capturing and streaming
features, there was not as much effort invested in implementing a stronger algorithm
for object detection. Depending on the planned application field, a different algorithm
could be implemented in future versions of the object detection filter. As there can be
connected different object detection filters for each Slave at runtime, an optimized
version algorithm can be applied for different light and contrast conditions and
objects.
Page 86
Diploma Thesis
Erich Semlak
4.2. Video Streaming
The emergence of the Internet as a pervasive communication medium, and the
widespread availability of digital video technology have led to the rise of several
networked streaming media applications such as live video broadcasts, distance
education and corporate telecasts. However, packet loss, delay, and varying
bandwidth of the Internet have remained the major problems of multimedia streaming
applications. The goal is to deliver smooth and low-delay streaming quality by
combining new designs in network protocols, network infrastructure, source and
channel coding algorithms. Efforts to reach this goal are as diverse as the mentioned
problems. This chapter shall give an overview of the currently ongoing research and
progress in moderating and solving these problems.
4.2.1. Compression vs. Delivery
Streaming video over network, if it is assumed that the material is available in
whatever pre-captured format, involves two major tasks, video compression and
video delivery. These two steps can’t always be separated and aren’t therefore totally
independent. There are some situations where the delivery procedure affects the
compression process, e.g. when adapting the compression level to the available
bandwith.
4.2.2. Compression
Since raw video consumes a lot of bandwidth, compression is usually employed to
achieve transmission efficiency. Video compression can be classified into two
categories: scalable and nonscalable video coding. A nonscalable video encoder
compresses the raw video into a single bit-stream, thereby leaving little scope for
adaptation. On the other hand, a scalable video encoder compresses the raw video
into multiple bit streams of varying quality. One of the multiple bit streams provided
by the scalable video encoder is called the base stream, which, if decoded provides a
coarse quality video presentation, whereas the other streams, called enhancement
streams, if decoded and used in conjunction with the base stream improve the video
quality. The best schemes in this area are Fine Granularity Scalability (FGS,
[RAD2001]) which utilizes bitplane coding method to represent enhancement
streams. A variation of FGS is Progressive FGS (PFGS) which, unlike the two layer
approach of the FGS, uses a multiple layered approach towards coding. The
advantage in doing this is that errors in motion prediction are reduced due to
availability of incremental reference layers. [ZHU2005]
Streaming video applications on the Internet generally have very high bandwidth
requirements and yet are often unresponsive to network congestion. In order to avoid
congestion collapse and improve video quality, these applications need to respond to
congestion in the network by deploying mechanisms to reduce their bandwidth
requirements under conditions of heavy load. In reducing bandwidth, video frames
have low quality, while video with low motion will look better if some frames are
dropped but the remaining frames have high quality.
Page 87
Diploma Thesis
Erich Semlak
[TRI2002] presents a content-aware scaling mechanism that reduces the bandwidth
occupied by an application by either dropping frames (temporal scaling) or by
reducing the quality of the frames transmitted (quality scaling). Therefore video
quality could be improved as much as 50%.
[LEO2005] evaluated the performance on compressed video of a number of available
similarity, blocking and blurring quality metrics. Using a systematic, objective
framework based on simple subjective comparisons, he evaluated the ability of each
metric to correctly rank order images according to the subjective impact of different
spatial content, quantization parameters, amounts of filtering, distances from the
most recent I-frame, and long-term frame prediction strategies. This evaluation
shows that the recently proposed quality all have some weakness in measuring the
quality of still frames from compressed video.
As mentioned above, scalable video encoding is advantageous when adapting to
varying network bandwiths. [TAU2003] proposed method involves the construction of
quality layers for the coded wavelet sample data and a separate set of quality layers
for scalably coded motion parameters. When the motion layers are truncated, the
decoder receives a quantized version of the motion parameters used to generate the
wavelet sample data. A linear model is used to infer the impact of motion
quantization on reconstructed video distortion. An optimal tradeoff between the
motion and subband bit-rates may then be found. Experimental results indicate that
the cost of scalability is small. At low bit-rates, significant improvements are observed
relative to lossless coding of the motion information.
4.2.3. Streaming
Internet video delivery has been motivating research in multicast routing, quality of
service (QoS, [URL54]), and the service model of the Internet itself for the last 15
years. Multicast delivery has the potential to deliver a large amount of content that
currently cannot be delivered through broadcast. IP and overlay multicast are two
architectures proposed to provide multicast support. A large body of research has
been done with IP multicast and QoS mechanisms for IP multicast since the late
1980s. In the past five years, overlay multicast research has gained momentum with
a vision to accomplish ubiquitous multicast delivery that is efficient and scales in the
dimensions of the number of groups, number of receivers, and number of senders.
4.2.3.1. Internet
One of the challenging aspects of video streaming over the Internet is the fact that
the Internet's transmission resources exhibit multiple time-scale variability. There are
two approaches to deal with the variability: small time-scale variability can be
accommodated by receiver buffer and large time-scale variability can be
accommodated by scalable video encoding. There still exist difficulties, since an
encoded video generally exhibits significant rate variability to provide consistent
video quality. [KIM2005] show that the real-time adaptation as well as the optimal
adaptation algorithm provides consistent video quality when used over both TCPfriendly rate control (TFRC) and transmission control protocol (TCP).
Page 88
Diploma Thesis
Erich Semlak
4.2.3.2. Wireless
As wireless networks a getting more and more common [URL55], video streaming
techniques have to be adapted to the conditions in which wireless environments
differ from cable-bound ones. [ARG2003] propose a protocol for media streaming in
wireless IP-capable mobile multi-homed hosts. Next generation wireless networks,
like 3G, 802.11a WLAN, and Bluetooth, are the target underlying technologies for the
proposed protocol. Operation at the transport layer guarantees TCP-friendliness,
error resilience and independence from the inner workings of the access technology.
Furthermore, compatibility with UMTS and IMT-2000 wireless streaming service
models are some of the additional benefits of our approach.
Wireless streaming media communications are fragile to the delay jitter because the
conditions and requirements vary frequently with the users' mobility [STA2003].
Buffering is a typical way to reduce the delay jitter of media packets before the
playback, but it will incur a longer end-to-end delay. [WAN2004] propose a novel
adaptive playback buffer (APB) based on the probing scheme. By utilizing the
probing scheme, instantaneous network situations are collected and then using with
the delay margin and the delay jitter margin, the playback buffer is adaptively
adjusted to represent the continuous and real-time streaming media at the receiver.
Unlike the previous studies, the novelty and contributions of this work paper are: a)
accuracy: by employing the instantaneous network information, the adjustment to the
playback buffer correctly reflects the current network situations, which makes the
adjustment effective; b) efficiency: by utilizing the simple probing scheme, APB
achieves the current network situations without the complex mathematic prediction
and makes it efficient to adjust the playback buffer. Performance data obtained
through extensive simulations show that our APB is effective to reduce the delay jitter
and to decrease the buffer delay.
4.2.3.3. Summary
In spite of Double-Layer DVDs and high-bandwith DSL-Connections there are still
developers at MainConcept, Sorensen and Real trying to find even more efficient
algorithms. These are supposed to shrink video for DVDs as well as for smallest
bitrates for watching videos on a cell-phone-display.
H.264 is dealed as the follower for the Blu-ray and HD-DVD. Tests have shown that
most currently common codecs (DivX, XviD, MainConcept H.264, Nero Digital AVC,
Sorenson AVC Pro and RealVideo 10) achieve comparable quality and compression
levels. The open source community shows up to that teamwork can be advantageous
[CT_2005_10].
Page 89
Diploma Thesis
Erich Semlak
4.2.4. Relation to this work
The current version of this thesis’ system uses a rather simple and straight-through
compression and streaming approach. As mentioned in chapter 2.4.3 there is only a
single output stream supported. To stream different bitrates and/or video resolutions
at the same time, multiple encoders and streaming filters would have to be
connected to the filter graph. To overcome performance limitations, as encoding
consumes quite some CPU, this task could be distributed to multiple streaming
servers. This approach for future versions is highly scalable and through the modular
design of DirectShow it is also possible to determine the number of required server
machines at runtime.
To adapt to network congestion an adaptive approach to the streaming filter can be
chosen. [LIN1998] addresses the problem of adding network and host adaptive
capability to DirectShow RTP.
Another approach for serving different bitrates and adaptive streaming would be by
using a dedicated streaming server (Helix Universal Server from Real, Streaming
Media Server from Microsoft) which takes the current RTP stream as input and then
generates different bitrates and/or formats there out.
Therefore the compression ratio for the RTP stream should be rather low to avoid
losing too much information. The highest bitrate later served by the dedicated
streaming server stands as reference level which makes sense not to be higher than
the source stream’s bitrate.
Page 90
Diploma Thesis
Erich Semlak
5. Final Statement
In the course of this work a system has been developed that is capable of processing
and streaming captured video from multiple sources. This system uses multiple
machines for capturing and analyzing multiple video inputs. The video streams and
resulting motion and object detection data are delivered through common Ethernet
network to a single Master machine. This machine does the merging and
broadcasting of the resulting video stream. The system is therefore limited scalable
regarding number of inputs and video resolution.
This system gets by on today common hardware. It is based on Microsoft’s
DirectShow technology and can fall back on numerous components ready available.
According to this thesis’ evaluation, DirectShow shows significant advantages over
Windows Media and Video for Windows. The system needs no extra compression
hardware and is entirely software-based. As Slave machines even older hardware
with about 600 MHz can still be used.
As by-product, in the course of development and incorporation, a tutorial emerged
which explains the usage and development under DirectShow with special emphasis
on C#. This tutorial compensates the lack of books and information be found on the
internet on this topic.
Tests of the developed applications show that performance of common PCs with 3.0
GHZ is sufficient to handle multiple input streams. As the workload for capturing and
analyzing is distributed to the Slave machines, up to 9 inputs can be processed
simultaneously.
These tests also show up the bottleneck at the Master machine. The system is
capable of handling video resolutions of up to CIF (352x288), which are usual in web
streaming, without dropouts. Higher video resolutions, like SVHS (720x576) or HDTV
(starting at 1280x720), which are common in the field of professional video
processing exceed the system’s possibilities by far and are reserved for special
hardware.
Page 91
Diploma Thesis
Erich Semlak
6. Sources
6.1. Books & Papers
[ARG2003]
A. Argyriou and V. Madisetti; A media streaming protocol for
heterogeneous wireless networks; CCW 2003 Proceedings; Pages
30 – 33
[APO2002]
John G. Apostolopoulos, Wai-tian Tan and Susie J. Wee; Video
Streaming: Concepts, Algorithms and Systems; HP Technical
Report; 2002
[BER2004]
M. Bertozzi, A.Broggi, P.Grisieri and A. Tibaldi; A tool for vision
based pedestrian detection performance evaluation; IEEE Intelligent
Vehicles Symposium 2004; Pages 784 - 789
[BLA2003]
James Black, Tim Ellis and Paul Rosin; A Novel Method for Video
Tracking Performance Evaluation; Joint IEEE Int Workshop on Visual
Surveillance and Performance Evaluation of Tracking and
Surveillance, London 2003, Pages 125-132
[BRU2003]
Kai Bruns and Benjamin Neidhold; Audio-, Video- und
Grafikprogrammierung, Fachbuchverlag Leipzig; 2003
[CT1996_5]
Ulrich Hilgefort, Neue digitale Schnittsysteme, C’t 5/1996, p. 84
[CT1996_11]
Robert Seetzen, Bilderflut, C’t 11/1996, p. 30
[CT2005_10]
Dr. Volker Zota; Kompressionist; Ct 10/2005; p. 146
[DEM2003]
Gregory C. Demetriades; Streaming Media; Wiley; 2003
[EID2005]
Horst Eidenberger; Proceedings of the 11th International Multimedia
Modelling Conference, 2005, Pages 358-363
[ELL2001]
Tim Ellis and Ming Xu; Object Detection and Tracking in an Open
Dynamic World; Second IEEE International Workshop on
Performance Evaluation of Tracking and Surveillance; 2001; Pages
211-219
[HAG1998]
Gregory D. Hager and Peter N. Belhumeur; Efficient Region Tracking
With Parametric Models of Geometry and Illumination; IEEE
Transactions on Volume 20, Issue 10, Oct. 1998 Page(s):1025 1039
Page 92
Diploma Thesis
Erich Semlak
[HEL1994]
Tobias Helbig; Development and Control of Distributed Multimedia
Applications; Proceedings of the 4th Open Workshop on High-Speed
Networks; 1994; Pages 208-213
[HUA2002]
Yu Huang, Thomas S. Huang and Heinrich Niemann, A Region
Based Method for Model-Free Object Tracking; 16th International
Conference on Pattern Recognition, 2002; Proceedings. Volume 1;
Page 592-595
[HP2002]
John. G. Apostolopoulos; Video Streaming: Concepts, Algorithms
and Systems; HP Laboratories Palo Alto; 2002
[ISO1999]
MPEG-7 Context and Objectives; ISO/IETC JTC1/SC29/WG11
[ISO2004_1]
Coding of moving pictures and associated audio for digital storage
media at up to about 1.5 MBit/s; ISO Standard CD 11172-1
[ISO2004_2]
Coding of moving pictures and associated audio for digital storage
media at up to about 1.5 MBit/s; ISO Standard CD 11172-2
[JIA2003]
Hao Jiang and Mark S. Drew; Shadow-Resistant Tracking in Video;
International Conference on Multimedia and Expo; Proceedings Vol.
3, Pages 77-80
[JOD1997]
P-M. Jodoin and M. Mignotte; Unsupervised motion detection using a
Markovian temporal model with global spatial constraints;
International Conference on Image Processing 2004; Volume 4
Pages 2591-2594
[KIM2005]
Kim Taehyun and M.H. Ammar, Optimal quality adaptation for
scalable encoded video; Journal on Selected Areas in
Communications, IEEE Volume 23, Issue 2, Feb 2005, Pages:344 356
[LEO2005]
Athanasios Leontaris; Comparison of blocking and blurring metrics
for video compression; AT&T Labs; 2005
[LIN1998]
Linda S. Cline, John Du, Bernie Keany, K. Lakshman, Christian
Maciocco and David M. Putzolu; DirectShow(tm) RTP Support for
Adaptivity in Networked Multimedia Applications; IEEE International
Conference on Multimedia Computing and Systems, 1998;
Proceedings, Pages:13 – 22
[LV2002]
T. LV, B. Ozer and W. Wolf; Parallel architecture for video processing
in a smart camera system; IEEE Workshop on Signal Processing
Systems, 2002; Pages: 9 – 14
[MAY1999]
Ketan Desharath Mayer-Patel; A Parallel Software-Only Video
Effects Processing System; Ph.D. Thesis; 1999
[MEN2003]
Eyal Menin; The Streaming Media Handbook; Prentice Hall; 2003
Page 93
Diploma Thesis
Erich Semlak
[MSDN2003]
Microsoft; MSDN Library for Visual Studio .NET 2003; 2003
[MUL1993]
Armin Müller; Multimedia PC; Vieweg; 2003
[NGU2004]
Hieu T. Nguyen and Arnold W.M. Smeulders; Fast Occluded Object
Tracking; Transactions on Pattern Analysis and Machine Intelligence,
IEEE Volume 26, Issue 8, Aug. 2004; Pages: 1099 - 1104
[ORT1993]
Michael Ortlepp and Michael Horsch; Video für Windows; Sybex;
1993
[PER2003]
P. Perez et al.; Color-Based Probabilistic Tracking; Microsoft
Research Paper; 2002
[PES2003]
Mark D. Pesce; Programming Microsoft DirectShow for Digital Video
and Television, Microsoft Press; 2003
[PUA2000]
Kok Meng Pua; Feature-Based Video Sequence Identification; Ph.D.
Thesis; 2000
[RAD2001]
H.M. Radha, M. van der Schaar, Yingwei Chen; The MPEG-4 finegrained scalable video coding method for multimedia streaming over
IP; IEEE Transactions on Multimedia, Volume 3, Issue 1, March
2001 Pages: 53 - 68
[ROS1997]
Paul L.Rosin and Tim Ellis; Image difference threshold strategies and
shadow detection; British Machine Vision Conf., Pages 347-356
[SCL2001]
Stan Scarloff and Lifeng Liu; Deformable Shape Detection and
Description, Transactions on Pattern Analysis and Machine
Intelligence, IEEE Volume 23, Issue 5, May 2001 Pages: 475 - 489
[SHA2004]
John Shaw; Introduction to Digital Media and Windows Media Series
9; Microsoft Corporation; 2004
[STA2003]
Vladimir Stankovic and Raouf Hamazoui; Live Video Streaming over
Packet Networks and Wireless Channels; Paper; Proceedings for the
13th Packet Video Workshop; Nantes 2003;
[STO1995]
[Dieter Stotz, Computergestützte Audio- und Videotechnik, Springer;
1995
[TAD2003]
Hadj Hamma Tadjine and Gerhard Joubert; Colour Object Tracking
without Shadow; The 23rd Picture Coding Symposium (PCS 2003);
Pages 391-394
[TAU2003]
David Taubman and Andrew Secker; Highly Scalable Video
Compression with Scalable Motion Coding; International Conference
on Image Processing, 2003; Proceedings, Pages: 273-276
Page 94
Diploma Thesis
Erich Semlak
[TOP2004]
Michael Topic; Streaming Demystified; McGraw-Hill; 2002
[TOT2003]
Daniel Toth and Til Aach; Detection and recognition of moving
objects using statistical motion detection and Fourier descriptors;
12th International Conference on Image Analysis and Processing,
2003; Proceedings, Pages: 430 - 435
[TRI2002]
Avanish Tripathi and Mark Claypool; Improving Multimedia Streaming
with Content-Aware Video Scaling; Proceedings of the Second
International Workshop on Intelligent Multimedia Computing and
Networking; 2002; Pages 1021-1024
[TRV2002]
Mohan M. Trivedi, Andrea Prati and Greg Kogut; Distributed
Interactive Video Arrays For Event Based Analysis of Incidents; The
IEEE 5th International Conference on Intelligent Transportation
Systems; 2002; Proceedings, Pages: 950 - 956
[WAN2004]
Tu Wanging and Jia Weija; Adaptive playback buffer for wireless
streaming media; IEEE International Conference on Networks; 2004;
Proceedings. 12th Volume 1, Pages: 191 - 195 vol.1
[WU2004]
Hao Wu and Qinfen Zheng; Self Evaluation for Video Tracking
Systems; 24th Army Science Conference Proceedings; 2004
[YIL2004]
Alper Yilmaz and Mubarak Shah; Contour-Based Object Tracking
with Occlusion Handling; IEEE Transactions on Pattern Analysis and
Machine Intelligence, Volume 26, Issue 11, Nov. 2004 Pages:1531 –
1536
[ZHO2000]
Yu Zhong, Anil K. Jain and M.-P. Dubuisson-Jolly, Object Tracking
Using Deformable Templates; IEEE Transactions on Pattern Analysis
and Machine Intelligence, Volume 22, Issue 5, May 2000; Pages:
544 - 549
[ZHU2005]
Bin B. Zhu, Chun Yuan, Yidong Wang and Shipeng Li, Scalable
Protection for MPEG-4 Fine Granularity Scalability; Transactions on
Multimedia, IEEE Volume 7, Issue 2, Apr 2005; Pages: 222 – 233
[ZIV2003]
Zoran Živković; Motion Detection and Object Tracking in Image
Sequences; Ph.D. Thesis; 2002
Page 95
Diploma Thesis
Erich Semlak
6.2. URLs
[URL1]
Microsoft; DirectShow Reference;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshowreference.asp;2004
[URL2]
Wikipedia; Charge-coupled Device;
http://de.wikipedia.org/wiki/Charge-coupled_Device;2004
[URL3]
Microsoft, MSDN Library “Audio and Video”;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnanchor/html/audiovideo.asp;2004
[URL4]
Microsoft, MSDN Supported Formats;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/supportedformatsindirectshow.asp;2004
[URL5]
LeadTools, DirectShow,
http://www.leadtools.com/SDK/Multimedia/MultimediaDirectShow.htm;2004
[URL6]
Microsoft; DirectX Overview,
http://www.microsoft.com/windows/directx/default.aspx?url=/windows/direc
tx/productinfo/overview/default.htm;2004
[URL7]
Jim Taylor; DVD Demystified,
http://www.dvddemystified.com; 2004
[URL8]
Microsoft; Windows Media;
http://www.microsoft.com/windows/windowsmedia/default.aspx ;2004
[URL9]
Idael Cardoso, CSharp Windows Media Format SDK Translation;
http://www.codeproject.com/cs/media/ManWMF.asp ; 2004
[URL10]
NetMaster; DirectShow .NET;
http://www.codeproject.com/cs/media/directshownet.asp ; 2004
[URL11]
Robert Laganiere, Programming computer vision applications;
http://www.site.uottawa.ca/~laganier/tutorial/opencv+directshow/ ;2003
[URL12]
Intel, Open Source Computer Vision Library;
http://www.intel.com/research/mrl/research/opencv/; 2004
[URL13]
Yunqiang Chen; DirectShow Transform Filter AppWizard;
http://www.ifp.uiuc.edu/~chenyq/research/Utils/DShowFilterWiz/DShowFilt
erWiz.html; 2001
[URL14]
John McAleely; DXMFilter;
http://www.cs.technion.ac.il/Labs/Isl/DirectShow/dxmfilter.htm; 1998
Page 96
Diploma Thesis
Erich Semlak
[URL15]
Dennis P. Curtin; A Short Course in Digital Video;
http://www.shortcourses.com/video/introduction.htm; 2004
[URL16]
Wavelength Media; Creating Streaming Video;
http://www.mediacollege.com/video/streaming/overview.html; 2004
[URL17]
Alken M.R.S; Video Standards;
http://www.alkenmrs.com/video/standards.html;2004
[URL18]
Britannica Online; http://www.britannica.com; 2004
[URL19]
MPEG.ORG; http://www.mpeg.org; 2004
[URL20]
Björn Eisert; Was ist MPEG ?;
http://www.cybersite.de/german/service/Tutorial/mpeg/; 1995
[URL21]
Berkeley Multimedia Research Center; MPEG Background;
http://bmrc.berkeley.edu/frame/research/mpeg/mpeg_overview.html;
2004
[URL22]
CyberCollege; Linear and Nonlinear Editing;
http://www.cybercollege.com/tvp056.htm; 2005
[URL23]
Siemens; Online Lexikon;
http://www.networks.siemens.de/solutionprovider/_online_lexikon; 2005
[URL24]
Microsoft; Video For Windows;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/multimed/htm/_win32_video_for_windows.asp;2005
[URL25]
Michael Blome; Core Media Technology in Windows XP Empowers You to
Create Custom Audio/Video Processing Components;
http://msdn.microsoft.com/msdnmag/issues/02/07/DirectShow/default.asp
x; 2002
[URL26]
Chris Thompson; DirectShow For Media Playback in Windows;
http://www.flipcode.com/articles/article_directshow01.shtml;2000
[URL27]
DFNO-Expo; Streaming Overview;
http://www.dfn-expo.de/Technologie/Streaming/Streaming_tab.html; 1999
[URL28]
Jane Hunter; A Review of Video Streaming over the Internet;
http://archive.dstc.edu.au/RDU/staff/jane-hunter/video-streaming.html;
1997
[URL29]
Ross Finlayson; LIVE.COM; Internet Streaming Media, Wireless and
Multicast Technology, Services & Standards;http://www.live.com; 2005
[URL30]
Robert B. Fisher; CVonline: The Evolving, Distributed, Non-Proprietary,
On-Line Compendium of Computer Vision;
http://homepages.inf.ed.ac.uk/rbf/CVonline/; 2005
Page 97
Diploma Thesis
Erich Semlak
[URL30]
Paul L. Rosin; Thresholding for Change Detection;
http://users.cs.cf.ac.uk/Paul.Rosin/thresh/thresh.html; 2004
[URL31]
Telefunken Germany; http://www.telefunken.de/de/his/2336/his.htm; 2005
[URL32]
WikiPedia; Moore's Law, http://en.wikipedia.org/wiki/Moore's_law; 1965
[URL33]
Encyclopedia Britannica, Distributed
Computing;http://www.britannica.com/eb/article?tocId=168849; 2005
[URL34]
Encyclopedia Britannica, Human Eye Color Vision;
http://www.britannica.com/eb/article?tocId=64933; 2005
[URL35]
Webopedia; Motion JPEG;
http://www.webopedia.com/TERM/m/motion_JPEG.html; 2005
[URL36]
RealMedia; www.real.com; 2005
[URL37]
Apple; Quicktime; http://www.apple.com/quicktime/; 2005
[URL38]
WikiPedia; Webcast; http://en.wikipedia.org/wiki/Webcast; 2005
[URL39]
Sun Developer Network; How to Create a Multimedia Player;
http://java.sun.com/developer/technicalArticles/Media/Mediaplayer/; 2005
[URL40]
StreamWorks; http://www.streamworks.dk/uk/default.asp; 2005
[URL41]
Siggraph; VDO Live;
http://www.siggraph.org/education/materials/HyperGraph/video/architectur
es/VDO.html; 2005
[URL42]
Microsoft; DirectShow Documentation;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshow.asp; 2005
[URL43]
MontiVision, DirectShow Filters; http://www.montivision.com/;2005
[URL44]
LEAD; DirectShow Filters;
http://www.leadtools.com/SDK/MULTIMEDIA/Multimedia-DirectShowFilters.htm; 2005
[URL45]
Microsoft; Windows Media SDK Documentation;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnanchor/html/anch_winmedsdk.asp; 2005
[URL46]
Inscriber; Windows Media Plugins; http://www.inscriber.com/; 2005
[URL47]
Consolidated Video; LiveAlpha; http://convid.com/; 2005
Page 98
Diploma Thesis
Erich Semlak
[URL48]
Microsoft; Windows Media SDKs FAQ;
http://msdn.microsoft.com/library/enus/dnwmt/html/ds_wm_faq.asp?frame=true#ds_wm_faq_topic07; 2005
[URL49]
Avid Technology; http://www.avid.com/
[URL50]
Betacam Format; http://betacam.palsite.com/format.html
[URL51]
Pinnacle Systems; Dazzle Video Creator;
http://www.pinnaclesys.com/VideoEditing.asp?Category_ID=1&Langue_ID
=7&Family=24
[URL52]
Microsoft; Windows Media SDK Components;
http://www.microsoft.com/windows/windowsmedia/mp10/sdk.aspx
[URL54]
Cisco; Quality of Service;
http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/qos.htm
[URL55]
Carmen Nobel; Video Streaming goes Wireless; EWeek;
http://www.eweek.com/article2/0,1759,1436939,00.asp; 2004
[URL56]
The Internet Engineering Task Force; http://www.ietf.org
[URL57]
Michael B. Jones; The Microsoft Interactive TV System: An Experience
Report; http://research.microsoft.com/~mbj/papers/mitv/tr-97-18.html;
1997
[URL58]
Elecard Ltd; http://www.elecard.com
[URL59]
Alparysoft; Lossless Video Codec;
http://www.alparysoft.com/products.php?id=8&item=35; 2005
[URL60]
Real Time Streaming Protocol Information and Updates;
http://www.rtsp.org/; 2005
[URL61]
Wikipedia; Multicast technology; http://en.wikipedia.org/wiki/Multicast;
2005
[URL62]
Agava’s DirectShow Development Site; Forum;
http://dsforums.agava.com/cgi/yabb/YaBB.cgi?board=directshow; 2005
[URL63]
Intel Corp.; Hyper-threading Technology;
http://www.intel.com/technology/hyperthread/; 2005
[URL64]
Cinelerra; Movie Studio in a Linux Box;
http://heroinewarrior.com/cinelerra.php3; 2005
Page 99
Diploma Thesis
Erich Semlak
7. Pictures
Picture 1: miro Video PCTV card.............................................................................. 13
Picture 2: DCT compression cycle............................................................................ 15
Picture 3: TCP protocol handshaking and dataflow .................................................. 28
Picture 4: UDP protocol dataflow.............................................................................. 29
Picture 5: Windows Movie Maker GUI ...................................................................... 35
Picture 6: black box model........................................................................................ 40
Picture 7: video process phases............................................................................... 40
Picture 8: distributed system model.......................................................................... 41
Picture 9: real world example ................................................................................... 42
Picture 10: slave model ............................................................................................ 43
Picture 11: master model.......................................................................................... 44
Picture 12: master video processing......................................................................... 44
Picture 13: Slave application model.......................................................................... 56
Picture 14: Master application model........................................................................ 56
Picture 15: motion detection filter properties page.................................................... 58
Picture 16: HNS MPEG-2 sender properties page ................................................... 60
Picture 17: filter graph using infinite pin tee .............................................................. 61
Picture 18: infinite pin tee filter properties page........................................................ 62
Picture 19: HNS MPEG streamer filter properties page............................................ 63
Picture 20: Moonlight MPEG2 Encoder settings....................................................... 64
Picture 21: Slave application camera selection dialog.............................................. 65
Picture 22: Slave application graphical user interface .............................................. 66
Picture 23: Slave's filter graph .................................................................................. 66
Picture 24: Master application schematic workflow .................................................. 67
Picture 25: Master’s Filter graph............................................................................... 67
Picture 26: Master application Slave selection ......................................................... 68
Picture 27: Master application Slave resolution selection ......................................... 68
Picture 28: Master application filter graph setup....................................................... 68
Picture 29: Master application graphical user interface ............................................ 69
Picture 30: motion detection schematic data flow..................................................... 70
Picture 31: object detection schematic data flow ...................................................... 71
Picture 32: CPU load at Master for 352x288/25fps................................................... 74
Picture 33: CPU load at Master for 352x288/25fps-640x480/25fps .......................... 74
Picture 34: CPU load at Master for 640x480/25fps................................................... 75
Picture 35: incoming network load for 352x288/25fps .............................................. 76
Picture 36: outgoing network load ............................................................................ 76
Picture 37: Windows Media Player 9 ...................................................................... 103
Picture 38: Simple filter graph................................................................................. 105
Picture 39: connecting pins..................................................................................... 109
Picture 40: connection pins "intelligently" ............................................................... 110
Picture 41: sample life cycle ................................................................................... 111
Picture 42: Graphical User Interface of GraphEdit.................................................. 112
Picture 43: DirectShow application tasks................................................................ 113
Picture 44: Sample filter graph ............................................................................... 116
Picture 45: Filter selection dialog in GraphEdit ....................................................... 119
Page 100
Diploma Thesis
Erich Semlak
8. Code samples
Code sample 1: IFilter graph interface in C++ ........................................................ 115
Code sample 2: IFilter graph interface in C# .......................................................... 115
Code sample 3: initiating a filter graph in C# .......................................................... 117
Code sample 4: helper function for searching a filter by GUID ............................... 118
Code sample 5: finding a filter by GUID ................................................................. 118
Code sample 6: helper function for searching a filter by its name .......................... 120
Code sample 7: searching a filter by its name ........................................................ 120
Code sample 8: adding filters to the filter graph ..................................................... 120
Code sample 9: finding a filter’s pins helper function.............................................. 121
Code sample 10: get unconnected pins ................................................................. 121
Code sample 11: finding pins by name................................................................... 121
Code sample 12: connecting pins........................................................................... 122
Code sample 13: running the filter graph................................................................ 122
Code sample 14: stopping and clearing the filter graph.......................................... 123
9. Tables
Table 1: Overview of TV formats [URL17] ................................................................ 10
Table 2: Video compression formats overview ......................................................... 18
Table 3: Overview SDKs........................................................................................... 55
Page 101
Diploma Thesis
Erich Semlak
A. A Short History Of DirectShow
A.1. DirectShow Capabilities
DirectShow capabilities can be separated into three broad areas, which reflect the
three basic types of DirectShow filters. First are the capture capabilities. DirectShow
can handle capture of audio from the microphone or from a line input, can control a
digital camcorder or D-VHS VCR (a new kind of VCR that stores video digitally in
high resolution), or can capture both audio and video from a live camera, such as a
webcam. DirectShow can also open a file and treat it as if it were a "live" source. This
way, one can work on video or audio that has been previously captured.
Once the media stream has been captured, DirectShow filters can be used to
transform it. Transform filters have been written to convert color video to black-andwhite, resize video images, add an echo effect to an audio stream, and so on. These
transform filters can be connected, one after another, like so many building blocks,
until the desired effect is achieved. Streams of audio and video data can be "split"
and sent to multiple filters simultaneously, as if you added a splitter to the coaxial
cable that carries a cable TV or satellite signal. Media streams can also be
multiplexed, or “muxed”, together, taking two or more streams and making them one.
Using a multiplexer, you can add a soundtrack to a video sequence, putting both
streams together synchronously.
After all the heavy lifting of the transform filters has been accomplished, there's one
task left: rendering the media stream to the display, speakers, or a device.
DirectShow has a number of built-in render filters, including simple ones that provide
a window on the display for video playback. You can also take a stream and write it
to disk or to a device such as a digital camcorder.
Most DirectShow applications do not need the full range of DirectShow's capabilities;
in fact, very few do. For example, Windows Media Player doesn't need much in the
way of capture capabilities, but it needs to be able to play (or render) a very wide
range of media types-MP3s, MPEG movies, AVI movies, WAV sounds, Windows
Media, and so on. You can throw almost any media file at Windows Media Player
(with the notable exception of Apple QuickTime and the RealNetworks media
formats), and it'll play the file without asking for help. That's because Windows Media
Player, built with DirectShow, includes all of DirectShow's capabilities to play a broad
range of media.
Page 102
Diploma Thesis
Erich Semlak
Picture 37: Windows Media Player 9
On the other hand, Windows Movie Maker is a great example of an application that
uses nearly the full suite of DirectShow capabilities. It is fully capable of
communicating with and capturing video from a digital camcorder (or a webcam).
Once video clips have been captured, they can be edited, prettied up, placed onto a
timeline, mixed with a soundtrack, and then written to disk (or a digital camcorder) as
a new, professional-looking movie. You can even take a high-resolution, highbandwidth movie and write it as a low-resolution, low-bandwidth Windows Media file,
suitable for dropping into an e-mail message or posting on a Web site. All of these
capabilities come from Windows Movie Maker's extensive use of DirectShow
because they're all DirectShow capabilities.
The flexibility of DirectShow means that it can be used to rapidly prototype
applications. DirectShow filters can be written quite quickly to provide solutions to a
particular problem. It is widely used at universities and in research centers-including
Microsoft's own-to solve problems in machine vision (using the computer to
recognize portions of a video image) or for other kinds of audio or video processing,
including the real-time processing of signals.
There are some tasks that DirectShow cannot handle well, a few cases in which
"rolling your own" is better than using DirectShow. These kinds of applications
generally lie on the high end of video processing, with high-definition video pouring in
at tens of Megabytes per second or multiple cameras being choreographed and
mixed in real time. Right now these kinds of applications push even the fastest
computers to the limits of processor speed, memory, and network bandwidth. That's
not to say that you'll never be able to handle high-definition capture in DirectShow or
real-time multi-camera editing. You can write DirectShow applications that edit highdefinition images and handle real-time multi-camera inputs, this thesis proofs it.
In any case, working with video is both processor-intensive and memory-intensive,
and many DirectShow applications will use every computing resource available, up to
100 percent of the CPU. So when decision is made to use DirectShow for a project,
expectations have to be set appropriately. DirectShow is an excellent architecture for
media processing, but it has its limits [PES2003] [URL25]
Page 103
Diploma Thesis
Erich Semlak
A.2. Supported Formats in DirectShow
DirectShow is an open architecture, which means that it can support any format as
long as there are filters to parse and decode it. The filters provided by Microsoft
themselves, either as redistributables through DirectShow or as Windows operating
system components, provide default support for the following file and compression
formats. [URL4]
File types:
Windows Media Audio (WMA)*
Windows Media Video (WMV)*
Advanced Systems Format (ASF)*
Motion Picture Experts Group (MPEG)
Audio-Video Interleaved (AVI)
QuickTime (version 2 and lower)
WAV
AIFF
AU
SND
MIDI
Compression formats:
Windows Media Video*
ISO MPEG-4 video version 1.0*
Microsoft MPEG-4 version 3*
Sipro Labs ACELP*
Windows Media Audio*
MPEG Audio Layer-3 (MP3) (decompression only)
Digital Video (DV)
MPEG-1 (decompression only)
MJPEG
Cinepak
An asterisk (*) indicates that DirectShow applications must use the Windows Media
Format SDK to support this format. For more information, see the Audio and Video
section of the Microsoft MSDN Library [URL3]
Microsoft does not provide an MPEG-2 decoder. Several DirectShow-compatible
hardware and software MPEG-2 decoders are available from third parties, e.g.
Elecard [URL58] and Mainconcept [URL43].
As you can see, the MPEG1-format is supported for decompression only, as MPEG
underlies license fees and is therefore not provided for free. There are several
manufacturers who offer MPEG compression filters, such as Moonlight, Adobe,
Panasonic, Honestech and MainConcept.
Page 104
Diploma Thesis
Erich Semlak
A.3. Concepts of DirectShow
DirectShow is composed of two types of classes of objects: filters, atomic entities of
DirectShow, and filter graphs, collections of filters connected together.
Filters themselves consist mainly of pins and maybe some other properties, which
depend of the type of filter and what is intended to perform. Those pins may be
inbound or outbound, which means that data flows into or out of the filter.
Some filter only possess input or output pins, e.g. capturing devices need only output
pins, because their input comes from an external device, while a video rendering
device has only input pins and the output is shown on the screen.
The filter graph dictates in which sequence the filters are connected together and
performs the deliverance of data and synchronization. This data flow is commonly
known as stream. Conceptually, a filter graph might be thought of as consecutive
function calls, while the filters provide the functions. One difference between a
common sequential program and a filter graph is that the filter graph executes
continuously. Another important point distinguishes a DirectShow filter graph from an
ordinary computer program: a filter graph can have multiple streams flowing across it
and multiple paths through the filter graph. For example, a DirectShow application
can simultaneously capture video frames from a webcam and audio from a
microphone.
Picture 38: Simple filter graph
This data enters the filter graph through two independent source filters, which would
likely later be multiplexed together into a single audio/ video stream. In another
case, you might want to split a stream into two identical streams. One stream could
be sent to a video renderer filter, which would draw it upon the display, while the
other stream could be written to disk. Both streams execute simultaneously;
DirectShow sends the same bits to the display and to the disk.
DirectShow filters make computations and decisions internally-for instance, they
can change the values of bits in a stream-but they cannot make decisions that affect
the structure of the filter graph. A filter simply passes its data along to the next filter
in the filter graph. It can't decide to pass its data to filter A if some condition is true or
filter B if the condition is false. This means that the behavior of a filter graph is
completely predictable; the way it behaves when it first begins to operate on a
data stream is the way it will always operate.
Page 105
Diploma Thesis
Erich Semlak
Although filter graphs are entirely deterministic, it is possible to modify the elements
within a filter graph programmatically. A C++ program could create filter graph A if
some condition is true and filter graph B if it is false. Or both could be created during
program initialization (so that the program could swap between filter graph A and
filter graph B on the fly) as the requirements of the application change. Program code
can also be used to modify the individual filters within a filter graph, an operation that
can change the behavior of a filter graph either substantially or subtly. So, although
filter graphs can't make decisions on how to process their data, program code can be
used to simulate that capability.
For example, consider a DirectShow application that can capture video data from
one of several different sources, say from a digital camcorder and a webcam. Once
the video has been captured, it gets encoded into a compact Windows Media file,
which could then be dropped into an e-mail message for video e-mail. Very different
source and transform filters are used to capture and process a video stream from a
digital camcorder than those used with a webcam, so the same filter graph won't
work for both devices. In this case, program logic within the application could
detect which input device is being usedperhaps based on a menu selection-and
could then build the appropriate filter graph. If the user changes the selection from
one device to another, program logic could rebuild the filter graph to suit the needs
of the selected device. [PES2003] [URL26]
A.4. Modular Design
The basic power and flexibility of DirectShow derives directly from its modular design.
DirectShow defines a standard set of Component Object Model (COM) interfaces for
filters and leaves it up to the programmer to arrange these components in some
meaningful way. Filters hide their internal operations; the programmer doesn't need
to understand or appreciate the internal complexities of the Audio Video Interleaved
(AVI) file format, for example, to create an AVI file from a video stream. All that's
required is the appropriate sequence of filters in a filter graph. Filters are atomic
objects within DirectShow, meaning they reveal only as much of themselves as
required to perform their functions.
Because they are atomic objects, filters can be thought of and treated just like puzzle
pieces. The qualities that each filter possesses determine the shape of its puzzle
piece, and that, in turn, determines which other filters it can be connected to. As
long as the pieces match up, they can be fitted together into a larger scheme, the
filter graph.
All DirectShow filters have some basic properties that define the essence of their
modularity. Each filter can establish connections with other filters and can negotiate
the types of connections it is willing to accept from other filters. A filter designed to
process MP3 audio doesn't have to accept a connection from a filter that produces
AVI video-and probably shouldn't. Each filter can receive some basic messagesrun, stop, and pause-that control the execution of the filter graph. That's about it;
there's not much more a filter needs to be ready to go. As long as the filter defines
these properties publicly through COM, DirectShow will treat it as a valid element
in a filter graph.
This modularity makes designing custom DirectShow filters a straightforward process.
The programmer's job is to design a COM object with the common interfaces for a
DirectShow filter, plus whatever custom processing the filter requires. A custom
Page 106
Diploma Thesis
Erich Semlak
DirectShow filter might sound like a complex affair, but it is really a routine job, one that
will be covered extensively in the examples in Part III.
The modularity of DirectShow extends to the filter graph. Just as the internals of a
filter can be hidden from the programmer, the internals of a filter graph can be hidden
from view. When the filter graph is treated as a module, it can assume responsibility
for connecting filters together in a meaningful way. It is possible to create a complete,
complex filter graph by adding a source filter and a renderer filter to the filter graph.
These filters are then connected with a technique known as Intelligent Connect.
Intelligent Connect examines the filters in the filter graph, determines the right way to
connect them, adds any necessary conversion filters, and makes the connectionsall without any intervention from the programmer. Intelligent Connect can save you
an enormous amount of programming time because DirectShow does the tedious
work of filter connection for you.
There is a price to be paid for this level of automation: the programmer won't know
exactly which filters have been placed into the filter graph or how they're connected.
Some users will have installed multiple MPEG decoders, such as one for a DVD
player and another for a video editing application. Therefore, these systems will
have multiple filters to perform a particular function. With Intelligent Connect, you
won't know which filter DirectShow has chosen to use (at least, when a choice is
available). It is possible to write code that will make inquiries to the filter graph and
map out the connections between all the filters in the filter graph, but it is more work
to do that than to build the filter graph from scratch. So, modularity has its upsidesease of use and extensibility-and its downsides-hidden code.
Hiding complexity isn't always the best thing to do, and you might choose to build
DirectShow filter graphs step by step, with complete control over the construction
process. Overall, the modular nature of DirectShow is a huge boon for the
programmer, hiding gory details behind clean interfaces. This modularity makes
DirectShow one of the very best examples of object-oriented programming
(OOP), which promises reusable code and clean module design, ideals that are
rarely achieved in practice. DirectShow achieves this goal admirably, as you'll see.
[PES2003] [URL26]
A.5. Filters
Filters are the basic units of DirectShow programs, the essential components of the
filter graph. A filter is an entity complete unto itself. Although a filter can have many
different functions, it must have some method to receive or transmit a stream of data.
Each filter has at least one pin, which provides a connection point from that filter to
other filters in the filter graph. Pins come in two varieties: input pins can receive a
stream, while output pins produce a stream that can be sent along to another filter.
A.5.1. Filter Types
There are three basic classes of DirectShow filters, which span the path from input,
through processing, to output (often referred to “rendering”). All DirectShow filters
fall into one of these broad categories. A filter produces a stream of data, operates
on that stream, or renders it to some output device.
Page 107
Diploma Thesis
Erich Semlak
A.5.1.1. Source Filters
Any DirectShow filter that produces a stream is known as a source filter. The stream
might originate in a file on the hard disk, or it might come from a live device, such as
a microphone, webcam, or digital camcorder. If the stream comes from disk, it
could be a pre-recorded WAV (sound), AVI (movie), or Windows Media file. Alternately,
if the source is a live device, it could be any of the many thousands of Windowscompatible peripherals. DirectShow is closely tied in to the Windows Driver Model
(WDM), and all WDM drivers for installed multimedia devices are automatically
available to DirectShow as source filters. So, for example, webcams with properly
installed Windows drivers become immediately available for use as DirectShow
source filters. Source filters that translate live devices into DirectShow streams are
known as capture source filters. Chapter 12 covers the software design of a
source filter in detail.
A.5.1.2. Transform Filters
Transform filters are where the interesting work gets done in DirectShow. A transform
filter receives an input stream from some other filter (possibly a source filter),
performs some operation on the stream, and then passes the stream along to
another filter. Nearly any imaginable operation on an audio or video stream is
possible within a transform filter. A transform filter can parse (interpret) a stream of
data, encode it (perhaps converting WAV data to MP3 format) or decode it, or
add a text overlay to a video sequence. DirectShow includes a broad set of
transform filters, such as filters for encoding and decoding various types of video
and audio formats.
Transform filters can also create a tee in the stream, which means that the input
stream is duplicated and placed on two (or more) output pins. Other transform
filters take multiple streams as input and multiplex them into a single stream. Using a
transform filter multiplexer, separate audio and video streams can be combined into a
video stream with a soundtrack.
A.5.1.3. Renderer Filters
A renderer filter translates a DirectShow stream into some form of output. One
basic renderer filter can write a stream to a file on the disk. Other renderer filters can
send audio streams to the speakers or video streams to a window on the desktop.
The Direct in DirectShow reflects the fact that DirectShow renderer filters use
DirectDraw and DirectSound, supporting technologies that allow DirectShow to
efficiently pass its renderer filter streams along to graphics and sound cards. This
ability means that DirectShow's renderer filters are very fast and do not get tied up in
a lot of user-to-kernel mode transitions. (In operating system parlance, this process
means moving the data from an unprivileged level in an operating system to a
privileged one where it has access to the various output devices.)
A filter graph can have multiple renderer filters. It is possible to put a video
stream through a tee, sending half of it to a renderer filter that writes it to a file, and
sending the other half to another renderer filter that puts it up on the display.
Therefore, it is possible to monitor video operations while they're happening,
even if they're being recorded to disk-an important feature we'll be using later on.
Page 108
Diploma Thesis
Erich Semlak
A.5.1.4. Roundup
All DirectShow filter graphs consist of combinations of these three types of filters, and
every DirectShow filter graph will have at least one source filter, one renderer filter,
and (possibly) several transform filters. In each filter graph, a source filter creates a
stream that is then operated on by any number of transform filters and is finally
output through a renderer filter. These filters are connected together through
their pins, which provide a well-defined interface point for transfer of stream data
between filters.
A.5.2. Connections between Filters
Although every DirectShow filter has pins, it isn't always possible to connect an input
pin to an output pin. When two filters are connecting to each other, they have to
reach an agreement about what kind of stream data they'll pass between them.
For example, there are many video formats in wide use, such as DV (digital video),
MPEG-I, MPEG-2, QuickTime, and so on.
The pins on a DirectShow filter handle the negotiation between filters and ensure that
the pin types are compatible before a connection is made between any two filters.
Every filter is required to publish the list of media types it can send or receive and a
set of transport mechanisms describing how each filter wants the stream to travel
from output pin to input pin.
When a DirectShow filter graph attempts to connect the output pin of one filter to
the input pin of another, the negotiation process begins. The filter graph examines
the media types that the output pin can transmit and compares these with the media
types that the input pin can receive. If there aren't any matches, the pins can't
be connected and the connection operation fails.
A transform filter that can handle DV might not be able handle any other video
format. Therefore, a source filter that creates an MPEG-2 stream (perhaps read from
a DVD) should not be connected to that transform filter because the stream data
would be unusable.
Picture 39: connecting pins
Next the pins have to agree on a transport mechanism. If they can't agree, the
connection operation fails. Finally one of the pins has to create an allocator, an
object that creates and manages the buffers of stream data that the output pin uses
to pass data along to the input pin. The allocator can be owned by either the
output pin or the input pin; it doesn't matter, so long as they're in agreement.
If all these conditions have been satisfied, the pins are connected. This
connection operation must be repeated for each filter in the graph until there's a
complete, uninterrupted stream from source filter, through any transform filters, to a
renderer filter. When the filter graph is started, a data stream will flow from the output
pin of one filter to the input pin of the other through the entire span of the filter graph.
[URL1] [PES2003]
Page 109
Diploma Thesis
Erich Semlak
A.5.3. Intelligent Connect
One of the greatest strengths of DirectShow is its ability to handle the hard work
of supporting multiple media formats. Most of the time it is not necessary for the
programmer to be concerned with what kinds of streams will run through a filter
graph. Yet to connect two pins, DirectShow filters must have clear agreement on the
media types they're handling. How can both statements be true simultaneously?
Intelligent Connect automates the connection process between two pins. You
can connect two pins directly, as long as their media types agree. In a situation in
which the media types are not compatible, you'll often need one (or several)
transform filters between the two pins so that they can be connected together.
Intelligent Connect does the work of adding and connecting the intermediate
transform filters to the filter graph.
Picture 40: connection pins "intelligently"
For example, a filter graph might have a source filter that produces a stream of
an MPEG file This filter graph has a renderer filter that shows the video on the
screen. These two filters have nothing in common. They do not share any
common media types because the MPEG data is encoded and maybe
interleaved and must be decoded and de-interleaved before it can be shown. With
Intelligent Connect, the filter graph can try combinations of intermediate transform
filters to determine whether there's a way to translate the output requirements of
the pin on the source filter into the input requirements of the render filter. The
filter graph can do this because it has access to all possible DirectShow filters. It
can make inquiries to each filter to determine whether a transform filter can transform
one media type to another which might be an intermediate type-transform that type
into still another, and so on, until the input requirements of the renderer filter
have been met. A DirectShow filter graph can look very complex by the time the filter graph succeeds in connecting two pins, but from the programmer's point of
view, it is a far easier operation. And if an Intelligent Connect operation fails, it is
fairly certain there's no possible way to connect two filters. The Intelligent Connect
capability of DirectShow is one of the ways that DirectShow hides the hard work of
media processing from the programmer. [URL1] [PES2003]
A.6. Filter Graphs
The DirectShow filter graph organizes a group of filters into a functional unit. When
connected, the filters present a path for a stream from source filters, through
any transform filters, to rendering filters. But isn't enough to connect the filters,
the filter graph has to tell the filters when to start their operation, when to stop,
and when to pause. In addition, the filters need to be synchronized because
they're all dealing with media samples that must be kept in sync. (Imagine the
confusion if the audio and video kept going out of sync in a movie.)
For this reason, the filter graph generates a software-based clock that is available
Page 110
Diploma Thesis
Erich Semlak
to all the filters in the filter graph. This clock is used to maintain synchronization and
allows filters to keep their stream data in order as it passes from filter to filter.
Available to the programmer, the filter graph clock has increments of 100
nanoseconds. (The accuracy of the clock on your system might be less precise than
100 nanoseconds because accuracy is often determined by the sound card or chip
set on your system.)
When the programmer issues one of the three basic DirectShow commands-run,
stop, or pause-the filter graph sends the messages to each filter in the filter graph.
Every DirectShow filter must be able to process these messages. For example,
sending a run message to a source filter controlling a webcam will initiate a
stream of data coming into the filter graph from that filter, while sending a stop
command will halt that stream. The pause command behaves superficially like
the stop command, but the stream data isn't cleared out like it would be if the filter
graph had received a stop command. Instead, the stream is frozen in place until the
filter graph receives either a run or stop command. If the run command is issued,
filter graph execution continues with the stream data already present in the filter
graph when the pause command was issued. [URL1] [PES2003]
A.7. The Life Cycle of a Sample
To gain a more complete understanding of DirectShow, let's follow a sample of video
data gathered from a webcam as it passes through the filter graph on its way to the
display. There are only a few transformations necessary for this configuration.
After processing, these samples are presented at the output pin of the webcam,
which maintains the per-sample timestamp, ensuring that the images stay
correctly synchronized as they move from filter to filter.
Picture 41: sample life cycle
Finally the stream arrives at its destination, the video renderer. The renderer filter
accepts a properly formatted video stream from the DV video decoder and
draws it onto the display. As each sample comes into the renderer filter, it is
displayed within the DirectShow output window. Samples will be displayed in the
correct order, from first to last, because the video renderer filter examines the
timestamp of each sample to make sure that the samples are played
sequentially. Now that the sample has reached a renderer filter, DirectShow is done
with it, and the sample is discarded once it has been drawn on the display. The buffer
that the filter allocated to store the sample is returned to a pool of free buffers, ready
to receive another sample.
This flow of samples continues until the filter graph stops or pauses its execution.
If the filter graph is stopped, all of its samples are discarded; if paused, the
samples are held within their respective filters until the filter graph returns to the
running state. [URL5] [PES2003]
Page 111
Diploma Thesis
Erich Semlak
A.8. GraphEdit
GraphEdit is a tool to try filter graphs prior to code them in some programming
language.
Picture 42: Graphical User Interface of GraphEdit
With GraphEdit you can easily create filter graphs by simply choosing the desired
filters and connecting them. The basic elements of DirectShow applications – filter,
connections and filter graphs – can be represented visually.
GraphEdit is like a whiteboard on which prototype DirectShow filter graphs can be
sketched. Because GraphEdit is built using DirectShow components, these
whiteboard designs are fully functional, executable DirectShow filter graphs.
GraphEdit uses a custom persistence format. This format is not supported for
application use, but it is helpful for testing and debugging an application.
Page 112
Diploma Thesis
Erich Semlak
B. Programming DirectShow
B.1. Writing a DirectShow Application
In broad terms, there are three tasks that any DirectShow application must perform.
These are illustrated in the following diagram:
Picture 43: DirectShow application tasks
1. The application creates an instance of the Filter Graph Manager.
2. The application uses the Filter Graph Manager to build a filter graph. The
exact set of filters in the graph will depend on the application.
3. The application uses the Filter Graph Manager to control the filter graph and
stream data through the filters. Throughout this process, the application will
also respond to events from the Filter Graph Manager.
When processing is completed, the application releases the Filter Graph Manager
and all of the filters.
DirectShow is based on COM; the Filter Graph Manager and the filters are all COM
objects. One should have a general understanding of COM client programming
before to begin programming DirectShow. The article "Using COM" in the DirectX
SDK documentation is a good overview of the subject, there are also a lot of books
about COM programming available.
B.2. C# or C++?
DirectShow is meant, through his modular concept, to be extended and enhanced
with additional components, respectively filters.
For people who are familiar with programming with C++ and COM programming,
everything should be easy. If one is interested in using “the edge of technology” he
might intend to develop the GUI and filter graph parts of the application by using
Microsoft .NET respectively C#. C# has several advantages in GUI programming
issues compared to C++ and makes development much easier without having to
write several lines of code only to open a basic window.
Unfortunately there aren’t any extensions available in .NET for using DirectShow in
managed code (most parts of DirectX interfaces are available for managed code, but
Page 113
Diploma Thesis
Erich Semlak
not DirectShow). This means that all interfaces and definitions in DirectShow would
have to be rewritten for .NET, in other words, several days of work.
But only some parts are needed to span a filter graph and to connect some existing
filters, so the work of rewriting interfaces can be cut down to a minimum.
The core C# language differs notably from C and C++ in its omission of pointers as
a data type. Instead, C# provides references and the ability to create objects that are
managed by a garbage collector. In the core C# language it is simply not possible to
have an uninitialised variable, a "dangling" pointer, or an expression that indexes an
array beyond its bounds. While practically every pointer type construct in C or C++
has a reference type counterpart in C#, nonetheless, there are situations where
access to pointer types becomes a necessity. In unsafe code it is possible to
declare and operate on pointers, to perform conversions between pointers and
integral types, to take the address of variables, and so forth. In some sense, writing
unsafe code is much like writing C code within a C# program.
Nonetheless, for advanced DirectShow programming, so to say writing new filters,
there’s not alternative to C++. Not only because the SDK grounds on C++ but also
because there aren’t any tutorials or samples which use another language than C++.
Moreover, C++ generates slightly faster code sometimes (because of its
“unmanagedness”) and is therefore more appropriate for writing the performancedependent parts, as filters are.
B.3. Rewriting DirectShow interfaces for C#
To make the required functions of DirectShow accessible to C#, the necessary
interfaces have to be rewritten. The clue for this is first the header file named
“strmif.h” and is located in the “/include”-directory of the DirectX SDK.
The first important interface will be “IFilter graph”, it is needed to span a filter graph,
add/remove filters and connect them.
The interface looks in C++ like this:
MIDL_INTERFACE("56a8689f-0ad4-11ce-b03a-0020af0ba770")
IFilter graph : public IUnknown
{
public:
virtual HRESULT STDMETHODCALLTYPE AddFilter(
/* [in] */ IBaseFilter *pFilter,
/* [string][in] */ LPCWSTR pName) = 0;
virtual HRESULT STDMETHODCALLTYPE RemoveFilter(
/* [in] */ IBaseFilter *pFilter) = 0;
virtual HRESULT STDMETHODCALLTYPE EnumFilters(
/* [out] */ IEnumFilters **ppEnum) = 0;
virtual HRESULT STDMETHODCALLTYPE FindFilterByName(
/* [string][in] */ LPCWSTR pName,
/* [out] */ IBaseFilter **ppFilter) = 0;
virtual HRESULT STDMETHODCALLTYPE ConnectDirect(
/* [in] */ IPin *ppinOut,
Page 114
Diploma Thesis
Erich Semlak
/* [in] */ IPin *ppinIn,
/* [unique][in] */ const AM_MEDIA_TYPE *pmt) = 0;
virtual HRESULT STDMETHODCALLTYPE Reconnect(
/* [in] */ IPin *ppin) = 0;
virtual HRESULT STDMETHODCALLTYPE Disconnect(
/* [in] */ IPin *ppin) = 0;
virtual HRESULT STDMETHODCALLTYPE SetDefaultSyncSource( void) = 0;
}
Code sample 1: IFilter graph interface in C++
In C# it has to look like this:
[ComVisible(true), ComImport,
Guid("56a8689f-0ad4-11ce-b03a-0020af0ba770"),
InterfaceType( ComInterfaceType.InterfaceIsIUnknown )]
public interface IFilter graph
{
[PreserveSig]
int AddFilter(
[In] IBaseFilter pFilter,
[In, MarshalAs(UnmanagedType.LPWStr)] string pName );
[PreserveSig]
int RemoveFilter( [In] IBaseFilter pFilter );
[PreserveSig]
int EnumFilters( [Out] out IEnumFilters ppEnum );
[PreserveSig]
int FindFilterByName(
[In, MarshalAs(UnmanagedType.LPWStr)]
[Out] out IBaseFilter ppFilter );
string pName,
[PreserveSig]
int ConnectDirect(
[In] IPin ppinOut,
[In] IPin ppinIn,
[In, MarshalAs(UnmanagedType.LPStruct)] AMMediaType pmt );
[PreserveSig]
int Reconnect( [In] IPin ppin );
[PreserveSig]
int Disconnect( [In] IPin ppin );
[PreserveSig]
int SetDefaultSyncSource();
}
Code sample 2: IFilter graph interface in C#
As one can see, most of the definitions look very similar. There are only some cases
were the data types differ in such way that there is some marshalling needed,
especially for pointers on pointers of structures.
Some of that basic work had already been done by someone else (CodeProject,
[URL10]), so the basic structure for spanning a filter graph within C# could be
achieved simply by using CodeProject’s code. For my purposes there had to be
added many more functions for managing pins, mediatypes and connections.
Page 115
Diploma Thesis
Erich Semlak
In the following example a filter graph will be setup up which looks like this in
GraphEdit:
Picture 44: Sample filter graph
Page 116
Diploma Thesis
Erich Semlak
B.4. Initiating the filter graph
As soon as the required interfaces are defined for C#, a filter graph can be set up.
Therefore an instance of a Filter graph-object is created (just like CoCreateInstance
in C++) and is then casted to some additional object types which extend the
functionality for different purposes:
IGraphBuilder - provides facilities to create a graph for a specific file, and to
add/remove filters individually from the graph.
IMediaControl - provides methods to start and stop the filter graph.
IMediaEventEx - is used for setting up notifications that will be sent to the app when
things happen in the graph (reach end of media, buffer underruns, etc.)
IVideoWindow – sets properties on the video window. Applications can use it to set
the window owner, the position and dimensions of the window, and other properties.
The following code sets up the appropriate objects for spanning a filter graph:
Type comType = null;
object comObj = null;
try
{
comType = Type.GetTypeFromCLSID( Clsid.Filter graph );
if( comType == null )
throw
new
NotImplementedException(
@"DShow
Filter
registered!" );
comObj = Activator.CreateInstance( comType );
IGraphBuilder graphBuilder = (IGraphBuilder) comObj;
comObj = null;
Guid clsid = Clsid.CaptureGraphBuilder2;
Guid riid = typeof(ICaptureGraphBuilder2).GUID;
comObj = DsBugWO.CreateDsInstance( ref clsid, ref riid );
ICaptureGraphBuilder2 capGraph = (ICaptureGraphBuilder2) comObj;
comObj = null;
IMediaControl mediaCtrl = (IMediaControl) graphBuilder;
IVideoWindow videoWin = (IVideoWindow) graphBuilder;
IMediaEventEx mediaEvt = (IMediaEventEx) graphBuilder;
}
catch( Exception ee )
{
MessageBox.Show( this, "Could not get interfaces\r\n" +
"DirectShow.NET", MessageBoxButtons.OK, MessageBoxIcon.Stop );
}
finally
{
if( comObj != null )
Marshal.ReleaseComObject( comObj ); comObj = null;
}
Code sample 3: initiating a filter graph in C#
Page 117
graph
not
ee.Message,
Diploma Thesis
Erich Semlak
B.5. Adding Filters to the Capture Graph
Before the needed filters can be added and connected, they have to be allocated.
There are two ways to achieve this, either to find them by their GUIDs or by name.
The first way is the exact one, because the GUID is definite, the second one is
sometimes more convenient, without having to find out the GUID, but there may be
more than one filter having the same name (e.g. “Video Renderer”).
Calling filters by GUID is short and simple, but a little helper-function makes it even
more convenient. It needs the GUID of the filter to be found and returns the filter
object:
public int FindDeviceByID(System.Guid clsid,out IBaseFilter ofilter)
{
IntPtr ptrIf;
ofilter=null;
Guid ibasef =typeof(IBaseFilter).GUID;
int hr = DsBugWO.CoCreateInstance(
ref clsid, IntPtr.Zero, CLSCTX.Server, ref ibasef, out ptrIf );
if( (hr != 0) || (ptrIf == IntPtr.Zero) )
{
Marshal.ThrowExceptionForHR( hr );
return -1;
}
Guid iu = new Guid( "00000000-0000-0000-C000-000000000046" );
IntPtr ptrXX;
hr = Marshal.QueryInterface( ptrIf, ref iu, out ptrXX );
ofilter (IBaseFilter)System.Runtime.Remoting.Services._
EnterpriseServicesHelper.WrapIUnknownWithComObject( ptrIf );
return Marshal.Release( ptrIf );
}
Code sample 4: helper function for searching a filter by GUID
Using this helper-function handles finding e.g. the Video Renderer (which name is
ambiguous on some systems) in one single line:
// Setup preview
Int hr=FindDeviceByID(new Guid("70E102B0-5556-11CE-97C0-00AA0055595A"),
out vrendFilter);
Code sample 5: finding a filter by GUID
All other filters, where only the name is known, can be searched for their GUIDs
either through the registry or the GraphEdit Tool. Open the Menu “Graph” and select
“Insert filters…”., or Ctrl-F. There you can look up the required filter, most of are
stored in the “Direct Show” section. After choosing a filter, the appropriate information
is displayed:
Page 118
Diploma Thesis
Erich Semlak
Picture 45: Filter selection dialog in GraphEdit
The Filter Moniker shows the GUID of the selected filter. The first part by the
backslash defines the filter class, “{083863F1-70DE-11D0-BD40-00A0C911CE86}”
means the legacy filter category. The second part after the backslash is the GUID of
the filter itself, in our example “70E102B0-5556-11CE-97C0-00AA0055595A”.
Another way, as mentioned, is to enumerate the filters by name. This is sometimes
more convenient but may take some more time, especially if the filter class where to
search in contains a lot of filters, as the DirectShow section does.
For crawling through the list of filters, there’s another convenient helper function:
public int FindDeviceByName(string name,System.Guid fclsid,out IBaseFilter ofilter)
{
int i,ok=-1;
ArrayList devices;
DsDevice dev=null;
object capObj=null;
ofilter=null;
DsDev.GetDevicesOfCat(fclsid, out devices);
for(i=0;i<devices.Count;i++)
{
dev=devices[i] as DsDevice;
Console.WriteLine(dev.Name);
if (dev.Name!=null)
{
if(dev.Name.Trim()==name) break;
}
}
if (i<devices.Count)
{
try
Page 119
Diploma Thesis
Erich Semlak
{
Guid gbf = typeof( IBaseFilter ).GUID;
dev.Mon.BindToObject( null, null, ref gbf, out capObj );
ofilter = (IBaseFilter) capObj;
capObj = null;
ok=0;
}
catch( Exception ee )
{
MessageBox.Show( this, "Could not create "+name+" device\r\n" +
ee.Message, "DirectShow.NET", MessageBoxButtons.OK,
MessageBoxIcon.Stop );
}
finally
{
if( capObj != null )
Marshal.ReleaseComObject( capObj ); capObj = null;
}
}
return ok;
}
Code sample 6: helper function for searching a filter by its name
So the creation of the remaining filter is also just a matter of some lines of code:
hr=FindDeviceByName(cameraname,
Clsid.VideoInputDeviceCategory,out capFilter);
hr=FindDeviceByName("HNS Motion Detection",
Clsid.LegacyAmFilterCategory,out motdetFilter);
hr=FindDeviceByName("Smart Tee",
Clsid.LegacyAmFilterCategory,out smarttFilter);
hr=FindDeviceByName("HNS MPEG-2 Sender",
Clsid.LegacyAmFilterCategory,out nwrendFilter);
hr=FindDeviceByName("Microsoft MPEG-4 Video Codec V1",
Clsid.VideoCompressorCategory,out comprFilter);
Code sample 7: searching a filter by its name
All these enumerated filters have to be added to the filter graph before they can be
connected:
hr=graphBuilder.AddFilter(capFilter, "Camera" );
hr=graphBuilder.AddFilter(motdetFilter,"MotionDet");
hr=graphBuilder.AddFilter(comprFilter,"Compressor");
hr=graphBuilder.AddFilter(nwrendFilter,"Network Renderer");
hr=graphBuilder.AddFilter(vrendFilter,"Video Renderer");
hr=graphBuilder.AddFilter(smarttFilter,"Smart Tee");
Code sample 8: adding filters to the filter graph
Page 120
Diploma Thesis
Erich Semlak
B.6. Connecting Filters through Capture Graph
Now the filters can be connected through their input and output pins. These pins
have also to be found. This can be achieved by the following helper-function:
public int GetUnconnectedPin(IBaseFilter filter,out IPin opin, PinDirection pindir)
{
int hr;
IEnumPins cinpins;
IPin pin=null,ptmp;
PinDirection pdir;
int i;
int ok=-1;
opin=null;
filter.EnumPins(out cinpins);
while(cinpins.Next(1,ref pin,out i)==0)
{
pin.QueryDirection(out pdir);
if (pdir==pindir)
{
hr=pin.ConnectedTo(out ptmp);
if (hr<0)
{
opin=pin;
ok=0;
break;
}
}
}
return ok;
}
Code sample 9: finding a filter’s pins helper function
The function needs the filter and the direction of the pins to be found. There are only
input and output pins. Using this helper function makes finding the needed pins clear:
hr=GetUnconnectedPin(capFilter,out capout,PinDirection.Output);
hr=GetUnconnectedPin(motdetFilter,out mdin,PinDirection.Input);
hr=GetUnconnectedPin(motdetFilter,out mdout,PinDirection.Output);
hr=GetUnconnectedPin(smarttFilter,out stin,PinDirection.Input);
hr=GetUnconnectedPin(comprFilter,out comprin,PinDirection.Input);
hr=GetUnconnectedPin(comprFilter,out comprout,PinDirection.Output);
hr=GetUnconnectedPin(nwrendFilter,out nwrendin,PinDirection.Input);
hr=GetUnconnectedPin(vrendFilter,out vrendin,PinDirection.Input);
Code sample 10: get unconnected pins
If the names of the pins to be found are known, they can also be defined by using the
GraphBuilder directly. This is necessary if there are different pins of the same
direction, in this case the Smart Tee has to output pins, one for capturing and one for
previewing:
hr=smarttFilter.FindPin("Capture",out stoutcap);
hr=smarttFilter.FindPin("Preview",out stoutprev);
Code sample 11: finding pins by name
After defining all pins, they can be connected through the GraphBuilder:
hr=graphBuilder.Connect(capout,stin);
hr=graphBuilder.Connect(stoutcap,mdin);
Page 121
Diploma Thesis
Erich Semlak
hr=graphBuilder.Connect(mdout,comprin);
hr=graphBuilder.Connect(comprout,nwrendin);
hr=graphBuilder.Connect(stoutprev,vrendin);
Code sample 12: connecting pins
B.7. Running the Capture Graph
After the filter graph is defined and spanned, there is only the video output panel to
be defined for the video render, and then the graph can be run.
try
{
// Set the video window to be a child of the main window
hr = videoWin.put_Owner( vidPanel.Handle );
if( hr < 0 )
Marshal.ThrowExceptionForHR( hr );
// Set video window style
hr = videoWin.put_WindowStyle( WS_CHILD | WS_CLIPCHILDREN );
if( hr < 0 )
Marshal.ThrowExceptionForHR( hr );
// position video window in client rect of owner window
Rectangle rc = vidPanel.ClientRectangle;
videoWin.SetWindowPosition( 0, 0, rc.Right, rc.Bottom );
// Make the video window visible, now that it is properly positioned
hr = videoWin.put_Visible( DsHlp.OATRUE );
if( hr < 0 )
Marshal.ThrowExceptionForHR( hr );
hr = mediaEvt.SetNotifyWindow( slaveform.Handle, WM_GRAPHNOTIFY,
IntPtr.Zero );
if( hr < 0 )
Marshal.ThrowExceptionForHR( hr );
}
catch( Exception ee )
{
MessageBox.Show( this, "Could not setup video window\r\n" + ee.Message,
"DirectShow.NET",
MessageBoxButtons.OK,
MessageBoxIcon.Stop );
}
// run it all
hr = mediaCtrl.Run();
if( hr < 0 )
Marshal.ThrowExceptionForHR( hr );
Code sample 13: running the filter graph
Then the filters are started and the media data is run through the graph until it is
paused or stopped.
To stop the graph and clean up resources, the allocated object handles have to be
released:
public void CloseInterfaces()
{
int hr;
try
{
if( mediaCtrl != null )
{
hr = mediaCtrl.Stop();
Page 122
Diploma Thesis
Erich Semlak
mediaCtrl = null;
}
if( mediaEvt != null )
{
hr = mediaEvt.SetNotifyWindow( IntPtr.Zero, WM_GRAPHNOTIFY,
IntPtr.Zero );
mediaEvt = null;
}
if( videoWin != null )
{
hr = videoWin.put_Visible( DsHlp.OAFALSE );
hr = videoWin.put_Owner( IntPtr.Zero );
videoWin = null;
}
if( capGraph != null )
Marshal.ReleaseComObject( capGraph ); capGraph = null;
if( graphBuilder != null )
Marshal.ReleaseComObject( graphBuilder ); graphBuilder = null;
if( capFilter != null )
Marshal.ReleaseComObject( capFilter ); capFilter = null;
}
catch( Exception )
{}
}
Code sample 14: stopping and clearing the filter graph
B.8. Writing new DirectShow Filters
Although there is a rich set of filters coming ready with DirectShow, there might be
some tasks that demand writing a new DirectShow filter. Because the initial setup of
the code framework and the development environment (e.g. Visual Studio) depends
on the required base classes, there’s some preparation work to do. This can be
simplified by using some of the existing filter creation wizards, which can be found
on the web ([URL13] [URL14]). Unfortunately none of them are working with Visual
Studio .NET 2003, so there is just one alternative to write the whole framework by
hand: copy one of the DirectShow SDK sample projects, which fits best to the
scheme. This means to rename lots of definitions and maybe deleting some code
and rewrite some other. Filter development is further covered in this thesis at the
practical part.
Page 123
Diploma Thesis
Erich Semlak
C. Sources for Appendix A and B
C.1. Books & Papers
[PES2003]
Mark D. Pesce; Programming Microsoft DirectShow for Digital Video
and Television, Microsoft Press; 2003
C.2. URLs
[URL1]
Microsoft; DirectShow Reference;
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshowreference.asp;2004
[URL5]
LeadTools, DirectShow,
http://www.leadtools.com/SDK/Multimedia/MultimediaDirectShow.htm;2004
[URL13]
Yunqiang Chen; DirectShow Transform Filter AppWizard;
http://www.ifp.uiuc.edu/~chenyq/research/Utils/DShowFilterWiz/DShowFilt
erWiz.html; 2001
[URL14]
John McAleely; DXMFilter;
http://www.cs.technion.ac.il/Labs/Isl/DirectShow/dxmfilter.htm; 1998
[URL25]
Michael Blome; Core Media Technology in Windows XP Empowers You to
Create Custom Audio/Video Processing Components;
http://msdn.microsoft.com/msdnmag/issues/02/07/DirectShow/default.asp
x; 2002
[URL26]
Chris Thompson; DirectShow For Media Playback in Windows;
http://www.flipcode.com/articles/article_directshow01.shtml;2000
Page 124
Diploma Thesis
Erich Semlak
Curriculum Vitae
Name: Erich Semlak
Born: 31.5.1972 in Wilmington (DE) / USA
Education
1978 - 1981 Primary School
1982
First Class Academic Gymnasium Linz / Spittelwiese
1983 - 1989 2. - 6. Class Stiftsgymnasium Kremsmünster
1989 – 1991 Vocational School
1993 - 1996 Academy for Applied Computer Science at WIFI Linz
1996
Study permit exam
1996 - 2005 Computer Science Studies at Johannes Kepler University / Linz
Business Experience
1989
short experience as apprentice for Organ Building (2 months)
1989 - 1991
apprenticeship for merchandising at Ludwig Christ / Linz
1991 - 1994, 1996 continued at Ludwig Christ as employee
1995
Civil Service at the Red Cross
1996 - 2003
freelancer at the Ars Electronica Center / Linz
since 2003
several short contracts as Freelancer for different companies
Page 125