FORMAT INDEPENDENCE PROVISION OF AUDIO AND

Transcription

FORMAT INDEPENDENCE PROVISION OF
AUDIO AND VIDEO DATA IN MULTIMEDIA
DATABASE MANAGEMENT SYSTEMS
Der Technischen Fakultät der
Universität Erlangen-Nürnberg
zur Erlangung des Grades
DOKTOR–INGENIEUR
vorgelegt von
Maciej Suchomski
Erlangen – 2008
Als Dissertation genehmigt von
der Technischen Fakultät der
Tag der Einreichung:
Tag der Promotion:
13.05.2008
31.10.2008
Dekan:
Prof. Dr.-Ing. habil. Johannes Huber
Berichterstatter:
Prof. Dr.-Ing. Klaus Meyer-Wegener, Vizepräsident der FAU
Prof. Dr. Andreas Henrich
BEREITSTELLUNG DER
FORMATUNABHÄNGIGKEIT VON
AUDIO- UND VIDEODATEN IN
MULTIMEDIALEN
DATENBANKVERWALTUNGSSYSTEMEN
Der Technischen Fakultät der
zur Erlangung des Grades
DOKTOR–INGENIEUR
vorgelegt von
Maciej Suchomski
Erlangen – 2008
Als Dissertation genehmigt von
der Technischen Fakultät der
Tag der Einreichung:
Tag der Promotion:
13.05.2008
31.10.2008
Dekan:
Prof. Dr.-Ing. habil. Johannes Huber
Berichterstatter:
Prof. Dr.-Ing. Klaus Meyer-Wegener, Vizepräsident der FAU
Prof. Dr. Andreas Henrich
To My Love Parents
Dla Moich Kochanych Rodziców
Abstract
ABSTRACT
Since late 90s there is a noticeable revolution in the consumption of multimedia data being
analogical to the electronic data processing revolution in 80s and 90s. The multimedia
revolution covers different aspects such as multimedia production, storage, and delivery, but as
well triggers completely new solutions on consumer market such as multifunction handheld
devices, digital and internet TV, or home cinemas. It brings however also new problems. The
multimedia format variety is on one hand an advantage but on the other one of the problems,
since every consumer has to understand the data in a specific format in order to consume them.
On the other hand, the database management systems have been responsible for providing the
data to the consumers and applications regardless the format and storage characteristics.
However in case of multimedia data, the MMDBMSes have failed to provide data independence
due to complexity in “translation process”, especially when considering continuous data such as
audio and video. There are many reasons of such situation: the time characteristic of the
continuous data (processing according to functional correctness but also to time correctness),
the complexity of conversion algorithms (especially compression), and the demand of
processing resources varying in time (due to the dependence on content) thus requiring
sophisticated resource allocation algorithms.
This work focuses on a proposal of the conceptual model of the real-time audio-video
conversion (RETAVIC) architecture in order to diminish existing problems in the multimedia
format translation process, and thus, to allow the format independence provision of audio and
video data. The data processing within the RETAVIC architecture has been divided in four
phases: capturing, analysis, storage and delivery. The key assumption is the meta-data-based realtime transcoding in the delivery phase, where quality-adaptive decoding and encoding
employing Hard-Real-Time Adaptive model occurs. Besides, the Layered Lossless Video format
(LLV1) has been implemented within this project and the analysis of format independence
approaches and support in current multimedia management systems has been conducted. The
prototypical real-time implementation of the critical parts of the transcoding chain for video
provides the functional, quantitative and qualitative evaluation.
i
Abstract
ii
Kurzfassung
KURZFASSUNG
Seit den späten 1990er Jahren gibt es eine wahrnehmbare Revolution im Konsumverhalten von
Multimediadaten, welche analog der Revolution der elektronischen Datenverarbeitung in 1980er
und 1990er Jahren ist. Diese Multimediarevolution umfasst verschiedene Aspekte wie
Multimediaproduktion, -speicherung und -verteilung, sie bedingt außerdem vollständig neue
Lösungen auf dem Absatzmarkt für Konsumgüter wie mobile Endgeräte, digitales und InternetFernsehen oder Heimkinosystemen. Sie ist jedoch ebenfalls Auslöser bis dato unbekannter
Probleme. Die Multimediaformatvielzahl ist einerseits ein Vorteil, auf der anderen Seite aber
eines dieser Probleme, da jeder Verbraucher die Daten in einem spezifischen Format
„verstehen“ muss, um sie konsumieren zu können.
Andererseits sind die Datenbankverwaltungssysteme aber auch dafür verantwortlich, dass die
Daten unabhängig von Format- und Speichereigenschaften für die Verbraucher und für die
Anwendungen zur Verfügung stehen. Im Falle der Multimediadaten jedoch haben die
MMDBVSe die Datenunabhängigkeit wegen der Komplexität „im Übersetzungsprozess“ nicht
zur Verfügung stellen können, insbesondere wenn es sich um kontinuierliche Datenströme wie
Audiodaten und Videodaten handelt. Es gibt viele Gründe solcher Phänomene, die
Zeiteigenschaften von den kontinuierlichen Daten (die Verarbeitung entsprechend der
Funktionskorrektheit aber auch entsprechend der Zeitkorrektheit), die Komplexität der
Umwandlungsalgorithmen (insbesondere jene der Kompression) und die Anforderungen an die
Verarbeitungsressourcen, die in der Zeit schwanken (wegen der Inhaltsabhängigkeit), die daher
anspruchsvolle Ressourcenzuweisungsalgorithmen erforderlich machen.
Die vorliegende Arbeit konzentriert sich auf einen Vorschlag des Begriffsmodells der
Echtzeitumwandlungsarchitektur der Audio- und Videodaten (RETAVIC), um vorhandene
Probleme im Multimediaformat-Übersetzungsprozess zu mindern und somit die Bereitstellung
der Formatunabhängigkeit von Audio- und Videodaten zu erlauben. Die Datenverarbeitung
innerhalb der RETAVIC-Architektur ist in vier Phasen unterteilt worden: Erfassung, Analyse,
Speicherung
und
Anlieferung.
Die
Haupthypothese
ist
die
metadaten-bezogene
Echtzeittranskodierung in der Anlieferungsphase, in der die qualitätsanpassungsfähige
Decodierung und Enkodierung mit dem Einsatz des „Hard-Real-Time Adaptive (Hart-Echtzeitiii
Kurzfassung
Adaptiv-)-Modells“ auftritt. Außerdem ist das „Layered Lossless Video Format“ (Geschichtetes
Verlustfreies Videoformat) innerhalb dieses Projektes implementiert worden, eine Analyse der
Formatunabhängigkeitsansätze sowie der -unterstützung in den gegenwärtigen MultimediaManagementsystemen wurde geführt. Die prototypische Echtzeit-Implementierung der
kritischen Teile der Transkodierungskette für Video ermöglicht die funktionale, quantitative und
qualitative Auswertung.
iv
Acknowledgements
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my supervisor Prof. Klaus Meyer-Wegener. It was a
great pleasure to work under his supervision. He was always able to find time for me and
conduct stimulating discussions. His advices and suggestions at a crossroads allowed me to
choose correct path and bring my research forward keeping me right on to the end of the road.
His great patience, tolerance, and understanding helped in conducting the research and testing
new ideas. His great wisdom and active support are undoubted facts. Prof. Meyer-Wegener
spent not only days but also nights on co-authoring the papers published in the time of research
on this work. Without him beginning and completion of this thesis would never be possible.
Next, I would like to express my gratitude to Prof. Hartmut Wedekind for his great advices,
shared spiritual experiences during our stay in Sudety Mountains and for accepting the chairman
position during the viva voce examination. I also want to thank Prof. Andreas Henrich for many
fruitful discussions during the workshops of the MMIS Group. Moreover, I am very happy that
Prof. Henrich has committed himself to be the reviewer of my dissertation and I will never
forget these efforts. Subsequently, I give my great thanks to Prof. André Kaup for his
participation in the PhD-examination procedure.
The expression of enjoyment deriving from the cooperative and personal work, from meetings
“in the kitchen”, and from the funny every-day situations goes to all my colleagues at the chair.
Particularly I would like to thank few of them. First, I wan to give my thanks to our ladies: to
Mrs. Knechtel for his organizational help and warm welcome at the university, and to Mrs.
Braun and to Mrs. Stoyan for keeping the hardware infrastructure up and running allowing me
to work without additional worries. My appreciations are directed to Prof. Jablonski for many
scholar advices and for the smooth and unproblematic collaboration during preparation of the
exercises. I give my great thanks to Dr. Marcus Meyerhöfer – I have really enjoyed the time with
you not only in the shared office but also outside the university, and there is no time spent
together that will be forgotten. Finally, I would like to thank other colleagues: Michael Daum,
Dr. Ilia Petrov, Dr. Rainer Lay, Dr. Sascha Müller, Dr. Udo Mayer, Florian Irmert and Robert
Nagy. They spent willingly the time with me also outside the office and brought me closer not
only to the German culture but also to the night life fun.
v
Acknowledgements
I am also grateful to all students supervised by me, which have done their study projects and
master theses supporting the RETAVIC project. Their contribution including among other
things discussions on architecture issues, writing the code, benchmarking and evaluation allowed
refining the concepts and clarifying the doubts. Especially, the best-effort converter prototypes
and then their real-time implementations have made proving the assumed hypotheses possible –
great thanks to my developers and active discussion partners.
I must express my gratitude to my dear brother, Paweł. We are two people but one heart, blood,
soul and mind. His helpfulness and kindness in supporting my requests and asks will never be
forgotten. He helped me a lot by organizational aspects of every day life and was the reliable
representative of my person in many important cases in Wroclaw in Poland in the time of my
stay in Erlangen. I am not able to say how much I owe to him.
I would like to thank my love wife, Magda, for her spiritual support and love, for the big and
small things day by day, for always being with me in good and bad times, for tolerating my bad
mood, and for her patience and understanding. Her entertainment and amusement made me
very happy even during the time under pressure. I have managed to finish this work thanks to
her.
Finally, I am honestly thankful to my dear parents for their advices, continuous support and
love. It is a great honor and the same my convenience to have such love parents and to have a
privilege of using their wisdom and experience. To them I dedicate this work entirely.
vi
Contents
CONTENTS
ABSTRACT .....................................................................................................................................................I
KURZFASSUNG .......................................................................................................................................... III
ACKNOWLEDGEMENTS ............................................................................................................................ V
CONTENTS................................................................................................................................................ VII
LIST OF FIGURES.................................................................................................................................... XIII
LIST OF TABLES ..................................................................................................................................... XIX
INDEX OF EQUATIONS ......................................................................................................................... XXI
CHAPTER 1 – INTRODUCTION................................................................ 1
I.
INTRODUCTION ...............................................................................................................................1
I.1.
EDP Revolution.......................................................................................................................2
I.2.
Digital Multimedia – New Hype and New Problems ................................................................3
I.3.
Data Independence ....................................................................................................................6
I.4.
Format Independence .................................................................................................................9
I.5.
AV Conversion Problems .......................................................................................................11
I.6.
Assumptions and Limitations .................................................................................................11
I.7.
Contribution of the RETAVIC Project..................................................................................12
I.8.
Thesis Outline.........................................................................................................................13
CHAPTER 2 – RELATED WORK .............................................................. 15
II.
FUNDAMENTALS AND FRAMEWORKS .........................................................................................15
II.1.
Terms and Definitions.............................................................................................................16
II.2.
Multimedia data delivery .........................................................................................................17
II.3.
Approaches to Format Independence ........................................................................................18
II.3.1.
II.3.2.
II.3.3.
II.3.3.1
II.3.3.2
II.4.
II.4.1.
II.5.
II.5.1.
II.5.2.
Cascaded transcoding............................................................................................................................24
MD-based transcoding..........................................................................................................................25
Video and Audio Transformation Frameworks.......................................................................27
II.4.1.1
II.4.2.
II.4.3.
Redundancy Approach.............................................................................................................. 19
Adaptation Approach................................................................................................................ 20
Transcoding Approach ............................................................................................................. 22
Converters and Converter Graphs.......................................................................................... 28
Well-known implementations..............................................................................................................29
End-to-End Adaptation and Transcoding Systems ............................................................. 31
Summary of the Related Transformation Frameworks ....................................................... 34
Format Independence in Multimedia Management Systems.......................................................34
MMDBMS .................................................................................................................................. 35
Media (Streaming) Servers........................................................................................................ 39
III. FORMATS AND RT/OS ISSUES.....................................................................................................40
III.1.
Storage Formats ......................................................................................................................40
vii
Contents
III.1.1.
III.1.1.1
III.1.1.2
III.1.1.3
III.1.2.
III.2.
III.2.1.
III.2.2.
Video Data.................................................................................................................................. 40
Scalable codecs .......................................................................................................................................41
Lossless codecs.......................................................................................................................................42
Lossless and scalable codecs ................................................................................................................42
Audio Data.................................................................................................................................. 43
Real-Time Issues in Operating Systems....................................................................................45
III.2.2.1
III.2.2.2
III.2.3.
OS Kernel – Execution Modes, Architectures and IPC...................................................... 46
Real-time Processing Models................................................................................................... 47
Best-effort, soft- and hard-real-time...................................................................................................47
Imprecise computations........................................................................................................................48
Scheduling Algorithms and QAS ............................................................................................ 48
CHAPTER 3 – DESIGN .............................................................................. 51
IV. SYSTEM DESIGN.............................................................................................................................51
IV.1.
Architecture Requirements .......................................................................................................52
IV.2.
General Idea ...........................................................................................................................53
IV.2.1.
Real-time Capturing................................................................................................................... 56
IV.2.2.
Non-real-time Preparation ....................................................................................................... 69
IV.2.3.
Storage ......................................................................................................................................... 72
IV.2.4.
Real-time Delivery ..................................................................................................................... 76
IV.2.1.1
Grabbing techniques .............................................................................................................................56
IV.2.1.2
Throughput and storage requirements of uncompressed media data ..........................................58
IV.2.1.3
Fast and simple lossless encoding.......................................................................................................61
IV.2.1.4
Media buffer as temporal storage........................................................................................................64
Different hardware storage solutions and offered throughput...........................................................................................64
Evaluation of storage solutions in context of RETAVIC ............................................................................................68
IV.2.2.1
IV.2.2.2
IV.2.2.3
IV.2.3.1
IV.2.3.2
Archiving origin source.........................................................................................................................70
Conversion to internal format .............................................................................................................71
Content analysis......................................................................................................................................72
Lossless scalable binary stream............................................................................................................73
Meta data .................................................................................................................................................75
IV.2.4.1
Real-time transcoding............................................................................................................................76
Real-time decoding ........................................................................................................................................................77
Real-time processing ......................................................................................................................................................77
Real-time encoding ........................................................................................................................................................78
IV.2.4.2
Direct delivery ........................................................................................................................................80
IV.3.
IV.3.1.
IV.3.2.
Evaluation of the General Idea................................................................................................81
IV.3.2.1
IV.3.2.2
IV.3.2.3
V.
Insert and Update Delay........................................................................................................... 82
Architecture Independence of the Internal Storage Format............................................... 83
Correlation between modules in different phases related to internal format..............................84
Procedure of internal format replacement ........................................................................................85
Evaluation of storage format independence .....................................................................................86
VIDEO PROCESSING MODEL .......................................................................................................87
V.1.
Analysis of the Video Codec Representatives............................................................................87
V.2.
Assumptions for the Processing Model......................................................................................92
V.3.
Video-Related Static MD.......................................................................................................94
V.4.
LLV1 as Video Internal Format...........................................................................................99
V.4.1.
V.4.2.
V.4.3.
V.5.
LLV1 Algorithm – Encoding and Decoding ...................................................................... 100
Mathematical Refinement of Formulas................................................................................ 102
Compression efficiency........................................................................................................... 104
Video Processing Supporting Real-time Execution .................................................................105
viii
Contents
V.5.1.
V.5.2.
V.5.2.1
V.5.2.2
V.5.2.3
V.6.
V.6.1.
V.6.2.
V.6.3.
V.6.4.
V.6.5.
V.6.6.
MD-based Decoding ............................................................................................................... 105
MD-based Encoding ............................................................................................................... 106
MPEG-4 standard as representative................................................................................................ 106
Continuous MD set for video encoding......................................................................................... 108
MD-XVID as proof of concept....................................................................................................... 111
Evaluation of the Video Processing Model through Best-Effort Prototypes...............................114
MD Overheads......................................................................................................................... 114
Data Layering and Scalability of the Quality ....................................................................... 116
Processing Scalability in the Decoder................................................................................... 122
Influence of Continuous MD on the Data Quality in Encoding..................................... 126
Influence of Continuous MD on the Processing Complexity.......................................... 129
Evaluation of Complete MD-Based Video Transcoding Chain ...................................... 131
VI. AUDIO PROCESSING MODEL .....................................................................................................134
VI.1.
Analysis of the Audio Encoders Representatives.....................................................................134
VI.2.
Assumptions for the Processing Model....................................................................................138
VI.3.
Audio-Related Static MD.....................................................................................................140
VI.4.
MPEG-4 SLS as Audio Internal Format............................................................................142
VI.4.1.
VI.4.2.
VI.4.3.
VI.4.3.1
VI.4.3.2
VI.5.
VI.5.1.
VI.5.2.
MPEG-4 SLS Algorithm – Encoding and Decoding ........................................................ 143
AAC Core ................................................................................................................................. 144
HD-AAC / SLS Extension.................................................................................................... 145
Integer Modified Discrete Cosine Transform ............................................................................... 147
Error mapping..................................................................................................................................... 151
Audio Processing Supporting Real-time Execution.................................................................151
MD-based Decoding ............................................................................................................... 151
MD-based Encoding ............................................................................................................... 153
VI.5.2.1
MPEG-4 standard as representative................................................................................................ 153
Generalization of Perceptual Audio Coding Algorithms ............................................................................................ 153
MPEG-1 Layer 3 and MPEG-2/4 AAC ............................................................................................................ 154
VI.5.2.2
Continuous MD set for audio coding ............................................................................................. 156
VI.6.
Evaluation of the Audio Processing Model through Best-Effort Prototypes ..............................158
VI.6.1.
VI.6.2.
MPEG-4 SLS Scalability in Data Quality............................................................................. 158
MPEG-4 SLS Processing Scalability..................................................................................... 159
VII. REAL-TIME PROCESSING MODEL .............................................................................................161
VII.1. Modeling of Continuous Multimedia Transformation .............................................................161
VII.1.1.
VII.1.2.
Converter, Conversion Chains and Conversion Graphs................................................... 161
Buffers in the Multimedia Conversion Process .................................................................. 164
VII.1.3.
VII.1.4.
VII.1.5.
VII.1.6.
Data Dependency in the Converter ...................................................................................... 166
Data Processing Dependency in the Conversion Graph .................................................. 167
Problem with JCPS in Graph Scheduling............................................................................ 168
Operations on Media Streams ............................................................................................... 173
VII.1.7.
Media data synchronization.................................................................................................... 173
VII.2.1.
VII.2.2.
VII.2.3.
Remarks on JCPS – Data Influence on the Converter Description................................ 175
Hard Real-time Adaptive Model of Media Converters...................................................... 176
Dresden Real-time Operating System as RTE for RETAVIC......................................... 178
Jitter-constrained periodic stream ................................................................................................................................ 164
Leading time and buffer size calculations.................................................................................................................... 165
M:N data stream conversion...................................................................................................................................... 165
Media integration (multiplexing)................................................................................................................................ 173
Media demuxing........................................................................................................................................................ 173
Media replication....................................................................................................................................................... 173
VII.2.
Real-time Issues in Context of Multimedia Processing ............................................................175
ix
Contents
Architecture............................................................................................................................................................... 178
Scheduling.................................................................................................................................................................. 180
Real-time thread model .............................................................................................................................................. 180
VII.2.4.
VII.2.5.
DROPS Streaming Interface.................................................................................................. 182
Controlling the Multimedia Conversion .............................................................................. 184
VII.2.6.
The Component Streaming Interface................................................................................... 190
VII.2.5.1 Generalized control flow in the converter ..................................................................................... 184
VII.2.5.2 Scheduling of the conversion graph ................................................................................................ 185
Construct the conversion graph ................................................................................................................................... 185
Predict quant data volume.......................................................................................................................................... 187
Calculate bandwidth .................................................................................................................................................. 187
Check and allocate the resources................................................................................................................................. 187
VII.2.5.3 Converter’s time prediction through division on function blocks............................................. 189
VII.2.5.4 Adaptation in processing................................................................................................................... 190
VII.3.
Design of Real-Time Converters.............................................................................................192
VII.3.1.
Platform-Specific Factors ....................................................................................................... 193
VII.3.2.
VII.3.3.
Timeslices in HRTA Converter Model ................................................................................ 199
Precise time prediction............................................................................................................ 200
VII.3.4.
VII.3.5.
Mapping of MD-LLV1 Decoder to HRTA Converter Model......................................... 216
Mapping of MD-XVID Encoder to HRTA Converter Model........................................ 218
VII.3.1.1
VII.3.1.2
VII.3.1.3
VII.3.3.1
VII.3.3.2
VII.3.3.3
VII.3.3.4
VII.3.3.5
VII.3.5.1
VII.3.5.2
Hardware architecture influence ...................................................................................................... 193
Compiler effects on the processing time ........................................................................................ 195
Thread models – priorities, multitasking and caching.................................................................. 197
Frame-based prediction ..................................................................................................................... 201
MB-based prediction .......................................................................................................................... 204
MV-based prediction.......................................................................................................................... 210
The compiler-related time correction.............................................................................................. 215
Conclusions to precise time prediction........................................................................................... 216
Simplification in time prediction...................................................................................................... 218
Division of encoding time according to HRTA............................................................................ 219
CHAPTER 4 – IMPLEMENTATION ...................................................... 223
VIII. CORE OF THE RETAVIC ARCHITECTURE...............................................................................223
VIII.1. Implemented Best-effort Prototypes..........................................................................................223
VIII.2. Implemented Real-time Prototypes..........................................................................................224
IX. REAL-TIME PROCESSING IN DROPS........................................................................................226
IX.1.
Issues of Source Code Porting to DROPS..............................................................................226
IX.2.
Process Flow in the Real-Time Converter ...............................................................................229
IX.3.
RT-MD-LLV1 Decoder .....................................................................................................231
IX.3.1.
IX.3.2.
IX.3.3.
IX.3.4.
IX.4.
IX.4.1.
IX.4.2.
IX.4.3.
IX.4.4.
Setting-up Real-Time Mode ................................................................................................... 231
Preempter Definition .............................................................................................................. 232
MB-based Adaptive Processing............................................................................................. 233
Decoder’s Real-Time Loop.................................................................................................... 234
RT-MD-XVID Encoder.....................................................................................................235
Setting-up Real-Time Mode ................................................................................................... 236
Preempter Definition .............................................................................................................. 237
MB-based Adaptive Processing............................................................................................. 238
Encoder’s Real-Time Loop .................................................................................................... 239
CHAPTER 5 – EVALUATION AND APPLICATION .............................241
x
Contents
X.
EXPERIMENTAL MEASUREMENTS .............................................................................................241
X.1.
The Evaluation Process .........................................................................................................242
X.2.
Measurement Accuracy – Low-level Test Bed Assumptions....................................................243
X.2.1.
X.2.2.
X.2.2.1
X.2.2.2
X.2.3.
Impact Factors ......................................................................................................................... 243
Measuring Disruptions Caused by Impact Factors ............................................................ 245
Deviations in CPU Cycles Frequency ............................................................................................. 245
Deviations in the Transcoding Time............................................................................................... 246
Accuracy and Errors – Summary .......................................................................................... 248
X.3.
Evaluation of RT-MD-LLV1.............................................................................................249
X.4.
Evaluation of RT-MD-XVID ............................................................................................255
X.3.1.
X.3.2.
X.3.3.
X.4.1.
X.4.2.
Checking Functional Consistency with MD-LLV1............................................................ 249
Learning Phase for RT Mode ................................................................................................ 250
Real-time Working Mode ....................................................................................................... 253
Learning Phase for RT-Mode ................................................................................................ 255
Real-time Working Mode ....................................................................................................... 258
XI. COROLLARIES AND CONSEQUENCES........................................................................................264
XI.1.
Objective Selection of Application Approach based on Transcoding Costs ..................................264
XI.2.
Application Fields .................................................................................................................265
XI.3.
Variations of the RETAVIC Architecture ..........................................................................267
CHAPTER 6 – SUMMARY ........................................................................ 269
XII. CONCLUSIONS ..............................................................................................................................269
XIII. FURTHER WORK ..........................................................................................................................271
APPENDIX A ............................................................................................. 275
XIV. GLOSSARY OF DEFINITIONS ......................................................................................................275
XIV.1. Data-related Terms ...............................................................................................................275
XIV.2. Processing-related Terms ........................................................................................................277
XIV.3. Quality-related Terms............................................................................................................280
APPENDIX B ..............................................................................................281
XV. DETAILED ALGORITHM FOR LLV1 FORMAT ..........................................................................281
XV.1. The LLV1 decoding algorithm..............................................................................................281
APPENDIX C ............................................................................................. 285
XVI. COMPARISON OF MPEG-4 AND H.263 STANDARDS..............................................................285
XVI.1. Algorithmic differences and similarities...................................................................................285
XVI.2. Application-oriented comparison ............................................................................................288
XVI.3. Implementation analysis.........................................................................................................291
XVI.3.1.
XVI.3.2.
MPEG-4.................................................................................................................................... 292
H.263.......................................................................................................................................... 292
xi
Contents
APPENDIX D............................................................................................. 295
XVII. LOADING CONTINUOUS METADATA INTO ENCODER ..........................................................295
APPENDIX E ............................................................................................. 297
XVIII. TEST BED ...............................................................................................................................297
XVIII.1. Non-real-time processing of high load .....................................................................................297
XVIII.2. Imprecise measurements in non-real-time.................................................................................300
XVIII.3. Precise measurements in DROPS ..........................................................................................301
APPENDIX F ............................................................................................. 303
XIX. STATIC META-DATA FOR FEW VIDEO SEQUENCES ...............................................................303
XIX.1. Frame-based static MD.........................................................................................................303
XIX.2. MB-based static MD ............................................................................................................304
XIX.3. MV-based static MD ...........................................................................................................306
XIX.3.1. Graphs with absolute values .................................................................................................. 306
XIX.3.2. Distribution graphs.................................................................................................................. 307
APPENDIX G ..............................................................................................311
XX. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-LLV1........................311
XX.1. Function preempter_thread()..................................................................................................311
XX.2. Function load_allocation_params()........................................................................................312
XXI. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-XVID.......................313
XXI.1. Function preempter_thread()..................................................................................................313
APPENDIX H..............................................................................................315
XXII.MPEG-4 AUDIO TOOLS AND PROFILES ..................................................................................315
XXIII. MPEG-4 SLS ENHANCEMENTS ..........................................................................................317
XXIII.1. Investigated versions - origin and enhancements .......................................................................317
XXIII.2. Measurements........................................................................................................................317
XXIII.3. Overall Final Improvement....................................................................................................318
BIBLIOGRAPHY ........................................................................................321
xii
List of Figures
LIST OF FIGURES
Number
Description
Page
Figure 1. Digital Item Adaptation architecture [Vetro, 2004]. ....................................................... 23
Figure 2. Bitstream syntax description adaptation architecture [Vetro, 2004]. ............................ 24
Figure 3. Adaptive transcoding system using meta-data [Vetro, 2001]......................................... 31
Figure 4. Generic real-time media transformation framework supporting format
independence in multimedia servers and database management systems.
Remark: dotted lines refer to optional parts that may be skipped within a phase.......................... 54
Figure 5. Comparison of compression size and decoding speed of lossless audio
codecs [Suchomski et al., 2006].......................................................................................... 63
Figure 6. Location of the network determines the storage model [Auspex, 2000]. .................... 66
Figure 7. Correlation between real-time decoding and conversion to internal format............... 84
Figure 8. Encoding time per frame for various codecs................................................................... 88
Figure 9. Average encoding time per frame for various codecs..................................................... 89
Figure 10. Example distribution of time spent on different parts in the XVID encoding
process for Clip no 2. ............................................................................................................ 90
Figure 11. Example distribution of time spent on different parts in the FFMPEG
MPEG-4 encoding process for Clip no 2. ......................................................................... 91
Figure 12. Initial static MD set focusing on the video data.............................................................. 98
Figure 13. Simplified LLV1 algorithm: a) encoding and b) decoding. .......................................... 101
Figure 14. Quantization layering in the frequency domain in the LLV1 format......................... 104
Figure 15. Compressed file-size comparison normalized to LLV1 ([Militzer et al.,
2005]).................................................................................................................................... 105
Figure 16. DCT-based video coding of: a) intra-frames, and b) inter-frames.............................. 108
Figure 17. XVID Encoder: a) standard implementation and b) meta-data based
implementation................................................................................................................... 113
Figure 18. Continuous Meta-Data: a) overhead cost related to LLV1 Base Layer for
tested sequences and b) distribution of given costs. ..................................................... 115
Figure 19. Size of LLV1 compressed output: a) cumulated by layers and b) percentage
of each layer. ....................................................................................................................... 118
Figure 20. Relation of LLV1 layers to origin uncompressed video in YUV format................... 119
Figure 21. Distribution of layers in LLV1-coded sequences showing average with
deviation. ............................................................................................................................. 120
Figure 22. Picture quality for different quality layers for Paris (CIF) [Militzer et al.,
2005]..................................................................................................................................... 122
xiii
List of Figures
Figure 23. Picture quality for different quality layers for Mobile (CIF) [Militzer et al.,
2005]..................................................................................................................................... 122
Figure 24. LLV1 decoding time per frame of the Mobile (CIF) considering various data
layers [Militzer et al., 2005]................................................................................................ 123
Figure 25. LLV1 vs. Kakadu – the decoding time measured multiple times and the
average. ................................................................................................................................ 126
Figure 26. Quality value (PSNR in dB) vs. output size of compressed Container (QCIF)
sequence [Militzer, 2004]................................................................................................... 127
Figure 27. R-D curves for Tempete (CIF) and Salesman (QCFI) sequences [Suchomski
et al., 2005]........................................................................................................................... 128
Figure 28. Speed-up effect of applying continuous MD for various bit rates
[Suchomski et al., 2005]..................................................................................................... 129
Figure 29. Smoothing effect on the processing time by exploiting continuous MD
[Suchomski et al., 2005]..................................................................................................... 130
Figure 30. Video transcoding scenario from internal LLV1 format to MPEG-4 SP
compatible (using MD-XVID): a) usual real-world case and b) simplified
investigated case. ................................................................................................................ 131
Figure 31. Execution time of the various data quality requirements according to the
simplified scenario.............................................................................................................. 132
Figure 32. OggEnc and FAAC encoding behavior of the silence.wav (based on
[Penzkofer, 2006]). ............................................................................................................. 135
Figure 33. Behavior of the Lame encoder for all three test audio sequences (based on
[Penzkofer, 2006]). ............................................................................................................. 136
Figure 34. OggEnc and FAAC encoding behavior of the male_speech.wav (based on
[Penzkofer, 2006]). ............................................................................................................. 136
Figure 35. OggEnc and FAAC encoding behavior of the go4_30.wav (based on
[Penzkofer, 2006]). ............................................................................................................. 137
Figure 36. Comparison of the complete encoding time of the analyzed codecs......................... 138
Figure 37. Initial static MD set focusing on the audio data............................................................ 141
Figure 38. Overview of simplified SLS encoding algorithm: a) with AAC-based core
[Geiger et al., 2006] and b) without core. ....................................................................... 143
Figure 39. Structure of HD-AAC coder [Geiger et al., 2006]: a) encoder and b)
decoder................................................................................................................................. 146
Figure 40. Decomposition of MDCT. ............................................................................................... 148
Figure 41. Overlapping of blocks. ...................................................................................................... 148
Figure 42. Decomposition of MDCT by Windowing, TDAC and DCT-IV [Geiger et
al., 2001]............................................................................................................................... 150
xiv
List of Figures
Figure 43. Givens rotation and its decomposition into three lifting steps [Geiger et al.,
2001]..................................................................................................................................... 150
Figure 44. General perceptual coding algorithm [Kahrs and Brandenburg, 1998]: a)
encoder and b) decoder..................................................................................................... 153
Figure 45. MPEG Layer 3 encoding algorithm [Kahrs and Brandenburg, 1998]........................ 155
Figure 46. AAC encoding algorithm [Kahrs and Brandenburg, 1998].......................................... 155
Figure 47. Gain of ODG with scalability [Suchomski et al., 2006]. .............................................. 159
Figure 48. Decoding speed of SLS version of SQAM with truncated enhancement
stream [Suchomski et al., 2006]. ....................................................................................... 160
Figure 49. Converter model – a black-box representation of the converter (based on
[Schmidt et al., 2003; Suchomski et al., 2004])............................................................... 162
Figure 50. Converter graph: a) meta-model, b) model and c) instance examples........................ 163
Figure 51. Simple transcoding used for measuring times and data amounts on besteffort OS with exclusive execution mode....................................................................... 170
Figure 52. Execution time of simple transcoding: a) per frame for each chain element
and b) per chain element for total transcoding time..................................................... 170
Figure 53. Cumulated time of source period and real processing time......................................... 175
Figure 54. Cumulated size of source and encoded data. ................................................................. 176
Figure 55. DROPS Architecture [Reuther et al., 2006]. .................................................................. 179
Figure 56. Reserved context and real events for one periodic thread. .......................................... 181
Figure 57. Communication in DROPS between kernel and application threads (incl.
scheduling context). ...........................................................................................................182
Figure 58. DSI application model (based on [Reuther et al., 2006]).............................................. 183
Figure 59. Generalized control flow of the converter [Schmidt et al., 2003]. .............................. 185
Figure 60. Scheduling of the conversion graph [Märcz et al., 2003].............................................. 186
Figure 61. Simplified OO-model of the CSI [Schmidt et al., 2003]............................................... 191
Figure 62. Application model using CSI: a) chain of CSI converters and b) the details
of control application and converter [Schmidt et al., 2003]......................................... 191
Figure 63. Encoding time of the simple benchmark on different platforms. .............................. 194
Figure 64. Proposed machine index based on simple multimedia benchmark............................ 195
Figure 65. Compiler effects on execution time for media encoding using MD-XVID.............. 196
Figure 66. Preemptive task switching effect (based on [Mielimonka, 2006])............................... 198
Figure 67. Timeslice allocation scheme in the proposed HRTA thread model of the
converter.............................................................................................................................. 199
Figure 68. Normalized average LLV1 decoding time counted per frame type for each
sequence............................................................................................................................... 201
xv
List of Figures
Figure 69. MD-XVID encoding time of different frame types for representative
number of frames in Container QCIF. .............................................................................. 202
Figure 70. Difference between measured and predicted time........................................................ 203
Figure 71. Average of measured time and predicted time per frame type.................................... 204
Figure 72. MB-specific encoding time using MD-XVID for Carphone QCIF. ............................. 205
Figure 73. Distribution of different types of MBs per frame in the sequences: a)
Carphone QCIF and b) Coastguard CIF (no B-frames). ................................................... 206
Figure 74. Cumulated processing time along the execution progress for the MD-XVID
encoding (based on [Mielimonka, 2006])........................................................................ 207
Figure 75. Average coding time partitioning in respect to the given frame type (based
on [Mielimonka, 2006])...................................................................................................... 208
Figure 76. Measured and predicted values for MD-XVID encoding of Carphone QCIF............ 209
Figure 77. Prediction error of MB-based estimation function in comparison to
measured values. ................................................................................................................. 210
Figure 78. Distribution of MV-types per frame in the Carphone QCIF sequence. ....................... 211
Figure 79. Sum of motion vectors per frame and MV type in the static MD for
Carphone QCIF sequence: a) with no B-frames and b) with B-frames. ....................... 211
Figure 80. MD-XVID encoding time of MV-type-specific functional blocks per frame
for Carphone QCIF (96). ..................................................................................................... 212
Figure 81. Average encoding time measured per MB using the given MV-type. ........................ 213
Figure 82. MV-based predicted and measured encoding time for Carphone QCIF (no Bframes). ................................................................................................................................ 214
Figure 83. Prediction error of MB-based estimation function in comparison to
measured values. ................................................................................................................. 215
Figure 84. Mapping of MD-XVID to HRTA converter model..................................................... 221
Figure 85. Process flow in the real-time converter. ......................................................................... 230
Figure 86. Setting up real-time mode (based on [Mielimonka, 2006; Wittmann, 2005])............ 232
Figure 87. Decoder’s preempter thread accompanying the processing main thread
(based on [Wittmann, 2005]). ........................................................................................... 233
Figure 88. Timeslice overrun handling during the processing of enhancement layer
(based on [Wittmann, 2005]). ........................................................................................... 234
Figure 89. Real-time periodic loop in the RT-MD-LLV1 decoder................................................ 235
Figure 90. Encoder’s preempter thread accompanying the processing main thread. ................. 237
Figure 91. Controlling the MB-loop in real-time mode during the processing of
enhancement layer..............................................................................................................238
Figure 92. Logarithmic time scale of computer events [Bryant and O'Hallaron, 2003]. ............ 244
xvi
List of Figures
Figure 93. CPU frequency measurements in kHz for: a) AMD Athlon 1800+ and b)
Intel Pentium Mobile 2Ghz.............................................................................................. 245
Figure 94. Frame processing time per timeslice type depending on the quality level for
Container CIF (based on [Wittmann, 2005]). ................................................................ 249
Figure 95. Normalized average time per MB for each frame consumed in the base
timeslice (based on [Wittmann, 2005])............................................................................ 251
Figure 96. Normalized average time per MB for each frame consumed in the
enhancement timeslice (based on [Wittmann, 2005]). .................................................. 251
Figure 97. Normalized average time per MB for each frame consumed in the cleanup
timeslice (based on [Wittmann, 2005])............................................................................ 252
Figure 98. Percentage of decoded MBs for enhancement layers for Mobile CIF with
increasing framerate (based on [Wittmann, 2005])........................................................ 253
Figure 99. Percentage of decoded MBs for enhancement layers for Container QCIF
with increasing framerate (based on [Wittmann, 2005])............................................... 254
Figure 100.
Percentage of decoded MBs for enhancement layers for Parkrun
ITU601 with increasing framerate (based on [Wittmann, 2005])................................ 254
Figure 101.
Encoding time per frame of various videos for RT-MD-XVID: a)
average and b) deviation.................................................................................................... 256
Figure 102.
Worst-case encoding time per frame and achieved average quality vs.
requested Lowest Quality Acceptable (LQA) for Carphone QCIF............................... 257
Figure 103.
Time-constrained RT-MD-XVID encoding for Mobile QCIF and
Carphone QCIF..................................................................................................................... 260
Figure 104.
Time-constrained RT-MD-XVID encoding for Mobile CIFN and
Coastgueard CIF. ................................................................................................................... 261
Figure 105.
Time-constrained RT-MD-XVID encoding for Carphone QCIF with Bframes................................................................................................................................... 262
Figure 106.
Newly proposed temporal layering in the LLV1 format...................................... 272
Figure 107.
LLV1 decoding algorithm......................................................................................... 282
Figure 108.
Graph of ranges – quality vs. bandwidth requirements ....................................... 291
Figure 109.
Distribution of frame types within the used set of video sequences. ................ 304
Figure 110.
Percentage of the total gained time between the original code version
and the final version [Wendelska, 2007]. ........................................................................ 319
xvii
List of Figures
xviii
List of Tables
LIST OF TABLES
Number
Description
Page
Table 1.
Throughput and storage requirement for few digital cameras from different
classes. .................................................................................................................................... 59
Table 2.
Throughput and storage requirements for few video standards. .................................. 60
Table 3.
Throughput and storage requirements for audio data. ................................................... 61
Table 4.
Hardware solutions for the media buffer. ........................................................................ 65
Table 5.
Processing time consumed and amount of data produced by the example
transcoding chain for Mobile (CIF) video sequence..................................................... 172
Table 6.
The JCPS calculated for the respective elements of the conversion graph
from the Table 5................................................................................................................. 172
Table 7.
JCPS time and size for the LLV1 encoder. .................................................................... 175
Table 8.
Command line arguments for setting up timing parameters of the real-time
thread (based on [Mielimonka, 2006])............................................................................. 236
Table 9.
Deviations in the frame encoding time of the MD-XVID in DROPS caused
by microscopic factors (based on [Mielimonka, 2006])................................................ 247
Table 10. Time per MB for each sequence: average for all frames, maximum for all
frames, and the difference (max-avg) in relation to the average (based on
[Wittmann, 2005])............................................................................................................... 252
Table 11. Configuration of the MultiMonster cluster. ................................................................... 298
Table 12. The hardware configuration for the queen-bee server. ................................................ 298
Table 13. The hardware configuration for the bee-machines. ...................................................... 299
Table 14. The configuration of PC_RT. .......................................................................................... 300
Table 15. The configuration of PC. .................................................................................................. 300
Table 16. MPEG Audio Object Type Definition based on Tools/Modules [MPEG-4
Part III, 2005]...................................................................................................................... 315
Table 17. Use of few selected Audio Object Types in MPEG Audio Profiles [MPEG4 Part III, 2005]. ................................................................................................................. 316
xix
List of Tables
xx
Index of Equations
INDEX OF EQUATIONS
Equation Page
Equation Page
Equation Page
(1)
94
(27)
165
(53)
207
(2)
94
(28)
165
(54)
208
(3)
95
(29)
165
(55)
209
(4)
95
(30)
165
(56)
209
(5)
95
(31)
166
(57)
209
(6)
95
(32)
166
(58)
212
(7)
96
(33)
166
(59)
213
(8)
96
(34)
166
(60)
213
(9)
96
(35)
168
(61)
213
(10)
96
(36)
169
(62)
215
(11)
96
(37)
169
(63)
217
(12)
103
(38)
169
(64)
217
(13)
103
(39)
169
(65)
217
(14)
103
(40)
171
(66)
217
(15)
107
(41)
177
(67)
219
(16)
140
(42)
188
(68)
219
(17)
140
(43)
188
(69)
219
(18)
140
(44)
188
(70)
219
(19)
140
(45)
188
(71)
220
(20)
148
(46)
202
(72)
220
(21)
149
(47)
202
(73)
220
(22)
149
(48)
202
(74)
220
(23)
149
(49)
203
(75)
264
(24)
150
(50)
206
(25)
150
(51)
206
(26)
151
(52)
206
xxi
Index of Equations
ii
Chapter 1 – Introduction
I. Introduction
The problems treated here are those of data independence –the independence of application programs and terminal activities from growth in data types and changes in data representation– and certain kinds of data inconsistency which are expected to become troublesome even in nondeductive systems. Edgar F. Codd
(1970, A Relational Model of Data for Large Shared Data Banks, Comm. of ACM 13/6)
I. INTRODUCTION
The wisdom of humanity and the accumulated experience of generations derived from humanbeing ability of applying intelligently the knowledge gained through the human senses. However,
before a human-being acquires the knowledge, he must learn to understand the meaning of
information. A common sense (or meaning) of the natural languages has been molded through
ages and educated implicitly among young generation by the old one, but in case of unnatural
languages like programming, the sense of terms must be explained explicitly to the users. The
sense of language terms used in communication between people allows them to understand the
information, which is carried by the data written or spoken in the specific language. The data are
located in the base level of the hierarchy of information science. The data in different forms (i.e.
represented by various languages) may carry the same information, as well as the data in the
same form may carry different information, but in this second case the interpretation of the
used terms has different meaning usually depending on the context.
1
I. Introduction
Based on the above discussion, it may be deduced, that people rely on data and their meaning in
context of information, knowledge and wisdom, and thus everything is about data and their
understanding. However, the data themselves are useless until they are processed in order to
obtain the information out of them. In other words, the data without the processing may be just
a collection of garbage, and the processing is possible if the format of the data is known.
I.1.
EDP Revolution
The term of data processing covers all actions dealing with data including data understanding,
data exchange and distribution, data modification, data translation (or transformation) and data
storage. The people have been using the data as carriers of information since ages, and the same
they have been processing these data in manifold, but sometimes in inefficient ways.
The revolution in data processing finds its roots in the late sixties of twentieth century [Fry and
Sibley, 1976], when the digital world came into a play and in which two data models have been
developed: the network model by Charles Bachman named Integrated Data Store (IDS) but
officially known as Codasyl Data Model (which defined DDL and DML) [CODASYL Systmes
Committee, 1969], and the hierarchical model implemented by IBM under supervision of Vern
Watts called Information Management System (IMS) [IBM Corp., 1968]. In both models, an access to
the data was provided through operations using pointers and paths linked to the data records.
As so, a restructuring of the data required rewriting the access methods, and thus the physical
structure had to be known for querying of the data.
Edgar F. Codd, working in that time by IBM, did not like the idea of physical-dependent
navigational models of Codasyl and IMS, in which the data access was dependent on the storage
structures. Therefore he proposed a relational model of data for large data banks [Codd, 1970].
The relational model separated the logical organization of database called schema from the
physical storage methods. Based on Codd’s model, the System R—being a relational database
system—has been proposed [Astrahan et al., 1976; Chamberlin et al., 1981]. Moreover, the
System R was the first implementation of the SQL —structured query language— supporting
transactions.
2
I. Introduction
After the official birth of System R an active movement of data and their processing from the
analog into electronic world has begun. First commercial systems, such as IBM DB2
(production successor of System R) and Oracle, fostered and accelerated the electronic data
processing. The development of these systems has been focused on implementation of the
objectives of database management systems such as: data independence, centralization of data
and system services, declarative query language (SQL) and application neutrality, multi-user
operations with concurrent access, error treatment (recovery), and concepts of transactions.
However, these systems supported only textual and numerical data, and other more complex
types of data like images, audio or video have not even been mentioned.
I.2. Digital Multimedia – New Hype and New Problems
Nowadays, a multimedia revolution analogical to the electronic data processing (EDP)
revolution can be observed. It has been triggered by scientific achievements in information
systems, followed by wide range of available multimedia software and continuous reduction of
computer equipment prices, which was especially noticeable in the storage devices sector in
1980s and 1990s. On the other hand, there has been a raising demand for multimedia data
consumption since late eighties.
However, the demand could not be fully satisfied due to missing standards i.e. due to
deficiencies in the common definition of norms and terms. Thus the standardization bodies
have been established: the WG-11 (known as MPEG) in Europe and the SG-16 in the USA.
The MPEG is a working group of JTC1/SC29 within ISO/IEC organization. It was set down in
1988 to answer the demands at first by standardizing the coded representation of digital audio
and video enabling many new technologies e.g. VideoCD and MP3 [MPEG-1 Part III, 1993].
The SG-16―a study group of ITU-T―finds its roots in ITU CCITT Study Group VIII (SG-7),
which was founded in 1986. The SG-16’s primary goal was to develop a new more efficient
compression scheme for continuous-tone still images (known as JPEG [ITU-T Rec. T.81,
1992]). Currently, the activities of MPEG and SG-16 cover standardization of all technologies
that are required for interoperable multimedia including media-specific data coding,
compositions, descriptions and meta-data definitions, transport systems and architectures.
3
I. Introduction
In parallel to standardization of coding technologies, the transmission and storage of digital
video and audio have become very important for almost all kinds of applications that process
audio-visual data. For example, an analogous archiving and broadcasting was the dominant
solution of handling audio and video data even less than 10 years ago, but now the situation
changed completely and services as DVB-T, DTV or ITV are available. Moreover, the
standardization process has triggered research activities delivering new ideas, which proposed
even more extensive usage of digital storage and transmission of multimedia data. The digital
pay-TV (DTV) is transmitted through cable networks to households and the terrestrial
broadcasting of digital video and audio (DVB-T and DAB-T in Europe, ISDB-T in Japan and
Brazil, ATSC in USA and some other countries) is already available in some high-tech regions
delivering the digital standard-definition television (SDTV) and allowing for high-definition TV
(HDTV) in the future. The rising internet network bandwidths enable high-quality digital videoon-demand (VoD) applications, but as well the sharing of home-made videos is possible
through Google Video, YouTube, Vimeo, Videoegg, Jumpcut, blip.tv, and many other providers. The
music and video stores are capable of selling digital media content through Internet (e.g. iTunes). The recent advances in mobile networking (e.g. 3GPP, 3GPP2) permit audio and video
streaming to handheld devices. The home entertainment devices like DVD players or digital
camcorders with digital video (DV) hit the masses. Almost all modern PCs by default offer
hardware and software support for digital video playback and editing. The national TV and
Radio archives of analogous media goes on-line through government-sponsored digitizing
companies (e.g. INA in France). The digital cinemas allow for new experiences in respect to
high audio and video quality. The democratically-created and low-cost Internet TV (e.g. DeinTV), which has its analogy to open-source community developing free software, seems to be
approaching our doors.
Such increasing popularity of the mentioned applications causes a yet increasing amount of
audio and video data being processed, stored and transmitted, and the same it closes the selffeeding circle of providing better know-how and then developing new applications, which after
being applied indicate new requirements for existing technologies. As such, common and
continuous production of audio and visual information triggered by new standards requires a
huge amount of hard disk space and network bandwidth, which the same highly stimulates the
4
I. Introduction
development of more efficient and thus even more complex multimedia compression
algorithms.
In the past, the development of MPEG-1 [LeGall, 1991] and MPEG-2 (H.262 [ITU-T Rec.
H.262, 2000] is a common text with MPEG-2 Part 2 Video [MPEG-2 Part II, 2001]) has been
driven by commercial storage/distribution/transmission of digital video at different resolutions
accompanied by audio in format such as MPEG-1 Layer 3 (commonly known as MP3) [MPEG1 Part III, 1993] or Advanced Audio Coding (AAC) [MPEG-2 Part VII, 2006]. Next, the H.263
[ITU-T Rec. H.263+, 1998; ITU-T Rec. H.263++, 2000; ITU-T Rec. H.263+++, 2005] has
been caused by demand of a low-bitrate encoding solution for videoconferencing applications.
Newer MPEG-4 Video [MPEG-4 Part II, 2004] with High Efficiency AAC [MPEG-4 Part IV
Amd 8, 2005] was required to support the Internet applications in manifold ways, and H.264
[ITU-T Rec. H.264, 2005] and MPEG-4 Part 10, which are developed by JVT, joint video team
of MPEG and SG-16, and are technically aligned to each other [MPEG-4 Part X, 2007], were
meant for next generation AV compression algorithms supporting high-quality digital audio and
video distribution [Schäfer et al., 2003]. However, after the standardization process, the
applicability of standards crosses usually the borders of imagination present in the times of
standard definition e.g. the MPEG-2 found application in DTV and DVB/DAB through
satellite, cable, and terrestrial as planned, but also as standard for SVCD and DVD production.
Considering the described multimedia revolution delivering more and more information, the
people have moved from poor and illegible world of textual and numerical data to the rich and
easy-understandable information carried by multimedia data. According to the human
perception systems, the consumption of audio-visual information provided by applications rich
in media data like images, music, animation or video is simpler, more pleasant and
comprehensible, and as so, is the rising popularity of any multimedia-oriented solutions higher
and higher. The trend towards rich-media applications supported by hardware development and
combined with the advances in computing performance over the past few years caused the
media data being the most important factor in digital storage and transmission today.
On the other hand, the multimedia revolution causes also new problems. Today’s large variety
of multimedia formats finding an application in many different fields confuses the usual
consumers, because the software and hardware used is not always capable to understand the
5
I. Introduction
formats and is not able to present the information in expected way (if at all). Moreover, among
the different application fields the quality requirements exposed to the multimedia data vary to a
high degree, e.g. a video played back on a small display of a handheld device can hardly have the
same dimensions and the same framerate as a movie displayed on a large digital cinema screen, a
picture presented on the computer screen will differ from those downloaded to cellular phone, a
song downloaded from Internet audio shop may be decoded and played back by home set-top
multimedia boxes, however it must not be consumable on a portable device.
I.3. Data Independence
The amount and popularity of multimedia data with its variety of applications, and on the other
hand the complexity and diversity of coding techniques have been consistently motivating the
researchers and developers of database management systems. The complex data types as image,
audio and video data need to be stored and accessed. Most of the research on multimedia
databases considered only the timeless media objects i.e. the multimedia database management
systems have been extended by services supporting storage and retrieval of time-independent
media data like still images or vector graphics. However, managing the time-dependent objects
like audio or video data requires sophisticated delivery techniques (including buffering,
synchronization and considering time constraints) and efficient storage and access techniques
(gradation of physical storage, caching with preloading, indexing, layering of data), which
imposes completely new challenges on database management systems in comparison to those
known from EDP revolution in 80s and 90s. Furthermore, new searching techniques for audio
and video have to be proposed due to the enormously large amounts of data e.g. a contextbased retrieval needs new index methodology because present indexing facilities are not able to
process the huge amounts of media data stored.
While managing typical timeless data within one single database management system is possible,
the handling of audio-video data is usually distributed over two distinct systems: a standard
relational database and a specialized media server. The first one is responsible for managing
meta-data and searching, and the second one provides data delivery, but both should together
provide the objectives of database management system as mentioned in previous section. The
centralization of data and system services and multi-user operations with concurrent access are
provided by many multimedia management systems. The concepts of transactions is not really
6
I. Introduction
applicable to the multimedia data (until one consider upload of a media stream as a transaction),
and thus, the error treatment is solved by simple reload or restart mechanisms. Work on
declarative query language (SQL) have resulted in multimedia extension to SQL:1999 (known as
SQL/MM [Eisenberg and Melton, 2001]), but it still falls short of the possibilities offered by
abstract data types for multimedia and is not implemented in any well-known commercial
system. Finally, the application neutrality and data independence are left somehow behind and
there are lacks of solutions supporting them.
The data independence of multimedia data has been a research topic of Professor MeyerWegener since early years of his research in the beginning of 1990s. He has started with a design
of media object storage system (MOSS) [Käckenhoff et al., 1994] as component of a multimedia
database management system. Next he continued with research on kernel architecture for next
generation archives and realtime-operable objects (KANGAROO) allowing for media data
encapsulation and media specific operations [Marder and Robbert, 1997]. In the meantime, he
worked on integration of media servers with DBMS and support of quality-of-service (IMOS)
[Berthold and Meyer-Wegener, 2001], which was continued by Ulrich Marder in VirtualMedia
project [Marder, 2002]. Next, he supervised the Memo.real project focused on media-object
encoding by multiple operations in realtime [Lindner et al., 2000]. Finally, this work started in
2002, named the RETAVIC project, focuses on format independence provision by real-time
audio-video conversion for multimedia databases management systems. This work should be a
mutual complement with the memo.REAL project (continuation of Memo.real) [Märcz et al.,
2003; Suchomski et al., 2004], which was started in 2001, but unfortunately had to be
discontinued in the midway.
After these years of research, the data independence of multimedia data is still a very complex
issue, because of the variety of media data types [Marder, 2001]. The media data types are
classified in two general groups: non-temporal and temporal (known also as timed, time
dependent, continuous). Text, image, 2D and 3D graphics belong to the non-temporal group,
because the time has no influence on the data. Contrary, audio (e.g. wave), music (e.g. MIDI),
video and animation are temporal media data, because they are time-constrained. For example,
an audio stream consists of discrete values (samples) usually obtained during the process of
sampling audio with a certain frequency and a video stream consists of images (frames) that
7
I. Introduction
have also been captured with a certain frequency (called framerate). The characteristic
distinguishing these streams from non-temporal kind of media data is the time constraint (the
continuous property of a data stream) i.e. the occurrence of the data events (samples or frames)
is ordered and the periods between them are usually constant [Suchomski et al., 2004]. Thus,
providing data independent access to various media types requires different solutions specific to
each media type or at least to each group (non-temporal vs. temporal).
Secondly, the data independence in database management systems considers both: format
independence (logical) and storage independence (physical) [Connolly and Begg, 2005; Elmasri
and Navathe, 2000]. The format independence defines the format neutrality for the user
application i.e. the internal storage format may differ from the output format delivered to the
application, but the application is not bothered by any internal conversion steps and just
consumes the understandable data as they were designed in the external schema. The storage
independence defines the physical storage and access paths neutrality by hiding the internal
access and caching methodology i.e. the application does not have to know how the data are
stored physically (on disc, tape, CD/DVD), how they are accessed (index structures, hashing
algorithms) and in which file system are they located (local access paths, network shares), but
only knows the logical localization (usually represented by URL retrieved from multimedia
database) from the external schema used by application. So, the provision of data independence
and application neutrality for multimedia data relies on many fields of research in computer
science:
•
databases and information systems (e.g. lossless storage and hierarchical data access,
physical storage structures, format transformations),
•
coding and compression techniques (e.g. domain transformations (DCT, MDCT)
including lifting schemes (binDCT, IntMDCT), entropy repetitive sequence coding
(RLE), entropy statistical variable-length coding (Huffman coding), CABAC, bit-plane
coding (BPGC)),
•
transcoding techniques (cascade transcoder, transform-domain transcoder, bit-rate
transcoder, quality adaptation transcoder, frequency-domain transcoder, spatialresoultion transcoder, temporal transcoder, logo insertion and watermarking),
8
•
I. Introduction
audio and video analysis (motion detection and estimation, scene detection, analysis of
important macro blocks),
•
audio- and video-specific algorithms (zic-zac and progressive scanning, intra- and interframe modeling, quantization with Constant-Step, Matrix-Based, or FunctionDependent, perceptual modeling, noise shaping),
•
digital networks and communication with streaming technologies (time-constrained
protocols MMS, RTP/RTSP, bandwidth limitations, buffering strategies, quality of
service issues, AV gateways),
•
and operating systems with real-time aspects (memory allocation, IPC, scheduling
algorithms, caching, timing models, OS boot-loading).
None of the available solutions created for research purposes so far can be considered complete
in respect to data independence provision of continuous data in MMDMBS. The RETAVIC
project [Suchomski et al., 2005] and co-related memo.REAL project [Lindner et al., 2000;
Suchomski et al., 2004] have been founded with the aim to develop a solution for multimedia
database management systems or media servers providing data independence in context of
capturing, storage and access to potentially large media objects, and real-time quality-aware
delivery of continuous media data comprising a modeling of converters, transcoding and
processing with graph-based models. However, considering all the aspects from the mentioned
fields of computer science, the complexity of the data independence and application neutrality
provision based on real-time processing and QoS control is enormous e.g. it requires design of
real-time capable file system, network adapters and infrastructure, and development of real-time
transformation framework. As so, the problem considered in this work was limited to just a
subset of mentioned aspects of the server-side, namely to a solution of format independence
provision for audio and video data by transformations on the MMDBMS side. Many issues of
operating systems (e.g. real-time file access, storage design) and of network and communication
systems have been left out.
I.4. Format Independence
Format independence can be compared to the Universal Multimedia Access (UMA) idea
[Mohan et al., 1999]. In UMA, it is assumed that some amount of audiovisual (AV) signal is
9
I. Introduction
transmitted over different types of networks (optical, wireless, wired) to a variety of AVterminals that support different formats. The core assumption of the UMA is to provide best
QoS or QoE by either selecting the appropriate content formats, or adapting the content format
directly to met the playback environment, or to adapt the content playback environment to
accommodate the content [Mohan et al., 1999]. The key problem is to fix the mismatch between
rich multimedia contents, networks and terminals [Mohan et al., 1999]; however, it has not been
specified how to do this. On the other hand, the UMA is going beyond the borders of
multimedia database management systems and proposes to do format transformations within
the network infrastructure and with the dedicated transcoding hardware. This however makes
the problem of format independence even more complex due to too many factors deriving from
the variety of networks, hardware solutions, and operating systems. Moreover, introducing the
real-time QoS control within the scope of global distribution area is hardly possible, because the
networks have their constraints (bandwidth) and the terminals their own hardware (processing
power and memory capacity) and software capabilities (supported formats, running OS).
Including all these aspects supporting all applications in one global framework is hardly
possible, as so this work focuses only on the part connected with the MMDBMS and format
independence provision and application neutrality within MMDBMS, where the heterogeneity
of the problem is kept on the reasonable low level.
There are three perspectives in the research on the format independence of continuous media
data and on the application neutrality (detailed discussion is provided in section II.3). The first
one is using multiple copies in many formats and various qualities of the same media object.
Second covers adaptation with scalable formats, where the quality is adopted during
transmission. Third one, presented in this work, considers conversion from internal format(s) to
miscellaneous delivery formats on demand, analogical to UMA transparent transformation idea
of media data, but only within the MMDBMS [Marder, 2000].
Storing videos in different formats and dimensions seems reasonable, as transmitting them in a
unique format and with fixed dimensions (and then adapting the quality and format on the
receiver’s side). However, the waste of storage and network resources is huge: in first case the
replicas occupy extra space on the storage and in the second cases the waste of the bandwidth
of the transmission channel, regardless if it is wireless or wired, takes place. Moreover, none of
10
I. Introduction
these two solutions provides fully format independence. Only the audio-video conversion
allowing for quality adaptations during processing and transforming to required coding scheme
would allow for full format independence.
I.5. AV Conversion Problems
However, there are many problems when considering audio-video conversion. First, the time
characteristic of the continuous data requires that the processing must be controlled not only
according to functional correctness but also to time correctness. For example the video frame
converted and delivered after its presentation time is useless as well as listening of the audio
with samples ordered in different time-order than the original sequence is senseless. Secondly,
the conversion algorithms (especially compression) for audio and video data are very complex
and computation demanding (requires a lot of CPU processing power). As so, the optimization
of the media-specific transformation algorithms is required. Thirdly, the processing demand of
conversion varies in time and is heavily dependend on the media content i.e. one part of the
media data may be easy convertible with small effort, while the other includes high energy of
audio or visual signal and requires much more calculations during compression. Thus the
resource allocation must be able to cope with varying load, or the adaptation of the processing
and quality-of-service (QoS) support must be included in the conversion process i.e. the
processing elements of the transcoding architecture such as decoder, filters, encoders should
provide a mechanism to check requests and guarantee a certain QoS fulfilling such requests
when the feasibility was tested positively.
I.6. Assumptions and Limitations
The MMBDMS should support not only the format independence but also application
neutrality. It means that different aspects and requirements of applications should be already
considered in the process of multimedia database design. One of such key aspect is a long-time
storage involving no loss of information i.e. the data should keep the origin information for the
generations. As such, the MMDBMS must be capable of supporting lossless internal format and
lossless transformations between internal and external (delivery) format. It is assumed, that lossy
frame/window of samples is a subset of lossless frame/window of samples, so being able of
providing lossless properties means also ability of delivering lossy characteristics.
11
I. Introduction
Moreover, it is assumed that the huge sets of data are stored in MMDBMS. In such collections,
the access frequency to a media object is low and only few or less clients access the same data
(contrary to VoD, where many clients access the same data very often). Examples of such media
data sets are scientific collections of video and images, digital libraries with scanned documents,
police archives for fingerprints and criminal photography, and videos from surgery and medical
operations (e.g. brain surgery or coronary artery (heart) bypassing).
Furthermore, the clients accessing media data differ between each other and postulate different
requirements in respect to format and the quality. The quality requirements may range from
lossless quality with full information to very low quality with limited but still understandable
information. All possible formats would have been delivered if the system was a superior
system, however in reality only formats implemented in the transformation process may be
supported by MMDBMS.
Embedded systems in dedicated hardware eventually would provide a fast environment for
audio-video conversion supporting various formats, but they are neither flexible nor
inexpensive, so this work aims at a software-only solution of the real-time transcoding
architecture. Moreover, if a conversion on client side was assumed, then loss of bandwidth of
transmission channels would have occurred, but anyway, it would have been hardly possible to
do the conversion on power-sensitive and not strong enough mobile devices, which are usually
optimized for decoding of one specific format (manufacturer’s implementation of the decoder).
I.7. Contribution of the RETAVIC Project
The central contribution of this dissertation is a proposal of the conceptual model of the realtime audio-video conversion architecture, which includes: a real-time capturing phase with fast
simple lossless encoding and media buffer; a non-real-time preparation phase with conversion
to internal format and content analysis; a storage phase with lossless, scalable binary formats and
meta-data set definition; and a real-time transcoding phase with real-time capable qualityadaptive decoding and encoding. The key assumption in real-time transcoding phase is an
exploitation of the meta-data set describing given media object for feasibility analysis, scheduling
and controlling the process. Moreover, the thesis proposes media-specific processing models
based on the conceptual model and defines hard-real-time adaptive processing. The need for
12
I. Introduction
lossless, scalable binary format caused a Layered Lossless Video format (LLV1) [Militzer et al.,
2005] being designed and implemented within this project. The work is backed by the analysis
of requirements for media internal storage format for audio and video and by the review of
format independence support in current multimedia management systems i.e. multimedia
database management systems and media servers (streaming servers). The proof of concept is
given by the prototypical real-time implementation of the critical parts of the transcoding chain
for video, which was evaluated in respect to functional, quantitative and qualitative properties.
I.8. Thesis Outline
In Chapter 2, the related work is discussed. It’s divided on two big sections: the fundamentals
and frameworks being the core related work (Section II) and the data format and real-time
operating system issues being a loosely-coupled related work (Section III).
In Chapter 3, the design is presented. At first, the conceptual model of format independence
provision is proposed in Section IV. It includes real-time capturing, non-real-time preparation,
storage, and real-time transcoding. In Section V, the video processing model is described. The
LLV1 is introduced, the video-specific meta-data set is defined in details, and real-time decoding
and encoding are presented. Analogically, in Section VI, the audio processing model with
MPEG-4 SLS, audio-specific meta-data and audio transcoding are described. And finally in
Section VII, the real-time issues in context of continuous media transcoding are explained.
Prediction, scheduling, execution and adaptation are the considered subjects there.
In Chapter 4, the details about the best-effort prototypes and the real-time implementation are
given. Section VIII points out the core elements of the RETAVIC architecture and states what
the target of the implementation phase was. Sections IX describe real-time implementations of
two representative converters, respectively RT-MD-LLV1 and RT-MD-XVID. The controlling
of the real-time processes is also described in this section.
In Chapter 5, the proof of concept is given. The evaluation and measurements are presented in
Section X. The evaluation process, discussion on measurement accuracy, and the evaluation of
real-time converters are presented. The applicability of the RETAVIC architecture is discussed
in Section XI. Moreover, few variations and extensions of the architecture are mentioned.
13
I. Introduction
Finally, the summary and outlook at further work is included in Chapter 6. The conclusions are
listed in Section XII and the further work is covered by Section 0. Additionally, there are few
appendixes detailing some of the related issues.
14
Chapter 2 – Related Work
II. Fundamentals and Frameworks
C h a p t e r 2 – R e l a t e d Wo r k
One thing I have learned in a long life: that all our science, measured against reality, is primitive and childlike—and yet it is the most precious thing we have. Albert Einstein
(1955, Speech for Israel in Correspondence on a New Anti-War Project)
Even though many fields of research have been named as related areas in the introduction part,
this chapter covers only the most relevant issues, which are referred later in this dissertation.
The chapter is divided on two sections: essential being the most important related work, and
surrounding describing adjacent but still important related work.
II. FUNDAMENTALS AND FRAMEWORKS
The essential related work covers terms and definitions, multimedia delivery, format
independence methods and applications, and transformation frameworks. As so, there are
definitions provided directly from other sources or keywords with refined meaning used within
the RETAVIC context – both are given in the Terms and Definitions section. They are grouped
into data-related, processing-related, and quality-related terms. Next, the issues of delivering
multimedia data such as streaming, size and time constraints, and buffering are described. The
three possible methods of providing format independence (already shortly mentioned in the
Introduction), which are referred later shortly as approaches, are discussed subsequently. After that,
the comments on some video and audio transformation frameworks are given. And last but not
15
least, the related research on format independence in respect to multimedia management
systems is considered.
II.1. Terms and Definitions
The most of the following definitions used within this work are collected in [Suchomski et al.,
2004] and some of them are adopted or refined for purposes of this work. There are also some
terms added from other works or newly created to clarify the understanding. All terms are listed
and explained in details in the Appendix A.
The terms and definitions are grouped in three groups: data-related, processing-related and
quality-related. The terms are grouped as follows:
a) data-related: media data, media object (MO), multimedia data, multimedia object (MMO),
meta data (MD), quant, timed data (time-constrained data, time-dependent data),
continuous data (periodic or sporadic), data stream (shortly stream), continuous MO,
audio stream, video stream, multimedia stream, continuous MMO, container format,
compression/coding scheme;
b) processing-related: transformation (lossy, lossless), multimedia conversion (media-type,
format and content changers), coding, decoding, converter, coder/decoder, codec, (data)
compression, decompression, compression efficiency (coding efficiency), compression
ratio, compression size, transcoding, heterogeneous transcoding, transcoding efficiency,
transcoder, cascade transcoder, adaptation (homogeneous or unary-format transcoding),
chain of converters, graph of converters, conversion model of (continuous) MMO,
(error) drift;
c) quality-related: quality, objective quality, subjective quality, Quality-of-Service (QoS),
Quality-of-Data (QoD), transformed QoD (T(QoD)), Quality-of-Experience (QoE).
This work uses the above terms extensively thus all the readers are referred to the Appendix A
in case of doubts or problems with understanding.
16
II.2. Multimedia data delivery
Delivering of multimedia data is different from conventional data delivery, because multimedia
stream is not transferred as a whole from the data storage server such as MMDBMS or media
server to the client due to its size. Usually, the consuming application on the client side starts
displaying the data almost immediately after their arrival. As so, the data do not have to be
complete i.e. not all parts of multimedia data have to be present on the consumer side, but just
these parts required for displaying at a given time. This delivery method is known as streaming
[Gemmel et al., 1995].
Streaming multimedia data to clients differs in two ways from transferring conventional data
[Gemmel et al., 1995]: 1) amount of data and 2) real-time constraints. For example, a two-hour
MPEG-4 video, which is compressed with the average throughput of 3.96 Mbps1, needs more
than 3.56 GB of disk space, and analogically, accompanying two-hour stereo audio signal in the
MPEG-1 Layer III format of average throughput of 128kbps requires 115 MB. And even
though, the modern compression schemes allow for compression ratio of 1:50 [MPEG-4 Part
V, 2001] or even more, storing and delivering many media files is not a trivial task, especially if
higher resolutions and frame rates are considered (e.g. HDTV 1080p). Please note that even
when considering very compressed AV streams in high-quality the data rates vary between 2 and
10 Mbps.
Secondly, the delivery becomes even more difficult because of real-time constraints. In contrast,
variations of the transfer rate are negligible and all bits has to be transferred correctly for
conventional files, but the most important factor for streaming is that a certain minimum
transfer is required [Gemmel et al., 1995]. If we take the above example of audio-video
compressed stream, the transfer rate must at least equal the sum of average throughput of 4.09
Mbps in order to allow the client consuming the multimedia stream. However, this constraint is
not even sufficient. The variations of the transfer rate, which may occur due to network
congestions or the server inability of delivering the data on time, usually leads to stuttering
1
For example, a PAL viedo (4CIF) in YV12 color scheme (12 bits per pixel) with resolution 720 x 576 pixels and 25 fps gives
requirements of about 118.5 Mbps. A compression ratio of 1:30, which is reasonable for the MPEG-2 compression, gives for
this video the value of 3.955 Mbps.
17
effect during play-out on the client side (i.e. when the required data for continuing playing did
not arrive yet).
The buffering on the client side is used to reduce the network-throughput fluctuation problem
by fetching multimedia data into the local cache and starting play-out first then when the buffer
cache is filled up [Gemmel et al., 1995]. If the transfer rate suddenly decreases, the player can
still use the data from the buffer, which is filled again when the transfer rate has recovered. On
the other hand, the larger buffer on the client side the bigger latency in starting the play-out is,
as so the buffer size shall be as low as required. Buffering overcomes short bottlenecks deriving
from the network overloads, but if the server cannot deliver data fast enough for a longer
period of time, the media playback is affected anyway [Gemmel et al., 1995]. Therefore, enough
resources have to be allocated and guaranteed on the server, and the resource allocation
mechanism has to detect when its transfer reaches the limits and in such case disallows further
connections.
Another approach is buffering within the network infrastructure e.g. on caching and/or
transcoding proxies in the tree-based distribution network [Li and Shen, 2005]. Various caching
policies proposing placement and replacement of the media objects are present, and one
promising2 is the caching in transcoding proxies for tree networks (TCTN) caching being an
optimized strategy using dynamic programming-based solution and considering transcoding
costs through weighted transcoding graphs [Li and Shen, 2005]. In general, the problems with
server unavailability mentioned before could be solved by buffering on the proxies, however,
the complexity of application coordinating the media transcoding rises to enormous size,
because all network aspects and different proxy platforms must be considered.
II.3. Approaches to Format Independence
As already mentioned in the introduction, there are three research approaches in different fields
using multimedia data, which could be applied for the format independence provision. These
three approaches are defined within this work as: redundancy, adaptation and transcoding. They have
2
At leas, it outperforms the least recently used (LRU), least normalized cost replacement (LNC-R), aggregate effect (AE) and
web caching in transcoding proxies for linear topology (TCLT)
18
been developed without application neutrality in mind and usually focused on the specific
requirements. One exception is the Digital Item Adaptation (DIA) based on Universal Media
Access (UMA) idea, which is discussed later.
II.3.1.
Redundancy Approach
The redundancy approach defines the format independence through multiple copies of the
multimedia data, which is kept in many formats and / or various qualities. In other words, the
MO is kept in few preprocessed instances which may have the same coding scheme but
different predefined quality characteristics (e.g. lower resolution), or may have different coding
scheme but the same quality characteristics, or may have various coding scheme as well as
different characteristics. The disadvantages of this method are as follows:
•
waste of storage – due to multiple copies representing the same multimedia information
•
partial format independence solution – i.e. it covers only limited set of quality
characteristics or coding schemes, because it is impossible to prepare copies of all
potential qualities and coding schemes supporting yet undefined applications
•
very high start-up cost – in order to deliver different qualities and coding schemes, the
multimedia data must be preprocessed with cost of O(m·n·o) complexity i.e. the
number of MO instances, and thus their preparation, depends directly on:

m – number of multimedia objects

n – number of provided qualities

o – number of provided coding schemes
The biggest advantage is relatively small cost of delivery, and possibility of using distribution
techniques such as multicasting or broadcasting. The redundancy approach could be used as
imitation of format independence, but only in the limited set of applications where just few
classes of devices exist (e.g. Video-on-Demand). As so, this approach is not enough for format
independence provision by MMDBMS and may be applicable only as an additional extension
for optimization purposes on costs of storage.
19
II.3.2.
Adaptation Approach
Another approach, which could be partly utilizable as format independence solution, is the
adaptation approach. The goal is not to provide data independence (any format requested by
application), but rather to adapt the existing (scalable) format to the network environment and
end device capabilities. Here, the adaptation points on the borders of networks are defined
(usually called media proxies or media gateways), which are responsible of adapting the transmitted
multimedia stream to the end-point consumer devices or rather to the given class of consumer
devices. This brings a disadvantage of wasting the network resources due to the bandwidth
over-allocation of the delivery channel from the central point of distribution to the media
proxies on the network borders. A second important disadvantage is that the data are always
dropped and the more proxies adapt the data between the client and the server the lower data
quality is provided. A third drawback is the dependency on the static solution, because the
scalable format cannot be easily changed when once chosen.
There have been three architectures distinguished between adaptation approaches from low to
high complexities [Vetro, 2001]3, mainly useful for non-scalable formats (such as MPEG-2
[MPEG-2 Part II, 2001]):
1) Simple open-loop system – simply cutting variable-length codes corresponding to the
high frequency coefficient (i.e. to the higher-level data carrying less important
information) used for bit-rate reduction; this involves only variable-length parsing (and
no decoding); however, it produces the biggest error drift;
2) Open-loop system with partial decoding to un-quantized frequency domain – where the
operations (e.g. re-quantization) are conducted on the decoded transformed coefficients
in the frequency domain;
3) Closed-loop systems – where the decoding up to spatial/time domain (with pixels or
sample values) is conducted in order to compensate the drift for the new re-quantized
3
The research work of Vetro focuses only on video data and these architectures are called “classical transcoding” architectures.
However within this work, the Vetro’s “classical transcoding” reflects the adaptation term.
20
data; however, only one coding schemes is involved due to the design (see definition of
adaptation term); it is called simplified spatial domain transcoder (SSDT);
These three architectures have been refined by [Sun et al., 2005]. So, Vetro’s first architecture is
called Architecture 1 – Truncation of the High-Frequency Coefficients and the second one is Architecture 2
– Requantizing the DCT Frequency Coefficients, and both are subjects to drift due to their simplicity
and operation in the frequency domain. They are useful for applications such as trick modes and
extended play recording of the digital video recorder. There are also some optimizations
proposed through constrained dynamic rate shaping (CDRS) or general (unconstrained)
dynamic rate shaping (GDRS) [Eleftheriadis and Anastassiou, 1995], which operates directly on
the coefficients in the frequency domain. The Architecture 3 – Re-Encoding with Old MVs and Mode
Decisions and the Architecture 4 – Re-Encoding with Old MVs (and New MD) are resistant to drift
error, and they are reflecting the close-loop system from [Vetro, 2001]. They are useful for VoD
and statistical multiplexing for multiple channels. Here an optimization is also proposed by
simplifying the SSDT and creating a drift-free Frequency Domain Transcoder at reduced
computational complexity, which does the motion compensation step in the frequency domain
through the approximate matrices computing the MC-DCT residuals [Assunncao and Ghanbari,
1998].
Another adaptation approach is to use the existing scalable formats and to operate directly on
the coded bit-stream through dropping out the unnecessary parts. This reduces significantly the
complexity of the adaptation process i.e. the scalable format adaptation is much simpler than
the adaptations mentioned before because it does not require any decoding of the bit-stream i.e.
it operates in the coded domain. Few scalable formats have been proposed recently, for example
MPEG-4 Scalable Video Coding [MPEG-4 Part X, 2007] or MPEG-4 Scalable to Lossless
Coding [Geiger et al., 2006; MPEG-4 Part III FDAM5, 2006], as well as few scalable extensions
to the existing formats have been defined e.g. Fine Granularity Scalability (FGS) profile of the
MPEG-4 standard [MPEG-2 Part II, 2001]. However, there is unquestionable disadvantage of
having additional costs of coding efficiency caused by additional scalability information included
within the bit stream.
The biggest disadvantage of all adaptation approaches using scalable or non-scalable format is
still the assumption, that the storage format has to be defined or standardized with respect to
21
the end-user applications. Or even worse, only the consumers compliant with the standard
storage format can be supported by the adaptation architecture, if the problem is treated from
the distributor perspective. As so, there is still a drawback of having just a partial format
independence solution being similar to the one present in the redundancy approach. And even
though, one could compare transcoding to scalable coding with adaptivity [Vetro, 2003], the
adaptation of one universal format with different qualities is not considered as fully-fledged data
independence solution usable by MMDBMS.
II.3.3.
Transcoding Approach
Last, but not least, approach is the transcoding approach. It is based on the multimedia conversion
from internal (source) format to the external (requested) format. In general, there are two
methodologies: on-demand (on-line) or off-line. The off-line approach is a best-effort solution
where no delivery time is guaranteed and the MO is converted to the requested format
completely before delivery starts. Obviously, this introduces huge latencies in response time.
The on-demand is meant for delivering multimedia data just after obtaining the client request,
i.e. the transcoding stars as soon as request for data appears. In this case, there are two types of
on-line transformations: real-time and best-effort. While in the first one there may be
sophisticated mechanism for QoS control implemented, in the second case the execution and
delivery guaranties cannot be given.
On the other hand, [Dogan, 2001] talks about video transcoding in two aspects: homogeneous
and heterogeneous. Homogeneous video transcoders only changes bit rate, frame rate, or
resolution, while heterogeneous video transcoding allows for transformations between different
formats, coding schemes and networks topologies e.g. different video standards like H.263
[ITU-T Rec. H.263+++, 2005] and MPEG-4 [MPEG-4 Part II, 2004]. In analogy to Dogan’s
definitions, the homogeneous transcoding is treated as adaptation from the previous section,
and heterogeneous transcoding is discussed within this section subsequently, and both aspects
are considered with respect to audio as well as video.
The transcoding approach is exploited within the MPEG-21 Digital Item Adaptation (DIA)
standard [MPEG-21 Part VII, 2004], which is based on the Universal Media Access (UMA) idea
and tries to cope with the “mismatch” problem. As stated in the overview of DIA [Vetro, 2004],
22
the DIA is “a work to help alleviate some of burdens confronting us in connecting a wide range
of multimedia content with different terminals, networks, and users. Ultimately, this work will
enable (…) UMA.”. The DIA defines an architecture for Digital Item4 [MPEG-21 Part I, 2004]
transformation delivering format-independent mechanisms that provide support in terms of
resource adaptation, description adaptation, and/or QoS management, which are collectively
referred to as DIA Tools. The DIA architecture is depicted in Figure 1. However, the DIA
describes only the tools assisting the adaptation process but not the adaptation engines
themselves.
Figure 1.
Digital Item Adaptation architecture [Vetro, 2004].
The DIA covers also other aspects related to the adaptation process. The usage environment
tools describe four environmental system conditions: the terminal capabilities (incl. codec
capabilites, I/O capabilities, and device properties), the network characteristics (covering
network static capabilities and network conditions), the user characteristics (e.g. user’s
information, preferences and usage history, presentation preferences, accessibility characteristics
and location characteristics such as mobility and destination), and the natural environment
characteristic (current location and time or audio-visual environment). Moreover, the DIA
architecture proposes not only multimedia data adaptation, but also the bitstream syntax
4
Digital Item is understood as a fundamental unit of distribution and transaction within the MPEG-21 multimedia framework
representing “what” part. As so, it is reflecting the definition of media object within this work i.e. DI is equal to MO. The
detailed definition of what the DI exactly is, can be found in Part 2 of MPEG-21 containing Digital Item Declaration (DID)
[MPEG-21 Part II, 2005], which is divided on three normative sections: model, representation, and schema.
23
description adaptations, which could easily be called meta-data adaptations, being on the higher
logical level and allowing the adaptation process in the coded bitstream domain in the codec
independent manner. An overview of the adaptation on bitstream level is depicted in Figure 2.
Figure 2.
Bitstream syntax description adaptation architecture [Vetro, 2004].
To summarize, if the DIA would support really format-independent mechanism for adaptation
i.e. allowing for transformation from one coding scheme to different one, which seems to be the
case at least from the standard description, it should not be called adaptation anymore, but
Digital Item Transformation, and then it is to be treated without any doubts as transcoding
approach and definitely not as adaptation approach.
II.3.3.1
Cascaded transcoding
The cascaded transcoding approach is understood in the context of this work as straightforward
transformation using cascade transcoder with complete decoding from one coding scheme and
complete encoding to another one and employs maximum complexity (of the described
solutions) [Sun et al., 2005]. Due to this complexity, the conversion-based format independence
based on cascaded approach has not been well-accepted as usable in real applications. For
example, when transforming one video format into another, full recompression of the video
24
data demanding expensive decoding and encoding steps5 is required. So, in order to achieve
reasonable processing speed, modern video encoders (e.g. the popular XviD MPEG-4 video
codec) employ sophisticated block-matching algorithms instead of a straight-forward full search
in order to reduce the complexity of motion estimation (ME). Often, predictive algorithms like
EPZS [Tourapis, 2002] are used, which offer a 100-5000 times speed-up over a full search while
achieving similar picture quality. The performance of predictive search algorithms however
highly depends on the characteristics of the input video (and is especially low for sequences with
irregular motion). This content-dependent and unpredictable charactersistic of the ME step
makes it very difficult to predict behaviour of a video encoder, and thus interfere with making
the video encoders a part of real-time transcoding process within the cascaded transcoder.
II.3.3.2
MD-based transcoding
The unpredictability of the classical transcoding can be eliminated without compromising
compression efficiency by using meta-data (MD) [Suchomski et al., 2005; Vetro, 2001]. There
are plenty of proposals for simplifying the transcoding process by exploiting MD-based
multimedia conversion. For example, the meta data guiding the transcoder specific for video
content are divided into low-level and high-level futures [Vetro et al., 2001], where low-level are
referred to color, motion, texture and shape, and high-level may include storyboard information
or the semantics of the various objects in the scene. Other research in the video-processing and
networking fields discussed specific transcoding problems in more detail [Sun et al., 2005; Vetro
et al., 2003], like object-based transcoding, MPEG-4 FGS-to-SP, or MPEG-2-to-MPEG-4, and
especially the MD-based approaches have been proposed by [Suzuki and Kuhn, 2000] and
[Vetro et al., 2000].
[Suzuki and Kuhn, 2000] proposed difficulty hint, which assist in bit allocation during
transcoding by providing the information about the complexity of one segment with respect to
the other segments. It is represented as a weight in the range [0,1] of each segment (frame or
sequence of frames) obtained through normalization of bits spent to this segment to the total
bits spent to all segments. As so, the rate-control algorithm may use this hint for controlling the
5
In multimedia storage and distribution, there are usually asymmetric compression techniques used. This means that the efforts
spent on encoding are much higher than on the decoding – the rate may achieve 10 times and more.
25
transcoding process and optimizing the temporal allocation of bits (if the variable bit rate (VBR)
is allowed). One constraint should be considered during calculation of the hint, namely it should
be calculated at fine QP but still the results may vary [Sun et al., 2005]. The specific application
of difficulty hint and some other issues as motion uncompensability and search range
parameters are further discussed in [Kuhn and Suzuki, 2001].
[Vetro et al., 2000] proposes the shape change and motion intensity hints, which are usable
especially for supporting data dependency in object-based transcoding. The well-know problem
of variable temporal resolution of the objects within the visual sequence has been already
investigated in the literature and two shape-distortion measures have been proposed: Hamming
distance (number of different pixels between two shapes) and Hausdorff distance (maximum
function between two sets of pixles based on Euclidean distance between these points). [Vetro,
2001] proposes to derive shape change hint from one of these measures but after normalization
in the range [0,1] by dividing through the number of pixels within the object for Hamming
distance or by maximum width or height of the rectangle bounding the object or the frame for
Hausdorff distance. The motion intensity hint is defined as measure of significance of the object
and is based on the intensity of motion activity6 [MPEG-7 Part III, 2002], number of bits,
normalizing factor reflecting the object size, and the constant (being bigger than zero) usable for
zero-motion objects [Vetro et al., 2000]. The larger values of motion intensity hint indicate
higher importance of the object, and may be used for the decisions on quantization parameter
or skipping with respect to each object.
A collective solution has been proposed by MPEG-7 where the MediaTranscodingHints descriptor in
the MediaProfile descriptor of Media Description Tools [MPEG-7 Part V, 2003] has been defined. The
transcoding properties proposed there fit only partly the existing implementations of encoders
as required for the MMDBMS format independence provision. Among others, the
MediaTranscodingHints define useful general motion, shape and difficulty coding hints7. On the
other hand, a property like “intraFrameDistance” [MPEG-7 Part V, 2003] is not a good hint for
6
Intensity of motion activity is defined by MPEG-7 as standard deviation of motion vector magnitudes.
7
Suzuki, Kuhn, Vetro, and Sun have taken actively part in the MPEG-7 standard development, especially in the part responsible
for defining meta-data for transcoding (MediaTranscodingHints) [MPEG-7 Part V, 2003]. As so, the transcoding hints derive
from their works proposed earlier (discussed in the paragraphs above).
26
encoding, since scene changes may appear unpredictably (and usually intra frames are used
then). Thus, the intraFrameDistance should be treated rather like constraint than a hint.
Regardless, in comparison to the MD set defined later within this work, few parameters of
MPEG’s MDT are similar (e.g. importance is somehow related to MB priority) and some very
important are still missing (e.g. frame type, or MB mode suggestions).
In general, the focus of the previous research is on the transcoding itself including the
functional compatibility, compression efficiency and network issues, because the goal here was
to simplify the execution in general, but not to predict the required resources or adapt the
process to RTE. The concerns such as limiting motion search range, or improving bit allocation
between different objects by identifying key objects or by regulating temporal resolution of
objects, are in the center of interest. The subjects connected with real-time processing,
scheduling of transcoders and QoS control in the context of the MMDBMS (i.e. format
independence and application neutrality) are not investigated, and as so no meta-data set has
been proposed to support coping with these topics.
II.4. Video and Audio Transformation Frameworks
The transformation of continuous multimedia data is already well-researched topic for fairly
long time. Many applications for converting or transforming audio and video can be found.
Some built-in support for multimedia data is also already available in few operating systems
(OS’es), which are called Multimedia OS’s due to that fact, but usually the support can neither
meet all requirements nor handle all possible kinds of continuous media data. In order to better
solve the problems associated with the large diversity of formats, frameworks have been
proposed, but to my knowledge, there is a lack of collective work presenting a theory of
multimedia transformations in a broad range and in various applications, i.e. considering the
other solutions, various media types, real-time processing, QoS control at the same time and for
different applications. One very good book pretending to cover almost all these topics, but only
with respect to video data is [Sun et al., 2005], and many references within this work are done to
this book.
27
There are many audio and/or video transformation frameworks. They are discussed within this
section and it was done everything what possible to keep the sound and complete description.
However, the author cannot guarantee that there is no other related framework available.
II.4.1.
Converters and Converter Graphs
The well-accepted and the most general research approach is comes from the signal processing
field and is based on converters (filters) and converter graphs. It has been covered extensively in
multimedia networking and mobile communication, so the idea is not new; it has been rooted in
[Pasquale et al., 1993], supported by [Candan et al., 1996; Dingeldein, 1995] and extended by
Yeadon in his doctoral dissertation [Yeadon, 1996]. A more recent approach is described by
Marder [Marder, 2002]. They all, however, restrict the discussion to the function and do not
consider execution time neither real-time processing nor QoS control. Typical implemented
representatives are Microsoft DirectShow, Sun’s Java Media Framework (JMF), CORBA A/V Streams,
and Multi Media Extension (MME) Toolkit [Dingeldein, 1995].
The pioneers have introduced some generalizations of video transformations in [Candan et al.,
1996; Pasquale et al., 1993; Yeadon, 1996]. A filter as a transformer of one or more input streams
of the multi-stream into an output (multi-)stream has been introduced [Pasquale et al., 1993]. In
other words, the output (multi-)stream replaces after transformation the input (multi-)stream.
The filters have been classified into three functional groups: selective, transforming, and mixing.
These classes have been extended by [Yeadon, 1996] into five generic filter mechanisms:
hierarchical, frame-dropping, codec, splitting/mixing, and parsing. Yeadon also proposed the
QoS-Filtering Model which uses a few key objects to constitute the overall architecture: sources,
sinks, filtering entities, streams, and agents. [Candan et al., 1996] proposed collaborators capable of
displaying, editing and conversion within the collaborative multimedia system called c-COMS,
which is defined as un-directed weighted graph consisting of set of collaborators (V),
connections (E) and cost of information transmission over connection (ρ). Moreover, it defines
collaboration formally, discusses quality constraints and few object synthesis algorithms (OSAs).
[Dingeldein, 1995] proposes a GUI-based framework for interactive editing of continuous
media supporting synchronization and mixing. It supports media objects divided on complex
media (Synchronizer, TimeLineController) as composition and simple objects (audio, video data) as
media control, and defines ports (source and sink) for processing. An adaptive framework for
28
developing multimedia software components called the Presentation Processing Engine (PPE)
framework is proposed in [Posnak et al., 1997]. PPE relies on a library of reusable modules
implementing primitive transformations [Posnak et al., 1996] and proposes a mechanism for
composing processing pipelines from these modules. There are other published works, e.g.
[Margaritidis and Polyzos, 2000] or [Wittmann and Zitterbart, 1997], but they follow or
somehow adopt the above-mentioned classifications and approaches of the earlier research and
do not introduce breakthrough ideas. However, all of the mentioned works consider only
aspects of the communication layer (networking) or presentation layer, which is not sufficient
when talking about multimedia transformations in context of MMDBMS.
VirtualMedia is another work of very high importance [Marder, 2000], which defines a theory of
multimedia metacomputing as a new approach to the management and processing of
multimedia data in web-based information systems. A solution for application independence of
multimedia data (called transformation independence) through advanced abstraction concept
has been introduced by [Marder, 2001]. The work discusses theoretically several ideas like device
independence, location transparency, execution transparency, and data independence. Moreover,
an approach to construct a set of connected filters, a description of the conversion process, and
an algorithm to set up the conversion graph have been proposed in subsequent work [Marder,
2002], where the individual media and filter signatures are used for creating transformation
graphs. However, no implementation as proof of concept exists.
II.4.1.1
Well-known implementations
The Microsoft’s DirectX platform is an example of media framework in an OS-specific
environment. The DirectShow [Microsoft Corp., 2002a] is the most interesting part of DirectX
platform in respect to audio-video transformations. It deals with multimedia files and uses a
filter-graph manager and a set of filters working with different audio and/or video stream formats
and coding schemes. The filters (called also media codecs) are specially designed converters
supporting the DirectShow internal communication interface. Filter graphs are built manually,
by creating by programmer the well-known execution path consisting of defined filters, or
automatically, by comparing the provided output formats from previously selected filter to
acceptable input formats with potential following media codec [Microsoft Corp., 2002b]. It
provides mechanisms for stream synchronization according to OS time; however the
29
transformation processes cannot get execution and time guaranties from the best-effort OS.
Moreover, DirectX is only available under one OS family, so use at the client side is limited.
A competitive company, Sun Microsystems, has provided also a media framework analogical to
MS DirectShow. The Java Media Framework (JMF) [Sun Microsystems Inc., 1999] uses the
Manager object to cope with Players, Data Soruce, Data Sink and Processors. The processors are
equivalents to converter graphs and are built from processing components called plug-ins (i.e.
converters) ordered in transformation tracks within the processor. Processors can be configured
using suitable controls by hand i.e. constructed by the programmer on the fly (TrackControl), or on
the basis of predefined processor model, which specifies input and output requirements, or let
decide the processor to auto-configure by specifying only output format [Sun Microsystems
Inc., 1999]. In contrast to DirectShow filter graphs, processors can be combined with each
other, which introduces one additional logical level of complexity in the hierarchy between
simple converters and very complicated graphs. Moreover, JMF is not limited to just one OS
due to the Java platform-independence properties through Java Virtual Machines (JVMs), but
still it does not support QoS control and no execution guarantees in real-time can be given8.
One disadvantage may derive from the taken-for-granted inefficiency of Java applications in
comparison to low level languages and platform specific implementation—the detailed
efficiency investigation and benchmarking would be suggested though.
The Transcode [Östreich, 2003] is the third related implementation that has to be mentioned.
It is an open-source program for audio and video transcoding and is still under development.
Even though reliable versions are available and can be very useful for experienced users. The
Transcode’s goal was to be the most popular utility for audio and video data processing running
under Linux OSes and which could be controlled by text console allowing shell scripting and
parametric execution for the purpose of automation. In contrast, the previous frameworks
require development of an application, first then being able to use the transcoding capabilities.
The approach is analogical to cascaded transcoding which uses raw (uncompressed) data
between coded inputs and outputs, i.e. transcoding is done by loading modules that are either
responsible for decoding and feeding transcode with raw video or audio streams (import modules),
8
Time line, timing and synchronizations in JMF are provided through internal time model defining objects such as Time, Clock,
TimeBase and Duration (all with nanosecond precision).
30
or for encoding the stream from raw to encoded representation (export modules). Up to now, the
tool supports many popular formats (AVI, MOV, ES, PES, VOB, etc.) and compression
methods (video: MPEG-1, MPEG-2, MPEG-4/DivX/XviD, DV, M-JPEG; sound: AC3, MP3,
PCM, ADPCM), but it does not support QoS control and is developed as best effort application
with unsupported real-time processing.
II.4.2.
End-to-End Adaptation and Transcoding Systems
Other work in the field of audio and video transformation relates directly to the concept of
audio and video adaptation and transcoding as the method allowing for interoperability in
heterogeneous networks by changing container format, structure (resolution, frame rate),
transmission rate, and/or the coding scheme, e.g. MPEG transcoder [Keesman et al., 1996],
MPEG-2 transcoder [Kan and Fan, 1998], or low-delay transcoder [Morrison, 1997]. Here the
converter is referred to as a transcoder.[Sun et al., 2005], [Dogan, 2001] and [Vetro, 2001] give
overviews on the video transcoding and propose solutions. However, [Dogan, 2001] covers only
H.263 and MPEG-4, and he does not address the problem of transformation between different
standards, which is a crucial assumption for format independence.
Figure 3.
Adaptive transcoding system using meta-data [Vetro, 2001].
31
[Vetro, 2001] proposes object-based transcoding framework (Figure 3) that is the most similar
solution among all referred related works to the transformation framework of the RETAVIC
architecture, as so it’s described in more detail. The author defines the future extraction part,
which takes place only for non-real-time scenarios, that generates descriptors and meta-data
describing characteristics of the content. However, he does not specify the set of produced
descriptors or meta-data elements – he just proposes to use the shape change and motion
intensity transcoding hints (discussed earlier in section II.3.3.2 MD-based transcoding).
Moreover, he proposes these hints with the only purpose of functional decisions made by the
transcoding control, and more precisely by the analysis units responsible for the recognition of
shape importance, the temporal decision (such as frame skip) and the quantization parameter
selection. The author has also mentioned two additional units for resize analysis and texture
shape analysis (for reduction of shape’s resolution) in his different work [Vetro et al., 2000].
Further, [Vetro, 2001] names two major differences to other research [Assunncao and
Ghanbari, 1998; Sun et al., 1996] namely: the inclusion of the shape hint within the bit-stream
and some new tools adopted for DC/AC prediction with regard to texture coding; no other
descriptors or meta-data are mentioned. The transcoding control is used only for controlling the
functional properties of the transcoders. The core of the object transcoders are analogical to
multi-program transcoding framework [Sorial et al., 1999] and the only difference is the input
stream – the object-based MPEG-4 video streams do not correspond to frame-based MPEG-2
video program streams9. The issues reflecting real-time processing and QoS control are not
considered – nor in the meta-data set neither in the transcoder design. As so this work extends
the Vetro’s research only partially by functional aspects and entirely by the quantitative aspects
of transcoding.
Many examples of the end-to-end video streaming and transcoding system are discussed in [Sun
et al., 2005]. For example, there is mentioned the MPEG-4 Fine Granular Scalability (FGS) to
MPEG-4 Simple Profile (SP) transcoder in the 3rd Chapter, the spatial and temporal resolution
reduction is discussed on functional level in the 4th Chapter and is accompanied by motion
vector refinement and requantization discussion. The “syntactical adaptation” being one-to-one
9
The issue of impossibility of frame skipping in MPEG-2 is well know problem bypassed by spending one bit marking each MB
as skipped for all macro blocks in the frame, thus it is not cited here.
32
(binary-format) transcoding is discussed for JPEG-to-MPEG1, MPEG-2-to-MPEG-1, DV-toMPEG-2, and MPEG-2-to-MPEG-4. Some more issues such as error-resilient transcoding, logo
insertion, watermarking, picture switching and statistical multiplexing are discussed also in 4th
Chapter of [Sun et al., 2005]. Finally, novel picture-in-picture (PIP) transcoding for H.264/AVC
[ITU-T Rec. H.264, 2005] discusses two cases: PIP Cascade Transcoder (PIPCT) and optimized
Partial Re-Encoding Transcoder Architecture (PRETA). However, all the transcoder examples
mentioned in this paragraph are either adaptations (homogeneous or unary-format transcoding)
or binary-format transcoding (one-to-one), and no transformation framework providing one-tomany or many-to-many coding-scheme transcoding is proposed i.e. “real” heterogeneous
solution. Moreover, they do not consider real-time and QoS control.
Yet another example is discussed in Chapter 11 of [Sun et al., 2005] –the real-time server
containing a transcoder with the precompressed content is given in Figure 11.2 on p.32910.
There are few elements common with our architecture pointed out in the server-side, but also
few critical are still missing (e.g. content analysis, MD-based encoding). Moreover, the discussed
test-bed architecture is MPEG-4 FGS-based, which is also one of the adaptation solutions. The
extensions to MPEG-4 transcoding of the test bed is proposed, but it assumes that there are
only several requested resolutions all being delivered in MPEG-4 format (which is not the
RETAVIC goal) and still the application of MD-based encoding algorithm is not considered.
Finally, there are also other less-related proposals enhancing media data transformation during
delivery to the end-client. [Knutsson et al., 2003] proposes an extension to HTTP protocol to
support server-directed transcoding on proxies, and even though, it states that any kind of data
could be managed in such form, only the static image data and no other media types, especially
continuous, are investigated. [Curran and Annesley, 2005] discusses the transcoding of audio
and video for mobile devices constrained by bandwidth, and additionally discusses properties of
media files and their applicability in streaming with respect device type. The framework is not
presented in details, but it’s based on JMF and does not consider QoS nor real-time –at least
both are nowhere mentioned and within the transcoding algorithm there is no step referring to
any kind of time or quality control (it’s just best-effort solution). The perceived quality
10
The complete multimedia test bed architecture is also depicted in Figure 11.5 on p. 399 of [Sun et al., 2005].
33
evaluation is done by mean of the scores (MOS) in off-line mode and total execution times are
measured. An audio streaming framework is proposed in [Wang et al., 2004], but here only an
algorithm for multi-stage interleaving of audio data and layered unequal-sized packetization
useful in error prone networks are discussed, and no format transformations are mentioned.
II.4.3.
Summary of the Related Transformation Frameworks
Summarizing, there are interesting solutions for media transformations that are ready to be
applied in certain fields, but still there is no solution that supports QoS, real-time, and format
independence in a single architecture or framework that may be directly applicable in the
MMDBMS where the specific properties of the multimedia databases such as data presence and
meta-data repository describing the data characteristics could be exploited.
Thus the RETAVIC media transformation framework [Suchomski et al., 2005] is proposed
where each media-object is processed at least twice before being delivered to a client and media
transformations are assisted by meta-data. This is analogical to two-pass encoding techniques in
video compression [Westerink et al., 1999], so the optimization techniques deriving from the
two-pass idea can also be applied to the RETAVIC approach. However, the RETAVIC
framework goes beyond that – it heavily extends the idea of MD-assisted processing and
employs meta-data to reduce the complexity and enhance the predictability of the
transformations to meet real-time requirements and proposes an adaptive model of converter to
provide QoS control.
II.5. Format Independence in Multimedia Management Systems
Even though, there has been plenty of research done in the direction for audio and video
transcoding as mentioned in previous sections, the currently available media servers and
multimedia database management systems have huge deficiencies with respect to format
independence for the media objects, especially when considering audio and video data. They
offer either data independence or real time support, but not both. The media servers, which
have been analyzed and compared, support only a small number of formats [Bente, 2004], not
to mention the possibility to transform one format into the other, when the output format is
defined by the end-user. Some attempts towards format independence are made with respect to
the quality scalability, but only in one specific format i.e. the adaptation approach. The only
34
server allowing for limited transformation (from RealMedia Format to Advanced Streaming
Format) does neither support QoS control nor real-time processing. Besides, the user cannot
freely specify the quality attributes such as different resolution or higher compression and at
most can only choose between predefined qualities of given format (redundancy approach).
II.5.1.
MMDBMS
The success of using database management systems for organizing large amounts of characterbased data (e.g. structured or unstructured text) has lead to the idea of extending them to
support multimedia data as well [Khoshafian and Baker, 1996]. In contrast to multimedia
servers, which are designed to simply store the multimedia data and manage accessing them in
analogy to file servers [Campbell and Chung, 1996], the multimedia database systems handle the
multimedia data in a more flexible way without redundancy allowing to query for them through
standardized structured query language (SQL) [Rakow et al., 1995], which may deliver data
independence and application neutrality. Moreover, MMDBMS are responsible for coping with
multi-user environment and control access to the same data in parallel.
Before going into details of each system, a short summary of the demands of state-of-the-art
multimedia database management system (MMDBMS) based on [Meyer-Wegener, 2003] is
given. The multimedia data are stored and retrieved as media objects, however storing means
much more than just file storing (as in media servers). Other algorithms, e.g. manipulation of
the content for video production or broadcasting, should not be implemented at the MMDBMS
side. The majority of DBMSes has already built-in support for binary large objects (BLOBs) for
integrating undefined binary data, which are not really suited to various media data, because
there is no differentiation between media types and thus neither interpretation of nor specific
characteristic of the data can be exploited. The MO should be kept as whole without divisions
on separate parts (e.g. images and not separate pixels and positions; video and not each frame as
image; audio and not each sample individually). The exceptions are scalable media formats and
coding schemes, but here further media-specific MO refinement is required. The data
independence (including storage device and format) of MOs shall be provided due to its crucial
importance for any DBMS. The device independence is more critical in multimedia data than
the format independence due to hierarchical access driven by the frequency of use of the data,
where the access time of storage devices differentiates. However, the format independence may
35
not be neglected to fully support data independence especially for allowing avoiding data
redundancy, supporting long-time applications and neutrality, allowing for internal format
updating without influencing the outside world, etc. Additionally, MMDBMS must prevent data
inconsistencies, if multiple copies are required in any circumstances (e.g. due to optimization in
delivery on costs of storage in analogy to materialized views). The more-sophisticated search
capabilities, not only by name or creation date, have to be provided – for example using indexes
or meta-data delivered by media-specific recognition or analysis algorithms. The indexes or
meta-data can be produced by hand or automatically in advance11. Finally, the time-constrains of
continuous data have to be considered, which means implementation of: a) a best-effort system
with no quality control working fast enough to deliver in real-time, b) a soft-RTOS-based
system with just statistical QoS, or c) a hard-RTOS-based system providing scheduling
algorithms and admission control, thus allowing exact QoS control and precise timing.
Summarizing, the MMDBMS combines the advantages of conventional DBMS with specific
characteristics of the multimedia data.
The researchers have already delivered many solutions considering various perspectives on the
evolution of MMDBMSes. In general, they can be classified into two functional extensions: 1)
focus on “multi” part of multimedia, or 2) coping with device independency i.e. access
transparency. The goal of the first group is it to provide integrated support for many different
media types in one and complete architecture. Here METIS [King et al., 2004] and MediaLand
[Wen et al., 2003] can be named as representatives. METIS is a Java-based unified multimedia
management solution for arbitrary media types including a query processor, persistence
abstraction, web-based visualization and semantic packs (containers for semantically related
objects). Contrary, MediaLand coming from Microsoft is a prototypical framework for uniform
modeling, managing and retrieving of various types of multimedia data by exploiting different
querying methodologies such as standard database queries, information retrieval and contentbased retrieval techniques, and hypermedia (graph) modeling.
The second group coping with the device independence proposes such solutions as SIRSALE
[Mostefaoui et al., 2002] or Mirror DBMS [van Doorn and de Vries, 2000]. These are especially
11
On-demand analysis during the time of query is too costly (not to say unfeasible) and has to be avoided.
36
useful for the distributed multimedia data. The first one, SIRSALE proposes a modular indexing
model for searching in various domains of interest and independently of device, while the other,
the Mirror DBMS solves the problems only with physical data independence in respect to the
storage access and its querying mechanism. However, both do not discuss data independence in
respect to the format of multimedia data.
From other perspective, there are commercial DBMSs available, such as Oracle, IBM’s DB2, or
Informix. These are complex and reliable relational systems with object-oriented extensions;
however, they are lacking of direct support for multimedia data and they can be maintained only
through the mentioned object-relational extensions e.g. by additional implementation of special
data types and related methods for indexing and query optimization. As so, the Informix offers
DataBlades as extensions, the DB2 respectively Extenders, and the Oracle proposes interMedia
Cartridgers. All of them provide some limited level of format independence through data
conversion. The most sophisticated solution is delivered by Oracle interMedia [Oracle Corp.,
2003]. There the ORDImage.Process() allows for converting the stored picture to different image
formats. Moreover, the functions processAudioCommand() included within the ORDAudio interface
and processVideoCommand() present in ORDVideo are used by calls for media processing (i.e. also
transcoding) but these are just interfaces for passing the processing commands to the plug-ins
that anyway have to be implemented for each user-defined audio and video format. And this
format specific implementation would lead to the M-to-N conversion problem, because there
has to be an implemented solution for each format handing over the date in user-requested
format. Moreover, the Oracle interMedia allows for storing, processing and automatic extraction
of meta-data covering the format (e.g. MIME-type, file container, coding scheme) and the
content (e.g. author, title) [Jankiewicz and Wojciechowski, 2004]. The other system DB2
proposes Data Extenders for few types of media i.e. for Image, Audio and Video. However, it
provides conversion only for DB2Image type and none of the continuous media data types at the
same time [IBM Corp., 2003].
There has been also some work done on extensions of the declarative structured query language
(SQL) to support multimedia data. In results, the multimedia extension to SQL:1999 under the
name SQL Multimedia and Application Packages (known in short as SQL/MM) has been proposed
[Eisenberg and Melton, 2001]. The standard [JTC1/SC32, 2007] defines a number of packages
37
of generic data types common to various kinds of data used in multimedia and application areas
in order to enable these data to be stored and manipulated in an SQL database. It covers
currently five parts:
1) P.1) Framework – defines the concepts, notations and conventions being common to two
or more other parts of the standard, and in particular explains the user-defined types and
their behavior;
2) P.2) Full-Text – defines the full-text user-defined types and their associated routines;
3) P.3) Spatial – defines the spatial user-defined types, routines and schemas for generic
spatial data handling on aspects such as geometry, location and topology usable by
geographic infocmation system (GIS), decision support, data mining and data
warehousing systems;
4) P.5) Still image – defines the still image user-defined types and their associated routines
for generic image handling covering characteristics such as height, width, format, coding
scheme, color scheme, and image futures (average color, histograms, texture, etc.) but
also operations or methods (rotation, scaling, similarity assessment);
5) P.6) Data mining – defines data mining user-defined types and their associated routines
covering data-mining models, settings and test results.
However, the SQL/MM still falls short of the possibilities offered by abstract data types for
timed multimedia data and is not fully implemented in any well-known commercial system. The
Oracle 10g implements only the Part 5 of the standard through the SQL/MM StillImage types
(e.g. SI_StillImage), which is an alternative standard-compliant solution to the more powerful
Oracle-specific ORDImage type [Jankiewicz and Wojciechowski, 2004]. The Oracle 10g supports
partially also the full text and spatial parts of SQL/MM (i.e. Part 2 and Part 3).
To the best knowledge of the author of this thesis, neither the prototypes nor commercial
MMDBMSs have considered format independence of the continuous multimedia data. The enduser perspective is the same neglected, because the systems do not address requirements of the
formats variety on request. None of the systems provides format and quality conversion
38
through transcoding, beside the adaptation possibility deriving only from using the scalable
format. What’s more, the serving of different clients respecting the hardware properties and
limitations (e.g. mobile devices vs. multimedia set-top boxes) is usually solved by limiting the
search result only to the subset restricted by constraints of the user’s context, which means that
only the data suiting given platform are considered for searching. Besides, none of them
considered to be consistent with the MPEG-7 MDS model [MPEG-7 Part V, 2003].
II.5.2.
Media (Streaming) Servers
As pointed out previously, the creation of fully featured MMDBMS supporting continuous data
failed so far, probably due to the enormous complexity of the task. On the other hand, the
demand for delivery of audio-visual data, imposing needs for effective storage and transmission,
has been rising incessantly. In results, the development of simple and pragmatic solutions has
been started, which delivered nowadays audio and video streaming server solutions being
especially useful for video-on-demand applications.
The RealNetworks Media Server [RealNetworks, 2002], the Apple Darwin Streaming Server
[Apple, 2003] and the Microsoft Windows Media Services [Microsoft, 2003] are definitely the
most known commercial products currently available. All of them deliver the continuous data
with low-latency and control the client access on the server side to the stored audio/video data.
The drawback is that the storage of data on the server is conducted only in a proprietary format,
as so it is also streamed to the client over a network. To provide at least some degree of
scalability including various qualities of the data accompanied by a range of bandwidth
characteristics, the redundancy approach has be applied and usually several instances of the
same video are stored on the server that are pre-coded at various bit-rates and resolutions.
39
III. Formats and RT/OS Issues
III. FORMATS AND RT/OS ISSUES
III.1. Storage Formats
Only the lossless and/or scalable coding schemes are discussed within this section. The nonscalable and lossy are out of interest as can be derived from the later part of the work. But
before going into details, few general remarks have to be stated. It is a fact, that multimedia
applications have to cope with huge amounts of data, and thus the compression techniques are
exploited for storage and transmission. It is also clear, that the trade off between compression
efficiency and data quality has to be considered when choosing the compression method. As so
the first decision is to choose between lossless and lossy solutions. Secondly, the number of
provided SNR quality levels should be selected between discrete with no scalability (one quality
level) or just few levels, and continuous12 with smooth quality transition in the quality range
(often represented by fine-granular scalability). It is taken for granted, that the higher
compression leads to lower data quality for lossy algorithms, as well as the introduction of
scalability lowers the coding efficiency for the same quality in comparison to non-scalable
coding schemes. Moreover, the lossless compression is much worse than the lossy compression
when considering only coding efficiency. All these general remarks apply to both audio and
video data and their processing.
III.1.1.
Video Data
When this work has been started, there was no scalable and lossless video compression available
according to author’s knowledge. As so there are discussed only scalable and only lossless
coding schemes for the video data at the beginning. Next, as an exception, the MJPEG-2000
being a possible solution for use as lossless scalable image coding format applied to video
encoding is shortly described and the 3D-DWT-based solution is mentioned subsequently. With
12
The term continuous is overestimated in this case, because there is no real continuity in the digital world. As so, the border
between discrete and continuous is very vague i.e. one could ask how many levels is required to have continuous and not
discrete anymore. It is suggested that anything below 5 levels is discrete, but anyway the decision is left to the reader.
40
respect to the newest ongoing and unfinished standardization process on MPEG-4 SVC
[MPEG-4 Part X, 2007], there are some remarks given in the Further Work.
III.1.1.1
Scalable codecs
The MPEG-4 FGS profile [Li, 2001] defines two layers with identical spatial resolution, where
one is base layer (BL) and the other is enhancement layer (EL). The base layer is compressed
according to advanced simple profile (ASP) including the temporal prediction, while the EL
stores the difference between originally calculated DCT coefficients and coarsely quantized
DCT coefficients stored in BL using bit-plane method, such that the enhancement bit stream
can be truncated at any position, and as so the fine granularity scalability is provided. There is
no temporal prediction within the EL, so the decoder is resistant to error drift and robust to
recovery from the errors in enhancement stream.
Two extensions to FGS have been proposed in order to lower FGS coding deficiencies due to
missing temporal redundancy removal in the enhancement layer. The Motion-Compensated
FGS (MC-FGS) [Schaar and Radha, 2000] is one of the extensions proposing higher-quality
frames from the EL to be used as reference for motion compensation, which leads to smaller
residuals in the EL and thus better compression, but at the same time suffers from higher error
drift due to error propagation within the EL. The Progressive FGS (PGFS) [Wu et al., 2001] is
the other proposal, which adopts special prediction method for EL in the separate loop using
only partial temporal dependency on the higher quality frame. As so, it closes a gap between
relatively inefficient FGS with no drift in EL and efficient MC-FGS with susceptibility to error
drift in EL. Moreover, there has already been proposed an enhancement to PGFS called
Improved PGFS [Ding and Guo, 2003], where coding efficiency is even higher (for the same
compressed size the PSNR [Bovik, 2005] gain achieves 0.5 dB) by using higher-quality frame as
reference for all frames, and additionally the error accumulation is prevented by the attenuation
factors.
Even though the very high quality can be achieved by MPEG-4 FGS based solutions, it is
impossible to get losslessly coded information due to the lossy properties of DCT and
quantization steps of the implementation of the MPEG-4 standard.
41
III.1.1.2
Lossless codecs
Many different lossless video codecs, which provide high compression performance and
efficient storage, are available. They may be divided on two groups: 1) using general-purpose
compression methods or 2) using the transform-based coding.
The first group exploits lossless general-purpose compression methods like Huffman coding or
Lempel-Ziv algorithm and its derivatives, e.g. LZ77, LZ78, LZW or LZY [Effelsberg and
Steinmetz, 1998; Gibson et al., 1998]. More efficient solution is to join these methods into
one—it is known as DEFLATE (defined by Phil Katz in PKZIP or used in popular GZIP)—
examples of such video codecs are: Lossless Codec Library (known as LCL AVIzlib/mszh),
LZO Lossless Codec, and CSCD (RenderSoft CamStudio). These methods are still relatively
inefficient due to their generality and the unexploited spatial and temporal redundancy, so more
enhanced method employing spatial or temporal prediction for exploiting redundancies in video
data would improve compression performance. The examples of such enhanced methods are
HuffYUV [Roudiak-Gould, 2006] and Motion CorePNG, and it is also assumed that the
proprietary AlparySoft Lossless Video Codec [WWW_AlparySoft, 2004] belongs to this group
as well.
The second group of codecs uses compression techniques derived from transform-based
algorithms, which was originally developed for still images. If such technique is applied to video
data on the frame basis, it produces a simple sequence of pictures, which are independently
coded without loss of information. Typical examples are Lossless Motion JPEG [Wallace, 1991]
or LOCO-I [Weinberger et al., 2000]. There are also many implementations available both in
software (e.g. PICVideo Lossless JPEG, Lead Lossless MJPEG) as well as in hardware (e.g.
Pinnacle DC10, Matrox DigiSuite).
III.1.1.3
Lossless and scalable codecs
The combination of lossless and scalability properties in video coding was proposed by the
recent research, however no ready and implemented solutions have been provided. One
important solution focused on video is the MPEG-4 SVC [MPEG-4 Part X, 2007] discussed in
the further work (due to work-in-progress status).
42
Another possible lossless and scalable solution for video data could be a use of the image
compression technology based on a wavelet transform e.g. Motion JPEG 2000 (MJ2) using twodimensional discrete wavelet transform (2D-DWT) [Imaizumi et al., 2002]. However, there are
still many unsolved problems regarding efficiently exploiting temporal redundancies (JPEG
2000 only covers still pictures, not motion). Extensions for video exploiting temporal
characteristics are still under development.
Last but not least could be an application of three dimensional discrete wavelet transform (3DDWT) for video coding. 3D-DWT operates not in two dimensions as the transform used by
MJ2, but in three dimensions i.e. the third dimension is time. This allows treating a video
sequence or a part of the video sequence consisting of more than one frame called group of
pictures (GOP) as a 3-D space of values of luminance or chrominance. There is however
unacceptable drawback for real-time compression – the time buffer of at least the size of GOP
has to be introduced. On the other hand, it allows to exploit better the correlation between
pixels in the subsequent frames around one point or macro block of the frame, there is no need
to divide the frame on non-overlapping 2-D blocks (allowing avoiding blocking artifacts) and
the method allows inherently for scaling [Schelkens et al., 2003]. It is also assumed, that by
selecting more important regions, one could achieve compression of 1:500 (vs. 1:60 for DCTbased algorithms [ITU-T Rec. T.81, 1992]). Still the computing cost of 3D-DWT-based
algorithms may be much higher than previously published video compression algorithms like
2D-DWT-based. Moreover, only proprietary prototypical implementation has been developed.
III.1.2.
Audio Data
Contrary to video section, the audio section discusses lossless and scalable coding schemes only,
because there are few coding schemes applicable within this work. In general, there are few
methods of storing audio data without loss: 1) in uncompressed form, e.g. in RIFF WAVE file
using Pulse Code Modulation, Differential PCM, Adaptive DPCM, or any other modulation
method, 2) in compressed form using standard lossless compression algorithms like ARJ, ZIP
or RAR, or 3) in compressed form using one of the audio-specific algorithms for lossless
compression. From the perspective of this work, only the third class is shortly described due to
capability of layering, compression efficiency and relation to audio data.
43
The most known and efficient coding schemes using audio-specific lossless compression are as
follows:
•
Free Lossless Audio Codec (FLAC) [WWW_FLAC, 2006]
•
Monkey's Audio [WWW_MA, 2006]
•
LPAC
•
MPEG-4 Audio Lossless Coding (ALS) [Liebchen et al., 2005] (LPAC has been used as
a reference model for ALS)
•
RKAU from M Software, WavPack [WWW_WP, 2006]
•
OptimFrog
•
Shorten from SoftSound
•
Apple Lossless
•
MPEG-4 Scalable Lossless Coding (SLS) [MPEG-4 Part III FDAM5, 2006]
Out of all the mentioned coding schemes only WavPack, OptimFrog and MPEG-4 SLS are (to
some extent) scalable formats, while the other are un-scalable, which means that all data must be
read from the storage medium, decoded completely, and there is no way of getting lower-quality
through partial decoding. Both, WavPack and OptimFrog, have a feature providing rudimentary
scalability, such that there are only two layers: a relatively small lossy file (base layer) and an
error-compensation file which has to be combined with the decoded lossy data for providing
lossless information (enhancement layer). This scalability future is called Hybrid Mode for
WavPack, and respectively DualStream for OptimFrog.
Beside the mentioned two-layer scalability, the fine granular scalability would be possible in
WavPack because the algorithm uses linear prediction coding where the most significant bits are
encoded first. As so, if some rework in the implementation of WavPack were done, it would be
possible to use every additional bit layer for improving quality and make the WavPack scalable
to lossless audio codec with multiple layers of scalability. Though, such rework has not been
implemented so far and no finer than two-layer scalability has been proposed.
Similarly, the two layers are proposed in the MPEG-4 SLS as default [Geiger et al., 2006] i.e. the
base layer (called also core) and the enhancement layer. The core of MPEG-4 SLS uses the well44
known MPEG-4 AAC coding scheme [MPEG-2 Part VII, 2006], while the EL stores the
transform-based encoded error values (i.e. the difference between the decoded core and the
lossless source) coded according to the SLS algorithm. The SLS’s EL, analogically to WavPack,
also stores the most significant bits of the difference at first, but contrary the SLS algorithm
allows cutting off the bits from EL, so the fine-granular scalability is possible. Moreover, the
transform-based coding of MPEG-4 SLS outperforms the linear prediction coding used in
WavPack in the coding efficiency at low bit rates. Additionally, the MPEG-4 SLS can also be
switched into the SLS-only mode where no audio data are stored in the base layer (no core)13,
and just the enhancement layer encodes the error. Such solution is equal to encoding of error
difference between zero and lossless signal value in the scalable from-the-first-bit manner. Of
course, at low bitrates the SLS-only method cannot compete with the AAC-base core.
III.2. Real-Time Issues in Operating Systems
The idea of transcoding-driven format independence of audio and video data imposes on the
operating system enormous performance requirements including both computing power and
data transfers. Secondly, the time-constraints of the continuous data limits the applicable
solution to only a subset of the existing algorithms. For example, the preemptive scheduling
with few priority levels [Tannenbaum, 1995], which was proved to be relatively simple and
efficient in managing the workload in the best effort systems such as Linux or Windows, is
insufficient in more complex scenario where many threads with time constraints appear and the
execution deadlines should be considered.
Additionally, in order to understand the later part of this work, some backgrounds in the
operating systems field are also required. Not all aspects related to OS issues are discussed
within this section, but just the most critical points and definitions are referred – more detailed
aspects using introduced here definitions appear whenever required in the respective chapter (as
inline explanation or footnotes). At first the execution modes, kernel architectures and inter-
13
This method is called SLS-based core, but it may be misunderstood, because there is no bit spent for storing audio quanta in
base layer. Only the descriptive and structure information such as bit-rate, number of channels, sampling rate, etc. are stored
in the BL.
45
process communication are discussed, then the real-time processing models are shortly
introduced, and next related scheduling algorithms are referenced.
III.2.1.
OS Kernel – Execution Modes, Architectures and IPC
In general, there are two kinds of memory space (or execution modes) distinguishable: user space
(user mode) and kernel space (kernel mode). As it may be derived from the name, the OS kernels
(usually with the device drivers and kernel extensions or modules) run in the non-swappable14
kernel memory space with the kernel mode, while the applications use the swappable user
memory and are executed with the user mode [Tannenbaum, 1995]. Of course, the programs in
user mode cannot access the kernel space, which is very important to system stability and allow
handling the buggy or malicious applications.
On the other hand, there are two types of OS kernel architecture commonly used15: a microkernel and a monolithic kernel. The monolithic kernels embed all the OS services such as
memory management (address spaces with mapping and TLB16, virtual memory, caching
mechanisms), threads and threads switching, scheduling mechanisms, interrupts handling, interprocess communication (IPC17) mechanism [Spier and Organick, 1969], file systems support
(block devices, local FS, network FS) and network stacks (protocols, NIC drivers), PnP
hardware support, and many more. At the same, all these services run in the kernel space with
the kernel mode. The novel monolithic kernels are usually re-configurable by having a
possibility of dynamical load of additional kernel modules e.g. when a new hardware driver or a
new file system support is required. In this only respect, they are similar to micro-kernels. What
differs micro-kernels from the monolithic kernel is the embedded set of OS services, and the
14
Swappable means that the parts of virtual memory (VM) may be swapped out to the secondary storage (usually to swap file or
swap partition on a hard disk drive) temporality whenever the process is inactive or the given part of VM has not been used
for some time. Analogically, non-swappable memory can not be swapped out.
15
A nano-kernel is intentionally left out due to its limitations i.e. only the minimalist hardware abstraction layer (HAL) including
the CPU interface. The interrupts management and memory management unit (MMU) interface are very often included in the
nano-kernels (due to the coupled CPU architectures) even though they do not really belong to it. The nano-kernels are
suitable for the real-time single-task applications, and thus are applicable in embedded systems for hardware independence.
Alternatively, it might be said that nano-kernels are not OS kernels in the traditional sense, because they do not provide
minimal set of OS services.
16
TLB stands for Translation Lookaside Buffer, which is a cache of mappings from the operating system’s page table.
17
IPC finds it roots in the meta-instructions defined for the multiprogrammed computations by [Dennis and Van Horn, 1966]
within the MIT’s Project MAC.
46
address space and execution mode used for these services. There is only a minimal set of
abstraction (address spaces, threads, IPC) in the micro-kernel included [Härtig et al., 1997],
which is used as a base for the user-space implementation of the remaining functionality (i.e. for
other OS services called usually servers). In other words, only the micro-kernel with its minimal
concepts (primitives) runs with the kernel mode using kernel-assigned memory and all the servers
are loaded to the protected user memory space and run at the user level [Liedtke, 1996]. As so
the micro-kernels have some advantages over the monolithic kernels such as: a smaller size of
the binary image, a reduced memory and cache footprint, a higher resistance to malicious
drivers, or are easier portable to different platforms [Härtig et al., 1997]. An example of the
micro-kernel is the L4 [Liedtke, 1996] implementing the primitives such as: 1) address spaces
with grant, map and flush, 2) threads and IPC, 3) clans and chiefs for kernel-based restriction of
IPCs (clans and chiefs themselves are located in the user-space), and 4) unique identifiers
(UIDs). One remark: the device-specific interrupts are abstracted to IPCs and are not handled
internally by L4 micro-kernel, but rather by the device driver in the user-mode.
The efficient implementation of IPC is required by the microkernels [Härtig et al., 1997] and is
also necessary for parallel and distributed processing [Marovac, 1983]. The IPC allows for data
exchange (usually by sending messages) between different processes or processing threads [Spier
and Organick, 1969]. On the other hand, by limiting the IPC interface of a given server or by
implementing a different IPC redirection mechanism an additional security may be enforced by
efficient and more flexible access control [Jaeger et al., 1999]. However, a wrongly designed IPC
introduces additional overhead to the implemented OS kernel [Liedtke, 1996].
III.2.2.
III.2.2.1
Real-time Processing Models
Best-effort, soft- and hard-real-time
A few processing models applicable in on-line multimedia processing can be distinguished with
respect to real-time. The first one is the best-effort processing without any time consideration
during scheduling and without any QoS control, for example the mentioned preemptive
scheduling with few priority levels applicable in best-effort OS [Tannenbaum, 1995].
The second processing model is recognized as soft real-time, where the deadline misses are
acceptable (with processing being continued) and time buffers for keeping the constant quality
47
required. However, if the execution time varies too much such that the limited buffer will
overrun, a frame skip may occur and the quality will drop noticeable. Moreover, the delays
depending on the size of time buffer are introduced, so the trade-off between buffer size and
the quality guarantees is crucial here (and the guarantees of delivering all frames cannot be
given). Additionally, a sophisticated scheduling algorithm must allow for variable periods per
frame in such case, as so the jitter-constrained periodic streams model (JCPS) [Hamann, 1997] is
applicable here.
The third model is hard real-time where the deadline misses are not allowed and the processing
must be executed completely. So, the guarantees for delivering all the frames may be given here.
However, a waste of resources is the problem here, because in order to guarantee complete
execution on time the worst case scheduling must be applied [El-Rewini et al., 1994].
III.2.2.2
Imprecise computations
The imprecise computations [Lin et al., 1987] is a scheduling model which obeys the deadlines
as the hard-real-time model, but is applicable by the flexible computations i.e. if there is still
time left within the strict period the result should be made better to achieve the better quality.
Here the deadline misses are not allowed, but the processing is adopted to meet all the deadlines
on costs of graceful degradation in result’s quality. The idea is that the minimum level of quality
is provided by the mandatory part, and the improved quality is delivered by optional
computations if the resources are available.
III.2.3.
Scheduling Algorithms and QAS
The scheduling algorithm based on the idea of imprecise computation, which is applicable for
real-time audio-video transformation with minimal quality guarantees and no deadline misses, is
the quality assuring scheduling (QAS) proposed by [Steinberg, 2004]. QAS is based on periodic
use of resources, where the application may split the resource request into few sub-tasks with
different fixed-priorities. The reservation time is the sub-task’s exclusive time allocated on the
resource. This allocation is done a priori i.e. during the admission test before the application is
started. Moreover, the allocation is possible only if there is still available free (unused)
reservation time of the CPU.
48
The QAS scheduling [Steinberg, 2004] is the most suitable for the format independence
provision using real-time transformations for the continuous multimedia data within the
RETAVIC project, because the guarantees of delivering all the frames may be given and the
minimum level of quality will be always provided. It is also assumed that the complete task does
not have to be completed by the deadline but just its mandatory part delivering minimum
requested quality, and thus the optional jobs (sub-tasks) within the periodic task may not be
executed at all or may be interrupted and stopped. So, the QAS will be further referred in this
paper and clarified whenever needed.
A reader can find the extensive discussion including advantages and disadvantages of scheduling
algorithms usable for multimedia data processing in the related work of [Hamann et al., 2001a].
This discussion refers to: 1) extended theoretical research on imprecise computations (Liu,
Chong, or Baldwin) resulting in just one attempt to build a system for CPU allocation and no
other resources (Hull), 2) time value functions (Jensen with Locke, or Rajkumar with Lee),
where the varying execution time of jobs but no changes in system load are considered, 3)
Abdelzaher’s system for QoS negotiation which considers only long-lived real-time service and
assumes task’s requirements for resources being constant, 4) network-oriented view on packet
scheduling in media streams (West with Poellabauer) also with guaranties of specific level of
quality, but not considering the semantic dependencies between packets, 5) statistical rate
monotonic scheduling (SRMS from Atlas and Bestavros) which relies on the actual execution
time and scheduling at the beginning of each period, 6) resource kernel (Oikawa and Rajkumar)
which is managing and providing resources directly by kernel (in contrast QAS is server-based
on top of micro-kernel), and moreover, overloads, quality and resources for optional parts are
not addressed.
Since the opinion on applying QAS in the RETAVIC project is similar to the one presented by
[Hamann et al., 2001a] being applicable in the real-time multimedia processing, the scheduling
algorithms are not further discussed in this work.
49
50
Chapter 3 – Design
IV. System Design
Entia non sunt multiplicanda praeter necessitatem.18 William of Occam
(14th Century, “Occam’s Razor”)
IV. SYSTEM DESIGN
This chapter gives an overview on the RETAVIC conceptual model. At first, shortly the
architecture requirements are stated. Next, the general idea based on generic 3-phase video
server framework proposed in [Militzer, 2004] is depicted – however, the 4-phase model (as
already published in [Suchomski et al., 2005]) is described in even more details considering both
audio and video, and then each phase in separate subsection is further clarified. Some
drawbacks of the previously described models of architecture are mentioned in the respective
subsections, and thus the extensions are proposed. Subsequently, the evaluation of the general
idea is given by short summary. Next, the hard real-time adaptive model is proposed. Finally,
the real-time issues in context of the continuous media according to proposed conceptual model
are described.
The subsections of the conceptual model referring to storage including internal format and
meta-data (IV.2.3) and real-time transcoding (IV.2.4) are further explained for each media type
18
In English: Entities should not be multiplied beyond necessity. Based on this sentence the KISS principle has been
developed: Keep It Simple, Stupid — but never oversimplify.
51
IV. System Design
(media-specific details) separately in the following sections of this chapter: video in Section V,
audio in Section VI, and both combined as multi-media in Section VII.
IV.1. Architecture Requirements
The main difficulty of media transformation is caused by the huge complexity of the process
[Suchomski et al., 2004], especially for continuous media where it must be done in real-time to
allow uninterrupted playout. The processing algorithms for audio and video require enormous
computational resources, and their needs vary heavily since they depend on the input data. Thus
accurate resource allocation is often impossible due to this unpredictable behavior [Liu, 2003]
(huge difference between worst-case and average execution times), and the missing scalability
and adaptability of the source data formats inhibit QoS control.
The essential objective in this project is to develop a functionality extending nowadays’
multimedia database services that brings out efficient and format-transparent access to
multimedia data by utilizing real-time format conversion respecting user’s specific demands. In
other words, the RETAVIC project has been initiated with the goal to enable format
independence by real-time audio-video conversion in multimedia database servers. The
conversion services will run in a real-time environment (RTE) in order to provide a specific
QoS. Few main directions have been defined in which the Multimedia DBMS should be
extended:
Data independence – various clients' requests should be served by transparent on-line
format transformations (format independence) without regard to physical storage and access
methodology (data independence).
Data completeness – data should be stored and provided without any loss of information,
whenever it is required.
Data access scalability – just a portion of data should be accessed when lower quality is
requested, and a complete reading should take only place if lossless data are requested.
52
IV. System Design
Real-time conversion – a user should be allowed to transparently access data on-demand.
So, the conversion should be executed just-in-time, which causes specific real-time
requirements and thus requires QoS control.
Redundancy/transformation trade-off – a single copy (an instance) of each media object
should be kept to save space and to ease update. However, the system must not be
limited to exactly one copy, especially when many clients with identical quality and
format requirements are expected to being served e.g. using caching proxies.
Real-time capturing – lossless insertion (recording) should be supported, which is
especially important in scientific and medical fields.
These directions are not excluding Codd’s well-known Twelve Rules [Codd, 1995], but contrary,
they are extending them by considering the continuous multimedia data. For example, the first
direction specifies a method for the 8th and 9th Codd’s rules19. Moreover, Codd’s rules have been
defined for an ideal relational DBMS, which is not fully applicable in the other DBMSes such as
ODBMS, ORDBMS or XML-specific DBMS. For example 1st Codd’s rule (Information Rule –
all data presented in tables) is somehow useless in respect to multimedia data, where the
presentation layer is usually devoted to the end-user application.
According to the mentioned six directions and the limitations of existing media-server solutions
(previously discussed in Chapter 2), a proposal of the generic real-time media transformation
framework for multimedia database managements systems and multimedia servers is developed
and presented in the successive part.
IV.2. General Idea
The generic real-time media transformation framework defined within the RETAVIC project
(shortly the RETAVIC framework) finally consists of four phases: real-time capturing, non-real-
19
Codd’s Rules: 1. The Information Rule, 2. Guaranteed Access Rule, 3. Systematic Treatment of Null Values, 4. Dynamic OnLine Catalog Based on the Relational Mode, 5. Comprehensive Data Sublanguage Rule, 6. View Updating Rule, 7. High-level
Insert, Update, and Delete, 8. Physical Data Independence, 9. Logical Data Independence, 10. Integrity Independence, 11.
Distribution Independence, 12. Nonsubversion Rule.
53
IV. System Design
time preparation, storage, and real-time delivery. The RETAVIC framework is depicted in
Figure 4.
Figure 4.
Generic real-time media transformation framework supporting format
independence in multimedia servers and database management systems.
Remark: dotted lines refer to optional parts that may be skipped within a phase.
The real-time capturing (Phase 1) includes a fast and simple lossless encoding of captured
multimedia stream and an input buffer for encoded multimedia binary stream (details are given
in Section IV.2.1). Phase 2 –non-real-time preparation phase– prepares the multimedia objects
and creates meta-data related to these objects. Finally it forwards all the produced data to a
multimedia storage system. An optional archiving of the origin source, a format conversion
from many different input formats into an internal format (which has specific characteristics),
and a content analysis are executed in Phase 2 (described in Section IV.2.2). Phase 3 is a
multimedia storage system where the multimedia bitstreams are collected and the accompanying
meta-data are kept. Phase 3 has to be able to provide real-time access to the next phase (more
about that in Section IV.2.3). Finally, Phase 4 –real-time delivery–, has two parallel processing
54
IV. System Design
channels. The first one is real-time transcoding sub-phase where the obligatory processes,
namely real-time decoding and real-time encoding take place, and which may have optional realtime media processing (e.g. resizing, filtering). The second channel –marked as bypass delivery
process in Figure 4– is used for direct delivery which does not apply any transformations beside
the protocol encapsulation. Both processing channels are treated as isolated i.e. they work
separately and do not influence each other (details about Phase 4 are given in Section IV.2.4).
There are few general remarks according to the proposed framework as a whole one:
•
phases may be distributed over a network – then one must consider additional
complexity and communication overhead, but can gain a processing efficiency increase
(and as so higher number of served requests)
•
Phase 1 is optional and may be completely skipped, however, only if the application
scenario does not require live video and audio capturing
•
Phase 1 is not meant for non-stop life capturing (there must be some breaks between to
empty media buffer), because of unidirectional communication problem between realtime and non-real-time phases (solvable only by guaranteed resources with worst-case
assumptions in non-real-time phase)
There are also two additional boundary elements visible in Figure 1 besides those within
described 4 phases. They are not considered to be a part of the framework, but allow
understanding such construction of the framework. The multimedia data sources and the
multimedia clients with an optional feedback channel (useful in failure-prone environments) are
the two constituents. The MM data sources represent input data for the proposed architecture.
The sources are generally restricted by the previously mentioned remark about non-stop life
capturing and by the available decoder implementations. However, if the assumption is made
that a decoder is available for each encoded input format, the input data may be in any format
and quality including also lossy compression schemas, because the input data are simply treated
as origin information i.e. a source data for storage in the MMDBMS or MM server. The
multimedia clients are analogically restricted only by the available real-time encoder
implementations. And again if the existence of such encoders is assumed for each existing
format, the client restriction to the given format is dispelled due to the format independence
55
IV. System Design
provision by on-demand real-time transcoding. Moreover, the same classes of multimedia
clients consuming identical multimedia data, which have already been requested and cached
earlier, are supported by direct delivery without any conversion, as same as clients accessing the
origin format of the multimedia stream.
IV.2.1.
Real-time Capturing
First phase is designed to deliver a real-time capturing, which is required in some application
scenarios where no frame drops are allowed. Even though grabbing of audio and video on the
fly (live media recording) is a rather rare case considering the variety of applications in the
reality, where most of the time input audio and video data have already been stored on some
other digital storage or device (usually pre-compressed to some media-specific format), it must
not be neglected. A typical use case for heavily-capturing scenario are the scientific and
industrial research systems, especially in the medical field, which regularly collect (multi)media
data where loss of information is not allowed and real-time recording of the processes under
investigation is a necessity20. Thus, the capturing phase of the RETAVIC architecture must
provide a solution for information preservation and has to rely on a real-time system. Of course,
if the multimedia data is already available as a media file, then Phase 1 can be completely
skipped as it is shown in Figure 4 by connecting the “Media Files” in oval shape directly to
Phase 2.
IV.2.1.1 Grabbing techniques
The process of audio-video capturing includes analog-digital conversion (aka ADC) of the
signal, which is huge topic for its own and will not be discussed in details. Generally, the ADC is
carried out only for the analog signal and produces the digital representation of the given analog
output. Of course, the ADC is a lossy process as every discrete mapping of continuous
function. In many cases the ADC is already conducted by the recording equipment directly (e.g.
within the digital camera’s hardware). In general, there are few possibilities of getting audio and
video signal from the reality into the digital world:
20
In case of live streaming of already compressed data like MPEG-2 coded digital TV channels distributed through DVBT/M/C/S, these streams could simply be dumped to the media buffer in real-time. The process of dumping is not
investigated due to its simplicity and the media buffer is discussed in the next subsection.
56
•
IV. System Design
by directly connecting (from microphone or camera) to standardized digital
interfaces like IEEE 139421 (commonly known as FireWire limited to 400Mbps),
Universal Serial Bus (USB) in version 2.0 (limited to 480Mbps), wireless BlueTooth
technology in version 2.0 (limited to 21Mbps), Camera Link using coaxial MDR-26 pin
connector (or optical RCX C-Link Adapter over MDR-26), or S/P-DIF22 (Sony/Philips
Digital Interface Format) being a consumer version of IEC 6095823 Type II Unbalanced
using coaxial RCA Jack or Type II Optical using optical F05 connector (known as
TOSLINK or EIAJ Optical – optical fiber connection developed by Toshiba).
•
by using specialized audio-video grabbing hardware producing digital audio and
video like PC audio and video cards24 (e.g. Studio 500, Studio Plus, Avid Liquid Pro
from Pinnacle, VideoOh!’s from Adaptec, All-In-Wonder’s from ATI, REALmagic
Xcard from Sigma Designs, AceDVio from Canopus, PIXIC Frame Grabbers from
EPIX) –a subset of these is also known as Personal Video Recorder cards (PVRs e.g.,
WinTV-PVR from Hauppauge, EyeTV from Elgato Systems)–, or stand-alone digital
audio and video converters25 (e.g. ADVC A/D Converters from Canopus, DVD
Express or Pyro A/V Link from ADS Technologies, PX-AV100U Digital Video
Converter from Plextor, DAC DV Converter from Datavideo, Mojo DNA from Avid)
–a subset of stand-alone PVRs is also available (e.g. PX-TV402U from Plextor).
21
The first version of IEEE 1394 appeared in 1995, but there are also IEEE 1394a (2000) and 1394b (2002) amendments
available. The new amendment IEEE 1394c is coming up but it’s known already as FireWire 800, which is limited to
800Mbps, thus the previous standards may be referred as FireWire 400. However, to our knowledge, the faster FireWire has
not yet been applied in the audio-video grabbing solutions. The i.Link is the Sony’s implementation of IEEE1394 (in which 2
power pins are removed).
22
There is a small difference in the Channel Status Bit information between specification of AES/EBU (or AES3) and its
implementation by S/P-DIF. Moreover, the optical version of SPDIF is usually referred as TOSLINK (contrary to “coaxial
SPDIF” or just “SPDIF”).
23
It is known as AES/EBU or AES3 Type II (Unbalanced or Optical)
24
Analog audio is typically connected through RCA jacks on chinch pairs, and analog video is usually connected through:
coaxial RF (known as F-type used for antennas; audio and video signal together), composite video (known as RCA – one
chinch for video, pair for audio), S-video (luminance and chrominance separately; known as Y/C video; audio separately) or
component video (each R,G,B channel separately; audio also independently).
25
Digital audio is carried by mentioned S/P-DIF or TOSLINK. The digital video is transmitted by DVI (Digital Video
Interface) or HDMI (High-Definition Multimedia Interface). The HDMI is also capable of transmitting digital audio signal in
addition.
57
•
IV. System Design
by using network communication e.g. WLAN based mobile devices (like stand-alone
cameras with built-in WLAN cards) or Ethernet-based cameras (connected directly to
the network),
•
by using generic data acquisition (DAQ) hardware doing AD conversion being
stand-alone with USB support (e.g. HSE-HA USB from Harvard Apparatus,
Multifunction DAQs from National Instruments, Personal DAQ from IOTech), PCIbased (e.g. HSE-HA PCI from Harvard Apparatus, DaqBoard from IOTech), PXIbased (PCI eXtensions for Instrumentation e.g. Dual-Core PXI Express Embedded
Controller from National Instruments) or Ethernet-based (e.g. E-PDISO 16 from
Measurement Computing).
Regardless the variety of grabbing technologies named above, it is assumed in the RETAVIC
project, that the audio and visual signals are already delivered to the system as a digital signals
for further capturing and storage (no ADC is analyzed anymore). Moreover, it is assumed that
huge amounts of continuous data (discussed in next section) must be processed without loss of
information26, for example, when monitoring a scientific process with high-end 3CCD industrial
video cameras connected through 3 digital I/O channels by DAQs or when recording a digital
audio of the high-quality symphonic concert using multi-channel AD converters transmitting
over coaxial S/P-DIF.
IV.2.1.2 Throughput and storage requirements of uncompressed media data
Few digital cameras grouped in different classes may serve as an example of throughput and
storage requirements27:
1) very high-resolution 1CCD monochrome/color camera e.g. Imperx IPX-11M5-LM/LC,
Basler A404k/A404kc or Pixelink PL-A686C,
26
Further loss of information is meant when assuming the loss introduced by ADC.
27
The digital cameras used as examples are representative from the market on September 3rd, 2006. The proposed division on
five classes should not change in the future i.e. the characteristics of each class allow to assign correctly existing cameras in a
given point of time. For example, there might be available an HDTV camera in the group of consumer cameras (class 5)
instead of professional cameras (class 4) in the future.
58
IV. System Design
2) high-resolution 3CCD (3-chips) color camera e.g. Sony DXC-C33P or JVC KY-F75U,
3) high speed camera e.g. Mikrotron MotionBLITZ Cube ECO1/Cube ECO2/Cube3,
Mikrotron MC1310/1311, Basler A504k/A504kc,
4) professional digital camera e.g. JVC GY-HD111E, Sony HDC-900 or Thomson LDK6000 MKII (all three are HDTV cameras),
5) consumer digital (DV) camera e.g. Sony DCR-TRV950E, JVC MG505 or Canon DMXM2E or DC20.
First three classes comprise the industrial and scientific cameras. Class 4 refers to professional
market, while class 5 covers only the needs of standard consumers.
Class
Name
1
1
1
1
1
2
2
3
3
3
3
3
3
3
4
4
4
5
5
5
5
Imperx
Imperx
Basler
Basler
Pixelink
JVC
Sony
Mikrotron
Mikrotron
Mikrotron
Basler
Basler
Mikrotron
Mikrotron
JVC
Sony
Thomson
Sony
JVC
Canon
Canon
Table 1.
Model
IPX-11M5-LM
IPX-11M5-LC
A404k
A404kc
PL-A686C
KY-F75U
DXC-C33P
MotionBLITZ Cube ECO1
MotionBLITZ Cube ECO2
MotionBLITZ Cube3
A504k
A504kc
MC1310
MC1311
GY-HD111E
HDC-900
LDK-6000 MKII
DCR-TRV950E
MG505
DM-XM2E
DC20
Width
Height
FPS
Pixel
Bit
Depth
4000
4000
2352
2352
2208
1360
752
640
1280
512
1280
1280
1280
1280
1280
1920
1920
720
1173
720
720
2672
2672
1726
1726
3000
1024
582
512
1024
512
1024
1024
1024
1024
720
1080
1080
576
660
576
576
5
5
96
96
5
7,5
50
1000
500
2500
500
500
500
500
60
25
25
25
25
25
25
12
12
10
10
10
12
10
8
8
8
8
8
10
10
8
12
12
8
8
8
8
Throughput
[Mbps]
612
612
3717
3717
316
120
209
2500
5000
5000
5000
5000
6250
6250
422
593
593
79
148
79
79
Throughput and storage requirement for few digital cameras from different classes.
59
File
Size of
60sec.
[GB]
4,48
4,48
27,22
27,22
2,31
0,88
1,53
18,31
36,62
36,62
36,62
36,62
45,78
45,78
3,09
4,35
4,35
0,58
1,08
0,58
0,58
IV. System Design
The Table 1 shows how much data should be considered during recording from the mentioned
five classes of digital cameras. The bandwidth of video data ranges from 5 Mbps up to 6250
Mbps. Interestingly, the highest bit rate is achieved not by the highest resolution nor by the
highest frame rate but by a mixture of high resolution and high frame rate (for high-speed
cameras in class 3). Moreover, a file keeping just 60 seconds of the visual signal from the most
demanding camera requires almost 46 GB of space on the storage system.
Name
HDTV 1080p
HDTV 720p
SDTV
PAL (ITU601)
CIF
CIFN
QCIF
Table 2.
Width
Height
FPS
1920
1280
720
720
352
352
176
1080
720
576
576
288
240
144
25
25
50
25
25
30
25
Pixel
Bit
Depth
8
8
8
8
8
8
8
Throughput
[Mbps]
File Size of
60sec. [GB]
396
176
158
79
19
19
5
2,90
1,29
1,16
0,58
0,14
0,14
0,04
Throughput and storage requirements for few video standards.
Additionally, the standard resolutions are listed in the Table 2 in order to compare the
bandwidth and storage requirements for them as well. As so, the highest quality of the highdefinition television standard (HDTV 1080p) requires as much as a low-end camera of highresolution industrial cameras (class 1). As so, the requirements of the standards from consumer
market are not as critical as scientific or industrial demands.
Analogically, the audio requirements are presented in Table 3. The throughput of the
uncompressed audio ranges from 84 kbps for low quality speech, through 1,35 Mbps for CD
quality, up to 750 Mbps for 128-channels studio recordings. Respectively, an audio file of 60
seconds needs from 600kB, through 10MB, up to 5,6 GB (please note, that the values in file size
column are given in MegaBytes).
60
IV. System Design
Name
Studio 32/192/128
Studio 24/96/128
Studio 32/192/48
Studio 24/96/48
Studio 32/192/4
Studio 24/96/4
Studio 20/96/4
Studio 16/48/4
DVD 7+1
DVD 5+1
DVD Stereo
DAT
CD
HQ Speech
PC-Quality
Low-End-PC
LQ-Stereo
LQ Speech
Table 3.
Sampling
frequency
[kHz]
Sample
Bit
Depth
Number
of
channels
Throughput
[Mbps]
File
Size of
60sec.
[MB]
192
96
192
96
192
96
96
48
96
96
96
48
44,1
44,1
22,05
22,05
11
11
32
24
32
24
32
24
20
16
24
24
24
16
16
16
16
8
8
8
128
128
48
48
4
4
4
4
8
6
2
2
2
1
2
2
2
1
750,000
281,250
281,250
105,469
23,438
8,789
7,324
2,930
17,578
13,184
4,395
1,465
1,346
0,673
0,673
0,336
0,168
0,084
5625
2109
2109
791
176
66
55
22
132
99
33
11
10
5,0
5,0
2,5
1,3
0,6
Throughput and storage requirements for audio data.
IV.2.1.3 Fast and simple lossless encoding
Due to the fact of such huge storage requirements, a compression should be exploited in order
to store efficiently captured volumes of data. However, the throughput requirement must not be
neglected. As so, the trade-off between compression efficiency and algorithm speed must be
considered.
In general, a slower algorithm generates smaller output than a faster one, due to a more
sophisticated compression schemes in slower algorithms –the algorithm is referred here and not
the issues of a good or bad implementation– including steps as the motion estimation and
prediction, the scene-change detection or the decision about a frame-type or a macro-block
type. The slow and complex algorithms unfortunately cannot be used in the capture process,
because their motion estimation and prediction algorithms consume too many resources to
permit recording of visual signals as provided by high-quality, very high-quality or high-speed
digital cameras. Besides, some slow algorithms are not even able to process videos from HDTV
cameras in real-time. What’s more, the implementations of the compression algorithms usually
61
IV. System Design
are still best-effort implementations dedicated to non-real-time systems, and as so, they just
work in the real time (or rather “on time”), i.e. they produce the compressed data fast enough
by processing a sufficient amount of frames or samples per second on time using expensive
hardware. Secondly, there is very little or no control over the process of on-line encoding. On
the other hand and in spite of the expectations, some of the implementations are not even able
to handle huge amounts of data at high resolution and high frame rate or at high sample-rate
with multiple streams in real time even on modern hardware.
So, the only solution is to capture input media signals with a fast and simple compression
scheme, which does not contain complicated functions, is easy to control and at least delivers
the data on time (willingly it runs on real-time system). A very good candidate for such a
compression of video data is the HuffYUV algorithm proposed by [Roudiak-Gould, 2006]. It is
a lossless video codec employing selectable prediction scheme and Huffman entropy coding. It
predicts each sample separately, and the entropy coding encodes the resulting error signal by
selecting most appropriate, predefined Huffman table for a given channel (it’s possible to
specify own outside table). There are three possible prediction schemes: left, gradient and
median. The left prediction scheme predicts the previous sample from the same channel and is
the fastest one but in general delivers worst results. The gradient method predicts from
calculation of three values: adds Left, adds Above and minuses AboveLeft, and it is a good trade
off between speed and compression efficiency. The median scheme predicts the median of three
values: Left, Above and the median predictor; and delivers maximal compression, however it is
the slowest one (but still much faster than complex algorithms with motion prediction e.g.
DCT-based codecs [ITU-T Rec. T.81, 1992]). The HuffYUV codec supports YUV 4:2:2 (16
bpp), RGB (24 bpp28), and RGBA (RGB with alpha channel; 32 bpp) color schemes. With
respect to YUV color scheme, there are allowed UYVY and YUY2 ordering methods. As
author claims, a computer “with a modern processor and a modern IDE hard drive should be
able to capture CCIR 601 video at maximal compression (…) without problems” [RoudiakGould, 2006]29. The known limitation of HuffYUV is the resolution being a multiple of four.
28
bpp – stands for bits per pixel
29
There are available some comparison tests done within the RETAVIC project. For details, a reader is referred to [Militzer et
al., 2005].
62
IV. System Design
According to [Suchomski et al., 2006] from among the lossless audio formats available on the
market, the WavPack [WWW_WP, 2006] and the Free Audio Lossless Coding (FLAC)
[WWW_FLAC, 2006] are good candidates for use in the real-time capturing phase. On the
Figure 5 can be seen a comparison of the decoding speed and compression size of some lossless
audio codecs, where the EBU Sound Quality Assessment Material (SQAM) [WWW_MPEG
SQAM, 2006] and on a private set of music samples (Music) [WWW_Retavic - Audio Set, 2006]
have been measured.
1000
SQAM SLS
Decoding Speed (x real-time)
SQAM WavPack
SQAM Monkey's
SQAM Flac
SQAM OptimFrog
Music WavPack
Music Monkey's
100
Music Flac
Music OptimFrog
Music SLS nocore
10
SLS 64
No Core
SLS 256
SLS no Core
1
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
Compression Rate
Figure 5.
Comparison of compression size and decoding speed of lossless audio codecs
[Suchomski et al., 2006].
It has been observed in [Suchomski et al., 2006] that the codecs achieve different compression
rates according to the content of the input samples and different coding speeds. The best results
in meaning of the speed are achieved by FLAC i.e. it compresses about 150 to 160 times faster
(on the used test bed) than the required by real-time compression. Contrary, the WavPack is
only about 100 times faster. In terms of compression rate, however, the FLAC achieves not the
best results. FLAC is very rich of features, supports bit depths from 4 to 32 bits per sample,
sampling rates up to 192 kHz and right now up to 8 audio channels. There is no kind of
scalability feature in FLAC. As the source code of the FLAC libraries are licensed under BSD
license, it would be possible to do modifications without licensing problems. Analogical
behavior in respect to the encoding speeds may be also observed but the further investigation
should be conducted.
63
IV. System Design
IV.2.1.4 Media buffer as temporal storage
According the extreme throughput and storage requirements presented in previous subsection,
there must be provided a solution for storing the audio and video data in real-time. It would be
possible to store them directly in the MMDBMS, however this direct integration is not further
investigated in respect to the RETAVIC architecture. One of the arguments for keeping the
separate media buffer is the non-real-time preparation phase (Phase 2) disallowing any real-time
processing and coming next after the Phase 1 and before storing the media data in MMDBMS
(Phase 3).
Thus, the solution proposed to handle the storing process defines a media buffer as the
intermediate temporal storage of the captured media data associated with the real-time capturing
phase. This prevents from losing the data and keeps it for further complex analysis and
preparation in the next Phase 2. Of course, its content is just temporal and is to be removed
after the execution of required steps in Phase 2.
Different hardware storage solutions and offered throughput
Media buffer should have few characteristics deriving from the capturing requirements, namely:
real-time processing, appropriate size in respect to the application, enough bandwidth according
to properties of grabbed media data (allowing required throughput). There are few hardware
solutions compared in the Table 4. There is a cache on different levels (L1, L2, L3) omitted in
the MEM class due to its small sizes making it inapplicable as the media buffer.
The MEM class is a non-permanent memory – the most temporal storage of all four classes
given in the Table 4 due to its erasing properties after powering it off. It was divided into:
simple RAM, dual-channel RAM and NUMA. Simple RAM (random access memory) is
delivered in hardware as single in-line memory modules (SIMM) or dual in-line memory
modules (DIMM). Most known representatives of SIMMs are DRAM SIMM (30)30 and FPM31
SIMM (72). There are many versions of DIMMs available: DDR2 SDRAM (240), DDR32
30
There is given a number of pins present on the module in the brackets.
31
FPM – Fast Page Mode (DRAM) optimized for successive reads and writes (bursts of data).
32
DDR - Double Data Rate (of the frontside bus) by transferring data on both the rising and falling edges of the clock.
64
IV. System Design
SDRAM33 (184), (D)RDRAM34 (184), FPM (168), EDO35 (168), SDRAM (168). There are also
versions for notebooks called SODIMMs, which include: DDR SDRAM (200 pins), FPM (144),
EDO (144), SDRAM (144), FPM (72), EDO (72). There are referred only DDR2 SDRAM in
the Table 4 because these modules are the fastest out of all listed above.
Class
Type
Example
MEM
MEM
MEM
NAS
NAS
NAS
NAS
SAN
SAN
SAN
SAN
DAS
DAS
DAS
DAS
Simple RAM
DUAL-CHANNEL RAM
NUMA
ETHERNET
CHANNEL BONDED ETHERNET
MYRINET
INFINIBAND
SCSI over Fiber Channel
SATA over Fiber Channel
iSCSI
AoE
RAID SAS
RAID SATA
RAID SCSI
RAID ATA/ATAPI
DIMM DDR2 64bit/800MHz
2x DIMM DDR2 64bit/800MHz
HP Integrity SuperDome
10GbE
2x 10GbE
4th Generation Myrinet
12x Link
FC SCSI
FC SATA
10GbE iSCSI
10GbE AoE
16x SAS Drives Array
16x SATA Drives Array
16x Ultra-320 SCSI Drives Array
16x ATAPI UDMA 6 Drives Array
Table 4.
Peak
Bandwidth
[MB/s]
6104
12207
256000
1250
2500
1250
3750
500
500
1250
1250
300
150
320
133
Hardware solutions for the media buffer.36
The Dual Channel RAM is a technology for bonding two memory modules into one logical unit.
A nowadays motherboard’s chipset may support two independent memory controllers allowing
access to the RAM modules simultaneously, as so, one 64-bit channel may be defined as
upstream data, while the other as downstream data (IN and OUT), or both channels are bonded
to one 128-bit channel (either IN or OUT).
33
SDRAM – Synchronous Dynamic RAM i.e. it’s synchronized with the CPU speed (no wait states are present)
34
(D)RDRAM – (Direct) Rambus DRAM is a type of SDRAM proposed by Rambus Corp., but the latency is higher in
comparison to DDR/DDR2 SDRAM. Moreover, the heat output is higher requiring special metal heat spreaders.
35
EDO – Extended Data Out (DRAM) faster about 5% than FPM, because new cycle starts while data output is still from the
previous cycle (overlapping in operations i.e. pipelining).
36
This table includes only raw values of the specified hardware solution i.e. no protocols, communication or other overheads are
concerned. In order to consider them, the measurement of the user-level or application-level performance must be
conducted.
65
IV. System Design
NUMA is a Non-Uniform Memory Access (sometimes Architecture) defined as a computer
memory design with hierarchical access (i.e. the access time depends on the memory location
relative to a processor) used in multiprocessor systems. As so, a CPU can access its own local
memory faster than non-local (remote) memory (which is memory local to another processor or
shared memory). ccNUMA is a cache coherent NUMA which is critical for keeping the cache of
the same memory regions consistent. Usually it’s solved by inter-process communication (IPC)
between cache controllers. In result, the ccNUMA performs badly on applications accessing the
memory in rapid succession. Huge advantage of NUMA and ccNUMA is much bigger memory
space available to the application providing still very high bandwidth known as the scalability
advantage over symmetric multiprocessors (SMPs – where it is extremely hard to scale over 8-12
CPUs). Other words, it’s a good trade of between Shared-Memory MIMD37 and DistributedMemory MIMD.
The difference between next classes of storage, namely NAS, SAN and DAS are depicted in
Figure 4, which comes directly from the famous Auspes’s Storage Architecture Guide38 [Auspex,
2000].
Figure 6.
Location of the network determines the storage model [Auspex, 2000].
37
MIMD – Multiple Instructions Multiple Data (contrary to SISD, MISD, or SIMD).
38
Auspex is commonly recognized as a first true NAS company, before the term NAS had been defined.
66
IV. System Design
The Network Attached Storage (NAS) class is a network-based storage solution and in general, it is
provided by a network file system. The network infrastructure is used as a base layer for storage
solution provided by clusters (e.g. Parallel Virtual File System [Carns et al., 2000], Oracle
Clustered File System V2 [Fasheh, 2006], Cluster NFS [Warnes, 2000]) or by dedicated file
servers (e.g. software-based SMB/CIFS or NFS servers, Network Appliance Filer [Suchomski,
2001], EMC Celerra, Auspex Systems). NAS may be based on different networks types
regardless the type of the physical connection (glass fiber or copper), like: Ethernet, Channel
Bonded Ethernet, Myrinet or Infiniband. Thus the bandwidth presented in Table 4 is given for
each type of mentioned networks without consideration of the overhead given by the file
sharing protocol of specific network file systems39. Ethernet is a hardware base for the Internet
and its standardized bandwidth ranges from 10Mbps, through 100Mbps, 1Gbps (1GbE), up to
the newest 10Gbps (10GbE). Channel Bonded Ethernet is analogical to Dual Channel RAM but
refers to two NICs coupled logically into one channel. Myrinet was the first network allowing
bandwidth up to 1Gbps. The current Fourth-Generation Myrinet supports 10Gbps. In general, it
has two fiber optics connectors (upstream/downstream). Myricom Corp. offers the NICs which
supports both 10GbE and 10Gbps Myrinet. Infiniband uses two types of connectors: Host
Channel Adapter (HCA) used mainly in IPC, and Target Channel Adapter (TCA) use mainly in
I/O subsystems. Myrinet and Infiniband are low latency/delay and low protocol overhead
networks in comparison to Ethernet.
SAN stands for Storage Area Network and is a hardware solution delivering high IO bandwidth.
It is more often referred to as block I/O services (block-based storage in analogy to
SCSI/EIDE) rather than file access services (higher abstraction level). It usually uses four types
of network for block-level I/O: SCSI over fiber channel, SATA over fiber channel, iSCSI or
recently proposed ATA over Ethernet (AoE) [Cashin, 2005]. The SCSI40/SATA41 over fiber
39
A network file system is considered to be on the higher logical level of the ISO/OSI network model, so in order to measure
its efficiency, an evaluation on the user- or application-level must be carried out considering the bandwidth of the underlying
network layer.
40
SCSI is Small Computer System Interface and is the most commonly used 8-or 16-bit parallel interface for connecting storage
devices in the servers. It ranges from SCSI (5MB/s) up to Ultra-320 SCSI (320MB/s). There is also available newer SAS
(Serial Attached SCSI) being much more flexible (hot-swapping, improved fault tolerance, higher number of devices) and still
achieving 300MB/s.
41
SATA stands for Serial Advanced Technology Attachment (contrary to origin ATA which was parallel and now referred as
PATA) and is used to connect devices with the bandwidth up to 1,5Gbps (but due to 8/10 bits encoding on physical layer
67
IV. System Design
channel provides usually bandwidth of 1Gbps, 2Gbps or 4Gbps, which requires a use of the
special Host Bus Adapter (HBA). iSCSI is an Internet SCSI (or SCSI over IP) which as a
medium may use standard Ethernet NIC (e.g. 10GbE). AoE is a network protocol designed for
accessing ATA42 storage devices over Ethernet network, thus enabling cheap SANs over lowcost, standard technologies. AoE simply puts ATA commands into low-lever network packets
(simplifying the ATA ribbon – the wide 40- or 80-line cable is exchanged by Ethernet cable).
The only drawback (but also a design goal) was to make it not routable over LANs (and thus
very simple).
Direct Attached Storage (DAS), or simply, local storage43, are devices (disk drives, disk arrays,
RAID arrays) attached directly to the computer by one of the standardized interfaces like
ATA/ATAPI, SCSI, SATA or SAS. There are only the fastest representatives from each of the
mentioned standard given in the Table 4. Each hardware implementation of the given standard
has its limits contrary to the SATA/SCSI over fiber channel, iSCSI or AoE, where just the
protocols are implemented and only the different physical layer is used as a carrier providing
higher transfer speed.
Evaluation of storage solutions in context of RETAVIC
From the REATVIC perspective, each solution has advantages and disadvantages. Memory is
very limited in size until very expensive NUMA technology is used allowing for scalability.
Besides, the non-permanent characteristic demands the grabbing hardware being always on-line
until all captured data is processed by Phase 2. An advantage is definitely easiness of the
implementation of the real-time support (might be directly supported in the RTOS kernel
without special drivers) and it has extremely high bandwidth – some applications like capturing
with high-speed cameras may be done only by using memory directly (e.g. Mikrotron MC1311
requires at least 6,25Gbps bandwidth without overhead). DAS is limited in scalability
achieves only 1,2Gbps). New version (SATA-2) doubles the bandwidth (respectively 3Gbps/2,4Gbps) and third version
being ongoing research should allow for four-times bigger bandwidth (6Gbps/4,8Gbps).
42
ATA is also commonly referred to as the Integrated Drive Electronics (IDE), Enhanced IDE (EIDE), ATA Packet Interface
(ATAPI), Programmed Input/Outpu (PIO), Direct Memory Access (DMA), or Ultra DMA (UDMA), which is wrong,
because ATA is a standard interface for connecting devices and IDE/EIDE only uses ATA, and PIO, DMA and UDMA are
different data access methods. ATAPI is an extension of ATA and it exchanged the ATA. ATA/ATAPI range from 2,1MB/s
up to 133MB/s (with UDMA 6).
43
DAS is also called captive storage or server attached storage.
68
IV. System Design
(comparing to NAS or SAN), however the real-time support could be provided with only some
efforts (a block-access driver supporting the real-time kernel must be implemented usually for
each specific solution). Moreover, it has a good price-to-benefits rate. SAN is scalable and
reliable solution, which was very expensive up to now, however with introduction of AoE, it
seems to be achieving prices of DAS. But anyway, the real-time support requires sophisticated
implementation within the RTOS considering the network driver coupled with block-access
method (additionally network communication between server and the NAS device must be realtime capable). NAS is relatively cheap and very scalable solution; however demands even more
sophisticated implementation to provide real-time capabilities because none of the existing
network file sharing systems provides real-time and QoS control (it’s due to the missing realtime support in the network layer being used as a base layer for the network file system).
Due to such variety of costs versus efficiency characteristics, the final decision of using one of
given solution is left to the system designer exploiting the RETAVIC architecture. As so for the
application withing this project, the NAS has been chosen as being most suitable for testing
purposes (limited size / high speed / easiness of use).
IV.2.2.
Non-real-time Preparation
Phase 2 is responsible for insertion and update of the data in the RETAVIC architecture i.e. the
actual step of putting the media objects into the media server is performed in this non-real-timepreparation part. Its main goal is conversion to a particular internal format i.e. to convert the input
media from source storage format into the internal storage format being most suitable for realtime conversion. Secondly, it does a content analysis and stores the meta data. Additionally, it is
able to archive origin source i.e. to keep the input media data in the origin format.
Phase 2 does not require any real-time processing and thus may apply the best-efforts
implementation of the format conversion and complex analysis algorithms. The drawback of
missing real-time requirement is that the potential implementation of the architecture cannot
guarantee a time needed for conducting the insert or update operations i.e. it can be only based
on the best-effort manner respective to the controlling mechanism (e.g. as fast as possible
according to selected thread/process/operation priority). Moreover, the transactional behavior
69
IV. System Design
of the insert and update operations must be provided regardless of the real-time properties
being considered as critical or not.
IV.2.2.1 Archiving origin source
The archiving origin source module is optional and does not have to exist in every scenario,
however is required in order to cover all possible applications. This module was introduced in
the last version of the proposed RETAVIC architecture (Figure 4) due to two requirements:
exact bit-by-bit preservation of the origin stream and even more flexible application in the real
world considering simpler delivery with smaller costs in some certain use cases (this also obliged
doing changes in Phase 4). Moreover, in cases of meta-data being incorporated in the source
encoded bitstream, the meta-information could be dropped by the decoding process. However,
it is preserved now by keeping all the origin bits and bytes whenever required.
The first goal is achieved since the source media bit stream is simply stored completely as the
origin binary stream regardless of the used format in analogy to the well-known binary large
objects (BLOBs). The problem noticed here is a varying lossiness of decoding process i.e. the
amount of noise always varies in the decoded raw representation after the decoding process due
to the different decoder implementations (e.g. having dissimilar rounding function) even if the
considered decoders operate on the same encoded input bitstream. The mentioned problem was
neglected in earlier versions of the RETAVIC architecture due to the assumption of media data
were considered as the source after being decoded from lossy format by lossy decoder.
However, now the problematic aspect is handled additionally and thus preserves bit-by-bit copy
of the source (encoded) data. Of course, use only of BLOBs without any additional (meta)
information is impossible, but in the context of the proposed architecture it makes sense, since
the existing relationship between the BLOB and its well-understandable representation in the
scalable internal format.
Second goal of achieving higher application flexibility is targeted by giving an opportunity to the
user of accessing the source bitstream in the origin format. In case that the proposed archiving
origin source module would not be present, the origin format had to be produced in the realtime transcoding phase as for every other requested format. However, by introducing the
module, the process may be simplified and transcoding phase may be skipped. On the other
70
IV. System Design
hand, the probability of having a user requesting a format exactly the same to the source format
is extremely small. Thus it may be useful only in the applications where the requirement of
preserving bitwise origin source is explicitly stated.
IV.2.2.2 Conversion to internal format
Here the integration of media objects is achieved by importing different formats and decoding
them to raw format (uncompressed representation) and then encoding it to the internal format.
This integration is depicted as a conversion module in the middle of Phase 2 in Figure 4. The
media source can either be a lossless binary stream from the media buffer of Phase 1 described
in the previous section IV.2.1 or an origin media file (called also the origin media source)
delivered to the MMDBMS in the digital (usually compressed) form from the outside
environment.
If media data come from the media buffer, the decoding is fairly simple due to the fact of
exactly knowing the format of lossless binary stream grabbed by Phase 1. As so, the process of
decoding is just an inverse process to the previously defined fast and simple lossless encoding (3rd
subsection of IV.2.1 Real-time Capturing) and is similarly easy to handle.
If media data come from the outside environment as a media file, the decoding is more complex
than in the previous case, because it requires additional step of detection of storage format with
all necessary properties, whenever the format with the required parameters was not specified
explicitly, and then it has to choose the correct decoder for the detected/given digital
representation of the media data. Finally, the decoding is executed. Some problems may appear
if the format could be decoded by more than one available decoder. In this case, the selection
scheme must be applied. Some methods have been proposed already e.g. by Microsoft
DirectShow or Java Media Framework. Both schemes allow for manual or automatic selection
of the used decoders44.
44
Semi-automatic method would also be possible, if the media application decides on its own in case of just one possible
decoder and asks user for a decision in case of more than one possible decoder. However, the application must handle this
additional use case.
71
IV. System Design
Secondly, the part employs an encoding algorithm to produce a lossless scalable binary stream
(according to next section IV.2.3) and decoder-related control meta-data for each continuous
media separately. The detailed explanation of each media-specific format and its encoder is
given in the following chapters: for video in V and for audio in VI. One important characteristic
of employing the media-specific encoder is a possibility to simply exchange (or update) the
internal storage format for each media at will due to the fact of a simple switching of the current
encoding algorithm to a new one just within the Conversion to internal format module. Moreover,
such exchange should not influence the decoding part of this module but must be done in
coordination with Phase 3 and Phase 4 of the RETAVIC architecture (some additional
information how to do this are given in section IV.3.2).
IV.2.2.3 Content analysis
The last but not least important aspect of feeding the media data into the media server is a
content analysis of these data. This step looks into a media object (MO) and its content, and
based on that produces the meta-data (described in next section) which are required later in the
real-time transcoding part (section IV.2.4). The content analysis produces only a certain set of
meta-data. The set is specific for each media type and proposed MD sets are presented later,
however these MD sets are not finished nor closed –so, let’s call them initial MD sets. The MD
set may be extended, but due to the close relation between content analysis and produced metadata set, the content analysis has to be extended in parallel.
Since the format conversion into the internal storage format and the content analysis of the
input data need to be performed just once upon import of a new media file into the system, the
resource consumption of the non-real-time preparation phase is not critical for the behavior of
the media server. In results the mentioned operations are designed as best-effort non-real-time
processes that can be run at lowest priority in order to not influence negatively the resourcecritical real-time transcoding phase.
IV.2.3.
Storage
Phase 3 differs from all others phases in one point, namely it is not a processing phase i.e. no
processing on media data or meta-data is performed here. It would be possible to employ some
querying or information retrieval here, but it was clearly defined as out of scope of the
72
IV. System Design
RETAVIC architecture. Moreover, the access methods and I/O operations are also not the
research points to cope with. It is simply taken for granted that there is provided established
storage solution, which is analogical to one of the usable storage methods like SAN, NAS or
DAS – these methods have already been described in details in the subsection Media buffer as
temporal storage (of section IV.2.1).
Moreover, the set of storage methods may be extended by considering the higher-level
representations of data e.g. instead of talking about file- or block-oriented methods it may be
possible to store the media data in other multimedia database management system or on a
media server with unique resource identifier (URI), or anything similar.
However, the chosen solution has also to offer well-controllable real-time media data access.
Thus, the RETAVIC architecture does not limit the storage method, besides the real-time
support for media access (similar as in the real-time capturing) i.e. the real-time I/O operations
must be provided e.g. write/store/put/insert&update and read/load/get/select. This real-time
requirement may be hard to implement from the hardware or operating system perspective
[Reuther et al., 2006], but again this is the task of some other projects to solve the problem of
the real-time access to the storage facilities. Few examples of research on the real-time access
with QoS support can be found in [Löser et al., 2001a; Reuther and Pohlack, 2003].
IV.2.3.1 Lossless scalable binary stream
All media objects are to be stored within RETAVIC architecture as lossless scalable streams to
provide application flexibility, efficiency and scalability of access and processing. There are few
reasons of storing the media data in such a way.
At first, there are applications that require the lossless storage of audio and video recordings.
They would not be supported if architecture assumed a lossy internal format already from the
beginning in the design postulation. Contrary, the undemanding applications, which do not
require lossless preservation of information, can still benefit from the losslessly stored data by
the extraction of just a subset of the data or by lossy transcoding. As so, the lossless
requirement for the internal storage format allows covering all application fields, and thus brings
the application flexibility to the architecture. From the other perspective, every DBMS is obliged
73
IV. System Design
to preserve the information as it was stored from the client side and deliver it to him without
any changes. If the lossy internal format was used, such as FGS-based video codecs described in
section III.1.1.1, the information would be lost during the encoding process. Moreover, the
introduced noise would be greater by every update due to the well-known tandem coding
problem [Wylie, 1994] of lossy encoders even if the decoding was not introducing any additional
losses45.
Considering the application flexibility and information preservation on one hand and the
efficiency and scalability of access and processing on the other hand, the only possible solution
is to make a lossless storage format scalable. The lossless formats have been designed for their
lossless properties and as so are not as efficient in the compression in comparison to the lossy
formats which usually exploit the perceptual models of human being46. Moreover, the lossless
codecs usually are unscalable, e.g. those given in section III.1.1.2, and process all or nothing i.e.
the origin data is stored in one coded file which requires that the systems reads the stored data
completely and then decodes it also completely, and there is no way to access and decode just a
subset of data. Because the compression size delivered by lossless codecs usually ranges from
30% to 70% due to the requirement of information preservation, the relatively big amount of
data has to be read always, which is not required by in all cases (not always lossless information
is needed).
By introducing scalability (e.g. by data layering) into the storage format, the scalability of access
i.e. the ability to access and read just a subset of compressed data (e.g. just one layer) is
provided. Moreover, it also allows the scalability of processing by handling just this smaller set
of input data, which usually is faster than dealing with complete set of data. As side effect, the
scalability of the binary storage format provides also lossy information (by delivering a lower
quality of the information) with compression having lower size; for example, just one tenth of
the compressed data may be delivered to the user. The examples of scalable and lossless coding
45
The tandem coding problem will appear if at first the data is selected from the MMDBMS and decoded, and next it is
encoded and updated on the MMDBMS. Of course, if the update without selecting and decoding the media data from the
MMDBMS but with getting them from different source occurs, the tandem coding problem no longer applies.
46
Anyway, a comparison of lossless to lossy encoders does not really make sense, because the lossless algorithms will always lose
in respect to the compression efficiency.
74
IV. System Design
schemes usable for video and audio data are discussed in the Related Work in sections III.1.1.3
and III.1.2.
IV.2.3.2 Meta data
The multimedia transcoding, and especially audio and video processing, is a very complex and
unpredictable task. The algorithms cannot be controlled during the execution in respect to the
amount of processed data and time constraints, because the behavior of the algorithms depends
on the content of the audio and video i.e. the functional properties are defined in respect to
coding/compression efficiency as well as to quality according to human perception system –it is
human hearing system (HHS) for audio information [Kahrs and Brandenburg, 1998] and human
visual system (HVS) for video perception respectively [Bovik, 2005]. Due to the complexity and
unpredictability of the algorithms, the idea of using meta-data (MD) to ease transcoding process
and to make the execution controllable is a core solution used in the RETAVIC project.
As mentioned already, the MD are generated during the non-real-time preparation (Phase 2) by
the process called content analysis. The content analysis looks into a media object (MO) and its
content, and based on that produces two types of MD: static and continuous. The two MDtypes have different purposes in the generic media transformation framework.
The static MD describe the MO as it is stored and hold information about the structure of the
video and audio binary streams. So, static MD keep statistical and aggregation data allowing
accurate prediction of resource allocation for the media transformation in real-time. Thus, they
must be available before the real transcoding starts. However, static MD are not required
anymore during the process of transcoding, so they may be stored separately from the MO.
The continuous MD are time-dependent (like the video and audio data itself) and are to be
stored together with the media bit stream in order to guarantee real-time delivery of data. The
continuous MD are meant for helping the real-time encoding process by feeding it with the precalculated values prepared by the content analysis step. In other words, they are required in
order to reduce the complexity and unpredictability of the real-time encoding process (as
explained in subsection Real-time transcoding of Section IV.2.4).
75
IV. System Design
A noticeable fact is that the size of static MD is very small in comparison to the continuous
MD. Moreover, the static MD is sometimes referred to as coarse-granularity data due to the
aggregative properties (e.g. the sum of the existing I-frames, the total number of audio samples),
and respectively, the continuous MD are called fine-granularity data because of their close
relation to each quant (or even to a part of quant) of the MO.
IV.2.4.
Real-time Delivery
The last but not least phase is Phase 4. The real-time delivery is divided on two processing
channels. The first one is real-time transcoding, which is the key task in achieving the format
independence in the RETAVIC architecture. The second one is an extension of the earlier
proposed version of the architecture and brings the real-time direct delivery of the stored media
objects to the client application. Both processing channels are further described in the following
subsections.
IV.2.4.1 Real-time transcoding
This part finally provides format independence of stored media data to the client application. It
employs a media transcoding that meets real-time requirements. The processing idea is derived
from the converter graph which extends the conversion chains, and is an abstraction of the
analogical technologies like the DirectX/DirectShow, processors with controls of Java Media
Framework (JMF), media filters, and transformers (for details see II.4.1 Converters and Converter
Graphs). However, the processing in the RETAVIC architecture is fed with additional
processing information, or in other words it is based on additional meta-data (discussed in Meta
data in section IV.2.3). The MD are used for controlling the real-time transcoding process and
for reducing the complexity and unpredictability of this process.
The real-time transcoding is divided on three subsequent tasks –using different classes of
converters– namely: real-time decoding, real-time processing, and real-time encoding. The tasks
representing the converters use pipelining to process media objects i.e. they passes so-called
media quanta [Meyer-Wegener, 2003; Suchomski et al., 2004] between the consecutive
converters.
76
IV. System Design
Real-time decoding
First, the stored bitstreams –this applies to both audio and video data– must be read and
decoded resulting in the raw intermediate format i.e. uncompressed media-specific data e.g. a
sequence of frames47 with pixel values of luminance and chrominance48 for the video or a
sequence of audio samples49 for the audio. The layered internal storage formats allow for
reading the binary media data selectively, so the system is used efficiently, because only the data
required for the decoding in the requested quality are read. The reading operation must of
course support the real-time capabilities in this case (as mentioned in the Storage section
IV.2.3). Then the meta-data necessary for the decoding processes are read accordingly, and
depending on the MD type the real-time capabilities are required or not. Next, the mediaspecific decoding algorithms are executed. The algorithms are designed in such a way, that they
are scalable not only in data consumption due to the layered internal formats but also in the
processing i.e. the bigger amount of data needs more processing power. Obviously, the trade-off
between the amount of data, the amount of required computation, and the data quality must be
considered in the implementation.
Real-time processing
Next the media-specific data may optionally be processed i.e. some media conversion
[Suchomski et al., 2004] may be applied. This step covers converting operations on media
streams with respect to real-time. These conversions may be grouped in few classes: quanta
modification, quanta-set modification, quanta rearrangement and multi-stream operation. This grouping
applies to both discussed media types: audio as well as video. First group –quanta modification–
refers to the direct operations conducted on content of every quant of media stream. Examples
for video quanta modification are: color conversion, resize (scale up / scale down), blurring,
sharpening, and other video effects. Examples for audio quanta modification are: volume
operation, hi- and low-pass filters, re-quantization (changing bit resolution per sample e.g. from
24 to 16 gives smaller possible set of values). Quanta-set modification considers actions on set
of quanta i.e. the number of quanta in the stream is changed. Examples are conversion of rate
47
See term video stream in Appendix A.
48
It may be other uncompressed representation with separate values of each red, green and blue (RGB) color. As default, the
luminance and two chrominance values (red, blue) are assumed. Other modes are not forbidden by the architecture.
49
See term audio stream in Appendix A.
77
IV. System Design
of video frames (known as frames per second – fps), in which the 25fps are transformed to
30fps (frame-rate upscale) or 50fps may be halved to 25fps (frame downscale of frame
dropping), or analogically in audio – conversion of sample rate may be changed (e.g. from
studio sample frequency of 96kHz to CD standard of 44,1kHz) by downscaling or simple
dropping samples. The third category does not change quanta themselves neither the set of
involved quanta, but the frame sequence with respect to time. Here time operations, like
double- or triple-speed play, slow (stop) play, but as well frame reordering are involved. Fourth
proposed group covers operations on many streams in the same time i.e. there are always few
streams on input or output of the converter involved. The most-suitable representative are
mixing (only of the same type of media) and multiplexing (providing exact copy of the stream
e.g. for two outputs). Other examples cover mux operation (merging different types of media),
stream splitting (contrary to mux), and re-synchronization.
In general, the operations mentioned above are linear operations and do not depend on the
content of the media i.e. it does not matter how the pixel values are distributed in the frame or
if the variation of sample values is high. However, the operations depend on the structure of the
raw intermediate format. For example, the number of input pixels calculated by width and
height and the output resolution influences the time required for the resize operation. Similarly,
the number of samples influences the amount of time spent on making sound louder, but the
current level of volume does not affect the linear processing.
Real-time encoding
Finally, the media data is encoded into the user-requested output format, which involves
respective compression algorithm. There are many various compression algorithms for audio
and video available. Most known representative codecs are:
•
For video

MPEG-2 P.2: ffmpeg MPEG-4 (libavcodec for MPEG-2), bitcontrol MPEG-2,
Elecard MPEG-2, InterVideo, Ligos, MainConcept, Pinnacle MPEG-2

MPEG-4 P.2: XviD (0.9, 1.0, 1.1), DivX (3.22, 4.12, 5.2.1, 6.0), Microsoft
MPEG-4 (v1, v2, v3), 3ivx (4.5), OpenDivX (0.3)
78

IV. System Design
H.264 / MPEG-4 P.10 (AVC) [ITU-T Rec. H.264, 2005]: x264, Mpegable AVC,
Moonlight H.264, MainConcept H.264, Fraunhofer IIS H.264, Ateme MPEG-4
AVC/H.264, Videosoft H.264, ArcSoft H.264, ATI H.264, Elecard H.264, VSS
H.264
•

Windows Media Video 9

QuickTime (2, 3, 4)

Real Media Video (8, 9, 10)

Motion JPEG

Motion JPEG 2000
For audio

MPEG-1 P.3 (MP3): Fraunhofer IIS

MPEG-2 P.3 (AAC): Fraunhofer IIS, Coding Technologies

MPEG-4 SLS: Fraunhofer IIS

AAC+: Coding Technologies

Lame

OGG Vorbis

FLAC

Monkey’s Audio
All the mentioned codecs are provided for non-real-time systems such as MS Windows, Linux
or Mac OS, in which the rate of worst-case to average execution time of audio and video
compression may reach even thousands, so the accurate resource allocation for real-time
processing is impossible with these standard algorithms. The variations in the processing time
are caused by the content analysis step (due to the complexity of the algorithms) i.e. for video
these are motion-related calculations like prediction, detection and compensation and for audio
these are filter bank processing (inkl. Modified DCT or FFT) and perceptual model masking
[Kahrs and Brandenburg, 1998] (some results are given later for specific media separately in the
following analysis-related sections: V.1 and VI.1). Within the analysis part of the codec the most
suitable representation of the intermediate data for the further standard compression algorithms
79
IV. System Design
(like RLE or Huffman coding) is found, in which the similarity of the data is further exploited
(thus making compression more efficient leading to lower compression size).
Thus, it is proposed in the RETAVIC architecture to separate the analysis step and the
unpredictability accompanying it from the real-time implementation, and put it in the non-realtime preparation phase. Secondly, the encoder should be extended with the support of MD,
which allows exploiting the data produced by the analysis. This idea is analogical to the two-pass
method in compression algorithms [Bovik, 2005; Westerink et al., 1999], where the non-realtime codec first analyzes a video sequence entirely and the optimizes a second encoding pass of
the same data by using the analysis results. The two-pass codec uses the internal structures for
keeping results and has no possibility to store them outside [Westerink et al., 1999]. The analysis
is done by each run even if the same video is used.
Finally, the transcoded media data is delivered to the outside of the RETAVIC architecture – to
the client application. The delivery may involve the real-time encapsulation into the network
protocol, which should be capable of delivering data under the real-time constraints. The
network problems and solutions, such as Real-Time Protocol (RTP) [Schulzrinne et al., 1996],
Microsoft Media Server (MMS) Protocol [Microsoft Corp., 2007a] or Real-Time Streaming
Protocol (RTSP) [Schulzrinne et al., 1998], are however not the scope of the RETAVIC project
and are not further investigated.
IV.2.4.2 Direct delivery
The second delivery channel is called bypass delivery, which is a direct delivery method. The idea
here is very simple – the media data which are stored by Phase 3 of the RETAVIC are read
from the storage and delivered to the client application. Of course, the real-time processing is
necessary here, so real-time requirements for reading analogical to these from real-time
decoding must be considered.
There are three possible scenarios in direct delivery:
1) Delivering internal storage format
2) Delivering origin source
80
IV. System Design
3) Reusing processed media data
In order to deliver media data in the internal storage format, not much have to be done within
the architecture. All required facilities are already provided by the real-time transcoding part. It
is obvious, that real-time reading and data provision outside the system must be supported
already in the real-time transcoding. So, bypass delivery should simply make use of them.
If one considers storing the origin source within his application scenario, the archiving origin
source proposed as optional (in Section IV.2.1) has to be included. Moreover, the capability of
managing the origin source within the storage phase has to be developed. This capability covers
adaptation of media data and meta-data models. Moreover, if searching facilities are present they
must also support the origin source. Besides these issues, the bypass delivery of origin source is
conducted similar way to the internal format.
Third possible scenario considers reusing the already processed media data and may be refered
as caching. This is depicted by the arrow back from the real-time encoding to multimedia
bitstream collection. The idea behind the reuse is to give the possibility of storing the encoded
media data after each request for the not yet present formats in the multimedia bitstream
collection in order to speed up further processing for the same request by costs of storage. As
one can notice, the compromise between the costs of the storage and costs of the processing is
a crucial key, which has to be concerned, so the application scenario should define if such
situation (of reusing the processed media data) is really relevant. If it is, the extensions analogical
to delivering origin source have to be implemented with consideration of linking more than one
media bitstream to the media object. Also searching of already existing instances of various
formats has to be provided.
IV.3. Evaluation of the General Idea
Contrary to the framework proposed by [Militzer, 2004] and [Suchomski et al., 2005], the
extended architecture allows for both: storing origin source of data and converting it to the
internal storage format. The initial RETAVIC approach of keeping origin source was
completely rejected in [Militzer, 2004] and only internal format has been allowed. This rejection
may however limit the potential application field for the RETAVIC framework, and somehow
81
IV. System Design
contradicts the MPEG-7 and MPEG-21 models, in which the master copy (origin instance) of
the media object (media item) is allowed and supported. On the other hand, keeping origin
instance introduces higher redundancy50 but that is a trade-off between application flexibility
and redundancy in the proposed architecture, which is considered being targeted well in my
opinion.
The newest proposal of the RETAVIC framework (Figure 4) is a hybrid of the previous
assumptions regarding many origin formats and the one internal storage format, thus keeping all
the advantages of previous framework versions (as in [Militzer, 2004] and [Suchomski et al.,
2005]) and at the same time delivering higher flexibility to the applications. Moreover, all the
drawbacks present in the initial proposal of the RETAVIC architecture (as discussed in section
2.1 and 2.2 of [Militzer, 2004]) ––like complexity of many-to-many conversion (of arbitrary
input and output formats), accurate determination of the behavior of a black-box converter
(global average, worst- and best-case behavior), unpredictable resource allocation (due to lack of
influence on the black-box decoding), no scalability in data access and thus no adaptivity in
decoding process–– are not present in the last version of the framework anymore.
IV.3.1.
Insert and Update Delay
The only remaining drawback is the delay during storing new/updating old media data in the
MMDBMS. As previously outlined, the Phase 1 and Phase 2, even though separated, serve both
as a data insert/update facility, where the real-time compression to an intermediate format is
only required for capturing uncompressed live video (part of Phase 1) and the media buffer is
required for intermediate data captured in real-time regardless of the real-time source (next part
of Phase 1), the archiving of the origin source is only required in some application cases, and the
conversion into the internal format and content analysis are performed in all application cases in
subsequent steps. Obviously, the last steps of the input/update facility, namely the conversion
and analysis steps, are computationally complex and run as a best-effort processes, thus they
50
The redundancy was not the objective of the RETAVIC architecture; however it is one of the key factors influencing system
complexity and storage efficiency. So, the higher redundancy is, the more complex system to manage consistency and the less
efficient storage are.
82
IV. System Design
may require quite some time to finish51. Actually, it may take quite a few hours for a longer
audio or video data to be completely converted and analyzed, especially when assuming (1) a
high load of served request for media data in general and (2) the conversion/analysis processes
running only at low priority (such a case promotes serving simultaneously many data-requesting
clients (1) contrary to uploading/updating clients (2)). To summarize, the delay between the
actual insertion/update time and the moment when the new data become usable in the system
and visible by the client is an unavoidable fact within proposed RETAVIC framework.
However, it is believed that this only limitation can be well accepted in reality considering still
available support for few applications demanding real-time capturing. Moreover, considering
points (1) and (2) of the previous paragraph, the most of the system’s resources would be spent
on real-time transcoding delivering format-converted media data i.e. the inserting a new media
data to the MMDBMS is a rather rare case compared to transcoding and transmitting of some
already stored media content to a client for playback. Consequently, the proposed framework
delivers higher responsiveness to the media-requesting clients due to the assumed trade-off
between delay in insertion and speed up during delivery.
Additionally, a feature of making newly inserted or updated media data available for the clients
as soon as possible is not considered to be an essential highlight of the MMDBMS. It actually
does not matter for the user whether he or she is able to access a newly inserted or updated data
either just after the real-time import or maybe a couple of hours later. And it is more important
that the MMDBMS can guarantee a reliable behavior and delivery on time. Nevertheless, the
newly inserted or updated data should still become accessible within a relatively short period.
IV.3.2.
Architecture Independence of the Internal Storage Format
As mentioned already, when employing the media-specific encoder as an atomic and isolated
part of the Conversion to internal format module, the architecture gains the possibility of exchanging
(or updating) the internal storage format without changing the outside interfaces and
51
Please note, that during the input or update of the data, the real-time insertion in most cases is not required according to
Architecture Requirements from Section IV.1. The few cases of supporting real-time capturing are provided by Phase 1. As
so, the unpredictable behavior of the conversion to internal format and of the analysis process is not considered being a
drawback.
83
IV. System Design
functionality i.e. the data formats understood and the data formats provided by the MMDBMS
designed according to the RETAVIC architecture will stay the same as before the format
update/exchange. Of course, the exchange/update can be done for each media separately, thus
allowing fastest possible adaptation of the novice results of the autonomous research on each
media type52.
IV.3.2.1 Correlation between modules in different phases related to internal format
Figure 7.
Correlation between real-time decoding and conversion to internal format.
Before conducting the replacement of the internal format, the correlation between influenced
modules has to be explained. Figure 7 depicts such correlation by grey arrows with black edges.
The real-time decoding uses as an input two data sources: media data and meta-data. The first
type is simply the binary stream in the known format which decoder understands. The second
one is used by the self-controlling process being a part of the decoder and by the resource
allocation for prediction and allocation of required resources. As so, if someone decides to
replace the media internal storage format, he must also exchange the real-time decoding module,
and accordingly the related meta-data. From the other hand, a new format and related meta-data
have to be prepared by new encoder and stored on the storage. This encoder has to be placed in
the encoding part of the conversion module from Phase 2. Due to the changes of meta-data, the
DB schema has to be adopted accordingly for keeping new set of data. This few-steps exchange,
52
This is exactly the case in the research – the video processing and audio processing are separate scientific fields and in general
do not depend on or related to each other. Sometimes the achievements from both sides are put together for assembling a
new audio-video standard.
84
IV. System Design
however, should not influence the decoding part of the non-real-time conversion module as
well as the remaining modules of the real-time delivery phase.
IV.3.2.2 Procedure of internal format replacement
A step-by-step guide for replacement of the internal storage format is proposed as follows:
1. prepare real-time decoding – implement it in the RTE of Phase
4 according to the used real-time decoder interface
2. prepare non-real-time encoding – according
to the encoder
interface of the Conversion to internal format module
3. design changes in the meta-data schema – those reflecting the
data required by real-time decoding
4. encode all stored media bitstreams in the new format (on the
temporal storage)
5. prepare meta-data for the new format
6. begin transactional update
a. replace the real-time decoder
b. replace the encoder in the Conversion to internal format
module
c. update the schema for meta-data
d. update the meta-data
e. update all required media bitstreams
7. commit
This algorithm is a bit similar to the distributed 2-phase commit protocol [Singhal and
Shivaratri, 1994]: it has a preparation phase (commit-request: points 1-5) and an execution phase
(commit: points 6-7). It should work without any problems53 for non-critical systems in which
the access to data may be blocked exclusively for longer time. However in 24x7 systems such
transactional update, especially point 6.e), may be hard to conduct. Then one of the possibilities
53
The blocking-protocol disadvantage of the 2-phase commit protocol is not considered here, because the human beings
updating the system (system administrators) are coordinators and the same cohorts in the meaning of this transaction.
85
IV. System Design
would be to prepare the bitstreams on the exact copy of the storage system but with updated
bitstreams and exchange these systems instead of updating all the required media data. Other
solution would be skipping transactional update and doing update based on locking mechanism
for separate media bitsreams and allowing operability of two versions of real-time decoding.
This however, is more complex solution and has an impact on the control application of the RT
transcoding (because a support for more than one (per media) real-time decoder was not
investigated).
IV.3.2.3 Evaluation of storage format independence
Summarizing, having a possibility of replacing internal format at will, the RETAVIC
architecture is independent from the internal storage format. In results, any lossless scalable
format may be used. As so, also the level of scalability may be chosen at will and depends only
on the selected format for given application scenario. Thus, the application flexibility of the
RETAVIC architecture is even higher. Moreover, by assuming lossless properties of the internal
format in the architecture requirements the number of replacement is not limited contrary to the
case of lossy formats where the tandem coding problem occurs.
86
V. Video Processing Model
V. VIDEO PROCESSING MODEL
The chapter V introduces the video-specific processing model based on the logical model
proposed in previous chapter. Additionally, the analysis of few representatives of the existing
video codecs and the small discussion on the possible solution is presented at the first. Next,
there is one particular solution proposed for each phase of the logical model. The video-related
static meta-data are described. Next, the LLV1 is proposed as the internal storage format and
the coding algorithm based on [Militzer, 2004; Militzer et al., 2005] is explained and further
detailed. After that, the processing idea is explained: the decoding using its own MD subset and
then, the encoding employing also own subset of MD. Only the continuous MD are referred in
the processing part (Section V.5). Finally, the evaluation of video processing model by
exploiting best-effort prototypes is presented.
The MD set covering the encoding part has been proposed at first by [Militzer, 2004] and
named as coarse- and fine-granular MD (as mentioned in subsection Meta data of Section IV.2.3
coarse-granularity refers to static MD, and analogically fine-granularity to continuous MD).
Next the MD set was extended and refined in [Suchomski and Meyer-Wegener, 2006]. The
extension of continuous part of MD supporting adaptivity in LLV1 decoding was proposed by
[Wittmann, 2005].
V.1. Analysis of the Video Codec Representatives
The analysis of the execution time with respect to the representatives of the DCT-based video
encoding such as FFMPEG54, XVID or DIVX clearly showed that the processing is irregular
and unpredictable, the time spent per frame varies to big extent [Liu, 2003]. For example, the
encoding time per frame of the three mentioned codecs for exactly same video data is depicted
54
Whenever the FFMPEG abbreviation is used, the MPEG-4 algorithm within the FFMPEG codec is referred. The FFMPEG
is sometimes called multi-codec implementation because it can also provide different coded outputs e.g. MPEG-1, MPEG-2
PS (known as VOB), MJPEG, FLV (Macromedia Flash), RM (Real Media A+V), QT (Quick Time), DV (Digital Video) and
others. Moreover, the FFMPEG supports much more decoding formats (as input) e.g. all output formats and MPEG-2 TS
(used in DVB), FLIC, and other proprietary like GXF (General eXchange Format), MXF (Media eXchange Format), NSV
(Nullsoft Video). The complete list may be found in the Section 6. Supported File Formats and Codecs of [WWW_FFMPEG,
2003].
87
in Figure 855. This clearly shows that for various codecs and even for their diverse
configurations the execution time per frame differs and it is also not constant within one codec
i.e. frame-by-frame. Moreover, even with the same input data the behavior of the codecs cannot
be directly generalized into simple behavior functions depending directly on data i.e. even for
the same scene (frames between 170 and 300) one codec speeds up, the other slows down, and
the third one does neither of these actions. It is also clearly noticeable, that raising the quality
parameter56 to five (Q=5) for XVID and thus making motion estimation and compensation
steps more complex, the execution time varies more from frame to frame and the overall
execution is slower (blue line over the green one).
Encoding Time [s]
70
DIVX
FFMPEG
XVID Q1
XVID Q5
60
50
40
30
20
10
0
0
50
100
150
200
250
300
350
400
Frame
Figure 8.
Encoding time per frame for various codecs.57
55
These are curves representing the average encoding time per frame of five benchmark repetitions of the given encoding in the
exactly same system configuration.
56
The referred quality parameter (Q) defines the set of parameters used for controlling the motion estimation and compensation
process. There have been five classes defined from 1 to 5 in such a way that the smaller parameter is the simpler algorithms
are involved.
57
The figure is based on the same data as the Figure 4-4 (p.51) in [Liu, 2003] i.e. coming from measurements of the video clip
“Clip no 2” with fast motion and fade out. The content of the video clip is depicted in Figure 2-3 (p.23) in [Liu, 2003]. The
88
Encoding Time [s] - Average (Min/Max), Deviation and Variance
80
70
60
50
40
30
20
10
0
DIVX
FFMPEG
XVID Q0
XVID Q1
XVID Q2
XVID Q3
XVID Q4
Deviation
8,4
3,2
3,6
2,4
2,7
4,0
4,9
5,3
Variance
70,9
10,0
13,1
6,0
7,2
15,9
24,3
28,0
Average
34,5
48,2
22,1
19,4
27,4
29,1
31,0
36,1
Figure 9.
XVID Q5
Average encoding time per frame for various codecs58.
The encoding time was further analyzed and the results are depicted in Figure 9. Obviously, the
FFMPEG was the slowest on the average (black bullets) and the XVID with quality parameter
set to one59 (Q=1) the fastest one. However minimal and maximal values of the time spend per
frame are more interesting. Here, the peak-to-average execution plays a key rule. It allows us to
predict the potential loss of resources (i.e. inefficient allocation/use), if the worst-case resource
allocation for real-time transcoding is used. The biggest peak-to-average ratio for this specific
video data is achieved by XVID Q0 and is of factor 3.32. Other factors worth of noticing are
the MIN/AVG and MIN/MAX ratios, the variance and standard deviation. While MIN/MAX
or MIN/AVG may be exploited by the resource allocation algorithm (e.g. defining the
importance of the need of freeing resources for other processes), the variance and standard
deviation60 inform us about the required time buffer for the real-time processing. So, the more
only difference is that the color conversion from RGB to YV12 was eliminated from the encoding process i.e. the video data
was prepared earlier in the YV12 color scheme (which is the usual form used in the video compression).
58
There are presented all possible configurations of quality parameter Q in XVID encoder – from 0 to 5, where 5 means the
most sophisticated motion estimation and compensation.
59
The quality parameter set to zero (Q=0) allowed the XVID encoder take the default values and/or detect them, which
required additional processing. That’s why the execution is a bit slower than the one with Q=1.
60
The variance magnifies the measured differences by its exponential (quadratic) nature. Thus it is very useful when the
importance of constancy is very high and detection of even small changes is required. One however must be careful when
variance is used with fractional values between 0 and 1 (e.g. coming from the difference of compared values), because then
the measured error is decreased. Contrary, the standard deviation exposes the linear characteristics.
89
frame-to-frame execution time varies, the bigger variance/deviation is. For example, the DIVX
has the biggest variance and deviation, while the average encoding time lands somewhere in the
middle of all codecs results. Contrary, the FFMPEG is the slowest on average but the variance
is much smaller in comparison to DIVX.
Encoding time distribution in XVID encoder
100%
90%
80%
Interlacing
Coding
70%
Prediction
60%
Transfer
Interpolation
50%
Edges
IQuan
40%
IDCT
30%
Quant
DCT
20%
Motion Compensation
Motion Estimation
10%
0%
0
21 42 63 84 105 126 147 168 189 210 231 252 273 294 315 336 357 378 399
Frame
Figure 10. Example distribution of time spent on different parts in the XVID encoding
process for Clip no 261.
Further investigations in [Liu, 2003] delivered detailed results on certain parts of the mentioned
codecs. Due to the obtained values, the most problematic and time consuming issues in the
processing are the motion estimation (ME) and compensation (MC). The encoding time
distribution calculated per each frame separately62 along the video stream are depicted here for
just two codecs: XVID in Figure 10 and FFMPEG in Figure 11. In the first figure it is clearly
seen that time required for MC and ME of XVID achieves over 50% of the total time spend on
frame. Interesting is that the MC/ME time rises but the distribution does not vary much from
frame to frame. However, other behavior can be noticed for FFMPEG where the processing
61
The figure is based on the same data as the Figure 4-19 (p.61) in [Liu, 2003] i.e. the same video clip “Clip no 2”. The encoder
quality parameter Q was set to 5 (option “–Q 5”).
62
Please note, this are the percentage values of time used for ME/MC per frame related to the total time per frame, and not the
ME/MC time spend per frame.
90
time used per frame varies frame-by-frame – expressed by angular line (peaks and drops
interchangeably). One more conclusion may be derived when comparing the two graphs,
namely there is a jump in the processing time starting from 29th frame of the given sequence,
but reaction of each encoder is different. The XVID adapts its control settings of internal
processing slowly and distributes them on many frames, while FFMPEG reacts at once. This is
also the reason of having the angular (FFMPEG) versus smooth (XVID) curves. It has been
also verified that after encoding there are: one intra-frame and 420 inter-frames in XVID, vs.
two intra-frames and 419 inter-frames in FFMPEG. This extra intra-frame in FFMPEG was at
the 251st position, and thus there is also a peak where the MC/ME time was neglected in respect
to the remaining parts.
Encoding time distribution in FFMPEG encoding
100%
90%
Rest
Postprocessing
80%
Preprocessing
Huffman
70%
Picture Header
60%
IP+CS+PI
Simple Frame Det.
50%
Interlacing
Transfer
40%
IQ+IDCT
30%
Quantization
DCT
20%
Edge MC
Motion Estimation
10%
414
396
378
360
342
324
306
288
270
252
234
216
198
180
162
144
126
90
108
72
54
36
0
18
0%
Frame
Figure 11. Example distribution of time spent on different parts in the FFMPEG MPEG-4
encoding process for Clip no 263.
Some other measurements from [Liu, 2003] (e.g. Figure 4-19 or 4-21) showed that the execution
times of ME/MC vary much more than of all other parts of the encoding. This is due to the
63
The figure is based on the same data as Figure 4-21 (p.62) in [Liu, 2003]. The video is the same as in example used for XVID
(Figure 10) with the same limitation. The noticeable peak is due to the encoder decision about different frame type (I) for
frame 250.
91
dependency of the ME and MC on the video content, which is very hard to define as a
mathematical function. As so, the ME/MC is the most unpredictable part of encoding where
variations of consumed time even between subsequent frames exist. This effect of ME/MC
hesitations could be partly argued by the changing shape of the ME/MC curves in both figures.
However, for that purposes better is to see the curves depicting spend time exactly in the
measured values (mentioned in the first sentence of this paragraph).
Summarizing, the ME/MC steps are problematic in the video encoding. The problem lies in
both the complexity and the amount of time required for the processing as well as in the
unpredictable behavior and the variations in time spend per frame.
V.2. Assumptions for the Processing Model
The first idea was to skip completely the ME/MC step in order to gain control over the
encoding process, which should work in real-time. Thus the encoding would be roughly twice as
fast and much more stable according to time spend on each frame. However, a video processing
model using the straightforward removal of the complex ME/MC step would cause a noticeable
drop in the coding efficiency and in the same worse quality of the video information – it’s
obvious that such additional complexity in the video encoder yields to higher compression ratio
and reduced data rates for the decoders [Minoli and Keinath, 1993]. Therefore the idea of pure
exclusion of the ME/MC step from the processing chain was dropped.
However, still the idea of removing the ME/MC step from the real-time encoding seemed to be
only reasonable in gaining the control on the video processing. Thus another and more
reasonable concept was to move the ME/MC step out of the real-time processing and not
dropping it out but putting it into the non-real-time part, and use only the data produced by the
removed step as additional input data to the real-time encoding. Based on that assumption it
was required to measure how much data, namely the motion vectors, are produced by the
ME/MC step. As it was stated in the published paper [Suchomski et al., 2005], the size overhead
related to the additional information is equal to about 2% of the losslessly encoded video data.
Hence it was clearly acceptable to gain two times faster and more stable –yielding lower worstcase to average ratio of time consumed per frame, and consequently allowing for more accurate
resource allocation– real-time encoding by only such small costs of storage.
92
Finally, the non-real-time preparation phase included the ME/MC steps in the content analysis
of the video data i.e. the ME/MC steps cover all the activities in the compression algorithm like
scene-change detection, frame-type recognition, global motion estimation, deblocking into
macro blocks, detection of motion characteristics and complexity, and decision about the type
of each macro block. Moreover, if there exist some ME-related parts being applicable to only a
given compression algorithm, but not named in the previous sentence, they should also be done
in the non-real-time content analysis step. All these ME activities are means for producing the
meta-data used in the real-time phase, not only for the compression process itself, but also for
scheduling and resource allocation.
The other remaining parts responsible for motion compensation like motion vectors encoding
and calculation of texture residuals, and the parts used for compression in DCT-based
algorithms [ITU-T Rec. T.81, 1992], such as DCT, quantization, quantized coefficients scanning
(e.g. ZIC-ZAC scanning), run-length encoding (RLE), Huffman encoding, bit-stream building,
are done in the real-time transcoding phase.
This splitting up of the video encoding algorithm on two parts in non-real-time and real-time
phases leads to the RETAVIC architecture, which may be treated as already improved in respect
to processing costs in the design phase, because the model already considers the reduction of
the resource consumption. The overall optimization is gained due to just one-time analysis step
execution contrary to the complete-algorithm execution (in which the analysis is done each time
for the same video during the un-optimized encoding on demand). Secondly, the encodingrelated optimizations are provided due to a simplified encoder construction without analysis
part. This encoding algorithm simplification leads to faster execution in real-time and as so gives
possibility of serving bigger number of potential clients. Additionally, the compression
algorithm should behave more smoothly, which would decrease the buffer sizes.
The earlier attempts proposing a processing model for video encoding has not considered the
relationship between the behavior and functionality of the encoder in the relation to video data
[Rietzschel, 2002]. The VERNER [Rietzschel, 2003] –a real-time player and encoder
implemented in DROPS– used real-time implementation of the XVID codec where the origin
source code has been directly embedded in the real-time environment and no processing
adaptation has been performed beside treating the encoding operation as mandatory part and
93
the post-processing functionality as optional part [Rietzschel, 2002]. In result, the proposed
solution failed to integrate the real-time ability and predictability usable in QoS control due to
still to high variations of the required processing time per frame within the mandatory part64. As
so the novel MD-based approach is investigated in next sections.
V.3. Video-Related Static MD
As it was mentioned in the Section IV.2.3.2, the static MD are used for resource allocation. The
initial MD-set (introduced in IV.2.2.3) related only to video is discussed in this section. Its
coverage is a superset of two sets reflecting MD required for the two algorithms used within the
prototypical implementation of the RETAVIC architecture, namely the MD useful for LLV1
and for XVID algorithms respectively.
It is assumed that the media object (MO)
65
defined by Equation (1) belongs to the media
objects set O. The MO is uniquely identifiable by the media identifier (MOID) as defined by
Equation (2), where i and j are referring to different MOIDs.
∀i : moi = (typei , contenti , formati ) ∧ typei ∈ T ∧ contenti ∈ C ∧ format i ∈ F
∀i : moi ∈ O
1≤ i ≤ X
(1)
where T denotes a set of possible media types66, C denotes a set of content, F respectively a set
of formats, and X is the number of all media objects in O, and none of the sets can be empty.
∀i ∀ j : i ≠ j ⇒ moi ≠ mo j
1 ≤ i ≤ X ∧1 ≤ j ≤ X
(2)
In other words, the MO refers to the data of one media type either video or audio and
represents exactly one stream of the given media type.
64
The mandatory part is assumed to be exactly and deterministically predictable, which is impossible without considering data in
case of video encoding. Otherwise, it must be modeled with the worst case model. Some more details are explained later.
65
The definitions of MMO and MO used here are analogical to those described in [Suchomski et al., 2004]. MO is defined to
have type, content and format, and MMO consist of more than one MO, as mentioned in the fundamentals of the related
work.
66
The media type set is limited within this work to {V, A} i.e. to video and audio types.
94
The multimedia object (MMO) is an abstract representation of the multimedia data and consists
of more than one MOs. The MMO belongs to multimedia objects set M and is defined by
Equation (3):
{
}
∀i : mmoi = mo j mo j ∈ O
∀i : mmoi ∈ M
1≤ i ≤Y
(3)
where X the number of all media objects in O. Analogically to MO, the MMO is uniquely
identified by MMOID as formally given by Equation (4):
∀i∀ j : i ≠ j ⇒ mmoi ≠ mmo j
(4)
Therefore the MO may be related to MMO by MMOID67.
There are initial static meta-data defined for the multimedia object (MMO) as depicted in Figure
12. As so, the static MD describing the specific MO are related to this MO by its identifier
(MOID). The MD, however, are different for various media types, so the static MD for video
are related to the MO having type of video, and the set (StaticMD_Video) is defined as:
∀i : MDV (moi ) ⊂ StaticMD _ Video ⇔ typei = V
(5)
where typei denotes type of the media object moi, V is the video type, and MDV is a complex
function extracting the set of meta-data related to the video media object moi. The index V by
MDV denotes video specific meta data contrary to index A which is used for audio specific MD
in the later part.
Moreover, the video stream identifier (VideoID) is a one-to-one mapping to the given MOID:
∀i ∃ j ¬∃k : VideoIDi = MOID j ∧ VideoIDi = MOIDk
k≠ j
67
The MO does not have to be related to MMO i.e. the reference attribute in MO pointing to the MMOID is nullable.
95
(6)
The static MD of the video stream include sums of each type of frames i.e. there is a sum of
frames calculated separately for each frame type within the video. The frame type (f.type) is
defined as I, P, or B:
∀i : f i .type ∈ {I , P, B}
1≤ i ≤ N
(7)
where I denotes the type of an intra-coded frame (I-frame), P denotes the type of a predicted
inter-coded frame (P-frame), B denotes the type of a bidirectional predicted inter-coded frame
(B-frame), and N denotes the amount of all frames in the video media object.
The sum for I-frames is defined as:
{
IFrameSummoi = f j f j ∈ moi ∧ 1 ≤ j ≤ N ∧ f j .type = I
}
(8)
where, fj is a frame at j-th position, fj.type denotes type of the j-th frame. And analogically, for Pand B-frames (N, fj, fj.type is the same as in Equation (8)):
{
}
(9)
{
}
(10)
PFrameSummoi = f j f j ∈ moi ∧ 1 ≤ j ≤ N ∧ f j .type = P
BFrameSummoi = f j f j ∈ moi ∧ 1 ≤ j ≤ N ∧ f j .type = B
Next, the video static MD are defined per each frame in the video – namely they include a
frame number and frame type, and calculated sums of each macro block (MB) types. The frame
number represents the position of the frame in the sequence of frames for the given video, and
the frame type specifies the type of this given frame fj.type. The sum of macro blocks with
respect to the given type of MB is calculated analogically to the frames as in Equations (8), (9)
and (10), but the N refers to total number of MBs per frame and conditions are replaced
respectively by:
mbs j = 1 ⇔ mb j .type = I mb j .type = P mb j .type = B
mbs j = 0 ⇔ mb j .type ≠ I mb j .type ≠ P mb j .type ≠ B
96
(11)
The information about sum of different block types is stored in respect to the layer type (in
LLV1 there are four layers defined – one base and three enhancement layers [Militzer et al.,
2005]) in the StaticMD_Layer.
And finally, the sum of different motion vector types is kept for the frame in
StaticMD_MotionVector. Nine types of vectors, which could be used in the video, are recognized
up to now [Militzer, 2004]. For each video frame, a motion vector class histogram is created by
counting the number of motion vectors corresponding to each class and the result is stored in
relation to VectorID , FrameNo and VideoID.
A motion vector (x, y) activates one of nine different interpolations of pixel samples depending
on the values of the vector’s components [Militzer, 2004]. Therefore there are nine types of
motion vectors distinguished [Militzer, 2004]: mv1 – if both x and y are in half-pixel precision,
luminance and chrominance samples need to be horizontally and vertically interpolated to
achieve prediction values; mv2 – if just the x component is given in half-pel precision, the
luminance and chrominance match in the reference frame needs to be horizontally interpolated,
and no vertical interpolation is applied to chrominance samples as long as the y vector
component is a multiple of four; mv3 – as mv2 but y is no multiple of four, then the chrominance
samples are additionally vertically filtered; mv4 – if only the y component of the motion vector is
in half-pixel precision, the luminance and chrominance pixels in the reference frame need to be
vertically interpolated and no horizontal filtering is employed as long as the x vector component
is a multiple of four; mv5 – as mv4 but x vector is no multiple of four, then the chrominance
samples are horizontally filtered additionally; mv6 ÷ mv9 – finally, if both the x and y vector
components have full-pel values, the interpolation complexity again depends on whether
chrominance samples need to be interpolated or not: if none of the vector components is a
multiple of four, then chrominance samples need to be horizontally and vertically filtered – mv6
–, if only the y component is a multiple of four, chrominance samples are horizontally filtered –
mv7 –, if only the x component is a multiple of four, then vertical filtering is applied – mv8 –, and
if both x and y are multiples of four, no interpolation is required – mv9.
In order to simplify understanding of the initial set of MD, the current definition is mapped to
the relational schema and depicted as relational schema diagram (with primary keys, foreign keys
and integrity constraints for relationships) in Figure 12.
97
Figure 12. Initial static MD set focusing on the video data68.
The video static MD as mapped according to Figure 12 allows for exploiting the potency of the
SQL by calculating the respective sum for given type with simple SQL query (Listing 1):
SELECT VideoID, count(FrameType)
FROM StaticMD_Frame
WHERE FrameType = 0
-- for P-frames: FrameType = 1
-- for B-frames: FrameType = 2
GROUP BY VideoID
ORDER BY VideoID;
Listing 1.
68
Calculation of the sum for the I-frame type from existing MD (query condition for
P- and B- are given as comments) for all described videos.
The figure is based on the Figure 2 of [Suchomski and Meyer-Wegener, 2006].
98
Also the sum of all types respectively to the type may be calculated in one SQL query (Listing
2):
SELECT VideoID, FrameType, count(FrameType)
FROM StaticMD_Frame
GROUP BY VideoID, FrameType
ORDER BY VideoID, FrameType;
Listing 2.
Calculation of all sums according to the frame type of frames for all described
videos.
If the sum of all motion vector types along the video for all videos is required, the following
SQL query extracts such information (Listing 3):
SELECT VideoID, VectorID, sum(MVsSum)
FROM StaticMD_MotionVector
GROUP BY VideoID, VectorID
ORDER BY VideoID, VectorID;
Listing 3.
Calculation of all MV types per video for all described videos.
Of course, the entity set of StaticMD_Frame must be complete in such a way that all frames
existing in the video are described and included by this set. Based on this assumption, the sums
included in the StaticMD_Video are treated as derived attributes counted by above SQL
statements. However, due to the optimization issues and rare updates of the MD set, these
values are materialized and included as usual attributes in the StativMD_Video. On the other
hand, if the StaticMD_Frame were not complete i.e. did not include all the frames, the sum
attributes could not be treated as derived.
V.4. LLV1 as Video Internal Format
The LLV1 stands for Lossless Layered Video One format and was first time proposed by
[Militzer, 2004]. The detailed description can be found in [Militzer, 2004] and it covers: 1) higher
logical design, 2) advantages of using LLV1, 3) scalable video decoding useful for real-time
transcoding, 4) implementation, and 5) proposed method of decoding time estimation. The
compact description including all the mentioned issues with the related work is published in
[Militzer et al., 2005]. The work of [Wittmann, 2005] focuses on the decoding part of LLV1 only
and describes it in more details, where each step is represented separately by an algorithm. The
analysis, implementation and evaluation in the real-time aspect are also described in [Wittmann,
99
2005] (they are also extended and described in section IX.3 RT-MD-LLV1 of this work). Due
to available literature, this section will only 1) summarize the most important facts about the
LLV1 by explaining the algorithm in simplified way, and 2) refine the mathematical definitions
whenever necessary, thus make the previous doubtful explanation more precise. Some further
extensions of the LLV1 algorithm are given in the Further Work section in the last chapter of this
work.
V.4.1.
LLV1 Algorithm – Encoding and Decoding
The LLV1 format fulfills the requirement of losslessness while at the same time gives the media
server flexibility to guarantee the best picture quality possible facing user requirements and QoS
characteristic changes. These two aspects are achieved by layered storage where each layer can
be read selectively. It is possible to limit the access to just a portion of the stored media data
(e.g. just the base layer, which is about 10% of the whole video [Militzer et al., 2005]). Thus, if
the lower resolutions of highly compressed videos are requested, only parts of the data are really
accessed. The other layers, i.e. the spatial and temporal enhancement layers, can be used to
increase the quality of the output, both in terms of image distortion and frame rate. Though
being a layered format designed for real-time decoding, LLV1 is still competitive regarding
compression performance compared to other well-known lossless format69 [Militzer et al., 2005].
The reasons for this competitiveness may derive from the origins – LLV1 is based on the XVID
codec and was adopted by exchange of the lossy transform to lossless one [Tran, 2000] and by
refinement of variable length encoding (incl. Huffman tables and other escape codes).
The simplified encoding and decoding algorithms are depicted in Figure 13. The input video
sequence (VS IN) is read frame-by-frame from the input and the frame type detection (FTD) is
conducted to find out if the current frame should be intra- or inter-coded frame. Usually, the
intra-coded type is used when new scene appears or a difference between subsequent frames
crosses certain threshold. If the frame is assigned to be inter then the next steps of motion
detection (MD) producing the motion vectors (MV) and motion error prediction (MEP)
69
In the time of development beginning, there were no lossless and layered (or scalable) video codecs available. Thus the
conducted benchmarks relate the LLV1 to the lossless codecs only without scalability characteristics. Nowadays, there is
ongoing work on standardized MPEG-4 Scalable Video Coding (SVC) [MPEG-4 Part X, 2007], which is much more
sophisticated and promising solution, but the official standard has been finished on July 2007 and thus could not be used
within this work.
100
calculating the motion compensation errors (MCE) are applied. Otherwise, the pixel values (PV)
are delivered to the transform step (binDCT). The binDCT is an integer transform using a
lifting scheme and is similar in characteristics to the discrete cosine transformation (DCT) but is
invertible (i.e. lossless) and produces bigger numbers in the output [Liang and Tran, 2001]. Here
depending on the frame input either the PV or the MEP values are transformed from the time
domain (pixel-values domain) to the frequency domain (coefficients domain).
a)
b)
Figure 13. Simplified LLV1 algorithm: a) encoding and b) decoding.
101
As next step the quantization similar to the well-known H.263 quantization from MPEG-4
reference software [MPEG-4 Part V, 2001] is applied on the coefficient values. For each layer
different quantization parameters (QP) are used: QP=3 for quantization base layer (QBL),
QP=2 for quantization enhancement layer one (QEL1), and respectively QP=1 for the second
QEL and QP=0 for the third QEL – this is denoted correspondingly by Q3, Q2, Q1 and Q0.
In the subsequent, the variable length encoding (VLE) is applied on the base layer coefficients
and accompanying motion vectors outputting the BL bitstream. In parallel, the quantization bit
plane calculation (QPC) is applied sequentially on the coefficients of QEL1, next of QEL2 and
finally of QEL3. The QPC computes the prediction values and the sign values (q-plane) by
using formulas defined later. Each q-plane is then coded separately by the bit plane variable
length encoding (BP VLE), what produces the encoded bitstreams of all QELs.
The decoding is the inverse process of the encoding. The input encoded bitstreams are specified
for the decoding – of course there must be BL and optionally ELs so the decoder accepts 1, 2,
3, or 4 streams. The BL bitstream is decoded by variable length decoding (VLD) and the
quantization coefficients (QC) or respectively quantization motion compensation error (QMCE)
are inverse quantized (IQ). Next the inverse transform step (IbinDCT) is executed. The motion
compensation (MC) using motion vectors (MV) is additionally performed for predicted frames.
The bit-plane variable length decoding (BP VLD) is the first step for the enhancement layers.
Secondly the quantization plane reconstruction (QPR) base on the QC from the layer below is
conducted. Finally, the IQ and IbinDCT are conducted only for the highest-quality quantization
plane. The complete decoding algorithm with the detailed explanation is given in Appendix B.
V.4.2.
Mathematical Refinement of Formulas
There have been few details missing to present all the problematic aspects in the previous
papers, e.g. how to exactly calculate the bit plane values and when to store the sign information
in the quantization enhancement layer (QEL). The formula given in the previous papers (nr 3.2
on p. 22 in [Militzer, 2004] and nr 1 on p. 440 in [Militzer et al., 2005]) for calculating the
coefficient for the enhancement layer in the decoding process looks after refinement like the
one in Equation (12):
102
⎧2 ⋅ C i −1 + Pi ⇔ C i −1 > 0
⎪
Ci = ⎨2 ⋅ Ci −1 − Pi ⇔ C i −1 < 0
⎪ P ⋅S ⇔ C = 0
i −1
⎩ i i
⎫
⎪
⎬ : Pi ∈ {0,1} ∧ S i ∈ {−1,1} ∧ i ∈ {1,2,3}
⎪
⎭
(12)
where i denotes current enhancement layer, Ci is the calculated coefficient of the current i-layer,
Ci-1 is the coefficient of the previous layer (one below), Pi is the predicted value and Si is the
sign information.
Moreover, the formula for calculating the prediction value and sign information in the given bit
plane during the encoding process has been neither officially stated nor published before. It is
now given respectively in Equation (13) and in Equation (14):
Pi = Ci − 2 ⋅ Ci −1 ∧ i ∈ {1, 2,3}
(13)
⎧− 1 ⇔ Ci < 0 ∧ C i −1 = 0 ⎫
Si = ⎨
⎬ : i ∈ {1,2,3}
⎩ 1 ⇔ Ci ≥ 1 ∨ Ci −1 ≠ 0 ⎭
(14)
Please note, that the predicted value and the sign information are stored only for the
enhancement layers, so there is no error where current layer is the base layer. Moreover, the sign
information is stored only when zeroing operation of the negative coefficient of the next lower
layer occurs and only if the current coefficient is smaller than 0 (i.e. negative).
The calculation of the coefficient according to the algorithm description above is given by
example in Figure 14, where the QBL stands for quantization base layer and QELx for the
respective quantization enhancement layer (x = {1,2,3}). QBL stores signed coefficients, while
QELs store the predicted values as unsigned bitplanes. Only the blue colored data are really
stored by the LLV1 bitstream.
103
Figure 14. Quantization layering in the frequency domain in the LLV1 format.
V.4.3.
Compression efficiency
The compression efficiency is not the biggest advantage of the LLV1 but still it is comparable to
other lossless codecs. The results of comparison to a set of popular lossless codecs are depicted
for few well-known sequences in Figure 15. The YV12 has been used as internal color space for
all tested compression algorithms to guarantee a fair comparison.
The results –the output file sizes– are normalized to the losslessly-encoded LLV1 output file
sizes (100%) i.e. all the layers are included. As can be seen, LLV1 performs not worse than most
of the other codecs. In fact, LLV1 provides better compression than all other tested formats
besides LOCO-I, which outperforms LLV1 by approximately 9% on average. Of course, the
other codecs have been designed especially for lossless compression and do not provide
scalability features.
104
HuffYUV
Lossless JPEG
Knightshields (720p)
Alparysoft
LLV1
LOCO-I
VQEG src20 (NTSC)
Paris (CIF)
Silent (QCIF)
Overall
0%
20%
40%
60%
80%
100%
120%
140%
Figure 15. Compressed file-size comparison normalized to LLV1 ([Militzer et al., 2005]).
V.5. Video Processing Supporting Real-time Execution
This section focuses on MD-supported processing. At first the continuous MD set including
just one attribute applicable during video decoding of the LLV1 is explained.
V.5.1.
MD-based Decoding
The LLV1 decoding was designed from the assumptions to be scalable in the processing.
Moreover, processing of each layer should be stable i.e. the time spend per frame along the
video was expected to be constant for the given type of frame or macro block. Thus, it should
be enough to include in the process of decoding just the static MD for prediction. This
assumption was practically tested and then refined, thus in order to support MD-based
decoding the existing LLV1 had to be extended by one important element placed in the
continuous MD set allowing for even more adaptive decoding i.e. the granularity of adaptation
was enhanced by allowing stopping the decoding in the middle of the frame, which reflects
processing on the fine-grained macro block level.
105
Based on [Wittmann, 2005], the functionality of the existing best-effort LLV1 decoder has been
extended70 by the possibility to store the size occupied by the frame in the compressed
stream for all enhancement layers. The best-effort LLV1 decoder therefore accepts an
additional argument that defines whether the compressed frame sizes should be written to an
additional file as continuous MD71. The original best-effort decoder had no need for this metadata since the whole frame was read for each enhancement layer that had to be decoded and it
did not matter how long it took.
However, if the execution time is limited (e.g. the deadline is approaching), it happens that not
the complete frame is decoded from the compressed stream. In such case, the real-time decoder
should leave out some macro blocks at the end of allocated time (period or timeslice). The
problem is that it cannot start decoding next frame due to the Huffman coding. As so, the end
of the frame has to be passed as meta-data (in other words, the position of the beginning of
next frame).
V.5.2.
MD-based Encoding
The idea of MD-supported encoding is explained along this work by using the MPEG-4 video
coding standard [MPEG-4 Part II, 2004]. This however, does not limit the idea. Different
encoders may reuse or partly adopt the existing MD. It may be also required to extend the MD
set by adding some other parameters than proposed here.
V.5.2.1
MPEG-4 standard as representative
The MPEG-4 video coding standard was chosen as representative due to few properties which
are common in most video encoding techniques. At first it is transform-based algorithm. There
are few possible domain transforms, which could be applied video processing, such as Fourier
Transform (FT) [Davies, 1984] in discrete form (DFT) or known as fast transform (FFT),
Discrete Sine Transform (DST), Discrete Cosine Transform (DCT) [Ahmed et al., 1974],
70
The decoder accepting additional argument that defines whether this information should be written to an extra MD file is
available through the –m <header-file> switch.
71
This functionality should be included on the encoder side (and used during the analysis phase). However it required less
programming efforts to include it on the decoder side for decoding existing streams in order to extract the size of frames in
the ELs of the considered media objects.
106
Laplace Transform (LT) [Davies, 1984] or Discrete Wavelet Transform (DWT) [Bovik, 2005].
All these transforms if applied to video compression should consider two dimensions (2D) in
high and width of each frame, but it’s also possible in to represent 2D transform as 1D
transform. Out of the available transforms only the DCT was well accepted in video processing
becaues: 1) DCT is very equivalent to DFT of roughly twice the length (would be more complex
in calculation), 2) FFT could be a competitor but produce larger coefficients from the same
input data (as so it’s harder to handle it by entropy coding thus less efficient compression can be
obtained), 3) DCT operates on real data with even symmetry (in contrast to odd DST), 4) there
are extremely fast 1-D DCT implementations (in comparison to other transforms) operating on
8x8 matrixes of values (representing the luminance or chrominance)72, and 5) 2D-DWT do not
allow for applying ME/MC techniques for exploiting temporal redundancy73. There exist
floating-point or integer implementations of the fast and well-accepted DCT, but to be more
precise, the 1-D DCT Type II given by Equation (15) [Rao and Yip, 1990] is the most popular
due to a proposed wide range of low complexity approximations and fast integer-based
implementations for many computing platforms [Feig and Winograd, 1992]. The DCT Type II
is applied in MPEG standards as well as in ITU-T standards and in many other not-standardized
codecs. As so, the resulting coefficients are expected to be similar in the codecs using this type
of DCT.
Zk =
2 N −1
k ( 2n + 1)π
ck ∑ xn cos
N n=1
2N
where
⎧ 1
⇔k =0
⎪
ck = ⎨ 2
⎪1 ⇔ k ≠ 0
⎩
(15)
Secondly, the generalization of video coding algorithms includes the two types of frames almost
in all types of transform-based algorithms. The intra-coded and inter-coded frames can be
distinguished. The intra-coded processing does not include any motion algorithms
(estimation/prediction/compensation), while the inter-coded technique uses motion algorithms
intensively and encodes not the coefficients of pixel values (as intra-coded does) but the error
coming from the difference between two compared sets of pixels (usually 8x8 matrixes).
72
In the newest video encoding algorithms some other derivates of DCT are employed. For example, in MPEG-4 AVC it is
possible to use integer-based DCT, which operates on 4x4 matrixes being faster in calculation of 256 values (4 matrixes vs. 1
matrix in old DCT).
73
There are also fast 3D-DWTs available mentioned already in the related work, but due to their inapplicability in real-time
there are omitted.
107
Figure 16. DCT-based video coding of: a) intra-frames, and b) inter-frames.
Thirdly, as it can be noticed there exist common parts in the DCT transform-based algorithms
[ITU-T Rec. T.81, 1992], which may still differ a bit in the implementation. These are namely: 1)
quantization (most known types commonly used are MPEG-based and H.263-base, which btw.
may be also used in MPEG compliant codecs), 2) coefficient scanning according to few types
(horizontal, vertical, and most popular zig-zac), 3) run length encoding, 4) Huffman encoding
(often the Huffman tables differ from codec to codec) and in case of inter-frames 5) motion
estimation and prediction generating motion vectors and motion compensation error (called
also prediction error), which is using 6) the frame buffer of the previous and/or following
frames related to the currently processed frame. The more detailed comparison of MPEG-4 vs.
H.263 can be found in Appendix C.
V.5.2.2
Continuous MD set for video encoding.
Based on the mentioned commonalities of the DCT-based compression algorithms, few kinds
of information are stored as continuous MD for video, namely: frame coding type, bipred value, MB
width and height, MB mode and priority, three most significant coefficient values and motion vector(s).
Especially fine-granularity information on coefficient values and motion vectors makes the size
relatively large, and thus continuous MD should be compressed along with the media bit stream
(as tests showed yielding about 2% of the LLV1 stream size [Suchomski et al., 2005]). Of
course, the continuous MD are not interleaved with the video data, but are stored as separate
bitstream (e.g. to allow decoding of media data passed in the direct delivery by the standard
decoder). What is important is the time-dependency of the continuous MD due to their close
108
relationship to the given frame, and thus they should be delivered by real-time capable system
along the video stream.
The parts of the proposed continuous MD set have already been proposed in the literature as it
was mentioned at the beginning of section V within this chapter. However, the parts have not
been collected in one place making the continuous MD set comprehensive. As so, the elements
of the continuous MD set are described in the following.
Frame coding type. It holds information on the type of a given frame in the video sequence
(I-, P- or B-frame). The type is detected in the analysis process to eliminate the resourcedemanding scene detection and frame-type decision. Furthermore, it allows better prediction of
compression behavior and resource allocation, because the three types of frames are
compressed in very different ways due to the use of dissimilar prediction schemes.
Bipred value The bipred value is used in addition to frame coding type and it says if two
vectors have been used to predict the block – if any block in the frame has two vectors, the
value is set. This is useful optimization for bidirectionally-predicted framed (B-frames) i.e. if just
one reference frame should be used during encoding, the algorithm may skip the interpolation
of the two referred pixel matrixes (with all the accompanying processing).
Macro block width and height. This is just simple information about the number of MB in
two directions, horizontal and vertical respectively. It allows us to calculate the number of MBs
in the frame (we did not assume that it is always the same).
Macro block mode. Similar to frame coding type, the information about macro block (MB)
mode allows distinguishing, if the MB should be coded as intra, inter or bi-directionally
predicted. It also stores more detailed information. For example, if we take into account only bidirectionally predicted MB's, there are the following five modes possible: forward predicted,
backward predicted, direct mode with delta vector74, direct mode without MV and not coded.
Special cases of a bi-directionally predicted with two vectors are considered separately (by bipred
74
For each luminance block there may be none, one or two vectors, which means that there may be none, 4 x 1 or 4 x 2 vectors
for the luminance in MB. There is however one more „delta vector” mode i.e. the backward or forward vector is the same for
all four luminance blocks.
109
value). If we go further and take into consideration H.264/AVC [ITU-T Rec. H.264, 2005] for
similar bidirectional MB's, there can be 23 different types. If we consider intra type of MB's,
there are 25 possible types in H.264/AVC, but only 2 in MPEG-4 (inter MB's have 5 types in
both standards).
Such a diversity of MB coding types makes the compression algorithm very complex in order to
find the optimal type. So, using meta-data from the analysis process will significantly speed up
the compression and allows for recognition of the execution path, which is useful in resource
allocation.
Macro block priority. Additionally, if the priority of MB is considered, the processing could be
influenced by calculating at first these MBs with the highest importance, which is done in our
implementation for the intra blocks (they have the highest priority). Moreover, depending on
the complexity of the blocks within the MB, the encoder can assign more or less memory to the
respective blocks in the quantization step [Symes, 2001]. And now, not only complexity of block
but also the importance of current MB could influence the bit allocation algorithm and optimize
the quantization parameter selection, thus influencing positively the overall perceived quality.
Most significant coefficients. The first three coefficients, the DC and two AC coefficients,
are stored in addition, but only for the intra coded blocks. This allows for better processing
control by possible skipping coefficient calculations (DCT) in case of lack of resources – in such
situation it would be possible to just calculate not the real but estimated pixel values and still
deliver acceptable picture quality. Since the DCT in different codecs are expected to work
similar way, storing first three coefficients will influence the size of the MD just a little, but on
the other hand allows skipping DCT calculations without the cost of dropping the macro block.
In other words, the quality provided in case of skipping the MB with just three coefficients, will
be still higher than zero.
Motion vectors. They could be stored for the whole MB, or for parts of a MB called blocks,
however our current implementation considers each block separately. Of course, only
temporally predicted MB's require associating MV's (MVs do not exist for intra MBs at all).
However, in different compression algorithms there are different numbers of MV's used by one
MB, e.g. in H.264/AVC it is allowed to keep up to 16 MV's per MB. So, if all possible
110
combinations were searched (using e.g. quarter-pel accuracy), the execution time would explode.
Thus pre-calculated MVs help a lot in the real-time encoding, even though they will not be
applicable directly in some cases (in such situations they should be used adopted).
The load of continuous MD is to be implemented as a part of the real-time encoder. The
pseudo code showing how to do the implementation is given in section XVII of Appendix D.
V.5.2.3
MD-XVID as proof of concept
The XVID is a state-of-the-art open-source implementation of the MPEG-4 standard for visual
data [MPEG-4 Part II, 2004], and to be more precisely for rectangular video (natural and
artificial) supporting Advance Simple Profile. It was chosen to be a base for the meta-data based
representative encoder due to its good compression [Vatolin et al., 2005], stable execution and
source code availability. It was adapted to support the proposed continuous MD set i.e. the
algorithm given in the previous section has been implemented. The design details about the
MD-based XVID encoder can be found in [Militzer, 2004] and here only the overview is given.
There is the XVID encoder depicted in Figure 17 a), which combines the previously discussed
DCT-algorithm for inter- and intra-frames (in Figure 16) and optimizes the quality through
additional decoding part within the encoder. The first step done by XVID is making decision
about the frame type for applying selectively further steps, which is very crucial for the coding
efficiency. As it was given in the literature [Henning, 2001], the compression ratio for MPEG-4
standard depends on the frame type and typically amounts 7:1 for I-frames, 20:1 for P-frames
and 50:1 for B-frames. Of course, these ratios can change depending on the temporal and
spatial similarity i.e. the number of different MD types encoded within the frame and
accompanying respective MVs. Then depending if the P- or B-frame should be processed, the
motion estimation, which produces motion vector data, together with the motion prediction
delivering motion compensation error are applied. Additionally coding reordering must take
place in case of B-frames and two referencing frames are used instead of one, so in that case the
ME/MP complexity is higher than for P-frame. Next the standard steps of DCT-based
encoding are applied like DCT, quantization, zig-zac scanning, RLE and Huffman encoding.
Due to the mentioned quality optimization by additional decoding loop, the lossy transformed
and quantized AC/DC coefficients are decoded through dequantization, inverse DCT and
111
eventually if the referred frame is the intra-frame (precisely P-frame) the motion compensation
takes place. Such encoder extension imitate the client-side behavior in the decoder and gives
common base for the reference frame reconstruction, thus avoiding additional reference error
deriving from difference between decoded reference frame and origin reference frame.
A representative of the meta-data based video encoder being based on XVID (called shortly
MD-XVID) is depicted in Figure 17 b). There is an absent lossless meta-data decompression
step scarified for simplicity in the picture, but in reality it takes place as suggested in the section
V.1. The depicted MD represents not both types of the MD but only the continuous set as
given in section V.5.2.2. i.e. all the seven MD elements are included, and for each the arrowed
connector showing the flow of meta data to the given module of the encoding process is
depicted. The Frame Type Decision from Figure 17 a) is exchanged by the much faster MD-Based
Frame Type Selection. The Motion Estimation step present in Figure 17 a) is completely removed in
MD-XVID thanks to MV data flowing to the Motion Prediction step, and the MP is simplified due
to the inflow of the three additional elements of continuous MD set such as bipred value, MB
mode and MB priority.
Finally, the first three and most significant coefficients are delivered to the Quantization step if
and only if they are not calculated beforehand (the X-circle connector symbolizes the logical
XOR operation). This last option allows for delivering the lowest possible quality of the
processed MBs in case of high load situations. For example, if the previous steps like motion
prediction and DCT could not be finished on time or if the most important MBs took to much
time and the remaining MBs are not going to be processed, then just the lowest possible quality
including the first three coefficients and the zeroed rest are further processed and finally
delivered.
Optionally, there could be also an arrowed connector symbolizing the flow of the MB priority
to the Quantization step in the Figure 17 b), thus allowing for even better bit allocation by
extended bit rate control algorithm. However, this option has been neither implemented nor
investigated yet, and as so, it is left for further investigations.
112
Figure 17. XVID Encoder: a) standard implementation and b) meta-data based
implementation.
113
V.6. Evaluation of the Video Processing Model through Best-Effort Prototypes
The video processing model has been implemented at first as a best-effort prototype in order to
evaluate the assumed ideas. Here few critical aspects reflecting the core components of the
architecture (from Phase 2: conversion to internal format and content analysis; from Phase 3:
decoding and encoding) have been considered such as: generation of MD set by analysis step,
encoding to internal storage format demonstrating the scalability in the data amount and quality,
decoding from internal format exhibiting the scalability in the processing in relation to the data
quality, and encoding using the prepared MD where the quality of delivered data or the
processing complexity are considered. Finally, the evaluation of complete processing of video
transcoding chain is done in respect to execution time.
The evaluation of static MD is not included in this section; because it can be done only with
help of the environment with the well-controlled timing (i.e. it is included in the evaluation of
implementation in the real-time system).
V.6.1.
MD Overheads
In the best-effort prototype, the content analysis of Phase 2 generating MD has been integrated
with the conversion step transforming video data from the source format to the internal format.
Here, the implemented LLV1 encoder has been extended by content analysis functionality, such
that the LLV1 encoder delivers the required statistical data describing the structure of the
lossless bit stream used for scheduling as well as the data required for the encoding process in
real-time transcoding, mainly used by the real-time encoder. Depending on the number of LLV1
layers to be decoded and the requested output format properties of the transcoding process, not
all generated MD may be required. In these cases, only the really necessary MD could be
accessed selectively so that a waste of resources is avoided.
The overhead of introducing the continuous meta-data is depicted in Figure 18 a). The average
cost for the 25 well-known sequences, which was calculated as relation of the continuous MD
size to the LLV1 base layer size and multiplied by 100%, was equal to 11.27 %. The measure
related to base layer was used instead of the relation to the complete LLV1 (i.e. including all
layers), because the differences can be easier noticed—the cost of continuous MD in relation to
114
the complete LLV1 amounts on the average 1.63% with standard deviation of the cost equal to
0.34%.
Distribution
Cost of continuous MD
s ize cMD / s ize BL [%]
9
30,00%
8
25,00%
7
6
20,00%
5
15,00%
4
3
10,00%
2
5,00%
1
0
0%<5% < 10% < 15%< 20% < 25% < 30%
Number of results
below given range
ca
rp
h
ca
rp
ho
ne
_q
on
c if
e_
qc
co
if_
as
9
tg
ua 6
co rd _
ci
nt
f
ai
co ne
r_
nt
ci
ai
f
ne
r_
fo
qc
ot
b a if
ll _
fo
re ci fn
m
an
fo
_c
re
i
m
an f
ga _ q
ci
rd
en f
ha
_c
ll_
ifn
m
o
ha
l l_ nito
r
m
o n _c if
ito
m
ob
r
c a _qc
if
l_
72
m
0p
ob
5
ca
l_ 0
itu
60
m
1
ob
i le
_c
m
ob
if
i le
_c
m
ob i fn
pa
i le
rk
_q
pa run
ci
f
_7
rk
ru
20
n_
p5
72
0
0
pa
p5
rk
99
ru
4
s h n_ it
u6
ie
01
s h l ds_
ie
7
20
ld
s_
p5
72
0
0p
s
59
h
st
ie
oc
9
kh l ds_ 4
olm
i tu
60
_7
1
st
oc 20p
59
kh
olm 9 4
_i
tu
6
te
nn 0 1
is
_c
if n
0,00%
a)
b)
Figure 18. Continuous Meta-Data: a) overhead cost related to LLV1 Base Layer for tested
sequences and b) distribution of given costs.
The difference between the average cost for all sequences and the sequence-specific cost is also
depicted by the thin line with the mark. It’s obvious that the size of continuous MD is not
constant because it heavily depends on the content i.e. on motion characteristics including MVs
and coefficients, and additionally it is influenced by the lossless entropy compression. However,
the precise relationship between continuous MD and the content of the sequence was neither
mathematically defined nor further investigated.
Additionally, the distribution of the MD costs in respect to all tested sequences has been
calculated using frequency of occurrence in the equally distributed ranges of 5% with the
maximum value of 30%75. This distribution is depicted in Figure 18 b). It is clearly visible that in
the most of cases the MD overhead is in the range between 10% and 15%, and the MD
75
The rages are as follows: (0%;5%), <5%;10%), <10%;15%), <15%;20%), <20%;25%), and <25%;30%). There was no
continuous MD cost above 30% measured.
115
overhead lower than 15% occurs in 80% cases (20 out of 25 sequences), and below 25% in 96%
cases.
Finally, the static MD overhead was calculated. The average cost in relation to the LLV1 BL
amounts 0.72% with standard deviation below 0.93%, as so such small overhead can easily be
neglected in respect to the total storage space for complete video data. If one considers that
LLV1 BL occupies between 10% and 20% of the losslessly stored video in LLV1 format, then
the overhead can be treated as unnoticeable – below 1‰ of the complete LLV1.
To summarize, the static MD is just a small part of the MD set proposed and the continuous
MD plays a key role here. Still, the continuous MD introduces only very small (1.6%)—when
considered lossless data—to small overhead (11.3%)—when measured only against the LLV1
base layer. Please note, that the LLV1 BL has also limited level of data quality as it is shown in
next section, but the size of continuous MD is constant for the given sequence regardless the
number of LLV1 layers used in the later processing.
V.6.2.
Data Layering and Scalability of the Quality
The scalability of data in respect to quality is an important factor in the RETAVIC architecture.
The LLV1 was designed with a purpose of having the data layered in additive way without
unnecessary redundancy in the bit streams. So the proposed layering method provides many
advantages over a traditional non-scalable lossless video codec such as HuffYUV [RoudiakGould, 2006], Motion CorePNG, AlparySoft Lossless Video Codec [WWW_AlparySoft, 2004],
Lossless Motion JPEG [Wallace, 1991] or LOCO-I [Weinberger et al., 2000]. Especially in the
context of the RETAVIC transformation framework targeting the format independence where
the changeable level of quality is a must, the data are not allowed to be accessed on all-ornothing basis without scalability, so just a reduced file size to the manageable amount of the
compressed video file [Dashti et al., 2003] are not enough anymore.
The organization of data blocks allocated on storage device is expected to follow the LLV1
layering, such that the separate layers can be read efficiently and independently of each other,
and the best when sequentially—due to the highest throughput efficiency of the nowadays hard
drives achieving then the peak performance [Sitaram and Dan, 2000]. On the other hand, the
data prefetching mechanism must consider the time constraints of the processed quanta of each
116
layer which is hardly possible with sequential reading for varying number of data layers being
stored separately. The rotational-position-aware RT disk scheduling using a dynamic active
subset, which exploits QAS (discussed in the related work), proposed recently can be helpful
here [Reuther and Pohlack, 2003]. This algorithm provides a framework optimizing the disc
utilization under hard and statistical service guaranties by calculating the set of active (out of all
outstanding) requests on each scheduling point such that no deadline are missed. The
optimization is provided within the active set by employing shortest time access first (SATF)
method considering the rotational position of the request.
The time constraints for write operations are left out intentionally due to their unimportance to
the RETAVIC architecture in which only the decoding process of the video data stored on the
server must meet real-time access requirements. Since the conversion from input videos into the
internal storage format is assumed to be a non-RT process, it does not require time-constrained
storing mechanism so far.
The major advantage of the layered approach through bit stream separations in LLV1 is the
possibility of defining picture quality levels and temporal resolutions by the user where the
amount of requested as well as really accessed data can vary (Figure 19). Contrary, other scalable
formats designed with network scalability in mind cannot be simply accessed in the scalable
way—in such case, the video bit stream has to be read completely (usually sequentially) from the
storage medium and unnecessary parts can be dropped only during decoding or transmission
process where are two possibilities: 1) the entropy decoding takes place to find out the layers or
2) the bit stream is packetized such that the layers are distinguishable without decoding [MPEG4 Part I, 2004]. Regardless the method, all the bits have to be read from the storage medium.
Obviously, the lossless content requires significantly higher throughput than in the lossy case.
Even though, the modern hard-disks are able to deliver un-scaled and losslessly coded content
in real-time, it is waste of resources to read the complete bitstream always even when not
required. The scalable optimization in data prefetching for further processing in the RETAVIC
is simply delivered by the scheme where just certain subset of data is read, in which the number
of layers and as so amount of data can vary for the Paris (CIF) sequence as shown in Figure 19
a). There is the base layer taking only 6.0% of the completely-coded LLV1 including all the
layers (Figure 15 b)). The addition of the temporal enhancement layer to the base layer
117
(BL+TEL) brings the storage space requirement for LLV1-coded Paris sequence to the level of
about 9.6%. Further increase of data amount by successive quantization enhancement layers
(QELs) i.e. QEL1 and QEL2 raises the required size respectively to 32.1% and 65.6% for this
specific sequence. The last enhancement layer (QEL3) needs about 34.4% to make the Paris
sequence lossless.
Paris (CIF) sequence compressed in LLV1 format
BL 6,0%
TEL 3,6%
QEL3
34,4%
0
10
BL
20
BL+TEL
30
40
BL+TEL+QEL1
50
60
BL+TEL+QEL1+QEL2
70
80
90 MB 100
QEL1
22,5%
QEL2
33,5%
BL+TEL+QEL1+QEL2+QEL3
a)
b)
Figure 19. Size of LLV1 compressed output: a) cumulated by layers and b) percentage of each
layer.
The LLV1 layering scheme, in which the size of the base layer is noticeably smaller than the
volume of all the layers combined (Figure 19), entails reduced number of disk requests made by
the decoding process by directly proportional function. Thus, LLV1 offers new freedom for
real-time admission control and disk scheduling, since most of the time users actually request
videos at a significantly lower quality than the internally stored lossless representation due to
limited network or playback capabilities (e.g. handheld devices or cellular phones). In result, by
using separated bit streams in the layered representation—like the LLV1 format—for internal
storage, the waste of resources used for data access can be reduced. Consequently, the
separation of bit streams in the LLV1 delivers more efficient use of the limited hard-disk
throughput and allows more concurrent client requests being served.
To prove if layering scheme is such efficient for other video sequences as in case of Paris
sequence (Figure 19), the subset of well-known videos [VQEG (ITU), 2005; WWW_XIPH,
2007] has been encoded and the size required for data on each level of LLV1 layering scheme
has been compared to the origin size of the uncompressed video sequence. The results are
depicted in Figure 20. There the base layer was investigated always together with temporal
enhancement layer in order to measure the frame rate of the origin frequency resolution. It can
118
be noticed that the size of the base layer (including TEL) influences proportionally the next
layer—the QEL1 built directly on top of BL+TEL. For example, the BL+TEL of the Mobile
sequence crosses the 20% line in all three resolutions (CIF, QCIF, CIFN) as same as the
Mobile’s QEL1s are above the average. Contrary, the Mother and Daughter, Foreman or News
have the BL+TELs as well as the QEL1s smaller than respective average values. Thus it may be
derived, that the bigger the base layer, the bigger participation of the QEL1 in the LLV1-coded
sequence is. Contrary, the QEL2 and QEL3 are just slightly influenced by the BL+TEL
compression size and they both are almost constant—the average compression size in respect to
origin uncompressed sequence amounts respectively to 19% and 19.2%, and the standard
deviation is equal to 0.47% and 0.41%.
Relation of LLV1 layers to uncompressed video
100%
90%
80%
70%
BL+TEL
60%
QEL1
50%
QEL2
40%
QEL3
30%
ALL
20%
10%
tennis_cifn
stockholm_itu601
shields_itu601
stockholm_720p5994
shields_720p5994
parkrun_itu601
shields_720p50
parkrun_720p5994
paris_cif
parkrun_720p50
news_qcif
mother_and_daughter_qcif
mobile_qcif
mother_and_daughter_cif
mobile_cif
mobile_cifn
mobcal_itu601
mobcal_720p50
hall_monitor_qcif
garden_cifn
hall_monitor_cif
foreman_cif
foreman_qcif
football_cifn
container_cif
container_qcif
coastguard_cif
carphone_qcif
carphone_qcif_96
0%
Figure 20. Relation of LLV1 layers to origin uncompressed video in YUV format.
Summarizing, it may be deduced that the QEL2 and QEL3 are resistant to any content changes
and have almost constant compression ratio/size, while the QEL1 depends somehow on the
content and the BL+TEL are very content-dependent. For the QEL1 the minimum
compression size amounts to 8.2% and the maximum to 17.8% thus MAX/MIN ratio achieves
2.16, and the BL+TEL demonstrating even less stability achieves respectively 3.4% and 31.6%,
119
and the MAX/MIN ration equal to 9.41. Contrary, the MIN/MAX ratios of QEL2 and QEL3
are equal to 1.11 and 1.10. On the other hand, the content independent compression may
indicate a wrong compression scheme that is not capable of exploiting the signal characteristics
and the entropy of the source data on the higher layers, but some more investigation is required
to prove such statement.
Figure 21. Distribution of layers in LLV1-coded sequences showing average with deviation.
Additionally, the distribution of layers in the LLV1-coded sequences has been calculated and is
depicted in Figure 21. The average is calculated for the same set of videos as Figure 20, but this
time the percentage of each quantization layer76 within the LLV1-coded sequence is presented.
The maximum and minimum deviation for each layer is determined – it is given as superscript
and subscript assigned to the given average and depicted as red half-rings filled with the layer’s
color. The BL+TEL occupies less than two-eleventh of the whole, QEL requires a bit more
than one-fifth, and the QEL2 and QEL3 need a bit less than one-third each. It’s obvious that
the percentage of QEL2 and QEL3 is not almost-constant as in previous case, because the size
of LLV1-coded sequences changes according to the higher or lower compression of BL-TEL
and QEL1, in which the compression sizes depend on the source data. The small deviation—
76
The base layer together with temporal enhancement layer is used as base for QELs in the benchmarking for the sake of
simplicity.
120
both negative and positive—by QEL1 confirms its relationship to BL+TEL, because if the
BL+TEL size changes the QEL1 size follows these variations such that the percentage of
QEL1’s size in respect to the changed total size of LLV1 is kept on the same level hesitating
only in a small range (from -4.2 to +1.7). The higher or lower percentage of BL-TEL is
respected of costs (loses/gains) of percentage of two other enhancement layers (QEL2 &
QEL3).
The scalability in the amount of data would be useless if no differentiation of the signal/noise
ratio (SNR) had taken place. The LLV1 is designed such that the SNR scalability is directly
proportional to the amount of layers, and as so to the amount of data. The peak-signal-to-noiseration (PSNR)
77
[Bovik, 2005] values of each frame between decoded streams including
different combination of layers are given in Figure 22 and Figure 23. Four combinations of
quantization layering (temporal layering is again turned off) are investigated:
1) just base quantization layer,
2) base layer and quantization enhancement layer (BL+QEL1),
3) base layer, first and second quantization layer (BL+QEL1+QEL2)
4) all layers
The detailed results for Paris (CIF) are presented in Figure 22 and for Mobile (CIF) respectively
in Figure 23. The quality values of fourth option i.e. the decoding of all layers from BL up to
QEL3 are not depicted in the graphs since the PSNR values of lossless data are infinite. The
PSNR is based on MSE and is inversely proportional to MSE, so having MSE being zero is a
proof of lossless compression where no difference error between images exists. So, the lossless
property was proved by checking if the mean square error (MSE) is equal to zero. The results
show that the difference of the PSNR values between two consecutive layers averages about 5-6
77
In most cases the PSNR value is in accordance with the compression quality. But sometimes this metric does not reflect
presence of some important visual artefacts. For example, the quality of the blocking artefacts cannot be estimated i.e. the
compensation performed by some codec as well as the presence of the "snow" artefacts (strong flicking of the stand-alone
pixels) cannot be detected in the compressed video using only PSNR metric. Moreover, it is difficult to say whether 2 dB
difference is significant or not in some cases.
121
[dB] and this difference in the quality between layers seems to be constant for the various
sequences. Moreover, there is no significant variation along the frames, so the perceived quality
is also experienced to be constant.
Picture Quality for Different Layers (Paris - CIF)
QEL2
QEL1
BL
44,00
PSNR [dB]
42,00
40,00
38,00
36,00
34,00
32,00
30,00
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
1001 1051
Frames
Figure 22. Picture quality for different quality layers for Paris (CIF) [Militzer et al., 2005]
Picture Quality for Different Layers (Mobile - CIF)
QEL2
QEL1
BL
44,00
PSNR [dB]
42,00
40,00
38,00
36,00
34,00
32,00
30,00
1
51
101
151
201
251
Frames
Figure 23. Picture quality for different quality layers for Mobile (CIF) [Militzer et al., 2005]
Finally, the scalability in the quality is achieved such that lower layer has proportionally lower
quality. The PSNR value ranges for the BL from about 32 dB to roughly 34 dB, for the QEL1 it
is between 37 dB and 39 dB, and for the QEL2 respectively 43 dB and 44 dB. The lossless
property of all layers together (from BL up to QEL3) has been confirmed.
V.6.3.
Processing Scalability in the Decoder
Due to data scalability in LLV1, the quantification of resource requirements in relation to
amount of data seems to be possible. This allows QoS-like definitions of required resources e.g.
processing time, required memory, required throughput, and thus allows better resource
management in a real-time environment (RTE). So, the data scalability enables some control
122
over the decoding complexity in respect to requested data amount and thus to the data quality
(Figure 24). The achieved processing scalability being proportional to number of layers and as to
amount of data fits the needs of the RETAVIC framework.
Figure 24. LLV1 decoding time per frame of the Mobile (CIF) considering various data layers
[Militzer et al., 2005].
The decoding process of e.g. the base layer and the temporal enhancement layer (BL+TEL)
takes approximately only about 25% of the time for decoding the complete lossless information
(Figure 24). The LLV1 base layer is expected to provide the minimal quality of the video
sequence, so only BL decoding is mandatory. To achieve higher levels of data quality, additional
layers have to be decoded on cost of processing time. As so, the LLV1 decoding process can be
well partitioned into a mandatory part—requiring just very few computational resources—and
additional optional parts—requiring the remaining 75% of resources being spent in total.
Contrary to the decoding of traditional single-layered video, this scalable processing
characteristic can be exploited according to the imprecise computation concept [Lin et al.,
1987] where at first low-quality result is delivered and then according to available resources and
time additional calculations are done to elevate the accuracy of the information. Moreover, the
123
processing with mandatory and optional computation parts can be scheduled according to QAS
[Hamann et al., 2001a].
Moreover, the required computational resources for the decoding process can be smoothly
controlled if decoding is done not only on the number of enhancement layers but on a perframe or even on macro block level, for example, the complete BL+ETL layers are decoded and
partly QEL1 where partly means some of all frames, or even some important MBs out of a
given frame. In such case only some frames or parts of frames will have higher quality.
When considering LLV1 decoding versus unscalable decoding, the computational complexity of
decoding a LLV1 stream at a specific quality level (e.g. targeting given PSNR value) is not much
higher than that of decoding a single-layered representation at about the same quality. Such
behavior is achieved due to dequantization and inverse transform being executed only once, no
matter if just one or three quantization enhancement layers are used for decoding. The main
overhead cost for higher quality derives then neither from the dequantization nor from the
inverse transformation but from the amount of data being decoded by entropy decoder, which
also in case of single-layered decoders raises in accordance with the higher quality.
The depicted adaptivity in terms of computational complexity allows a media server making
better use of resources. By employing LLV1 as unified source format for real-time
transformations more concurrent client requests with lower quality or less number of clients
with higher quality can be served. As so, the QoS control gets the controlling mechanism based
on the construction of the internal storage format and allows making a decision upon
established QoS strategy. Such QoS control would not be possible if no scalability is present in
the data and in the processing.
Additionally and in analogy to the sophisticated QoS levels defined for the ATM networks, a
minimum guaranteed quality can be defined since there is a possibility of dividing the LLV1
decoding process into mandatory and optional parts. Moreover, if the system had some idle
resources, the quality could be improved according to the QAS such that the conversion
process calculates data from enhancement layers but does not allocate the resources exclusively
and still some other new processes can be admitted. For example, let’s assume that there is
some number of simultaneous clients requesting minimum guaranteed quality and this number
124
is smaller than the maximum number of possible clients requesting minimum quality. Rationally
thinking, such assumption is a standard case, because otherwise the allocation would be refused
and client’s request would be rejected. So, there are still some resources idle which could be
adaptively assigned to running transformations in order to maximize the overall output quality,
and finally deliver the quality higher than the guaranteed minimum.
The comparison to other coding algorithms makes sense only if both the lossless and scalable
properties are assumed. There have been proposed few wavelet-based video decoders which
could have the mentioned characteristic [Devos et al., 2003; Mehrseresht and Taubman, 2005].
However, in [Mehrseresht and Taubman, 2005] only the quality evaluation but not the
processing complexity is provided. In [Devos et al., 2003] the implementation is just a prototype
and not the complete decoding is considered, but just three main parts of the algorithm i.e.
wavelet entropy decoding (WED), inverse discrete wavelet transform (IDWT) and motion
compensation (MC). Anyway, the results are far behind these achieved even by best-effort
LLV1 e.g. just the three mentioned parts need from 48 sec. in low quality up to 146 sec. in high
quality for Mobile CIF on Intel Pentium IV 2.0GHz machine. Contrary, the RT-LLV1 decoder
requires about 3 sec. for just base layer and 10 sec. for lossless (most complex) decoding on the
same machine as PC_RT, which is specified in Appendix E (section XVIII.2)78.
Additionally, the LLV1 was compared to Kakadu – a well known JPEG 2000 implementation
and the results of both best-effort implementations are shown in Figure 25. On average the
LLV1 takes 11 sec, while Kakadu needs almost 14 sec on the same PC_RT machine, so the
processing of LLV1 is less CPU demanding than the Kakadu even though the last few steps of
the LLV1 algorithm such as inverse quantization (IQ) and inverse binDCT (IbinDCT) have to
be executed twice.
78
Please note, that PC_RT is AMD Athlon XP 1800+ running with 1.533MHz, so the results for LLV1 on the machine using
Intel Pentium IV 2GHz should be even better.
125
16
LLV1
JPEG2K - KAKADU
14
Total Execution Time [s]
12
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
10
AVG
Figure 25. LLV1 vs. Kakadu – the decoding time measured multiple times and the average.
V.6.4.
Influence of Continuous MD on the Data Quality in Encoding
In order to evaluate the influence of continuous MD on the data quality in the encoding
process, some simulations have been conducted on a set constructed from the well-known
standard video clips being recognized for research evaluation [WWW_XIPH, 2007] and from
the sequences of Video Quality Expert Group [WWW_VQEG, 2007]. In order to check how
the motion vectors coming from the continuous MD set are influencing the quality of the
output, the comparison between a best-effort MPEG-4 encoder using standard full motion
estimation step and a MD-based encoder exploiting directly motion vectors has been done.
Results from these two cases are depicted in Figure 26.
The PSNR value given (in dB) on the Y-axis is counted averagely per compressed sequence, and
for a number of various sizes of compressed bit stream specified on the non-linear X-axis. The
non-linear scale is logarithmic, because the PSNR measure is a logarithmic function based on
the peak signal value divided by noise represented by mean square error (MSE), and as so the
difference of the PSNR values especially on lower bit rates can be better observed.
There can be observed in Figure 26, that the direct use of the MVs from continuous MD
performs very well when targeting low qualities (from 34 to 38 dB), however if very low quality
i.e. very high compression is expected then the direct MV reuse delivers worse results in
comparison to the full motion estimation. It is caused by the difference between currently
126
processed frame and the reconstructed reference frame within the encoder having higher
quantization parameter (for higher compression ratio), which introduces higher MSE for the
same constant set of MVs. Since the MVs are defined in analogy to the BL of LLV1 which
targets the quality of 32-34 dB, the errors are introduced also in case of higher quality (above
40dB); however, here the errors are much smaller than errors in case of very low quality (25 to
around 32 dB).
Figure 26. Quality value (PSNR in dB) vs. output size of compressed Container (QCIF)
sequence [Militzer, 2004].
Contrary to undoubtful interest in the higher qualities, it is questionable, if the very low qualities
(below 32dB) are in the interest of users. Thus another type of graph commonly known as the
rate-distortion (R-D) curves has been used for evaluation of the applicability of MD-based
method. The R-D curves depicts the quality in comparison to the more readable bit rates
expressed by bits per pixel (bpp), which may be directly transformed to the compression ratio
i.e. assuming the YV12 color space where there are 12 bpp in the uncompressed video source it
can be derived that achieving 0.65 bpp in the compressed domain delivers the compression ratio
127
of 18.46 (the compression size of 5.5%), and in analogy 0.05 bpp brings 240 (0.5%) and 0.01bpp
causes 1200 (0.83‰).
Figure 27. R-D curves for Tempete (CIF) and Salesman (QCFI) sequences [Suchomski et al.,
2005].
The R-D graph showing the comparison of the standard XVID encoder and the MD-based
XVID have been used for comparisons targeting different bit rates and low, but not very low,
quality (option Q=2). Considering various bits per pixels of the compressed output, the quality
for both processing cases of two sequences (Tempete and Salesman) is depicted in Figure 27. It
can be noticed that the curves overlaps in the range of 30 to 40 dB, thus the influence of direct
MV reuse is neglected. In case of higher quality, e.g. around 49dB the cost of introducing MDbased encoding according to compression efficiency is still very small i.e. 0.63 bpp is achieved
for Tempete instead of getting 0.61 bpp, which means the value of 19.6 vs. 19.1 in compression
ratio. Even better results are achieved for Saleseman.
128
V.6.5.
Influence of Continuous MD on the Processing Complexity
The processing complexity is an important factor in video transcoding, especially if applied in
real-time for delivering format transformation. The continuous meta-data have been designed
such that they influence the complexity in the positive manner i.e. the processing is affected by
the simplification (speed-up) in the algorithm and by the stabilization in the processing time
counted per frame.
Figure 28. Speed-up effect of applying continuous MD for various bit rates [Suchomski et al.,
2005].
The speed-up effect is noticeable for different bit rates targeted by the encoder (Figure 28), so
the positive effects of using continuous MD covers not only the wide spectrum of qualities
being proportional to the achieved bit rate (Figure 27) but also the processing complexity which
is inverse proportional to the bit rate i.e. lowering the bit rate yields the higher fps thus the
smaller complexity. Please note, that the Y-axis of Figure 28 represents the frames per second
so the higher value the better. It is clearly visible that the MD-based encoder outperforms the
unmodified XVID by allowing for processing much more frames per second (black line above
129
the gray one). For example, if the bit rate of 0.05 bpp is requested the speed-up of 1.44 (MDXVID ~230fps vs. XVID ~160fps) is achieved, and respectively the speed-up of 1.47 for the
bit rate of 0.2 bpp (MD-XVID ~191fps vs. XVID ~132fps); the MD-XVID (145fps) is 1.32
times faster than XVID for the bit rate of 0.6 bpp (~110fps).
Good Will Hunting Trailer (DVD), PAL, 1000 kbit/s
35
XviD 1.0, quality 2, direct MV reuse
30
XviD 1.0, quality 2, regular ME
Processing time [ms]
25
20
15
10
5
0
1
64
127 190 253 316 379 442 505 568 631 694 757 820 883 946 1009 1072 1135 1198 1261 1324 1387 1450 1513 1576 1639
Frames
Figure 29. Smoothing effect on the processing time by exploiting continuous MD [Suchomski
et al., 2005].
The encoding time per frame for one bit rate (of 1000 kbps) is depicted for the real-world
sequence, namely Good Will Hunting Trailer, in Figure 29. Beside the noticeable speed-up (smaller
processing time), there may be recognized a smoothing effect in the processing time for the
MD-based XVID i.e. the differences between MIN and MAX frame-processing time and the
variations of the time from frame to frame are smaller (Figure 29). Such smoothing effect is
very helpful in the real-time processing, because it makes the behavior of the encoder more
predictable, and in results the scheduling of the compression process is easier manageable.
Consequently, the resource requirements can be more accurately determined and stricter buffer
techniques for continuous data processing can be adopted.
Moreover, the smoothing effect will be even more valuable if more complex video sequences
are processed. For example, assuming close to worst-case scenario for usual (non-MD-based)
130
encoder where very irregular and unpredictable motion exists the processing time spent per
frame hesitates much more than the one depicted in Figure 29 or in Figure 8. So, the reuse of
MVs, frame type or MB-type in such cases eliminates the numerous unnecessary steps leading to
misjudgment decisions in the motion prediction by delivering the almost-perfect results.
V.6.6.
Evaluation of Complete MD-Based Video Transcoding Chain
Finally, the evaluation of simple but complete chain using the MD-based transcoding has been
conducted. As the RETAVIC framework proposes, the video encoder has been split into two
parts: content analysis, which is encapsulated in the LLV1 encoder (in the non-real-time
preparation phase), and MD-based encoding (MD-XVID) using the continuous MD delivered
from the outside (read from the hard disk). The video data is stored internally in the proposed
LVV1 format and the continuous MD are losslessly compressed using entropy coding. The
continuous MD decoder is embedded in the MD-XVID encoder, thus the cost of MD decoding
is included in the evaluation.
Figure 30. Video transcoding scenario from internal LLV1 format to MPEG-4 SP compatible
(using MD-XVID): a) usual real-world case and b) simplified investigated case.
The video transcoding scenario is depicted in Figure 30. There are two cases presented: a) the
usual real-world case where the delivery through network to end-client occurs and b) the
simplified case for investigating only the most interesting parts of the MD-based transcoding
(marked in color). So, the four simple steps in the example of conversion are performed as
depicted in Figure 5. b), namely: the LLV1-coded video together with accompanying MD is read
from the storage, next the compressed video data are adaptively decoded to raw data, then the
decoding of continuous MD and encoding to MPEG-4 Simple Profile using MD-based XVID
is performed, and finally the MPEG-4 stream is written to the storage. For the real-world
131
situation (a), the MPEG-4 stream is converted into the packetized elementary stream (PES)
according to [MPEG-4 Part I, 2004] and sent to the end-client through the network instead of
writing to the storage (b).
Figure 31. Execution time of the various data quality requirements according to the simplified
scenario.
The results for the simplified scenario (b) for the Paris (CIF) sequence are depicted in Figure 31.
Here, the execution time covers the area marked by color (Figure 30) i.e. only the decoding and
encoding is summed up and read/write operations are not included (but anyway they influence
the results insignificantly). Four sets of quality requirements specified by the user have been
investigated:
1. low quality – where BL and TEL of LLV1 are decoded and the bit rate of the MPEG-4
bitstream targets 200 kbps;
2. average quality – where BL+TEL and QEL1 of LLV1 are decoded and the targeted
bit rate is equal to 400 kbps;
132
3. high quality – where the layers up to QEL2 of LLV1 are worked out and the higher bit
rate of 800 kbps is achieved;
4. excellent quality – where all LLV1 layers are processed and the MPEG-4 with bit rate
of 1200kbps is delivered (lossless decoding).
The execution time for each quality configuration is measured per frame in order to see the
behavior of the transcoding process. There are still few peaks present, which may derived from
the thread preemption in the best-effort system; however in general the execution time per
frame is very stable—and one could even risk a statement that it is almost constant. Obviously,
the use of continuous MD allows for reduced encoder complexity making the participation of
LLV1-decoder’s time bigger in the total execution time in contrast to a chain with the standard
XVID encoder. On the other hand, the proved LLV1 processing scalability now influences
significantly the total transcoding time thus allows gaining better control over the amount and
quality of the processed data on the encoders input.
All in all the whole transcoding where the continuous MD are used shows much more stable
behavior for all considered quality ranges up to full-content lossless decoding of LLV1 video.
Summarizing, the MD-based framework allows gaining some more control over the whole
transcoding process and speeds up the execution of the video processing chain in comparison
to the transcoding without any MD.
133
VI. Audio Processing Model
VI. AUDIO PROCESSING MODEL
In analogy to chapter V, the chapter VI proposes the audio-specific processing model based on
the logical model proposed in chapter IV. The analysis of few most-known representatives of
the audio encoders is given at the beginning. Next, the general assumptions for the audio
processing are made. Subsequently, the audio-related static MD is defined and MPEG-4 SLS is
proposed as the internal format and described in details. The MD-based processing is described
separately for decoding and encoding. Finally, the evaluation of the decoding part of the
processing model is given. This part covers only the MPEG-4 SLS evaluation and its
enhancement.
VI.1. Analysis of the Audio Encoders Representatives
There have been three codecs selected for the analysis as representatives of different perceptual
transform-based algorithms. These are following:
•
OggEnc [WWW_OGG, 2006] – standard encoding application of the Ogg Vorbis coder
being part of the vorbis-tools package,
•
FAAC [WWW_FAAC, 2006] – open source MPEG-2 AAC compliant encoder,
•
LAME [WWW_LAME, 2006] – open source MPEG-1 Layer III (MP3) compliant
encoder.
All of them are open source programs and they are recognized as state-of-the-art for their
specific audio format. The decoding part of these lossy coders is not considered as important
for the RETAVIC, since only the encoding part is executed in the real-time environment.
Moreover, the most important factors under analysis cover the constancy and predictability of
the encoding time as well as the encoding speed (deriving from coding complexity). In all cases
the default settings of the encoders have been used.
There have been used different types of audio data in the evaluation process. The data covered
the instrumental sounds from the MPEG Sound Quality Assessment Material [WWW_MPEG
134
SQAM, 2006] and an own set of music from commercial CDs and self-created audio
[WWW_Retavic - Audio Set, 2006], however in the later part only the three representatives are
used (namely silence.wav, male_speech.wav, and go4_30.wav). These three samples are enough to
show the behavior of the encoding programs under different circumstances like silence, speech,
music, having high and low volume or different dynamic ranges.
The graphs depicting the behaviour of all encoders for each of the selected samples are given
such that the encoding time is measured per block of PCM samples defined as 2048 samples for
FAAC and OggEnc, and 1152 samples for Lame. Such division of samples per block derives
from the audio coding algorithm itself and cannot be simply changed (i.e. the block size in Lame
is fixed), thus the results cannot be simply compared on one graph and so the Lame is depicted
separately.
As expected, the silence sample results show a very constant encoding time for all the encoders.
The results are depicted respectively in Figure 32 and Figure 33. The FAAC (pink color) needs
about 2ms per block of PCM samples, while the OggEnc (blue colour) requires roughly 2.4ms.
The Lame is achieving about 2.1ms per block, however due to the smaller block size the
artificial time of 3.73ms could be achieved after normalization to the bigger block of 2048
samples (it is however not depicted to avoid flattening of the curve).
Figure 32. OggEnc and FAAC encoding behavior of the silence.wav (based on [Penzkofer,
2006]).
135
Figure 33. Behavior of the Lame encoder for all three test audio sequences (based on
[Penzkofer, 2006]).
Figure 34. OggEnc and FAAC encoding behavior of the male_speech.wav (based on [Penzkofer,
2006]).
The results for encoding of male_speech are depicted in Figure 33 for Lame and in Figure 34 for
FAAC and OggEnc. Contrary to silence, the encoding time hesitates much more for the
male_speach. The FAAC encodes the blocks in time between 1.5ms and 2ms and is still relatively
constant in respect to this range since there are no important peaks that exceed these values.
The encoding time of Lame fluctuates much more, but still there are just few peaks above 3ms.
136
The OggEnc behaves unstable i.e. there are plenty of peaks in both directions going even
beyond 7ms per block.
Yet another behavior of audio encoders can be observed for music example go4_30 (Figure 35).
A narrow tunnel with min and max values still can be defined for the FAAC encoder (1.6 to
2.2ms), but the OggEnc is now very unstable having even more peaks than in case of
male_speach. Also the Lame (yellow color in Figure 33) manifests the unstable behavior and
requires more processing power than in case of male_speach i.e. the encoding time ranges from 2
to 4ms per block of 1152 samples and the average encoding time is longer (yellow curve above
two other curves).
Figure 35. OggEnc and FAAC encoding behavior of the go4_30.wav (based on [Penzkofer,
2006]).
The overall FAAC encoding time of the different files does not differ very much even if
different types of audio samples are used. This may ease the prediction of the encoding time.
The OggEnc results demonstrate huge peaks in both directions of the time axis for male_speach
and for go4_30. These peaks are caused by the variable window length, which is possible in the
Ogg Vorbis codec, i.e. the number of required encoding steps is not fixed for the block of 2048
samples and it is determined by block analysis. Here, the pre-calculated meta-data keeping
137
window length of every block would be applicable to help in prediction of the execution time.
On the other hand, the peaks hesitate around relatively constant average value and usually a
positive peak is followed by negative one. Such behavior could be supported by small time
buffer leading to the constant average time.
Finally, the overall encoding time of the analyzed encoders differ significantly as it is shown in
Figure 36. The male_speech is processed faster because the sample sequence is shorter than two
others. Still it has to be pointed out that these codecs are lossy encoders, in which the
compression ratio, sound quality, and processing time are related to each other, and different
relationship exists for each encoder. As so in order to compare speed, the quality and the
produced size shall also be considered. The measurement of perceived quality is subjective to
each person, so it needs a special environment, hardware settings and experienced listeners to
evaluate the quality. Anyway, the default configuration settings should be acceptable by a home
user. On the other hand, the quality and compression ratio was not the point within the
behavior evaluation under the default configuration settings of the selected encoders. So the
idea of the graph is to show the different behavior depending on audio source i.e. for different
sequences one encoder sometimes is faster but sometimes slower than the competitors.
Figure 36. Comparison of the complete encoding time of the analyzed codecs.
VI.2. Assumptions for the Processing Model
Considering the results from the previous chapter, the better understanding of the coding
algorithm is required before proposing the enhancements in the processing. All the audio
codecs under investigation employed the perceptual audio coding idea, which uses
138
psychoacoustic model79 to specify which parts of the data are really inaudible and can be
removed during compression [Sayood, 2006; Skarbek, 1998]. The goal of psychoacoustic model
is to increase the efficiency of compression by maximizing the compression ratio and to make
the reconstructed file as most exact to the source as possible in the meaning of the perceived
quality. Of course, an algorithm using psychoacoustic model is a lossy coding technique, so
there is no bit-exactness between both source and reconstructed data files.
To make the audio compression effective, it is reasonable to know what kind of sounds a
human can actually hear. The distinction between important-for-perception elements and the
less important ones lets the developers concentrate only on the audible parts and ignore the
segments which would not be heard anyway. According to this, the aggressive compression is
possible or even cutting out some parts of signal without significant influence on the quality of
the heard sound. Most of all, the frequencies which exceed the perception threshold (e.g. very
high frequencies) can be removed, and the important-for-human-being frequencies between
200Hz to 8kHz must be reflected precisely during the compression, because human can easily
hear the decrease of the sound quality in that frequency range.
When two sounds occur simultaneously the masking phenomenon takes place [Kahrs and
Brandenburg, 1998]. For example, when one sound is very loud and the other one is rather
quiet, only the loud sound will be perceived. Also if at a short interval of time after or before a
loud, sudden sound a quiet sound occurs, it won’t be heard. The inaudible parts of a signal can
be encoded with a low precision or can not be encoded at all. Psychoacoustics helped in
development of effective, almost lossless compression in meaning of perceived quality. The
removed parts of the signal don’t affect the comfort of the sound perception to a high degree
although they may decrease its signal quality.
79
The ear of an average human can detect sounds between 16 Hz and 22 kHz of frequency range and about 120dB of dynamic
range in the intensity domain [Kahrs and Brandenburg, 1998]. The dynamic range is said to be from the lowest level of
hearing (0dB) to the threshold of pain (120-130dB). Human ear is most sensitive to frequencies from the range of 1000 Hz to
3500 Hz and the human’s speech communication takes place between 200 and 8000 Hz [Kahrs and Brandenburg, 1998]. The
psychoacoustics as the research field aims at the investigation of dependencies between sound wave coming into human’s ear
and subjective sensible perception of this wave. The discipline deals with describing and explaining some specific aspects of
auditory perception based on anatomy, physiology of human ear and cognitive psychology. The results of the investigation
are utilized while inventing modern, new compression algorithms [Skarbek, 1998].
139
VI.3. Audio-Related Static MD
As stated in the Video-Related Static MD (section V.3), the MD are different for various media
types, thus also the static MD for audio are related to the MO having type of audio, for which
the set (StaticMD_Audio) is defined as:
∀ i : MD A (moi ) ⊂ StaticMD _ Audio ⇔ typei = A
(16)
∀i : moi ∈ O
1≤ i ≤ X
where typei denotes type of the i-th media object moi, A is the audio type, MDA is a function
extracting the set of meta-data related to the audio media object moi, and X is the number of all
media objects in O.
Analogically to video, the audio stream identifier (AudioID) is also one-to-one mapping to the
given MOID:
∀i ∃ j ¬∃k : AudioIDi = MOID j ∧ AudioIDi = MOIDk
(17)
k≠ j
The static MD set of the audio stream includes sums of each type of window analogically to
frame type of video:
∀i : wi .type ∈ WT ∧ WT = {ST , SP, L, M , S }
(18)
where ST denotes the start window type, and analogically SP denotes the stop , L - long, M medium and S - short, and WT denotes the set of these windows types.
The sum for each window type is defined as:
{
wi .type
WindowSummo
= w j w j ∈ moi ∧ 1 ≤ j ≤ N ∧ w j .type = wi .type ∧ w j .type ∈ WT
i
}
(19)
Where wi.type is one of the window types defined by Equation (18), N denotes the amount of all
windows, wj is a window at j-th position in the audio stream, wj.type denotes type of the j-th
frame.
The sum of all types of windows is kept in the respective attributes of the StaticMD_Audio.
Analogically to the window type, the information about the sum of different window switching
140
modes along the complete audio stream is kept in StaticMD_SwitchingModes. Eleven types of
window switching modes are differentiated up to now thus there are eleven derived aggregation
attributes.
The current definition of initial static MD set is mapped to the relational schema (Figure 37).
Figure 37. Initial static MD set focusing on the audio data.
Analogically to the video static MD, the sum of all window types and window switching modes
may be calculated by the respective SQL queries:
SELECT AudioID, WindowType, count(WindowType)
FROM StaticMD_Window
GROUP BY AudioID, WindowType
ORDER BY AudioID, WindowType;
Listing 4.
Calculation of all sums according to the window type.
141
SELECT AudioID, WindowSwitchingMode, count(WindowSwitchingMode)
FROM StaticMD_Window
GROUP BY AudioID, WindowSwitchingMode
ORDER BY AudioID, WindowSwitchingMode;
Listing 5.
Calculation of all sums according to the window switching mode.
Of course, the entity set of StaticMD_Window must be complete in such a way that all windows
existing in the audio are included by this set. Based on this assumption, the sums included in the
StaticMD_Audio and StaticMD_WindowSwitchingSum are treated as derived attributes counted by
above SQL statements. However, due to the optimization issues and rare updates of the MD
set, these values are materialized and included as usual attributes in the StativMD_Audio and
StaticMD_WindowSwitchingSum. If the set completeness assumption of StaticMD_Window were
not fulfilled, the sum attributes could not be treated as derived.
VI.4. MPEG-4 SLS as Audio Internal Format
Following the system design in section IV, the internal storage format has to be chosen. The
MPEG-4 SLS is proposed to be the suitable format for storing the audio data internally in the
RETAVIC architecture. The reasons have been discussed in details in [Suchomski et al., 2006]
where the format requirements have been stated and the evaluation in respect to the qualitative
aspects considering data scalability as well as the quantitative aspects referring to the processing
behavior and resource consumption. Other research considering the MPEG-4 SLS algorithm
and its evaluation is covered by the recent MPEG verification report [MPEG Audio Subgroup,
2005], but both works are complementary to each others due to their different assumptions i.e.
they discuss different configuration settings of the MPEG-4 SLS and evaluate the format from
distinct perspectives. For example, [MPEG Audio Subgroup, 2005] compares the coding
efficiency for only two settings: AAC-Core and No-Core, contrary to [Suchomski et al., 2006]
where additionally various AAC-cores have been used. Secondly, [Suchomski et al., 2006]
provides explicit comparison of coding efficiency and processing complexity (e.g. execution
time) to other lossless formats, while the [MPEG Audio Subgroup, 2005] reports details on SLS
coding efficiency without comparison and very detailed analysis of algorithmic complexity.
Finally, the worst-case processing in respect to complexity being usable during a hard-RT DSP
implementation is given in [MPEG Audio Subgroup, 2005], whereas [Suchomski et al., 2006]
discusses some scalability issues of processing applicable in different RT models e.g. cutting-off
142
the enhancement part of the bitstream during the on-line processing with consideration of the
data quality in the output80.
VI.4.1.
MPEG-4 SLS Algorithm – Encoding and Decoding
The MPEG-4 Scalable Lossless Audio Coding (SLS) was designed as an extension of MPEG
Advanced Audio Coding (AAC). These two technologies combined together have composed an
innovative technology joining scalability and losslessness, which is referred commercially as
High Definition Advanced Audio Coding (HD-AAC) [Geiger et al., 2006], and based on
standard AAC perceptual audio coder with the additional SLS enhancement layer. Both
technologies cause, that even at lower bit rates a relatively good quality can be delivered. The
SLS can be also used as a stand-alone application, with a non-core mode due to its internal
compression engine. The scalability of quality varies from AAC-coded representation through
wide range of in-between near-lossless levels up to fully lossless representation. The overview of
the SLS simplified encoding algorithm is depicted in Figure 38 for two possible modes: a) with
AAC-core and b) as stand-alone SLS mode (without core).
Figure 38. Overview of simplified SLS encoding algorithm: a) with AAC-based core [Geiger et
al., 2006] and b) without core.
80
The bitstream truncation itself is discussed in [MPEG Audio Subgroup, 2005], however from the format and not from the
processing perspective.
143
VI.4.2.
AAC Core
Originally the Advanced Audio Coding (AAC) was invented as an improved successor of
MPEG-1 Layer III (MP3) standard. It is specified in MPEG-2 standard, Part 7 [MPEG-2 Part
VII, 2006] and later in MPEG-4, Part 3 [MPEG-4 Part III, 2005] and can be described as a
high-quality multi-channel coder. The used psychoacoustic model is the same as the one in
MPEG-1 Layer III model, but there are some significant differences in the algorithm. AAC
didn’t have to be backward compatible with Layer I and Layer II of MPEG-1 as MP3 had to, so
there can be offered higher quality at lower bitrates. AAC shows great encoding performance at
very low bitrates additionally to its efficiency at standard bitrates. As an algorithm, it offers new
functionalities like low-delay, error robustness and scalability, which are introduced as AAC
tools/modules and are used by specific MPEG audio profiles (detailed description is given in
section XXII of Appendix H).
The main improvements were made by adding the Long Term Predictor (LTP) [Sayood, 2006]
to reduce the bit rate in the spectral coding block. It also supports more sample frequencies
(from 8 kHz to 96 kHz) and introduces additional sampling half-step frequencies (16, 22.05, 24
kHz). Moreover, LTP is computationally cheaper than its predecessor and is a forward adaptive
predictor. The other improvement over MP3 is the increased number of channels – provision of
multi-channel audio up to 48 channels but also support 5.1 surround sound mode. Generally,
the coding and decoding efficiency has been improved what effects in smaller data sizes and
better quality sound than MP3 when compared at the same bitrate. The filter bank converts the
input audio signal from time domain into frequency domain by using Modified Discrete Cosine
Transform (MDCT) [Kahrs and Brandenburg, 1998]. The algorithm supports dynamical
switching between two window lengths of 2048 samples and 256 samples. All windows are
overlapped by 50% with neighboring blocks. This results in generating 1024 or respectively 128
spectral coefficients from the samples. For better frequency selectivity the encoder can switch
between two different shapes of windows, a sine shaped window and a Kaiser-Bessel Derived
(KBD) Window with improved far-off rejection of its filter response [Kahrs and Brandenburg,
1998].
144
The standard contains two types of predictors. The first one, intra-block predictor—also called
Temporal Noise Shaping (TNS)81—is used to input containing transients. The second one,
interlock predictor, is useful only in stationary conditions. In stationary signals, the further
reduction of bit rates can be achieved by using prediction to reduce the dynamic range of the
coefficients. During stationary periods, generally the coefficients at a certain frequency don’t
change their values in a significant degree between blocks. This fact allows transmitting only the
prediction error between subsequent blocks (backward adaptive predictor).
The noiseless coding block uses twelve Huffman codebooks available for two- and four-tuple
blocks of quantized coefficients. They are used to maximize redundancy reduction within the
spectral data coding. A codebook is applied to sections of scale-factor bands, to minimize the
resulting bitrate. The parts of a signal, which are referred to as noise, are in most cases
undistinguishable from other noise-like signals. This fact is used by not transmitting the scalefactor bands but instead of this transmitting the power of the coefficients in the “noise” band.
The decoder will generate a random noise according to the power of the coefficients and put it
in the region where should be the original noise.
VI.4.3.
HD-AAC / SLS Extension
The process of encoding consists of two main parts (see Figure 38 a). At first the bitstream is
encoded by the integer-based invertible AAC coder and the core layer is produced. AAC is a
lossy algorithm, so not all the information is encoded. The preservation of all the information
can be attained thank to the second step delivering enhancement layer. This layer encodes the
residual signal missing in the AAC-encoded signal i.e. the error between lossy and source signal.
Moreover, the enhancement layer allows scalability in meaning of quality due to the bitsignificance-based bit-plane coding preserving spectral shape of the quantization noise used by
the AAC-core [Geiger et al., 2006].
81
TNS is a technique that operates on the spectral coefficients only during pre-echoes. It allows using the prediction in the
frequency domain to shape temporally the introduced quantization noise. The spectrum is filtered, and then the resulting
prediction residual is quantized and coded. Prediction coefficients are transmitted in the bitstream to allow later recovery of
the original signal in the decoder.
145
The detailed architecture of the HD-AAC is depicted as block diagram for a) encoding and b)
decoding in Figure 39. The HD-AAC exploits the Integer Modified Discrete Cosine Transform
(IntMDCT) [Geiger et al., 2004] to avoid additional errors while performing the transformations
and is completely invertible function. During the encoding process the MDCT produces
coefficients, which are later mapped into both layers. The residual signal introduced in the AAC
process is calculated by the error mapping process and then mapped into bit-planes in the
enhancement layer. The entropy coding of the bit-planes is done by three different methods:
Bit-Plane Golomb Coding (BPGC), Context-Based Arithmetic Coding (CBAC) and Low
Energy Mode (LEM).
a)
b)
Figure 39. Structure of HD-AAC coder [Geiger et al., 2006]: a) encoder and b) decoder.
The error correction is placed at the beginning of each sample window. The most significant
bits (MSB) correspond to the bigger error, while the least significant bits (LSB) to the finer
146
error, and obviously the more bits of the residual are used, the less loss in the output will be.
The scalability is achieved by truncation of the correction value.
Core and enhancement layers are subsequently multiplexed into the HD-AAC bitstream. The
decoding process can be done in two ways, by decoding only the perceptually-coded AAC part,
or by using both AAC and the additional residual information from the enhancement layer.
VI.4.3.1 Integer Modified Discrete Cosine Transform
In all the audio coding schemes one of the most important matter is the transform of the input
audio signal from the time domain into the frequency domain. To achieve a block-wise
frequency representation of an audio signal, the Fourier-related transforms (DCT, MDCT) or
filter-banks are used. The problem is that these transforms produce floating point output even
for the integer input. Reduction of the data rate must be done by quantizing the floating data.
This means, that some additional error will be introduced. For the lossless coding any additional
rounding errors must be avoided by the usage of a very precise quantization, so the error could
be neglected, or by applying a different transform. The Integer Modified Discrete Cosine
Transform is an invertible integer approximation of the MDCT [Geiger et al., 2004]. It retains
the advantageous properties of the MDCT and also produces integer output values. It provides
a good spectral representation of the audio signal, critical sampling and overlapping of blocks.
The following part describes the IntMDCT in some details and is based on [Geiger et al., 2001;
Geiger et al., 2004]. The earlier mentioned properties have been gained by applying the lifting
scheme onto the MDCT. Lifting steps divide the signal into a set of steps. The most
advantageous quality of the steps is the fact that the inverse transform is a mirror of the forward
transform. The divided signal can be easier operated on by some convolution operations and
transformed back into one signal without any rounding errors.
The MDCT can be decomposed into Givens rotations. According to this fact, the decomposed
transform can be approximated by a reversible, lossless integer transform. Figure 40 illustrates
this decomposition. The MDCT is firstly decomposed into Givens rotations. To achieve the
decomposition it must be divided into three parts: 1) Windowing, 2) Time Domain Aliasing
Cancellation (TDAC) and 3) Discrete Cosine Transform of type IV. Of course, TDA is not
147
used directly, but as an integer approximation deriving from the decomposition into three lifting
steps.
Figure 40. Decomposition of MDCT.
The MDCT is defined by [Geiger et al., 2001]:
X ( m) =
2 2 N −1
(2k + 1 + N )( 2m + 1)π
w(k ) x(k ) cos
∑
N k =0
4N
(20)
where x(k) is a time domain input, w(k) is the windowing function, N defines the number of
calculated spectral lines and m is sequence of integer numbers between 0 and N-1.
Figure 41. Overlapping of blocks.
To achieve both the critical sampling and the overlapping of blocks the frequency domain
subsampling is performed. This procedure introduces aliasing in time domain, thus TDAC is
used to cancel the aliasing by “overlap and add” formula employed on two subsequent blocks in
the synthesis filter-bank. Two succeeding blocks overlap by 50%, so to get one full set of
samples, 2 blocks are needed. The better preservation of all the original data can be achieved by
the redundancy of information (see Figure 41). Each set of samples (marked with different
148
colors) is firstly put into the right part of the corresponding block and to the left part of the
succeeding block. For each block 2N time domain samples are used to calculate N spectral lines.
Each block corresponds to one window. The MDCT while proceeding introduces an aliasing
error, which can be cancelled by adding the outputs of the inverse MDCT of two overlapping
blocks (as depicted). The windows must fulfill certain conditions in their overlapping parts to
ensure this cancellation [Geiger et al., 2001]:
⎛ π
(2k + 1)⎞⎟, where k = 0,...,2 N − 1
w(k ) = sin ⎜
⎝ 4N
⎠
(21)
To decompose the MDCT (with a window length of 2N) into Givens rotations it must be first
decomposed into windowing, time domain aliasing and a Discrete Cosine Transform of Type
IV (DCT-IV) with a length of N. The combination of windowing and TDAC for the
overlapping part of the two subsequent blocks results in application of [Geiger et al., 2001]:
N
N
⎛
⎞
− w( − 1 − k ) ⎟
⎜ w( + k )
N
2
2
⎜
⎟, where k = 0,..., − 1
N
N
2
⎜⎜ w( − 1 − k )
w( + k ) ⎟⎟
2
⎝ 2
⎠
(22)
which is itself a Givens rotation.
The DCT-IV is defined as [Geiger et al., 2001]:
X ( m) =
2 N −1
(2k + 1)( 2m + 1)π
x(k ) cos
, where m = 0,..., N − 1
∑
N k =0
4N
(23)
The DCT-IV coefficients build an orthogonal NxN matrix, what means that it can be
decomposed into Givens rotations. The rotations applied for windowing and time domain
aliasing can be also used in the inverse MDCT in reversed order, with different angles. The
inverse to the DCT-IV is the DCT-IV itself. Figure 42 depicts the decomposition of MDCT
and of the inverse MDCT, where 2N samples are first rotated and then transformed by the
DCT-IV to achieve N spectral lines.
149
Figure 42. Decomposition of MDCT by Windowing, TDAC and DCT-IV [Geiger et al.,
2001].
According to the conditions for the TDAC, it can be observed that for certain angles the
MDCT can be completely decomposed into Givens rotations, by an angle α [Geiger et al., 2001]:
⎛ cos α
⎜⎜
⎝ sin α
− sin α ⎞⎛ x1 ⎞ ⎛ x1 cos α − x2 sin α ⎞
⎟⎜ ⎟ = ⎜
⎟
cos α ⎟⎠⎜⎝ x2 ⎟⎠ ⎜⎝ x1 sin α + x2 cos α ⎟⎠
(24)
The input values x1 and x2 are rotated and thus transformed into values x1cosα – x2sinα and
respectively x1sinα + x2cosα. Moreover, every Givens rotation can be divided into three lifting
steps [Geiger et al., 2001]:
⎛ cos α
⎜⎜
⎝ sin α
− sin α ⎞ ⎛⎜ 1 cos α − 1 ⎞⎟⎛ 1
⎟=
sin α ⎟⎜⎜
cos α ⎟⎠ ⎜ 0
sin α
1
⎝
⎠⎝
0 ⎞⎛⎜ 1 cos α − 1 ⎞⎟
⎟
sin α ⎟
1 ⎟⎠⎜ 0
1
⎝
⎠
Figure 43. Givens rotation and its decomposition into three lifting steps [Geiger et al., 2001].
150
(25)
The Givens rotations are mostly used to introduce zeros in matrices. The operation is
performed to simplify the calculations and thereby reduce the number of required operations
and the overall complexity. It is often used to decompose matrices or to zero components
which are to be annihilated.
VI.4.3.2 Error mapping
The error mapping process provides the link between the perceptual AAC core and the scalable
lossless enhancement layer of the coder. Instead of encoding all IntMDCT coefficients in the
lossless enhancement layer, the information already coded in AAC is used. Only the resulting
residuals between the IntMDCT spectral values and their equivalents in the AAC layer are
coded.
The residual signal e[k] is given by [Geiger et al., 2006]:
⎧c[k ] − thr (i[k ]), i[k ] ≠ 0
e[k ] = ⎨
c[k ],
i[k ] = 0
⎩
(26)
where c[k] is an IntMDCT coefficient, thr(i[k]) is the next quantized value closer to zero with
respect to i[k], and i[k] is the AAC quantized value.
If the coefficients belong to a scale-factor band that was quantized and coded in the AAC, the
residual coefficients are received by subtracting the quantization thresholds from the IntMDCT
coefficients. If the coefficients don’t belong to a coded band, or are quantized to zero, the
residual spectrum is composed form the original IntMDCT values. This process allows better
coding efficiency without losing information.
VI.5. Audio Processing Supporting Real-time Execution
VI.5.1.
MD-based Decoding
In analogy to LLV1, the SLS decoding was also designed with the goal of scalability, however
there contrary to LLV1 there is a possibility to adapt the processing according the real-time
constraints. Thus in order to support MD-based decoding, the existing SLS do not have to be
extended to support the real-time adaptive decoding through possibility of skipping some parts
151
of the extension stream including residuals for given window of samples or even a set of
windows of samples82. Of course, the extension stream stores not the samples themselves but
the compressed residual signal used for the error correction.
Such meta data allow stopping the decoding in the middle of the currently processed window,
and continuing enhancement processing at the subsequent window or at one of next windows,
whenever the unpredictable peak in processing occurs. In other words, this process may be
understood as truncation of the bitstream on the fly during higher-load period, and not before
start of the decoder execution (as it is in standard SLS right now83).
The MD-based functionality of the existing best-effort SLS is already included considering the
analogical extension proposal for LLV1 (as given in section V.5.1) due to the possibility of
storing the size occupied by the data window of enhancement layer in the compressed
domain at the beginning of each window. So, the real-time SLS decoder should only be able to
consume this additional information as continuous MD.
Of course, the original best-effort SLS decoder has no need for such MD, because the stream is
read and decoded as specified by the input parameters (i.e. if truncated than beforehand). The
discontinuation of processing of one or few data windows in the compressed domain is only
required under the strict real-time constraints and not under best-effort system, i.e. if the
execution time is limited (e.g. the deadline is approaching) and the peak in the processing occurs
(higher coding complexity than predicted), the decoding process of the enhancement layer
should be terminated and started again with the next or later window. The finding of the
position of the next or later data window in the compressed stream is not problematic anymore,
since the beginning of next data window is delivered by the mentioned MD. On the other hand,
the window positions could be stored in an index and then faster localization of the next
window would be possible.
82
In order to skip few data windows, the information about size of each window has to be read sequentially and fseek operation
has to be executed in steps to read the size of the window at the beginning of the given window.
83
There is an additional application BstTruncation used for truncating the enhancement layer in the coded domain.
152
VI.5.2.
MD-based Encoding
To understand the proposed MD-based extensions for audio coding the general perceptual
coding algorithm together with the reference MPEG-4 audio coding standard are explained at
first. Then the concrete continuous MD are discussed referring to the audio coding.
VI.5.2.1 MPEG-4 standard as representative
Generalization of Perceptual Audio Coding Algorithms
The simplified algorithm of perceptual audio coding is depicted in Figure 44. The input sound
in the digital form, which is usually Pulse Code Modulation (PCM) audio in 16-bit words, is first
divided in the time domain into windows usually of a constant duration (not depicted). Each
window corresponds to a specific amount of samples e.g. one window of a voice/speech signal
lasts approx. 20 ms what would be 160 samples for the sampling rate of 8 kHz (8000 samples/s
x 0.02 s) [Skarbek, 1998].
a)
b)
Figure 44. General perceptual coding algorithm [Kahrs and Brandenburg, 1998]: a) encoder
and b) decoder.
Then the windows of samples in the time domain are transformed to the frequency domain by
analysis filter bank (analogically to transform step in video), which decomposes the input signal
into its sub-sampled spectral components – frequency sub-bands – being a time-indexed series of
coefficients. The analysis filter bank apply the Modified Discrete Cosine Transform (MDCT) or
discrete Fast Fourier Transform (FFT) [Kahrs and Brandenburg, 1998]. The windows can be
overlapping or non-overlapping. During this process also a set of parameters is extracted, which
give information about the distribution of signal, masking power over time-frequency plane and
153
signal mapping used to shape the coding distortion. All these assist in perceptual analysis,
perceptual noise shaping and in reduction of redundancies, and are later needed for quantization
and encoding.
The perceptual model, called also psychoacoustic model, is used to simulate the ability of
human auditory system to perceive different frequencies. Additionally, it allows modeling the
masking effect of loud tones in order to mask quieter tones (frequency masking and temporal
masking) and quantization noise around its frequency. The masking effect depends on the
frequency and the amplitude of the tones. The psychoacoustic model analyzes the parameters
from filter banks to calculate the masking thresholds needed in quantization. After the masking
thresholds are calculated bit allocation is assigned to signal representation in each of frequency
sub-bands for a specified bitrate of the data stream.
Analogically to quantization and coding steps in video, the frequency coefficients are next
quantized and coded. The purpose of the quantization is implementing the psychoacoustic
threshold while maintaining the required bit rate. The quantization can be uniform (equally
distributed) or non-uniform (with varying quantization step). The coding uses scale factors to
determine noise shaping factors for each frequency sub-band. It is done by scaling the spectral
coefficients before quantization to influence the scale of the quantization noise and grouping
them into bands corresponding to different frequency bands. The scale factors’ values are found
from the perceptual model by using two nested iteration loops. The values of the scale factors
are coded by Huffman coding (only the difference between the values of subsequent bands is
coded). Finally, the windows of samples are packed into the bitstream according to required
bitstream syntax.
MPEG-1 Layer 3 and MPEG-2/4 AAC
The MPEG-1 Layer 3 (shortly MP3) [MPEG-1 Part III, 1993] and MPEG-2 AAC extended by
MPEG-4 AAC (shortly AAC) [MPEG-4 Part III, 2005] are the widest spread and most often
used coding algorithms for natural audio coding, thus this work discusses these two coding
algorithms from the encoder perspective. MP3 encoding block diagram is depicted in Figure 45
while the AAC in Figure 46.
154
Figure 45. MPEG Layer 3 encoding algorithm [Kahrs and Brandenburg, 1998].
Figure 46. AAC encoding algorithm [Kahrs and Brandenburg, 1998].
There are few noticeable differences between MP3 and AAC encoding algorithm [Kahrs and
Brandenburg, 1998]. The MP3 uses the hybrid analysis filter bank (Figure 44) consisting of two
155
cascading modules: Polyphase Quadrature Filter (PQF)84 and MDCT (Figure 45), while the
AAC uses only switched MDCT. Secondly, the MP3 uses window size of 1152 (evtl. 576) values
and the length of AAC window is equal to 102485. Thirdly, MPEG-4 AAC uses additional
functionality like Temporal Noise Shaping (intra-block predictor), Long-Term Predictor (not
depicted), Intensity (Stereo) / Coupling, (interlock) Prediction and Mid/Side Processing.
The window type detection together with window/block switching, and determination of M/S
stereo IntMDCT are the most time consuming blocks of the encoding algorithm.
VI.5.2.2 Continuous MD set for audio coding
Having the audio coding explained, the definition of continuous meta-data set can be
formulated. There are some differences between coding algorithms; however there are also
some common aspects. Based on these similarities, the following elements of continuous MD
set for audio are defined: window type, window switching mode, and M/S flag.
Window type. It is analogical to the frame type in video, however here the type is derived not
from the processing method, but primarily from the audio content. The window type is decided
upon the computed MDCT coefficients and their energy. An example algorithm of determining
the window type is given in [Youn, 2008]. Depending on the current signal there are three
window types defined as far: start, stop, long, medium and short. These can be mapped to standard
AAC window types. Contrary, the MPEG-4 SLS allows for more window types, but still they
could be classified as sub-groups of the mentioned groups and refined on the fly during
execution (which is not as expensive as complete calculation).
Window switching mode. In case of different window types coming subsequently, the
window switching is required. It can be conducted first then, when the window types of the
84
PQF splits a signal into N = 2x equidistant sub-bands allowing increase of the redundancy removal. There are 32 sub-bands
(x=5) for MP3 with the frequency range between 0 and 16 kHz). However, the PQF introduces the aliasing for certain
frequencies, which has to be removed by the filter overlapping the neighboring sub-bands (e.g. by MDCT having such
characteristic).
85
There are few window lengths possible in the MPEG-4 AAC [MPEG-4 Part III, 2005]. Generally only two are used: short
window with 128 samples and long window with 1024. The short windows are grouped by 8 in a group, so the total length is
then 1024, and that’s why usually the window size of 1024 is mentioned. The other possible windows are: medium – 512,
long-short – 960, medium-short – 480, short-short – 120, long-ssr – 256, short-ssr – 32.
156
current and next frame are known. If the window type is already delivered by the continuous
MD, then the window switching mode can be calculated, and even though there have been
proposed enhancements (e.g. [Lee et al., 2005]) this still requires additional processing power.
Thus the window switching mode can also be included in the continuous MD. It is proposed to
include the same eleven modes as defined by MPEG-486 in the continuous MD.
Mid/Side flag. It holds information on the using possibility of M/S processing. There are
three modes of using M/S processing:
•
0 – the M/S processing is always turned off regardless the incoming signal
characteristics
•
1 – the M/S processing is always turned on, so the left and right channels are always
mapped to M/S signal, which may be inefficient in case of big differences between
channels
•
2 – the M/S processing chooses dynamically if each channel signal is processed
separately or in (M/S) combination.
Obviously, the last option is the default by the encoders supporting the M/S functionality.
However, checking each window separately is CPU consuming operation.
There is also some additional chance of using meta-data in a continuous matter such that the
real-time encoding could possibly be conducted faster. The Dynamic Range Control (DRC) idea is
meant here [Cutmore, 1998; MPEG-4 Part III, 2005]. DRC is a process manipulating the
dynamic range of an audio signal thus altering (automatically) the volume of audio, such that the
gain (level) of audio is reduced when the wave’s amplitude crosses certain threshold (deriving
directly from the limitation of hardware or of coding algorithm). If the dynamics of the audio is
analyzed before it is coded, there is possibility to exploit an additional “helper” signal along with
86
Types supported by MPEG-4: STATIC_LONG, STATIC_MEDIUM, STATIC_SHORT, LS_STARTSTOP_SEQUENCE,
LM_STARTSTOP_SEQUENCE,
MS_STARTSTOP_SEQUENCE,
LONG_SHORT_SEQUENCE,
LONG_MEDIUM_SEQUENCE,
MEDIUM_SHORT_SEQUENCE,
LONG_MEDIUM_SHORT_SEQUENCE,
FFT_PE_WINDOW_SWITCHING.
157
the digital audio which “predicts” the gain that will shortly be required [Cutmore, 1998]. This
allows modification of the dynamic range of the reproduced audio in a smooth manner. The
support for DRC has already been included in MPEG-4 AAC [MPEG-4 Part III, 2005],
however the DRC meta-data is produced on the side of encoder and consumed on the decoder
side. There is no possibility to exploit DRC MD on the encoder side. Eliminating such
drawback would allow to use the DRC information as hint for the scale factor calculation87.
VI.6. Evaluation of the Audio Processing Model through Best-Effort Prototypes
VI.6.1.
MPEG-4 SLS Scalability in Data Quality
The MPEG-4 scalability in meaning of the quality of the input data has been measured (shown
in Figure 47). The SLS enhancement bitstream for different bitrates of the core layer was
truncated with the BstTruncation tool to different sizes and then decoded. The resulting audio
data have been then compared to the reference files using a PEAQ implementation that gives
the Objective Difference Grade (ODG) [PEAQ, 2006]. The ODG value of the difference of
perceived sound quality is closer to zero, the better quality is. Beside the standard SLS core
option, the SLS was tested also for the non-core mode referred to as SLS no Core and for the
pure core (i.e. AAC) denoted by SLS Core only (Figure 47). The set included both the MPEGbased SQAM subset [WWW_MPEG SQAM, 2006] and the private music set [WWW_Retavic Audio Set, 2006].
Overall, the scalability of SLS is very efficient in respect to quality gain compared to the added
enhancement bits. The biggest gain of ODG is achieved when scaling towards bitrates around
128 kbps. The high-bitrate AAC cores affect the sound quality positively until about 256 kbps
with the disadvantage of no scalability below the bitrate of the AAC core. In the area between
near-lossless to lossless bitrates the ODG converges towards zero. The SLS will achieve the
lossless state at rates about 600-700 kbps, which is not possible for lossy AAC. The pure basic
87
The MPEG-4 AAC uses Spectral Line Vector which is calculated based on Core Scaling factor and Integer Spectral Line Vector
(IntSLV). Core Scaling factor is defined as a constant for four possible cases: MDCT_SCALING = 45.254834,
MDCT_SCALING_MS = 32, MDCT_SCALING_SHORT = 16, and MDCT_SCALING_SHORT_MS = 11.3137085. The
IntSLV is calculated by the MDCT function.
158
AAC stream (SLS Core only) starts to fall behind scalable to lossless SLS from about 160 kbps
upwards.
0,00
-0,25
-0,50
-0,75
-1,00
-1,25
ODG
-1,50
-1,75
-2,00
SLS 64 kbps
Core
SLS 128 kbps
Core
-2,25
-2,50
-2,75
SLS Core only
SLS no Core
-3,00
-3,25
-3,50
64
96
128 160
192 224
256 288 320 352 384
Bitrate / kbps (Core + Enhancement Layer)
Figure 47. Gain of ODG with scalability [Suchomski et al., 2006].
Concluding, the SLS provides very good scalability in respect to the quality and occupied size
and the same can still be competitive to other non-scalable lossless codecs.
VI.6.2.
MPEG-4 SLS Processing Scalability
In order to investigate the processing scalability, the decoding speed was compared to the
bitrate of input SLS bitstream. Figure 48 shows the SLS decoding speed in respect to the
enhancement streams truncated at different bit rates analogical to the data-quality scalability test.
Obviously, the truncated enhancement streams are decoded faster as the amount of data and
number of bit planes decrease due to the truncation. Furthermore, the Non-Core SLS streams
are decoded faster than normal SLS, but in both cases the increase in speed is very small—about
a factor of 2 from minimum to maximum enhancement bitrate (but the expected behavior
should show the decoding being much faster for small amount of data i.e. the curve should be
more declivous). However, the situation changes dramatically when the FAAD2 AAC decoder
[WWW_FAAD, 2006] is used for the decoding of the AAC core and the enhancement layer is
dropped completely. In that case, the decoding is over 100 times faster that the real-time. But
159
even then, if the decoding of enhancement layers takes place, the decoding speed will drop
below 10 times faster than real-time.
Decoding speed (x times real-time)
1000,0
SLS no core
SLS with
64kbps core
SLS 64kbps
core only
100,0
10,0
1,0
64
96 128 160 192 224 256 288 320 352 384 ma
x
Bitrate
Figure 48. Decoding speed of SLS version of SQAM with truncated enhancement stream
[Suchomski et al., 2006].
Overall, the real processing scalability, as it would also be expected, is not given with MPEG-4
SLS. From this point of view it can be reduced to only two steps of scalability, using the
enhancement layer or not, so the scalability of SLS in terms of processing speed cannot
compete with its scalability in terms of sound quality. This is caused by the inverse integer
MDCT (InvIntMDCT) taking the largest part of the overall processing time. Even though it is
almost constant and does not depend on the source audio signal, it consumes between 50% and
70% of the decoding depending on the source audio. A better and faster implementation has
been proposed [Wendelska, 2007] and is described in section XXIII. MPEG-4 SLS Enhancements
of Appendix H. It can be clearly seen (Figure 110) that the enhancements in the implementation
have brought the benefits such as smaller decoding time.
160
VII. Real-Time Processing Model
VII. REAL-TIME PROCESSING MODEL
The time aspect of the data in continuous media such as audio and video influences the
prediction, scheduling and execution methods. If the real-time processing is considered as a
base for format independence provision in MMDBMS, the converters participating in the media
transformation process should be treated not as separate but as a part of the whole execution.
Of course, a converter at first must be able to be run in the real-time environment (RTE) and as
such, must be able to execute in the RTOS and to react on the IPC used for controlling realtime processes specific to the given RTOS. Secondly, there exist data dependencies in the
processing, which have not been considered earlier by discussed models [Hamann et al., 2001a;
Hamann et al., 2001b]. So, at first the modeling of continuous multimedia transformations with
the aspects of data dependency is discussed. Then, the real-time issues are discussed in context
of multimedia processing. Finally, the design of real-time media converters is proposed.
VII.1.
VII.1.1.
Modeling of Continuous Multimedia Transformation
Converter, Conversion Chains and Conversion Graphs
A simple black-box view of the converter [Schmidt et al., 2003] considers as visible: data source,
data destination, resource utilization and processing function, which consumes the resources.
The conversion chain described in [Schmidt et al., 2003] assumed that in order to handle all the
kinds of conversion a directed, acyclic graph is required (conversion graph) i.e. the split and rejoin operations are required in order to support multiplexed multimedia stream; however, the
split and re-join operation are treated in [Schmidt et al., 2003] as separate operations being
outside the defined conversion chain. On the other hand, it was stated that due to the
interaction between only two consecutive converters in most cases, only the conversion chains
shall be investigated. Those two assumptions are somehow contradicting to each other. If a
consistent, sound, and complete conversion model was provided, then the split and re-join
operations would be treated as converter as well, and thus the whole conversion graph should
be considered.
161
m data inputs
Converter
(processing function)
n data outputs
ressource utilisation
Figure 49. Converter model – a black-box representation of the converter (based on [Schmidt
et al., 2003; Suchomski et al., 2004]).
As so, the conversion model was extended by [Suchomski et al., 2004] such that the incoming
data could be represented not by only one input but by many inputs and the outgoing data
could be represented not by one output but by many outputs, i.e. the input/the output of the
converter may consume/provide one or more media data streams as depicted in Figure 49.
Moreover, it is assumed that the converter may convert from m media streams to n media
streams, where m∈N, n∈N, and m does not have to be equal n (but still it may happen that
m=n). Due to this extension, the simple media converter could be extended to multimedia
converter, and the view on the conversion model could be broadened from the conversion
chain to the conversion graphs [Suchomski et al., 2004]. Moreover, the model presented in
[Suchomski et al., 2004] is a generalization of the multimedia transformation based on the
previous research (mentioned in the Related Work in section II.4).
The special case of having source and sink represented as converters on the borders of the
conversion graph should still be mentioned. These two types of converters within the
conversion graph have respectively only the output or the input within the model. The source
(e.g. file reading) delivers the data sources for the consecutive converter, while the sink (e.g.
screen) only consumes the data from the previous converter. The file itself or the file system (its
structure, access method, etc.) as well as the network or the graphical adapter should not be
presented in the conversion graph as fully-fledged converters, as so they are considered as
representatives of the outside environment having just one type of data interconnection and are
mapped to converter sink or source respectively. Of course, the conversion graph should not be
limited to just one source and one sink (there may be many of them), however there always
must be at least one instance of source and one instance of sink present in the conversion
graph. The converter graph is depicted in Figure 50 where (a) presents the meta-model of
converter graph, (b) shows a model of the converter graph derived from the meta-model
162
definition, and (c) depicts three examples being instances of a converter graph model i.e. the
particular converters are selected such that the converter graph is functionally correct and
executable (explained later in section VII.2.5.2 Scheduling of the conversion graph).
Producer
RUNTIME
CONNECTION
N
CONVERTER
Consumer
a)
M
[CONSTRAINTS]
SOURCE: may not have any CONSUMER role
SINK: may not have any PRODUCER role
D
SOURCE
SINK
converter graph
Producer
b)
converter chain
converter
(source)
converter
converter
(sink)
converter
buffer
buffer
mpeg-video
to DivX
V
mux:
asf
mpeg-audio
to wma
A
buffer
buffer
c)
A
network
buffer
buffer
buffer
LLV1
selective
reading
buffer
V
demux:
mpeg-sys
disk
1
converter
(sink)
converter
converter
Consumer
CONN
1
buffer
XVID
encoding
LLV1
decoding
network
buffer
buffer
DVB-T
Live
stream
multiplex
demux:
mpeg-sys
A
internet
radio
buffer
archive
Figure 50. Converter graph: a) meta-model, b) model and c) instance examples.
163
VII.1.2.
Buffers in the Multimedia Conversion Process
Within the discussed conversion graph model the data inputs and data outputs are logically
joined into connections, which are mapped on buffers between converters. This mapping is to
be designed carefully due to the key problem of the data transmission costs between multimedia
converters. Especially if the converters operate on the uncompressed multimedia data, the costs
of copying data between buffers may influence heavily the efficiency of the system.
As so, the assumption of having zero-copy operations provided by the operating system should
be made. The zero-copy operation could be provided for example by shared buffers between
processes called global buffer cache [Miller et al., 1998], or in other words by using inter-process
shared or mapped memory for input/output operations through pass-by-reference transfers88.
Yet another example of zero-copy operation, which could deliver possibly efficient data
transfers, is based on the page remapping by “moving” data residing in the main memory across
the protected boundaries and thus allows avoiding copy operations (e.g. IO-Lite [Pai et al.,
1997]). Contrary, the message passing where the message is formed out of the function
invocation, signal and data packets is not a zero-copy operation due to the fact of copying data
from/to local memory into/from packets transmitted through the communication channel
[McQuillan and Walden, 1975]89.
Jitter-constrained periodic stream
It’s obvious, that the size of the buffer depends on the size of the media quanta and on the
client’s QoS requirements [Schmidt et al., 2003], but still, the buffer size must somehow be
calculated. It could be done e.g. by the statistical approach using jitter-constrained periodic
streams (JCPS) proposed by [Hamann et al., 2001b]. The JCPS is defined as two streams: time
and size [Hamann et al., 2001b]:
88
Another example of zero-copy operation is well-known direct memory access (DMA) for hard disks, graphical adapters or
network cards used by the OS drivers.
89
There may be a message passing implemented with zero-copy method, however, it is commonly acknowledged that message
passing introduces some overhead and is less efficient than shared memory on local processor(s) (e.g. on SMPs) [LeBlanc and
Markatos, 1992]. This however, may not be the case on distributed shared memory (DSM) multiprocessor architectures [Luo,
1997]. Nevertheless, there is some research done in the direction of providing the software-based distributed shared memory
e.g. over virtual interface architecture (VIA) for Linux-based clusters, which seems to be simpler in handling and still
competitive solution [Rangarajan and Iftode, 2000], or using openMosix architecture allowing for multi-threaded process
migration [Maya et al., 2003].
164
JCPS t = (T , D,τ , t 0 )
Time stream:
(27)
where T is the average distance between events i.e. length of period (T>0), D is a minimum
distance between events (0≤D≤T), τ is the maximum lateness i.e. maximum deviation from the
beginning of period out of all deviations for each period, t0 is the starting time point ( t 0 ∈ );
JCPS s = ( S , M ,σ , s0 )
Size stream:
(28)
where S is the average quant size (S>0), M is a minimum quant size (0≤M≤S), σ is the
maximum deviation from accumulated quantum size, s0 is the initial value ( s 0 ∈ ) [Hamann et
al., 2001b].
Leading time and buffer size calculations
According to JCPS specification it is possible to calculate the leading time of the producer P in
respect to the consumer C and the minimum size of the buffer by following [Hamann et al.,
2001b]:
tlead =
σC
R
+τ P =
TP ⋅ σ C
+τ P
SP
⎡
⎤
S
Bmin = ⎡(τ C + τ P ) ⋅ R + σ P + σ C − s 0 ⎤ = ⎢(τ C + τ P ) ⋅ P + σ P + σ C − s 0 ⎥
TP
⎢
⎥
(29)
(30)
where R is the rate of size to time (i.e. bit rate) of both communicating JCPS streams. Of
course, there is only one P and one C for each buffer.
M:N data stream conversion
However, the JCPS considers only the conversion chains based on simple I/O converter i.e. the
converter accepting one input and one output (such as given in [Schmidt et al., 2003]), and no
multiple streams and stream operations are supported. The remedy, which allows to support
additional stream operations such as join, split or multiplex, could be delivered in two ways: 1)
as an extension of the model such that the condition for buffer allocation supports multiple
producers and multiple consumers, or 2) by changing the core assumption i.e. substituting the
165
used converter model by the one given in [Suchomski et al., 2004] (depicted in Figure 49). In the
first case, the task requires extending the assumption for the producer and for the consumer as:
Pt = Pt1 + Pt2 + … + Ptn
where Ptn = (TPn, DPn, τPn, t0Pn)
(31)
Ps = Ps1 + Ps2 + … + Psn
where Psn = (SPn, MPn, σPn, s0Pn)
(32)
Ct = Ct1 + Ct2 + … + Ctn
where Ctn = (TCn, DCn, τCn, t0Cn)
(33)
Cs = Cs1 + Cs2 + … + Csn
where Csn = (SCn, MCn, σCn, s0Cn)
(34)
where “+” operator symbolizes the function combining the time event streams into one, such
that all events are included and the length of period (T), minimum event distance (D), maximum
lateness (τ) and starting point (t0) are correctly calculated, and “+” operator represents operation
of addition of average sizes (S) together with the calculation of minimum quantum size (M), of
maximum accumulated deviation of quantum size (σ), and of initial value (s0). However, the
definition of “+” and “+” is expected to be very complex and has not been defined yet.
Contrary, the other method, where the new converter model is assumed, does not require
changes in the JCPS buffer model i.e. the outputs of converter are treated logically as separate
producers delivering to separate consumers from the inputs of subsequent converter(s). The
drawback is obvious – the converter itself must cope with the synchronization issues of each
input buffer (being consumer) and each output buffer (being producer).
VII.1.3.
Data Dependency in the Converter
As it was proved in the previous work [Liu, 2003] and shortly pointed out within this work, the
behavior of the converter highly depends on the data processed. Even application of the MDbased transcoding approach does not eliminate but only reduces the variations of the processing
time requirements, making the precise definition of the relationship between data and
processing impossible i.e. the exact prediction of the process execution time is still hardly
feasible. On the other hand, it reduces the required computations appreciably e.g. the speed-up
for video ranges between 1.32 and 1.47. Anyway, the data influence on the processing has been
investigated to some extent and is presented in the later part of this work in the Real-time Issues in
Context of Multimedia Processing section (subsection VII.2.1).
166
VII.1.4.
Data Processing Dependency in the Conversion Graph
There are also data dependencies within the conversion graph. Let’s assume that there are only
two converters: producer (P) and consumer (C). P delivers synchronously quanta such that 25
decoded frames per second without delay and with constant size, which is derived from color
scheme and resolution or from the number of MBs (each MB consist of 6 blocks having 64 8bit values), are written to the buffer for sequence Mobile (CIF), and C consumes exactly the
same way as P produces them; in other words, both have no delay in delivery and intake.
According to JCPS model they could be presented as:
Pt = (TP, DP, τP, t0P) where TP=40[ms], DP= 40[ms], τP=0, t0P=0,
i.e. Pt = (40, 40, 0, 0)
Ps = (SP, MP, σP, s0P) where SPn=1201.5[kb], MPn=1201.5[kb], σPn=0, s0Pn=0,
i.e. Ps = (1201.5, 1201.5, 0, 0)
and analogically
Ct = (40, 40, 0, 0)
and
Cs = (1201.5, 1201.5, 0, 0).
The lead time and the minimal buffer can then be calculated according to Formulas (29) and
(30) for this simple theoretical case:
tlead = 0.0[s]
and
Bmin= 0[b]
Even though, P and C fulfill the requirement of constant rate i.e.: R = Ss/Ts = St/Tt, the above
result would lead to an error in the reality, because there is data dependency between successive
converters90 hidden, which is not presented within the model but is very important for any type
of scheduling of the conversion process. Namely, C can start consumption first then when P
has completed the first quant i.e. the buffer is filled with the amount of bits equal to the needs
of C. Moreover, when considering time of subsequent quanta being consumed by C, all of them
must be produced beforehand by P. Thus, if the scheduling is considered the P should start at
90
According the JCPS model, the contents of the quanta are irrelevant here i.e. the converter is not dependent on data.
167
least one full quant (frame/window of samples) before C can work i.e. tlead must be equal to
40ms. Moreover, if the buffer is used as the transport medium it should at least allow for storing
full-frame, which means that it should be equal at least 1201.5 [kb]. Finally, if the P and C are
modeled as stated above, both must occupy the processor exclusively due to the full processing
time use (i.e. in both cases 40 [ms] x 25 [fps] = 1 [s]), which means that they run either on two
processors or on two systems.
The data dependency in the conversion graph exists regardless the scheduling model and the
system i.e. it does not matter if there is one processor or more. In case of one processor, the
scheduling must create an execution sequence of converters such that the producer will occupy
the processor, produce the quant required by the consumer always in advance and share the
time with the consumer. In case of parallel processing, the consumer on the next processor can
start only when the producer on the previous one delivered the data to the buffer. This is
analogical to the instructions pipelining, but in contrast it would be applied on the thread level
not on the instruction level.
VII.1.5.
Problem with JCPS in Graph Scheduling
If the transcoding graph is considered from the user perspective, the output from the last
converter in the chain must consider execution time of all the previous converters and must
allow for synchronized data delivery. Moreover, if processing with one CPU is assumed, the
time consumed by each converter in respect to given quant should in total not cross the period
size of the output stream, because otherwise the rule of constant quant rate will be broken. For
example, if the Mobile (CIF) sequence is considered with 25fps in the delivered stream, it means
that the period size is equal to 40ms—so, the sum of execution time of all the previous
converters on one processor should not cross this value. Otherwise, the real-time execution will
not be possible. So, the following equation must be true for real-time delivery on one CPU:
n
TOUT ≥ ∑ Ti
(35)
i =1
where TOUT is the period size of the output JCPS stream requested by the user, Ti is the JCPS
period size for the specific element in the conversion graph and n is the number of converters.
168
Moreover, the leading time (tlead) for all converters should be summed up and analogically the
size buffer should be considered, as follows:
n
t leadOUT ≥ ∑ t leadi
(36)
i =1
n
BminOUT ≥ ∑ Bmin i
(37)
i =1
These requirements are especially true for infinite streams. And even though, introducing the
buffer would allow for some deviations in execution time, still the extra time consumed by some
converters, which took longer, must be balanced by those which took less time, thus making the
average time still fulfilling the requirement. On the other hand, the respective buffer size would
allow for some variations in the processing time, but the same introduces start-up latency in the
delivery process such that the bigger buffer is, the bigger latency is introduced. The latency in
startup can simply be calculated for each conversion graph element by:
L=
B
⋅T
S
(38)
where L means latency derived from given buffer, and following the definitions given previously
B is the buffer size, S is the average quantum size and T is the length of the period.
The latency for the complete converter graph is calculated as sum of the latencies of each
converter in the conversion graph as follows:
n
LOUT ≥ ∑ Li
(39)
i=1
An example of simple transcoding chain is depicted in Figure 51. Here the LLV1 BL+TEL bit
stream is read from the disk, put into the buffer, decoded by LLV1 decoder, put into the buffer,
encoded by XVID, put into the buffer, and finally encoded MPEG-4 bit stream is stored on the
disk. The transcoding chain has been executed sequentially in the best-effort OS with exclusive
execution mode (occupied 100% of the CPU) in order to allow modeling with JCPS by
measuring the time spent by each element of the chain for all frames i.e. at first the selected data
have been read from the storage and buffered completely for all frames, then the first converter
169
A (denotes LLV1 decoding) processed and put the decoded data to next buffer, next the second
converter B (denotes XVID encoding) compressed the data and put it in the third buffer, and
finally the write-to-disk operation has been executed.
Figure 51. Simple transcoding used for measuring times and data amounts on best-effort OS
with exclusive execution mode.
Converter's participation in total transcoding time
Transcoding Time
40,00
Sink Store
3,5%
35,00
Time [ms]
30,00
Sink Store
Source Read
4,3%
Buffer
0,0%
Converter A
Decode
27,9%
Buffer
25,00
Converter B Encode
20,00
Buffer
0,7%
Buffer
Converter A Decode
15,00
Buffer
Source Read
10,00
5,00
Converter B
Encode
61,4%
0,00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Buffer
2,2%
Frame No.
a)
b)
Figure 52. Execution time of simple transcoding: a) per frame for each chain element and b)
per chain element for total transcoding time.
The measured execution times together with the amount of produced data are listed for the first
15 quanta for each participating chain element in Table 5 and depicted in Figure 52. The time
required for reading (from buffer) by the given converter can be neglected in context of the
converter’s time due to the fast memory access employing caching mechanism, thus it is
measured together with the converter’s time—only in case of source the measured time
represents the reading from the disk. Secondly, the quant size read by the next converter is equal
to the one stored in the buffer. Thus the consumer part of each converter is hidden (in order to
avoid repetitions in the table) and the data in the buffers is called Data In, which shall reflect their
consumer characteristic, in opposite to Data Out in the source, converters and sink (being
producers). As it can be noticed, there is the complete chain evaluated and the execution and
170
buffering time are calculated. The execution time is the simple addition of time values for
source, converters and sink, while the buffering time is the sum of writing to buffer. Finally, the
waiting time required for synchronization is calculated – the synchronization is assumed to be
done in respect to the user specification i.e. to the given 25fps in the output stream, which is
given by the JCPS as time stream JCPSt=(40,40,0,0).
Finally, the JCPS is calculated for each element of the hypothetical conversion chain. The time
and size streams are both given in the Table 6 (analogically to Table 5). The time JCPSs are also
calculated for the execution, buffering and wait values. It can simply be noticed, that the
addition of the respective elements of these tree time JCPSs will not give the time JCPS
specified by the user. Only the period size (T) can be added. The other attributes (D, τ, t0) have
to be calculated different yet unknown way (as it was mentioned in section VII.1.2 and specified
by symbol “+”).
Summarizing, JCPS alone is not enough for scheduling the converter graphs, because of the
data influence on the converter graph behavior, which is not considered in the JCPS model as
mentioned above. Thus additional MD are required, which might be analogical to trace
information. The trace data are defined as statistical data coming from the analysis of execution
recorded for each quant, which is the most suitable in prediction of the repetitive execution but
rather expensive. Moreover, JCPS due to its unawareness of media data are not as good as
model based on trace information.
As so, the goal could be to minimize the trace information and encapsulate it in the MD set.
Other possible solution is to define MD-set separately in such a way that the schedule of the
processing analogical to the trace could be calculated.
Moreover, the JCPS should be extended by the target frequency of the processed data, because
specifying the period sizes as it was originally mentioned will not provide the real expected
synchronous output. As so, the following definition for the time JCPS is given:
Time stream:
JCPSt = (T , D,τ , t0 , F )
(40)
where the additional parameter F represents the target frequency of the events delivering
quanta.
171
Sequence:
Period (1/fps·10-3):
Number of frames:
Source
Read
Quant
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
Time
Data
Out
Time
Data
In
Time
[ms]
2,23
0,74
1,49
0,76
1,34
0,83
2,08
0,89
1,63
0,74
1,19
0,89
2,52
0,80
1,95
20,1
[kb]
44,6
14,9
29,7
15,1
26,7
16,5
41,6
17,8
32,7
14,9
23,8
17,8
50,5
15,9
39,0
401,3
[ms]
0,36
0,12
0,24
0,12
0,21
0,13
0,33
0,14
0,26
0,12
0,19
0,14
0,40
0,13
0,31
3,2
[kb]
44,6
14,9
29,7
15,1
26,7
16,5
41,6
17,8
32,7
14,9
23,8
17,8
50,5
15,9
39,0
401,3
[ms]
8,03
8,32
9,07
8,34
9,01
8,21
9,03
8,18
9,01
8,23
9,05
8,09
9,01
8,21
9,03
128,8
Table 5.
T/S
D/M
τ/σ
t0 / s0
MOBILE_CIF
(356x288x25fps)
40
15
Buffer
Converter A
Buffer
Decode
[ms]
Buffer
Sink
Store
Data
Out
Time
Data In
Time
Data
Out
Time
Data
In
Time
Data
Out
[kb]
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
18022,5
[ms]
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
0,67
10,0
[kb]
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
1201,5
18022,5
[ms]
17,67
18,31
19,96
18,35
19,83
18,07
19,86
17,99
19,81
18,10
19,91
17,79
19,81
18,07
19,86
283,4
[kb]
35,6
11,9
23,8
12,1
21,4
13,2
33,3
14,3
26,1
11,9
19,0
14,3
40,4
12,8
31,2
321,1
[ms]
0,02
0,01
0,01
0,01
0,01
0,01
0,02
0,01
0,01
0,01
0,01
0,01
0,02
0,01
0,02
0,2
[kb]
35,6
11,9
23,8
12,1
21,4
13,2
33,3
14,3
26,1
11,9
19,0
14,3
40,4
12,8
31,2
321,1
[ms]
1,8
0,6
1,2
0,6
1,1
0,7
1,7
0,7
1,3
0,6
1,0
0,7
2,0
0,6
1,6
16,1
[kb]
35,6
11,9
23,8
12,1
21,4
13,2
33,3
14,3
26,1
11,9
19,0
14,3
40,4
12,8
31,2
321,1
Complete Chain
Summary
Sync
Exec Buffer
Time
Time Time
(Wait)
[ms]
[ms]
[ms]
29,7
1,0
9,2
28,0
0,8
11,2
31,7
0,9
7,4
28,1
0,8
11,1
31,2
0,9
7,9
27,8
0,8
11,4
32,6
1,0
6,4
27,8
0,8
11,4
31,8
0,9
7,3
27,7
0,8
11,6
31,1
0,9
8,0
27,5
0,8
11,7
33,4
1,1
5,5
27,7
0,8
11,5
32,4
1,0
6,6
448,3
13,4 138,3
Processing time consumed and amount of data produced by the example transcoding chain for Mobile (CIF) video sequence.
Source Source Buffer Buffer
JCPSt JCPSs JCPSt JCPSs
1,3
26,8
0,2
26,8
0,7
14,9
0,1
14,9
0,9
17,8
0,1
17,8
-1,3
-25,1
-0,2
-25,1
Table 6.
Converter B
Encode
A
A
Buffer Buffer
B
B
Buffer Buffer Sink
Sink
Exec Buffer Wait
JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSt JCPSt
8,6 1201,5
0,7 1201,5
18,9
21,4
0,0
21,4
1,1
21,4
29,9
0,9
9,2
8,0 1201,5
0,7 1201,5
17,7
11,9
0,0
11,9
0,6
11,9
27,5
0,8
5,5
0,0
0,0
0,0
0,0
0,0
14,2
0,0
14,2
0,7
14,2
0,0
0,2
4,0
-0,8
0,0
0,0
0,0
-1,8 -20,1
0,0
-20,1
-1,0
-20,1
-3,8
-0,2
0,0
The JCPS calculated for the respective elements of the conversion graph from the Table 5.
172
VII.1.6.
Operations on Media Streams
Media integration (multiplexing)
The multiplexing (called also merging or muxing) of media occurs when two or more
conversion chains are joined by one converter i.e. when the converter accepts two or more
inputs. In such case, the problem with synchronization occurs (see later). The typical examples
can be multiplexing audio and video in one synchronous transport stream.
Media demuxing
Demuxing (or demultiplexing) is an inverse operation to the muxing operation. It allows for
separating each of the media specific stream form the interleaved multimedia stream. Important
element of the demuxing is the assignment of time stamps to the media quanta in order to allow
synchronization of the media (see later). The typical example is decoding from the multiplexed
stream in order to display the video and audio together.
Media replication
The replication is a simple copy of the (multi)media stream such that the input is mapped to the
multiple (at least two) outputs by copying the exact content of the stream. Here, no other
special functionalities are required.
Both demuxing and replication are considered as one input and many outputs converters
according to the converter model defined previously (depicted in Figure 49), and respectively
muxing is considered as many inputs and one output converter.
VII.1.7.
Media data synchronization
The synchronization problem can be solved by using the digital phase-lock loop (DPLL)91 in
two ways: 1) employing the buffer fullness flag and 2) using time stamp [Sun et al., 2005]
(Chapter VI). The second technique has an advantage of allowing the asynchronous execution
of the producer and consumer. These two techniques are usually applied to the network area
91
DPLL is an apparatus/method for generating a digital clock signal which is frequency and phase referenced to an external
digital data signal. The external digital data signal is typically subject to variations in data frequency and high frequency jitter
unrelated to changes in the data frequency. DPLL may consist of a serial shift register receiving digital input samples, a stable
local clock signal supplying clock pulses thus driving the shift register, and a phase corrector circuit adjusting the phase of the
regenerated clock to match the received signal.
173
between the transmitter and the receiving terminals, however are not limited just to them. Thus,
the global timer, allowing to represent time clock of the media stream, and time stamps assigned
to the processed quanta should be applied in the conversion graph (in analogy to the solution
described for MPEG in [Sun et al., 2005]92). These elements are required, however are not
sufficient within the conversion graph. Additionally, the target frequency of all media must be
considered to define the synchronization points, which could be integer-based calculated by
least common multiple (LCM) and by greatest common divisor (GCD)93—otherwise, the
synchronization may include some minor rounding error. If multiple streams are considered
where just minor difference of quant frequency exist (e.g. 44.1kHz vs. 48kHz, or 25 vs.
24.99fps) it may be desirable to synchronize more often than when based on integers using
GCD, thus avoidance of rounding errors is not possible.
In order to explain it more clearly, let us assume that there are two streams, video and audio.
The video should be delivered with frame rate of 10 fps, which means the target frequency of
10 Hz. The audio should be delivered with the sampling rate of 11.025 kHz (standard phone
quality). However, the stored source data have QoD higher than requested QoD, which is
achieved due to the higher frame rate (25fps) and higher sampling rate (44kHz). The origin data
are synchronized by every frame and by every 1764th sample i.e. GCD(25,11025)=25, and
25/25=1 and 11025/25=1764. Contrary, the produced data can by integer synchronized by
every 2nd frame and every 2205th sample i.e. GCD(10,11025)=5, and 10/5=2 and
11025/5=2205. If the fractional synchronization were used such that video is synchronized
every frame, then the audio should be synchronized exactly in the middle between sample 1102
and 1103 (2205/2=1102.5).
For the other previous examples, the GCD(44100,48000) is equal to 300 meaning
synchronization at every 23520 sample for two audio streams, and respectively GCD(2499, 25)
amounts 1 meaning synchronization at every 62475 frame for two video streams.
92
The MPEG-2 allows for two types of time stamps (TS): decoding (DTS) and presentation (PTS). Both are based on system
time clock (STC) of the encoder, which is analogical to the global timer. The STC of MPEG-2 has a constant value of 27
MHz and is also represented in the stream by program clock references (PCR) or system clock reference (SCR).
93
In mathematics: LCM is known also as smallest common multiple (SCM) and GCD is also called greatest common factor
(GCF) or highest common factor (HCF).
174
VII.2.
Real-time Issues in Context of Multimedia Processing
VII.2.1.
Remarks on JCPS – Data Influence on the Converter Description
JCPS is proposed to be used in modeling of the converters applicable in the timed-media
processing. However, it was found empirically that for the different media data (two different
video or audio streams), the same converter requires specification of different values according
to JCPS definition. Moreover, the difference exists in both the time and the size specifications
of JCPS (Table 7).
JCPS time
JCPS size
source JCPSt
source JCPSs
mobile_cif JCPSt
mobile_cif JCPSs
tempete_cif JCPSt
tempete_cif JCPSs
Table 7.
T
S
D
M
133
67584
136
54147
134
35852
τ
σ
133
67584
90
39344
87
24418
t0
s0
0
0
1686
799983
1637
501602
0
0
50
48273
47
34227
JCPS time and size for the LLV1 encoder.
To depict the difference between the calculation backgrounds for the above JCPSs, two graphs
have been depicted: cumulated time in Figure 53 and cumulated size in Figure 54. The values
are cumulated along the number of the frame being processed for the first 20 frames.
3
Cumulated Time
source
2,5
mobile_cif
tempete_cif
Time [s]
2
1,5
1
0,5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Figure 53. Cumulated time of source period and real processing time.
175
15
16
17
18
19
20
1400
Cumulated Size
1200
source
mobile_cif
Size [kB]
1000
tempete_cif
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Figure 54. Cumulated size of source and encoded data.
The Tempete and Mobile sequences used for this comparison have the same resolution and frame
rate thus the same constant data rate (depicted as source). However, the contents of these
sequences are different. Please note, that JCPS have been designed with the assumption of no
data dependency.
As it can be noticed, the JCPSs given for the LLV1 encoder processing the video sequences
with the constant data rate and time requirements but with different contents (Tempete vs. Mobile)
have different JCPS values in both the size and the time. So, if no data dependency was
assumed, the JCPS for both time and size should be the same for one converter. However, this
is not the case and especially the difference in size is noticeable (Figure 54).
VII.2.2.
Hard Real-time Adaptive Model of Media Converters
Due to these problems with pure application of JCPS in the conversion graph, some other
solution has been investigated. It is based on imprecise computations and a hard-real-time
assumption, thus it is called within this work the hard real-time adaptive (HRTA) converter model. As
it was mentioned in the Related Work (subsection III.2.2.2), the imprecise computation allows
for scalability in the processing time by delivering less or more accurate results. If they are
applied such that the minimum time for calculating the lowest quality acceptable (LQA) [ITU-T
Rec. X.642, 1998] is guaranteed, some interesting applications to the multimedia processing can
be recognized.
176
For example, the frame skip must not occur at all in the imprecise computation model i.e. when
the decoding/encoding is designed according to the model, the minimal quality of each frame
could be computed on lower costs and then enhanced according to the available resources, but
this minimal quality could be guaranteed for 100%. Moreover, such model does not require
converter-specific time buffers (besides those used for the reference quanta, which is the case in
all processing models) and does not introduce additional initial processing delay beside the one
required by the data dependency in the conversion graph (beside the graph-specific time buffer,
which is present anyway in all processing models). In results, the model would allow for average
case allocation with LQA guaranties and no frame drops, thus achieving higher perceived
quality of the decoded video, which is an advantage over the all-or-nothing model where in case
of a deadline miss the frame is skipped.
The hard real-time adaptive model of media converters is using quality assuring scheduling
(QAS) but it is not the same. The difference is that the model proposes how to cope with the
media converter if the imprecise computations have to be included, and the QAS is a tool for
mapping from the converter model to the system implementation.
The HRTA converter model is defined as:
C HRTA = (C M , CO , C D )
(41)
where CHRTA denotes a converter supporting adaptivity and hard real-time constraints, CM
defines the mandatory part of the processing, CO specifies optional part of the algorithm, and CD
is a delivery part providing the results to the next converter (called also clean-up step).
In order to provide LQA, the CM and CD are always executed, contrary to CO being executed
only when idle resources are available. The CM is responsible for processing the media data and
coding then according to LQA definition. The CO enhances the results either by repeating
calculation with higher precision on the same data or by operating on additional data and then
improving the results or calculating additional results. The CD decides upon which results to
deliver i.e. if no CO has been executed then the output of CM is selected, otherwise the output of
CO is integrated with output of CM, and finally, the results are provided as the output of the
converter.
177
The hard real-time adaptive model of media converter is flexible, and still supports the quantbased all-or-nothing method i.e. it can be applied on different levels of processing quality such
as dropping of information with different granularity. The quant, regardless if it is an audio
sample or a video frame, can be dropped completely in a way that CM is empty—no processing
is defined—, CO does all the processing, and the CD simply delivers output of CO (if it completed
the execution) or informs about the quant drop (and no LQA is provided). However, the
advantage of model would be wasted in such case, because the biggest benefit is the possibility
to influence the conversion algorithm during the quant processing and gain the partial results,
thus raising the final quality.
For example, the video frame may be coded only partially by including only a subset of macro
blocks, while the remaining set of MBs could be dropped. In other words, the data dropping is
conducted on the macro block level. This obviously rises the quality because instead of having
0% of the frame (frame skipped) and loss of some resources, the results includes something
between 0% and 100% and no resources are lost.
VII.2.3.
Dresden Real-time Operating System as RTE for RETAVIC
The requirement of real-time transformation and QoS control enforces embedding the
conversion graph in the real-time environment (RTE). The reliable RTE can be provided by the
real-time operating system, because then reservations of processing time and of resources can
be guaranteed. A suitable system is the Dresden Real-Time Operating System (DROPS), which
aims at supporting the real-time applications with different timing modes. Moreover, it supports
both timesharing (i.e. best-effort) and real-time threads running on the same system where the
timesharing threads do not influence the real-time threads [Härtig et al., 1998], and are allowed
to be executed only on unreserved (idle) resources.
Architecture
DROPS is a modular system (Figure 55) and is based on the Fiasco microkernel offering eventtriggered or time-triggered executions. The Fiasco is an L4-based microkernel providing finegrained as well as coarse-grained timer modes for its scheduling algorithm. The fine-grained
timer mode (called one-shot mode) has a precision of 1µs and thus it is able to produce a timer
interrupt for every event at the exact time it occurs. In contrast, the coarse-grained timer mode
178
generates interrupts periodically and it is default mode. The interrupt period has the granularity
of roughly 1ms (976µs), which might yield to unacceptable overhead for the applications
demanding small periods. On the other hand, the fine-grained timer may introduce the
additional switching overhead. There are three types of clock possible:
•
RTC – real-time clock generates timer interrupts on IRQ8 and is the default mode
•
PIT – it generates timer interrupts analogical to RTC but works on IRQ0, and is adviced
to be used with VMWare machines and does not work with profiling
•
APIC – most advanced mode using APIC timer, where the next timer interrupt is
computed each time for the scheduling context
The RTC and PIT allow only for coarse-grained timers, and only the APIC mode allows for
both timer modes. Thus if the application requires precise scheduling, the APIC mode with
one-shot option has to be chosen.
Figure 55. DROPS Architecture [Reuther et al., 2006].
Fiasco uses non-blocking synchronization ensuring that higher-priority threads do not block
waiting on lower-level threads or kernel (avoiding priority inversion) and supports static
priorities allowing fully preemptible executions [Hohmuth and Härtig, 2001]. In comparison to
RTLinux, other real-time operating system, Fiasco can deliver smaller response time on
interrupts to the event-triggered executions such that the maximum response time may be
179
guaranteed [Mehnert et al., 2003]. Additionally, a real-time application may be executed on given
scheduled point of time (time-triggered), where the DROPS grants the resources to given
thread, controls the state of the execution and interacts with the thread according to the
scheduled time point, thus allows for ensuring the QoS control and accomplishment of the task.
Scheduling
The quality assuring scheduling (QAS) [Hamann et al., 2001a] is adopted as the scheduling
algorithm for DROPS. The implementation of QAS works such that the preemptive tasks are
ordered according to priorities where from all threads being ready to run in the given scheduling
period always the thread with the highest priority is executed. If a thread with a higher priority
than the current one becomes ready i.e. next period for the given real-time thread occurs, the
current thread is stopped (preempted) and the new thread is assigned to the processor. Among
threads with the same priority in the same period a simple round-robin scheduling algorithm is
applied. So, the time quantum and the priority to each thread must be assigned in order to
control the scheduling algorithm on the application level, and if one or both of them is/are not
assigned the default values are used. Moreover, the thread with default values is treated as timesharing (i.e. non-real-time) thread with lowest priority and no time constraints.
Real-time thread model
A periodic real-time thread in the system is characterized by its period and arbitrary many
timeslices. The timeslices can be classified into mandatory, which have to be executed always,
and optional timeslices, which improve the quality but can be skipped if necessary. Any
combination of mandatory and optional timeslices is imaginable, as the type of the timeslice is
only subject to the programmer of the thread on application level, and not the kernel level. Each
timeslice has the two required properties: length and priority. The intended summed length of
the timeslices together with the length of the period make up the threads reserved context as
shown in Figure 56. In other words, the reserved context has to have the deadline (i.e. end of
period) and the intended end of each timeslices defined (if there are three timeslices then three
timeslices’ ends are defined for each period).
180
Figure 56. Reserved context and real events for one periodic thread.
Nevertheless, the work required by one timeslice might consume more time than estimated, and
the scheduled thread does not finish the work till given end of timeslice i.e. the timeslice exceeds
the reserved time. The kernel is able to recognize such event and in reaction it sends a timeslice
overrun (TSO) IPC message connected with the threads’ timeslice to the preempter thread
(Figure 57), which runs in parallel to the working thread of the application at the highest
priority. Similarly, the kernel is able to recognize a deadline miss (DLM) and to send the
respective IPC call. The deadline miss happens to the thread in the case where the end of the
period is reached and the thread has not signalized the waiting state for the end of the period to
the kernel. Normally the kernel does not communicate directly with the working thread.
The scheduling context 0 is meant as time-sharing part and is always included in any application.
The real-time application requires additionally at least one scheduling context (numbered from 1 to
n). Then the SC0 is used as empty (idle) time-sharing part allowing for waiting for the next
period. The kernel does not distinguish between mandatory and optional timeslices, but only
between scheduling context, so it is the responsibility of the application developer to handle the
scheduling context respectively to the timeslice’s type by defining its reserved context through
181
mapping from the timeslice type to the scheduling context. As so, the SC1 is understood as
mandatory part with the highest priority, SC2…SCn are meant for optional parts each with
lower priority than the previous one, and SC0 is the time-sharing part with the lowest priority.
Figure 57. Communication in DROPS between kernel and application threads (incl.
scheduling context).
VII.2.4.
DROPS Streaming Interface
The DROPS streaming interface has been designed in order to support transferring of big
amounts of timed data being typical for multimedia application. Especially the controlling
facilities enabling real-time applications and efficient data transfer have been provided. The DSI
implements the packet-oriented zero-copy transfer where the traffic model is based on JCPS. It
allows for sophisticated QoS control by supporting non-blocking transport, synchronization,
time-limited validity, data (packet) dropping, and resynchronization. Other RTOSes line QNX
[QNX, 2001] and RTLinux [Barabanov and Yodaiken, 1996] do not provide such a concept for
data streaming.
182
Control
Application
1. Initiate
Stream
2. Assign Stream
Producer
Shared
Data
Area
2. Assign Stream
Shared
Control
Area
3. Signaling
& Data Transfer
Consumer
Figure 58. DSI application model (based on [Reuther et al., 2006]).
The DSI application model is depicted in Figure 58 in order to explain how to use the DSI in
the real-time interprocess communication [Löser et al., 2001b]. Three types of DROPS’s servers
having socket references are involved: control application, producer, and consumer. The zero copy data
transfer is provided through shared memory areas (called stream), which are mapped to the
servers participating in the data exchange – one as producer and the other as consumer. A
control application sets up the connection by creating the stream in its own context and then by
assigning it to the producer and the consumer through socket references. The control
application initiates but is not involved in the data transfer—the producer and the consumer are
responsible for arranging the data exchange independently from the control application through
signaling employing the ordered control data area in a form of the ring buffer with limited amount
of packet’s descriptor [Löser et al., 2001b]. The control data kept in packet descriptor such as
packet position (packet sequence number), timestamp, pointer to start and end position of
timed data allow arranging the packets arbitrary in the shared data area [Löser et al., 2001a].
The communication can be done in the blocking or non-blocking mode [Löser et al., 2001b].
The non-blocking mode must rely on correct scheduling, the virtual time mechanism, and
polling or signaling in order to avoid data loss, but the communication does not need the
overhead of IPC for un-/blocking mechanism thus is few times faster than the method using
blocking. On the other hand, the blocking mode handles situations in which the required data is
not yet available (empty buffer) or the processed data cannot be sent (full buffer). Additionally,
183
the DSI supports handling data omission [Löser et al., 2001a] e.g. when frame dropping in video
processing occurs.
Summarizing, the DSI provides the required streaming interface for the real-time processing of
the multimedia data. It delivers efficient continuous data transfer avoiding the costly copy
operations between processes, which is critical in the RETAVIC architecture. It is delivered
through the fastest static mapping in the non-blocking mode contrary to slower blocking mode.
The possibility of using dynamic mapping is put aside due to its higher costs being linear to the
packet size of the transferred data [Löser et al., 2001b].
VII.2.5.
Controlling the Multimedia Conversion
The most of the existing software implementations of the media converters (e.g. XVID,
FFMPEG) have not been designed for any RTOS but contrary for the wide-spread best-effort
systems such as Microsoft Windows or Linux. Thus there is no facility for controlling the
converter during the processing in respect to the processing time. The only exception is to stop
the work of the converter; however this is not the controlling facility in our understanding. By
the controlling facility it is meant the minimal interface implementation required for interaction
between the RTOS, the processing thread and the control application. A proposal of such
system was given by [Schmidt et al., 2003], thus it is only shortly explained here and the problem
is not further investigated within the RETAVIC project.
VII.2.5.1 Generalized control flow in the converter
The converters are different to each other in meaning of processing function, but their control
flow can be generalized. The generalization of the control flow has been proposed in [Schmidt
et al., 2003] and is depicted in Figure 59. There are three parts distinguished: pre-processing,
control loop and post-processing. The pre-processing allocates memory for the data structures
on the media stream level and arranges the I/O configuration, while post-processing frees
allocated memory and cleans up the created references. The control loop iterates over all
incoming media quanta and produces other quanta. There is a processing function executed on
the quant taken from the buffer in all iterations of the loop. This processing function does the
conversion of the media data, and then the output of the function is written to the buffer of the
next converter.
184
pre-processing
control loop
read a quant
processing function
write a quant
post-processing
Figure 59. Generalized control flow of the converter [Schmidt et al., 2003].
The time constraints connected with the processing has to be considered here. Obviously, the
processing function occupies the most of the time, however the other parts cannot be neglected
(as it was already shown in the Table 5). The stream-related pre- and post-processing is assumed
to be executed only once during the conversion chain setup. If there was some quant-related
pre- and post-processing then it would become a part of the processing function. The time
constraints are considered in the scheduling of the whole conversion chain in cooperation with
the OS.
VII.2.5.2 Scheduling of the conversion graph
Each converter uses resources of the underlying hardware and to provide guarantees for timely
execution, the converters must be scheduled in RTE. In general, the scheduling process can be
divided into multiple steps, starting with a user request and ending with a functionally correct
and successfully scheduled conversion chain. The algorithm of the whole scheduling process
proposed in the cooperative work on memo.REAL [Märcz et al., 2003] is depicted in Figure 60.
Construct the conversion graph
The first step of the scheduling process creates the conversion graph for the requested
transformation. In this graph, the first converter accepts the format of the source media objects
from the source, and the last converter produces the destination format requested by the user
through the sink. Moreover, each successive pair of converters must be compatible so that the
output produced by the producer is accepted as input by the consumer i.e. there is a functionally
correct coupling.
185
Resource Managers (RTOS)
media server
Figure 60. Scheduling of the conversion graph [Märcz et al., 2003].
There can be several of the functionally correct conversion graphs, since some of the
conversion can be performed in arbitrary order or there exist converters with different behavior
but the same processing function. The advantage of this is that the graph with the lowest
resource consumption can be chosen. In fact, some of the graphs may request more resources
than available, and thus cannot be considered for scheduling even if they are functionally
correct.
A method for constructing a transcoding chain from the pool of available converters has been
proposed by [Ihde et al., 2000]. The algorithm uses the capability advertisement (CA) of the
converter represented in simple Backus/Naur Form (BNF) grammar and allows for media type
transformations. However, the approach does not consider the performance i.e. the quantitative
properties of the converters, thus no distinction is made between to functionally-equal chains.
186
Predict quant data volume
Secondly, a static resource planning is done individually for various resource usages based on a
description of the resource requirements of each converter. This yields the data volume that a
converter turns over on each resource when processing a single media quant. Examples are the
data volumes that are written to disk, sent through the network, or calculated by a processing
function.
Calculate bandwidth
In the third step, resource requirements in terms of bandwidth are calculated with the help of
the quant rate (e.g. frame, sample, or GOP rate). The quant rate is specified by the client
request, by the media object format description, and by the function of each converter. The
output of this step is the set of bandwidths required on each resource, which may be calculated
from:
Check and allocate the resources
Fourth, the available bandwidth of each resource is inquired. With this information, a feasibility
analysis of all conversion graphs is performed:
a) Each conversion graph is tested as to whether the available bandwidth on all resources
suffices to fulfill the requirements calculated in 3rd step. If this is not the case, the graph
must be discarded.
b) Based on the calculated bandwidth (from 3rd step), the runtime for each converter is
computed using the data volume processed for a single quant. If the runtime goes
beyond the limits defined by the quant rate, the graph is put aside.
c) The final part of the feasibility analysis is to calculate the buffer sizes e.g. according to
[Hamann et al., 2001b] where the execution time follows the JCPS model. If some
buffer sizes emerge as too large for current memory availability, the whole feasibility
analysis must be repeated with another candidate graph.
The details on feasibility analysis in bandwidth based scheduling can be found in [Märcz and
Meyer-Wegener, 2002]. There is the available bandwidth of the resource (BR) in respect to
187
resource capacity in terms of maximum data rate (CR) and the resource utilization (ηR) defined
such that:
B R = (1 − η R ) ⋅ C R
(42)
The data volume sR on the resource for given quant is measured as [Märcz and Meyer-Wegener,
2002]:
s R ,i = C R ⋅ t R ,i = D R ,i ⋅ t ′R ,i
(43)
where the tR,i is a time required by the processing function for the i-th quant when the converter
uses the resource R exclusively thus the used resource bandwidth DR,i occupies 100% of
resource. In case of parallel execution with used resource bandwidth (i.e. the percentage is lower
than 100%) the longer time t’R,i on resource is required by i-th quant.
The required bandwidth QR of each resource can be calculated using the desired output quant
rate per second of the converter (frate) as [Märcz and Meyer-Wegener, 2002]:
Q R = S ( s R ) ⋅ f rate
(44)
where the S is an average size being a function of the given data volume processed on the
specific resource (in analogy to the first parameter of JCPSs).
Finally, the feasibility check of the bandwidth allocation is done according to [Märcz and MeyerWegener, 2002]:
∑Q
R,x
≤ BR
x
(45)
where the sum of the required bandwidth on the given type of resource (R) is summed for all x
converters and is smaller than the available bandwidth of that resource BR. (i.e. the bandwidth
overlapping is not allowed).
In case no graph passes all the tests, the client has to be involved to decide on measures to be
taken. For example, it might consider lowering the QoS requirements. In the better case of
more than one graphs being left, any of them can be selected using different criteria. Minimum
188
resource usage would be an obvious criterion. Finally, bandwidth-based resource reservations
should be conveyed to the resource manager of the operating system.
Summarizing, the result of the scheduling process is a set of bandwidths (dynamic resources)
and buffer sizes (static resources) required to run the converters with a guaranteed Quality of
Service. To avoid resource scheduling by worst-case execution time one of the already discussed
models could be exploited.
VII.2.5.3 Converter’s time prediction through division on function blocks
The time, which was mentioned in the Equation (43), required by the processing function can
be measured physically on the running system. However, this time depends on the content of
the video sequences (section VII.2.1), so the measured time may vary from sequence to
sequence even, when it is done on the per-quant basis. Here, the proposed static meta-data may
be very helpful. The idea behind is to split the converter into basic execution blocks (called also
function blocks [Meyerhöfer, 2007]) which process subsets of the given quant (usually iteratively
in a loop) and have certain input and output data i.e. for video it can be split on the macro
block’s and then on block’s level e.g. by detailing the LLV1 algorithm in Figure 13 (on p. 101)
or the MD-XVID algorithm in Figure 17 b) (on p. 113). The edges of the graphs representing
the control flow rather than data flow should be used, since the focus is on the processing
times; however the data flow in the multimedia data should not be neglected due to the
accompanying transfer costs and buffering, which also influence the final execution time.
Moreover, the data dependency between subsequent converters must be considered.
It’s natural in media processing that the defined function blocks can have internally multiple
alternatives and completely separated execution paths. Of course it may happen that only one
execution path exist or there is nothing-to-do execution path included. Anyway, the execution
paths are contextual i.e. they depends on the processed data and the decision about which path
to take is made up based on the current context values of the data, and to prove such
assumption, the different execution time for different types of frame for the given function
block should be measured on the same platform (the statistical errors have to be eliminated).
189
Additionally, the measurement could deliver the platform-specific factors usable by the
processing time calculation and prediction if executed for the same data on different platforms.
Some details on calculating and measuring the processing time are given in the upcoming
section VII.3 Design of Real-Time Converters.
VII.2.5.4 Adaptation in processing
If any mismatch exists between the predicted time and the really occupied time i.e. when the
reserved time for the processing function is too small or too big, the adaptation in the converter
processing can be employed. In other words, the adaptation in the processing during execution
is understood as a remedy for the imprecise prediction of the processing time.
Analogical to the hard real-time adaptive model of media converter, the converter including its
processing function has to be reworked, such that the defined parts of the HRTA converter
model are included. Moreover, a mechanism for coping with the overrun situations has to be
included in the HRTA-compliant media converter.
VII.2.6.
The Component Streaming Interface
The separation of the processing part of the control loop from the OS environment and any
data streaming functionality was the first requirement of conversion graph design. The other
one was to make the converter’s implementation OS independent. Both goals are realized by the
component streaming interface (CSI) proposed by [Schmidt et al., 2003]. The CSI is an
abstraction implementing multimedia converter interface providing the converter’s processing
function with the media quant and pushing the produced quant back to the conversion chain
for further data flow to the subsequent converter. This abstraction exempts the converter’s
processing function from dealing directly with the OS-specific streaming interface like DSI as
depicted in Figure 62 a). Obvious benefit of the CSI is the option to adapt the CSI-based
converter to other real-time environment having other streaming interface without changing the
converter code itself.
On the converter side the CSI is implemented in a strictly object-oriented manner. Figure 61
shows the underlying class concept. The converter class itself is an abstract class representing
only basic CSI functionality. For each specific converter implementation the programmer is
190
supposed to override at least the processing function of the converter. Some examples of
implemented converters (filter, multiplexer) are included in the UML diagram as subclasses of
the Converter class.
ConverterConnection
ConverterChain
Converter
JCPSendBuffer
JCPReceiveBuffer
Sender
Receiver
Filter
Multiplexer
Converter
CSI
DSI
CSI
Converter
CSI
DSI
CSI
Converter
CSI
a)
CSI
Figure 61. Simplified OO-model of the CSI [Schmidt et al., 2003].
ConverterChain
Converter
Connection
Converter
Connection
Converter
Connection
IPC
b)
Converter
DSI
JCPReceive
Buffer
JCPSend
Buffer
buffer
DSI
buffer
Figure 62. Application model using CSI: a) chain of CSI converters and b) the details of
control application and converter [Schmidt et al., 2003].
The chain of converters using the CSI interface is depicted in Figure 62 a) and the details of the
application scenario showing the control application managing ConversionChain with
ConverterConnections and the CSI Converters are shown in Figure 62 b) [Schmidt et al., 2003]. The
prototypical Converters run as stand-alone servers under DROPS. In analogy to the model of DSI
application (Figure 58), a control application manages the setup of the conversion chain and
forwards user interaction commands to the affected converter applications using interprocess
191
communication (IPC). The ConverterConnection object is affiliated with the Converter running in the
run-time RTE and employing the JCPReceiveBuffer and JCPSendBuffer, which help to integrate the
streaming operations of multimedia quanta. The prototypical implementation allowed to set up
a conversion chain and to play an AVI video, which was converted online from RGB to YUV
color space [Schmidt et al., 2003].
The CSI prototype has been selected as the environment for embedding the media converters at
the beginning. However, there have been some problems with the versioning of DROPS (see
Implementation chapter) and the CSI implementation has not been supported anymore due to the
early closing of the memo.REAL project. Thus, the CSI has not been used by the prototypes
explained in the later part of this work and the DSI has been used directly. Still, the CSI
concepts developed within the memo.REAL project are considered to be useful and applicable
in the further implementation of the control application of the RETAVIC architecture.
VII.3.
Design of Real-Time Converters
The previously introduced best-effort implementations evaluated shortly in Evaluation of the
Video Processing Model through Best-Effort Prototypes (section V.6) are the groundwork used for the
attempt of the design of the real-time converters. The real-time converters must be able to run
in the real-time environment with time restrictions and as so have to meet certain QoS
requirements to allow for the interaction during processing. When porting the best-effort to the
real-time converter, the run-time RTE and the selected processing model have to be considered
already in the design phase. As so the decision on selecting the DROPS with DSI has been
made thus the converters have to be adapted to this RTOS and to its streaming interface.
However, before going into the implementation details, the design decisions connected with
additional functionality supporting the time constraints are investigated and proposal to rework
the converter’s algorithm are given.
The systematic and scientific method in the algorithm refinement is exploited during the design
phase in analogy to the OS performance tuning including understanding, measuring and
improving steps [Ciliendo, 2006]. These three steps are further separated on low-level design
delivering the knowledge about the platform-specific factors influencing processing efficiency
and on high-level design where the mapping of the algorithms on the logical level occurs.
192
Obviously, the low-level design does not influence the functionality of the processing function
but demonstrates the influence of the elements of the run-time environment, which includes the
hardware and the OS specifics. Contrary, the high-level design does not cover any run-time
specific issue and may lead to the converter’s algorithm changes providing different
functionality and thus to the altered output results. Furthermore, the quantitative properties
shall be investigated in respect to the processing time after each modification implementing new
concept in order to prove if the expected gains are achieved. This investigation is called
quantitative time analyses in the performance evaluation [Ciliendo, 2006] and it distinguishes
between processing time variations deriving from the algorithmic or implementation-based
modifications vs. other time deviations related to the measurement errors (or side effects)—the
later can be classified into statistical deriving from the measurement environment and structural
caused by the specifics of the hardware system [Meyerhöfer, 2007].
VII.3.1.
Platform-Specific Factors
There are few classes of the low-level design in which the investigation of the behavior and the
capabilities of the run-time environment are conducted. These classes are called platformspecific factors and include: hardware architecture dependent processing time, the compiler
optimizations, priorities in thread models, multitasking effects, and caching effects.
VII.3.1.1 Hardware architecture influence
The hardware architecture of the computer should allow at first for running the software-based
RTOS, and secondly, it has to be capable of coping with the requested workload. As so, the
performance evaluation of the computer hardware system is conducted usually by the welldefined benchmarks delivering the deterministic machine index (e.g. BogoMips evaluating the
CPU in the Linux kernel). Since such indexes are somehow hard to interpret in the relation to
the multimedia processing and scheduling and the CPU clock rate cannot say everything about
the final performance, the simple video-encoding benchmark has been defined in order to
deliver some kind of a multimedia platform-specific index usable for the scheduling and
processing prediction. In such case, the obtained measurements should reflect not only the
measure of one specific subsystem, but instead they demonstrate the efficiency of the integrated
hardware system in respect to the CPU with L1, L2 and L3 cache size, the pipeline length and
193
branch prediction, implementation of instruction set architecture (ISA) with SIMD94 processor
extensions (e.g. 3DNow!, 3DNow!+, MMX, SSE, SSE2, SSE3) and other factors influencing
the overall computational power. Thus the index could easily reflect the machine and allow for
the simple hardware representation in the prediction algorithm.
The video encoding benchmark has been compiled once with the same configuration i.e. the
video data (first four frames of the Carphone QCIF), the binaries of system modules, the
binaries of encoder, and the parameters have not been changed. The benchmark has been
executed on two different test-bed machines called PC (using Intel Mobile processor) and
PC_RT (with AMD Athlon processor) which are listed in the Appendix E. The run-time
environment configuration (incl. DROPS modules) is also stated in this appendix in the section
Precise measurements in DROPS. The average time per frame of the four-times executed processing
is depicted in Figure 63, where the minimum and maximum values are also marked.
Hardware Architecture Comparison - Encoding Time
Processing Time [ms]
10
9
Pentium 4
AMD
8
7
1
2
3
4
Figure 63. Encoding time of the simple benchmark on different platforms.
It can be noticed, that the AMD-based machine is faster; however, that was not the point, but
based on these time measurements an index was proposed such that the execution time of the
first frame in first run is considered as normalization factor i.e. the other values are normalized
94
SIMD stands for single instruction multiple data (contrary to SISD, MISD, or MIMD). It is usually referred to as processor
architecture enhancement including floating-point and integer-based multimedia-specific instructions: Multimedia Extensions
(MMX; integer-only), Streaming SIMD Extensions (SSE 1 & 2), AltiVec (for Motorolla), 3DNow!-family (for AMD), SSE3
(known as Prescott New Instructions – PNI) and SSE4 (known as Nehalem New Instructions – NNI). A short comparison
of SIMD processor extensions can be found in [Stewart, 2005].
194
to this one and depicted in Figure 64. The average standard deviation counted for such index in
respect to the specific platform and specific frame is smaller than 0.7% which is interpreted as
the measurement error of multiple runs (and may be neglected). Though, the average standard
deviation counted in respect to the given frame but covering different platforms is equal to
6.1% for all frames and 8.0% for only the predicted frames—such significant deviation can be
regarded neither as measurement error nor as side effect.
Hardware Architecture Comparison - M achine Index
1,300
1,200
1,100
Pentium4
AMD
1,000
Index
0,900
0,800
0,700
1
2
3
4
Figure 64. Proposed machine index based on simple multimedia benchmark.
In results, the proposed index cannot be used in the prediction of processing time, because it
does not behave similar way on different platforms. The various types of frames are executed in
different ways on diverse architectures, thus some more sophisticated measure should be
developed that will allow reflecting the architecture specifics in the scheduling process.
VII.3.1.2 Compiler effects on the processing time
The compiler optimization is another aspect related to the platform-specific factors. It is always
applicable whenever a source code in the higher-level language (e.g. C/C++) has to be
translated to machine-understandable language i.e. binary or machinery code (e.g. assembler).
Since the whole development of the real-time converters (including the source code of the
RTOS as well) is done in C/C++, the language-specific compilers have been shortly
investigated. The source code of MD-XVID as well as of MD-LLV1 is already assembler
195
optimized for IA3295-compatible systems; however there is still an option to turn the
optimization off (to investigate speed-up or support other architectures).
A simple test has been executed to investigate the efficiency of the executable code delivered by
different compilers [Mielimonka, 2006]. The MD-XVID has been compiled with assembler
optimizations for four different versions of the well-known open-source GNU Compiler
Collection (gcc) compiler. Then the executable has been run on the test machine PC_RT and
the execution time has been measured. Additionally, the assembler optimizations have been
investigated by turning them off for the most recent version of the compiler (gcc 3.3.6).
108,00%
gcc 4.0.2
106,00%
gcc 3.4.5
104,00%
gcc 3.3.6
no assembler
102,00%
gcc 3.3.6
1,54
1,56
1,58
1,60
1,62
1,64
98,00%
Encoding Time [s]
gcc 4.0.2
1,52
gcc 3.4.5
1,50
gcc 3.3.6
no assembler
1,48
gcc 3.3.6
1,46
gcc 2.95.4
100,00%
gcc 2.95.4
Tim e normalized to gcc 3.3.6
a)
b)
Figure 65. Compiler effects on execution time for media encoding using MD-XVID.
The results for the encoding process of the Carphone QCIF sequence are depicted in Figure 65.
The part a) shows the execution time measured for the whole-sequence and the part b) presents
the normalized to the gcc 3.3.6 values to depict better the differences between the measured
times. It can be noticed that the oldest compiler (2.95.4) is about 6.8% slower than the fastest
one. Moreover, the assembler optimizations done per hand are also conducted by the compiler
during the compilation process. Even though they seem to be even better (99.93% of the
normalization factor), the difference is still in the range of the measurement error being below
0.7% (as discussed in the previous section).
95
IA32 is an abbreviation of 32-bit Intel™ Architecture.
196
Finally, it can be concluded that the compiler significantly influences the execution time and
thus only the fastest compiler should be used in the real-time evaluations. Additionally, the
decision of no further code optimization per hand is taken, since the gains are small or
unnoticeable. Even more, the use of the fastest compiler not only delivers better and more
efficient code, but keeps the opportunity to use the higher-level source code on different
platforms than the IA32 (which was the assembler optimization target).
VII.3.1.3 Thread models – priorities, multitasking and caching
Obviously the thread models and execution modes existing in any OS influences the execution
method of every application running under given OS. There are already some advantages being
usable for multimedia processing present in the real-time system such as controlled use of
resources, time-based scheduling or reservation guaranties used by QoS control. On the other
hand, the application must be able to utilize such advantages.
One of the important DROPS’s benefits is possibility to assign the priorities to the device
drivers deriving from the microscopic kernel construction i.e. the device drivers are treated as
user processes. Thus the assignment of lower priority to the device driver allows for lowering or
sometimes even avoiding96 the influence of the device interrupts, which is especially critical for
the real-time applications using memory actively due to potential overload situations deriving
from the PCI-bus sharing between memory and device drivers and causing deadline misses
[Schönberg, 2003].
The multitasking (or preemptive task switching) is another important factor influencing the
execution of the application’s thread because the processor timeline is not equal the application
processing timeline i.e. the thread may be active or inactive for some time. The problem in the
real-time application is to provide enough active time to allow the RT application finishing its
work before the defined deadlines. However, the DROPS uses a fixed-priority-based
preemptive scheduling (with the round-robin algorithm for the same priority value) for nonreal-time
96
threads analogically to the best-effort system i.e. the highest-priority thread is
It is possible only when the device being needless in multimedia server is not selected at all e.g. drivers for USB or IEEE hubs
and connectors are not loaded.
197
activated first and the equal-priority threads share the processor time equally. Thus some
investigation has been conducted where the MD-XVID video encoder (1st thread) and the
mathematical program (2nd thread) have been executed with equal priorities in parallel. The
execution time according to the processor timeline (not the application timeline) is measured for
the encoder in the parallel environment (Concurrent) and compared to the stand-alone execution
(Single) and results are depicted in Figure 66. It is clearly noticeable that the concurrent
execution takes much more time for some frames than the single one – namely, for these frames
where the thread was preempted to the inactive state and was waiting for its turn.
Multitasking Effect
30
Concurrent
Single
E x ecu tio n Tim e [m s]
25
20
15
10
5
0
1
4
7
10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94
Frame Number
Figure 66. Preemptive task switching effect (based on [Mielimonka, 2006]).
It’s obvious that the multitasking system is required by the multimedia server where many
threads are usual case, but the fact of the preemptive task switching as shown in Figure 66 is too
dangerous to the real-time applications. Thus a mechanism for QoS control and a time-based
scheduling is required; luckily the DROPS provides already controlling mechanism, namely the
QAS, which could be applicable here. The QAS may be used with the admission server
controlling the current use of resources (especially CPU time) and reserve the required active
time of the given real-time thread by allocating the period timeslice within the period provided
by the resource.
198
VII.3.2.
Timeslices in HRTA Converter Model
Having the platform specific factors discussed, the timeslice for each part defined in the HRTA
converter model has to be introduced. Contrary to the thread model in DROPS (Figure 57 on p.
182), where only one mandatory timeslice and any number of the optional timeslices exist and
one empty (idle) time-sharing timeslice, the time for each part of the HRTA converter model is
depicted in Figure 67. According to defininition of HRTA converter model, two mandatory
timeslices and one optional timeslices are defined, where the time tbase_ts of the mandatory base
timeslice is defined for CM, and analogically tenhance_ts of the optional enhancement timeslice for CO,
and tcleanup_ts of the mandatory cleanup timeslice for CD.
tbase _ ts
t enhance _ ts
t cleanup _ ts
tidle _ ts
Figure 67. Timeslice allocation scheme in the proposed HRTA thread model of the converter.
The last time value called tidle_ts is introduced in analogy to the time-sharing part of the DROPS
model. It is a nothing-to-do part in the HRTA-compatible converter and is used for inactive state
in the multitasking system—other threads may be executed in this time—or if the timeslice
overrun happened the other converter’s parts could still exploit this idle time. The period
depicted in Figure 67 is assumed to be constant and derives directly from the target frequency F
(or inverse of the target framerate given by 1/fps) of the converter output, and definitely does
not have to be equal to the length of period as defined by T in JCPSt.
199
VII.3.3.
Precise time prediction
The processing time may be predicted based on statistics and real measurements but still one
has to remember that it has already been demonstrated that the multimedia data influence the
processing time. As so, there have been three methods investigated during the design of the
processing time estimation:
1) Frame-based prediction
2) MB-based prediction
3) MV-based (or block-based) prediction
They are ordered by complexity and accuracy i.e. the frame-based prediction is the simplest one
but the MV-based seems to have the highest accuracy. Moreover, the more complex the
estimation algorithm is, the more additional meta-data it requires. So, these methods are directly
related to the different levels of the static MD set as given in Figure 12 (on p. 98).
All these methods depend on both: the platform characteristics (as discussed in section VII.3.1)
and the converter behavior. The data influence is respected by each method with small to high
attention by using a certain subset of the proposed meta-data. Anyway, all of the methods
require two steps in the estimation process:
a) Learning phase – where the platform characteristics in context of the used converter
and a given set of video sequences are measured and a machine index is defined (please
note that the simple one-value index proposed in VII.3.1.1 is insufficient);
b) Working mode – where the defined machine index together with the prepared static
meta-data of the video sequences is combined by the estimation algorithm in order to
define the time required for the execution (let’s call it default estimation).
Additionally, the working mode could be used for the refinement of the estimation extending
the meta-data set by delivering the trace and storing it back in the MMDBMS. Then the
prediction could be based on the trace or on combination of trace and the default estimation,
and thus would be more accurate. If the exactly same request appears in the future, there could
200
not only the trace be stored but also the already produced video data; this, however, produces
additional amount of media data and should be only considered as trade-off between the
processing and storage costs. Neither the estimation refinement by the trace nor the alreadyprocessed data reuse has been further investigated.
VII.3.3.1 Frame-based prediction
This method is relatively simple. It delivers the average execution time per frame considering
the distinction between frame types and the video size. The idea of distinction between frame
types comes directly from the evaluation. The decoding in respect to frame type is depicted for
few sequences in different resolutions in Figure 68. In order to make the results comparable, the
higher resolutions are normalized as follows: CIF by factor 4 and PAL (ITU601) by factor 16.
Moreover, the B-frames are used only in the video sequences with “_temp” extension.
Normalized Average Decoding Time per Frame Type
mother_and_daugher_qcif_temp / 1
mother_and_daugher_qcif / 1
container_qcif_temp / 1
container_qcif / 1
mobile_qcif / 1
mobile_qcif_temp / 1
mother_and_daugher_cif / 4
mother_and_daugher_cif_temp / 4
container_cif_temp / 4
container_cif / 4
mobile_cif_temp / 4
I-frame
mobile_cif / 4
mobcal_itu601 / 16
P-frame
mobcal_itu602_temp / 16
B-frame
shields_itu601 / 16
parkrun_itu601 / 16
0
2
4
6
8
10
12
Decoding Time [ms]
Figure 68. Normalized average LLV1 decoding time counted per frame type for each
sequence.
Analogically, the MD-XVID encoding is depicted only for the representative first forty frames
of Container QCIF in Figure 69, where the average of B-frames encoding is above the average of
P-frames encoding (i.e. they respectively amount about 9.17ms and 8.36ms). Summarizing, it is
clearly visible, that I-frames are processed fastest and B-frames slowest for both ―decoding and
encoding― algorithms.
201
11
10,5
Frame Encoding Time (carphone_qcif, first 40 frames)
10
9,5
9
8,5
8
7,5
7
Encoding
6,5
Avg. of P-frames
Avg. of B-frames
6
I
B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B
Frame Type
Figure 69. MD-XVID encoding time of different frame types for representative number of
frames in Container QCIF.
The predicted time for the given resolution is defined as:
Texec = n ⋅ T avg
(46)
where T avg is the machine-specific time vector including the average execution time measured
respectively for I-, P- or B-frames during the learning phase, and the n is the static MD vector
keeping the sum of frames of the given type for the specific video sequence:
n = [nI , nP , nB ]
nI = IFrameSummoi
(47)
nP = PFrameSummoi
nB = BFrameSummoi
The distribution of frame types for the investigated video sequences is depicted in Figure 109
(section XIX.1 Frame-based static MD of Appendix F).
As so, the predicted time for the converter execution may be simply calculated as:
Texec = nI ⋅ TI
avg
+ nP ⋅ TP
avg
+ nB ⋅ TB
avg
(48)
It is not distinguished between the types of converters yet, but it is obvious that the T avg should
be measured for each converter separately (as demonstrated in Figure 68 and Figure 69). The
prediction error calculated for the example data is depicted in Figure 70. There is a difference
between the total execution time and the total time predicted according to Equation (48) on the
202
left-side Y-axis, and the error given as the ratio between the difference and the real value. The
average absolute error is equal to 7.3% but the maximal deviation has achieved almost 11.6%
for overestimation and 12.2% for underestimation.
Prediction Error
4
40,00%
3
Error (r.s.)
2
20,00%
-5
-6
parkrun_itu601
shields_itu601
mobcal_itu601
mobile_cif
mobile_cif_temp
container_cif
mobcal_itu602_temp
-4
container_cif_temp
-3
mother_and_daugher_cif_temp
-2
mother_and_daugher_cif
-10,00%
mobile_qcif_temp
-1
mobile_qcif
0,00%
container_qcif
0
container_qcif_temp
10,00%
mother_and_daugher_qcif
1
mother_and_daugher_qcif_temp
Total Time [s]
30,00%
Difference (l.s.)
-20,00%
-30,00%
-40,00%
-50,00%
-60,00%
Figure 70. Difference between measured and predicted time.
Moreover, a different video resolution causes definitely the different average processing time
per frame. Thus, it is advised to conduct the learning step on at least two well-known
resolutions and then estimate the scaled video according to the following formula:
Tnew = θexec ⋅ Told
avg
, where θ exec
⎧
⎪⎛
⎞
1
⎪ ⎜⎜1 −
⎟⋅
⎪ ⎝ log( pnew ) ⎟⎠
⎪
=⎨
⎪
1
⎪⎛
⎞
1
⎪ ⎜1 −
⎟⋅
⎪⎩ ⎜⎝ log( pnew ) ⎟⎠
203
pnew
⇔ pnew > pold
pold
(49)
pnew
pold
⇔ pnew < pold
and Tnew and pnew is the time and number of the pixels in the new resolution, and analogically the
Told and pold is for the origin test video (i.e. of the measurement). The linear prediction, where the
slope θ is equal only the ratio between new and old number of pixels i.e. to pnew/pold for upscaling
and pnew/pold for downscaling, may be also applied but it yields higher estimation error as depicted
by thin black lines for the LLV1 decoding in Figure 71. The theta-based prediction performs the
best for I-frames and in general better when downscaling; however when upscaling the
predicted time is in most of the cases underestimated.
Average and predicted time per frame type
measured QCIF
I-frame
teta-downscale CIF->QCIF
linear-downscale CIF->QCIF
P-frame
teta-downscale PAL->QCIF
B-frame
linear-downscale PAL->QCIF
measured CIF
teta-upscale QCIF->CIF
linear-upscale QCIF->CIF
teta-downscale PAL->CIF
linear-downscale PAL->CIF
measured PAL
teta-upscale CIF->PAL
linear-upscale CIF->PAL
teta-upscale QCIF->PAL
linear-upscale QCIF->PAL
0
20
40
60
80
100
120
140
160
Execution Time [ms]
Figure 71. Average of measured time and predicted time per frame type.
VII.3.3.2 MB-based prediction
The MB-based method considers the number of different MBs in the frame. There are three
types possible: I-MBs, P-MBs and B-MBs. They are functionally analogical to frame-types
however they are not directly related to the frame-types i.e. I-MBs may appear in all three types
of frames, P-MBs may be included only in P- and B-frames, and B-MBs occur solely in the Bframes. As so, the differentiation not on the frame-level but on MB-level is more reliable for the
average time measurement of the learning phase. The examples depicting the time consumed
per MB are given for different frame-types in Figure 72. Interesting is that, the B-MBs are coded
faster than the I- or P-MBs, while in general the B-frames are coded longer than P-frames—for
204
a comparison see Figure 68, Figure 69 and Figure 70 in the previous section. The high standard
deviations for P- and B-frames stem from the different MB-types using probably different MVtypes. Anyway, it may be deduced, that if there are more B-MBs in the frame, the faster MBcoding of MD-based XVID and of MD-LVV1 will be; obviously, this may not be true for
standard coders using the motion estimation without meta-data (neither in MD-LLV1 nor in
MD-XVID it is done).
70
I-frame Avg.=43,4 Std.Dev.=4,7 Δ|max-min|=20,4
P-frame Avg.=42,2 Std.Dev.=9,1 Δ|max-min|=43,8
B-frame Avg.=30,2 Std.Dev.=7,8 Δ|max-min|=33,8
60
Encoding Time [μs]
50
40
30
20
10
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
MB Number
Figure 72. MB-specific encoding time using MD-XVID for Carphone QCIF.
The example distribution of the different MB types for two sequences is depicted in Figure 73
(more investigated examples are depicted in section XIX.2 MB-based static MD of Appendix F).
There is a noticeable difference between the pictures because the B-frames do (a) or do not (b)
appear in the video sequence. Even if the B-frames appear in the video sequence, it does not
have to mean that all MBs within B-frame are B-MBs—this rule is nicely depicted for even
frames in Figure 73 a), where B-MBs in yellow color cover only part of all MBs in the frame.
205
Coded MBs L0
Coded MBs L0
450
120
400
100
350
80
300
B-MBs
P-MBs
I-MBs
60
40
250
B-MBs
P-MBs
I-MBs
200
150
100
20
50
0
0
1
5
1
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
a)
b)
Figure 73. Distribution of different types of MBs per frame in the sequences: a) Carphone
QCIF and b) Coastguard CIF (no B-frames).
Now, based on the different amount of each MB-type in the frame the predicted time for each
frame may be calculated as:
TFexec = m ⋅ T MBavg + f (T Davg )
(50)
where T MBavg is the time vector including the converter-specific average execution time
measured respectively for I-, P- or B-MBs during the learning phase (as shown in Figure 72),
which includes operations covering code for preparation and support of MB’s structure (e.g.
zeroing MB matrixes), error prediction and interpolation, transform coding, quantization, and
entropy coding of MBs incl. quantized coefficients or quantized error values with MVs (MBrelated bitstream coding). It is defined as:
T MBavg = [TI
MBavg
, TP
MBavg
, TB
MBavg
]
(51)
The m is analogical to n (in previous section), but the static MD vector is keeping the sum of
MBs of the given type for the specific j-th frame of the video sequence (i.e. of the i-th media
object) i.e.:
m = [mI , mP , mB ]
mI = IMBsSummoi , j
mP = PMBsSummoi , j
mB = BMBsSummoi , j
206
(52)
The function f (T Davg ) returns the average time per frame required for frame-type-dependent default
operations other than MB processing related to the specific converter before and after MB-coding
i.e.:
T
Davg
= [TI
Davg
, TP
Davg
, TB
Davg
(53)
]
and the function returns one of these values depending on the type of the processed frame per
each converter. As so, TI
Davg
covers operations required for assigning internal parameters,
zeroing frame-related structures, continuous MD decoding and not-MB-related bitstream
coding (e.g. bit stream structure information like width, height, frame-type, frame number in the
sequence), TP
Davg
includes operations of TI
Davg
and preparation of reference frame (inverse
quantization, inverse transform and edging), and TB
Davg
includes operations of TI
Davg
and
preparation of two reference frames. Obviously, not all of the mentioned operations may be
included in the converter e.g. the LLV1 decoder does not need continuous MD decoding.
Moreover, the T Davg is related to the frame size analogically to the frame-based prediction and
thus the scaling should be applied respectively.
10
Encoding Process (carphone_qcif)
9
8
Cummulated Time [ms]
7
6
5
4
3
2
I-frame
P-frame
1
B-frame
0
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
MB-Encoding Progress
Figure 74. Cumulated processing time along the execution progress for the MD-XVID
encoding (based on [Mielimonka, 2006]).
207
To explain the meaning of f (T Davg ) the measurement of time distribution within the encoding
algorithm has been investigated from the perspective of MB-coding. The results for three
representative frames (different types) of Carphone QCIF are depicted in Figure 74. Here, the
progress between 0 and 1 means the first part of f (T Davg ) and the progress between 99 and
100 denotes the second part of f (T Davg ) . The progress between 1 and 99 is the processing
time spent on MB-coding. As it is shown, the P-frames need more time for preparation than Iframes, and B-frames need even more.
However, the contrary situation is in the after-MB-coding phase, where the I-frames need the
most processing time and B-frames the least. This situation is better depicted in Figure 75 in
which the time division specific for each frame-time has been divided into preparation, MBcoding and completion. Obviously, the time for the preparation and completion phases is given
by f (T Davg ) .
Coding Time Partitioning
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
80,00%
90,00%
100,00%
I-Frame
P-Frame
B-Frame
Preparation
MB-Coding
Completion
Figure 75. Average coding time partitioning in respect to the given frame type (based on
[Mielimonka, 2006]).
Now, the relation between the f (T Davg ) and the m ⋅ T MBavg could be derived for the given
frame, such that:
f (T Davg ) = Δ ⋅ m ⋅ T MBavg
⎛ a+b ⎞
⎟⎟
Δ = [Δ I , Δ P , Δ B ] ∧ Δ k = ⎜⎜
⎝ 1 ⋅ ( a + b) ⎠
and
208
(54)
⎧a = 24.7% ∧ b = 17.5% , if k = I ⇔ f .type = I
⎪
⎨a = 43.8% ∧ b = 8.3% , if k = P ⇔ f .type = P
⎪a = 64.4% ∧ b = 4.1% , if k = B ⇔ f .type = B
⎩
(55)
which may be further detailed as:
(24.7% + 17.5%)
⋅ m ⋅ T MBavg
100% − (24.7% + 17.5%)
(43.8% + 8.3%)
Davg
f (TP ) =
⋅ m ⋅ T MBavg
100% − (43.8% + 8.3%)
(64.4% + 4.1%)
Davg
f (TB ) =
⋅ m ⋅ T MBavg
100% − (64.4% + 4.1%)
f (TI
Davg
)=
(56)
Of course the f (T Davg ) according to the above definitions is only rough estimation. The
predicted values using the Equation (50) in combination with estimates from Equation (56) are
presented in relation to the real measured values of Carphone QCIF in Figure 76 and Figure 77.
The maximal error of overestimation and underestimation was equal respectively to 15% and
8.6%, but the average absolute error was equal to 3.93%. Moreover, the error counted in
average was positive, which means over allocation of resources in most cases.
Measured and Predicted Time
12
10
8
6
4
Measured
2
Predicted
0
I B P BPB P BP BP B PBP BP BPB PB P BPB P BP BP B PBP B PB PBP B PB PB P BP BP B PBP B PB PBP BPB PB P BPB P BP BPB PBP BP BPB PB P BPB P BP BPB
Figure 76. Measured and predicted values for MD-XVID encoding of Carphone QCIF.
Finally, the total predicted time for the converter execution may be calculated as:
nI
nP
nB
i =0
j =0
k =0
Texec = ∑ TFexec,i + ∑ TFexec, j + ∑ TFexec, k
209
(57)
where nI, nP and nB are defied by Equation (47).
The total predicted time for the MD-XVID encoding of the Carphone QCIF calculated according
to Equations (50), (56) and (57) was equal to 873ms and the measured to 836ms, thus the over
allocation was equal to 4.43% for the given example.
Prediction Error
3,0
15,00%
2,5
Difference (l.s.)
Error (r.s.)
2,0
10,00%
1,5
Time [ms]
1,0
5,00%
0,5
0,0
I BPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPB
0,00%
-0,5
-1,0
-5,00%
-1,5
-2,0
-10,00%
Figure 77. Prediction error of MB-based estimation function in comparison to measured
values.
VII.3.3.3 MV-based prediction
As it can be seen in Figure 72, the time required for each MB-type varies from MB to MB.
These variations measured by standard deviation are smallest for the I-MBs (4.7) and almost
twice as big for P-MBs (9.1) and B-MBs (7.8) for the obtained results. The main difference in
coding algorithm between intra- and inter-coded macro blocks is the prediction part, which
employs nine types of prediction vectors causing different execution paths. This leads to the
assumption that the different types of prediction in case of predicted macro blocks influence the
final execution time for each measured MB. Thus the third method based on the motion vector
types has been investigated, which has lead to the function-block decomposition [Meyerhöfer,
210
2007] for the MD-based encoding, such that the execution code has been measured pro MV
type separately.
Before going into detailed measurements, the static MD related to motion vectors has to be
explained. The distribution of the motion vector types within the video sequence is depicted in
Figure 78 and the absolute values of MV sums per frame are shown in Figure 79. The detailed
explanation of graphs’ meaning and other examples of MV-based MD are given in section
XIX.3 MV-based static MD of Appendix F.
Figure 78. Distribution of MV-types per frame in the Carphone QCIF sequence.
MVs per Frame
MVs per frame
120
200
180
100
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
80
60
40
20
160
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
140
120
100
80
60
40
20
0
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
a)
b)
Figure 79. Sum of motion vectors per frame and MV type in the static MD for Carphone QCIF
sequence: a) with no B-frames and b) with B-frames.
The time consumed for the encoding measured per functional-block specific to each MV-type is
depicted in Figure 80. It is clearly visible, that the encoding time per MV-type is proportional to
the number of the MVs of the given type in the frame.
The most noticeable is the mv1 execution—black in Figure 80—, which can be mapped almost
one-to-one with the absolute values included in the static MD—violet in Figure 79 a). The other
behavior visible at once is caused by mv9 (dark blue in both figures). Such results in both
211
remarkable cases prove the linear dependency between the coding time and the number of MVs
in respect to the considered MV-type. Moreover, the distribution graph (Figure 78) helps to find
out very fast, which MV-types influence the frame encoding time i.e. for given example the mv1
is the darkest almost for the whole sequence beside the middle – then mv9 is the darkest. The
other MV namely mv2 and mv4 are also key-impacts in the encoding of Carphone QCIF (96)
example.
mv1
mv2
mv3
mv4
mv5
mv6
mv7
mv8
mv9
4
3
Time [ms]
3
2
2
1
1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
Frame
Figure 80. MD-XVID encoding time of MV-type-specific functional blocks per frame for
Carphone QCIF (96).
The encoding time has been measured per MV-type and the average TiMVavg is depicted in Figure
81 for each MV-type. This average MV-related time is used as a denominator for prediction
calculations:
TFexec = v ⋅ T
MVavg
+ f (T
Davg
)
(58)
where f (T Davg ) is the one from as defined for Equation (50), T MVavg is the time vector
including the converter-specific average execution time measured respectively for different MV212
types during the learning phase (Figure 81), which includes all operations related to MB’s using
given MV-type (e.g. zeroing MB matrixes, error prediction and interpolation, transform coding,
quantization, and entropy coding of MBs incl. quantized coefficients or quantized error values
with MVs). It is defined as a vector having nine average encoding time values referring to the
given MV-types:
T MVavg = [T1
MVavg
, T2
MVavg
, T3
MVavg
,..., T9
MVavg
(59)
]
The v is a sum vector keeping the amount of MVs of the given type for the specific j-th frame
of the i-th video sequence:
v = [v1 , v2 , v3 ,..., v9 ]
vi = MVsSummoi , j (VectorID)
(60)
1≤ i ≤ 9
where VectorID is a given MV-type, vi is the sum of the MVs of the given MV-type and:
MVsSummoi , j (VectorID) = {mvi mvi ∈ MV ∧ TYPE (mvi ) = VectorID ∧ 1 ≤ i ≤ V }
(61)
where V is the total number of MVs in the j-th frame, mvi is the current motion vector belonging
to the set of all motion vectors MV of the j-th frame, and function TYPE(mvi) returns the type
of mvi.
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
0
Avg. Time [μs]
5
10
15
20
25
30
35
40
45
50
mv1
mv2
mv3
mv4
mv5
mv6
mv7
mv8
mv9
50
47
39
46
37
46
42
42
31
Figure 81. Average encoding time measured per MB using the given MV-type.
213
The total time for the video sequence is calculated the same way as in Equation (57).
The measured encoding time and predicted time according to Equation (58) and (57) are
presented in Figure 82. The predicted total time is denoted by TotalPredicted, and it is a sum of
TstartupPredicted and TcleanupPredicted both reflecting f (T Davg ) , and TMoCompPredicted including
sum of multiplications of amount of MVs by average time calculated per MV-type. Analogically,
the real measured total time is denoted by Total and respective components are
TimeStartup&1stMB, AllMBsBut1st and TimeCleanUp.
TotalPredicted
TstartupPredicted
TMoCompPredicted
TcleanupPredicted
Total
TimeStartup&1stMB
AllMBsBut1st
TimeCleanUp
10
9
8
Time [ms]
7
6
5
4
3
2
1
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85 89
93
Figure 82. MV-based predicted and measured encoding time for Carphone QCIF (no Bframes).
The error of the MV-based prediction is shown in Figure 83 as difference in absolute values and
percentage to the measured time.
The total predicted time was equal to 845ms and the measured 836ms (as in MB-based case),
thus the difference was equal to 9ms which resulted in over estimation of 1.04%. As so, the
MV-based prediction achieved better results than MB-based prediction.
214
Prediction Error
15,00%
3,0
Difference (l.s.)
2,5
Error (r.s.)
10,00%
2,0
1,5
5,00%
Time [ms]
1,0
0,5
0,00%
0,0
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
-0,5
-5,00%
-1,0
-1,5
-2,0
-10,00%
Figure 83. Prediction error of MB-based estimation function in comparison to measured
values.
VII.3.3.4 The compiler-related time correction
Finally, the calculated time should be corrected by the compiler factor. This factor may derive
from the measurement conducted in the previous section (VII.3.1.2). So, the compiler-related
time correction is defined as:
Texec _ corr = υ ⋅ Texec
(62)
where υ is a factor representing the relation of execution times in the different runtime
environments: of the learning phase and of the working mode i.e. it’s equal to 1 if the same
compiler was used in both phases and is calculated as ratio of execution times (or normalized
times with normalization factor) for different compilers. For example, let’s assume the values
presented in section VII.3.1.2 i.e. the execution time of the converter compiled with gcc 2.95.4
is equal to 1.6204 (normalized time to 106.78% and normalization factor to 1.5174) and
respectively with gcc 4.0.2 is equal to 1.5678 (or 103.32% and the same normalization factor) for
the same set of media data; if the code was compiled with gcc 2.95.4 for the learning phase and
with gcc 4.0.2 for the working mode, then the value of υ used for the time prediction in the
working mode is equal to the ratio of 1.5678/1.6204 (or 103.32%/106.78%). Moreover, if the
normalization factor for gcc 4.0.2 is different e.g. 1.5489 then the normalized time for gcc 4.0.2
is equal to 101.22% and the ratio is calculated as (101.22%·1.5489)/(106.78%·1.5174). Please
215
note, that the normalization factor does not have to be constant, however then two values
instead of one have to be stored, so it is advised to normalize the execution time by constant
normalization factor. On the other hand, using the measured execution time directly for
calculating υ delivers the same results and does not require storing of the normalization factor,
but at the same it hides the straightforward information about which compilers are faster and
which are slower.
VII.3.3.5 Conclusions to precise time prediction
It can easily be noticed that none of the given solutions can predict exact execution time. Each
method delivers estimates burdened with some error. In case of the frame-based prediction the
error is the biggest, smaller is in MB-base prediction and the smallest in MV-based prediction.
Moreover, the influence of the compiler’s optimization cannot be neglected, until the same
compiler is used for producing executables. Finally, the investigation of the time prediction has
lead us to the conclusion that the errors in prediction cannot be really avoided, and the HRTA
converter model is the only that exploits the predicted time (using any method) on costs of
some drop in the quality of the output data. Obviously, the yielded quality drop is bigger if the
less precise method for prediction was used.
VII.3.4.
Mapping of MD-LLV1 Decoder to HRTA Converter Model
To stick to the hard real-time adaptive converter model and provide a guarantee of the minimal
quality and delivering all the frames, the decoding algorithm had to be split in such a way that
the complete base layer without any enhancements is decoded at first (before the given
deadline) and next the enhancement layers up to the highest one (lossless) are decoded.
However, the optimization problem of multiple executions of inverse quantization (IQ), inverse
lifting scheme i.e. inverse binDCT (IbinDCT) and correction of pixel values in the output video
stream for each layer has occurred here. So, to avoid the loss of processing time in case of more
enhancement layers, the decoding was finally optimized and split into three parts of CHRTA as
follows:
•
CM – decoding of the complete frame in BL – de-quantized and transformed copy of the
frame in BL goes to frame store (FS – buffer), and quantized coefficients are used in
further EL computations
216
•
CO – decoding of the ELs – the decoded bit planes representing differences between
coefficient values on different layers are computed by formula (12) (on page 103)
•
CD – cleaning up and delivery – includes the final execution of IQ, IbinDCT and pixel
correction for the last enhancement layer, and as such utilizing all readily processed MBs
from optional part, and provides the frame to the consumer.
The time required for the base timeslice (guaranteed BL decoding) is calculated as follows:
tbase _ ts = tavg _ base / MB ⋅ m
(63)
where m denotes number of MBs in one frame and tavg_base/MB is the averaged consumed time for
one MB of BL in the given resolution (regardless the MB-type).
The time required for the cleanup timeslice (guaranteed frame delivery) is calculated as follows:
tcleanup _ ts = tmax_ cleanup / MB ⋅ m + tmax_ enhance / MB
(64)
where tmax_cleanup/MB is the maximum of the consumed time for the cleanup step for one MB and
tmax_enhance/MB being the maximum time for the enhancement step for one MB in one of ELs (to
care for the last processed MB in the optional part) in respect to given resolution.
The time required for the enhancement timeslice (complete execution not guaranteed, but
behaves like imprecise computations) is calculated according to:
tenhance _ ts = T − (tbase _ ts + tcleanup _ ts )
(65)
where the T denotes the length of period (analogical to T in JCPSt).
Finally, the decoder must check if it can guarantee the minimal QoS i.e. if the LQA for all the
frames is delivered:
tbase _ ts + tenhance _ ts ≥ t max_ base / MB ⋅ m
(66)
The check is relatively simple, namely the criterion is whether the maximum decoding time per
macro block tmax_base/MB multiplied by the number of MBs in one frame m will fit into the sum of
217
the first two timeslices. Only if this check is valid, the resource allocation providing QoS will
work and the LQA guarantees may be given. Otherwise, the allocation will be refused.
The input values for the formulas are obtained by measurements of the set of videos classified
along the given resolutions:
•
QCIF – container, mobile, mother_and_daughter
•
CIF – container, mobile, mother_and_daughter
•
PAL – mobcal, parkrun, shields
It has been decided so, since the maximum real values can only be measured by execution of the
real data using compiled decoder. The average values of time per MB have been calculated per
frame for all frames in the different sequences but in the same resolution, which means that
they are burdened with the error at least as high as the frame-based prediction. On the other
hand, the average values per MB could be calculated according to the one of the proposed
methods mentioned earlier for the whole sequence and then averaged per MB in the given
frame type–which could yield more exact average times for the processed sequence, however
has not been investigated.
VII.3.5.
Mapping of MD-XVID Encoder to HRTA Converter Model
Analogically to the MD-LLV1 HTRA decoder, the MD-XVID encoder has been mapped to the
HRTA converter model, but only one time prediction method has been used in the later part.
VII.3.5.1 Simplification in time prediction
All the prediction methods discussed in the Precise time prediction section are not able to predict
exact execution time and each of them estimates the time with some bigger or smaller error.
Because the differences between MB-based and MV-based prediction are relatively small
(compare Figure 77 and Figure 83), the simpler method i.e. MB-based prediction has been
chosen for calculation the timeslices of the HRTA converter model, and what’s more, it has
been simplified.
218
Additionally, it has been decided to allow only for the constant timeslices driven by output
frame frequency having strictly periodic execution. The constancy of timeslices is derived
directly from the average time of execution, namely it is based on the maximum average frametime for default operations and on the average MB-specific time. The maximum average frametime is chosen out of the three frame-type-dependent default-operations average time (as given
in Equation (53)):
TMAX
Davg
= max(TI
Davg
, max(TP
Davg
, TB
Davg
))
(67)
where max(x,y) is defined as [Trybulec and Byliński, 1989]:
⎧ x, if x ≥ y ⎫ 1
max( x, y ) = ⎨
⎬ = ⋅ ( x − y + x + y)
⎩ y , otherwise ⎭ 2
(68)
and the average MB-specific time is the mean value:
TAVG
MBavg
=
TI
MBavg
+ TP
MBavg
+ TB
MBavg
3
(69)
VII.3.5.2 Division of encoding time according to HRTA
Having the above simplification defined, the hard real-time adaptive converter model of MDXVID encoder with the minimal quality guarantees including processing of all the frames could
be defined. The encoding algorithm had to be split in such a way that the default operations are
completely treated as mandatory part and MB-specific encoding is partly treated as mandatory
and partly as optional part.
The time required for the base timeslice according to CHRTA is calculated as follows:
t base _ ts = TMAX
Davg
⋅
a
( a + b)
where a and b are defined according to Equation (55).
Analogically, the cleanup timeslice is calculated as follows:
219
(70)
t cleanup _ ts = TMAX
Davg
⋅
b
( a + b)
(71)
The time required for the enhancement timeslice, in which the complete execution of all MBs is
not guaranteed, is calculated according to:
t enhance _ ts = TAVG
MBavg
⋅m
(72)
where m is the number of MBs to be coded.
Both Equations (70) and (71) work with the assumption of the worst case condition
independently of the processed frame type. So there may be introduces some optimization
allowing “moving” the MB-specific processing to unused time in the base time slice. Of course,
the worst-case assumption should stay untouched in the clean-up step.
As so the additional relaxation condition is proposed:
⎢ t base _ ts − (TFexec ⋅ a) ⎥
TFexec ⋅ a < t base _ ts ⇒ mbase = ⎢
⎥
MBavg
TAVG
⎣
⎦
(73)
and
t enhance _ ts = TAVG
MBavg
⋅ (m − mbase )
(74)
Figure 84 demonstrates the mapping of the MD-XVID to HRTA converter model including the
idea of relaxation according to Equations (73) and (74).
220
Reservation according to CHRTA
startup
MB-processing
clean
up
IDLE
I-Frame
TSO
m
IDLE
o
IDLE
P-Frame
TSO
or
IDLE
m
c
IDLE
o
c
IDLE
m
B-Frame
o
period
n-1
period
n
period
n+1
mmoved
I-Frame
m
Relaxation Codition
t
c
IDLE
IDLE
o
c
mmoved
P-Frame
m
IDLE
IDLE
o
c
B-Frame
IDLE
m
o
c
Figure 84. Mapping of MD-XVID to HRTA converter model.
221
t
222
Chapter 4 – Implementation
VIII. Core of the RETAVIC Architecture
As soon as we started programming, we found to our surprise that it wasn’t as easy to get programs right as we had thought. Debugging had to be discovered. I can remember that exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs. Maurice Wilkes
(1979, Reminiscing about his early days on EDSCA in 1940s)
VIII. CORE OF THE RETAVIC ARCHITECTURE
The RETAVIC project was divided into parts i.e. sub-projects. Each part was meant to be
covered by one (or more) student work(s); however not all sub-projects have been conducted
due to the time factor or missing human resources. Finally, the focus was to cover most
important and critical parts of the RETAVIC project proving the idea of controllable meta-databased real-time conversion for the most complex type of media (i.e. for video type) being the
base assumption for the format independence of multimedia data in MMDBMS.
VIII.1.
Implemented Best-effort Prototypes
In the first phase, there has been the video transcoding chain implemented in the best-effort
system, and to be more precise in two platforms of best-effort OSes, namely in Windows XP
and in Linux. The implementation covered:
223
•
p-domain XVID – extension of XVID code to support p-domain-based bit rate
control algorithm;
•
LLV1 codec –including encoding and decoding parts– implementation of the temporal
and quantization layering schemes together with the binDCT algorithm and adaptation
of Huffman tables;
•
MD analyzer – produces the static and continuous MD for the given video sequence;
•
MD-LLV1 codec – extension of LLV1 to support additional meta-data allowing
skipping frames in the enhancement layers in the coded domain (i.e. bit stream does not
have to be decoded); here both –producing and consuming– parts are implemented;
•
MD-XVID codec – extension of XVID to support additional meta-data (e.g. direct
MVs reuse); it also includes some enhancements in the quality such as: 1) rotating diamond
algorithm based on diamond search algorithm [Tourapis, 2002] with only two iterations
for checking if full-pel MD-based MVs are suitable for the converted content, 2) predictor
recheck, which allows for checking MD-based MV against the zero vector and against the
median vector of three MD-based vectors (of the MBs in the neighborhood – left, top,
top-right), and 3) subpel refinement, where the full-pel MC using MD-based MVs is
calculated also for half-pel and q-pel).
Having the codec implemented, the functional and quality evaluations have been conducted,
and moreover, some best-effort efficiency benchmarks have been accomplished. All these
benchmarks and evaluations have already been reflected by charts and graphs in the previous
part of this thesis i.e. in the Video Processing Model (V) and Real-Time Processing Model (VII)
sections.
VIII.2.
Implemented Real-time Prototypes
The second phase of implementation covered at first the adaptation of the source code of the
best-effort prototypes to support the DROPS system and its specific base functions and
procedures as non-real-time threads, and secondly to implement the real-time threading futures
allowing for algorithm division as designed previously. The implementation is based only on
two prototypes of the previously mentioned best-effort implementations i.e. on MD-LLV1 and
MD-XVID codecs, and it covers:
224
•
DROPS-porting of MD-LLV1 – adaptation of the MD-LLV1 codec to support the
DROPS-specific environment; there is not distinction made between the MD-LLV1
implemented in DROPS or in Windows/Linux within this work, since only the OSspecific hacking activities have been conducted and neither algorithmic nor functional
changes have been made; moreover, the implementation in DROPS behaves analogically
to the best-effort system implementation since no real-time support has been included
and even the source code is included in the same CVS tree as the best-effort MD-LLV1;
•
DROPS-porting of MD-XVID – is exactly analogical to MD-LLV1; the only
difference is that it is based on MD-XVID codec;
•
RT-MD-LLV1 decoder – the implementation based on DROPS-porting of MD-LLV1
has covered the real-time issues i.e. the division of the algorithm on the mandatory and
optional parts (which have been explained in Design of Real-Time Converters section),
implementation of the preempter thread, special real-time logging through network, etc.;
it is described in details in the following chapters;
•
RT-MD-XVID encoder – the implementation covers analogical aspects to RT-MDLLV1 but is based on the DROPS-porting of MD-XVID (see also Design of Real-Time
Converters section); it is also detailed in the subsequent part;
The real-time implementations allowed evaluating quantitatively the processing time under the
real-time constraints and provided the means of assessing the QoS control of the processing
steps during the real-time execution.
225
IX. Real-time Processing in DROPS
IX. REAL-TIME PROCESSING IN DROPS
IX.1. Issues of Source Code Porting to DROPS
As already mentioned, the MD-LLV1 and MD-XVID codecs had to be ported to the DROPS
environment. The porting steps, which are defined below, have been conducted for both
implemented converters and as so they may be generalized as a guideline of source code porting
to DROPS for any type of converter implemented in the best-effort system such as Linux or
Windows.
The source code porting to DROPS algorithm consists of the following steps:
1) Adaptation (if exists) or development (if is not available) of the time measurement
facility
2) Adaptation to logging environment for delivering standard and error messages obeying
the real-time constraints
3) Adaptation to L4 environment
The first step covered the time measurement facility, which may be based on functions
returning the system clock or directly on the tick counter of the processor. The DROPS
provides a set of time-related functions giving the values in nanoseconds such as l4_tsc_to_ns or
l4_tsc_to_s_and_ns; however as it has been tested by measurements, they demonstrate some
inaccuracy in the conversion from the number of ticks (expressed by l4_time_cpu_t being a 64-bit
integer CPU-internal time stamp counter) into nanoseconds on the trade-off of a higher
performance. While in the non-real-time applications such inaccuracy is acceptable, in the
continuous-media conversion using hard real-time adaptive model, where very small time-point
distances (e.g. between begin and end of the processing of one MB) are measured, it is
intolerable. Therefore more accurate implementation using also the processor’s tick counter (by
exploiting DROPS functions like l4_calibrate_tsc, l4_get_hz, l4_rdtsc) but being based on the
226
floating-point calculations instead of integer-based i.e. the CPU-specific tsc_to_ns has been
employed. This has lead to the more precise calculations.
Secondly, the exhaustive logging due to analysis purposes is required especially for investigating
the behavior by execution of the real-time benchmarks. The DROPS provides mechanisms for
logging to the screen (LogServer) or to the network (LogNetServer). The first option is not really
useful for the further analysis due to the limited screen size, and only the second one is
applicable here. However, the LogNetServer is based on OSKit framework [Ford et al., 1997] and
can be compiled only with the gcc 2.95.x, which has been proved as the least effective in
producing efficient binaries (Figure 65 on p. 196), but still this is not a problem. The real
drawback is that the logging using the LogNetServer will influence the system under
measurements due to the task switching effect (Figure 66 on p.198) caused by the LOG
command executed by log DROPS’s server being different than the converter DROPS’s
server97, and as so the delivered measures will be distorted. In addition, the LogNetServer due to
its reliance on synchronous IPC over TCP/IP has unpredictable behavior, and thus not
conformable to real-time application model. In results, the logging has to be avoided during
real-time thread execution. On the other hand, it cannot be simply dropped, so the solution of
using log buffers in memory to save logging messages generated during real-time phase and
flushing them to the network by the LogNetServer during non-real-time phase is proposed. Such
solution
allows
avoiding
task switching problems and
unpredictable
synchronous
communication delivering intact real-time execution of the conversion process.
Finally, the adaptation to L4-specific environment being used in DROPS has to be conducted.
The DROPS does not conform to the Portable Operating System Interface for Unix (POSIX)
being a fundamental and standard API for Linux/Unix based systems. Moreover, the DROPS is
still under development and kernel changes may occur, so it is important to recognized, which
version of DROPS kernel is actually used and then do respective adaptations for the chosen
system configuration: L4Env_base for the upgraded version 1.2 (called l4v2) or L4Env_Freebsd
for the previous version 1.0 (called l4v0) [WWW_DROPS, 2006]. The L4Env_base differs from
L4Env_Freebsd such that the second one is based on the OSKit [Ford et al., 1997] while the first
97
The DROPS’s servers are referred here as applications in the user space and outside the microkernel according to the OS
ontology.
227
one has complete implementation of the fundamental system libraries (e.g. libc, log, names,
basic_io, basic_mmap, syslog, time, etc.) on its own [Löser and Aigner, 2007]98.
In results, the functions and procedures specific to the POSIX often simply works but
sometimes they may require adaptation. For example, the assembler-based command SIGILL
compatible with POSIX used by MD-LLV1 and MD-XVID determines the SIMD processor
extensions (namely MMX and SSE) but it is not supported on DROPS with L4Env_base mode.
Thus, the adaptation of both converters has been done by simply assuming that the used
hardware supports these extensions and no additional checks are conducted99, which eliminated
the use of the problematic command.
Another problem with DROPS is that it still does not support the I/O functionality in respect
to the real-time read from and write to the disk, and this is undoubtful limitation for the
multimedia server. The supported simple file operations are not sufficient due missing real-time
abilities. So, another practical solution has been employed for delivering bit streams with video
data, static and continuous meta-data as input i.e. the particular data have been linked after
compilation as binaries into the executable, loaded to the memory, which is real-time capable,
during the DROPS booting process, and accessed through the defined general input interface
allowing for integration of different inputs by calling input-specific read function. Obviously,
such technique is unacceptable for real-world applications, but allows for conducting the proofof-concepts of the assumed format independence through real-time transcoding. Another
possibility would be using the low-latency real-time Ethernet-based data transmission [Löser
and Härtig, 2004], but here only the specific subset of hardware allowing traffic shaping in the
closed system has been used100, so it needs to be found out if this technique may be applied in
98
There are also other modes available like Tiny, Sigma0, L4Env, L4Linux, etc. The detailed specification of all available
configurations together with detailed include paths and libraries can be found in [Löser and Aigner, 2007].
99
Such assumption may be called a hack; however it reflects the reality, because nowadays almost all processors include the
MMX- and SSE-based SIMD extensions in their ISA. Still, there has been the option for turning this SIMD support off by
setting compilation flags XVID_CPU_MMX and/or XVID_CPU_SSE to zero to prohibit the usage of the given ISA
subset.
100
The evaluation has been conducted with only three switches (two fast-ethernet and one gigabit) and two Ethernet cards (one
fast-ethernet - Intel EEPro/100 and one gigabit - 3Com 3C985B-SX). The support for other hardware is not stated. On the
other hand, the evaluation included the only real-time transmission as well as the shared network between real-time (DROPS)
and best-effort (Linux) transmissions and proved the ability of guaranteeing sub-millisecond delays for network utilization of
93% for fast and 49% for gigabit Ethernet.
228
the wide area of general purpose systems such as multimedia server using common off-the-shelf
hardware. Moreover, it would require development of the converters being able to read from
and to write to the real-time capable network (e.g. DROPS-compliant RTSP/RTP sender or
receiver).
IX.2. Process Flow in the Real-Time Converter
Before going into the details of converter’s processing function, the process flow has to be
defined. The DSI has already been mentioned as the streaming interface between the
converters. This is one of the possible options for I/O operations required in the process flow
of the real-time converter. Other possibility covers memory-mapped binaries (as mentioned in
previous paragraph) for input and internal buffer (which is a real-time capable memory allocated
by the converter). The DROPS simple file system and real-time capable Ethernet are jet another
I/O option. The memory-mapped binaries and the internal output buffer have been selected for
evaluation purposes due to their simplicity in the implementation. Moreover, some problems
have appeared with other options: DROPS Simple FS did not support guaranteed real-time I/O
operations, and RT-Ethernet required the specific hardware support (or would require OSrelated activities in the driver development).
The process flow of the real-time converter is depicted in Figure 85. All mentioned I/O options
(in gray) and the selected ones (in black) have been marked. The abstraction of the I/O can be
delivered by the general input/output interface or the CSI proposed by [Schmidt et al., 2003].
However, the CSI has been left out due to its support only for version 1.0 of the DROPS
(dependent on OSKit) – the development has been abandoned due to the closing of the
memo.REAL project. So, if the abstraction had to be supported, the only option was to write
the general I/O interface, such that no unnecessary copy operations appear e.g. by using pointer
handing over and shared memory. So, the general input/output interfaces are nothing else but
wrappers delivering information about the pointer to the given shared memory, which is
delivered by the previous/subsequent element. This is a bit similar to the CSI described in
section VII.2.6 (p. 190).
229
Figure 85. Process flow in the real-time converter.
The real-time converter uses the general input interface consisting of three functions:
int initialize_input (int type);
int provide_input (int type, unsigned char ** address_p, unsigned int
pos_n, unsigned int size_n);
int p_read_input (int type, unsigned char ** address_p, unsigned int
size_n);
where the type defines the input type (being the most left element of the Figure 85) and can be
provided by the control application to the converter after building the functionally-correct
chain. Obviously, the given type of input has to provide all the required types of data i.e. media
data, static MD, and continuous MD used by the real-time converter (otherwise the input
should not be selected for the functionally-correct chain). Then the data in the input buffer is
used by calling the provide_input checking if the requested size of data (size_n) at the given
position (pos_n) can be provided by the input, and p_read_input based of the size of the quant
(size_n) sets the address_p (being a pointer) to the correct position of the next quant in the
memory (and it does not read any data!). The general input interface then forwards the calls to
the input-specific implementation based on the type of the input i.e. for memory-mapped
binaries the provide_input calls the provide_input_binary function respectively. The nice thing about
that is that the real-time converter does not have to know how the input-specific function is
implemented but only must know the type which should be called to get the data, thus the
flexibility in delivering different transcoding chains by the control application is preserved.
230
Obviously, the input-specific functions should provide all three fundamental types of I/O
functions such as open, read, and lseek in POSIX. It is done in the memory-mapped binaries by:
initialize_input_binary (being equivalent to open), provide_input_binary (checks and reads at given
position like read), set_current_position and get_current_position (counterparts of lseek).
The general output interface is implemented in analogy to the general input interface i.e. there are
functions like initialize_output, provide_output and p_write_output defined. One remark is that, the
output internal buffer allocates the memory itself based on the constant size provided by a
system value during the start-up phase of the transcoding chain; however, it should be provided
by the control application as variable parameter after all. Moreover, the output type should also
fulfill the requirement of accepting the different output data produced by the real-time
converter (again the rule of functionally-correct chain has to be applied) and should be given to
the converter by the control application for calling the output functions properly.
IX.3. RT-MD-LLV1 Decoder
The first step was porting the decoder to the real-time environment by adapting all the standard
file access and time measurement functions to those present in DROPS as stated in the previous
sections. Next the algorithmic changes in processing function had to be introduced. The timing
functions and periods had to be defined in order to obtain a constant framerate requested by
the user in such a way that only one frame is provided within one period. Here, a mechanism of
stopping the process for one frame even if not every macro block (MB) was completely
decoded had to be introduced. This has been provided by additional meta-data (MD) describing
bit stream structure of each enhancement layer (discussed in section V.5.1 MD-based Decoding),
because finding a next frame without Huffman decoding of remaining MBs of the current frame
is not possible. So, the MD allowed for skip operation and going to next frame on the binary
level i.e. operation on encoded bit stream.
IX.3.1.
Setting-up Real-Time Mode
The time assigned to each timeslice is calculated by Equations (63), (64), (65) and (66) given in
section VII.3.4 (p. 217) by means of initial measurements with the real-time decoder—example
initial values embedded in the source code as constants are listed in Appendix G in function
load_allocation_params(), but they should be included in the machine-dependent and resolution231
related static meta-data. These measurements provide average decoding time per macro block as
well as maximum decoding time per macro block on a given platform. This allows us avoiding
complex analysis of the specific architecture—being possible due to use of the HRTA converter
model—and delivers huge simplifications in the allocation algorithm. Contrary, the predicted
time could be calculated according to formulas given in the Precise time prediction section. The
code responsible for setting up timeslices and real-time mode is given in Figure 86.
set up RT mode:
createPreempterThread(preemptPrio);
//set up RT periodic thread
registerRTMainThread(preemptThread, periodLength)
// set up timeslices
addTimeslice(baseLength, basePrio);
addTimeslice(enhanceLength, enhancePrio);
addTimeslice(cleanupLength, cleanupPrio);
//switch to RT mode
startRT(timeOffset);
while(running){
do_RT_periodic_loop();
}
Figure 86. Setting up real-time mode (based on [Mielimonka, 2006; Wittmann, 2005]).
IX.3.2.
Preempter Definition
The decoder’s adaptive ability on the MB level (mentioned in the sub-section VII.3.4 of the
Design of Real-Time Converters section) requires handling of IPCs referring to time from the
DROPS kernel. The timeslice overrun IPC is only relevant for the enhancement timeslice of the
main thread. In case of an enhancement timeslice overrun the rest of the enhancement layer
processing has to be stopped and skipped. For the mandatory and cleanup timeslices timeslice
overruns do not affect the processing i.e. for the base quality processing the enhancement
timeslice can be additionally used, and for the cleaning up in the delivery timeslice a timeslice
overrun should never happen – otherwise it is the result of erroneous measurements, as the
maximum time (worst-case) should be allocated for it. The deadline miss IPC (DLM) is
absolutely unintended for a hard real-time adaptive system, but nevertheless is optionally
handled by skipping the frame whenever processing of the current frame is not finished. The
system tries to limit the damage by this skipping operation, but a proper processing with the
232
guaranteed quality can not be assured anymore. It must be clear, that DLM might occur only by
system instability assuming the correct allocation for the delivery timeslice (which is worst-casebased being compliant with HRTA converter model). Finally, the preemtper thread has been
defined and is given by pseudo code in Figure 87 (full listing is given in Appendix G).
preempter:
while(running){
receive_preemption_ipc(msg)
switch(msg.type){
case TIMESLICE_OVERRUN:
if(msg.ts_id == ENHANCE_TS) {
abort_enhance();
}
break;
case DEADLINE_MISS:
if(frame_processing_finished)
abort_waiting_for_next_period();
else {
skip_frame();
raise_delivery_error();
}
break;
}
}
Figure 87. Decoder’s preempter thread accompanying the processing main thread (based on
[Wittmann, 2005]).
IX.3.3.
MB-based Adaptive Processing
The use of abort_enhance() function in the preempter enforces an implementation of the checking
function in the enhancement timeslice, in order to recognize when the timeslice overrun for the
timeslice occurred and to correctly stop the processing. Therefore, the decoder checks the
timeslice overrun (TSO) semaphore before decoding of next MB in the enhancement layer. If
the TSO occurred, then all the remaining MBs for this enhancement layer as well as all MBs for
higher layers are skipped i.e. the semaphore blocks it and the MB loop is quitted. The
prototypical code used for decoding frame in the given enhancement layer (EL) is listed in
Figure 88.
Regardless of the number of the already processed MBs in the enhancement timeslice, the
current results are processed and arranged by the delivery step. The mandatory and delivery
timeslices are required, and accordingly have to be executed completely in a normal processing,
233
but as mentioned already the deadline miss is handled additionally to cope with erroneous
processing. The part responsible for base layer decoding has no time-related functions i.e.
neither timeslice overrun nor deadline miss are checked –the analogical description was given
for the preempter pseudo-code listed in Figure 87.
decode_frame_enhance(EL_bitstream, EL_no):
for (x=0; x<mb_width; x++){
for (y=0; x<mb_height; y++) {
if (TSO_enhance)
return;
else {
set_decode_level_mb(EL_no);
mb_decode_enhancement(EL_bitstream)
}
}
}
Figure 88. Timeslice overrun handling during the processing of enhancement layer (based on
[Wittmann, 2005]).
Finally, the delivery part delivers: 1) the results from the base layer if no MB from the
enhancement processing has been produced or 2) the output of the dequantization and the
inverse transform of the MBs prepared by the enhancement timeslice. If at least one
enhancement layer has been processed completely the dequantization and inverse transform is
executed for all (i.e. mb_width· mb_height) MBs in the frame but only once.
IX.3.4.
Decoder’s Real-Time Loop
The real-time periodic loop demonstrating all the parts of the HRTA converter model is given
by the pseudo code listed in Figure 89. Here it is clearly visible that at first the decoding of base
layer takes place. When it is finished the context is switched to the next reserved timeslices.
Here it does not matter if the mandatory TSO of the base layer was missed or not, because here
the miss of the enhanced TSO would be critical. But the enhanced TSO is anyway not missed in
context of base layer processing since the minimal time assumption according to the worst-case
have been made—for details see condition (66) on page 217. Another interesting event occurs
during the enhancement timeslice namely the setting position to the end of the frame of each
decoded enhanced layer. It occurs always: 1) if the decoding of the given EL was finished the
function sets the pointer to the same position and 2) if the decoding was not finished the
234
pointer is moved to the end of frame based on delivered continuous meta-data allowing
jumping over the skipped MBs in the coded domain of the video stream. The context to the
next timeslice is switched just after the enhanced TSO i.e. maximum after the time required for
processing of exactly one MB, and the delivery step is executed and processing context is
switched to the non-real-time (i.e. idle) part. The delivery step finishes before or exactly in the
period deadline (otherwise, there is erroneous situation). If it finishes before the deadline, then
the converter waits in the idle mode till begin of next period.
do_RT_periodic_loop():
// BASE_TS
decode_base(BL_Bitstream);
next_reservation(ts1);
// ENHANCE_TS
for (EL=1; EL <=desiredLayers; EL++){
if(!TSO_Enhance) {
decode_frame_enhance(EL_Bitstream[EL], EL);
}
setPositionToEndOfFrame(EL_Bitstraem[EL]);
}
// CLEANUP_TS
if(desiredLayer>0){
decoder_dequant_idc_enhance();
decoder_put_enhance(outputBuffer);
}
else {
copy_reference_frame(outputBuffer);
}
// NON_RT_TS – do nothing until the deadline
if(!deadline_miss){ // i.e. normal case
wait_for_next_period()
}
Figure 89. Real-time periodic loop in the RT-MD-LLV1 decoder.
IX.4. RT-MD-XVID Encoder
Analogically to RT-MD-LLV1, the first step in real-time implementation of MD-XVID was
adaptation of all standard procedures and functions for file access and time measurements to
the DROPS-specific environment. Also the timing functions and periods had to be defined
analogically to RT-MD-LLV1. The same MB-based stopping mechanism of the process for
235
each type of frame had to be introduced. Of course, this mechanism has been possible due to
provided continuous meta-data (see section V.5.2 MD-based Encoding) allowing definition of
compressed emergency output in two ways: 1) by skipping the frame using special encoded
symbol at the very beginning of each frame, or 2) by avoiding the frame skip through exploiting
the processed MBs and zeroing only these which have not been processed yet. Additionally, the
reuse of the continuous MD and substituting the not yet coded MBs by refining the first three
and zeroing the rest of coefficients has been implemented, but could be applied directly only if
no resolution change of the frame is applied within the transcoding process101.
IX.4.1.
Setting-up Real-Time Mode
The time assigned to each timeslice according to HRTA converter model (see VII.3.5 Mapping of
MD-XVID Encoder to HRTA Converter Model) has been measured analogically to the decoder;
however, this time the values have not been hard-coded within the encoder source code, but
provided as outside parameters through command line arguments to the encoder process. Such
solution allowed to separate time prediction mechanism from the real-time encoder. The
possible parameters are listed in Table 8. The code responsible for setting up timeslices and realtime mode is exactly the same as for the RT-MD-LLV1 (given in Figure 86 on p. 232) such that
the variables are set to the values taken from arguments.
Argument
-period_length
-mandatory_length
-optional_length
-cleanup_length
Table 8.
101
Description
Length of period used by real-time main thread; given in [ms]
=> periodLength
Length of mandatory timeslice, delivering LQA under worst case assumption;
given in [ms]
=> baseLength
Length of optional timeslice for improving the video quality; given in [ms]
=> enhanceLength
Length of delivery timeslice for exploiting the processed data in the
mandatory and optional; it should be specified according to worst-case
assumption; in case of specifying zero, the frame skip will be applied always;
given in [ms]
=> cleanupLength
Command line arguments for setting up timing parameters of the real-time thread
(based on [Mielimonka, 2006])
Some kind of indirect application of the first three coefficients is possible in case of resolution change. Then the MD-based
coefficient should be rescaled analogically in the frequency domain, or transformed to the pixel domain and rescaled.
However, there exists an undoubtful overhead of additional processing, which in case of timeslice overrun may not be
possible.
236
IX.4.2.
Preempter Definition
Analogically to RT-MD-LLV1, the encoder’s adaptive ability on the MB level also requires
handling of DROPS-kernel’s IPCs informing about the time progress. The timeslice overrun
IPC is only relevant for the optional timeslice of the main thread. For the mandatory timeslice
TSO affects the processing such that the thread state is changed from mandatory to optional
and continues processing in the enhancement mode. Then, in case of the optional TSO the rest
of the enhancement processing has to be stopped and skipped. For the cleanup timeslice TSO
should not happen. If it happens, it means wrong allocation due to erroneous parameters and is
handled by skipping the frame whenever processing of the current frame was not yet finished.
Obviously, the worst-case condition should be used for time allocation of the delivery step. The
deadline miss IPC (DLM) is absolutely unintended for a hard real-time adaptive encoder and
raises delivery error of the encoder but only if the processing of the current frame was not
finished before. Contrary to the decoder, the encoder is stopped immediately in case of DLM.
The pseudo code of the encoder’s preempter thread is listed in Figure 87 (full listing is given in
Appendix G).
preempter:
while(running){
receive_preemption_ipc(msg)
switch(msg.type){
case TIMESLICE_OVERRUN:
if(msg.ts_id == BASE_TS) {
next_timeslice();
} elseif (msg.ts_id == ENHANCE_TS){
next_timeslice();
} elseif (msg.ts_id == CLEANUP_TS){
if(!frame_processing_finished)
skip_frame();
next_timeslice();
}
break;
case DEADLINE_MISS:
if(!frame_processing_finished) {
raise_delivery_error();
stop_encoder_immediatly();
};
break;
}
}
Figure 90. Encoder’s preempter thread accompanying the processing main thread.
237
IX.4.3.
MB-based Adaptive Processing
It can be noticed easily, that there is a difference between the preempter in decoder and
encoder. The decoder’s preempter changes only the semaphore signaling that the timeslice
overrun occurred, and the context of the current timeslices is not changed i.e. the priority of the
execution thread is not changed by the preempter, but first after finishing the processed MB.
Contrary, the encoder’s preempter switches the context immediately, because there is no clear
separation between the base and enhancement layers like in the decoder i.e. there are MBs
encoded in the mandatory part in the same loop as in the enhancement part (for details see
VII.3.5 Mapping of MD-XVID Encoder to HRTA Converter Model). The difference is that not all
MBs assigned to the optional part may be processed, because when switching to the clean-up
step, the MB loop is left in order to allow finishing the frame processing. Again the deadline
miss is handled within this loop, but it should never occur in the correct processing – only in
erroneous allocation (e.g. when period deadline occurs before end of cleanup timeslice) or
system instability it may be expected. Then however, some more drastic action is taken than in
case of real-time LLV1 decoder, namely the encoder is stopped at once by returning error signal
XVID_STOP_IT_NOW, such that the real-time processing is interrupted and no further data is
delivered.
encode_frame(frameType):
for (i = 0; i < mb_width*mb_height; i++) {
// do calculations for given MB
encode_MB(MB_Type);
}
// REALTIME control within the MB loop – inactive in non-RT mode
if (realtime_switch == REALTIME) {
if ((MANDATORY) OR (OPTIONAL)) { continue; }
if (CLEANUP) {
// leave the MB loop and clean up
break;
}
if (DEADLINE) {
// only in erroneous allocation e.g. DEADLINE before CLEANUP
// leave the MB loop & stop immediately
return XVID_STOP_IT_NOW;
}
}
Figure 91. Controlling the MB-loop in real-time mode during the processing of enhancement
layer.
238
IX.4.4.
Encoder’s Real-Time Loop
Since the proposed construction of preempter (Figure 90) and embedded real-time support
within the code of the MB-loop (Figure 91), there is no extra facility for controlling the realtime loop analogical to the RT-MD-LLV1 decoder given in IX.3.4. The functionality
responsible for switching of the current real-time allocation context is included in the preempter
thread i.e. it calls the next_timeslice() function which in context (i.e. for the subsequent timeslice)
executes the DROPS-specific next_reservation() function. The final stage after the clean-up step
calling the next context of the non-real-time timeslice, in which idle processing by
wait_for_next_period() occurs, is executed in all cases but not when the XVID_STOP_IT_NOW
signal was generated (in other words not if the deadline miss occurred).
239
240
Chapter 5 – Evaluation and Application
X. Experimental Measurements
Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Aaron Levenstein
(Why People Work)
X. EXPERIMENTAL MEASUREMENTS
Sun, Chen and Chiang have proposed in their book [Sun et al., 2005] conducting the evaluation
process in two phases in respect to the transcoding. One is called static tests while the other
dynamic tests. The static tests evaluate only the coding efficiency of the transcoding algorithm,
which is done obviously in respect to data quality (QoD). The graphs like PSNR vs. bit rate
[kbps] or vs. bits per pixel [bpp] without bit-rate control or PSNR vs. frame number within the video
sequence where the bit-rate control algorithm or the defined quantization parameter is used (as
a constant factor) – e.g. by using p-domain bit rate control algorithm with two network profiles
defined like varying in 2 sec periods VBR and constant CBR. The dynamic tests are used for
evaluating the real-time behavior of the transcoding process. However, there is only proposal of
using simulation environment in [Sun et al., 2005], namely the test based on the MPEG-21 P.12
test bed [MPEG-21 Part XII, 2004] is proposed to be used to simulate real time behavior of the
transcoder. It follows the rules of workload estimation process, which includes the four stages:
1) developing the workload model generating the controlled media streams (e.g. various
bitrates, resolutions, framerates) using different statistical distributions;
241
2) developing estimation of resource consumption;
3) developing the application for generating the workload according to specified model and
allowing load scalability (e.g. number of media streams);
4) measure and monitor the workload and resource consumption.
However, the model only simulates the real time behavior, but will never reflect precisely the
real environment covering the set of real media data. Thus it was decided at the very beginning
of the RETAVIC project not to use any modeling or simulating environment, but to prepare
prototypes running on real computing system under the selected RTOS with few selected media
sequences recognized by the community of researchers focused on the field of audio-video
processing.
X.1. The Evaluation Process
In analogy to the two mentioned phases, the evaluations have been conducted in both
directions. The static tests have been already included in the respective sections, for example in
Evaluation of the Video Processing Model through Best-Effort Prototypes (section V.6). The dynamic tests
covering execution time measurements have been further conducted under best-effort as well as
real-time OSes. Those run under best-effort systems, for example depicting behavior of
converters (e.g. in section V.1 Analysis of the Video Codec Representatives), have been used for the
imprecise time measurements due to the raised risk of external disruption caused by thread’s
preemption and the potency of the precision errors due to existing timing functions.
Contrary, the benchmarks executed under RTOS have delivered the precise execution time
measurement, which are especially important in real-time systems. They have been used for
quantitative evaluation in two ways under DROPS: without scheduling (non-real-time mode)
and with scheduling (real-time mode). The dynamic test under RTOS with non-RT mode have
been exploited once already in the previous Design of Real-Time Converters section (VII.3) for
evaluating the accuracy of prediction algorithms. The other dynamic benchmarks under RTOS
with RT as well as non-RT mode are discussed in the following subsections. They evaluate
quantitatively the real-time behavior of the converters respectively including scheduling
242
according to the HRTA converter model (with TSO / DLM monitoring) or including the
execution time in the non-RT mode for comparisons with the best-effort implementations.
Since the RETAVIC architecture is a complex system requiring a big team of developers to get
it implemented completely, only few parts have been evaluated. Of course, these evaluations
derive directly from the implementation i.e. they are conducted only for the implemented
prototypes explained in the previous chapter. Each step in the evolution of the implemented
prototypes has to be tested, if it fulfills the demanded quantitative requirements, which can only
be checked by real measurements. In addition to that, the real-time converter itself needs to
make measurements for analyzing the performance of the underlying system. The following
sections discuss measurements as base for time prediction, calibration and admission from the
run-up to the encoding process. The time trace can be used to recognize overrun situations and
to adjust the processing in the future and it can be delivered by time recording during the realtime operation.
X.2. Measurement Accuracy – Low-level Test Bed Assumptions
The temporal progress of the computer program is expected to be directly related to the binary
code and the processed data i.e. the execution should go through the same sequence of
processor instructions for a given combination of input data (functional correctness) and thus
the time spent on given set of subsequent steps shall be the same. However, it has been proven
that there exist many factors that can influence the execution time [Bryant and O'Hallaron,
2003]. These factors have to be eliminated as much as possible in every precise evaluation
process, but beforehand they have to be identified and classified according to possible impacts.
X.2.1.
Impact Factors
There are two levels of impact factors according to the time scale of duration of computer
system’s events [Bryant and O'Hallaron, 2003] and they are depicted in Figure 92. The
microscopic granularity covers such events as processor instructions (e.g. addition of integers,
floating-point multiplication or floating-point division) and is measured in nanoseconds on the
243
Gigahertz machine102. The deviations existing here derive from the impacts as the branch
prediction logic and the processor caches [Bryant and O'Hallaron, 2003]. Completely other type
of impacts can be classified on the macroscopic level and cover external events for example disc
access operations, refresh of the screen, keystroke or other devices (network card, USB
controller, etc.). The duration of macroscopic events is measured in milliseconds, which means
that they last about one million times longer than microscopic events. The external event
generates an interrupt (IRQ) to activate the scheduler [Bryant and O'Hallaron, 2003] and if the
IRQ-handler has higher priority than the current thread according to the scheduling policy the
preemption of the tasks occurs (for more details see section VII.3.1.3 Thread models – priorities,
multitasking and caching) and unquestionable generates errors in the results of the evaluation
process.
Figure 92. Logarithmic time scale of computer events [Bryant and O'Hallaron, 2003].
As mentioned, the impact factors being sources of irregularities have to be minimized to get
highly accurate measurements. The platform-specific factors discussed in section VII.3.1 are
considered already in the design phase, and especially the third subsection considers the
underlying OS’s factors. Contrary to the best-effort operating systems some of the impacts can
be eliminated in DROPS. According to the closed run-time system as described in the XVIII.3
section of Appendix E, the macroscopic events such as device, keystroke or even disc access
102
Obviously, they may be measured respectively in one-tenth or even one-hundredth of nanoseconds for faster processors, and
in hundreds of nanoseconds or in microseconds on megahertz machines.
244
interrupts has been eliminated, thus achieving the level acceptable for the transcoding processes
evaluation.
X.2.2.
Measuring Disruptions Caused by Impact Factors
On the other hand, not all impacts can be eliminated, especially those caused by the microscopic
events. In that case, they should be measured and then considered in the final results. A simple
method applicable here is to measure many times the same procedure and consider the warmup, measure and cool-down phases [Fortier and Michel, 2002].
X.2.2.1
Deviations in CPU Cycles Frequency
An example of measuring the CPU frequency103, which may influence the time measurement of
the transcoding process, is depicted in Figure 93. Here the warm-up as well as cool-down phase
covers 10 executions, and the 90 measured values in-between are selected as results for
evaluation. There have been two processors evaluated: a) AMD Athlon 1800+ and b) Intel
Pentium Mobile 2Ghz.
1533150
2100000
1533140
2000000
1533130
1900000
BENCHMARK
SELECTED
1533120
1800000
1533110
1533100
1700000
1533090
1600000
1533080
1500000
1533070
BENCHMARK
SELECTED
1400000
1533060
a)
109
105
97
101
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
9
17
13
5
1
109
97
105
93
101
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
17
9
5
13
1300000
1
1533050
b)
Figure 93. CPU frequency measurements in kHz for: a) AMD Athlon 1800+ and b) Intel
Pentium Mobile 2Ghz.
The maximum absolute deviation and the standard deviation was equal respectively to 1.078
kHz and to 0.255 kHz for the first processor (Figure 93 a) – please note that the scale range on
103
The frequency instability is well-known impact factor and it can be recognized by the CPU performance evaluation by simple
looping of assembler-based CPU-counter-read during one second period derived from timing functions. The responsible
prototypical source code is implemented within the MD-LLV1 codec in the utils subtree in the timer.c module (functions:
init_timer() and get_freq()).
245
the graph is on the level of 100kHz (being 1/10000 of GHz) to show aberrations. Thus the
absolute error being on the level of 1/100,000th of the percent (i.e. 1·10-7 of the measured
value) is negligible in respect to the one second period. On the other hand, the maximum
absolute deviation can be expressed in the absolute values in nanoseconds. The measurement
period of one second is divided on 109 nanoseconds and during this period there are averagely
1,533,086,203 clock ticks with plus-minus 1,078 clock ticks. That makes the error expressed in
nanosecond equal to 703 ns, and this value is considered later on.
Another situation can be observed for the second processor. Here the noticeable short drop of
frequency to about 1.33GHz is caused by the internal processor controlling facility (such as
energy saving and overheating protection). Even if this drop is omitted i.e. only 70 values
instead of 90 are considered, both the maximum absolute deviation and the standard deviation
are few hundreds times higher than for the previous processor and are equal respectively to
777.591 kHz and to 203.497 kHz. The absolute error on the level of 1/100 of the percent (i.e.
1·10-5) is bigger than previous one but still could be ignored; however, the unpredictable
controlling facility generates impacts of big frequency change (of about 33%) that are too
complex for consideration in the measurements and are not in the scope of this work. Thus the
precise performance evaluations have been executed on the processor working with exactly one
frequency (e.g. AMD Athlon), where no frequency switching is possible as in Intel Pentium
Mobile processors104.
Finally, the error caused by the deviations in CPU cycles frequency definitely influences the time
estimation of the microscopic events but can be neglected for the macroscopic events. Further,
depending on the investigated event type (micro vs. macro), this error should or should not be
considered.
X.2.2.2
Deviations in the Transcoding Time
An example of executing the coding of the same data repeatedly is used for recognizing the
measurement errors of the transcoding. Here the impact factors have been analyzed in context
104
The frequency switching is probably the reason of problems during the attempt of defining the machine index described
previously in the VII.3.1.1 Hardware architecture influence section (on p. 193).
246
of the application by determining variations of the MD-XVID encoder under DROPS for the
Coastguard CIF and Foreman QCIF sequences and for 100 executions for each sequence (with
warm-up and cool-down phases each having 10 executions). The interruption of external
macroscopic event has been eliminated by running only the critical parts of the DROPS
(detailed in the XVIII.3 section of Appendix E). Next, three independent frames (i.e. 100th,
200th, and 300th) out of all frames in the sequence have been selected for the comparison of the
execution time and recognition of the level of the deviations. The results are listed in Table 9.
Sequence
Foreman
QCIF
Coastguard
CIF
Table 9.
Frame
Number
Average
Execution
Time
[ns]
Maximum
Absolute
Deviation
[ns]
Standard
Deviation
[ns]
Maximum
Absolute
Error
[%]
Standard
Error
[%]
100
8 669 832
158 198
40 972
1.82%
0.47%
200
8 501 218
148 223
39 212
1.74%
0.46%
300
8 549 199
162 409
43 531
1.90%
0.51%
100
31 398 821
251 021
66 213
0.80%
0.21%
200
30 814 549
260 892
70 911
0.85%
0.23%
300
30 231 784
269 940
64 432
0.89%
0.21%
Deviations in the frame encoding time of the MD-XVID in DROPS caused by
microscopic factors (based on [Mielimonka, 2006]).
It can be seen that the maximum absolute error is now on the level of one percent (i.e. 1·10-2),
while the standard error (counted from the standard deviation) is on the level of few-tenths of
the percent. Obviously, these errors derive only from the microscopic impact factors, since the
maximum absolute deviation is 1000 times smaller than the duration of the macroscopic event,
and if the macroscopic event appeared during the execution of the performance test, the
absolute deviation would be on the level of milliseconds.
Contrary to the error of CPU frequency expressed in thousand of nanoseconds, the maximum
absolute deviation for the frame encoding time is expressed in hundred-thousands of
nanoseconds. As so the participation of CPU frequency fluctuation error in the frame encoding
error is counted in few-tenths of the percent i.e. 703ns vs. 158,198 ns gives about 0.44% of the
encoding error. In other words, the standard error measured here is few hundreds times bigger
than the one deriving from the measurements based on the clock cycle counter and CPU
frequency.
247
Additionally, the standard error calculated here is on the same level as the average of the
standard error obtained in the previous multiple-execution measurements of the machine index
being equal to or smaller than 0.7% (see section VII.3.1.1 Hardware architecture influence), which
may be a proof of correctness of the measurement.
Due to the few facts given above, the CPU frequency deviations being an undoubtful burden to
the microscopic events are neglected in the frame-based transcoding time evaluations. Secondly,
the time values measured per frame can be represented by the numbers having only first four
most-meaningful digits.
X.2.3.
Accuracy and Errors – Summary
Finally, the accuracy of the measurements in the context of the frame-based transcoding is on
the level of few-tenths of the percent. The impact factors of the macroscopic events have been
eliminated, which was proved by errors being thousand times smaller than the duration of the
macroscopic event. Moreover, the encoding task per frame in QCIF takes roughly the same
time as a macroscopic event.
The measurement inaccuracy is mainly caused by microscopic events being influenced by the
branch prediction logic and the processor caches. The fluctuations of the CPU frequency are
unimportant for the real-time frame-based transcoding evaluations due to the minor influence
on the final error of the transcoding time counted per frame and they are treated as spin-offs or
side-effects.
However, if the level of measurements goes beneath the per-frame resolution (e.g. time
measured per MB), the CPU frequency fluctuations may gain on importance and thus they may
influence the results. In such case, the frequency errors should not be treaded as spin-offs but
considered as meaningful for the results. These facts have to be kept in mind during the
evaluation process.
248
X.3. Evaluation of RT-MD-LLV1
The experimental system on which the measurements have been conducted is the PC_RT
described in Appendix E and the DROPS configuration is given in section XVIII.3 of this
appendix.
X.3.1.
Checking Functional Consistency with MD-LLV1
To investigate the RT-MD-LLV1 converter behavior in respect to the quality of requested data,
the four tests have been executed for Container CIF, where the time per frame spent on each
timeslice type according to HRTA converter model has been measured. The results are depicted
collectively in one graph (Figure 94).
18
16
14
mandatory_QBL
mandatory_QELx
optional_QBL
optional_QEL1
optional_QEL2
optional_QEL3
delivery_QBL
delivery_QEL1
delivery_QEL2
delivery_QEL3
Time [ms]
12
10
8
6
4
114
108
102
96
90
84
78
72
66
60
54
48
42
36
30
24
18
12
6
0
0
2
Frame Number
Figure 94. Frame processing time per timeslice type depending on the quality level for
Container CIF (based on [Wittmann, 2005]).
Each quality level is responding to the number of processed QELs i.e. none, one, two or all
three. The time spent in base (“mandatory_”), enhancement (“optional_”) and cleanup (“delivery_”)
timeslices are depicted. The lowest quality level is referred by “_QBL” extension in the graph.
The mandatory_QBL curve (dick and black) is on the level of 5.4ms per frame (it’s covered by
other curve i.e. by delivery_QEL1), and respectively the optional_QBL on 0.03ms and the
delivery_QBL on 0.8ms, which obviously is correct since in the lowest quality no calculations are
done in the optional part. The higher quality (“_QELx”) required more time than the lowest
quality. Taking more time by higher-quality processing was expected for the enhancement and
cleanup timeslices, but not for the base timeslice since it processes exactly the same amount of
249
base layer data—on the other hand, this difference between mandatory_QBL and
mandatory_QELx may stem from the optimizations of the RT-MD-LLV1, where no frame is
prepared for further enhancing if only base quality is requested. It is also noticeable, that the
processing of the base timeslice took the same time for all three higher quality levels
(mandatory_QELx).
Summarizing, it’s clearly visible that the RT-MD-LLV1 decoder behaves as assumed similarly to
the best-effort implementation of LLV1 decoder (see Figure 24 on p.123) i.e. the higher quality
is requested the more time necessary for the processing of enhancement layers should be
allocated by the optional_QELn timeslice (see respectively optional curves of QEL1, QEL2, and
QEL3). What’s more, the curves are more constant vertical lines, thus the processing is more
stable and better predictable.
X.3.2.
Learning Phase for RT Mode
As the allocation is based on average and maximum execution times on macro block level (see
VII.3.4 Mapping of MD-LLV1 Decoder to HRTA on p.216), the time consumed for the different
timeslices was measured. This was done by setting the framerate down to a value where all MBs
in the highest quality level (up to QEL3) could be easily processed such that the reserved
processing time was even ten times bigger than average case e.g. for CIF and QCIF videos this
was 10fps (i.e. period equal to 100ms) and for PAL video 4fps (i.e. period equal to 250ms). Of
course, a waste of resources being idle took place in such configuration. The period had the
following timeslices (as defined in VII.3.2 Timeslices in HRTA) assigned: 30% of the period for
the base timeslice, next 30% for the enhancement timeslice, 20% for the cleanup timeslice and
remaining 20% has been used for idle non-RT part (i.e. waiting for the next period).
The time really consumed by each timeslice has been measured per frame and normalized by
MB according to number of MBs being specific to each resolution (e.g. QCIF => 99 MBs) to
allow comparison of different resolutions. This average time per MB for each frame is shown in
Figure 95 for the base timeslice, in Figure 96 for the enhancement timeslice, and in Figure 97
for the cleanup timeslice. Please notice that for the enhancement timeslice (Figure 96) the time
is measured for one MB but through all quality enhancement layers (up to QEL3) i.e. the given
250
time is a sum of time in QEL1, QEL2 and QEL3 per one MB and leads to the lossless video
data.
35
30
Time [µs]
25
20
15
10
5
container_cif
container_qcif
mobile_qcif
115
110
105
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
5
10
0
0
shields_itu601
Frame Number
Figure 95. Normalized average time per MB for each frame consumed in the base timeslice
(based on [Wittmann, 2005]).
60
55
50
45
Time [µs]
40
35
30
25
20
15
115
110
105
100
95
90
85
80
75
70
container_qcif
65
60
55
45
40
35
30
25
20
15
10
5
0
0
5
50
container_cif
mobile_qcif
shields_itu601
10
Frame Number
Figure 96. Normalized average time per MB for each frame consumed in the enhancement
timeslice (based on [Wittmann, 2005]).
As it can be seen in all three figures above, the average execution time per MB for the same
video does not vary much for various frames. There are some minor deviations in the curves,
but in general the curves are almost constant for each video. Still there is a noticeable difference
between videos, but it can not be deduced that the difference is directly related to the resolution,
which one could expect.
251
16
15
14
13
12
11
Time [µs]
10
9
8
7
6
5
4
115
110
105
100
95
90
80
75
70
container_qcif
65
60
55
50
40
35
30
25
20
15
10
5
0
0
1
45
container_cif
mobile_qcif
shields_itu601
2
85
3
Frame Number
Figure 97. Normalized average time per MB for each frame consumed in the cleanup timeslice
(based on [Wittmann, 2005]).
To investigate existing differences in more details, the average and maximum time has been
calculated and the difference between them in relation to average expressed on a percentage
basis. The detailed results of execution time of all frames in given video are listed in Table 10.
The smallest difference between average and maximum time (Δ%) can be noticed for cleanup
timeslice. For the enhancement timeslice the differences between average case and worst case is
bigger, and for the base timeslices differences are the biggest achieving up to 20% of the average
time. On the other hand, the worst-case time being only 20% bigger than the average case is
already very good achievement considering video decoding and its complexity.
Video
Sequence
Name
container_cif
container_qcif
mobile_qcif
mother_and_
daughter_qcif
shields_itu601
Table 10.
Time per MB [µs]
avg
17.39
18.97
29.35
Base TS
max
18.23
20.69
30.35
19.47
23.24
20.33
23.92
Enhancement TS
avg
max
Δ%
39.15 40.71
3.98%
45.97 49.95
8.66%
54.10 55.31
2.24%
Cleanup TS
avg
Max
Δ%
14.51 14.80 2.00%
12.50 12.74 1.92%
12.67 12.99 2.53%
19.36%
45.80
48.65
6.22%
13.10
13.17
0.53%
17.66%
39.20
45.26
15.46%
14.25
14.88
4.42%
Δ%
4.83%
9.07%
3.41%
Time per MB for each sequence: average for all frames, maximum for all frames,
and the difference (max-avg) in relation to the average (based on [Wittmann,
2005]).
252
There are however the differences between different videos e.g. the average cases for the
Container differ much from the Mobile (both in QCIF) i.e. 17.39 vs. 29.35 µs. This could be
explained by the different number of the coded blocks in the videos namely in Mobile there have
been roughly 550 blocks coded (out of 594, which is 6 blocks·99 MBs) per frame for the base
layer, and in Container only about 350 coded blocks, whereas the applied normalization
considered the resolution (i.e. the constant amount of MBs) and not really coded blocks.
X.3.3.
Real-time Working Mode
Now, with the measured average and maximum times in the learning phase an allocation is
possible for each video. The framerate can be set to an appropriate level in accordance to the
user request and the capabilities and characteristics of the decoding algorithm. When setting the
framerate gradually higher, the quality naturally drops, because the timeslice for the
enhancement layer processing becomes smaller i.e. the optional part consumes less and less
time. There is however limit where the framerate cannot be raised anymore and the allocation is
refused in order to guarantee the base layer quality being the lowest acceptable quality (for
details see the check condition given by Equation (66) on p.217). The measurement of the
percentage of processed MBs for the different enhancement layers are depicted in Figure 98 for
Mobile CIF, in Figure 99 for Container QCIF and in Figure 100 for Parkrun ITU601 (PAL).
120%
EL1
RT-LLV1 decoding of Mobile CIF
EL2
EL3
Percentage of decoded MBs-QELs in optional timeslice
100%
R
E
F
U
S
E
D
80%
A
L
L
O
C
A
T
I
O
N
60%
40%
20%
0%
25
28
29
30
35
36
37
38
40
45
46
47
48
49
50
55
60
Framerate [fps]
Figure 98. Percentage of decoded MBs for enhancement layers for Mobile CIF with increasing
framerate (based on [Wittmann, 2005]).
253
61
120,00%
EL1
RT-LLV1 decoding of Container QCIF
EL2
EL3
100,00%
R
E
F
U
S
E
D
80,00%
A
L
L
O
C
A
T
I
O
N
60,00%
40,00%
20,00%
0,00%
110 120 122 123 124 125 130 140 160 180 200 220 221 222 223 224 225 226 227 228 230 240 260 270 280 285 288 289 290 291
Framerate [fps]
Figure 99. Percentage of decoded MBs for enhancement layers for Container QCIF with
increasing framerate (based on [Wittmann, 2005]).
120,00%
EL1
RT-LLV1 decoding of Parkrun ITU601
EL2
EL3
100,00%
R
E
F
U
S
E
D
80,00%
A
L
L
O
C
A
T
I
O
N
60,00%
40,00%
20,00%
0,00%
6
8
9
10
11
12
13
14
15
Framerate [fps]
Figure 100. Percentage of decoded MBs for enhancement layers for Parkrun ITU601 with
increasing framerate (based on [Wittmann, 2005]).
For framerates small enough (e.g. 29fps or less for Mobile CIF, 122fps for Container QCIF and
6fps for Parkrun ITU601), the complete video can be decoded with all enhancement layers,
achieving the lossless reconstruction of the video. For higher framerates the quality has to be
254
adapted by means of leaving out the remaining unprocessed MBs of the enhancement layers.
Thus the quality may be changed not only according to the levels on certain layers—such as the
PSNR values for each enhancement layer presented already in the V.6.2 section, where each
level achieves respectively about 32dB, 38dB, 44dB and ∞dB and the difference is roughly equal
to 6dB—, but the fine-grain scalability is also achievable. Since the LLV1 enhancement
algorithm is based on bit planes, the amount of processed bits from the bit plane (deriving from
the number of processed MBs) is linearly directly proportional to the gained quality expressed in
PSNR values assuming that there is equal distribution of error bit values in the bit plane. For
example, the framerate of 36fps for Mobile CIF allows achieving more than 44dB, because the
complete QEL2 and 5% of QEL3 are decoded, and if the framerate targets 40fps then roughly
41dBs are obtained (QEL1 completely and about 50% of QEL2).
After setting the framerate too high, the feasibility check according to Equation (66) will fail,
because no base layer may be processed completely. Such situation will occur in case of setting
the framerate to 61fps for the Mobile CIF, and respectively 291fps for Container QCIF and 15fps
for Parkrun ITU601 for the given test bed. As so the allocation is refused for such framerates
since the LQA cannot be guaranteed. Of course the user is to be informed about that fact
deriving from the lack of resources.
X.4. Evaluation of RT-MD-XVID
X.4.1.
Learning Phase for RT-Mode
The learning phase has been implicitly discussed in the Precise time prediction section i.e. the
measurements have been conducted to find out the calculation factors for each of the
mentioned prediction methods. The graphs depicting those measurements are given in: Figure
68, Figure 69, Figure 72, Figure 75, Figure 76, Figure 80, Figure 81 and Figure 82. However,
there have not been any quality aspects investigated neither in respect to the duration of
mandatory timeslice nor in respect to the requested lowest quality acceptable (LQA). Thus some
more benchmarks have been conducted.
At first the complete coding (without loss of any MB) of the videos has been conducted. The
implemented RT-MD-XVID has been used for this purpose; however no time restrictions have
been specified i.e. the best-effort execution mode has been used such that neither timeslice
255
overruns nor deadline misses nor encoding interruptions have occurred. The goal was to
measure the average encoding time per frame with minimum and maximum values, and to find
out relevant deviations. The minimum value represents the fastest encoding and reflects the Iframe encoding time, while the maximum value is the slowest encoding and depending on use
of B-frames indicates the P- or B-frames. The results are depicted in Figure 101.
Encoding Time per Frame [ms] - Average (Min/Max) and Deviation
Deviation [%]
40
Mobile
CIFN (IP)
35
Mobile
QCIF (IP)
30
25
Foreman
QCIF (IP)
20
Football
CIFN (IP)
15
Coastguard
CIF (IP)
10
5
0
Carphone
QCIF (IPB)
Carphone QCIF Carphone QCIF Coastguard CIF
(IP)
(IPB)
(IP)
Football CIFN
(IP)
Foreman QCIF
Mobile QCIF (IP) Mobile CIFN (IP)
(IP)
Deviation
0,44
0,47
0,72
0,80
0,36
0,26
1,18
Average
8,60
8,91
31,12
27,14
8,86
9,17
28,46
a)
Carphone
QCIF (IP)
0%
1%
2%
3%
4%
5%
6%
b)
Figure 101. Encoding time per frame of various videos for RT-MD-XVID: a) average and b)
deviation.
The deviation is on the level of 2.3% to 5.2%. Interesting fact is that for higher resolutions the
peaks occur (see the maximum values) but the total execution is more stable (with smaller
percentage deviation). Anyway, the real-time-implemented MD-based encoding in comparison
to best-effort standard encoding (see Figure 9) is much more stable and predictable since the
deviation is on lower level (up to 5.2% versus 6.6% to 24.4%) and the max/min ratio is
noticeably lower (1.2 to 1.33 vs. 1.93 to 4.28 times more requires processing of the slowest
frame in contrast to the fastest frame processing).
Although, the above results prove the positive influence of the MD on processing stability thus
allow for using the average coding time in prediction, the most interesting is to see if the time
constraints can be used during the real-time processing. If assumed, that a one CPU system is
given (e.g. the one used for above tests – see section XVIII.3 in Appendix E), such that the
encoding can use only half of the CPU’s processing time (the other half is meant for decoding
and other operations), and that frame rate of test videos is equal to 25fps, there is only 20ms per
256
frame available (out of 40ms period). Following, only the QCIF encoding can be executed
within such time without any loss of MBs (complete encoding). In other cases (CIF), the
computer is not powerful enough to encode videos in real-time completely. Analogical situation
is for sequence with higher frame rate (CIFN) i.e. the available 50% of CPU time is mapped to
16.67ms out of 33.33ms period (due to 30 fps). Then, the test bed machine is also not capable
of encoding the complete frame, so the quality losses are indispensable.
Thus, the next investigation covers analysis of time required for specified quality expressed by
the given amount of MBs processed. As it was already mentioned in Mapping of MD-XVID
Encoder to HRTA Converter Model (section VII.3.5), the mandatory and clean up timeslices are
calculated according to the worst-case processing time selected as maximum of the frame-typedependent default operations as defined in Equation (70) and (71). Since the time required for default
operations is the smallest for I-frames and the biggest for B-frames, there is always some idle
time in the base timeslice for processing at least some of MBs (see relaxation condition (73)).
P&B Quality (l.s.)
P-only Time (r.s.)
P&B Time (r.s.)
15
90%
14
80%
13
70%
12
60%
11
50%
10
40%
9
30%
8
20%
7
10%
6
0%
5
10%
20%
30%
40%
50%
60%
70%
80%
LQA (as percentage of coded MBs)
Figure 102. Worst-case encoding time per frame and achieved average quality vs. requested
Lowest Quality Acceptable (LQA) for Carphone QCIF.
257
90%
Time per frame [ms]
Quality (as percentage of coded MBs)
P-only Quality (l.s.)
100%
On the other hand, the definition allows delivering all frames, but it may happen that the Bframes have no MBs coded at all, P-frames have only few MBs coded, and I-frames a bit more
than P-frames. Since the MB-coding is generally treated as optional, there is no possibility to
define the minimum quality per frame that should be delivered. Thus the worst case execution
time has been measured per LQA defined on different levels i.e. the different amounts of MBs
to be always coded was requested and the worst encoding time per frame was measured. This
worst-case time is depicted in Figure 102 (right scale). The quality steps are equal to every 10%
of all MBs (as given in X-axis). Having such LQA-specific worst-case times, the guaranties of
delivering requested LQA can be given. Please note, that the values for video with P-frames
only are lower than for the P&B-frames.
Now, if the measured worst-case time of given LQA will be used as basis for the base timeslice
calculation, then definitely the achieved quality will be higher (of course assuming the relaxation
condition which allows for movement of some of the MB-coding part into the base timeslice).
To prove the really obtained quality with guaranteed LQA, the encoding was executed once
again not with the base timeslice as calculated according to (70) but with the timeslice set to the
measured LQA-specific worst-case execution time. The really achieved quality is depicted in
Figure 102 (left scale). There is clearly visible, that the obtained quality is higher than the LQA.
Moreover, the bigger requested LQA is the smaller differences between achieved quality and
LQA are e.g. for LQA=90% achieved quality is equal to 95% for P-only and P&B encoded
video, and for LQA=10% the P-only video achieves 28% and P&B encoding achieves 40%.
X.4.2.
Real-time Working Mode
Finally, the time constrained execution fully exploiting the characteristics of the HRTA
converter model has to be conducted. Here the encoding is driven by the available processor
time and by the requested quality. Of course, the quality is to be mapped on the predicted time
according to learning phase, and it must be checked if the execution is feasible at all. For the
following tests, the feasibility check was assumed to be always positive, since the goal was to
analyze the quality drop in respect to time constrains (analogically to the previously tested RTMD-LLV1). Moreover, not all but only the most interesting configurations of timeslice division
are depicted, namely, for the investigated resolutions respective mandatory and optional
timeslices have been chosen such that the ranges depict the quality drop phase for the first fifty
258
frames of each video sequence. In other words, the timeslices too big where no quality drop
occurs or timeslices too small where none of MBs is processed are omitted. Additionally, the
clean up time is not included in the graphs, since its worst case assumption allowed finishing all
the frames correctly. The results are depicted in Figure 103, Figure 104, and Figure 105.
Figure 103 shows two sequences with QCIF resolution (i.e. 99 MBs) for two configurations of
mandatory and optional timeslices:
1) tbase_ts = 6.6ms and tenhance_ts = 3ms
2) tbase_ts = 5ms and tenhance_ts = 3ms
The first configuration was prepared with the assumption of running three encoders in parallel
within the limited period size of 20ms, and the second one assumes parallel execution of four
encoders. As it can be seen, the RT-MD-XVID is able to encode on average 57.2 MBs for
Mobile QCIF and 65.4 MBs for Carphone QCIF in the first configuration with the standard
deviation being equal respectively to 2.3 and 5.0MBs. This results in processing of about 58%
and 66% of all MBs, and according to [Militzer, 2004] this should yield the PSNR-quality of
about 80% to 85% in comparison to completely coded frame. If the defined enhancement
timeslice is used, the encoder will process all the MBs for both video sequences.
For the second configuration, the RT-MD-XVID encodes on average 21.4 MBs of Mobile QCIF
and 24.6 MBs of Carphone QCIF. The standard deviations amount 1.2 MBs for both videos,
which is also visible by the flattened curves. So, processing of respectively only 21% and 25% of
all MBs results in very low but still acceptable quality [Militzer, 2004]105. And even if the
enhancement part is used, the encoded frame will sometimes (Carphone QCIF) or always (Mobile
QCIF) have some of MBs still not coded. Based on the above results, the acceptable quality
range of encoding time having minimum 5ms and maximum 9.6ms can be derived.
105
The minimal threshold of 20% of all MBs shall be considered for frame encoding.
259
Figure 103. Time-constrained RT-MD-XVID encoding for Mobile QCIF and Carphone QCIF.
Analogically to QCIF resolution, the evaluation of RT-MD-XVID encoding has been
conducted for the CIFN (i.e. 330 MBs) and CIF (i.e. 396 MBs) resolutions, but here it was
assumed that only one encoder works in a time. So, the test timeslices are defined as follows:
1) for CIFN tbase_ts = 16.6ms and tenhance_ts = 5ms
2) for CIF tbase_ts = 20ms and tenhance_ts = 10ms
The results are depicted in Figure 104. The achieved quality was better for the CIF than for the
CIFN sequence i.e. the amount of coded MBs achieved 149.8 MBs for Coastguard CIF vs. 86.8
MBs for Mobile CIFN (and respectively 37.8% and 26.3%) and the standard deviation was on
the level of 9.9 and 4.8 MBs. The disadvantage of CIFN derives directly from the smaller
timeslice, which is caused by additional costs of having 5 frames more per second. On the other
hand, the content may also influence the results. Anyway, the encoder was not able to encode all
MBs for both videos even when the defined enhancement timeslice was used. In the case of
Coastguard CIF, 12% to 21% of all MBs per frame were missing only for the first few frames.
260
Contrary, the encoding of Mobile CIFN produced only 38% to 53% of all MBs per frame along
the whole sequence.
Figure 104. Time-constrained RT-MD-XVID encoding for Mobile CIFN and Coastgueard CIF.
The last experiment has targeted the B-frames processing. Therefore the RT-MD-XVID has
been executed with additional option -use_bframes. The assumptions analogical to the QCIF
without B-frames have been made i.e.:
1) tbase_ts = 6.6ms and tenhance_ts = 3ms
2) tbase_ts = 5ms and tenhance_ts = 3ms
The results of the RT-MD-XVID encoding for two QCIF sequences are depicted in Figure 105.
The minimal threshold of processed MBs of Carphone QCIF in the mandatory part has been
reached only for the I- and P-frames and only for the bigger timeslice (first configuration), but
additional enhancement allowed for finishing all the MBs for all the frames. There have been
45.1 MBs processed on average within the base timeslice, however there is a big difference
261
between frame-related processing times expressed through the standard deviation on the high
level of 19.8 MBs. Such behavior may yield noticeable quality changes in the output.
Figure 105. Time-constrained RT-MD-XVID encoding for Carphone QCIF with B-frames.
The execution of the RT-MD-XVID with the second configuration ended up with even worse
results. Here the mandatory timeslice was not able to deliver any MBs for B-frames. If the
processing of other frame-types was treated as separate subset, then it could be comparable to
previous results (Figure 103). In the reality however, the average is calculated among all the
frames and thus the processing of mandatory part is unsatisfactory (i.e. avg. of 12.2 processed
MBs leads to as low quality as 12.4%). Using the enhancement layer allows processing of 84
MBs (84.4%) on average but with still high standard deviation of 15.3 MBs (15.4%).
In result, the use of B-frames is not advised since the jitter in the number of processed MBs
results in the quality deviations, which is a contradiction of the assumption of delivering as
constant quality level as possible since the humans classify the fluctuations between good and
low as lower quality even if the average is higher [Militzer et al., 2003].
262
Considering all investigated resolutions and frame rates, the processing complexity of the CIF
and CIFN sequences circumscribes the limits of real-time MD-based video encoding at least for
the used test bed. The rejection of B-frame processing makes the processing simpler, more
predictable and more efficient; nevertheless it is still possible to use B-frame processing in some
specific applications requiring higher compression on costs of higher quality oscillations or on
costs of additional processing power.
263
XI. Corollaries and Consequences
XI. COROLLARIES AND CONSEQUENCES
XI.1. Objective Selection of Application Approach based on Transcoding Costs
Before going into details about specific field of application, some general remarks on the format
independence approach are given. As it was stated in Related Work, there are three possible
solutions to provide at least some kind of multi-format or multi-quality media delivery. It is an
important issue to recognize which method is really required for the considered application. There
has been already some research done in this direction. In general, there are two aspects always
considered: dynamic and static transformations. The dynamic transformation refers to any type of
processing (e.g. transcoding, adaptation) done on the fly i.e. during the real-time transmission on
demand. The static approach considers off-line preparation of the multimedia data (regardless if
it’s multi-copy or scalable solution).
Based on those two views, the trade-off between the storage vs. processing is investigated. In
[Lum and Lau, 2002], a method is proposed, which finds an optimum considering the CPU
processing (i.e. dynamic) cost called also the transcoding overhead and the I/O storage (i.e. static)
cost of pre-adopted content. The hybrid model prepares selectively a sub-set of quality variants of
data and leaves the remaining qualities for the dynamic algorithm using this prepared sub-set as a
base for calculations. The authors proposed the content adaptation framework allowing for
content negotiations and delivery realization—the second one is responsible for adopting the
content by employing the transcoding relation graph (TRG) having transcoding costs on edges
and data versions on nodes, the modified greedy-selection algorithm supporting time and space,
and the simple linear cost model [Lum and Lau, 2002], which is defined as:
t j = m × vi + c
(75)
where tj ist he processing time of version j of data transformed from version i (vi), c is fixed
overhead of synthesizing any content (independently from content size), m is transcoding time per
unit of the source content, and || is size operator. The author claims that the algorithm can be
applied to any type of data; however in case of audio and video the cost defined as in Equation
264
(75) will not work, because the transcoding costs are influenced not only by the amount of data
but also by the content. Moreover, neither the frequency of data use (popularity) nor the storage
bandwidth for accessing multimedia data is considered.
Another solution is proposed in [Shin and Koh, 2004], where the authors consider the skewed
access patterns to multimedia objects (i.e. frequency of use or popularity) and the storage costs
(bandwidth vs. space). The solution is proved by simulating a QoS adaptive VOD streaming
service with the Poisson-distributed requests and the popularity represented by Zipf distribution
using various parameters (from 0.1 to 0.4). However, it did not consider the transcoding costs at
all.
Based on the two mentioned solutions, the best-suited cost algorithm could be defined in analogy
to the first proposal [Lum and Lau, 2002] and applied to the audio and video data but at first both
storage size and bandwidth shall be considered as in [Shin and Koh, 2004], the transcoding cost
has to be refined, and the popularity of the data should be respected. Then, the applicability of the
real-time transcoding-based approach could be evaluated based not only on subjective
requirements but also on objective measures. Since it was not the scope of the RETAVIC project,
this idea is left for the further research.
XI.2. Application Fields
The RETAVIC architecture, if fully implemented, could find application in many fields. Only
some of these fields are discussed within this section and the focus is made on the three most
important aspects.
The undoubtful and most important application area is the archiving of multimedia data with
purpose of long-time storage, in which the group of potential users is represented by TV
producers, national and private broadcasting companies including TV and Radio. Private and
national museums and libraries belong also this group as interested in storing high quality
multimedia data where the separation of internal format from the presentation format could
allow for very long time storage. The main interest here is to provide the available collections
and assets in a digital form through Internet to usual end-users with average-to-high quality
265
expectation, or through the intranets or other distribution media to the researchers in the
archeology, fine art and other fields.
Second possible application targets scientific databases and simulations. Here the generated
high-quality video precisely demonstrating processor-demanding simulations can be stored
without loss of information. Such videos can be created by various users from the chemical and
physical or bio-engineering laboratories, gene research, or other modeling and development
centers. Of course, the costs of conducting the calculations and simulations should be much
higher than the costs of recording and storage. The time required for calculating the final results
should also be considered since the stored multimedia data can be accessed relatively fast.
Very similar application as above can be regarded as multimedia databases for industrial
research, manufacturing and applied sciences. The difference to the previous fields is such
that the multimedia data come not from the artificial simulations but from the real recordings,
which are usually referred to as natural audio-video. The first example, where the various data
qualities may be use in the medical domain, in which the advanced, untypical and complex
surgery is recorded without loss of information and then distributed among the physicians,
medicine students and professors. Another example is recording of the scientific experiments,
where the execution costs are very high, thus the lossless audio-video recording is critical for
further analysis. Here the microscope and telescope observation where three high-resolution
cameras delivering RGB signals separately are considered as significant. Yet another example is
application for the industrial inspection, namely there may be a critical system requiring short
but with very high quality periodic observations. Finally, the industrial research testing new
prototypes may require very high quality in very short periods of time by employing high-speed
high-resolution cameras. Of course, the data generated by such cameras should be kept in a
lossless state in order to combine it with the other sensor data.
Some other general application fields without specific examples can also be found. Among them
there are few interesting worth of listing:
•
Analysis of fast-moving or explosive scenes

Analysis of fast-moving machine elements
266

Optimization of manufacturing machines

Tests of explosive materials

Crash test experiments

Airbag developments
•
Shock and vibration analysis
•
Material and quality control
•

Material forming analysis

Elastic deformations
High-definition promotional and slow-motion recordings for movies and television
Finally, a partial application in the video-on-demand systems can be imagined as well. Here however,
not the distribution chain including network issues like caching or media proxies is meant but
rather the high-end multimedia data centers, where the mudlitmedia data is prepared according
to quality requirements of a class of end-user systems (which are represented as one user
exposing specific requirements). Of course, there may be also few classes of devices, which are
interested in the same content during multicasting or broadcasting. However, if a VoD system
was based on unicast communication, the RETAVIC architecture is not adviced at all.
XI.3. Variations of the RETAVIC Architecture
The RETAVIC architecture beside direct application in the mentioned field could be used as
source for other variants. Three possible ideas can be proposed here. The first one is to use the
RETAVIC architecture not as the extension for the MMDBMS but just for the media
servers allowing them for additional functionality like audio-video transcoding. Such conversion
option could be a separate extra module. It could work with standard format of the media
server (usually one lossy format), which can be treated as internal storage format in RETAVIC.
Then only the format-specific decoding should be implemented as described in Evaluation of
storage format independence (p. 86).
267
Next, the non-real-time implementation of the RETAVIC architecture could be possible.
Here however, the OS-specific scheduling functionality being in its behaviour analogical to QAS
[Hamann et al., 2001a] has to be guaranteed. Moreover, the precise timing functionality allowing
for controlling the HRTA-compliant converters should be provided. For examples, [Nilsson,
2004] discusses possibility of using fine-grain timing under MS Windows NT family106 in which
the timing resolutions goes beneath the standard timer with intervals of 10ms i.e. at first at the
level of 1ms, and then 100ns units. Besides timing problems also some additional issues like tic
frequency, synchronization, protection against system time changes, interrupts (IRQs), thread
preemptiveness and avoidance of preemption through thread priority (which is limited to few
possible classes) are discussed [Nilsson, 2004]. Even though, the discussed OS could provide
acceptable time resolution for controlling HRTA-converters, still the problem with scheduling
requires additional extensions.
Finally, the mixture of the mentioned variations is also possible, thus the direct application
on the media-server specific OS would be possible. Then the need of using RTOS could be
eliminated. Obviously, the real-time aspects should be supported by additional extensions in the
best-effort OS analogically to the discussion of the previous variation.
106
In the article it covers: MS Windows NT 4.0, MS Windows 2000 and MS Windows XP.
268
Chapter 6 – Summary
XII. Conclusions
Chapter 6 – Summar y
If we knew what it was we were doing, it would not be called research, would it? Albert Einstein
XII. CONCLUSIONS
The research on format-independence provision for multimedia database systems conducted
during the course of this work has brought many answers but even more questions. The main
contribution of this dissertation is the RETAVIC architecture exploiting the meta-data-based
real-time transcoding and the lossless scalable formats providing quality-dependent processing.
RETAVIC is discussed in many aspects, where design is the most important part. However, the
other aspects such as implementation, evaluation and applications are not neglected.
The system design included the requirements, conceptual model and its evaluation. The video
and audio processing models covered analysis of codec representative, statement of
assumptions, specification of static and continuous media-type-related meta-data, presentation
of peculiar media formats, and evaluation of the models. The need for lossless, scalable binary
format caused a Layered Lossless Video format (LLV1) [Militzer et al., 2005] being designed
and implemented within this project. The attached evaluation proved that the proposed internal
formats for the lossless media data storage are scalable in data quality and processing and that
the meta-data assisted transcoding is acceptable solution with lower-complexity and still
acceptable quality. The real-time processing discussed aspects connected with multimedia
processing in respect to its time-dependence and with the design of real-time media converters
269
XII. Conclusions
including the hard-real time adaptive converter model, evaluation of three proposed prediction
methods and the mapping of the best-effort meta-data-based converters to HRTA-compliant
converters.
The implementation depicted the key elements of the programming phase covering the pseudo
code for the important algorithms and the most critical source code examples. Additionally, the
systematic course of source code porting from best-effort to RTOS-specific environment,
which has been referred to as a guideline of source code porting to DROPS for any type of
converter implemented in the best-effort system such as Linux or Windows.
The evaluation has proved that the time-constrained media transcoding executed under DROPS
real-time operating system is possible. The prototypical real-time implementation of the critical
parts of the transcoding chain for video—real-time video coders—have been evaluated in
respect to functional, quantitative and qualitative properties. The results have shown the
possibility of controlling the decoding and encoding processes according to the quality
specification and workload limitation of the available resources. The workload borders of the
test bed machine have been achieved already by processing the sequences in the CIF resolution.
Finally, this work delivered the analysis of requirements for media internal storage format for
audio and video, the review of format independence support in current multimedia management
systems (incl. MMDBMS and media servers) and the discussion of various transcoding
architectures being potentially applicable with purpose of format independent delivery.
270
XIII. Further Work
Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. Albert Einstein
(1929, Interview by George S. Viereck in Philadelphia Saturday Evening Post)
XIII. FURTHER WORK
There are few directions in which the further work can be conducted. One is to improve audio
and video formats themselves e.g. compression efficiency or processing optimization. Another
can be refinement of the internal storage within the RETAVIC architecture – here proposal of
new formats could be expected. Yet another could be improvement of the REATVIC
architecture.
An extension of the RETAVIC architecture is a different aspect than the variant of the
architecture. The extension is meant here as an improvement, enhancement, upgrade, or
refinement. As so there could be proposed one enhancement in the real-time delivery phase in
the direct processing channel, namely a module for bypassing static and continuous MD.
Currently, the RETAVIC architecture sends only multimedia data in requested format but it
may assist in providing other formats by intermediate caching proxies. Such extension could be
referred to as Network-Assisted RETAVIC and it would allow to push the MD-based
conversion to the borders of network (e.g. to media gateways or proxies). For example the MD
could be transmitted together with audio and video streams in order to allow building cheaperin-processing converting proxies, which do not have to use worst-case scheduling for
transcoding anymore. Proxies should be build in analogical way to the one proposed by realtime transcoding of the real-time delivery phase (with all the issues discussed in related sections).
Of course, there is obvious limitation of such solution – it is not applicable to live transmission
(which may not be the case in current proxies with worst-cast assumption), until there is a
method for predictable real-time MD creation. Moreover, there are two evident disadvantages:
271
XIII. Further Work
1) the load in the network segment between the central server and proxies will be higher and 2)
more sophisticated and distributed management of the proxies is required (e.g. changing internal
format, extending the MD set, adding or changing supported encoders). On the other hand, the
transcoding may be applied closer to the client and may reduce the load of the central server.
This is especially important if there is a class of clients in the same network segment having the
same format requirements.
Another possible extension is the refinement of the internal storage format by exchanging
the old and not efficient anymore format by a newer and better one as described in Evaluation of
storage format independence. For example, there could be MPEG-4 SVC [MPEG-4 Part X, 2007]
applied as the internal storage format for video. A hybrid of WavPack/Vorbis investigated in
[Penzkofer, 2006] could be applied as the audio internal format in systems where lower level of
audio scalability is expected (just two layers).
Last but not least aspect mentioned in this section, which could be investigated in the future, is
an extension of the LLV1 format. Here two functional changes and one processing
optimizations can be planned: a) new temporal layering, b) new quantization layering and c)
optimization in decoding of lossless stream.
Figure 106. Newly proposed temporal layering in the LLV1 format.
The first functional change covers a new proposal of temporal layering, which is depicted in
Figure 106. The idea behind it is using the P-frames instead of the B-frames (due to instability in
the processing of B-frames) in the temporal enhancement layer in order to gain the smother
decoding process thus better prediction of the real-time processing. This however may
introduce some losses in the compression efficiency of the generated bitstream. Thus the trade272
XIII. Further Work
off between the coding efficiency and the gain in processing stability should be investigated in
details.
Next change in the LLV1 algorithm refers to the different division on quantization
enhancement layers. The cross-layer switching of the coefficients between the enhancement
bitplanes allows for reconstruction the most important DC/AC coefficients at first. Since the
coefficients produced by the binDCT are ordered in zig-zag scan according to their importance,
it could be possible to realize analogical 3-D zig-zac scanning across the enhancement bitplanes
also according to the coefficient importance, namely it can work as follows:
•
In current LLV1 bitplane the value stores difference to next quantization layer for each
coefficient
•
Assume the first three values from each enhancement layer are taken, namely: c1QEL1,
c2QEL1, c3QEL1, c1QEL2, c2QEL2, c3QEL2, c1QEL3, c2QEL3, c3QEL3
•
Then:

the first three values of each layer: c1QEL1, c1QEL2, c1QEL3 are ordered as first three
values of the new first enhancement layer: c1QEL1, c2QEL1, c3QEL1,

the next three values on second position of each layer: c2QEL1, c2QEL2, c2QEL3 are
ordered as second-next three values of the new first enhancement layer: c4QEL1,
c5QEL1, c6QEL1,

the next three values on third position of each layer: c3QEL1, c3QEL2, c3QEL3 are
ordered as third-next three values of the new first enhancement layer: c7QEL1,
c8QEL1, c9QEL1
•
Next always groups of three values of each layer are taken (ciQEL1, ciQEL2, ciQEL3)and
assigned respectively to next elements of the current QEL
•
If the current QEL is complete, the values are assigned to the next QEL
Such reorganization would raise definitely the quality in respect to the amount of coded bits
(considering the assumption of coefficient importance being real), i.e. it would raise coding
efficiency if certain levels of quality are considered. In other words, the data quality is
distributed not linear (as now) but with important values cumulated at the beginning of the
273
XIII. Further Work
whole enhancement part. This method causes higher complexity in the coding algorithm, since
there is no clear separation on one-layer-specific processing. The algorithm shall be further
investigated to prove the data quality and processing changes.
Finally, the simple processing optimizations of the LLV1 decoding can be evaluated.
Theoretically, if the lossless stream is requested and the QEL3 is encoded (decoded), the
quantization (inverse quantization) step may be omitted completely. This is caused by the
format assumption i.e. the last enhancement layer produces the coefficient which are quantized
with the quantization step equal to 1, which means that the quantized value is equal to the
unquantized value. So, the (de-)quantization step is not required anymore. This introduces yet
another case in processing, and was not checked during the development of LLV1.
274
Appendix A
XIV. Glossary of Definitions
Appendix A
XIV. GLOSSARY OF DEFINITIONS
The list of the definitions and terms is divided in three groups: data-related, processing-related
and quality-related terms. Each group is ordered in the logical (not alphabetical) sequence i.e. at
first come the most fundamental terms being followed by more complex definitions, such that a
subsequent definition may refer to the previous term, but the previous terms are not based on
the following ones.
XIV.1.
•
Data-related Terms
media data – text, image (natural pictures, 2D and 3D graphics, 3D pictures), audio
(natural sound incl. human voice, synthetic), and visual (natural video, 2D and 3D
animation, 3D video);
•
media object – MO – special digital representation of media data; it has type, content
and format (with structure and coding scheme);
•
multimedia data – collection of the media data, where more than one type of media
data is involved (e.g movie, musical video-clip);
•
multimedia object – MMO – collection of MOs; it represents multimedia data in
digital form;
•
meta data – MD – description of an MO or an MMO;
275
Appendix A
•
quant107 (based on [Gemmel et al., 1995]) – a portion of digital data that is treated as
one logical unit occurring at a given time (e.g. sample, frame, window of samples e.g.
1024, or group of pictures – GOP, but also combination of samples and frames);
•
timed data – type of data, in which the data depends on time108 i.e. the data (e.g. a
quant) is usable then and only then when they occur in a given point of time; in other
words, too early or too late occurrence makes the data invalid or unusable; other terms
equivalent to timed within this work are following: time-constrained, time-dependent;
•
continuous data – time-constrained data ordered in such a way that continuity of parts
of data, which are related to each other, is present; they may be periodic or sporadic
(irregular);
•
data stream (or shortly stream) – is a digital representation of the continuous data;
most similar definition is from [ANS, 2001] such as “a sequence of digitally encoded
signals used to represent information in transmission”;
•
audio stream – a sequence of audio samples is a sequence of numerical values
representing the magnitude of the audio signal at identical intervals. The direct
equivalent is the pulse-code modulation (PCM) of the audio. There are also extensions
of PCM such as Differential (or Delta) PCM (DPCM) or Adaptive DPCM (ADPCM),
which represents not the value of the magnitude, but the differences between these
values. In DPCM, it is simply the difference between the current and the previous value,
and in ADPCM it is almost the same, but the size of the quantization step varies
additionally (thus allows more accurate digital representation of small values in
comparison to high values of the analog signal);
•
video stream – it may be a sequence of half-frames if the interlaced mode is required.
As default, the full frame (one picture) mode is assumed;
•
continuous MO – MO that has analogical properties to continuous data; in other
words it’s a data stream, where the continuous data is exactly one type of media data;
107
Quanta is plural of quant.
108
There is also other research area of database systems, which discusses timed data [Schlesinger, 2004]. However, there
different perspective on “timed” issue is presented and completely different aspects are discussed (global-views in Grid
computing and their problems on data coming from snapshots in different point of time).
276
Appendix A
•
audio stream, video stream – it’s used interchangeable for continuous MO of type
audio or of type video;
•
multimedia stream – a data stream including quanta of more than one MO type; an
audio-video (AV) stream is the most common case; it’s also called continuous MMO;
•
container format – is a (multi)media file format for storing media streams; it may be
used for one or many types of media (depending on the format container specification);
it may be designed for storage e.g. RIFF AVI [Microsoft Corp., 2002c] and MPEG-2
Program Stream (PS) [MPEG-2 Part I, 2000], or optimized for transmission e.g.
Advanced Systems Format (ASF) [Microsoft Corp., 2007b] or MPEG-2 Transport
Stream (TS) with packetized elementary streams (PES) [MPEG-2 Part I, 2000];
•
compression/coding scheme – a binary compressed/encoded representation of the
media stream for exactly one specific media type; it is usually described by four
characters code (FOURCC) being a registered (or well-recognized) abbreviation of the
name of the compression/coding algorithm;
•
Lossless Layered Video One (LLV1) – a scalable video format having four
quantization layers and two temporal layers allowing storing YUV 4:2:0 video source
without any loss of information. The upper layer extends the lower layer by storing
additional information i.e. the upper relies on data from the layer below.
XIV.2.
•
Processing-related Terms
transformation – a process of moving data from one state to another (transforming) –
it’s most general term; the terms conversion/converting are equivalents within this
work; the transformation may be lossy, where the loss of information is possible, or
lossless, where no loss of information occurs;
•
multimedia conversion – transformation, which refers to many types of media data;
there are three categories of conversion: media-type, format and content changers;
•
coding – the altering of the characteristics of a signal to make the signal more suitable
for an intended application (…) [ANS, 2001]; decoding is an inverse process to coding;
coding and decoding are conversions;
277
Appendix A
•
converter – a processing element (e.g. computer program) that applies conversion to
the processed data;
•
coder/decoder – a converter used for coding/decoding; also encoder refers to coder
•
codec – acronym for coder-decoder [ANS, 2001] i.e. an assembly consisting of an
encoder and a decoder in one piece of equipment or a piece of software capable of
encoding to a coding scheme and decoding from this scheme;
•
(data) compression – a special case (or a subset) of coding used for 1) increasing the
amount of data that can be stored in a given domain, such as space, time, or frequency,
or contained in a given message length or 2) reducing the amount of storage space
required to store a given amount of data, or reducing the length of message required to
transfer a given amount of information [ANS, 2001]; decompression is an inverse
process to compression, but not necessary mathematically inverse;
•
compression efficiency – general non-quantitative term reflecting the efficiency of the
compression algorithm such that the more compressed output with the smaller size is
understood as a better algorithm effectiveness; it is also often referred to as coding
efficiency;
•
compression ratio109 – the uncompressed (origin) to compressed (processed) size; the
bigger the value is the better; compression ration usually is bigger than 1, however, it
may occur that the value is lower than 1 - in that case the compression algorithm is not
able to compress anymore (e.g. if the already compressed data is compressed again) and
should not be applied;
•
compression size109 [in %] – the compressed (processed) to uncompressed (origin) size
of the data multiplied by 100%; the smaller the value is the better; compression size
usually ranges between more than 0% and 100%; the value higher than 100% responds
to the compression ratio lower than 1;
•
transcoding – according to [ANS, 2001] it’s a direct digital-to-digital conversion from
one encoding scheme to a different encoding scheme without returning the signals to
analog form; however within this work it’s defined in a more general way as a
109
The definition „compression rate” is not used here due to its unclearness i.e. in many papers it is referred once as
compression ratio and otherwise as compression size. Moreover, the compression ratio and size are obvious properties of
data, but they derive directly from the processing (data compression), and as so they are classified as processing-related terms.
278
Appendix A
conversion from one encoding scheme to a different one, where normally at least two
different codecs had to be involved; it is also referred as heterogeneous transcoding;
there are also other special cases of transcoding distinguished in the later part of this
work;
•
transcoding efficiency – is analogical to the coding efficiency defined before, but in
respect to the transcoding;
•
transcoder – a device or system that converts one bit stream into another bit stream
that possesses a more desirable set of parameters [Sun et al., 2005];
•
cascade transcoder – a transcoder that fully decodes and the fully encodes the data
stream; in other words, it is decoder-encoder combination;
•
adaptation – a subset of transcoding, where only one encoding scheme is involved and
no coding scheme or bit-stream syntax is changed e.g. there is MPEG-2 to MPEG-2
conversion used for lowering the quality (i.e. bit rate reduction, spatial or temporal
resolution decrease); it is also known as homogeneous or unary-format transcoding;
•
chain of converters – a directed, one-path, acyclic graph consisting of few converters;
•
graph of converters – a directed, acyclic graph consisting of few interconnected chains
of converters;
•
conversion model of (continuous) MMO – a model provided for conversion
independent of hardware, implementation and environment; JCPS (event and data)
[Hamann et al., 2001b] or hard real-time adaptive (described later) models are suggested
to be most suitable here; continuous is often omitted due to the assumption that audio and
video are usually involved in context of this work;
•
(error) drift – erroneous effect in successively predicted frames caused by loss of data,
regardless if intentional or unintentional, causing mismatch between reference quant
used for prediction of next quant and origin quant used before; it is defined in [Vetro,
2001] for video transcoding as “blurring or smoothing of successively predicted frames
caused by the loss of high frequency data, which creates a mismatch between the actual
reference frame used for prediction in the encoder and the degraded reference frame
used for prediction in the transcoder or decoder”;
279
Appendix A
XIV.3.
•
Quality-related Terms
quality – a specific characteristic of an object, which allows to compare (objectively or
subjectively) two objects and say, which one has higher level of excellence; usually it
refers to an essence of an object; however, in computer science it may refer also to a set
of characteristic; other common definition is “degree to which a set of inherent
characteristic fulfils requirements” [ISO 9000, 2005];
•
objective quality – the quality that is measured by the facts using quantitative methods
where the metric110 has an uncertainty according to metrology theory; the idea behind
the objective measures is to emulate subjective quality assessment results by the metrics
and quantitative methods e.g. for the psycho-acoustic listening test [Rohdenburg et al.,
2005];
•
subjective quality – the quality that is measured by the end user and heavily depends
on his experience and perception capabilities; an example of standardized methodology
for subjective quality evaluation used in speech processing can be found in [ITU-T Rec.
P.835, 2003];
•
Quality-of-Service – QoS – a set of qualities related to the collective behavior of one
or more objects [ITU-T Rec. X.641, 1997] i.e. as an assessment of a given service based
on characteristics; it is assumed within this work, that it is objectively measured;
•
Quality-of-Data – QoD – the objectively-measured quality of the stored MO or MMO;
it is assumed to be constant in respect to the given (M)MO111;
•
transformed QoD – T(QoD) – the objectively-measured quality requested by the user;
it may be equal to QoD or worse (but not better) e.g. lower resolution requested;
•
Quality-of-Experience – QoE – subjectively-assessed quality perceived with some
level of experience by the end-user (also called subjective QoS), which depends on
QoD, T(QoD), QoS and human factors; QoE is a well-defined term reflecting the
subjective quality given above;
110
A metric is a scale of measurement defined in terms of a standard (i.e. well-defined unit).
111
The QoD may change only when (M)MO has scalable properties i.e. QoD will scale according to the amount of accessed
data (which is enforced by the given coding scheme).
280
Appendix B
XV. Detailed Algorithm for LLV1 Format
Appendix B
XV. DETAILED ALGORITHM FOR LLV1 FORMAT
XV.1. The LLV1 decoding algorithm
To understand how the LLV1 bitstream is processed and how the reconstruction of the video
from all the layers is performed, the detailed decoding algorithm is presented in Figure 107. The
input for the decoding process is defined by the user i.e. he specifies how many layers the
decoder should decode. As so, the decoder accepts base layer binary stream (required), and up
to three optional QELs. Since the QELs depends on BL, the video properties as well as other
structure data are encoded only within the BL bitstream in order to avoid redundancy.
Three loops can be distinguished in the core of the algorithm: frame loop, macro block loop
and block-based processing. The first one is the most outer loop and is responsible for
processing all the frames in the encoded binary stream. For each frame, the frame type is
extracted from the BL. There are four types possible: intra-coded, forward-predicted, bidirectionally predicted and skipped frames. Depending on the frame type further actions are
performed. The next inner distinguishable part is called macro block loop. The MB type and the
coded block patterns (CBPs) of macro blocks for all (requested by the user) layers are extracted.
Based on that, just some or all blocks are processed within the most inner block loop. In case of
inter MB (forward or bi-directionally predicted), before getting into block loop, the motion
vectors are decoded additionally and the motion compensated frame is created by calculating
the reference sample interpolation, which uses an input reference frame from the frame buffer
of the BL.
281
Appendix B
Figure 107. LLV1 decoding algorithm.
282
Appendix B
The block loops are executed for the base layer at first and then for the enhancement layers. In
contrast to the base layer, however, not all steps are executed for all the enhancement layers –
only the enhancement layer executed as the last one includes all the steps. The quantization
plane (q-plane) reconstruction is the step required to calculate coefficient values by applying
Equation (12) (on p. 103) and using data from the bit plane of the QEL. Dequantization and
inverse binDCT are executed once, if only BL was requested, or maximum twice, if any other
QEL was requested. The reconstruction of the base layer is required anyway in both cases,
because the frames from the BL are used for the reference sample interpolation mentioned
earlier. In case of intra blocks additional step is applied, namely motion error compensation i.e.
correction of pixel values of interpolated frames by the motion error extracted from the
respectively BL or QEL.
283
Appendix B
284
Appendix C
XVI. Comparison of MPEG-4 and H.263 Standards
Appendix C
XVI. COMPARISON OF MPEG-4 AND H.263 STANDARDS
XVI.1.
Algorithmic differences and similarities
This section describes in short the most important differences between the H.263 and the
MPEG-4 standards for natural video coding. The differences are organized by features of the
standards, according to the part of the encoding process where they are used: Motion
Estimation and Compensation, Quantization, Coefficient Re-scanning and Variable Length
Coding. At the end, the features are discussed that provide enhanced functionality that is not
specifically related to the previous categories.
Motion Estimation and Compensation: The most interesting tools in this section are without
doubt Quarter Pixel Motion Compensation (Qpel), Global Motion Compensation (GMC),
Unrestricted Motion Vectors (UMV) and Overlapped Block Motion Compensation. Quarter
Pixel Motion Compensation is a feature unique to MPEG-4, allowing the motion compensation
process to search for a matching block using ¼ pixel accuracy and thus enhancing the
compression efficiency. Global Motion Compensation defines a global transformation (warping)
of the reference picture used as a base for motion compensation. This feature is implemented in
both standards with some minor differences, and it is especially useful when coding global
motion on a scene, such as zooming in/out. Unrestricted Motion Vectors allow the Motion
Compensation process to search for a match for a block in the reference picture using larger
search ranges, and it is implemented in both standards now. Overlapped Block Motion
Compensation has been introduced in H.263 (Annex F) [ITU-T Rec. H.263+, 1998] as a feature
to provide better concealment when errors occur in the reference frame, and to enhance the
perceptual visual quality of the video.
285
Appendix C
DCT: The Discrete Cosine Transformation algorithm used by any video coding standard is
defined statistically to comply with the IEEE standard 1180-1990. Both standards are the same
in this respect.
Quantization: DCT Coefficient Prediction and MPEG-4 Quantization are the most important
features in this category. DCT Coefficient Prediction allows DCT coefficients in a block to be
spatially predicted from a neighboring block, to reduce the number of bits needed to represent
them, and to enhance the compression efficiency. Both standards specify DCT coefficient
prediction now. MPEG-4 Quantization is unique to the MPEG-4 standard, and unlike the basic
quantization method which uses a fixed step quantizer for every DCT coefficient in a block,
MPEG-4 uses a weighted quantization table method, in which each DCT coefficient in a block
is quantized differently according to a weight table. This table can be customized to achieve
better compression depending on the characteristics of the video being coded. H.263 adds one
further operation after quantization of the DCT coefficients, which is particularly efficient at
improving the visual quality of video coded at low bit rates by removing blocking effects, the
Deblocking Filter mode.
Coefficient re-scanning: The use of alternate scan modes (vertical and horizontal), besides the
common zig-zag DCT coefficient reordering scheme, is a feature now available to both
standards. These scan modes are used in conjunction with DCT coefficient prediction to
achieve better compression efficiency.
Variable-length coding: Unlike earlier standards, which used only a single VLC table for
coding run-length coded (quantized) DCT coefficients in both intra- and inter-frames, MPEG-4
and H.263 specify the use of a different Huffman VLC table for coding intra-pictures,
enhancing the compression efficiency of the standards. H.263 goes a little bit further by
allowing some inter-frames to be coded using the Intra VLC table (H.263 Annex S) [ITU-T Rec.
H.263+, 1998].
Next, additional and new features are discussed, which do not fall in the basic encoding
categories, but define new capabilities of the standards.
286
Appendix C
Arbitrary-Shaped-Object coding (ASO): Defines algorithms to enable the coding of a scene
as a collection of several objects. These objects can then be coded separately allowing a coding
with higher quality for more important objects of the scene and higher compression for
unimportant details. ASO coding is unique to MPEG-4. H.263 does not offer any comparable
capability.
Scalable Coding: Scalable coding allows encoding of a scene using several layers. One base
layer contains a low quality / low resolution version of the scene, while the enhancement layers
code the residual error and consecutively refine the quality and resolution of the image. MPEG4 and H.263 introduce three types of scalability. Temporal Scalability layers enhance the
temporal resolution of a coded scene (e.g. from 15 fps to 25 fps). Spatial scalability layers
enhance the resolution of a coded scene (e.g. from QCIF to CIF). SNR Scalability (also known
as FGS, Fine Granularity Scalability) layers enhance the Signal-to-Noise Ratio of a coded scene.
Although both standards support scalable coding, the standards differ in the approach used to
support this capability. In contrast to H.263, MPEG-4 implements SNR scalability by using only
one enhancement layer to code the reconstruction error from the base layer. This enhancement
layer can be used to refine the scene progressively by truncating the layer according to the
capabilities / restrictions of the decoding client to achieve good quality under QoS restrictions.
Error-resilient coding: Error resilience coding features have been introduced in MPEG-4 and
H.263 to be able to effectively detect, reduce and conceal errors in the video stream, caused by
transmission over error prone communication channels. These features are to be especially used
with low bit-rate video, but are not restricted to that case. Features such a Reversible Variable
Length Codes (unique to MPEG-4), Data partitioning and slices (video packets) fall into this
category to enable better error detection, recovery and concealment.
Real-time coding: There are tools that enable a better control for the encoding application to
adapt to changing QoS restrictions and bandwidth conditions. These tools use a backward
channel from the decoder to the encoder, so that the last can change encoding settings to better
control the quality of the video. Reduced Resolution Coding (MPEG-4, Reduced Resolution
Update in H.263) is a feature used by both standards. It enables the encoder to code a
downsampled image to reduce to a given bit rate without causing the comparably bigger loss of
visual quality that occurs when dropping frames in the encoding process. MPEG-4 uses
287
Appendix C
NEWPRED which enables the encoder to select a different reference picture for the motion
compensation process, if the current one leads to errors in decoding. H.263 defines a better
version of this feature, Enhanced Reference Picture Resampling (H.263++, Annex U) [ITU-T
Rec. H.263++, 2000], which offers the same capability as NEWPRED, but adds the possibility
of using multiple reference frames in the motion compensation of P and B pictures.
XVI.2.
Application-oriented comparison
In a case-study based on the proposed algorithm, many existing solutions as well as possible
applications in near future have been analyzed. This has resulted in 4 general and most common
comparison scenarios. These are: Baseline, Compression efficiency, Realtime and Scalable coding.
Together with discussion about the standards some examples of suitable applications for each
comparison scenario (thus making it reasonable to the reader) are given.
Baseline: here the basic encoding tools proposed by each standard are compare i.e. MPEG-4
Simple versus H.263 Baseline. It is only a theoretical comparison, since H.263 baseline, being an
earlier standard and starting point for MPEG-4 too, lacks many of the tools already used in
MPEG-4 Simple profile. This scenario is suitable for applications which do not require high
quality video or high compression efficiency and use relatively error-free communication
channels, with the advantage of a widespread compatibility, and a cheap and low complexity of
implementation. A typical application for this would be capturing video for low-level or midrange digital cameras, home grabbing, popular cheap hardware (simple TV cards). Because of
the limitations of H.263 Baseline, an MPEG-4 Simple Profile compliant coding solution would
be better for this case. However, H.263 Baseline combined with Advanced Intra Coding (H.263
Annex I) [ITU-T Rec. H.263+, 1998] offers almost the same capabilities with similar
complexity. So the choice between any of these solutions is a matter of taste, although maybe a
series of in-depth tests and benchmarks of available implementations could shed a better light as
to which standard performs better in this case (some of the results are available in [Topivata et
al., 2001; Vatolin et al., 2005; WWW_Doom9, 2003]).
Compression efficiency: This is one of the main comparison scenarios. Here the comparison
of H.263 and MPEG-4 regarding the tools they offer to achieve high compression efficiency is
proposed, that is, the tools that help to encode a scene with a given quality with the least
288
Appendix C
possible amount of bits. For this scenario MPEG-4 Advanced Simple Profile against H.263’s
High Latency Profile is compared. In this application scenario the focus is on achieving high
compression and good visual quality. Typical representative applications for this scenario
include the home user digital video at VCD and DVD qualities, the digital video streaming and
downloading (Video On Demand) industry, as well as the High Definition TV (Digital TV)
broadcasting industry,
surgery in hospitals, digital libraries and museums, multimedia
encyclopedias, Video sequences in computer and console games, etc. The standard of choice for
this type of applications is MPEG-4, as it offers better compression tools, such as MPEG-4
Quantization, Quarter Pixel Motion Compensation and B frames. (H.263 can support B frames,
but only in scalability mode).
Realtime: This is another interesting scenario for comparison where the tools that each
standard offers for dealing with Real Time Encoding and Decoding of a video stream. For this
scenario, the MPEG-4’s Advanced Real-Time Streaming profile with H.263’s Conversational
Internet Profile is compared. The focus in this scenario relies not on compression or high
resolution, but on manageable complexity for real time encoding and error detection, correction
and concealment features typical for a real time communication scenario, where transmission
errors are more probable. Applications in this scenario make use of video with low to medium
resolution and usually low constant bitrates (CBR) to facilitate the live transmission of it. Video
conferencing is a good representative for this application scenario. Video is coded in real time,
and there is a continuous communication CBR channel between encoder and decoder, so that
information is exchanged for controlling and monitoring purposes. Other applications which
need ‘live’ encoding of video material such as video telephony, process monitoring applications,
surveillance applications, network video recording (NVR), Web cams and mobile video
applications (GPRS phones, military systems), live TV transmissions, etc. can make use of video
encoding solutions for the realtime scenario. Both H.263 and MPEG-4 have put efforts in
developing features for this type of applications. However, H.263 is still the standard of choice
here, since it offers tools specially designed to offer better video at low resolutions and low
bitrates. Features such as Deblocking Filter and Enhanced Reference Picture Selection make
H.263 a better choice than MPEG-4.
289
Appendix C
Scalable coding: Last but not least, comparison of both standards according to the tools they
provide for scalable coding is proposed. Scalable coding is an attractive alternative to Real-Time
encoding for satisfying Quality of Service restrictions. In this scenario the MPEG-4’s Simple
Scalable and FGS scalable profiles to H.263 Baseline + Advance Intra Coding (Annex I) +
Scalability (Annex O) [ITU-T Rec. H.263+, 1998] are compared. The goal is to compare the
ability of both standards to provide good quality and flexible adaptation to a particular QoS
level by using enhancement layers. The general idea of scalable coding is to encode the video
just once, but serve video at several quality / resolution levels. The desired QoS shall be
achieved by sending more or less enhancement layers according to the networks bandwidth
conditions. Scalable Coding is designed to suit a large variety of applications, due to its ability to
encode and send video at various bit rates. Video on Demand applications can make use of the
features offered by this scenario and offer low (e.g. Modem) / medium (e.g. ISDN) / high (e.g.
ADSL) bitrate versions of music videos or movie trailers, without having to keep 3 different
versions of the video stream, one for each bit rate. Other type of applications that benefit from
this scenario are those where the communication channel used to transmit video does not offer
a constant bandwidth, so that the video bit rate has to adapt to the changing conditions of the
network. Streaming applications over mobile channels come to mind. Although H.263 offers
features to support scalable coding, these features are not as powerful as those offered by
MPEG-4. Of special interest here is the new SNR scalability approach of MPEG-4, which is
much more flexible than former scalability solutions. One of the potential problems of scalable
coding, however, is the small availability of open and commercial encoders and decoders
supporting it at present time, since most MPEG-4 compliant products only comply with the
Simple or Advanced Simple Profile (compression efficiency), and most H.263 products target
only the mobile / real time low bit rate market.
After defining a comparison scenario, the addressed area of the problem can be presented by a
graph of ranges (or graph of coverage). Such graphs can show the dependencies between the
application requirements and the area covered by the scenario. As an example (Figure 108) a
graph quality range vs. bandwidth requirements (with roughly estimated H.263 and MPEG-4
functions of behavior) is depicted.
290
Appendix C
R
tim eal
e
Figure 108. Graph of ranges – quality vs. bandwidth requirements
XVI.3.
Implementation analysis
On this part we will have a look at current implementations of the MPEG-4 and the H.263
standard (we have chosen one representative for each of them). However, we do not dig into
because of many ad-hoc comparisons that are publicly available, for example the
[WWW_Doom9, 2003] compared seven different implementations in 2003 and [Vatolin et al.,
2005] compared different set of seven codecs in 2005. Besides, we do not want to provide yet
another benchmark and performance evaluation description.
One of the disadvantages of this type of comparison is that the current implementations for
each standard target different application markets. MPEG-4 compliant applications are mostly
compliant with the Simple Profile (SP) or the Advanced Simple Profile (ASP). Some of these
applications are open sourced, but most are commercial pro-ducts. Other MPEG-4 profiles do
not offer such a variety of current solutions in the market, and for many of them it is even very
hard to find more than one company offering solutions for that specific profile. H.263 is
exclusively a low bit rate encoder. There are few non-commercial products based on the
standard, and even the reference implementation, now maintained by the University of British
Columbia (UBC), has become a commercial product. Even for research purposes obtaining the
source of the encoder is subject to a payment. H.263 and MPEG-4 use both many algorithms
whose patents and rights are held by commercial companies, and as such, one must be very
careful not to break copyright agreements.
291
Appendix C
XVI.3.1.
MPEG-4
This is a list of products based on the MPEG-4 standard (as owners declare):
• 3viX: SP, ASP
•
On2 VP6: SP, ASP
•
Ogg Theora (VP3-based): SP
•
DivX 5.x: SP, ASP
•
XVID 0.9: SP, ASP,
•
Dicas mpegable: SP, ASP, Streaming technology
•
QuickTime MPEG-4: SP, Streaming technology
•
Sorenson MPEG-4 Pro: SP, ASP
•
UB Video: SP, ASP
XVID [WWW_XVID, 2003] is open source. As such it was easier to analyze this product and
test its compliance with MPEG-4. The result of the analysis of the source code (version 0.9,
stable) show that XVID is at the moment only a SP compliant encoder. However, the
development version of the codec aims for ASP compliance. ARTS profile should be included
in the later versions of XVID.
One of the missing parts is the ability to generate MPEG-4 system stream but only the MPEG4 Video stream. On the other hand, the video stream may be encapsulated in the AVI container.
Moreover, there are tools available to extract the MPEG-4 compliant stream and encapsulate it
in MPEG-4 System stream.
XVI.3.2.
H.263
This is only a short list of products based on the H.263 standard (as owners declare):
•
Telenor TMN 3.0
•
Scalar VTC: H.263+
•
UBC H.263 Library 0.3
As representative for the H.263 standard we chose the Telenor TMN 3.0 encoder, which was
the reference implementation for the standard. This version is, however, somehow obsolete in
292
Appendix C
comparison to the new features introduced by H.263+++ [ITU-T Rec. H.263+++, 2005]. The
software itself only supports the annexes proposed by the H.263 standard document (Version 1
in 1995) and the following annexes from the H.263+ standard document (Version 2): K, L, P,
Q, R [ITU-T Rec. H.263+, 1998].
293
Appendix C
294
Appendix D
XVII. Loading Continuous Metadata into Encoder
Appendix D
XVII. LOADING CONTINUOUS METADATA INTO ENCODER
The pseudo code showing how to load the continuous MD into the encoder for each frame is
presented by Listing 6. The continuous MD are stored in this case as binary compressed stream,
so at first decompression using simple Huffman decoding should be applied (not depicted on
the listing). Then the resulting stream is nothing else that the sequence of bits, where the given
position(s) is(are) mapped to a certain value of the continuous MD element.
LoadMetaData
R bipred (1 bit) {0,1}
R frame_type (2 bits) {I_VOP, P_VOP ,B_VOP}
if !I_VOP
oR
fcode (FCODEBITS) => length, height
if bipred
oR
bcode (BCODEBITS) => b_length, b_height
endif
endif
R mb_width (MBWIDTHBITS)
R mb_height (MBHEIGHTBITS)
// do for all macro blocks
for 0..mb_height
for 0..mb_width
// def. MACROBLOC pMB
R
pMB->mode (MODEBITS)
R
pMB->priority (PRIORITYBITS)
// do for all blocks
for i=0..5
if I_VOP
oR
pMB->DC_coeff[i] (12)
oR
pMB->AC_coeff[i] (12)
oR
pMB->AC_coeff[i+6] (12)
elseif (B_VOP && MODE_FORWARD) ||
(P_VOP && (MODE_INTER || MODE_INTRA))
oR
MVECTOR(x,y) (length,height)
if !B_VOP && MODE_INTRA
oR
pMB->DC_coeff[i] (12)
oR
pMB->AC_coeff[i] (12)
oR
pMB->AC_coeff[i+6] (12)
elseif B_VOP && MODE_BACKWARD
oR
MVECTOR(x,y) (b_length,b_height)
else
295
Appendix D
XVII. Loading Continuous Metadata into Encoder
// for direct mode: B_VOP && MODE_DIRECT
// and other (unsupported yet) modes
//do nothing with MD bitstream
endif
if bipred && (B_VOP || (P_VOP && !MODE_INTRA))
oR
MVECTOR(x,y) (b_length,b_height)
endif
endfor
endfor
endfor
endLoadMetaData
Listing 6.
Pseudo code for loading the continuous MD.
The rows marked with R means always read the data (yellow highlighting) and respectively with
oR - optional read (not always included in the stream – depends on the previously read values).
The size of read data in bits is defined in normal brackets just after the given MD property (in
bold). The size may be represented by constant in capitals, which are related respectively:
FCODEBITS to the maximum size of the forward MV, BCODEBITS to the maximum size of
the backward MV, MBWIDTHBITS and MBHEIGTHBITS to maximum number of MBs in
width and in height, MODEBITS to the number of supported MB types, PRIORITYBITS to
the total number of MBs in the frame. The values are given in the curly brackets if the domain is
strictly defined. Sign “=>” means that the read MD attribute allows for calculating other
elements used later on.
296
Appendix E
XVIII. Test Bed
Appendix E
XVIII. TEST BED
There have been used many resources for conducting the RETAVIC project. They have been
allocated depending on the tasks defined in the section X.1 (The Evaluation Process). Due to the
type of measurements there are three general groups distinguished: high-efficiency processing,
precise measurements in non-real-time, and precise measurements in real-time. These three
groups have been using different equipment and the details are given for each group separately.
XVIII.1. Non-real-time processing of high load
The goal of the test bed for non-real-time processing of high load is to measure functional
aspects of the examined algorithms i.e. the measurement of compression efficiency, the
evaluation of the quality, and the dependencies between achieved bitrates and quality, and the
scalability of the codecs.
The specially designed server MultiMonster was serving this purpose. The server was a powerful
cluster build of one “queen-bee” and eight “bees”. The detailed cluster specification is given in
Table 11, and respectively the hardware details about queen bee in Table 12 and bees in Table
13.
297
Appendix E
XVIII. Test Bed
CONTROL
BEES
NETWORK
STORAGE
MULTIMONSTER CLUSTER
Administrative Management Console
19’’ LCD + keyboard + mouse
Switch 16x
8x MM Processing Server
2x Intel Pentium 4 2.66GHz
Only 1 processor installed
Switch
1Gbps 24 ports
EasyRAID System 3.2TB
16x WD 200GB 7.2kRMP 2MB cache, 8.9ms
Effective storage:
RAID Level 5 => 1.3TB
RAID Level 3 => 1.3TB
QUEEN BEE
1x MM System Server
2x Intel Xeon 2.8GHz
RAID Storage Attached
OS Management and Configuration
Cluster tools: ClusterNFS & OpenMOSIX
POWER
UPS
3kVA (just for QUEEN BEE and STORAGE)
Total available processors: real 10 / virtually seen 12 (Xeon
HyperThreading)
Table 11.
Configuration of the MultiMonster cluster.
CPU model name
CPU clock (MHz)
Cache size
Memory (MB)
CPU Flags
CPU Speed (BogoMips)
Network Card
RAID Bus Controller
Table 12.
QUEEN BEE
2x Intel(R) Xeon(TM) CPU 2.80GHz
2785
512 KB
2560
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
5570.56
Intel Corp. 82543GC Gigabit Ethernet Controller (Fiber) (rev 02)
2x Broadcom Corporation NetXtreme BCM5703 Gigabit Ethernet (rev
02)
Compaq Computer Corporation Smart Array 5i/532 (rev 01)
The hardware configuration for the queen-bee server.
298
Appendix E
XVIII. Test Bed
BEE
Intel(R) Pentium(R) 4 CPU 2.66GHz
2658
512 KB
512
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm
5308.41
CPU model name
CPU clock (MHz)
Cache size
Memory (MB)
CPU Flags
CPU Speed
(BogoMips)
Network Card
Table 13.
2x Broadcom Corporation NetXtreme BCM5702 Gigabit Ethernet (rew 02)
The hardware configuration for the bee-machines.
The operating system is an adopted version of Linux SuSE 8.2. The special OpenMOSIX kernel
in version 2.4.22 was patched to support Broadcom gigabit network cards (bcm5700 v.6.2.17-1).
Moreover, it was extended by special ClusterNFS v.3.0 functionality. Special kernel
configuration options have been applied, among others: support for 64GB RAM and symmetric
multiprocessing (SMP). The software used within the testing environment covers: OpenMOSIX
tools (such as view, migmon, mps, etc.), self-written scripts for cluster management (cexce,
xcexec, cping, creboot, ckillall, etc.), transcode (0.6.12) and audio-video processing software
(codecs, libraries, etc.).
There are also other tools available which have been written by students to support the
RETAVIC project, and to name just few: audio-video conversion benchmark - AVCOB (based on transcode 0.6.12) by Shu Liu [Liu, 2003], file analysis extensions for MPEG-4 (based
on tcprobe 0.6.12) by Xinghua Liang, LLV1 codec with analyzer and transcoder (based on
XviD) by Michael Militzer, MultiMonster Multimedia Server by Holger Velke, Jörg Meier and
Marc Iseler (based on JBoss AS and JavaServlet Technology), or automation scripts for
benchmarking audio codecs by Florian Penzkoffer.
Finally, there are also few applications by the author such as: PSNR measurement tool,
YUV2AVI converter, YUV presenter, and web application presenting some of the results
(written mainly in PHP).
299
Appendix E
XVIII. Test Bed
XVIII.2. Imprecise measurements in non-real-time
This part was used for the first proof of concepts with respect to the expected behavior of the
audio-video processing algorithms. It was applied in the best effort system (Linux or Windows),
so there has been some error allowed. The goal was not to obtain exact measurements but
rather to just show the differences between standard and developed algorithms and justify the
relevance of the proposed ideas.
To provide imprecise measurements of analyzed audio and video processing algorithms but still
burdened with the relatively small error, the isolation from the network to avoid unpredictable
outside influence has to be applied. Moreover, to prove behavior of the processing on the
diverse processor architectures, the different configurations of the computers has to be used in
some cases. Thus few other configurations have been employed as listed below in Table 14 and
Table 15.
CPU model name
CPU clock (MHz)
Cache size
Memory (MB)
CPU Flags
Network Card
Table 14.
The configuration of PC_RT.
CPU model name
CPU clock (MHz)
Cache size
Memory (MB)
CPU Flags
Network Card
Table 15.
PC_RT
AMD Athlon(tm) XP 1800+
1533
256 KB
512
fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36
mmx fxsr sse syscal mp mmxext 3dnowext 3dnow
3022.84
3com 3c905 100BaseTX
PC
Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
1596
512 KB
512
fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm
2396.42
not used
The configuration of PC.
300
Appendix E
XVIII. Test Bed
XVIII.3. Precise measurements in DROPS
The precise measurements in real-time system are done under the closed system. To provide
comparability of the measurements, exactly one computer has been used. The detailed
configuration is listed in the previous section as PC_RT in Table 14.
Due to the DROPS requirements, there had to be used just specific type of the network card.
The used network card was based on the 3Com 3c905 chip. There have been three exactly same
machines configured for the development under DROPS where such network cards have been
installed, however the real-time measurements have been conducted always on the same
machine (faui6p7). This allowed minimizing the error even more during the measurement
process.
If not explicitly stated in the in-place description, a general ordered set of modules from
DROPS have been used. The set included following modules in sequence:
•
rmgr – the resource manager with sigma0 option (reference to root pager) for handling
physical memory, interrupts and tasks, and loading the kernel
•
modaddr 0x0b000000 – allowing for allocating higher addresses of memory
•
fiasco_apic – the microkernel with the APIC one-shot mode configured and scheduling
•
sigma0 – the root pager, which possesses all the available memory at the beginning and
makes it available to other processes
•
log_net_elinkvortex – the network logging server using the supported 3Com network
card driver. In very rare cases, the standard log was used instead of the network
log_net_elingvortex.
•
names – the server for registering names of running modules. It was also used to control
boot sequence of the modules (because the grub does not provide such functionality).
•
dm_phys – the simple dynamic memory (data space) manager providing memory parts
for demanding tasks
•
simple_ts – the simple, generic task server required for additional address space creation
during runtime (L4 tasks)
•
real_time_application – the real-time audio-video decoding or encoding application. It
runs as a demanding L4 task (thus needs simple_ts and dm_phys)
301
Appendix E
XVIII. Test Bed
Such defined configuration of the constant set of OS modules allowed achieving very stable and
predictable benchmarking environment, where no outside processes could influence the realtime measurements. The only possible remaining source of interrupts could be the network log
server used in DROPS for grabbing the measurement values. However it has not sent the
output at once, but only after the real-time execution was finished. As so, the possible interrupts
generated by the network card on a given IRQ has been eliminated from the measured values.
302
Appendix F
XIX. Static Meta-Data for Few Video Sequences
Appendix F
XIX. STATIC META-DATA FOR FEW VIDEO SEQUENCES
The attributes’ values of the entities in the initial static MD set have been calculated for few
video sequences. Instead of representing these values in tables, which would lead to enormous
occupation of space, they are demonstrated in the graphical form. There are three levels of
values depicted in analogy to the natural hierarchy of the initial static MD set proposed in the
thesis (Figure 12 on p. 98)112. The frame-based static MD represent the StaticMD_Video subset,
the MB-based static MD refer to the StaticMD_Frame subset, and the MV-based (or blockbased) static MD are connected with the StaticMD_MotionVector (or StaticMD_Layer) subset.
Each of these levels is depicted in the individual section.
XIX.1.
Frame-based static MD
The sequences under investigation have usually been prepared in two versions (Figure 109). The
normal where only I- and P- frames have been defined, and with the _temp extension where
additionally B-frames have been included. The distribution of P- and B-frames within the video
sequences was enforced by the LLV1 temporal scalability layer i.e. the equal number of P- and
B- frames appeared. The sums of each frame type (IFramesSum, PFramesSum, and BFramesSum)
are showed as the distribution in the video sequence.
112
The division on three levels has been used in the precise time prediction during the design of the real-time processing model.
303
Appendix F
Distribution of I-, P- & B-frames
parkrun_itu601
shields_itu601
mobcal_itu602_temp
mobcal_itu601
mobile_cif
mobile_cif _temp
container_cif
container_cif _temp
mother_and_daugher_cif _temp
mother_and_daugher_cif
mobile_qcif _temp
mobile_qcif
container_qcif
container_qcif _temp
mother_and_daugher_qcif
mother_and_daugher_qcif _temp
0%
I-f rames
P-f rames
B-f rames
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Figure 109. Distribution of frame types within the used set of video sequences.
XIX.2.
MB-based static MD
The MD-based static MD have been prepared for few video sequences in analogy to the framebased calculations. There are three types of macro blocks distinguished and the respective sums
(IMBsSum, PMBsSum, and BMBsSum) included in the StaticMD_Frame subset of the initial static
MB set are calculated for each frame in the video sequence and depicted on the charts below.
Coded MBs L0
Coded MBs L0
120
120
100
100
80
80
B-MBs
P-MBs
I-MBs
60
B-MBs
P-MBs
I-MBs
60
40
40
20
20
0
0
1
5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
1
carphone_qcif_96
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
carphone_qcif_96_temp
Coded MBs L0
Coded MBs L0
450
450
400
400
350
350
300
300
250
B-MBs
P-MBs
I-MBs
200
250
B-MBs
P-MBs
I-MBs
200
150
150
100
100
50
50
0
0
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
1
coastguard_cif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
coastguard_cif_temp
304
Appendix F
Coded MBs L0
Coded MBs L0
120
120
100
100
80
80
B-MBs
P-MBs
I-MBs
60
B-MBs
P-MBs
I-MBs
60
40
40
20
20
0
0
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
1
coastguard_qcif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
coastguard_qcif_temp
Coded MBs L0
Coded MBs L0
450
120
400
100
350
300
80
B-MBs
P-MBs
I-MBs
60
250
B-MBs
P-MBs
I-MBs
200
150
40
100
20
50
0
0
1
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
container_cif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
mobile_cif
Coded MBs L0
Coded MBs L0
350
350
300
300
250
250
200
B-MBs
P-MBs
I-MBs
150
200
B-MBs
P-MBs
I-MBs
150
100
100
50
50
0
0
1
8
15 22
29
36 43 50 57 64 71 78 85 92
99 106 113 120 127 134
1
mobile_cifn_140
8
15 22
29
36 43 50 57 64 71 78 85 92
99 106 113 120 127 134
mobile_cifn_140_temp
Coded MBs L0
Coded MBs L0
120
120
100
100
80
80
B-MBs
P-MBs
I-MBs
60
40
40
20
20
0
B-MBs
P-MBs
I-MBs
60
0
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
1
mobile_qcif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
mobile_qcif_temp
305
Appendix F
XIX.3.
MV-based static MD
The MV-based static MD have been prepared for few videos again in analogy to the previous
sections. There are nine types of MVs distinguished as described in the Video-Related Static MD
section (V.3). The respective frame-specific sum (MVsSum) is kept in relation to the MV type in
the StaticMD_MotionVector subset of the initial static MB set. Beside the nine types, there is one
more value called no_mv. This value refers to the macro blocks in which no motion vector is
stored, and the same the MB is intra-coded. Please note, that no_mv is different from the zeroMV (i.e. x=0 and y=0), because in case of no_mv neither the backward-predicted nor bidirectionally-predicted interpolation occurs, while in the other case one of these is applied.
XIX.3.1.
Graphs with absolute values
The charts below depict the absolute number of the MVs depending on the type per frame. The
sum of all ten cases (nine MVs + no_mv) is constant for the sequences having only I- or P-MBs
(or frames), because either one MV or no_mv is assigned per MB. Contrary, two MVs are
assigned to the bi-directionally predicted MBs, so the total number may vary between the
number of MBs (no B-MBs) and the value twice as big as the number of MBs (only B-MBs).
MVs per Frame
MVs per frame
200
120
180
100
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
80
60
40
20
160
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
140
120
100
80
60
40
20
0
0
1
5
1
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
carphone_qcif_96
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93
MVs per frame
MVs per frame
450
900
400
800
350
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
300
250
200
150
100
50
700
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
600
500
400
300
200
100
0
0
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
1
coastguard_cif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
coastguard_cif_temp
306
Appendix F
MVs per frame
MVs per frame
120
250
100
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
80
60
40
20
0
200
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
150
100
50
0
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
1
coastguard_qcif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
MVs per frame
MVs per frame
450
120
400
100
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
80
60
40
350
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
300
250
200
150
100
20
50
0
0
1
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
container_cif
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
mobile_cif
MVs per frame
MVs per frame
350
700
300
600
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
250
200
150
100
50
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
500
400
300
200
100
0
0
1
8
15 22
29 36
43
50 57 64 71 78 85 92 99 106 113 120 127 134
1
mobile_cifn_140
8
15 22
29 36
43
50 57 64 71 78 85 92 99 106 113 120 127 134
MVs per frame
MVs per frame
250
120
100
200
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
80
60
40
20
0
no_mv
mv9
mv8
mv7
mv6
mv5
mv4
mv3
mv2
mv1
150
100
50
0
1
1
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
mobile_qcif
XIX.3.2.
15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295
mobile_qcif_temp
Distribution graphs
307
Appendix F
The distribution graphs for the same sequences are depicted below. The small rectangles (bars)
depicts the sum of the given type of vector within the frame, such that the white color is equal
to zero and the black one is equal to all MVs in the frame. Of course, the darker the color is, the
more MVs of a given type exist in the frame. There are frame numbers along the X-axis starting
with first frame from the left side and ending with the last frame on the right. The bar width
depends on the number of frames presented in the histogram. The MV-types are assigned along
the Y-axis starting with mv1 from the top, going step-by-step down to mv9, and having no_mv at
the very bottom. As so, it is easy noticeable, that the first frame of each video sequence has the
most left and bottom rectangle dark and all nine rectangles above it in the same column white—
it is due to the use of only close GOPs in each sequence and thus always I-frame at the
beginning having no MVs at all (because only I-MBs are included).
carphone_qcif_96
coastguard_cif
308
Appendix F
coastguard_cif_temp
coastguard_qcif
container_qcif
container_qcif_temp
mobile_cifn_140
309
Appendix F
mobile_qcif
mobile_qcif_temp
mobile_cif
310
Appendix G
XX. Full Listing of Important Real-Time Functions in RT-MD-LLV1
Appendix G
This appendix includes full listings of real-time functions for the meta-data-based Dropsimplemented converters i.e. RT-MD-LLV1 decoder and RT-MD-XVID encoder are included.
XX. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-LLV1
XX.1. Function preempter_thread()
#if REALTIME
static void preempter_thread (void){
l4_rt_preemption_t _dw;
l4_msgdope_t _result;
l4_threadid_t id1, id2;
extern l4_threadid_t main_thread_id;
extern volatile char timeslice_overrun_optional;
extern volatile char timeslice_overrun_mandatory;
extern volatile char deadline_miss;
extern int no_base_tso;
extern int no_enhance_tso;
extern int no_clean_tso;
extern int no_deadline_misses;
id1 = L4_INVALID_ID;
while (1) {
// wait for preemption IPC
if (l4_ipc_receive(l4_preemption_id(main_thread_id),
L4_IPC_SHORT_MSG, &_dw.lh.low, &_dw.lh.high,
L4_IPC_NEVER, &_result) == 0){
if (_dw.p.type == L4_RT_PREEMPT_TIMESLICE) {
/* this is timeslice 1 ==> mandatory */
if (_dw.p.id == 1){
/* mark this TSO */
timeslice_overrun_mandatory = 1;
/* count tso */
no_base_tso++;
}
/* this is timeslice 2 ==> optional */
else if (_dw.p.id == 2){
/* mark this TSO for main thread */
timeslice_overrun_optional = 1;
/* count tso */
no_enhance_tso++;
}
/* this is timeslice 3 ==> mandatory */
/* count tso */
no_clean_tso++;
}
}
/* this is a deadline miss !
* => we're really in trouble! */
311
Appendix G
XX. Full Listing of Important Real-Time Functions in RT-MD-LLV1
else if (_dw.p.type == L4_RT_PREEMPT_DEADLINE){
/* mark deadline miss */
deadline_miss=1;
/* count tso */
no_deadline_misses++;
}
}
else LOG("Preempt-receive returned %x", L4_IPC_ERROR(_result));
}
}
#endif /*REALTIME*/
XX.2.
Function load_allocation_params()
#if REALTIME
/* load parameters for allocation */
void load_allocation_params(void){
/* list with allocations (must be defined as -D with Makefile) */
#ifdef _qcif
file
= "_qcif";
max_base_per_MB
= 0.0;
avg_base_per_MB_base
= 27.15;
max_base_per_MB_base
= 28.17;
avg_base_per_MB_enhance
= 29.35;
max_base_per_MB_enhance
= 30.35;
max_enhance_per_MB
= 18.44;
avg_cleanup_per_MB_base
= 2.22;
max_cleanup_per_MB_base
= 3.05;
avg_cleanup_per_MB_enhance = 13.10;
max_cleanup_per_MB_enhance = 13.17;
#elif defined _cif
file
= "_cif";
max_base_per_MB
= 0.0;
= 21.36;
= 22.14;
= 25.12;
= 25.97;
max_enhance_per_MB
= 15.17;
= 2.02;
= 2.07;
#elif defined _itu601
file
= "_itu601";
max_base_per_MB
= 0.0;
= 18.59;
= 23.54;
= 22.71;
= 27.69;
max_enhance_per_MB
= 15.09;
= 1.60;
= 1.72;
#else
file
= "unknown_video";
max_base_per_MB
= 0.0;
= 27.15;
= 28.17;
= 29.35;
= 30.35;
max_enhance_per_MB
= 18.44;
= 2.22;
= 3.05;
#endif
}
#endif /*REALTIME*/
312
Appendix G
XXI. Full Listing of Important Real-Time Functions in RT-MD-XVID
XXI. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-XVID
XXI.1.
Function preempter_thread()
#if REALTIME
static void preempter_thread (void){
l4_rt_preemption_t _dw;
l4_msgdope_t _result;
l4_threadid_t id1, id2;
extern l4_threadid_t main_thread_id;
extern volatile char deadline_miss;
extern int no_deadline_misses;
while (1) {
// wait for preemption IPC
if (l4_ipc_receive(l4_preemption_id(main_thread_id),
L4_IPC_SHORT_MSG, &_dw.lh.low, &_dw.lh.high,
L4_IPC_NEVER, &_result) == 0){
if (_dw.p.type == L4_RT_PREEMPT_TIMESLICE) {
// this is timeslice 1 ==> mandatory
if (_dw.p.id == 1){
realtime_mode = OPTIONAL;
l4_rt_next_reservation(1, &left);
}
// this is timeslice 2 ==> optional
realtime_mode = MANDATORY_CLEANUP;
}
// this is timeslice 3 ==> mandatory
realtime_mode = DEADLINE;
}
}
// this is a deadline miss !
// => we're really in trouble!
else if (_dw.p.type == L4_RT_PREEMPT_DEADLINE){
// mark deadline miss
deadline_miss=1;
// count tso
no_deadline_misses++;
}
}
else
LOG("Preempt-receive returned %x", L4_IPC_ERROR(_result));
}
}
#endif /*REALTIME*/
313
Appendix G
XXI. Full Listing of Important Real-Time Functions in RT-MD-XVID
314
Appendix H
XXII. MPEG-4 Audio Tools and Profiles
Appendix H
This appendix covers audio-specific aspects such as MPEG-4 tools and profiles and MPEG-4
SLS enhancements.
Coding Functionality
(Tools /
Modules)
Audio
Object Type
AAC main
AAC LC
AAC SSR
AAC LTP
SBR
AAC Scalable
TwinVQ
CELP
HVXC
TTSI
Main synthetic
Wavetable synthesis
General MIDI
Algorithmic
Synthesis
and Audio FX
ER AAC LC
ER AAC LTP
ER AAC scalable
ER TwinVQ
ER BSAC
ER AAC LD
ER CELP
ER HVXC
ER HILN
ER Parametric
SSC
Layer-1
Layer-2
Layer-3
gain control
block switching
window shapes - standard
window shapes – AAC LD
filterbank - standard
filterbank - SSR
TNS
LTP
intensity
coupling
frequency domain prediction
PNS
MS
SIAQ
FSS
upsampling filter tool
qantization & coding - AAC
quantization & coding – TwinVQ
quantization & coding - BSAC
AAC ER Tools
ER payload syntax
EP Tool
CELP
silence compression
HVXC
HVXC 4kbit/s VR
SA tools
SASBF
MIDI
HILN
TTSI
SBR
Layer-1
Layer-2
Layer-3
XXII. MPEG-4 AUDIO TOOLS AND PROFILES
x
x
x x
x
x
x
x
x
x
x
x
x
x x
x
x x
x
x
x
x
x x
x x
x
x
x x x
x x
x x x x
x
x x
x
x x
x
x x
x
x
x
x
x
x x x x x x
x
x
x
x
x
x x x
x x
x
x
x
x
x
x
x
Table 16.
x
x
x
x
x
x
x
x
x
x
x x
x
x
x x x
x
x
x
x
x
x x x
x x
x
x
x x
x
x
x x x x x x
x
x
x
x x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x x
x
x
x x
x
x
x
x
x
MPEG Audio Object Type Definition based on Tools/Modules [MPEG-4 Part III,
2005].
315
Appendix H
XXII. MPEG-4 Audio Tools and Profiles
Explanation of abbreviations used in the Table 16 (for detailed description of Tools/Modules
readers are referred to [MPEG-2 Part VII, 2006; MPEG-4 Part III, 2005] and respective
MPEG-4 standard amendments):
•
LC – Low Complexity
•
ER – Error Robust
•
SSR – Scalable Sample Rate
•
BSAC – Bit Sliced Arithmetic
•
LTP – Long Term Predictor
•
SBR – Spectral Band Replication
•
TwinVQ – Transform-domain
•
•
Coding
•
HILN – Harmonic and Individual
Lines plus Noise
Weighted Interleaved Vector
•
SSC – Sinusoidal Coding
Quantization
•
LD – Low Delay
CELP – Code Excited Linear
•
SSR – Scalable Sample Rate
Prediction
•
TNS – Temporal Noise Shaping
HVXC – Harmonic Vector
•
SA – Structured Audio
Excitation Coding
•
SASBF – Structured Audio Sample
•
TTSI – Text-to-Speech Interface
•
MIDI – Musical Instrument Digital
•
HE – High Efficiency
Interface
•
PS – Parametric Stereo
Bank Format
Audio Profile
Audio
Object Type
Object
Type
ID
1
2
3
4
23
Scalable
Audio
Profile
Main
Audio
Profile
Audio Object
Type
AAC main
AAC LC
AAC SSR
AAC LTP
ER AAC LD
Table 17.
x
x
x
x
x
x
High
Quality
Audio
Profile
Low
Delay
Audio
Profile
x
x
x
Natural
Audio
Profile
x
x
x
x
x
Mobile
Audio
Internetworking
Profile
AAC
Profile
High
Efficiency
AAC
Profile
x
x
x
Use of few selected Audio Object Types in MPEG Audio Profiles [MPEG-4 Part
III, 2005].
316
Appendix H
XXIII. MPEG-4 SLS Enhancements
XXIII. MPEG-4 SLS ENHANCEMENTS
This section is based on the work conducted in frames of the joint master thesis project
[Wendelska, 2007] in cooperation with Dipl.-Math. Ralf Geiger from Fraunhofer IIS.
XXIII.1. Investigated versions - origin and enhancements
Origin
v0 – origin version (not used due to printf overhead)
v01 – origin version (printf commented out for measurements)
New interpolations
v1 – new interpolateValue1to7 (but incomplete measurements – only 2 sequences checked)
v2 – new interpolateFromCompactTable (1st method)
Vectorizing Headroom
v3 – new vector msbHeadroomINT32 (with old interpolateFromCompactTable)
Vectorizing 2-level loop of srfft_fixpt
v4 – new vector 1st loop srfft_fixpt (with old msbHeadroomINT32)
v5 – old 1st loop srfft_fixpt, new vector 2nd loop srfft_fixpt
v6 – new vectors: 1st and 2nd loop srfft_fixpt
New interpolation and vectorizing Headroom
v7 – new interpolateFromCompactTable (1st method) and new vector msbHeadroomINT32 [incl.
v2 & v3]
XXIII.2. Measurements
Average Time
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
bach44m
barcelona32s
chopin44s
jazz48m
jazz48s
v01
v2
v3
v4
v5
v6
v7
317
Execution times of different versions for
five compared sequences. Only times of v2,
v3 and v7 are smaller than the origin time of
unoptimized code for all sequences. The
minimum and maximum time measured is
depicted as deviations from the average of
twelve-time execution. Out of all 420
measurements, there are only 3 cases having
difference between MAX and AVG over
5.8% (namely 15.5%, 10.7%, 7.5%), and only
2 cases of MIN to AVG difference being
over 5.0% (namely 6.1% and 5.1%). These
five measurements are influenced by outside
factors and thus are treated as irrelevant.
Appendix H
Average Time Cumulated
60
50
40
jazz48s
Average execution time was cumulated for
all sequences considering respective
versions. It shows clearly that smaller time is
achieved only by v2, v3 and v7.
jazz48m
30
chopin44s
barcelona32s
bach44m
20
10
0
v01
v2
v3
v4
v5
v6
v7
Execution Time (vN% ) vs. Origin Time (v01% )
105%
100%
95%
v01%
v2%
90%
v3%
v4%
85%
v5%
The execution time of each version in
comparison to the origin time (v01)
demonstrates the gain for each sequence and
for all on average. The v2 version needs only
96.6%, v3 only 79.8% and v7 only 76.4% of
the origin time of unoptimized version.
v6%
80%
v7%
75%
70%
bach44m
barcelona32s
chopin44s
jazz48m
jazz48s
average
Thus the v7 version delivers finally the best
speed up ranging from 1.26 to 1.35 for
different sequences and being equal to 1.31
on average.
Speed-up Ratio (Origin vs. Current)
1,40
1,35
1,30
1,25
bach44m
1,20
barcelona32s
chopin44s
1,15
jazz48m
1,10
jazz48s
average
1,05
1,00
0,95
v01
v2
v3
v4
v5
v6
v7
0,90
XXIII.3. Overall Final Improvement
The final benchmark has been conducted in comparison the origin. Figure 110 presents the
overall speedup ratios for both the encoder and the decoder i.e. the gained percentage of the
processing time of the final code version and of the original one. The total execution time of the
encoder was decreased by 21%-36%, depending on the input file, while the decoder’s total time
only by 15%-25%. All the successfully vectorized functions and operations together obtained
about 18% speedup of the total execution time and about 28% of the IntMDCT time
comparing to the original code version. The enhancement in the accumulated execution time of
the IntMDCT calculations being the focus of the project was improved to noticeably larger than
318
Appendix H
the overall results i.e. IntMDCT-encoding speedup achieved 42%-48% and InvIntMDCT
required 45%-50% less time respectively. So, in results the decreased by the factor of 2 of the
main optimization target has been achieved.
Figure 110. Percentage of the total gained time between the original code version and the final
version [Wendelska, 2007].
319
Appendix H
320
Bibliography
Bibliography
[Ahmed et al., 1974] Ahmed, N., Natarajan, T., Rao, K. R.: Discrete Cosine Transform. IEEE
Trans. Computers Vol. xx, pp.90-93, 1974.
[ANS, 2001] ANS: American National Standard T1.523-2001: Telecom Glossary 2000. Alliance for
Telecommunications Industry Solutions (ATIS) Committee T1A1, National Telecommunications and
Information Administration's Institute for Telecommunication Sciences (NTIA/ITS) Approved by ANSI, Feb. 28th, 2001.
[Assunncao and Ghanbari, 1998] Assunncao, P. A. A., Ghanbari, M.: A Frequency-Domain Video
Transcoder for Dynamic Bit-Rate Reduction of MPEG-2 Bit Streams. IEEE Trans. Circuits and
Systems for Video Technology Vol. 8(8), pp.953-967, 1998.
[Astrahan et al., 1976] Astrahan, M. M., Blasgen, M. W., Chamberlin, D. D., Eswaran, K. P., Gray,
J. N., Griffiths, P. P., King, W. F., Lorie, R. A., McJones, P. R., Mehl, J. W., Putzolu, G. R.,
Traiger, I. L., Wade, B., V., W.: System R: A Relational Approach to Database Management.
ACM Transactions on Database Systems Vol. 1(2), pp.97-137, 1976.
[Auspex, 2000] Auspex: A Storage Architecture Guide. White Paper, Santa Clara (CA), USA,
Auspex Systems, Inc., 2000.
[Barabanov and Yodaiken, 1996] Barabanov, Yodaiken: Real-Time Linux. Linux Journal Vol., 1996.
[Bente, 2004] Bente, N.: Comparison of Multimedia Servers Available on Nowadays Market—
Hardware and Software. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg,
Erlangen, Germany
[Berthold and Meyer-Wegener, 2001] Berthold, H., Meyer-Wegener, K.: Schema Design and
Query Processing in a Federated Multimedia Database System. 6th International Conference on
Cooperative Information Systems (CoopIS'01), in Lecture Notes in Computer Science Vol.2172, Trento,
Italy, Springer Verlag, Sep. 2001.
[Bovik, 2005] Bovik, A. C.: Handbook of Image and Video Processing (2nd Ed.), Academic Press,
ISBN 0-12-119792-1, 2005.
[Bryant and O'Hallaron, 2003] Bryant, R. E., O'Hallaron, D. R.: Computer Systems – A
Programmer's Perspective. Chapter IX. Measuring Program Execution Time, Prentice Hall, ISBN 013-034074-X, 2003.
[Campbell and Chung, 1996] Campbell, S., Chung, S.: Database Approach for the Management of
Multimedia Information. Multimedia Database Systems. Ed.: K. Nwosu, Kluwer Academic
Publishers, ISBN 0-7923-9712-6, 1996.
[Candan et al., 1996] Candan, K. S., Subrahmanian, V. S., Rangan, P. V.: Towards a Theory of
Collaborative Multimedia IEEE International Conference on Multimedia Computing and Systems
(ICMCS'96), Hiroshima, Japan, Jun. 1996.
321
Bibliography
[Carns et al., 2000] Carns, P. H., Ligon III, W. B., Ross, R. B., Thakur, R.: PVFS: A Parallel File
System For Linux Clusters. 4th Annual Linux Showcase and Conference, Atlanta (GA), USA, Oct.
2000.
[Cashin, 2005] Cashin, E.: Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the
LAN. Linux Journal Vol. 134, 2005.
[Chamberlin et al., 1981] Chamberlin, D. D., Astrahan, M. M., Blasgen, M. W., Gray, J. N., King,
W. F., Lindsay, B. G., Lorie, R., Mehl, J. W., Price, T. G., Putzolu, F., Selinger, P. G.,
Schkolnick, M., Slutz, D. R., Traiger, I. L., Wade, B. W., Yost, R. A.: A History and Evaluation
of System R. Communications of the ACM Vol. 24(10), pp.632-646, 1981.
[Ciliendo, 2006] Ciliendo, E.: Linux-Tuning: Performance-Tuning für Linux-Server. iX Vol.
01/06, pp.130-132, 2006.
[CODASYL Systmes Committee, 1969] CODASYL Systmes Committee: A Survey of
Generalized Data Base Management Systems. Technical Report (PB 203142), May 1969.
[Codd, 1970] Codd, E. F.: A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM Vol. 13(6), pp.377-387, 1970.
[Codd, 1995] Codd, E. F.: "Is Your DBMS Really Relational?" and "Does Your DBMS Run By
the Rules?" ComputerWorld, (Part 1: October 14, 1985, Part 2: October 21, 1985). Vol. xx, 1995.
[Connolly and Begg, 2005] Connolly, T. M., Begg, C. E.: Database Systems: A Practical Approach
to Design, Implementation, and Management (4th Ed.). Essex, England, Pearson Education
Ltd., ISBN 0-321-21025-5, 2005.
[Curran and Annesley, 2005] Curran, K., Annesley, S.: Transcoding Media for Bandwidth
Constrained Mobile Devices. International Journal of Network Management Vol. 15, pp.75-88, 2005.
[Cutmore, 1998] Cutmore, N. A. F.: Dynamic Range Control in a Multichannel Environment.
Journal of the Audio Engineering Society Vol. 46(4), pp.341-347, 1998.
[Dashti et al., 2003] Dashti, A., Kim, S. H., Shahabi, C., Zimmermann, R.: Streaming Media Server
Design, Prentice Hall Ptr, ISBN 0-13-067038-3, 2003.
[Davies, 1984] Davies, B.: Integral Transforms and Their Applications (Applied Mathematical
Sciences), Springer, ISBN 0-387-96080-5, 1984.
[Dennis and Van Horn, 1966] Dennis, J. B., Van Horn, E. C.: Programming semantics for
multiprogrammed computations. Communications of the ACM Vol. 9(3), pp.143-155, 1966.
[Devos et al., 2003] Devos, H., Eeckhaut, H., Christiaens, M., Verdicchio, F., Stroobandt, D.,
Schelkens, P.: Performance requirements for reconfigurable hardware for a scalable wavelet
video decoder. CD-ROM Proceedings of the ProRISC / IEEE Benelux Workshop on Circuits, Systems
and Signal Processing, STW, Utrecht, Nov. 2003.
[Ding and Guo, 2003] Ding, G.-g., Guo, B.-l.: Improvement to Progressive Fine Granularity
Scalable Video Coding 5th International Conference on Computational Intelligence and Multimedia
Applications (ICCIMA'03) Xi'an, China, Sep. 2003.
[Dingeldein, 1995] Dingeldein, D.: Multimedia interactions and how they can be realized. SPIE
Photonics West Symposium, Multimedia Computing and Networking, San José (CA), USA, SPIE Vol.
2417, pp.46-53, Mar. 1995.
[Dogan, 2001] Dogan, S.: Video Transcoding for Multimedia Communication Networks. PhD
Thesis. University of Surrey, Guildford, United Kingdom. Oct. 2001.
322
Bibliography
[Effelsberg and Steinmetz, 1998] Effelsberg, W., Steinmetz, R.: Video Compression Techniques.
Heidelberg, Germany, dpunkt Verlag, 1998.
[Eisenberg and Melton, 2001] Eisenberg, A., Melton, J.: SQL Multimedia and Application
Packages (SQL/MM). SIGMOD Record Vol. 30(4), 2001.
[El-Rewini et al., 1994] El-Rewini, H., Lewis, T. G., Ali, H. H.: Task Scheduling in Parallel and
Distributed Systems. New Jersey, USA, PTR Prentice Hall, ISBN 0-13-099235-6, 1994.
[Eleftheriadis and Anastassiou, 1995] Eleftheriadis, A., Anastassiou, D.: Constrained and General
Dynamic Rate Shaping of Compressed Digital Video. 2nd IEEE International Conference on Image
Processing (ICIP'95), Arlington (VA), USA, IEEE, Oct. 1995.
[Elmasri and Navathe, 2000] Elmasri, R., Navathe, S. B.: Fundamentals of Database Systems.
Reading (MA), USA, Addison Wesley Longman Inc., ISBN 0-8053-1755-4, 2000.
[Fasheh, 2006] Fasheh, M.: OCFS2: The Oracle Clustered File System, Version 2, retrieved on
21.07.2006, 2006, from http://oss.oracle.com/projects/ocfs2/dist/documentation/fasheh.pdf,
2006.
[Feig and Winograd, 1992] Feig, E., Winograd, S.: Fast Algorithms for the Discrete Cosine
Transform. IEEE Trans. Signal Processing Vol. 40(9), pp.2174-2193, 1992.
[Ford et al., 1997] Ford, B., van Maren, K., Lepreau, J., Clawson, S., Robinson, B., Turner, J.: The
FLUX OS Toolkit: Reusable Components for OS Implementation. 6th IEEE Workshop on Hot
Topics in Operating Systems, Cape Cod (MA), USA, May 1997.
[Fortier and Michel, 2002] Fortier, P. J., Michel, H. E.: Computer Systems Performance
Evaluation and Prediction. Burlington (MA), USA, Digital Press, ISBN 1-55558-260-9, 2002.
[Fry and Sibley, 1976] Fry, J. P., Sibley, E. H.: Evolution of Data-Base Management Systems.
ACM Computing Surveys (CSUR) Vol. 8(1), pp.7-42, 1976.
[Geiger et al., 2001] Geiger, R., Sporer, T., Koller, J., Brandenburg, K.: Audio Coding Based On
Integer Transforms. 111th Convention AES Convention, New York (NY), USA, AES, Sep. 2001.
[Geiger et al., 2004] Geiger, R., Yokotani, Y., Schuller, G., Herre, J.: Improved Integer Transforms
using Multi-Dimensional Lifting. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP'04), Montreal (Quebec), Canada, IEEE, May, 11-17th, 2004.
[Geiger et al., 2006] Geiger, R., Yu, R., Herre, J., Rahardja, S. S., Kim, S.-W., Lin, X., Schmidt, M.:
ISO / IEC MPEG-4 High-Definition Scalable Advanced Audio Coding. 120th Convention of
Audio Engineering Society (AES), Paris, France, AES No. 6791, May 2006.
[Gemmel et al., 1995] Gemmel, D. J., Vin, H. M., Kandlur, D. D., Rangan, P. V., Rowe, L. A.:
Multimedia Storage Servers: A Tutorial. IEEE Computer Vol. 28(5), pp.40-49, 1995.
[Gibson et al., 1998] Gibson, J. D., Berger, T., Lookabaugh, T., Lindbergh, D., Baker, R. L.:
Digital Compression for Multimedia: Principles and Standards. London, UK, Academic Press,
1998.
[Hamann, 1997] Hamann, C.-J.: On the Quantitative Specification of Jitter Constrained Periodic
Streams. 5th International Symposium on Modeling, Analysis and Simulation of Computer and
Telecommunication Systems MASCOTS’97, Haifa, Israel, Jan. 1997.
[Hamann et al., 2001a] Hamann, C.-J., Löser, J., Reuther, L., Schönberg, S., Wolter, J., Härtig, H.:
Quality-Assuring Scheduling – Using Stochastic Behavior to Improve Resource Utilization.
22nd IEEE Real-Time Systems Symposium (RTSS 2001), London, UK, Dec. 2001.
323
Bibliography
[Hamann et al., 2001b] Hamann, C.-J., Märcz, A., Meyer-Wegener, K.: Buffer Optimization in
Realtime Media Streams using Jitter-Constrained Periodic Streams. SFB 358 - G3 - 01/2001
Technical Report. TU Dresden, Dresden, Germany. Jan. 2001.
[Härtig et al., 1997] Härtig, H., Hohmuth, M., Liedtke, J., Schönberg, S.: The performance of μkernel-based systems. 16th ACM Symposium on Operating Systems Principles, Saint Malo, France,
1997.
[Härtig et al., 1998] Härtig, H., Baumgartl, R., Borriss, M., Hamann, C.-J., Hohmuth, M., Mehnert,
F., Reuther, L., Schönberg, S., Wolter, J.: DROPS — OS Support for Distributed Multimedia
Applications. 8th ACM SIGOPS European Workshop (SIGOPS EW'98), Sintra, Portugal, Sep.
1998.
[Henning, 2001] Henning, P. A.: Taschenbuch Multimedia (2nd Ed.). München, Germany, Carl
Hanser Verlag, ISBN 3-446-21751-7, 2001.
[Hohmuth and Härtig, 2001] Hohmuth, M., Härtig, H.: Pragmatic Nonblocking Synchronization
for Real-time Systems. USENIX Annual Technical Conference, Boston (MA), USA, Jun. 2001.
[IBM Corp., 1968] IBM Corp.: Information Management Systems/360 (IMS/360) - Application
Description Manual, New York (NY), USA, IBM Corp. Form No. H20-0524-1 White Plains,
1968.
[IBM Corp., 2003] IBM Corp.: DB2 Universal Database: Image, Audio, and Video Extenders Administration and Programming, Version 8. 1st Ed., Jun. 2003.
[Ihde et al., 2000] Ihde, S. C., Maglio, P. P., Meyer, J., Barrett, R.: Intermediary-based Transcoding
Framework. Poster Proc. of 9th Intl. World Wide Web Conference (WWW9), 2000.
[Imaizumi et al., 2002] Imaizumi, S., Takagi, A., Kiya, H.: Lossless Inter-frame Video Coding
using Extended JPEG2000. International Technical Conference on Circuits Systems, Computers and
Communications (ITC CSCC '02), Phuket, Thailand Jul. 2002.
[ISO 9000, 2005] ISO 9000: Standard 9000:2005 – Quality Management Systems – Fundamentals
and Vocabulary, ISO Technical Committee 176 / SC1, Sep. 2005.
[ITU-T Rec. H.262, 2000] ITU-T Rec. H.262: Information Technology – Generic Coding of
Moving Pictures and Associated Audio Information: Video. Recommendation H.262, ITU-T, Feb.
2000.
[ITU-T Rec. H.263+, 1998] ITU-T Rec. H.263+: Video coding for low bit rate communication
(called H.263+). Recommendation H.263, ITU-T, Feb. 1998.
[ITU-T Rec. H.263++, 2000] ITU-T Rec. H.263++: Video coding for low bit rate
communication - Annex U,V,W (called H.263++). Recommendation H.263, ITU-T, Nov. 2000.
[ITU-T Rec. H.263+++, 2005] ITU-T Rec. H.263+++: Video coding for low bit rate
communication - Annex X and unified specification document (called H.263+++).
Recommendation H.263, ITU-T, Jan. 2005.
[ITU-T Rec. H.264, 2005] ITU-T Rec. H.264: Advanced Video Coding for Generic Audiovisual
Services. Recommendation H.264 & ISO/IES 14496-10 AVC, ITU-T & ISO/IES, Mar. 2005.
[ITU-T Rec. P.835, 2003] ITU-T Rec. P.835: Subjective Test Methodology for Evaluating Speech
Communication Systems that include Noise Suppression Algorithm. Recommendation P.835,
ITU-T, Nov. 2003.
324
Bibliography
[ITU-T Rec. T.81, 1992] ITU-T Rec. T.81: Information Technology – Digital Compression and
Coding of Continuous-Tone Still Images – Requirements and Guidelines. ITU-T
Recommendation T.81 and ISO/IEC International Standard 10918-1, JPEG (ITU-T CCITT SG-7
and ISO/IEC JTC-1/SC-29/WG-10), Sep. 1992.
[ITU-T Rec. X.641, 1997] ITU-T Rec. X.641: Information technology – Quality of Service:
Framework. Recommendation X.641, ITU-T, Dec. 1997.
[ITU-T Rec. X.642, 1998] ITU-T Rec. X.642: Information technology – Quality of Service: Guide
to Methods and Mechanisms. Recommendation X.642, ITU-T, Sep. 1998.
[Jaeger et al., 1999] Jaeger, T., Elphinstone, K., Liedke, J., Panteleenko, V., Park, Y.: Flexible
Access Control Using IPC Redirection. 7th Workshop on Hot Topics in Operating Systems (HOTOS),
Rio Rico (AZ), USA, IEEE Computer Society, Mar. 1999.
[Jankiewicz and Wojciechowski, 2004] Jankiewicz, K., Wojciechowski, M.: Standard SQL/MM:
SQL Multimedia and Application Packages. IX Seminarium PLUG "Przetwarzanie zaawansowanych
struktur danych: Oracle interMedia, Spatial, Text i XML DB", Warsaw, Poland, Stowarzyszenie
Polskiej Grupy Użytkowników systemu Oracle, Mar. 2004.
[JTC1/SC32, 2007] JTC1/SC32: ISO/IEC 13249: 2002 Information technology -- Database
languages -- SQL multimedia and application packages. ISO/IEC 13249 3rd Ed., ISO/IEC,
2007.
[Käckenhoff et al., 1994] Käckenhoff, R., Merten, D., Meyer-Wegener, K.: "MOSS as Multimedia
Object Server - Extended Summary". Multimedia: Advanced Teleservices and High Speed
Communication Architectures, Proc. 2nd Int. Workshop - IWACA '94 (Heidelberg, Sept. 26-28, 1994),
Ed. R. Steinmetz, Lecture Notes in Computer Science Vol.868, Heidelberg, Germany, Springer-Verlag
[Kahrs and Brandenburg, 1998] Kahrs, M., Brandenburg, K.: Applications of Digital Signal
Processing to Audio and Acoustics, Kluwer Academic Publishers, ISBN 0-7923-8130-0, 1998.
[Kan and Fan, 1998] Kan, K.-S., Fan, K.-C.: Video Transcoding Architecture with Minimum
Buffer Requirement for Compressed MPEG-2 Bitstream. Signal Processing Vol. 67(2), pp.223235, 1998.
[Keesman et al., 1996] Keesman, G., Hellinghuizen, R., Hoeksema, F., Heideman, G.:
Transcoding of MPEG Bitstream. Signal Processing - Image Communication Vol. 8(6), pp.481-500,
1996.
[Khoshafian and Baker, 1996] Khoshafian, S., Baker, A.: MultiMedia and Imaging Databases,
Morgan Kaufmann, ISBN 1-55860-312-3, 1996.
[King et al., 2004] King, R., Popitsch, N., Westermann, U.: METIS: a Flexible Database
Foundation for Unified Media Management. ACM Multimedia 2004 (ACMMM'04), New York
(NY), USA, Oct. 2004.
[Knutsson et al., 2003] Knutsson, B., Lu, H., Mogul, J., Hopkins, B.: Architecture and
Performance of Server-Directed Transcoding. ACM Transactions on Internet Technology (TOIT)
Vol. 3(4), pp.392-424, 2003.
[Kuhn and Suzuki, 2001] Kuhn, P., Suzuki, T.: MPEG-7 Metadata for Video Transcoding: Motion
and Difficulty Hint. SPIE Conference on Storage and Retrieval for Multimedia Databases, San Jose
(CA), USA, SPIE Vol. 4315, 2001.
325
Bibliography
[LeBlanc and Markatos, 1992] LeBlanc, T. J., Markatos, E. P.: Shared Memory vs. Message
Passing in Shared-Memory Multiprocessors. 4th IEEE Symposium on Parallel and Distributed
Processing, Arlington (TX), USA Dec. 1992.
[Lee et al., 2005] Lee, C.-J., Lee, K.-S., Park, Y.-C., Youn, D.-H.: Adaptive FFT Window
Switching for Psychoacoustic Model in MPEG-4 AAC, Seoul, Korea, Yonsei University Digital
Signal Processing Lab, pp.553, Jul. 2005.
[LeGall, 1991] LeGall, D.: MPEG: A Video Compression Standard for Multimedia Applications.
Communications of the ACM Vol. 34(4), pp.46-58, 1991.
[Li and Shen, 2005] Li, K., Shen, H.: Coordinated Enroute Multimedia Object Caching in
Transcoding Proxies for Tree Networks. ACM Transactions on Multimedia Computing,
Communications and Applications Vol. 1(3), pp.289-314, 2005.
[Li, 2001] Li, W.: Overview of Fine Granularity Scalability in MPEG-4 Video Standard. IEEE
Trans. Circuits and Systems for Video Technology Vol. 11(3), pp.301-317, 2001.
[Liang and Tran, 2001] Liang, J., Tran, T. D.: Fast Multiplierless Approximation of the DCT with
the Lifting Scheme. IEEE Trans. Signal Processing Vol. 49(12), pp.3032-3044, 2001.
[Liebchen et al., 2005] Liebchen, T., Moriya, T., Harada, N., Kamamoto, Y., Reznik, Y.: The
MPEG-4 Audio Lossless Coding (ALS) Standard - Technology and Applications. 119th AES
Convention, New York (NY), USA, Oct. 2005.
[Liedtke, 1996] Liedtke, J.: L4 Reference Manual 486 Pentium Pentium Pro Version 2.0. Research
Report RC 20549, Yorktown Heights (NY), USA, IBM T. J. Watson Research Center, Sep.
1996.
[Lin et al., 1987] Lin, K. J., Natarajan, S., Liu, J. W. S.: Imprecise Results: Utilizing Partial
Computations in Real-Time Systems. 8th IEEE Real-Time Systems Symposium (RTSS '87), San
Jose (CA), USA, Dec. 1987.
[Lindner et al., 2000] Lindner, W., Berthold, H., Binkowski, F., Heuer, A., Meyer-Wegener, K.:
Enabling Hypermedia Videos in Multimedia Database Systems Coupled with Realtime Media
Servers. International Symposium on Database Engineering & Applications (IDEAS), Yokohama,
Japan, Sap. 2000.
[Liu, 2003] Liu, S.: Audio-Video Conversion Benchmark “AVCOB” – Analysis, Design and
Implementation. Master Thesis, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen,
Germany
[Löser et al., 2001a] Löser, J., Härtig, H., Reuther, L.: A Streaming Interface for Real-Time
Interprocess Communication. Technical Report TUD-FI01-09, Operating Systems Group. TU
Dresden, Dresden, Germany. Aug. 2001.
[Löser et al., 2001b] Löser, J., Härtig, H., Reuther, L.: Position Summary: A Streaming Interface
for Real-Time Interprocess Communication. 8th Workshop on Hot Topics in Operating Systems
(HotOS-VIII), Schloss Elmau in Bavaria, Germany, May 2001.
[Löser and Härtig, 2004] Löser, J., Härtig, H.: Low-latency Hard Real-Time Communication over
Switched Ethernet. 16th Euromicro Conference on Real-Time Systems (ECRTS'04), Catania (Sicily),
Italy, Jun.-Jul. 2004.
[Löser and Aigner, 2007] Löser, J., Aigner, R.: Building Infrastructure for DROPS (BID)
Specification. Publicly-Available Specification, Operating Systems Group. TU Dresden, Dresden,
Germany. Apr. 25th, 2007.
326
Bibliography
[Lum and Lau, 2002] Lum, W. Y., Lau, F. C. M.: On Balancing between Transcoding Overhead
and Spatial Consumption in Content Adaptation. 8th Intl. Conf. on Mobile Computing and
Networking, Atlangta (GA), USA, ACM, Sep. 2002.
[Luo, 1997] Luo, Y.: Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment.
Internal Report, Los Alamos (NM), USA, Los Alamos National Laboratory (Scientific
Computing Group CIC-19), Apr. 1997.
[Märcz and Meyer-Wegener, 2002] Märcz, A., Meyer-Wegener, K.: Bandwidth-based Converter
Description for Realtime Scheduling at Application Level in Media Servers. SDA Workshop
2002, Dresden, Germany, pp.10, Mar. 2002.
[Märcz et al., 2003] Märcz, A., Schmidt, S., Suchomski, M.: Scheduling Data Streams in
memo.REAL. Internal Communication, TU Dresden / FAU Erlangen, pp.8, Jan. 2003.
[Marder and Robbert, 1997] Marder, U., Robbert, G.: The KANGAROO Project. Proc. 3rd Int.
Workshop on Multimedia Information Systems, Como, Italy, Sep. 1997.
[Marder, 2000] Marder, U.: VirtualMedia: Making Multimedia Database Systems Fit for Worldwide Access. 7th Conference on Extending Database Technology (EDBT'02) - PhD Workshop,
Konstanz, Germany, Mar. 2000.
[Marder, 2001] Marder, U.: On Realizing Transformation Independence in Open, Distributed
Multimedia Information Systems. Datenbanksysteme in Büro, Technik und Wissenschaft (BTW) Vol.,
pp.424-433, 2001.
[Marder, 2002] Marder, U.: Multimedia Metacomputing in Webbasierten multimedialen
Informationssytemen. PhD Thesis. Univeristy of Kaiserslautern, Kaiserslautern, Germany. 2002.
[Margaritidis and Polyzos, 2000] Margaritidis, M., Polyzos, G.: On the Application of Continuous
Media Filters over Wireless Networks. IEEE Int. Conf. on Multimedia and Expo (ICME'00), New
York (NY), USA, IEEE Computer Society, Aug. 2000.
[Marovac, 1983] Marovac, N.: On Interprocess Interaction in Distributed Architectures. ACM
SIGARCH Computer Architecture News Vol. 11(4), pp.17-22, 1983.
[Maya et al., 2003] Maya, Anu, Asmita, Snehal, Krushna (MAASK): MigShm - Shared Memory
over
openMosix.
Project
Report
on
MigShm.
From
http://mcaserta.com/maask/Migshm_Report.pdf, Apr. 2003.
[McQuillan and Walden, 1975] McQuillan, J. M., Walden, D. C.: Some Considerations for a High
Performance Message-based Interprocess Communication System. ACM SIGCOMM/SIGOPS
Workshop on Interprocess Communications - Applications, Technologies, Architectures, and Protocols for
Computer Communication, 1975.
[Mehnert et al., 2003] Mehnert, F., Hohmuth, M., Härtig, H.: Cost and Benefit of Separate
Address Spaces in Real-Time Operating Systems. 23rd IEEE Real-Time Systems Symposium
(RTSS'03), Austin, Texas, USA, Dec. 2003.
[Mehrseresht and Taubman, 2005] Mehrseresht, N., Taubman, D.: An efficient content-adaptive
motion compensated 3D-DWT with enhanced spatial and temporal scalability. Preprint submitted
to IEEE Transactions on Image Processing, May 2005.
[Meyer-Wegener, 2003] Meyer-Wegener, K.: Multimediale Datenbanken - Einsatz von
Datenbanktechnik in Multimedia-Systemen (2. Auflage). Wiesbaden, Germany, B. G. Teubner
Verlag / GWV Fachverlag GmbH, ISBN 3-519-12419-X, 2003.
327
Bibliography
[Meyerhöfer, 2007] Meyerhöfer, M. B.: Messung und Verwaltung von Softwarekomponenten für
die Performancevorhersage. PhD Thesis, Database Systems Chair. FAU Erlangen-Nuremberg,
Erlangen. Jul. 2004.
[Microsoft Corp., 2002a] Microsoft Corp.: Introducing DirectShow for Automotive. MSDN
Library - Mobil and Embedded Development Documentation, retrieved on Feb. 10th, 2002a.
[Microsoft Corp., 2002b] Microsoft Corp.: The Filter Graph and Its Components. MSDN Library
- DirectX 8.1 C++ Documentation, retrieved on Feb. 10th, 2002b.
[Microsoft Corp., 2002c] Microsoft Corp.: AVI RIFF File Reference. MSDN Library - DirectX
9.0
DirectShow
Appendix,
retrieved
on
Nov.
22nd,
from
http://msdn.microsoft.com/archive/en-us/directx9_c/directx/htm/avirifffilereference.asp,
2002c.
[Microsoft Corp., 2007a] Microsoft Corp.: [MS-MMSP]: Microsoft Media Server (MMS) Protocol
Specification. MSDN Library, retrieved on Mar. 10th, from http://msdn2.microsoft.com/enus/library/cc234711.aspx, 2007a.
[Microsoft Corp., 2007b] Microsoft Corp.: Overview of the ASF Format. MSDN Library Windows
Media
Format
11
SDK,
retrieved
on
Jan.
21st,
from
http://msdn2.microsoft.com/en-us/library/aa390652.aspx, 2007b.
[Mielimonka, 2006] Mielimonka, A.: The Real-Time Implementation of XVID Encoder in
DROPS Supporting QoS for Video Streams. Study Project, Database Systems Chair. FAU
Erlangen-Nuremberg, Erlangen, Germany. Sep. 2006.
[Militzer et al., 2003] Militzer, M., Suchomski, M., Meyer-Wegener, K.: Improved p-Domain Rate
Control and Perceived Quality Optimizations for MPEG-4 Real-time Video Applications. 11th
ACM International Conference of Multimedia (ACM MM'03), Berkeley (CA), USA, Nov. 2003.
[Militzer, 2004] Militzer, M.: Real-Time MPEG-4 Video Conversion and Quality Optimizations
for Multimedia Database Servers. Diploma Thesis, Database Systems Chair. FAU ErlangenNuremberg, Erlangen, Germany. Jul. 2004.
[Militzer et al., 2005] Militzer, M., Suchomski, M., Meyer-Wegener, K.: LLV1 – Layered Lossless
Video Format Supporting Multimedia Servers During Realtime Delivery. Multimedia Systems and
Applications VIII in conjuction to OpticsEast, Boston (MA), USA, SPIE Vol. 6015, pp.436-445,
Oct. 2005.
[Miller et al., 1998] Miller, F. W., Keleher, P., Tripathi, S. K.: General Data Streaming. 19th IEEE
Real-Time Systems Sysmposium (RTSS), Madrid, Spain, Dec. 1998.
[Minoli and Keinath, 1993] Minoli, D., Keinath, R.: Distributed Multimedia Through Broadband
Communication. Norwood, UK, Artech House, ISBN 0-89006-689-2, 1993.
[Mohan et al., 1999] Mohan, R., Smith, J. R., Li, C.-S.: Adapting Multimedia Internet Content for
Universal Access. IEEE Trans. Multimedia Vol. 1(1), pp.104-114, 1999.
[Morrison, 1997] Morrison, G.: Video Transcoders with Low Delay. IEICE Transactions on
Communications Vol. E80-B(6), pp.963-969, 1997.
[Mostefaoui et al., 2002] Mostefaoui, A., Favory, L., Brunie, L.: SIRSALE: a Large Scale Video
Indexing and Content-Based Retrieving System. ACM Multimedia 2002 (ACMMM'02), Juanles-Pins, France, Dec. 2002.
[MPEG-1 Part III, 1993] MPEG-1 Part III: ISO/IEC 11172-3:1993 Information technology –
Generic coding of moving pictures and associated audio for digital storage media at up to
328
Bibliography
about 1,5 Mbit/s – Part 3: Audio. ISO/IEC 11172-3 Audio, MPEG (ISO/IEC JTC-1/SC29/WG-11), 1993.
[MPEG-2 Part I, 2000] MPEG-2 Part I: ISO/IEC 13818-1:2000 Information technology –
Generic coding of moving pictures and associated audio information – Part 1: Systems.
ISO/IEC 13818-1 Systems, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2000.
[MPEG-2 Part II, 2001] MPEG-2 Part II: ISO/IEC 13818-2:2000 Information technology –
Generic coding of moving pictures and associated audio information – Part 2: Video. ISO/IEC
13818-2 Video, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2000.
[MPEG-2 Part VII, 2006] MPEG-2 Part VII: ISO/IEC 13818-2:2000 Information technology –
Generic coding of moving pictures and associated audio information – Part 7: Advanced
Audio Coding (AAC). ISO/IEC 13818-7 AAC Ed. 4, MPEG (ISO/IEC JTC-1/SC-29/WG11), Jan. 2006.
[MPEG-4 Part I, 2004] MPEG-4 Part I: ISO/IEC 14496-1:2004 Information technology –
Coding of audio-visual objects – Part 1: Systems (3rd Ed.). ISO/IEC 14496-1 3rd Ed., MPEG
(ISO/IEC JTC-1/SC-29/WG-11), Nov. 2004.
[MPEG-4 Part II, 2004] MPEG-4 Part II: ISO/IEC 14496-2:2004 Information technology –
Coding of audio-visual objects – Part 2: Visual (3rd Ed.). ISO/IEC 14496-2 3rd Ed., MPEG
(ISO/IEC JTC-1/SC-29/WG-11), Jun. 2004.
[MPEG-4 Part III, 2005] MPEG-4 Part III: ISO/IEC 14496-3:2005 Information technology –
Coding of audio-visual objects – Part 3: Audio (3rd Ed.). ISO/IEC 14496-3: , MPEG Audio
Subgroup (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2005.
[MPEG-4 Part III FDAM5, 2006] MPEG-4 Part III FDAM5: ISO/IEC 144963:2005/Amd.3:2006 Scalable Lossless Coding (SLS). ISO/IEC 14496-3 Amendment 3, MPEG
Audio Subgroup (ISO/IEC JTC-1/SC-29/WG-11), Jun. 2006.
[MPEG-4 Part IV Amd 8, 2005] MPEG-4 Part IV Amd 8: ISO/IEC 14496-4:2004/Amd.8:2005
High Efficiency Advanced Audio Coding, audio BIFS, and Structured Audio Conformance.
ISO/IEC 14496-4 Amendment 8, MPEG Audio Subgroup (ISO/IEC JTC-1/SC-29/WG-11),
May 2005.
[MPEG-4 Part V, 2001] MPEG-4 Part V: ISO/IEC 14496-5:2001 Information technology –
Coding of audio-visual objects – Part 5: Reference Software (2nd Ed.). ISO/IEC 14496-5
Software for Visual Part, MPEG (ISO/IEC JTC-1/SC-29/WG-11), 2001.
[MPEG-4 Part X, 2007] MPEG-4 Part X: ISO/IEC 14496-10:2005/FPDAM 3 Information
technology – Coding of audio-visual objects – Part 10: Advanced Video Coding – Amendment
3: Scalable Video Coding. ISO/IEC 14496-10 Final Proposal Draft, MPEG (ISO/IEC JTC1/SC-29/WG-11), Jan. 2007.
[MPEG-7 Part III, 2002] MPEG-7 Part III: ISO/IEC 15938-3 Information Technology –
Multimedia Content Description Interface – Part 3: Visual. ISO/IEC 15938-3, MPEG
(ISO/IEC JTC-1/SC-29/WG-11), Apr. 2002.
[MPEG-7 Part V, 2003] MPEG-7 Part V: ISO/IEC 15938-5 Information Technology –
Multimedia Content Description Interface – Part 5: Multimedia Description Schemes.
ISO/IEC 15938-5 Chapter 8 Media Description Tools, MPEG (ISO/IEC JTC-1/SC-29/WG-11),
2003.
329
Bibliography
[MPEG-21 Part I, 2004] MPEG-21 Part I: ISO/IEC 21000-1 Information Technology –
Multimedia Framework (MPEG-21) – Part 1: Vision, Technologies and Strategy. ISO/IEC
21000-1 2nd Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Nov. 2004.
[MPEG-21 Part II, 2005] MPEG-21 Part II: ISO/IEC 21000-2 Information Technology –
Multimedia Framework (MPEG-21) – Part 2: Digital Item Declaration. ISO/IEC 21000-2 2nd
Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Oct. 2005.
[MPEG-21 Part VII, 2004] MPEG-21 Part VII: ISO/IEC 21000-7 Information Technology –
Multimedia Framework (MPEG-21) – Part 7: Digital Item Adaptation. ISO/IEC 21000-7,
MPEG (ISO/IEC JTC-1/SC-29/WG-11), Oct. 2004.
[MPEG-21 Part XII, 2004] MPEG-21 Part XII: MPEG N5640 - ISO/IEC 21000-12 Information
Technology – Multimedia Framework (MPEG-21) – Part 12: Multimedia Test Bed for
Resource Delivery. ISO/IEC 21000-12 Working Draft 2.0, MPEG (ISO/IEC JTC-1/SC29/WG-11), Oct. 2004.
[MPEG Audio Subgroup, 2005] MPEG Audio Subgroup: Verification Report on MPEG-4 SLS
(MPEG2005/N7687). MPEG Meeting "Nice'05", Nice, France, Oct. 2005.
[Nilsson, 2004] Nilsson, J.: Timers: Implement a Continuously Updating, High-Resolution Time
Provider for Windows. MSDN Magazine. Vol. 3, 2004.
[Oracle Corp., 2003] Oracle Corp.: Oracle interMedia User's Guide. Ver.10g Release 1 (10.1) -Chapter 7, Section 7.4 Supporting Media Data Processing, 2003.
[Östreich, 2003] Östreich, T.: Transcode – Linux Video Stream Processing Tool, retrieved on July
from
http://www.theorie.physik.uni-goettingen.de/~ostreich/transcode/
10th,
(http://www.transcoding.org/cgi-bin/transcode), 2003.
[Pai et al., 1997] Pai, V., Druschel, P., Zwaenopoel, W.: IO-Lite: A Unified I/O Buffering and
Caching System. Technical Report TR97-294, Computer Science. Rice University, Houston (TX),
USA. 1997.
[Pasquale et al., 1993] Pasquale, J., Polyzos, G., Anderson, E., Kompella, V.: Filter Propagation in
Dissemination Trees: Trading Off Bandwidth and Processing in Continuous Media Networks.
4th Intl. Workshop ACM Network and Operating System Support for Digital Audio and Video
(NOSSDAV'93), pp.259-268, Nov. 1993.
[PEAQ, 2006] PEAQ: ITU-R B5.1387-1 – Implementation PQ-Eval-Audio, Part of "AFsp
Library and Tools V8 R1" (ftp://ftp.tsp.ece.mc-gill.ca/TSP/AFsp/AFsp-v8r1.tar.gz), Jan.
2006.
[Penzkofer, 2006] Penzkofer, F.: Real-Time Audio Conversion and Format Independence for
Multimedia Database Servers. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg,
Erlangen, Germany. Jul. 2006.
[Posnak et al., 1996] Posnak, E. J., Vin, H. M., Lavender, R. G.: Presentation Processing Support
for Adaptive Multimedia Applications. SPIE Multimedia Computing and Networking, San Jose
(CA), USA, SPIE Vol. 2667, pp.234-245, Jan. 1996.
[Posnak et al., 1997] Posnak, E. J., Lavender, R. G., Vin, H. M.: An Adaptive Framework for
Developing Multimedia Software Components. Communications of the ACM Vol. 40(10), pp.4347, 1997.
[QNX, 2001] QNX: QNX Neutrino RTOS (Ver.6.1). QNX Software Systems Ltd., 2001.
330
Bibliography
[Rakow et al., 1995] Rakow, T., Neuhold, E., Löhr, M.: Multimedia Database Systems – The
Notions and the Issues. GI-Fachtagung, Dresden, Germany, Datenbanksysteme in Büro,
Technik und Wissenschaft (BTW)
[Rangarajan and Iftode, 2000] Rangarajan, M., Iftode, L.: Software Distributed Shared Memory
over Virtual Interface Architecture - Implementation and Performance. 4th Annual Linux
Conference, Atlanta (GA), USA, Oct. 2000.
[Rao and Yip, 1990] Rao, K. R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages,
Applications. San Diego (CA), USA, Academic Press, Inc., ISBN 0-12-580203-X, 1990.
[Reuther and Pohlack, 2003] Reuther, L., Pohlack, M.: Rotational-Position-Aware Real-Time Disk
Scheduling Using a Dynamic Active Subset (DAS). 24th IEEE International Real-Time Systems
Symposium, Cancun, Mexico, Dec. 2003.
[Reuther et al., 2006] Reuther, L., Aigner, R., Wolter, J.: Building Microkernel-Based Operating
Systems: DROPS - The Dresden Real-Time Operating System (Lecture Notes Summer Term
2006), retrieved on Nov. 25th, 2006, from http://os.inf.tu-dresden.de/Studium/KMB/
(http://os.inf.tu-dresden.de/Studium/KMB/Folien/09-DROPS/09-DROPS.pdf), 2006.
[Rietzschel, 2002] Rietzschel, C.: Portierung eines Video-Codecs auf DROPS. Study Project,
Operating Systems Group - Institute for System Architecture. TU Dresden, Dresden, Germany. Dec.
2002.
[Rietzschel, 2003] Rietzschel, C.: VERNER ein Video EnkodeR uNd -playER für DROPS. Master
Thesis, Operating Systems Group - Institute for System Architecture. TU Dresden, Dresden, Germany.
Sep. 2003.
[Rohdenburg et al., 2005] Rohdenburg, T., Hohmann, V., Kollmeier, B.: Objective Perceptual
Quality Measures for the Evaluation of Noise Reduction Schemes. International Workshop on
Acoustic Echo and Noise Control '05, High Tech Campus, Eindhoven, The Netherlands, Sep.
2005.
[Roudiak-Gould, 2006] Roudiak-Gould, B.: HuffYUV v2.1.1 Manual. Description and Source
2006,
from
Code,
retrieved
on
Jul.
15th,
http://neuron2.net/www.math.berkeley.edu/benrg/huffyuv.html, 2006.
[Sayood, 2006] Sayood, K.: Introduction to Data Compression (3rd Ed.). San Francisco (CA),
USA, Morgan Kaufman, 2006.
[Schaar and Radha, 2000] Schaar, M., Radha, H.: MPEG M6475: Motion-Compensation Based
Fine-Granular Scalability (MC-FGS) MPEG Meeting, MPEG (ISO/IEC JTC-1/SC-29/WG11), Oct. 2000.
[Schäfer et al., 2003] Schäfer, R., Wiegand, T., Schwarz, H.: The emerging H.264/AVC Standard.
EBU Technical Review, Berlin, Germany, Heinrich Hertzt Institute, Jan. 2003.
[Schelkens et al., 2003] Schelkens, P., Andreopoulos, Y., Barbarien, J., Clerckx, T., Verdicchio, F.,
Munteanu, A., van der Schaar, M.: A comparative study of scalable video coding schemes
utilizing wavelet technology. SPIE Photonics East - Wavelet Applications in Industrial Processing, SPIE
Vol. 5266, Providence (RI), USA,.
[Schlesinger, 2004] Schlesinger, L.: Qualitätsgetriebene Konstruktion globaler Sichten in Gridorganisierten Datenbanksystemen. PhD Thesis, Database Systems Chair. FAU ErlangenNuremberg, Erlangen. Jul. 2004.
331
Bibliography
[Schmidt et al., 2003] Schmidt, S., Märcz, A., Lehner, W., Suchomski, M., Meyer-Wegener, K.:
Quality-of-Service Based Delivery of Multimedia Database Objects without Compromising
Format Independence. 9th International Conference on Distributed Multimedia Systems (DMS'03),
Miami (FL), USA Sep. 2003.
[Schönberg, 2003] Schönberg, S.: Impact of PCI-Bus Load on Applications in a PC Architecture.
24th IEEE International Real-Time Systems Symposium, Cancun, Mexico, Dec. 2003.
[Schulzrinne et al., 1996] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V.: RTP: A transport
protocol for Real-Time Applications. RFC 1889, Jan. 1996.
[Schulzrinne et al., 1998] Schulzrinne, H., Rao, A., Lanphier, R.: Real Time Streaming Protocol
(RTSP). RFC 2326, Apr. 1998.
[Shin and Koh, 2004] Shin, I., Koh, K.: Cost Effective Transcoding for QoS Adaptive Multimedia
Streaming. Symposium on Applied Computing (SAC'04), Nicosia, Cyprus, ACM, Mar. 2004.
[Singhal and Shivaratri, 1994] Singhal, M., Shivaratri, N.: Advanced Concepts in Operating
Systems, McGraw-Hill, ISBN 978-0070575721, 1994.
[Sitaram and Dan, 2000] Sitaram, D., Dan, A.: Multimedia Servers: Applications, Environments
and Design, Morgan Kaufmann, ISBN 1-55860-430-8, 2000.
[Skarbek, 1998] Skarbek, W.: Multimedia. Algorytmy i standardy kompresji. Warszawa, PL,
Akademicka Oficyna Wydawnicza PLJ, 1998.
[Sorial et al., 1999] Sorial, H., Lynch, W. E., Vincent, A.: Joint Transcoding of Multiple MPEG
Video Bitstreams. IEEE International Symposium on Circuits and Systems (ISCAS'99), Orlando (FL),
USA, May 1999.
[Spier and Organick, 1969] Spier, M. J., Organick, E. I.: The multics interprocess communication
facility. 2nd ACM Symposium on Operating Systems Principles, Princeton (NJ), USA, 1969.
[Steinberg, 2004] Steinberg, U.: Quality-Assuring Scheduling in the Fiasco Microkernel. Master
Thesis, Operating Systems Group. TU Dresden, Dresden, Germany. Mar. 2004.
[Stewart, 2005] Stewart, J.: An Investigation of SIMD Instruction Sets. Study Project, School of
Information Technology and Mathematical Sciences. University of Ballarat, Ballarat (Victoria), Australia.
Nov. 2005.
[Suchomski, 2001] Suchomski, M.: The Application of Specialist Program Suites in Network
Servers Efficiency Research. Master Thesis. New University of Lisbon and Wroclaw University
of Technology, Monte de Caparica - Lisbon, Portugal and Wroclaw, Poland. Jul. 2001.
[Suchomski et al., 2004] Suchomski, M., Märcz, A., Meyer-Wegener, K.: Multimedia Conversion
with the Focus on Continuous Media. Transformation of Knowledge, Information and Data. Ed.: P.
van Bommel. London, UK, Information Science Publishing. Chapter XI, ISBN 159140528-9,
2004.
[Suchomski et al., 2005] Suchomski, M., Militzer, M., Meyer-Wegener, K.: RETAVIC: Using
Meta-Data for Real-Time Video Encoding in Multimedia Servers. ACM NOSSDAV '05,
Skamania (WA), USA, Jun. 2005.
[Suchomski and Meyer-Wegener, 2006] Suchomski, M., Meyer-Wegener, K.: Format
Independence of Audio and Video in Multimedia Database Systems. 5th Internationa Conference on
Multimedia and Network Information Systems 2006 (MiSSI '06), Wroclaw, Poland, Oficyna
Wydawnicza Politechniki Wroclawskiej, pp.201-212, Sep. 2006.
332
Bibliography
[Suchomski et al., 2006] Suchomski, M., Meyer-Wegener, K., Penzkofer, F.: Application of
MPEG-4 SLS in MMDBMSs – Requirements for and Evaluation of the Format. Audio
Engineering Society (AES) 120th Convention, Paris, France, AES Preprint No. 6729, May 2006.
[Sun et al., 1996] Sun, H., Kwok, W., Zdepski, J.: Architectures for MPEG Compressed Bitstream
Scaling. IEEE Trans. Circuits and Systems for Video Technology Vol. 6(2), 1996.
[Sun et al., 2005] Sun, H., Chen, X., Chiang, T.: Digital Video Transcoding for Transmission and
Storage. Boca Raton (FL), CRC Press. Chapter 11: 391-413, 2005.
[Sun Microsystems Inc., 1999] Sun Microsystems Inc.: Java Media Framework API Guide (Nov.
19th, 1999), retrieved on Jan. 10th, 2003, from http://java.sun.com/products/javamedia/jmf/2.1.1/guide/, 1999.
[Suzuki and Kuhn, 2000] Suzuki, T., Kuhn, P.: A proposal for segment-based transcoding hints.
ISO/IEC M5847, Noordwijkerhout, Netherlands, Mar. 2000.
[Symes, 2001] Symes, P.: Video Compression Demistyfied, McGraw-Hill, ISBN 0-07-136324-6,
2001.
[Tannenbaum, 1995] Tannenbaum, A. S.: Moderne Betriebssysteme - 2nd Ed., Prentice Hall
International: 78-88, ISBN 3-446-18402-3, 1995.
[Topivata et al., 2001] Topivata, P., Sullivan, G., Joch, A., Kossentini, F.: Performance evaluation
of H.26L, TML 8 vs. H.263++ and MPEG-4. Technical Report N18, ITU-T Q6/SG16 (VCEG),
Sep. 2001.
[Tourapis, 2002] Tourapis, A. M.: Enhanced predictive zonal search for single and multiple frame
motion estimation. SPIE Visual Communications and Image Processing, SPIE Vol. 4671, Jan. 2002.
[Tran, 2000] Tran, T. D.: The BinDCT – Fast Multiplierless Approximation of the DCT. IEEE
Signal Processing Letters Vol. 7(6), 2000.
[Trybulec and Byliński, 1989] Trybulec, A., Byliński, C.: Some Properties of Real Numbers Operations: min, max, square, and square root. Mizar Mathematical Library (MML) - Journal of
Formalized Mathematics Vol. 1, 1989.
[van Doorn and de Vries, 2000] van Doorn, M. G. L. M., de Vries, A. P.: The Psychology of
Multimedia Databases. 5th ACM Conference on Digital Libraries, San Antonio (TX), USA, 2000.
[Vatolin et al., 2005] Vatolin, D., Kulikov, D., Parshin, A., Kalinkina, D., Soldatov, S.: MPEG-4
Video
Codecs
Comparison,
retrieved
on
Mar.,
2005,
from
http://www.compression.ru/video/codec_comparison/mpeg-4_en.html, 2005.
[Vetro et al., 2000] Vetro, A., Sun, H., Divakaran, A.: Adaptive Object-Based Transcoding using
Shape and Motion-Based Hints. ISO/IEC M6088, Geneva, Switzerland, May 2000.
[Vetro, 2001] Vetro, A.: Object-Based Encoding and Transcoding. PhD Thesis, Electrical Engineering.
Polytechnic University, Brooklyn (NY), USA. Jun. 2001.
[Vetro et al., 2001] Vetro, A., Sun, H., Wang, Y.: Object-Based Transcoding for Adaptable Video
Content Delivery. IEEE Transactions on Circuits and Systems for Video Technology Vol. 11(3),
pp.387-401, 2001.
[Vetro, 2003] Vetro, A.: Transcoding Scalable Coding & Standardized Metadata. International
Workshop Very Low Bitrate Video (VLBV) Vol. 2849, Urbana (IL), USA, Sep. 2003.
[Vetro et al., 2003] Vetro, A., Christopoulos, C., Sun, H.: Video Transcoding Architectures and
Techniques: An Overview. IEEE Signal Processing Magazine Vol. 20(2), pp.18-29, 2003.
333
Bibliography
[Vetro, 2004] Vetro, A.: MPEG-21 Digital Item Adaptation: Enabling Universal Media Access.
IEEE Multimedia Vol. 11, pp.84-87, 2004.
[VQEG (ITU), 2005] VQEG (ITU): Tutorial - Objective Perceptual Assessmnet of Video
Quality: Full Reference Television, ITU Video Quality Expert Group, Mar. 2004.
[Wallace, 1991] Wallace, G. K.: The JPEG Still Picture Compression Standard. Communications of
the ACM Vol. 34, pp.30-34, 1991.
[Wang et al., 2004] Wang, Y., Huang, W., Korhonen, J.: A Framework for Robust and Scalable
Audio Streaming. ACM Multimedia '04 (ACMMM'04), New York (NY), USA, ACM, Oct.
2004.
[Warnes, 2000] Warnes, G. R.: A Recipe for a diskless MOSIX cluster using Cluster-NFS,
retrieved on May, 10th, 2000, from http://clusternfs.sourceforge.net/Recipe.pdf, 2000.
[Weinberger et al., 2000] Weinberger, M. J., Seroussi, G., Sapiro, G.: The LOCO-I Lossless Image
Compression Algorithm, Principles and Standardization into JPEG-LS. IEEE Trans. on Image
Processing Vol. 9(8), pp.1309-1324, 2000.
[Wen et al., 2003] Wen, J.-R., Li, Q., Ma, W.-Y., Zhang, H.-J.: A Multi-paradigm Querying
Approach for a Generic Multimedia Database Management System. ACM SIGMOD Record
Vol. 32(1), pp.26-34, 2003.
[Wendelska, 2007] Wendelska, J. A.: Optimization of the MPEG-4 SLS Implementation for
Scalable Lossless Audio Coding. Diploma Thesis, Database Systems Chair. FAU ErlangenNuremberg, Erlangen, Germany. Aug. 2007.
[Westerink et al., 1999] Westerink, P. H., Rajagopalan, R., Gonzales, C. A.: Two-pass MPEG-2
variable-bit-rate encoding. IBM Journal of Research and Development - Digital Multimedia Technology
Vol. 43(4), pp.471, 1999.
[Wittmann and Zitterbart, 1997] Wittmann, R., Zitterbart, M.: Towards Support for
Heterogeneous Multimedia Communications. 6th IEEE Workshop on Future Trends of Distributed
Computing Systems, Bologna, Italy, Nov. 2000.
[Wittmann, 2005] Wittmann, R.: A Real-Time Implementation of a QoS-aware Decoder for the
LLV1 Format. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen,
Germany. Nov. 2005.
[Wu et al., 2001] Wu, F., Li, S., Zhang, Y.-Q.: A Framework for Efficient Progressive Fine
Granularity Scalable Video Coding. IEEE Trans. Circuits and Systems for Video Technology Vol.
11(3), pp.332-344, 2001.
[WWW_AlparySoft, 2004] WWW_AlparySoft: Lossless Video Codec - Ver. 2.0 Alpha, retrieved
on Dec. 17th, 2004, from http://www.alparysoft.com/products.php?cid=8, 2004.
[WWW_Doom9, 2003] WWW_Doom9: Codec shoot-out 2003 – 1st Installment, retrieved on
Apr. 10th, 2003, from http://www.doom9.org/codecs-103-1.htm, 2003.
[WWW_DROPS, 2006] WWW_DROPS: The Dresden Real-Time Operating System Project,
retrieved on Oct. 23rd, 2006, from http://os.inf.tu-dresden.de/drops/, 2006.
[WWW_FAAC, 2006] WWW_FAAC: FAAC - Freeware Advanced Audio Coder (Ver. 1.24),
retrieved on Dec. 10th, 2006, from http://sourceforge.net/projects/faac/, 2006.
[WWW_FAAD, 2006] WWW_FAAD: FAAD - Freeware Advanced Audio Coder (Ver. 2.00),
retrieved on Nov. 10th, 2006, from http://www.audiocoding.com, 2006.
334
Bibliography
[WWW_FFMPEG, 2003] WWW_FFMPEG: FFmpeg Documentation, retrieved on Nov. 23rd,
2006, from http://ffmpeg.mplayerhq.hu/ffmpeg-doc.html, 2003.
[WWW_FLAC, 2006] WWW_FLAC: Free Lossless Audio Codec (FLAC), retrieved on Feb. 28th,
2006, from http://flac.sourceforge.net/, 2006.
[WWW_LAME, 2006] WWW_LAME: Lame Version 3.96.1, “Lame Ain’t an MP3 Encoder”,
http://lame.sourceforge.net, retrieved on Dec. 10th, 2006, 2006.
[WWW_MA, 2006] WWW_MA: Monkey’s Audio - A Fast and Powerful Lossless Audio
Compressor, retrieved on Sep. 23rd, 2006, from http://www.monkeysaudio.com/, 2006.
[WWW_MPEG SQAM, 2006] WWW_MPEG SQAM: MPEG Sound Quality Assessment
Material -- Subset of Ebu SQAM, retrieved on Nov. 15th, from http://www.tnt.unihannover.de/project/mpeg/audio/sqam/, 2006.
[WWW_OGG, 2006] WWW_OGG: Ogg (libogg), Vorbis (libvorbis) and OggEnc (vorbis-tools)
Version 1.1, retrieved on Dec. 10th, 2006, from http://www.xiph.org/vorbis/, 2006.
[WWW_Retavic - Audio Set, 2006] WWW_Retavic - Audio Set: Evaluation Music Set, retrieved
on Jan. 10th, from http://www6.informatik.uni-erlangen.de/research/projects/retavic/audio/,
2006.
[WWW_VQEG, 2007] WWW_VQEG: Offical Website of Video Quality Expert Group - Test
Video Sequences, retrieved on Feb, 14th, 2007, from http://www.its.bldrdoc.gov/vqeg/
(ftp://vqeg.its.bldrdoc.gov/,
mirror
with
thumbnails:
http://media.xiph.org/vqeg/TestSeqences/), 2007.
[WWW_WP, 2006] WWW_WP: WavPack - Hybrid Lossless Audio Compression, retrieved on
Feb. 26th, 2006, from http://www.wavpack.com/, 2006.
[WWW_XIPH, 2007] WWW_XIPH: Xiph.org Test Media - Derf's Collection of Test Video
Clips, retrieved on Feb. 14th, from http://media.xiph.org/video/derf/, 2007.
[WWW_XVID, 2003] WWW_XVID: XVID MPEG-4 Video Codec v.1.0, retrieved on Apr. 3rd,
2003, from http://www.xvid.org, 2003.
[Wylie, 1994] Wylie, F.: Tandem Coding of Digital Audio Data Compression Algorithms. 96th
Convention of Audio Engineering Society (AES), Belfast, N. Ireland, AES No. 3784, Feb. 1994.
[Yeadon, 1996] Yeadon, N. J.: Quality of Service Filtering for Multimedia Communications. PhD
Thesis. Lancaster University, Lancaster, UK. Yeadon96.
[Youn, 2008] Youn, J.: Method of Making a Window Type Decision Based on MDCT Data in
Audio Encoding. U. S. P. a. T. O. (PTO). USA, Sony Corporation (Tokyo, JP). Vol., 2008.
335