Serkan phD Thesis R4

Transcription

Serkan phD Thesis R4
Mustafa Serkan Kiranyaz
Advanced Techniques for Content-Based Management of
Multimedia Databases
Tampere 2005
ii
iii
Abstract
Digital multimedia collections are evolving in a tremendous pace as the modus operandi for
information creation, exchange and storage in our modern era. This brings an urgent need to
have means and ways to manage it. Earlier attempts such as text-based indexing and information retrieval systems show drastic limitations and require infeasible laborious work. The efforts are thus focused on the content-based management area; however, we are still at the
early stages of the development of techniques to guarantee efficiency and effectiveness in
content-based multimedia systems. The peculiar nature of the multimedia information such as
difficulty of semantic indexing, complex multimedia identification, and difficulty of adaptation to different applications have caused an ongoing premature state for the techniques in this
area.
This thesis considers a global approach to the management of the multimedia databases by providing advanced techniques encapsulated within a generic framework, MUVIS.
These techniques are intended to cover the entire management functionalities for multimedia
collections such as indexing, browsing, retrieval, summarization and efficient access of multimedia peculiars. MUVIS, in addition to its unique host architecture for such techniques developed, is also designed as a framework structure to provide a flexible basis for developing
and testing novel descriptors that are purposefully detached from the core of the system. Furthermore, it supports widely-used multimedia formats, last generation codecs, types and parameters to test and improve the robustness of the techniques and descriptors against such
variations. Special care has been drawn for its user interface design to especially ensure a
scalable video management.
A significant contribution of this thesis is to bring a robust framework structure specifically in the area of audio-based multimedia management, which can be much promising,
however comparatively in immature status than the visual counterpart. The efforts are first
focused on an automatic and optimized audio content classification and segmentation method
following a hierarchic approach with a new fuzzy modeling concept. The proposed technique
achieves a solid robustness and high accuracy level. The proposed audio-based multimedia
indexing and retrieval framework supports dynamic integration of audio feature extraction
modules during the indexing and retrieval phases. The experimental results show that audio
retrieval achieves equal or better performance with respect to the visual query.
For the evaluation of any technique within the context of multimedia management,
the ultimate measure of its performance is the user satisfaction. Especially in databases without an indexing structure, using the traditional query methodology the retrieval times become
longer and proportional with the database size. Therefore, this thesis presents a simple, yet
iv
efficient query method, the Progressive Query, which is developed to achieve several innovative retrieval capabilities for databases with or without an indexing structure. It especially
provides an enhanced user interaction and query browsing capabilities to ensure a solid control and a better relevance feedback scheme. It achieves a superior performance in terms of
speed, minimum system requirements, user interaction and possibility of better retrievals as
compared to the exhaustive search based traditional query method.
In order to accomplish the primary objective of any query operation, that is, the retrieval of the most relevant items at the earliest possible time, an efficient indexing scheme,
the Hierarchical Cellular Tree, is then introduced. It is specifically designed to cope up with
the indexing requirements of multimedia databases such as the presence of multiple features
possibly in high dimensions, the dynamic construction and editing capability, the need for
prevention of the corruption for the large-scale databases, robustness against the limited discrimination and deficiencies obtained from the low-level features, etc. The earliest retrieval of
the most relevant items is then shown to be feasible by the joint implementation of Progressive Query over the proposed indexing structure.
Another major retrieval scenario is database browsing. Browsing is a flexible operation, which however requires a continuous interaction and feedback from the user. It is likely
to be the initial operation needed to locate the example item for a query-by-example operation
and therefore, the main purpose is to access the items of interest in an efficient way even
though the definition about those items may not be crystal clear. The thesis presents an effective browsing scheme that is fully compliant with the Hierarchical Cellular Tree indexing
structure. By using the hierarchical structure of this indexing scheme, two important features
for browsing efficiency can be achieved: A hierarchical summarization, or say a mental picture of the database for the user perception and a guided navigation scheme that the user can
perform among the database items. Finally the interaction between the two retrieval scenarios,
the query by example and database browsing, is accomplished by a proper user interface design so that the user can have the continuous option of switching back and forth in between.
v
Preface
"The definition of success--To laugh much; to win respect of intelligent persons and the affections of
children; to earn the approbation of honest critics and endure the betrayal of false friends; to appreciate beauty; to find the best in others; to give one's self; to leave the world a little better, whether by
a healthy child, a garden patch, or a redeemed social condition.; to have played and laughed with enthusiasm, and sung with exultation; to know even one life has breathed easier because you have
lived--this is to have succeeded."
Ralph Waldo Emerson
The research presented in this thesis has been carried out at the Signal Processing Institute of
Tampere University of Technology, Finland as a part of the MUVIS project.
First and the foremost I wish to express my deepest gratitude to my supervisor, Professor Moncef Gabbouj, for his guidance, constant support and patience during all these
years and above all, his belief on me when it was most needed.
I would like to thank Professor Erkki Oja from Laboratory of Computer and Information Science in Helsinki University of Technology and Professor Jaakko Sauvola as the
head of the Media Team in Oulu University, the reviewers of this thesis, for their constructive
feedback and helpful comments.
In addition I would like to thank to my MSc supervisor, Professor Levent Onural
from Bilkent University, for being the first true light in my career and guided me through the
labyrinths of knowledge and wisdom during the earlier years in this fascinating field.
Over the years I have had the privilege to work with a wonderful group of people, the
colleagues and all my friends here. The amount of our achievements altogether is much more
than any individual achievement and I strongly believe together we have really built something significant. I thank all of them from all my heart.
More thanks are due to Vivre Larmila, Elina Orava and Ulla Siltaloppi for their
kind help whenever needed.
Warmest thanks go to all my close friends, especially Erdogan Özdemir, Burak Kirman, Esin and Olcay Güldogan, Aytaç Sen, in short the members of our small Turkish
community in Tampere, and also all my buddies abroad, Kerem Ayhan, Güner Aktürk, Alper Yildirim, Özgür Güleryüz, Utku Asim, Ugur Türkoglu, Tunç Bostanci, Emre Aksu,
Kerem Çaglar, and the rest of them not mentioned here but unforgotten, for their spiritual
support and friendship within all these years.
The financial support provided by the Tampere Graduate School of Information Science and Engineering (TISE) is also gratefully acknowledged.
vi
Last but not least, I wish to express my warmest thanks to my parents, Gönül and Yavuz and to my brother, Sertaç for their endless love and support, and for always being nearby
me despite the physical distance that has separated us for all these years. As being the light
and color in those moments when everything seems in vain and everything fades out, I would
like to dedicate this thesis to my beloved family.
Tampere, June 2005
Serkan Kiranyaz.
vii
Contents
Abstract ....................................................................................................................................iii
Preface ....................................................................................................................................... v
Contents...................................................................................................................................vii
List of Publications................................................................................................................... x
List of Acronyms ....................................................................................................................xii
List of Tables.......................................................................................................................... xiv
List of Figures ......................................................................................................................... xv
1.
Introduction ....................................................................................................................... 1
1.1.
1.2.
1.3.
2.
CONTENT-BASED MULTIMEDIA MANAGEMENT ............................................................. 2
OUTLINE OF THE THESIS................................................................................................. 4
PUBLICATIONS ............................................................................................................... 5
MUVIS Framework .......................................................................................................... 7
2.1. MUVIS OVERVIEW........................................................................................................ 8
2.1.1. Block Diagram of the System ................................................................................. 8
2.1.2. MUVIS Multimedia Family .................................................................................... 9
2.1.3. MUVIS Applications............................................................................................. 10
2.1.3.1 AVDatabase..................................................................................................................10
2.1.3.2 DbsEditor......................................................................................................................11
2.1.3.3 MBrowser.....................................................................................................................12
2.1.4. MUVIS Databases ................................................................................................ 14
2.2. INDEXING AND RETRIEVAL SCHEME ............................................................................ 15
2.2.1. Indexing Methods ................................................................................................. 15
2.2.2. Retrieval Methods................................................................................................. 15
2.3. FEATURE EXTRACTION FRAMEWORK........................................................................... 16
2.3.1. Aural Feature Extraction: AFeX.......................................................................... 16
2.3.2. Visual Feature Extraction: FeX ........................................................................... 17
2.4. VIDEO SUMMARISATION .............................................................................................. 18
2.4.1. Scene Analysis by MST......................................................................................... 19
viii
2.4.2.
2.4.3.
2.4.4.
Scene Analysis by NNE ........................................................................................ 21
Video Summarization Experiments ...................................................................... 23
Scalable Video Management................................................................................ 25
2.4.4.1 ROI Access and Query .................................................................................................26
2.4.4.2 Visual Query of Video..................................................................................................27
3.
Unsupervised Audio Classification and Segmentation ................................................ 31
3.1. AUDIO CLASSIFICATION AND SEGMENTATION – AN OVERVIEW................................ 31
3.2. SPECTRAL TEMPLATE FORMATION .............................................................................. 35
3.2.1. Forming the MDCT Template from MP3/AAC Bit-Stream.................................. 36
3.2.1.1 MP3 and AAC Overview .............................................................................................36
3.2.1.2 MDCT Template Formation .........................................................................................37
3.2.2. Spectral Template Formation in Generic Mode .................................................. 41
3.3. FEATURE EXTRACTION ................................................................................................ 41
3.3.1. Frame Features .................................................................................................... 42
3.3.1.1
3.3.1.2
3.3.1.3
3.3.1.4
3.3.2.
Total Frame Energy Calculation...................................................................................42
Band Energy Ratio Calculation ....................................................................................42
Fundamental Frequency Estimation .............................................................................42
Subband Centroid Frequency Estimation .....................................................................44
Segment Features ................................................................................................. 45
3.3.2.1
3.3.2.2
3.3.2.3
3.3.2.4
Dominant Band Energy Ratio.......................................................................................45
Transition Rate vs. Pause Rate .....................................................................................45
Fundamental Frequency Segment Feature....................................................................46
Subband Centroid Segment Feature .............................................................................47
3.3.3. Perceptual Modeling in Feature Domain ............................................................ 48
3.4. GENERIC AUDIO CLASSIFICATION AND SEGMENTATION ............................................... 49
3.4.1. Step 1: Initial Classification................................................................................. 50
3.4.2. Step 2.................................................................................................................... 51
3.4.3. Step 3.................................................................................................................... 52
3.4.4. Step 4.................................................................................................................... 54
3.4.4.1 Intra Segmentation by Binary Division ........................................................................55
3.4.4.2 Intra Segmentation by Breakpoints Detection..............................................................56
3.5. EXPERIMENTAL RESULTS ............................................................................................. 57
3.5.1. Feature Discrimination and Fuzzy Modeling ...................................................... 57
3.5.2. Overall Classification and Segmentation Performance....................................... 58
4.
Audio-Based Multimedia Indexing and Retrieval ....................................................... 61
4.1. AUDIO INDEXING AND RETRIEVAL – AN OVERVIEW .................................................... 61
4.2. A GENERIC AUDIO INDEXING SCHEME ........................................................................ 64
4.2.1. Unsupervised Audio Classification and Segmentation ........................................ 65
4.2.2. Audio Framing ..................................................................................................... 66
4.2.3. A Sample AFeX Module Implementation: MFCC................................................ 67
4.2.4. Key-Framing via MST Clustering ........................................................................ 70
4.3. AUDIO RETRIEVAL SCHEME ......................................................................................... 71
4.4. EXPERIMENTAL RESULTS ............................................................................................. 74
4.4.1. Classification and Segmentation Effect on Overall Performance ....................... 75
4.4.1.1 Accuracy.......................................................................................................................75
ix
4.4.1.2 Speed ............................................................................................................................76
4.4.1.3 Disk Storage .................................................................................................................77
4.4.2.
5.
Experiments on Audio-Based Multimedia Indexing and Retrieval ...................... 77
Progressive Query: A Novel Retrieval Scheme ............................................................ 81
5.1. QUERY TECHNIQUES - AN OVERVIEW.......................................................................... 81
5.2. PROGRESSIVE QUERY ................................................................................................... 83
5.2.1. Periodic Sub-Query Formation............................................................................ 85
5.2.1.1 Atomic Sub-Query........................................................................................................85
5.2.1.2 Fractional Sub-Query ...................................................................................................85
5.2.1.3 Sub-Query Fusion Operation........................................................................................87
5.2.2. PQ in Indexed Databases ..................................................................................... 89
5.3. HIGH PRECISION PQ – THE NEW APPROACH ................................................................. 90
5.4. EXPERIMENTAL RESULTS ............................................................................................. 92
5.4.1. PQ in MUVIS........................................................................................................ 92
5.4.2. PQ versus NQ....................................................................................................... 93
5.4.3. PQ versus HP PQ................................................................................................. 99
5.4.4. Remarks and Evaluation .................................................................................... 100
6.
A Novel Indexing Scheme: Hierarchical Cellular Tree ............................................. 101
6.1. DATABASE INDEXING METHODS – AN OVERVIEW ..................................................... 102
6.2. HCT FUNDAMENTALS ................................................................................................ 107
6.2.1. Cell Structure ..................................................................................................... 108
6.2.1.1
6.2.1.2
6.2.1.3
6.2.1.4
6.2.2.
6.2.3.
MST Formation ..........................................................................................................108
Cell Nucleus ...............................................................................................................110
Cell Compactness Feature ..........................................................................................110
Cell Mitosis ................................................................................................................111
Level Structure ................................................................................................... 112
HCT Operations ................................................................................................. 113
6.2.3.1 Item Insertion Algorithm for HCT .............................................................................113
6.2.3.2 Item Removal Algorithm for HCT .............................................................................116
6.2.4.
HCT Indexing ..................................................................................................... 117
6.2.4.1 HCT Incremental Construction ..................................................................................118
6.2.4.2 HCT (Periodic) Fitness Check....................................................................................118
6.3. PQ OVER HCT ........................................................................................................... 119
6.3.1. QP Formation from HCT ................................................................................... 120
6.3.2. PQ Operation over HCT .................................................................................... 122
6.4. HCT BROWSING ......................................................................................................... 123
6.5. EXPERIMENTAL RESULTS ........................................................................................... 129
6.5.1. Performance Evaluation of HCT Indexing......................................................... 129
6.5.1.1 Statistical Analysis .....................................................................................................131
6.5.1.2 Performance Evaluation .............................................................................................132
6.5.2.
6.5.3.
7.
PQ over HCT...................................................................................................... 132
Remarks and Evaluation .................................................................................... 134
Conclusions .................................................................................................................... 137
Bibliography.......................................................................................................................... 141
x
List of Publications
This thesis is written on the basis of the following publications. In the text, these publications
are referred to as Publications [P1], …, [P14].
[P1] S. Kiranyaz, A. F. Qureshi and M. Gabbouj, “A Generic Audio Classification And
Segmentation Approach For Multimedia Indexing and Retrieval”, IEEE Transactions
on Speech and Audio Processing, in Print.
[P2] S. Kiranyaz, M. Gabbouj, “A Novel Multimedia Retrieval Technique: Progressive
Query (WHY WAIT?)”, IEE Proceedings Vision, Image and Signal Processing, in
Print.
[P3] M. Gabbouj and S. Kiranyaz, “Audio-Visual Content-Based Multimedia Indexing and
Retrieval - the MUVIS Framework”, In Proc. of the 6th International Conference on
Digital Signal Processing and its Applications, DSPA 2004, pp. 300-306, Moscow,
Russia, March 31 - April 2, 2004.
[P4] S. Kiranyaz, K. Caglar, E. Guldogan, O. Guldogan, and M. Gabbouj, “MUVIS: A
Content-Based Multimedia Indexing and Retrieval Framework”, In Proc. of the Seventh International Symposium on Signal Processing and its Applications, ISSPA 2003,
pp. 1-8, Paris, France, 1-4 July 2003.
[P5] S. Kiranyaz, M. Aubazac, M. Gabbouj, “Unsupervised Segmentation and Classification
over MP3 and AAC Audio Bitstreams”, In Proc. of WIAMIS Workshop, pp. 338-345,
London, England, 2003.
[P6] S. Kiranyaz, K. Caglar, B. Cramariuc, and M. Gabbouj, “Unsupervised Scene Change
Detection Techniques In Feature Domain Via Clustering and Elimination”, In Proc. of
the IWDC 2002 Conference on Advanced Methods for Multimedia Signal Processing,
Capri, Italy, September 2002.
[P7] O. Guldogan, E. Guldogan, S. Kiranyaz, K. Caglar, and M. Gabbouj, "Dynamic Integration of Explicit Feature Extraction Algorithms into MUVIS Framework", In Proc. of
the 2003 Finnish Signal Processing Symposium, FINSIG'03, pp. 120-123, Tampere,
Finland, 19-20 May 2003.
xi
[P8] S. Kiranyaz, A. F. Qureshi, and M. Gabbouj, “A Fuzzy Approach Towards Perceptual
Classification and Segmentation of MP3/AAC Audio”, In Proc. of the First International Symposium on Control, Communications and Signal Processing, ISCCSP 2004,
pp.727-730, Hammamet, Tunisia, 21-24 March, 2004.
[P9] S. Kiranyaz and M. Gabbouj, “A Dynamic Content-based Indexing Method for Multimedia Databases: Hierarchical Cellular Tree”, In Proc. of IEEE International Conference on Image Processing, ICIP 2005, Paper ID: 2896, Genova, Italy, September,
2005, To Appear.
[P10] S. Kiranyaz and M. Gabbouj, “Hierarchical Cellular Tree: An Efficient Indexing
Method for Browsing and Navigation in Multimedia Databases”, In Proc. of European
Signal Processing Conference, Eusipco 2005, Paper ID: 1063, Antalya, Turkey, September, 2005, To Appear.
[P11] E. Guldogan, O. Guldogan, S. Kiranyaz, K. Caglar and M. Gabbouj, “Compression Effects on Color and Texture based Multimedia Indexing and Retrieval”, In Proc. of
IEEE International Conference on Image Processing, ICIP 2003, Barcelona, Spain,
September 2003.
[P12] I. Ahmad, S. Abdullah, S. Kiranyaz, M. Gabbouj, “Content-Based Image
Retrieval on Mobile Devices”, In Proc. of SPIE (Multimedia on Mobile Devices),
5684, San Jose, US, 16-20 January 2005, To Appear.
[P13] I. Ahmad, S. Kiranyaz, M. Gabbouj, “Progressive Query Technique for Image Retrieval on Mobile Devices”, In Proc. of Fourth International Workshop on
Content-Based Multimedia Indexing, Riga, Latvia, 21-23 June, 2005, To Appear.
[P14] S. Kiranyaz, M. Ferreira and M. Gabbouj, “A Novel Feature Extraction Method based
on Segmentation over Edge Field for Multimedia Indexing and Retrieval”, In Proc. of
WIAMIS Workshop, Montreux, Switzerland, 13-15 April, 2005.
xii
List of Acronyms
2D
2 Dimensional
AAC
(MPEG-2,4) Advanced Audio Codec
AFeX
Audio Feature Extraction
API
Application Programming Interface
AV
Audio-Visual
AVI
Audio Video Interlaced (Microsoft ©)
BER
Band Energy Ratio
CBIR
Content-based Image Retrieval
CPU
Central Processing Unit
Ds
Descriptors
DBER
Dominant Band Energy Ratio
DFT
Discrete Fourier Transform
DLL
Dynamic Link Library
DSs
Description Schemes
FeX
Feature Extraction
FF
Fundamental Frequency
FFT
Fast Fourier Transform
FT
Fourier Transform
FV
Feature Vector
GLCM
Gray Level Co-occurrence Matrix
GMM
Gaussian Mixture Model
GUI
Graphical User Interface
HCT
Hierarchical Cellular Tree
HMM
Hidden Markov Model
HP PQ
High Precision Progressive Query
HPS
Harmonic Product Spectrum
HSV
Hue, Saturation and (Luminance) Value
HVS
Human Visual System
ISO
International Organization for Standardization
JPEG
Joint Pictures Expert Group
xiii
KF
Key-Frame
kNN
k Nearest Neighbours
MAM
Metric Access Method
MDCT
Modified Discrete Cosine Transform
MFCC
Mel-Frequency Cepstrum Coefficients
MPEG
Moving Picture Experts Group
MP3
MPEG Layer 3
MST
Minimum Spanning Tree
MUVIS
Multimedia Video Indexing and Retrieval System
NQ
Normal Query
NNE
Nearest Neighborhood Elimination
P
Precision
PCA
Principal Component Analysis
PCM
Pulse Coded Modulation
PR
Pause Rate
PSQ
Progressive Sub-Query
PQ
Progressive Query
ROI
Region of Interest
R
Recall
RGB
Red, Green and Blue
SAM
Spatial Access Method
SC
Subband Centroid
SD
Scene Detection
SOM
Self Organizing Maps
TFE
Total Frame Energy
TR
Transition Rate
TS-SOM
Tree Structured Self Organizing Maps
QBE
Query by Example
QBH
Query by Humming
QTT
Query Total Time
QP
Query Path
UI
User Interface
ZCR
Zero Crossing Rate
xiv
List of Tables
Table I. MUVIS Multimedia Family ......................................................................................... 9
Table II: The ground-truth table for automatic scene detection algorithms in 10 sample video
sequences.................................................................................................................................. 25
Table III: The MDCT template array dimension with respect to Compression Type and
Windowing Mode..................................................................................................................... 37
Table IV: Transition Penalization Table. ................................................................................. 46
Table V: Generic Decision Table............................................................................................. 54
Table VI: Error Distribution Table for Bit-Stream Mode........................................................ 59
Table VII: Error Distribution Table for Generic Mode............................................................ 59
Table VIII: QTT (Query Total Time) in seconds of 10 aural retrieval examples from Real
World database......................................................................................................................... 76
Table IX: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Open Video
Database. .................................................................................................................................. 77
Table X: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Real World
Database. .................................................................................................................................. 77
Table XI: Statistics obtained from the ground level of HCT bodies extracted from the sample
MUVIS databases................................................................................................................... 130
Table XII: Retrieval times (in msec) for 10 visual and aural query operations performed per
query type............................................................................................................................... 134
xv
List of Figures
Figure 1: General structure of MUVIS framework .................................................................... 9
Figure 2: MUVIS AVDatabase application creating a video database in real time ................. 11
Figure 3: MUVIS DbsEditor application. ................................................................................ 12
Figure 4: MUVIS MBrowser application with an image (progressive) query performed. ...... 13
Figure 5: AFeX Module interaction with MUVIS applications. .............................................. 16
Figure 6: FeX module interaction with MUVIS applications. ................................................. 18
Figure 7: MST clustering illustration. ...................................................................................... 19
Figure 8: Number of scene frames versus threshold sketch for Figure 9................................. 22
Figure 9: Key-Frames (top) and Unsupervised Scene Frames by NNE (bottom-up) and MST
(bottom-down).......................................................................................................................... 24
Figure 10: Key-Frames (top) and Semi-Automatic 3 Scene Frames by NNE (bottom-up) and
MST (bottom-down)................................................................................................................. 24
Figure 11: ROI Selection and Rendering from the key-frames of a video clip........................ 26
Figure 12: ROI (Visual) Query of the example in Figure 11. .................................................. 27
Figure 13: Different error types in classification ..................................................................... 35
Figure 14: Classification and Segmentation Framework ......................................................... 35
Figure 15: MP3 Long Window MDCT template array formation from MDCT subband
coefficients. .............................................................................................................................. 39
Figure 16: MP3 Short Window MDCT template array formation from MDCT subband
coefficients ............................................................................................................................... 39
Figure 17: AAC Long Window MDCT template array formation from MDCT subband
coefficients ............................................................................................................................... 40
Figure 18: AAC Short Window MDCT template array formation from MDCT subband
coefficients. .............................................................................................................................. 40
Figure 19: Generic Mode Spectral Template Formation.......................................................... 41
Figure 20: FF detection within a harmonic frame.................................................................... 44
Figure 21: Perceptual Modeling in Feature Domain. ............................................................... 48
Figure 22: The flowchart of the proposed approach. ............................................................... 50
Figure 23: Operational Flowchart for Step 1............................................................................ 51
Figure 24: Operational Flowchart for Step 2............................................................................ 52
Figure 25: Operational Flowchart for Step 3............................................................................ 53
Figure 26: Intra Segmentation by Binary Division in Step 4. .................................................. 55
Figure 27: Windowed SC standard deviation sketch (white) in a speech segment. Breakpoints
are successfully detected with Roll-Down algorithm and music sub-segment is extracted..... 56
Figure 28: Frame and Segment Features on a sample classification and segmentation........... 58
Figure 29: Audio Indexing Flowchart. ..................................................................................... 65
Figure 30: A sample audio classification conversion............................................................... 66
Figure 31: The derivation of mel-scaled filterbank amplitudes. .............................................. 68
Figure 32: An illustrative clustering scheme............................................................................ 70
Figure 33: KF Rate (%) Plots ................................................................................................... 71
Figure 34: A class-based audio query illustration showing the distance calculation per audio
frame......................................................................................................................................... 74
Figure 35: PR curves of an aural retrieval example within Real World database indexed with
(left) and without (right) using classification and segmentation algorithm. ............................ 76
xvi
Figure 36: Three visual (left) and aural (right) retrievals in Open Video database. The top-left
clip is the query. ....................................................................................................................... 79
Figure 37: Progressive Query Overview. ................................................................................ 84
Figure 38: Formation of a Periodic Sub-Query........................................................................ 87
Figure 39: A sample fusion operation between subsets X and Y. ........................................... 88
Figure 40: Query path formation in a hypothetical indexing structure.................................... 89
Figure 41: HP PQ Overview.................................................................................................... 91
Figure 42: MBrowser GUI showing a PQ operation where 10th PSQ is currently active (or set
manually).................................................................................................................................. 92
Figure 43: Memory usage for PQ and NQ. .............................................................................. 93
Figure 44: PQ retrievals of a query image (left-top) within three PSQs. t p = 0.2 sec ........... 95
Figure 45: Aural PQ retrievals of a video clip (left-top) in 12 PSQs (only 1st, 6th and 12th
are shown). t p = 5 sec ............................................................................................................. 95
Figure 46: Visual PQ retrievals of a video clip (left-top) in 4 PSQs. t p = 3 sec .................... 96
Figure 47: Aural PQ Overall Retrieval Time and PSQ Number vs. PQ Period. ..................... 98
Figure 48: Visual PQ Overall Retrieval Time and PSQ Number vs. PQ Period..................... 98
Figure 49: PSQ and PQ retrieval times for the sample retrieval example given in Figure 45. 99
Figure 50: PSQ Retrieval times for single and multi threaded (HP) PQ schemes. t p = 5 sec 99
Figure 51: A sample dynamic item (5) insertion into a 4-node MST. ................................... 110
Figure 52: A sample mitosis operation over a mature cell C. ................................................ 111
Figure 53: A sample 3-level HCT body. ................................................................................ 112
Figure 54: M-Tree rationale that is used to determine the most suitable nucleus (routing)
object for two possible cases. Note that in both cases the rationale fails to track the closest
nucleus object on the lower level. .......................................................................................... 115
Figure 55: QP formation on a sample HCT body. ................................................................. 122
Figure 56: HCT Browsing GUI on MUVIS-MBrowser Application. .................................... 125
Figure 57: An HCT Browsing example, which starts from the third level within Corel_1K
MUVIS database. The user navigates among the levels shown with the lines through ground
level. ....................................................................................................................................... 127
Figure 58: Another HCT Browsing example, which starts from the third level within Texture
MUVIS database. The user navigates among the levels shown with the lines through ground
level. ....................................................................................................................................... 128
Figure 59: QP plot of a sample image query in Corel_1K database...................................... 133
Figure 60: QP plot of a sample video query in Sports database. ........................................... 133
Introduction
Chapter
1
1
Introduction
M
as a generic term involves the combination of two or more media types to
effectively create a sequence of events that will communicate an information usually
with both sound and visual support. Multimedia technology can then generally be defined as
the combined use of several methods of storage, sensory transmission and finally consumption employed for information to a terminal. Under this definition, multimedia technology is
old and widely used, comprising radio, television, performance art, many printed and educational materials. All of these systems involve the use of multiple sensory formats to facilitate
the transmittal of information. It is in the digital age that the term multimedia has taken on the
definition and level of prestige that it currently enjoys. The advent of digital technologies has
increased multimedia capabilities and potential to unprecedented levels. Digital multimedia is
then defined as the processes of employing a variety of digital items, possibly synchronized
and perhaps embedded within one another, or within an application, to present and transmit
information. The digital items are used as the generic term for any type of digitized information such as still images, audio and video clips.
Digital multimedia technologies, which provide powerful means to acquire and incorporate knowledge from various sources for a broad range of applications, have a strong impact
on the daily life, and have changed our way of learning, thinking and living. The rapid increase of multimedia technology over the last decade has brought about fundamental changes
to computing, entertainment, and education and it has presented our computerized society
with opportunities and challenges that in many cases are unprecedented. As the use of digital
multimedia increases, effective data storage and management become increasingly important.
In fields, which use large quantities of multimedia data collections there is an urge to minimize the volume of data stored while meeting the often conflicting demand for accurate data
representation. In addition, a multimedia collection or a single digital item need to be managed such that the users can have an efficient access, search, browsing capabilities and effective consumption of the required data. Therefore, the rest of the thesis will focus on the development of efficient techniques for the browsing, indexing, retrieval, etc., in short, the “management” of the multimedia data.
ULTIMEDIA
2
1.1.
CONTENT-BASED MULTIMEDIA MANAGEMENT
As the revolutionary advances in the information era continue into the new millennium, the
generation and dissemination of digital multimedia content is to witness phenomenal growth.
Especially with the advances in storage technology and the advent of the World Wide Web,
there has been an explosion in the amount and complexity of digital information being generated, analyzed, stored, accessed and transmitted. However, this rate of growth has not been
matched by the simultaneous emergence of technologies that can manage the content efficiently. State of the art systems for content management lag far behind the expectations of the
users of such systems. The users mainly expect these systems to perform analysis at the same
level of complexity and semantics that a human would perceive while analyzing the content.
Herein lies the reason why no commercially successful systems for content management still
exist. Humans assimilate information at a semantic level and do so with remarkable ease
thanks to the human intelligence, the natural presence of knowledge and the years of experience. The human ability to apply knowledge to the task of sifting through large volumes of
auditory, somatic, proprioceptive or visual data and extracting only relevant information is
indeed amazing. The troika of sensory organs, short-term and long-term memory, and the
ability to learn and reason based on sensory inputs (through supervised or unsupervised training) are the mainstay of the human ability to perform semantic analysis on multimedia data.
Henceforth, to make use of this vast amount of multimedia data, there is an undeniable
need to develop techniques to efficiently retrieve items of interest over large multimedia repositories based on their content. Due to the difficulty in capturing the content using textual
annotations and the non-scalability of the approach to large data sets due to a high degree of
manual effort required in performing the annotations, the approach based on supporting content-based retrieval over low-level features has become a promising research direction. Using
low-level features, query by example (QBE) based retrieval performs relatively well for images, audio clips and even for video clips. As impressive as these systems may be, it is however obvious that they do not really address many multimedia management requirements from
consumers’, companies’ and professionals’ point of view. Such systems, for instance, might
be good at finding sunset images using color histograms but they do not appreciably help a
user who is really looking for a frame of “Tom Cruise at the Oscars”. Only when such systems come to a point that offers capability of video and audio “content” understanding, not
just similarity matching, will we be able to manage multimedia by semantic content. Unfortunately this problem seems to be unfeasibly difficult, if not impossible, at least for the moment.
Although there has been some progress in some automatic techniques such as speech recognition, text identification and face recognition, they can only be used for a limited set of multimedia collection and they are far from being robust and reliable. Short of addressing the hard
Artificial Intelligence problems in this context, how can we advance the state of the art? How
the management of ever increasing vast amounts of multimedia collections should be efficiently performed by employing “feasible” methodologies with available features in order not
to get lost among them?
Introduction
3
The current state of the art in multimedia management systems centers around with
two fundamental approaches to the organization of the semantics. First the manual and semimanual annotation techniques loosely defined as metadata initiatives, and the second one is
the automatic extraction of information based on the visual and aural characteristic properties
of multimedia items. The well-known drawbacks and limitations of the first approach such as
infeasibility of its application over large collections and its user (annotator) dependency make
it usable only in some specific areas and applications and yet far from being the global solution for the management problem. The second approach, although much promising, also lacks
“feasible” techniques providing robust and reliable solutions. At the heart of the issue of any
technique of semantic analysis is the age old question: where does true “content” meaning lie?
Does it lie in the relationships among the objects and the audiovisual properties in any one
audio section, image or scene? Or does it “emerge” from the real-world context history of the
multimedia objects, and their potential users - the humans? Researchers who believe the former tend to focus more on automated techniques for feature extraction and due content analysis and retrieval. The others favoring the latter tend to focus more on advanced applications
with enhanced user interface design, querying and relevance feedback strategies. However,
there is a common agreement in the field that these two approaches need to converge since the
solution to this old question is not unique, but is rather shared between the two approaches. So
far the most promising framework that integrates the two approaches appears to be the standardization activity MPEG-7 [26], formally named “Multimedia Content Description Interface”. It is the standard that describes multimedia content so users can search, browse, and
retrieve content more efficiently and effectively than they could using today’s mainly textbased search engines such as Yahoo, Google, etc. Yet MPEG-7 aims to standardize only a
common interface for describing multimedia items (“the bits about the bits”), that is a standardization of a common syntax for the descriptions (Ds) and description schemes (DSs).
Consequently MPEG-7 neither standardizes the (automatic) extraction of audiovisual descriptions or features nor does it specify the efficient multimedia management techniques and
methodologies that can make use of the description. In this context it is only a common language that can be used among such techniques whenever needed.
In this thesis, we intend to address the problem from the point where MPEG-7 left out,
that is, bringing a global approach to the management of the multimedia databases providing
efficient techniques encapsulated within a feasible framework, MUVIS - Multimedia Video
Indexing and Retrieval System, [P3], [P4], [P6], [P11], [43]. The proposed techniques are
spread among the entire management functionalities such as indexing, browsing, retrieval,
summarization and efficient access of multimedia items. Developing novel features or improving the performance of the existing ones is another objective and therefore, a feature extraction framework has been embedded into MUVIS from the early stages. Via this framework, several visual and aural feature extraction techniques can be involved for any of the
management activities in real-time. The primary motivation to design a global framework
structure, which supports several digital formats, codecs, types and parameters lies in the following fact: The semantic content is totally independent from the various formats and repre-
4
sentations of digital multimedia. Therefore, it is crucial to have a framework or a test-bed
structure such as MUVIS in order to develop robust techniques against such variations. From
the content-based multimedia retrieval point of view the audio information can be even more
important than the visual part since it is mostly unique and significantly stable within the entire duration of the content. However, audio-based studies lag far behind the visual counterpart and the development of robust and reliable aural content management systems is still in
its infancy. Therefore, the current efforts in this thesis are especially focused on audio-based
content analysis and multimedia management.
1.2.
OUTLINE OF THE THESIS
The thesis is organized as follows. First the general overview about the MUVIS framework
where all the advanced techniques proposed for content-based multimedia management are
either embedded into or implemented over is presented in Chapter 2. Especially the global
approach for the management and handling of the multimedia items during their entire lifetime, from the (real-time) capturing or insertion into a MUVIS database via conversion, till
their efficient consumption (retrieval, browsing, etc.) by the user, will be discussed in this
chapter. Furthermore, the generic feature extraction framework, which allows efficient development and real-time integration of feature extraction algorithms into the system along with
the video management architecture where a hierarchical handling, querying and summarization capabilities are performed will particularly be emphasized. All MUVIS related author’s
publications are contained in [P3], [P4], [P6], [P7] and [P11].
The issues related to audio-based multimedia management are exposed in Chapters 3
and 4. In Chapter 3, we focus the attention on the area of generic and automatic audio classification and segmentation for audio-based multimedia indexing and retrieval applications. In
particular, a fuzzy approach towards hierarchic audio classification and global segmentation
framework based on automatic audio analysis providing robust, bi-modal and parameter invariant classification over automatically extracted audio segments is presented. This chapter is
mainly based on the author’s original publications [P1], [P5] and [P8]. Chapter 4 presents a
generic and robust audio based multimedia indexing and retrieval framework, which supports
the dynamic integration of the audio feature extraction modules during the indexing and retrieval phases and therefore, provides a test-bed platform for developing robust and efficient
aural feature extraction techniques. A sample audio feature extraction technique is also developed in this chapter. Furthermore this framework is designed based on the high-level content
classification and segmentation that is presented in Chapter 3, in order to improve the speed
and accuracy of an aural retrieval. The work presented in this chapter is based on the author’s
publication [P4].
In Chapters 5 and 6, novel indexing, retrieval and browsing schemes are proposed.
First Chapter 5 presents a novel multimedia retrieval technique, called Progressive Query
(PQ). PQ is designed to bring an effective solution especially when querying large multimedia databases. In addition, PQ produces intermediate query retrieval results during the execu-
Introduction
5
tion of the query, which will finally converge to the full-scale search retrieval in a faster way
and with no minimum system requirements. The original work for PQ was published in [P2].
Chapter 6 presents a novel indexing technique, called Hierarchical Cellular Tree (HCT),
which is designed to bring an effective solution especially for indexing of large multimedia
databases and furthermore to provide an enhanced browsing capability, which enables users to
perform a “guided tour” among the database items. The proposed indexing scheme is then optimized for the query method introduced in Chapter 5, the Progressive Query, in order to
maximize the retrieval efficiency from the user point of view. Chapter 6 is mainly based on
the author’s original publications [P9] and [P10].
Due to the amount and diversity among the subjects, a proper introduction with the extensive literature survey and the experimental results along with the conclusive remarks are
given within each chapter. Finally, the conclusions of the thesis are drawn in Chapter 7.
1.3.
PUBLICATIONS
The majority of the author’s contribution to the field of content-based management of multimedia databases is shared among the Chapters 2 to 6. As mentioned earlier MUVIS is the host
framework into which all the proposed techniques are either embedded or implemented over it
and therefore, the structure of the thesis follows the natural development phases of MUVIS.
The main contributions can be summarized in the following points:
q
q
q
q
q
q
q
q
q
q
q
The design and implementation of the MUVIS system, in [P3] and [P4].
The hierarchical video management, in [P6].
The design of a dynamic feature extraction framework within MUVIS, in [P7].
A generic and robust audio classification and segmentation scheme, in [P1], [33], [P5]
and [P8].
Audio-based multimedia indexing and retrieval framework, in [P4].
A novel multimedia retrieval technique: Progressive Query, in [P2] and [34].
An efficient indexing and browsing method for content-based retrieval for multimedia
databases: Hierarchical Cellular Tree, in [P9] and [P10].
An ongoing study for compression effects on color and texture based multimedia indexing and retrieval, in [P11].
The extension of MUVIS framework on mobile platforms, the M-MUVIS, in [P12].
An implementation of Progressive Query technique on mobile platforms, in [P13].
A novel feature and shape extraction method based on segmentation over Canny edge
field [13] for multimedia indexing and retrieval, in [P14].
Note that the last four contributions are deliberately excluded from this thesis, however they are closely linked with the entire work presented in this thesis and they will be used
by the associated co-authors in their doctoral thesis.
6
7
Chapter
2
MUVIS Framework
M
UVIS has been initially created as a Java application during the late 90s to provide
indexing and retrieval framework for large image databases using visual and semantic
features such as color, texture and shape. Based upon the experience and feedback from this
first system [16], a new framework, which aims to bring a unified and global approach to indexing, browsing and querying of various digital multimedia types such as audio/video clips
and digital images, has been developed. The primary motivation behind developing this new
version of MUVIS is to achieve a unified and global framework and a robust set of applications for capturing, recording, indexing and retrieval combined with browsing and various
other visual and semantic capabilities. The current MUVIS system has been developed as a
framework to bring a unified and generic solution to content-based multimedia indexing and
retrieval. Variations in formats, representations and other parameters in today’s digital multimedia world such as codec types, file formats, capture and encoding parameters, may significantly affect indexing and retrieval. Therefore, covering a wide-range of multimedia family
and especially the last generation multimedia codecs, MUVIS is developed to provide an efficient framework structure upon which robust algorithms can be implemented, tested, configured and compared against each other. Furthermore, it supports three types of browsing, five
levels of hierarchic video representation and summarization and most important of all, MUVIS framework supports integration of the aural and visual feature extraction algorithms explicitly. This brings a significant advantage for third parties to independently develop and test
their feature extraction modules. In short the MUVIS framework supports the following processing capabilities and properties:
•
•
•
•
•
An effective framework structure, which provides an application independent basis in
order to develop audio and visual feature extraction techniques that are dynamically
integrated to and used by the MUVIS applications for indexing and retrieval.
Real-time audio and video capturing, encoding and recording,
Multimedia conversions into one of the convertible formats that MUVIS supports,
Scalable video management,
Video summarization via scene frame extraction from the shot frames available in the
video bit-stream,
8
•
•
•
•
•
A novel Progressive Query mechanism, which provides faster query results along with
the query process and lets the user browse among the queries obtained and stops an
ongoing query in case the results obtained so far are satisfactory.
An enhanced retrieval scheme based on explicit visual and aural queries initiated from
any MUVIS database that includes audio/video clips and still images.
Multimedia format and type conversions such as MPEG-1 video (with or without the
presence of MPEG-1 Layer 2 audio) to one of MUVIS video (and audio) formats are
so far supported in order to append alien multimedia files to a MUVIS database,
A novel indexing and browsing Method: Hierarchical Cellular Tree (HCT) and implementation of PQ over HCT.
Audio content analysis and audio-based multimedia indexing and retrieval.
This chapter is organized as follows. Section 2.1 introduces the MUVIS system architecture,
primary applications, MUVIS multimedia family and databases. The visual/aural indexing
and retrieval schemes are discussed in Section 2.2. The dynamic visual and aural feature extraction frameworks, FeX and AFeX, are explained in Section 2.3 . Finally the scalable video
management and summarization are discussed in Section 2.4.
2.1.
MUVIS OVERVIEW
2.1.1. Block Diagram of the System
As shown in Figure 1, MUVIS framework is based upon three applications, each of which has
different responsibilities and functionalities. AVDatabase is mainly responsible for real-time
audio/video database creation with which audio/video clips are captured, (possibly) encoded
and recorded in real-time from any peripheral audio and video devices connected to a computer. DbsEditor performs the indexing of the multimedia databases and therefore, offline feature extraction process over the multimedia collections is its main task. MBrowser is the primary media browser and retrieval application into which PQ technique is integrated as the
primary retrieval (QBE) scheme. NQ is the alternative query scheme within MBrowser. Both
PQ (Sequential and over HCT) and NQ can be used for retrieval of the multimedia primitives
with respect to their similarity to a queried media item (an audio/video clip, a video frame or
an image). Due to their unknown duration, which might cause impractical indexing times for
an online query process, in order to query an (external) audio/video clip, it should first be appended (offline operation) to a MUVIS database upon which a query can then be performed.
There is no such necessity for images; any digital image (inclusive or exclusive to the active
database) can be queried within the active database. The similarity distances will be calculated by the particular functions, each of which is implemented in the corresponding visual/aural feature extraction (FeX or AFeX) modules.
MUVIS Framework
FeX
Modules
Indexing
9
AFeX
Modules
FeX & AFeX API
Retrieval
DbsEditor
Still
Images
Database
Management
Image
Database
HCT Indexing
MM Insertion
Removal
AV
Clips
An
Image
MBrowser
MM Conversions
FeX - AFeX
Management
Hybrid
Database
Display
An AudioVideo Clip
Query: PQ & NQ
HCT
Browsing
AVDatabase
Video
Summarization
AV Database
Creation
A Video
Frame
Video
Database
Real-time
Capturing
Encoding
Recording
Figure 1: General structure of MUVIS framework
2.1.2. MUVIS Multimedia Family
MUVIS databases are formed using the variety of multimedia types belonging to MUVIS
multimedia family as given in Table I. The associated MUVIS application will allow the user
to create an audio/video MUVIS database in real time via capturing or by converting into any
of the specified format within MUVIS multimedia family. Since both audio and video formats
are the most popular and widely used formats, a native clip with the supported format can be
directly inserted into a MUVIS database without any conversion. This is also true for the images but if the conversion is required by the user anyway, any image can be converted into
one of the “Convertible” image types presented in Table I.
Table I. MUVIS Multimedia Family
MUVIS Audio
Channel
Number
File
Formats
Codecs
Mono
Stereo
MP3
AAC
H263+
MPEG-4
Any
AVI
YUV 4:2:0
Any
MP4
RGB 24
Codecs
Sampling Freq.
MP3
AAC
16, 22.050,
G721
G723
PCM
MUVIS Video
24, 32, 44.1 KHz
Frame
Rate
Frame
Size
File Formats
1..25 fps
Any
AVI
MP4
10
MUVIS Image Types
Convertible Formats
JPEG
JPEG 2K
BMP TIFF PNG
Non-convertible Formats
PCX
GIF
PCT
TGA
PCX
EPS
WMF PGM
2.1.3. MUVIS Applications
MUVIS applications are developed for Windows OS family. Figure 1 illustrates the primary
applications within the current MUVIS framework. In the following subsections we review
the basic features of each application while emphasizing their role in the overall system.
2.1.3.1 AVDatabase
AVDatabase application is specifically designed for creating audio/video databases by collecting real-time audio/video files via capturing from a peripheral video/audio device as shown in
Figure 2. An audio/video clip may include only video information, only audio information or
both video and audio interlaced information. Several video and audio encoding techniques can
be used with any encoding parameters specified in Table I.
Video can be captured from any peripheral video source (i.e. PC camera, TV card,
etc.) in one of the following formats: YV12 (I420) à YUV 4:2:0, RGB24 (or RGB32) and
YUY2 (YUYV) or UYVY. If the capture format is other than YUV 4:2:0 then the frame is
first converted to YUV 4:2:0 format for encoding. Capturing parameters such as video framerate and frame size can be set during the recording phase by the user. The captured video is
then encoded in real-time with the user-specified parameters given in Table I, recorded into a
supported file format and finally appended into the active MUVIS database. Video encoding
parameters such as bit-rate, frame-rate, forced-intra rate (if enabled), etc. can be defined during the recording time. The supported file formats handle the synchronization of video with
respect to the encoding time-stamp of each frame.
Audio (only) files can also be captured, encoded, recorded and appended in real-time
similar to the video. For audio encoding, we also use last generation audio encoders giving
significantly high quality even in very low bit-rates such as MP3 and AAC. ADPCM encoders such as G721 and G723 can also be used for low complexity. Furthermore, audio can be
recorded in raw (PCM 16b) format. Compressed audio bit-stream is then recorded into any
audio-only file format (container) such as MP3 and AAC or possibly interlaced with video,
such as AVI (Microsoft ©) and MP4 (MPEG-4 File Format v1 and v2). It is also possible to
store standalone audio bitstream into AVI and MP4 files.
MUVIS Framework
11
Figure 2: MUVIS AVDatabase application creating a video database in real time
2.1.3.2 DbsEditor
As shown in Figure 3, DbsEditor application is designed to handle indexing and any editor
task for the MUVIS databases. Audio/video clips can be created by AVDatabase application
in real time. On the other hand, available clips can be directly appended to a MUVIS database
provided that their formats are supported, see Table I. Alien formats can be converted (e.g.
MPEG-1) by DbsEditor to one of the supported formats first and then appended.
Feature extraction is the primary task of DbsEditor application. This is basically
achieved by extracting and appending new features into any MUVIS database. Hence DbsEditor can add to and also remove features from any type of MUVIS database. The overall functionalities of DbsEditor can be listed as follows:
• Appending new audio/video clips and still images to any MUVIS database and removing such multimedia items from the database.
• Dynamic integration and management of feature extraction (FeX and AFeX) modules.
• Extracting new features or removing existing features of a database by using available
FeX and AFeX modules.
• Converting of alien audio/video files into any MUVIS database. Preview of any audio/video clip or image in a database.
• Display statistical information of a database and/or items in a database.
12
•
Hierarchical Cellular Tree (HCT) based visual and aural indexing.
Figure 3: MUVIS DbsEditor application.
2.1.3.3 MBrowser
MBrowser is the main media retrieval and browser terminal. In the most basic form, it has all
the capabilities of an advanced multimedia player (or viewer) and an efficient multimedia database browser. Furthermore, it allows users to access any multimedia item within a MUVIS
database easily, efficiently and in any of the designed hierarchic levels, especially for the
video clips. It supports five levels of video display hierarchy: single frame, shot frames (keyframes), scene frames, a video segment and the entire video clip.
MBrowser has a built-in search and query engine, which is capable of searching multimedia primitives in a database of any multimedia type that is similar to a queried media item
(a video clip, a frame or an image). In order to query an audio/video clip, it should first be appended to a MUVIS database upon which the query will be performed. There is no such requirement for images; any digital image (inclusive or exclusive to the active database) can be
queried within the active database. Query retrieval is based on comparing the similarity distances between the queried media item’s feature vector(s) with the feature vectors of multi-
MUVIS Framework
13
media primitives available in the database and performing a ranking operation afterwards. The
similarity distances will be calculated by the particular functions each of which is implemented in the corresponding feature extraction module.
The novel retrieval scheme, the Progressive Query (PQ), has been developed and integrated into MBrowser application. It provides instantaneous query results along with the
query process and lets the user browse around the sub-query retrievals and stops an ongoing
query in case the results obtained so far are satisfactory. Chapter 5 will cover the details of the
PQ method. One example PQ instance is as shown in Figure 4.
MBrowser provides the following additional functionalities:
• Video summarization via scene detection and key-frame browsing,
• Random access support for audio/video clips,
• Displaying any crucial information (i.e. database features, parameters, status, etc.) related with the active database and user commands,
• Visualizations of feature vectors of the images and video key-frames.
• Various browsing options: random, forward/backward and aural or visual HCT (if database is indexed via HCT).
Figure 4: MUVIS MBrowser application with an image (progressive) query performed.
14
2.1.4. MUVIS Databases
MUVIS system supports the following types of multimedia databases:
• Audio/Video databases include only audio/video clips and associated indexing information.
• Image databases include only still images and associated indexing information.
• Hybrid databases include both audio/video clips and images, and associated indexing
information.
As illustrated in Figure 1, audio/video databases can be created using both AVDatabase, and
DbsEditor applications and image databases can be created using only DbsEditor. Hybrid databases can be created by appending images to video databases using DbsEditor, or by appending audio/video clips to image databases using either DbsEditor or AVDatabase.
The experiments and simulations performed in the thesis use the following 8 sample MUVIS databases:
1) Open Video Database: This database contains 1500 video clips, each of which is
downloaded from “The Open Video Project” web site [48]. The clips are quite old (from
1960s) but contain color video with sound. The total duration of the database is around 46
hours. The spoken language is English.
2) Real World Audio/Video Database: There are 800 audio-only and video clips in the
database with a total duration of over 36 hours. They are captured from several TV channels
and the content is distributed among News, Advertisements, Talk Shows, Cartoons, etc. The
speech is distributed among English, Turkish, French, Swedish, Arabic and Finnish languages.
3) Sports Hybrid Database: There are 200 video clips with a total duration of 12 hours
and mainly carrying sports content such as Football, Tennis and Formula-1. There are also
495 images (in GIF and JPEG formats) showing instances from Football matches and other
sport tournaments. The spoken language is mostly Finnish.
4) Music Audio Database: There are 550 MP3 music files that are among Classical,
Techno, Rock, Metal, Pop and some other native music types.
5) Corel_1K Image Database: There are 1000 medium resolution (384x256 pix) images
from diverse contents such as wild life, city, buses, horses, mountains, beach, food, African
natives, etc.
6) Corel_10K Image Database: There are 10000 low resolution images (in thumbnail
size) from diverse contents such as wild life, city, buses, horses, mountains, beach, food, African natives, etc.
7) Shape Image Database: There are 1500 black and white (binary) images that mainly
represent the shapes of different objects such as animals, cars, accessories, geometric objects,
etc.
8) Texture Image Database: There are 1760 texture images representing the pure textures
from several materials and products.
MUVIS Framework
2.2.
15
INDEXING AND RETRIEVAL SCHEME
2.2.1. Indexing Methods
The indexing of a MUVIS database is performed by DbsEditor application in three steps. The
first mandatory step is collecting the multimedia items within a MUVIS database. This will
create a default sequential indexing where the database items are numbered (indexed) sequentially. The other two steps are optional and required only to enable (fast) query and (HCT)
browsing functionalities. Hence the second step is to extract visual/aural features using the
available FeX and AFeX modules. As long as one visual or aural feature is extracted for a database, then as the third step, the database can be further indexed by HCT.
HCT indexing scheme is recently developed for MUVIS databases. It has mainly the
following characteristics:
• Dynamic (Incremental) indexing scheme.
• Parameter invariant (None or minimum parameter dependency)
• Dynamic cell size formation.
• Hierarchic structure with fast indexing (i.e. ~O(nlogn)) formation.
• Similar items are grouped into cells via Mitosis operation(s).
• Optimized for PQ.
Once a MUVIS database is indexed by HCT, then the most relevant items can be retrieved faster via “PQ over HCT”. Moreover, an enhanced browsing scheme, the so called
HCT Browsing is enabled under MBrowser application. However, even without HCT indexing, PQ can still be performed over the sequential indexing and hence it is called as “Sequential PQ”. The structural details of HCT indexing method are explained in Chapter 6.
2.2.2. Retrieval Methods
There are two retrieval schemes available under MBrowser application for the multimedia
items within a MUVIS database: browsing and query-by-example (QBE). MBrowser provides
three different browsing methods: sequential, random and HCT. The former two require only
the first (mandatory) indexing step; however for HCT browsing all three indexing steps
should be performed. Depending on the database type and the features present, there are two
different HCT browsing types: visual and aural. If both feature types are present, both browsing types can be performed on a hybrid or a video database. However, for image databases,
only visual HCT browsing is possible. The details about HCT browsing methods are covered
in Chapter 6.
There are mainly two QBE methods available: Normal Query (NQ) and Progressive
Query (PQ). NQ is the most basic QBE operation and mainly works as follows: using the
available aural or visual features (or both) of the queried multimedia item (i.e. an image, a
video clip, an audio clip, etc.) and all the database items, the similarity distances are calculated and then merged to obtain a unique similarity distance per database item. Ranking the
items according to their similarity distances (to the queried item) over the entire database
16
yields the query result. NQ for QBE is computationally slow, costly and CPU intensive especially for large multimedia databases. Therefore, Progressive Query (PQ) is implemented in
MUVIS to provide an alternative query technique. As its name implies PQ provides query
results along with the query process and lets the user browse around the queries obtained and
stops the ongoing query in case the results obtained so far are satisfactory. As expected PQ
and NQ will produce the same (final) retrieval results at the end but PQ yields faster retrievals
than NQ especially if the database is HCT indexed and PQ is performed over HCT.
2.3.
FEATURE EXTRACTION FRAMEWORK
As mentioned in the previous section multimedia items can be collected via several methods
such as real-time recording or conversion, and then appended into any MUVIS database.
Once items are available within a MUVIS database, associated features are extracted and
stored along with the items to complete the sequential indexing scheme for that database. In
order to provide both visual and aural indexing schemes MUVIS provides both visual and aural feature extraction frameworks, in such a way that feature extraction modules can be developed independently and integrated into MUVIS system dynamically (during run-time). This is
the basis of the framework structure, which allows third party feature extraction modules to be
integrated into MUVIS without knowing the details of the MUVIS applications. In the following sections we explain the details of the aural and visual feature extraction frameworks.
DBSEditor
AFex_*.DLL
AFex_API.h
AFex_Bind
AFex_Init
AFex_Extract
AFex_Exit
AFex_Bind()
AFex_Init()
AFex_Extract()
AFex_GetDistance()
AFex_Exit()
AFex_GetDistance
MBrowser
Figure 5: AFeX Module interaction with MUVIS applications.
2.3.1. Aural Feature Extraction: AFeX
AFeX framework mainly supports dynamic audio feature extraction module integration for
audio clips. Figure 5 shows the API functions and linkage between MUVIS applications and a
MUVIS Framework
17
sample AFeX module. Each audio feature extraction algorithm should be implemented as a
Dynamically Linked Library (DLL) with respect to AFeX API. AFeX API provides the necessary handshaking and information flow between a MUVIS application and an AFeX module.
Five API function properties (name and types) are declared in AFex_API.h. The creator of an
AFeX module should implement all specified API functions, which are described as follows:
q
q
q
q
q
AFex_Bind: Used for handshaking operation between a MUVIS application and an
AFeX module. AFeX module fills the specific structure to introduce itself to the application. This function is called only once at the beginning, just after the application
links the AFeX module in run-time.
AFex_Init: The feature extraction parameters are given to initialize the AFeX module.
The AFeX module performs necessary initialization operations, i.e. memory allocation,
table creation etc. This function is called for the initialization of a unique sub-feature
extraction operation. A new sub-feature can be created by using different set of feature
parameters.
AFex_Extract: It is used to extract the features of an audio frame (buffer). It returns
the feature vectors, which should be normalized in such a way that the total length of
the vector should be in between 0.0 and 1.0. This normalization is required for merging multiple (sub-) features while querying in MBrowser.
AFex_Exit: It is for resetting and terminating the AFeX module operation. It frees the
entire memory space allocated in AFex_Bind function. Additionally, if AFex_Init has
been called already, this function resets the AFeX module to perform further feature
extraction operations. This function is called at least once while the MUVIS application is terminated, but it might be called at the end of each AFeX operation per subfeature extraction.
AFex_GetDistance: This function is used to obtain the similarity measure via calculating the distance between two feature vectors and therefore, the appropriate distance
measurement algorithm should be implemented in this function. The resulting distance
is returned as a double precision number.
The details about audio indexing along with a sample AFeX module implementation
are explained in Chapter 4.
2.3.2. Visual Feature Extraction: FeX
Visual features are extracted from two visual media types in MUVIS framework: video clips
and images. Features of video clips are extracted from the key-frames of the video clips. During real-time recording phase, AVDatabase may optionally and separately store the uncompressed (original) key-frames of a video clip along with the video bit-stream. If the original
key-frames are present they are used for visual feature extraction process. If not, DbsEditor
can extract the key-frames from the video bit-stream and use them instead. The key-frames
are the INTRA frames in MPEG-4 or H.263 bit-stream. In most cases, a shot detection algo-
18
rithm is used to select the INTRA frames during the encoding stage but sometimes a forcedintra scheme might be present in order to prevent possible degradations. Image features on the
other hand are simply extracted from their 24-bit RGB frame buffer, which is obtained by decoding the image.
DBSEditor
Fex_*.DLL
Fex_API.h
Fex_Bind
Fex_Init
Fex_Extract
Fex_Exit
Fex_Bind()
Fex_Init()
Fex_Extract()
Fex_GetDistance()
Fex_Exit()
Fex_GetDistance
MBrowser
Figure 6: FeX module interaction with MUVIS applications.
The rest of the implementation details of FeX structure are similar to AFeX: each visual FeX module should be implemented as a Dynamic Link Library (DLL) with respect to
FeX API, and stored in an appropriate directory. FeX API provides the necessary handshaking
and information flow between a MUVIS application and the feature extraction module. Figure
6 summarizes the API functions and linkage between MUVIS applications and an illustrative
FeX module.
2.4.
VIDEO SUMMARISATION
Video summarization is basically achieved within MUVIS framework by performing scene
analysis over the key-frames, which are available either explicitly or implicitly along with the
video stream. It is a fact that video representation and summarization rely on the video structure, which can be hierarchically described in three segmentation layers: shots, scenes and the
entire video. In shots layer, one or more shot frames can be chosen as key-frames in order to
represent the shot-segments. For video summarization purposes key-frames might include
quite a large amount of redundancy such as repetition of similar shots during the run-time of
the video. Therefore, by eliminating such redundancy, scene frames should be extracted out of
the key-frames to achieve efficient video summarization.
In order to extract scene frames from key-frames, we developed the following techniques, which are integrated into MBrowser application: Scene Detection (SD) by Minimum
Spanning Tree (MST) clustering and SD by Nearest Neighborhood Elimination (NNE). The
video encoder (MPEG-4 or H.263+) segments the video sequence into INTRA frames usually
MUVIS Framework
19
by a shot detection algorithm that is implemented along with the encoding process and encodes the key-frames as the first frame of each shot interval as INTRA frames (I-frames). Using the available key-frames that are encoded in the video bit-stream, both techniques apply
feature extraction directly to the key-frames in order to get an inter-similarity measure between them. Any feature extraction method (based on color, texture, motion, DCT coefficients, etc.) can be used for the proposed techniques. In this work for any particular video, we
used two color features based on HSV and YUV color histograms, and two texture features
(Gabor [40] and GLCM [49]) available within the active MUVIS database. These features are
used for inter-similarity measure between the key-frames of the video stream for both techniques (MST and NNE). The work presented here is based on the author’s publication [P6].
2.4.1. Scene Analysis by MST
MST is an efficient and widely used clustering technique [29]. The key-frames of a particular
video represent the nodes in a MST. In order to find the weight (similarity measure) of the
branch between two nodes, we use the normalized Euclidean Distance (D) between associated
feature vectors x and y , such as:
D
=
1 +
x −
y
x
+
2
2
(1)
y
2
By using the distance between two vectors as the weight of the MST branches, the tree
is formed. Then the tree branches are sorted from the largest branch weight to the smallest
one. At the beginning, the whole tree is a single cluster. In order to create a new cluster, it is
sufficient to break the (next) largest branch. Figure 7 illustrates a sample MST example with
three clusters.
0
2
8
1
9
2
10
1
2
1
4
1
6
2
8
9
3
5
3
mean = 3
1
7
Figure 7: MST clustering illustration.
Each time a branch is broken and thus a new cluster is created, the nodes inside the
cluster can be assumed to show high similarity than the nodes outside the cluster. Knowing
20
that those nodes represent the key-frames, in this way we grouped all the key-frames that are
similar to each other. Once a sufficient number of clusters are achieved, it is straightforward
to choose one or more key-frame(s) as the scene frame(s), which can represent the scene
(cluster). Therefore, for an efficient scene detection operation, it is important to know where
to stop clustering.
In supervised methods, the user sets a threshold value for branch weight comparison
and thereof it can be applied for the decision whether or not to break the branch. However,
these values might not achieve the optimal number of scene frames that should be extracted.
Another alternative is to make the scene detection in a semi-automatic way, that is, the user
directly sets the number of scene frames required for summarization and the corresponding
threshold value can be found accordingly. Yet again the user may not determine the optimal
number of scene frames beforehand but at least the video will be summarized by the specified
number of scene frames at the end. We developed an unsupervised (automatic) technique in
which this threshold value is found adaptively. In a sample case as shown in Figure 7, the optimal threshold value should be between 3 and 8 since there are two significantly long (8 and
9) inter-cluster branches and short intra-cluster branches. So the aim is to break the longest of
the two branches and stop clustering. For this the threshold value should be somewhere in between the smallest inter-cluster (broken) branch and the largest intra-cluster (unbroken)
branch. This might justify the usage of the mean of all the branch weights (mean = 3) as the
threshold value since it satisfies the condition. There might however be circumstances that
this method leads to some redundant cluster creation and hence redundant scene frame detections. The probability of redundant scene detection is especially increased when there are
significantly more intra-cluster branches than the inter-cluster branches. For example in the
sample MST in Figure 7, if there happen to be one or more intra-cluster branches that have
weights below 3, the mean value would be less than 3 and as a result, there would indeed be 4
clusters and node 5 is to be a new but redundant scene frame. In order to avoid such a case,
we propose a 2-step weighted root mean square algorithm to determine the threshold value as
follows:
• Step 1: Calculate mean (µ) and variance (σ) of the branch weights (wi).
• Step 2: Calculate the threshold value as the weighted root mean square of the branch
weights as follows:
N
=
Threshold
where
∑ (k
N '=
i
N
N
∑k
i

 (int)
ki = 

∗ wi )
2
i
i
'
and N = # of branches.

 (w i − µ ) 

 if w i ≥ µ 
.
σ



1 if w i < µ
(2)
MUVIS Framework
21
As clearly seen in this equation, large weight values affect the threshold value more
significantly than the smaller ones. Furthermore, the weight values around the mean (in a
window of variance) have no effect at all on the threshold value. As a result we can summarize the unsupervised (automatic) SD by MST algorithm as follows:
I. Extract key-frames and their feature vectors.
II. Use feature vectors as the nodes of the MST and a similarity distance measure (i.e.
Euclidean) as the branch weights (distance between nodes).
III. Form MST and sort branches from biggest to smallest weights.
IV. Find the mean and variance of the branch weights and the Threshold according to
equation (2).
V. Break the next largest branch if it is bigger than Threshold value. If not, stop clustering and proceed to VI.
VI. For each cluster choose appropriate one or more key-frame(s) to represent the cluster (scene) and thus extract scene frames.
VII. The total number of clusters gives the optimal number of scenes. Finally, scene
frame(s) extracted from each cluster can be used for the summarization of the video
sequence.
2.4.2. Scene Analysis by NNE
In this technique, the idea is to keep only one key-frame as a scene frame and eliminate all the
other similar ones. This is basically achieved without pre-clustering or any kind of tree forming. Initially the first key-frame of the video sequence is chosen as the first scene frame. By
using the same normalized similarity measure as in the previous section, similar key-frames to
the current scene frame are eliminated from the list of key-frames. In a supervised mode, a
fixed threshold value is used to find which key-frames are to be eliminated. Once all the similar key-frames are eliminated, the next “present” key-frame in the time-line is chosen as the
second scene frame since it is not eliminated and therefore, it is a new scene frame. Then all
the existing key-frames are compared with this new scene frame and similar ones are again
eliminated. The algorithm proceeds up to the elimination of the last key-frame while always
awarding next surviving key-frame as the scene frame.
Similar to SD by MST, this technique can be implemented in three modes: unsupervised, semi-automatic and supervised (automatic). As mentioned in the previous sub-section it
is straightforward to perform unsupervised and semi-automatic modes for this technique. In
order to achieve unsupervised (automatic) SD by NNE, the optimal threshold value used for
key-frame elimination should be found by an adaptive mechanism. Similar to MST method,
this value should yield the algorithm to come up with the optimal number of scenes to represent (summarize) the video clip. Since the similarity measure between two key-frames is
found by normalized Euclidean distance, this threshold value is somewhere in between 0 and
22
1.0. So starting from 0 threshold, increasing threshold values yields decreasing number of
scene frames.
For the supervised (i.e. for a fixed threshold) mode, we can summarize the NNE algorithm as follows:
I. Choose the first/next key-frame as the first/next scene frame.
II. Compare the normalized similarity measure between the current scene frame and
next key-frame with the threshold value: if it is less than the threshold, eliminate
that key-frame otherwise keep it.
III. Once all the key-frames in the list are compared with the current scene frame, proceed to the next surviving key-frame. If such key frame exists, go to I, otherwise,
terminate iterations.
Figure 8: Number of scene frames versus threshold sketch for Figure 9.
For the semi-automatic mode, the number of scene frames ( N S ) is given and the appropriate threshold value can be adaptively found as follows:
0
0
I. Set two boundary threshold values: thrlow
= 0 thr high
= 1 and i = 0 where i shows
the iteration number,
II. Let threshold
III.
IV.
V.
VI.
i
=
i
i
thr low
+ thr high
2
,
Run SD by NNE and extract scene frames. Let scene frame number be n i
If n i = N S then stop and terminate the iteration.
i +1
i +1
If n i > N S then set thrlow
= threshold i otherwise set thr high
= threshold i .
Increment iteration number (ià i+1) and go to II.
Finally in unsupervised (automatic) mode, the algorithm adaptively determines the optimal number of scene frames. In ideal case the key-frames, which belong to the same scene
(cluster) should be semantically similar to each other, so they are expected to give closer dis-
MUVIS Framework
23
tance values in feature domain compared to those key-frames from other scenes (clusters).
Therefore, the automatic mode runs the SD by NNE algorithm using several (i.e. 40) threshold
values that are logarithmically distributed between 0 and 1.0. This yields a number of scene
frames versus threshold plot as one example is shown in Figure 8. Starting from 0 along with
the increasing threshold values the elimination of the close (intra-scene) key-frames are soon
completed and then one can expect a region in which no elimination is possible until the interscene frame eliminations occur. In order to have a robust algorithm we try to detect the lowest
gradient section with the largest width as shown in Figure 8. Once this region is found, the
scene frame number yields the optimal number of scenes with which this video clip can be
represented (summarized) and thus the scene frames are accordingly extracted.
2.4.3. Video Summarization Experiments
In the experiments we use color and texture features such as HSV, YUV color histogram bins
and GLCM-based texture features [49]. As mentioned in the previous section, the similarity
measure is the normalized Euclidean distance.
The first experiment is unsupervised (automatic) scene frame detection. For SD by
MST method, in order to have comparable results with SD by NNE method, only one keyframe with minimum time-line index in each cluster is chosen as the scene frame for that cluster. First a monotonous dialog between two people in a 5 minutes clip is particularly chosen to
get a clear idea about the exact number of scenes available and consequently Figure 9 shows
the results of the automatic scene-detection experiment using two proposed methods. Another
similar example as shown in Figure 10 is used for the semi-automatic scene detection experiment with number of scene frames is set as 3. The top sections of these figures show the extracted key-frames by the encoder. The bottom two rows display the results of the two methods (upper row by SD by NNE and lower row SD by MST).
24
Figure 9: Key-Frames (top) and Unsupervised Scene Frames by NNE (bottom-up) and MST
(bottom-down).
Figure 10: Key-Frames (top) and Semi-Automatic 3 Scene Frames by NNE (bottom-up) and
MST (bottom-down).
A ground-truth experimental set-up is used in order to test the performance of the proposed
automatic scene detection algorithms performed over 10 video clips, which are captured in
QCIF size at 25 fps and both H.263+ and MPEG-4 (simple profile) encoded at several bit-
MUVIS Framework
25
rates (i.e. >128 Kb/s), are used for evaluation. Their content varies among the following: sport
programs, talk shows, commercials and news and their duration varies between 2 to 7 minutes
approximately. YUV histogram based shot-detection methods in the encoders extract keyframes including some additional forced intra key-frames (period of 1000). A group of 8 people evaluated the scene frames detected by both of the methods in automatic mode and the
results are shown in Table II.
Table II: The ground-truth table for automatic scene detection algorithms in 10 sample video
sequences.
Seq. #
# of KFs
# of Scenes
Detected
False Detection
Missed
True number
of scenes
MST
NNE
MST
NNE
MST
NNE
SEQ1
23
KFs
SEQ2
21
KFs
SEQ3
30
KFs
SEQ4
27
KFs
SEQ5
31
KFs
SEQ6
63
KFs
SEQ7
38
KFs
SEQ8
39
KFs
SEQ9
17
KFs
SEQ10
26
KFs
4
4
0
0
0
0
9
7
4
2
0
0
5
4
1
1
0
1
4
5
0
0
2
1
13
10
4
2
0
1
12
9
3
2
3
5
5
7
1
3
1
1
12
9
4
2
2
3
15
8
2
0
0
5
9
8
0
0
8
9
4
5
4
6
9
12
5
10
13
17
Both methods achieve a high accuracy for the talk shows and news sequences when there is a
clean distinction between scene frames. On the other hand, the accuracy is significantly degraded in the examples where such “clean” scene distinction is not encountered such as sport
programs and commercials. In fact in such examples, the evaluation group also found it difficult to be in consensus on determining the number of scenes. Several experiments using both
methods in semi-automatic mode show that they can extract the required number of scene
frames in almost all cases with high accuracy. Semi-automatic mode is much more robust to
the variations in scene content and it takes significantly less time compared to automatic
mode. Therefore, when the content is fuzzy with respect to scene extraction, the semiautomatic mode otherwise the automatic mode should be used in both methods.
2.4.4. Scalable Video Management
The term “Video Management” involves the efficient consumption of digital video clips possibly in a large multimedia database. Scalability is the primary property needed for efficient
management of the video information since any video clip can have indefinite duration possibly bearing multiple and mixed content. In this context, in addition to the summarization
scheme presented earlier, a hierarchical representation is further required for the retrieval of
the region of interest (ROI) of the video. Thus, the user can access and then also query a ROI
of a video clip.
26
2.4.4.1 ROI Access and Query
ROI can be a single (key-) frame or a group of shots represented by their key-frames. The
built-in key-frame summarization and scene analysis are presented in the previous subsections. This is a static representation (summarization) of a video clip, that is, the key-frames
and the extracted scene frames are shown to the user. In the ROI approach, the user can further define the region over the key-frames using MBrowser and then MBrowser can retrieve it
by rendering this ROI section of the video clip alone (possibly together with the synchronized
audio) via its random access capability. Figure 11 shows a sample ROI selection from the
key-frame summary of a video clip with duration of roughly two minutes, containing around
90 seconds of commercials and 30 seconds of F-1 race at the end. Note that only the last 11
key-frames (in the 2nd page of Key-Frames Display Dialog) are shown out of 32.
ROI
From
Play ROI
To
Video for ROI
Figure 11: ROI Selection and Rendering from the key-frames of a video clip.
The example given in Figure 11 is only one way of defining a ROI, which is through
the key-frame summarization of the video clip. MBrowser also provides various other ways of
ROI setting directly from the main GUI window such as using an enhanced slider bar with
two pointers (i.e. “from” and “to”) or directly browsing and setting the appropriate ROI
boundary key-frames (or frames) using the (key-) frame browser buttons. These GUI primitives (buttons and the slider bar) and the ROI options in the “Video Query Options Dialog”
can be seen in Figure 12.
Once the ROI is set, the next step can be the ROI (visual) query. In other words, the
user may want to retrieve all video clips in the database containing similar key-frames as in
the ROI of the query video. Figure 12 shows one ROI query example where ROI is set as
shown in Figure 11. The retrieval results on the right side of MBrowser GUI window clearly
indicate that only the specified content within the ROI (i.e. “F-1 race with BRIDGESTONE
flag”) is successfully retrieved. By this way the “content purification” can be achieved from a
multi-content bearing video clip possibly having a long duration, and thus only the “content of
interest” can be retrieved accordingly.
MUVIS Framework
27
Frame Browser Buttons
ROI
Prev.
Frame
Next Prev.
Frame KF
Next
KF
Slider Bar with two pointers
Figure 12: ROI (Visual) Query of the example in Figure 11.
2.4.4.2 Visual Query of Video
As mentioned earlier, the visual features of the key-frames of the video clips are used for performing the visual query operation. The key-frames are the first frame within a shot and they
are detected by shot detectors, which are usually embedded into the video encoders. Once all
video clips are collected within a MUVIS database, then feature extraction is performed for
each key-frame of every video clip in the database by the appropriate FeX modules. Then any
video clip can be queried within the database by simply calculating the (dis-) similarity distance using the features of the key-frames between the query and the database clips and a
ranking operation is performed afterwards. The main question is how to calculate a similarity
distance between the query and any clip since their durations or the number of the key-frames
might vary significantly and the content they bear may not be necessarily homogeneous,
rather mixed and incoherent. For instance if the query clip has only one minute duration, how
feasible can it be to compare it with a one hour clip containing various different visual content? The basic approach in MUVIS is to find the best matching key-frame for each key-frame
in the query clip. Therefore, this turns out to be equivalent to searching for the entire query
clip within each database clip to find their best matching occurrences. This yields, for example, searching for a one minute query clip inside a one hour clip and if a content-wise matching of one minute segment (or some group of key-frames) can be found within the one hour
28
clip, then both clips can be claimed as “similar”. Note that we are neither looking for a
matching criteria for the (temporal) order of the query key-frames, nor the continuation (vicinity) of them in the one hour clip since both of them can vary and change, but still yield a
similar content.
In order to formulate the aforementioned similarity distance calculation analytically,
let NoS be the number of feature sets existing in a database and let NoF(s) be the number of
sub-features per feature set, where 0 ≤ s < NoS . Let the similarity distance function be
SD ( x ( s , f ), y ( s , f )) where x and y are the associated feature vectors of the feature index s
and the sub-feature index f. Let i be the index of the key-frames in the query clip, Q and
QFV i ( s , f ) be its sub-feature vector of the feature index s and the sub-feature index f. Similarly, let j be the index of the key-frames of a database clip, C and CFV j ( s , f ) be its subfeature vector. For all key-frames in C, one particular key-frame, which gives the minimum
distance to the video frame i in the queried clip is found ( D i ( s , f ) ) and used for calculation
of the total sub-feature similarity distance ( D ( s , f ) ) between two clips. Here ROI(Q) represents the group of key-frames in the ROI of the query clip Q and by default it is the entire clip
unless set manually by the user. Since the visual feature vectors are unit normalized, the total
query similarity distance ( QDC ) between the clips Q and C in the database can be calculated
with a weighted linear interpolation, the weights per sub-feature f, of a feature set s,
W ( s, f ) , can be set by the user to find an optimum merging settings for the weights of all
the visual features present in the database. The following equation formalizes the calculation
of QDC .
{ [
(
Di ( s , f ) = min SD QFV i ( s , f ), CFV j ( s , f )
D (s, f ) =
∑ D ( s, f )
i∈ ROI ( Q )
)]
j ∈C
}
i
(3)
NoS NoF ( s )
QD C =
∑ ∑ W ( s, f ) ⋅ D ( s, f )
s
f
NoS NoF ( s )
∑ ∑ W ( s, f )
s
f
The weight normalization in the calculation of QDC is basically needed for two reasons: First to preserve the unit normalization in the final similarity distance and also to negate
the effect of biased similarity distance calculation for the clips missing one or more features.
The lack of features can occur as a consequence of abrupt stopping of an ongoing feature ex-
MUVIS Framework
29
traction operation (e.g. the user can stop manually for some reason) and can be corrected any
time by performing a “Consistency Check” operation for that database. Obviously the lack of
some features for a clip will yield a lesser amount of QDC calculation unless such weight
normalization is not applied.
30
31
Chapter
3
Unsupervised Audio Classification
and Segmentation
A
udio information often plays an essential role in understanding the semantic content of
digital media and in certain cases, audio might even be the only source of information
e.g. audio-only clips. Henceforth, audio information has been recently used for content-based
multimedia indexing and retrieval. Audio may also provide significant advantages over the
visual counterpart especially if the content can be extracted according to the human auditory
perceptual system. This, on the other hand, requires efficient and generic audio (content)
analysis that yields a robust and semantic classification and segmentation. Therefore, in this
chapter we focus the attention on the area of generic and automatic audio classification and
segmentation for audio-based multimedia indexing and retrieval applications. The next section presents an overview on audio classification and segmentation studies in the literature,
their performance analysis, especially their major limitations and drawbacks. In this way we
can then introduce and justify our philosophy behind the generic approach for the proposed
audio classification and segmentation scheme. Having defined the objectives and the basic
key-points, the rest of the sections details the proposed method. The chapter is organized as
follows. Section 3.2 is devoted to the description of the common spectral template formation
depending on the mode. Then the hierarchic approach adopted for the overall feature extraction scheme and the perceptual modeling in the feature domain are introduced in Section 3.3.
Section 3.4 describes the proposed framework with its hierarchical steps. Finally the experimental results along with the evaluation of the proposed algorithm and conclusive remarks are
reported in Section 3.5.
3.1.
AUDIO CLASSIFICATION AND SEGMENTATION –
AN OVERVIEW
During the recent years, there have been many studies on automatic audio classification and
segmentation using several techniques and features. Traditionally, the most common approach
is speech/music classifications in which the highest accuracy has been achieved, especially
32
when the segmentation information is known beforehand (i.e. manual segmentation). Saunders [58] developed a real-time speech/music classifier for audio in radio FM receivers based
on features such as zero crossing rate (ZCR) and short-time energy. A 2.4 s window size was
used and the primary goal of low computational complexity was achieved. Zhang and Kuo
[77] developed a content-based audio retrieval system, which performs audio classification
into basic types such as speech, music and noise. In their latter work [78] using a heuristic approach and pitch tracking techniques, they introduced more audio classes such as songs,
mixed speech over music. Scheirer and Slaney [59] proposed a different approach for the
speech/music discriminator systems particularly for ASR. El-Maleh et al. [27] presented a
narrowband (i.e. audio sampled at 8 KHz) speech/music discriminator system using a new set
of features. They achieved a low frame delay of 20 ms, which makes it suitable for real-time
applications. A more sophisticated approach has been proposed by Srinivasan et al. [65]. They
tried to categorize the audio into mixed class types such as music with speech, speech with
background noise, etc. They reported over 80% classification accuracy. Lu et al. [39] presented an audio classification and segmentation algorithm for video structure paring using a
one-second window to discriminate speech, music, environmental sound and silence. They
proposed new features such as band periodicity to enhance the classification accuracy.
Although audio classification has been mostly realized in the uncompressed domain,
with the emerging MPEG audio content, several methods have been reported for audio classification on MPEG-1 (Layer 2) encoded audio bit-stream [30], [46], [52], [68]. The last years
have shown a widespread usage of MPEG Layer 3 (MP3) audio [10], [23], [25], [51] as well
as proliferation of several video content carrying MP3 audio. The ongoing research on perceptual audio coding yields a more efficient successor called (MPEG-2/4) Advanced Audio
Coding (AAC) [10], [24]. AAC has various similarities with its predecessor but promises significant improvement in coding efficiency. In a previous work [P8], we introduced an automatic segmentation and classification method over MP3 (MPEG–1, 2, 2.5 Layer-3) and AAC
bit-streams. In this work, using a generic MDCT template formation extracted from both MP3
and AAC bit-streams, an unsupervised classification over globally extracted segments is
achieved using a hierarchical structure over the common MDCT template.
Audio content extraction via classification and segmentation enables the design of efficient indexing schemes for large-scale multimedia databases. There might, however, be
several shortcomings of the simple speech/music classifiers so far addressed in terms of extracting real semantic content, especially for multimedia clips that presents various content
variations. For instance, most of speech/music discriminators work on digital audio signals
that are in the uncompressed domain, with a fixed capturing parameter set. Obviously, large
multimedia databases may contain digital audio that is in different formats (compressed/uncompressed), encoding schemes (MPEG Layer-2, MP3, AAC, ADPCM, etc.), capturing and encoding parameters (i.e. sampling frequency, bits per sample, sound volume level,
bit-rate, etc.) and durations. Therefore, the underlying audio content extraction scheme should
be robust (invariant) to such variations since the content is independent from the underlying
parameters that the digital multimedia world presents. For example, the same content of a
Unsupervised Audio Classification and Segmentation
33
speech may be represented by an audio signal sampled at 8KHz or 32KHz, in stereo or mono,
compressed by AAC or stored in (uncompressed) PCM format, lasting 15 seconds or 10 minutes, etc.
A comparative study of several statistical, HMM and GMM and neural network based
training models was carried out by Bugatti et al. [12]. Although such approaches may achieve
a better accuracy for a limited set of collections, they are usually restricted to a focused domain and hence do not provide a generic solution for the massive content variations that a
large multimedia database may contain.
Another important drawback of many existing systems is the lack of global segmentation. Since classification and segmentation are closely related and dependent problems, an
integrated and well-tuned classification and segmentation approach is required. Due to technical difficulties or low-delay requirements, some systems tend to rely on manual segmentation
(or simply work over a short audio file having a single audio content type). The other existing
systems use several segment features that are estimated over audio segments with a fixed duration (0.5 – 5 seconds) to accomplish a classification per segment. Although fixing the segment size brings many practical advantages in the implementation and henceforth improves
the classification accuracy, its performance may suffer either from the possibly high resolution required by the content or from the lack of sufficient statistics needed to estimate the
segment features due to the limited time span of the segment. An efficient and more natural
solution is to extract global segments within which the content is kept stationary so that the
classification method can achieve an optimum performance within the segment.
Almost all of the systems so far addressed do not have a bi-modal structure. That is,
they are either designed in bit-stream mode where the bit-stream information is directly used
(without decoding) for classification and segmentation, or in generic mode where the temporal and spectral information is extracted from the PCM samples and the analysis is performed
afterwards. Usually, the former case is applied for improved computational speed and the latter for higher accuracy. A generic bi-modal structure, which supports both modes (possibly to
some extent), is obviously needed in order to provide feasible solutions for the audio-based
indexing of large multimedia databases. Such a framework can, for instance, work in bitstream mode whenever the enhanced speed is required, especially for long clips for which the
generic mode is not a feasible option for the underlying hardware or network conditions; or it
can work in the generic mode whenever feasible or required.
Due to content variations, most of the existing works addressing just “speech and music” categories may not be satisfactory for the purpose of an efficient audio indexing scheme.
The main reason for this is the presence of mixed audio types, such as speech with music,
speech with noise, music with noise, environmental noise, etc. There might even be difficult
cases where a pure class type ends up with an erroneous classification due to several factors.
For the sake of audio indexing overall performance, either new class types for such potential
audio categories should be introduced, or such “mixed” or “erroneous” cases should be collected under a certain class category (e.g. fuzzy) so that special treatment can be applied while
indexing such audio segments. Since high accuracy is an important and basic requirement for
34
the audio analysis systems used for indexing and retrieval, introducing more class types might
cause degradations in performance and hence is not considered as a feasible option most of
the time for such generic solutions.
In order to overcome the aforementioned problems and shortcomings, in this chapter
we present a generic audio classification and segmentation framework especially suitable for
audio-based multimedia indexing and retrieval systems. The proposed approach has been integrated into the MUVIS system [P4], [P6], [43]. The proposed method is automatic and uses
no information from the video signal. It also provides robust (invariant) solution for the digital
audio files with various capturing/encoding parameters and modes such as sampling frequencies (i.e. 8KHz up to 48 KHz), channel modes (i.e. mono, stereo, etc.), compression bit-rates
(i.e. 8kbps up to 448kbps), sound volume level, file duration, etc. In order to increase accuracy, a fuzzy approach has been integrated within the framework. The main process is selflearning, which logically builds on the extracted information throughout its execution, to produce a reliable final result. The proposed method proceeds through logical hierarchic steps
and iterations, based on certain perceptual rules that are applied on the basis of perceptual
evaluation of the classification features and the behavior of the process. Therefore, the overall
design structure is made suitable for human aural perception and for this, the proposed
framework works on perceptual rules whenever possible. The objective is to achieve such a
classification scheme that ensures a decision making approach suitable to human content perception.
The proposed method has a bi-modal structure, which supports both bit-stream mode
for MP3 and AAC audio, and generic mode for any audio type and format. In both modes,
once a common spectral template is formed from the input audio source, the same analytical
procedure is performed afterwards. The spectral template is obtained from MDCT coefficients
of MP3 granules or AAC frames in bit-stream mode and hence called as MDCT template.
The power spectrum obtained from FFT of the PCM samples within temporal frames forms
the spectral template for the generic mode.
In order to improve the performance and most important of all, the overall accuracy,
the classification scheme produces only 4 class types per audio segment: speech, music, fuzzy
or silent. Speech, music and silent are the pure (unique) class types. The class type of a segment is defined as fuzzy if it is either not classifiable as a pure class due to some potential uncertainties or anomalies in the audio source or it exhibits features from more than one pure
class. For audio based indexing and retrieval in MUVIS system, a pure class content is only
searched throughout the associated segments of the audio items in the database having the
same (matching) pure class type, such as speech or music. All silent segments and silent
frames within non-silent segments are discarded from the audio indexing. As mentioned earlier, special care is taken for fuzzy content, that is, during the retrieval phase, fuzzy content is
compared with all relevant content types of the database (i.e. speech, music and fuzzy) since it
might, by definition, contain a mixture of pure class types, background noise, aural effects,
etc. Therefore, for the proposed method, any erroneous classification on pure classes is intended to be detected as fuzzy, so as to avoid significant retrieval errors (mismatches) due to
Unsupervised Audio Classification and Segmentation
35
such potential misclassification. In this context, three prioritized error types of classification,
illustrated in Figure 13, are defined:
•
•
•
Critical Errors: These errors occur when one pure class is misclassified into another
pure class. Such errors significantly degrade the overall performance of an indexing
and retrieval scheme.
Semi-critical Errors: These errors occur when a fuzzy class is misclassified as one of
the pure class types. These errors moderately affect the performance of retrieval.
Non-critical Errors: These errors occur when a pure class is misclassified as a fuzzy
class. The effect of such errors on the overall indexing/retrieval scheme is negligible.
Critical Error
speech
fuzzy
Semi-Critical Error
music
speech
fuzzy
Non-Critical Error
music
speech
fuzzy
music
Figure 13: Different error types in classification
3.2.
SPECTRAL TEMPLATE FORMATION
In this section, we focus on the formation of the generic spectral template, which is the initial
and pre-requisite step in order to provide a bi-modal solution.
As shown in Figure 14, the spectral template is formed either from the MP3/AAC encoded bit-stream in bit-stream mode or the power spectrum of the PCM samples in the generic mode. Basically, this template provides spectral domain coefficients, SPEQ(w, f),
(MDCT coefficients in bit-stream mode or power spectrum in generic mode) with the corresponding frequency values FL(f) for each granule/frame. By using FL(f) entries, all spectral
features and any corresponding threshold values can be fixed independently from the sampling frequency of the audio signal. Once the common spectral template is formed the granule
features can be extracted accordingly and thus, the primary framework can be built on a
common basis, independent from the underlying audio format and the mode used.
Spectral Template
Bit-Stream Mode
Segment
Features
Decision
Segmentation
Granule Ext.
Audio Stream
Segment
Classification
MDCT
Generic Mode
Granule
Features
Music
Fuzzy
Speech
Power
Spectrum
Figure 14: Classification and Segmentation Framework
Silence
36
3.2.1. Forming the MDCT Template from MP3/AAC Bit-Stream
3.2.1.1 MP3 and AAC Overview
MPEG audio is a group of coding standards that specify a high performance perceptual coding scheme to compress audio signals into several bit-rates. Coding is performed in several
steps and some of them are common for all three layers. There is a perceptual encoding
scheme that is used for time/frequency domain masking by a certain threshold value computed using psychoacoustics rules. The spectral components are all quantized and a quantization noise is therefore introduced.
MP3 is the most complex MPEG layer. It is optimized to provide the highest audio
quality at low bit-rates. Layer 3 encoding process starts by dividing the audio signal into
frames, each of which corresponds to one or two granules. The granule number within a single frame is determined by the MPEG phase. Each granule consists of 576 PCM samples.
Then a polyphase filter bank (also used in Layer 1 and Layer 2) divides each granule into 32
equal-width frequency subbands, each of which carries 18 (subsampled) samples. The main
difference between MPEG Layer 3 and the other layers is that an additional MDCT transform
is performed over the subband samples to enhance spectral resolution. A short windowing
may be applied to increase the temporal resolution in such a way that 18 PCM samples in a
subband is divided into three short windows with 6 samples. Then MDCT is performed over
each (short) window individually and the final 18 MDCT coefficients are obtained as a result
of three groups of 6 coefficients. There are three windowing modes in MPEG Layer 3 encoding scheme: Long Windowing Mode, Short Windowing Mode and Mixed Windowing Mode. In
Long Windowing Mode, MDCT is applied directly to the 18 samples in each of the 32 subbands. In Short Windowing Mode, all of 32 subbands are short windowed as mentioned above.
In Mixed Windowing Mode, the first two lower subbands are long windowed and the remaining 30 higher subbands are short windowed. Once MDCT is applied to each subband of a
granule according to the windowing mode, the scaled and quantized MDCT coefficients are
then Huffman coded and thus the MP3 bit-stream is formed.
There are three MPEG phases concerning MP3: MPEG-1, MPEG-2 and MPEG 2.5.
MPEG-1 Layer 3 supports sampling rates of 32, 44.1 and 48 kHz and bit-rates from 32 to 448
kbps. It performs encoding on both mono and stereo audio, but not multi-channel surround
sound. One MPEG-1 Layer 3 frame consists of two granules (1152 PCM samples). During
encoding, different windowing modes can be applied to each granule. MPEG-2 Layer 3 is a
backwards compatible extension to MPEG-1 with up to five channels, plus one low frequency
enhancement channel. Furthermore, it provides support for lower sampling rates such as 16,
22.05 and 24 kHz for bit-rates as low as 8 kbps up to 320 kbps. One MPEG-2 Layer 3 frame
consists of one granule (576 PCM samples). MPEG 2.5 is an unofficial MPEG audio extension, which was created by Fraunhofer Institute to improve performance at lower bit-rates. At
lower bit-rates, this extension allows sampling rates of 8, 11.025 and 12 KHz.
AAC and MP3 have mainly a similar structure. Nevertheless, compatibility with other
MPEG audio layers has been removed and AAC has no granule structure within its frames
Unsupervised Audio Classification and Segmentation
37
whereas MP3 might contain one or two granules per frame depending on the MPEG phase as
mentioned before. AAC supports a wider range of sampling rates (from 8 kHz to 96 kHz) and
up to 48 audio channels. Furthermore it works at bit-rates from 8 kbps for mono speech and in
excess of 320 kbps. A direct MDCT transformation is performed over the samples without
dividing the audio signal in 32 subbands as in MP3 encoding. Moreover, the same tools (psychoacoustic filters, scale factors and Huffman coding) are applied to reduce the number of bits
used for encoding. Similar to MP3 coding scheme, two windowing modes are applied before
MDCT is performed in order to achieve a better time/frequency resolution: Long Windowing
Mode or Short Windowing Mode. In Long Windowing Mode MDCT is directly applied over
1024 PCM samples. In Short Windowing Mode, an AAC frame is first divided into 8 short
windows each of which contains 128 PCM samples and MDCT is applied to each short window individually. Therefore, in Short Windowing Mode, there are 128 frequency lines and
hence the spectral resolution is decreased by 8 times whilst increasing the temporal resolution
by 8. AAC has a new technique so called “Temporal Noise Shaping”, which improves the
speech quality especially at low bit-rates. More detailed information about MP3 and AAC can
be found in [10].
The structural similarity in MDCT domain between MP3 and AAC makes developing
generic algorithms that cover both MP3 and AAC feasible. So the proposed algorithm in this
chapter uses this similarity as an advantage to form a common spectral template based on
MDCT coefficients. This template allows us to achieve a common classification and segmentation technique that uses the compressed domain audio features as explained in the next subsection.
3.2.1.2 MDCT Template Formation
The bit-stream mode uses the compressed domain audio features in order to perform classification and segmentation directly from the compressed bit-stream. Audio features are extracted
using the common MDCT sub-band template. Hence MDCT template is nothing but a vari-
able size MDCT double array, MDCT (w, f ) , along with a variable size frequency line array
FL( f ) , which represents the real frequency value of the each row entry in the MDCT array.
The index w represents the window number and the index f represents the line frequency index. Table III represents array dimensions NoW and NoF with respect to the associated window modes of MP3 and AAC.
Table III: The MDCT template array dimension with respect to Compression Type and
Windowing Mode
Compression Type &
Windowing Mode
NoW
NoF
MP3 Long Window
1
576
MP3 Short Window
3
192
MP3 Mixed Window
3
216
AAC Long Window
1
1024
AAC Short Window
8
128
38
Let f s be the sampling frequency. Then according to Nyquist's theorem the maximum
frequency ( f BW ) of the audio signal will be: f BW = f s 2 . Since both AAC and MP3 use
linearly spaced frequency lines, then the real frequency values f can be obtained from the
FL( f ) using the following expression:


f 

 ( f + 1) × BW  Short or Long Win. Mode


NoF 




f BW 




FL( f ) = 
 ( f + 1) ×
 f < 36

576 

 MP3 Mixed Win. Mode


  f BW ( f − 35) × f BW 

 f ≥ 36
  16 +

192


 

(4)
where f is the index from 0 to the corresponding NoF given in Table III.
The MDCT template array is formed from the absolute values of the MDCT subband
coefficients, which are (Huffman) decoded from the MP3/AAC bit-stream per MP3 granule or
AAC frame. For each MP3 granule, the MDCT subband coefficients are given in the form of
a matrix of 32 lines, representing the frequency subbands, with 18 columns each of which for
every coefficient as shown in Figure 15. In case of short window, there are three windows
within a granule containing 6 coefficients. The template matrix formation for short window
MP3 granules is illustrated in Figure 16. In order to process the same algorithm for both encoding schemes, we apply a similar template formation structure to AAC frames. So in case
of long window AAC frame, 1024 MDCT coefficient array is divided into 32 groups of 32
MDCT coefficients and the template matrix for AAC is formed by taking into account that the
number of MDCT coefficients for a subband is not 18 (as in MP3) but now 32. Figure 17 illustrates AAC long window template formation. In case of short window AAC frame, 1024
coefficients are divided into 8 windows of 128 coefficients each. We divide these 128 coefficients in 32 subbands and fill the matrix with 4 coefficients in every subband in order to have
the same template as the MP3 short window case. Figure 18 shows how the subbands are arranged and the template array is formed by this technique.
Unsupervised Audio Classification and Segmentation
39
Long Window
Subband Filter 31
MDCT
MDCT
558
MDCT
575
MDCT
MDCT
18
MDCT
35
MDCT
MDCT MDCT
0
1
MDCT
17
Subband Filter 1
Subband Filter 0
Figure 15: MP3 Long Window MDCT template array formation from MDCT subband
coefficients.
Short Windows
Subband Filter 31
MDCT
MDCT
186
MDCT
191
MDCT
MDCT
6
MDCT
11
MDCT
MDCT
0
MDCT
5
MDCT
186
MDCT
191
MDCT
6
MDCT
11
Subband Filter 1
MDCT
0
MDCT
5
Figure 16: MP3 Short Window MDCT template array formation from MDCT subband
coefficients
40
Long Window
MDCT
MDCT
0
MDCT
32
Part 1
Part 2
Part 32
Long Window
MDCT
992
MDCT
1023
MDCT
32
MDCT
63
MDCT MDCT
0
1
MDCT
31
Figure 17: AAC Long Window MDCT template array formation from MDCT subband
coefficients
Short Windows
MDCT
MDCT
127
MDCT 0
Window 1
MDCT
1023
Window 8
Window 2
Window 8
Window 1
Part 1
MDCT
0
MDCT
3
Subband 31
MDCT
124
Subband 1
MDCT
4
MDCT
7
Subband 0
MDCT
0
MDCT
3
Part 32
MDCT
127
MDCT
127
MDCT
124
MDCT
127
MDCT
4
MDCT
7
MDCT
0
MDCT
3
Figure 18: AAC Short Window MDCT template array formation from MDCT subband
coefficients.
Unsupervised Audio Classification and Segmentation
41
3.2.2. Spectral Template Formation in Generic Mode
In the generic mode, the spectral template is formed from the FFT of the PCM samples within
a frame that has a fixed temporal duration. In bit-stream mode, the frame (temporal) duration
varies since the granule/frame size is fixed (i.e. 576 in MP3, 1024 in AAC long window
mode). However, in this mode, we have the possibility to extract both fixed-size or fixedduration frames depending on the feature type. For analysis compatibility purposes, it is a
common practice to fix the (analysis) frame duration. If, however, fixed spectral resolution is
required (i.e. for fundamental frequency estimation), the frame size (hence the FFT window
size) can also be kept constant by increasing the frame size by zero padding or simply using
the samples from neighbor frames.
Spectral Template
Audio Stream
FFT
Frame Ext.
Template Ext.
Power Spectrum
PSPQ(w,f)
Freq. Line
FL(f)
Audio Parameters
Figure 19: Generic Mode Spectral Template Formation.
As shown in Figure 19, the generic mode spectral template consists of a variable size
power spectrum double array, PSPQ(w, f), along with a variable size frequency line array
FL ( f ) , which represents the real frequency value of each row entry in the PSPQ array. The
index w represents the window number and the index f represents the line frequency index. In
generic mode, NoW =1 and NoF is the number of frequency lines within the spectral bandwidth: NoF
= 2 log 2 ( frdur
f s ) − 1
where . is the ceiling operator, fr dur is the duration of
one audio (analysis) frame and f S is the sampling frequency.
Note that for both modes, the template is formed independently from the number of
channels (i.e. stereo/mono) in the audio signal. If the audio is stereo, both channels are averaged and used as the signal before the frame extraction is processed.
3.3.
FEATURE EXTRACTION
As shown in Figure 14, a hierarchic approach has been adopted for the overall feature extraction scheme in the proposed framework. First the frame (or granule) features are extracted using the spectral template and the segment features are derived afterwards in order to accomplish classification for the segments. In the following sub-sections we will focus on the extraction of the several frame features.
42
3.3.1. Frame Features
Granule features are extracted from the spectral template, SPEQ(w, f) and FL(f), where SPEQ
can be assigned to MDCT or PSPQ depending on the current working mode.
We use some classical features such as Band Energy Ratio (BER), Total Frame Energy
(TFE) and Subband Centroid (SC). We also developed a novel feature so called Transition
Rate (TR) and tested it against the conventional counterpart, Pause Rate (PR). Since both PR
and TR are segment features by definition, they will be introduced on the next section. Finally
we proposed an enhanced Fundamental Frequency (FF) detection algorithm, which is based
on the well-known HPS (Harmonic Product Spectrum) technique [47].
3.3.1.1 Total Frame Energy Calculation
Total Frame Energy (TFE) can be calculated using (5). It is the primary feature to detect silent granules/frames. Silence detection is also used for the extraction of TR, which is one of
the main segment features.
TFE j =
∑ ∑ (SPEQ
2
NoW NoF
w
j
( w, f ) )
(5)
f
3.3.1.2 Band Energy Ratio Calculation
Band Energy Ratio (BER) is the ratio between the total energies of two spectral regions that
are separated by a single cut-off frequency. The spectral regions fully cover the spectrum of
the input audio signal. Given a cut-off frequency value f c ( f c ≤ f BW ), let f f c
be the
line frequency index where FL ( f f c ) ≤ f c < FL ( f f c + 1) , BER for a granule/frame j
can be calculated using (6).
∑ ∑ (SPEQ
NoW
BER j ( f c ) =
w
f
fc
f =0
∑ ∑ (SPEQ
NoW
w
NoF
f=f
( w, f ) )
2
j
(w, f ))
(6)
2
j
fc
3.3.1.3 Fundamental Frequency Estimation
If the input audio signal is harmonic over a fundamental frequency (i.e. there exists a series of
major frequency components that are integer multiples of a fundamental frequency value), the
real Fundamental Frequency (FF) value can be estimated from the spectral coefficients
(SPEQ(w, f)). Therefore, we apply an adaptive peak-detection algorithm over the spectral
template to check whether sufficient number of peaks around the integer multiple of a certain
frequency (a candidate FF value) can be found or not. The algorithm basically works in 3
steps:
Unsupervised Audio Classification and Segmentation
•
•
•
43
Adaptive extraction of the all the spectral peaks,
Candidate FF Peaks extraction via Harmonic Product Spectrum (HPS),
Multiple peak search and Fundamental Frequency (FF) verification.
Especially the human speech has most of its energy at lower bands (i.e. f < 500 Hz.)
and hence the absolute value of the peaks in this range might be significantly greater than the
peaks in the higher frequency bands. This brings the need for an adaptive design in order to
detect the major spectral peaks in the spectrum. We therefore, apply a non-overlapped partitioning scheme over the spectrum and the major peaks are then extracted within each partition. Let N P is the number of partitions each of which have the
( f BW
/ N P ) Hz bandwidth.
In order to detect peaks in a partition, the absolute mean value is first calculated from the
spectral coefficients in the partition and if a spectral coefficient is significantly bigger than the
mean value (e.g. greater than twice the mean value), it is chosen as a new peak and this process is repeated for all the partitions. The maximum spectral coefficient within a partition is
always chosen as a peak even if it does not satisfy the aforementioned rule. This is basically
done to ensure that at least one peak is to be detected per partition. One of the main advantages of the partition based peak detection is that the amount of redundant spectral data is significantly reduced towards the major peaks, which are the main concern for FF estimation
scheme.
The candidate FF peaks are obtained via HPS. If a frame is harmonic with a certain
FF value, HPS can detect this value. However there might be two potential problems. First
HPS can be noisy if the harmonic audio signal is a noisy and mixed signal with significant
non-harmonic components. In this case HPS will not extract the FF value as the first peak in
the harmonic product, but as the second or higher order peak. For this reason we consider a
reasonable number (i.e. 5) of the highest peaks extracted from the harmonic product as the
candidate FF values. Another potential problem is that HPS does not provide whether or not
the audio frame is harmonic, since it always produces some harmonic product peak values
from a given spectrum. The harmonicity should therefore be searched and verified among the
peak values extracted in the previous step.
The multiple peak verification is a critical process for fundamental frequency (FF)
calculation. Due to the limited spectral resolution, one potential problem might be that the
multiple peak value may not be necessarily on the exact frequency line that spectral coefficient exists. Let the linear frequency spacing between two consecutive spectral coefficient be
∆ f = FL ( f ) − FL ( f − 1) = f BW / NoF and let the real FF value will be in the
{− ∆f
2 , + ∆f 2} neighborhood of a spectral coefficient at the frequency FL( f ) . Then the
minimum window width to search for
n th (possible) peak will be: W (n) = n × ∆f . Another
problem is the pitch-shifting phenomenon of the harmonics that especially occurs to harmonic
patterns of the speech. Terhardt [66] proposed stretched templates for the detection of the
pitch (fundamental frequency) and one simple analytical description of the template stretching
is given in the following expression.
44
f n = nσ f 0 ⇒ W (n) = (nσ − n) f0 ∀n = 2,3,4...
(7)
σ is the stretch factor with a nominal value 1 and f 0 is the perceived fundamental
frequency. Practically σ ≈ 1 . 01 for the human speech and this can therefore be approxi-
where
mated as a linear function (i.e. f n ≅ n σ f 0 ). Due to such harmonic shifts and the limited spectral resolution of the spectral template, an adaptive search window is applied in order not to
miss a (multiple) peak on a multiple frequency line. On the other hand false detections might
occur if the window width is chosen larger than necessary. We developed, tested and used the
following search window template for detection:
W ( n) = n ∆f σ ∀n = 2,3,4...
(8)
Note that the search window width is proportional to both sampling frequency of the
audio signal and a stretch factor, σ , and inversely proportional to the total number of frequency lines, both of which gives a good measure of resolution and provides a stretched template modeling for the perceived FF value.
Figure 20 illustrates a sample peak detection applied on the spectral coefficients of an
audio frame with the sampling frequency 44100 Hz. Therefore, f BW = 22050 Hz but the
sketch shows up to around 17000 Hz for the sake of illustration. The lower subplot shows the
overall peaks detected in the first step (red), 5 candidate peaks extracted in the second step via
HPS algorithm (black), the multiple peaks found (blue) and finally the FF value estimated
accordingly ( FF = FL (18 ) = 1798 Hz in this example)
Power Spectrum Curve
FF Peak = 1798 Hz.
Freq. (x10KHz)
x
x
HPS Detected 5 Peaks
x
o Overall Spectral Peaks
o Harmonic Peaks via FF
x
x
Figure 20: FF detection within a harmonic frame
3.3.1.4 Subband Centroid Frequency Estimation
Subband Centroid (SC) is the first moment of the spectral distribution (spectrum) or in compressed domain it can be estimated as the balancing frequency value for the absolute spectral
values. Using the spectral template arrays, SC ( f SC ) can be calculated as follows:
Unsupervised Audio Classification and Segmentation
45
NoW NoF
f SC =
∑ ∑ (SPEQ ( w , f ) × FL ( f ) )
w
f
NoW NoF
∑ ∑ SPEQ ( w , f )
w
(9)
f
3.3.2. Segment Features
Segment Features are extracted from the frame (or granule) features and mainly used for the
classification of the segment. A segment, by definition, is a temporal window, which lasts a
certain duration within an audio clip. There are basically two types: silent and non-silent segments. The non-silent segments are subject to further classification using their segment features. As mentioned before, the objective is to extract global segments, each of which should
contain a stationary content along with its time span in the semantic point of view. Practically,
there is no upper bound for the segments. For instance a segment may cover the whole audio
clip if there is a unique audio category in it (i.e. MP3 music clips). However there is and
should be a practical lower bound for the segments duration within which a perceptive content
can exist (i.e. > 0.6 seconds).
Total Frame Energy (TFE) is the only feature used for the detection of the silent segments. The segment features are then used to classify the non-silent segments and will be presented in the following sections.
3.3.2.1 Dominant Band Energy Ratio
Since the energy is concentrated mostly on the lower frequencies for human speech, an audio
frame can be classified as speech or music by comparing its Band Energy Ratio (BER) value
with an empirical threshold. This is an unreliable process when it is applied per-frame basis,
but within a segment it can turn out to be an initial classifier by using the dominant (winning)
class type within the segment. Experimental results show that Dominant Band Energy Ratio
(DBER), as a segment feature, does not achieve as high accuracy as the other major segment
features but it usually gives consistent results for the similar content. Therefore, we use DBER
for the initial steps of the main algorithm, mainly for merging the immature segments into
more global segments if their class types match with respect to DBER. One of the requirements of segment merging is to have same class types of the neighbor segments and DBER is
consistent of giving same (right or wrong) result for the same content.
3.3.2.2 Transition Rate vs. Pause Rate
Pause Rate (PR) is a well-known feature as a speech/music discriminator and basically it is
the ratio between the numbers of silent granules/frames to total number of granules/frames in
a non-silent segment. Due to natural pauses or unsound consonants that occur within any
speech content, speech has a certain level of PR level that is usually lacking in any music content. Therefore, if this ratio is over a threshold ( T PR ), then the segment is classified as a
speech segment, otherwise music.
46
PR usually achieves a significant performance in discriminating speech from music.
However, its performance is degraded when there is a fast speech (without sufficient amount
of pauses), a background noise or when the speech segment is quite short (i.e. < 3 seconds).
Since PR is only related with the amount (number) of silence (silent frames) within a segment, it can lead to critical misclassifications.
In a natural human speech, due to the presence of the unsound consonants, the frequency of the occurrence of the silent granules is generally high even though their total time
span (duration) might be still low as in the case of a fast speech. On the other hand, in some
classical music clips, there might be one or a few intentional silent sections (silent passes) that
may cause misclassification of the whole segment (or clip) as speech due to the long duration
of such passes. These erroneous cases lead us to introduce an improved measure, Transition
Rate (TR), which is based on the transitions, occurs between consecutive frames. TR can be
formulated for a segment as in (10).
NoF
TR ( S ) =
NoF + ∑ TP i
(10)
i
2 NoF
where NoF is the number of frames within segment S, i is the frame index and TP i is the transition penalization factor that can be obtained from the following table:
Table IV: Transition Penalization Table.
Transition:
fr i 
→ fr i+1
TP i
silent → non-silent
+1
non-silent → silent
+1
silent → silent
+1
non-silent → non-silent
-1
Note that although the total amount of silent frames is low for a fast speech or in short
speech segment, the transition rate will be still high due to their frequent occurrence.
3.3.2.3 Fundamental Frequency Segment Feature
Fundamental Frequency (FF) is another well-known music/speech discriminator due to the
fact that music is more harmonic than the speech in general. Pure speech contains a sequence
of harmonic tonals (vowels) and inharmonic consonants. In speech, due to the presence of inharmonic consonants, the natural pauses and the low-bounded FF values (i.e. <500 Hz) the
average FF value within a segment tend to be quite low. Since the presence of the continuous
instrumental notes results large harmonic sections with unbounded FF occurrences in music,
the average FF value tends to be quite high. However, the average FF value alone might result in classification failures in some exceptional cases. Experiments show that such misclas-
Unsupervised Audio Classification and Segmentation
47
sifications occur especially in some harmonic female speech segments or in some hard-rock
music clips with saturated beats and base-drums.
In order to improve the discrimination factor from FF segment feature, we develop an
enhanced segment feature based on conditional mean, which basically verifies strict FF tracking (continuity) within a window. Therefore, FF value of a particular frame will be introduced
in the mean summation only if its nearest neighbors are also harmonic, otherwise discarded.
The conditional mean based FF segment feature is formulated in (11).
NoF
FF ( S ) =
∑ FF
c
i
i
NoF
 FF if FF j ≠ 0 ∀ j ∈ NN ( i ) 

where FF i c =  i
otherwise 
 0
where FF i is the FF value of the
(11)
i th frame in segment S and j represents the index of the
frames in the nearest neighbor frame set of the
i th frame, (i.e. i − 3 ≤ NN(i) ≤ i + 3 ).
Due to frequent discontinuities in the harmonicity such as pauses (silent frames) and
consonants on a typical speech segment, the conditional mean results in a significantly low
FF segment value for pure speech. Music segments, on the other hand, tend to have higher FF
segment values due to their continuous harmonic nature even though the beats or base-drums
might cause some partial losses on the FF tracking. The experimental results approve the significant improvement obtained from the conditional mean based FF segment feature and thus
FF segment feature become one of the major features that we use for the classification of the
final (global) segments.
3.3.2.4 Subband Centroid Segment Feature
Due to the presence of both voiced (vowels) and unvoiced (consonants) parts in a speech
segment, the average Subband Centroid (SC) value tend to be low with a significantly higher
standard deviation and vice versa for music segments. However the experimental results show
that some music types can also present quite low SC average values and thus SC segment feature used to perform classification is the standard deviation alone with one exception: The
mean of SC within a segment is only used when it gives such a high value (forcedclassification by SC) indicating the presence of the music with a certainty.
Both of SC segment features are extracted by smoothly sliding a short window through
the frames of the non-silent segment. The standard deviation of the SC is calculated using local windowed mean and windowed standard deviation of SC in the segment and formulated as
in (12).
∑ (SC
NoF
σ
SC
(S ) =
j
− µ SC
j
j
NoF
where µ iSC is the windowed SC mean of the
)
2
where µ
SC
i
=
∑ SC
j∈Wi
j
(12)
NoW
i th frame calculated within a window,
SC
NoW frames. σ (S ) is the SC segment feature of the segment S with NoF frames
W i , with
48
Such adaptive calculation of the segment feature improves the discrimination between
speech and music and therefore, SC is used as the third major feature within the final classification scheme.
3.3.3. Perceptual Modeling in Feature Domain
The primary approach in the classification and segmentation framework is based on the perceptual modeling in the feature domain that is mainly applied on to the major segment features: FF, SC and TR. Depending on the nature of the segment feature, the model provide a
perceptual-rule based division in the feature space as shown in Figure 21. The forcedclassification occurs if that particular feature results such an extreme value that perceptual
certainty about content identification is occurred. Therefore, it overrides all the other features
so that the final decision is made with respect to that feature alone. Note that the occurrence
of a forced classification, its region boundaries and its class category depend on the nature of
the underlying segment feature. In this context each major segment feature has the following
forced-classification definitions:
• TR has a forced speech classification region above 15% due to the fact that only pure
speech can yield such a high value within a segment formation.
• FF has a forced music classification with respect to its mean value that is above 2
KHz. This makes sense since only pure and excessively harmonic music content can
yield such an extreme mean value.
• SC has two forced-classification regions, one for music and the other for speech content. The forced music classification occurs when the SC mean exceeds 2 KHz and the
forced speech classification occurs when the primary segment feature of SC, the adaptive σ
SC
value, exceeds 1200 Hz.
FORCED CLASSIFICATION
Music/Speech
FUZZY REGION
Speech/Music
FORCED CLASSIFICATION
Figure 21: Perceptual Modeling in Feature Domain.
Unsupervised Audio Classification and Segmentation
49
Although the model supports both lower and upper forced-classification regions, only
the upper regions are so far used. However we tend to keep the lower region in case the further experimentations might approve the usage of that region in the future.
The region below forced-classification is where the natural discrimination occurs into
one of the pure classes such as speech or music. For all segment features the lower boundary
of this region is tuned so that the feature would have a typical value that can be expected from
a pure class type but still quite far away having a certainty to decide the final classification
alone.
Finally there may be a fuzzy region where the feature value is no longer reliable due to
various possibilities such as the audio class type is not pure, rather mixed or some background
noise is present causing ‘blurring’ on the segment features. So for those segment features that
are examined and approved for the fuzzy approach, a fuzzy region is formed and tuned experimentally to deal with such cases. There is, on the other hand, another advantage of having
a fuzzy region between the regions where the real discrimination occurs. The fuzzy region prevents most of the critical errors, which might occur due to noisy jumps from one (pure class)
region on to another. Such noisy cases or anomalies can be handled within the fuzzy region
and a critical error turns out to be a non-critical error for the sake of audio-based indexing and
retrieval. Furthermore, there are still other features that might help to clarify the classification
at the end. Experimental results show that FF and SC segment features are suitable for fuzzy
region modeling. However TR cannot provide a reliable fuzzy model due to its nature. Although TR can achieve probably the highest reliability of distinguishing speech from the other
class types, it is practically blind of categorization of any other non-speech content (i.e. fuzzy
from music, music from speech with significant background noise, etc.). Therefore, fuzzy
modeling is not applied to TR segment feature to prevent such erroneous cases.
3.4.
GENERIC AUDIO CLASSIFICATION AND SEGMENTATION
The proposed approach is mainly developed based on the aforementioned fact: automatic audio segmentation and classification are mutually dependent problems. A good segmentation
requires good classification and vice versa. Therefore, without any prior knowledge or supervising mechanism, the proposed algorithm proceeds in an iterative way, starting from granule/frame based classification and initial segmentation, the iterative steps are carried out until
a global segmentation and thus a successful classification per segment can be achieved at the
end. Figure 22 illustrates the 4-steps iterative approach to the audio classification and segmentation problem.
50
Initial
Classification
(per granule)
Initial
Segmentation
Intermediate
Classification
Primary
Segmentation
Primary
Classification
Further (Intra)
Segmentation
Step 1
Step 2
Step 3
Step 4
Figure 22: The flowchart of the proposed approach.
3.4.1. Step 1: Initial Classification
As for the first step the objective is to extract silent and non-silent frames and then obtain an
initial categorization for the non-silent frames in order to proceed with an initial segmentation
on the next step. Since the frame-based classification is nothing but only needed for an initial
segmentation, there is no need of introducing fuzzy classification in this step. Therefore, each
granule/frame is classified in one of three categories: speech, music or silent. Silence detection is performed per granule/frame by applying a threshold ( TTFE ) to the total energy as given
in (5). TTFE is calculated adaptively in order to take the audio sound volume effect into account. The minimum ( Emin ), maximum ( Emax) and average ( E µ ) granule/frame energy values are first calculated from the entire audio clip. An empirical all-mute test is performed to
ensure the presence of the audible content. Two conditions are checked:
I. E max > Minimum Audible Frame Energy Level.
II. E max >> Emin .
Otherwise the entire clip is considered as all mute and hence no further steps are necessary. Once the presence of some non-silent granules/frames is confirmed then TTFE is calculated according to (13).
TTFE = E min + λs × ( E µ − E min ), where 0 < λs ≤ 1
(13)
λs is the silence coefficient, which determines the silence threshold value between
Emin and E µ . If the total energy of a granule/frame is below TTFE , then it is classified as
silent, otherwise non-silent. If a granule/frame is not classified as silent, the BER is then calculated for a cut-off frequency of 500 Hz due to the fact that most of speech energy is concentrated below 500 Hz. If BER value for a frame is over a threshold (i.e. 2%) that granule/frame
is classified as music, otherwise speech. Figure 23 summarizes the operation performed in
Step 1.
Unsupervised Audio Classification and Segmentation
Frame 1
Frame 2
Frame 3
51
Frame NoF
.....
speech
Silent
Detection
non-silent
BER
music
silent
Frame Classification
speech
speech
silent
music
....
music
Figure 23: Operational Flowchart for Step 1.
3.4.2. Step 2
In this step, using the frame-based features extracted and classifications performed in the previous step the first attempts for the segmentation has been initiated and the first segment features are extracted from the initial segments formed. To begin with silent and non-silent segmentations are performed. In the previous step, all the silent granules/frames have already
been found. So the silent granules/frames are merged to form silent segments. An empirical
minimum interval (i.e. 0.2 sec.) is used to assign a segment as a silent segment if sufficient
number of silent granules/frames merges to a segment, which has the duration greater than
this threshold. All parts left between silent segments can then be considered as non-silent
segments. Once all non-silent segments are formed, then the classification of these segments
is performed using DBER and TR. The initial segmentation and segment classification (via
DBER) is illustrated in Figure 24 with a sample segmentation and classification example at
the bottom. Note that there might be different class types assigned independently via DBER
and TR for the non-silent segments. This is done on purpose since the initial segment classification performed in this step with such twofold structure is nothing but a preparation for the
further (towards a global) segmentation efforts that will be presented in the next step.
52
Silent Segmentation
Non-Silent Segmentation
Separated Non-Silent
Segments
DBER
TR
Classified Segments
S
M
S
M
M
S
M
X
X
X
S
M
X
S
S
S
M
X
S
S
X
silent
segment
M
M
M
X
X
X
X
X
S
S
S
S
S
S
...
silent
segment
X
S
S
S
S
S
X
X
X
X
...
Figure 24: Operational Flowchart for Step 2.
3.4.3. Step 3
This is the primary step where most of the efforts towards classification and global segmentation are summed up, as illustrated in Figure 25. The first part of this step is devoted to a merging process to obtain more global segments at the end. The silent segments extracted in the
previous step might be ordinary local pauses during a natural speech or they can be the borderline from one segment to another one with a different class type. If the former argument is
true, such silent segments might still be quite small and negligible for the sake of segmentation since they reduce the duration of the non-silent segments and hence they might lead to
erroneous calculations for the major segment features. Therefore, they need to be eliminated
to yield a better (global) segmentation that would indeed result in a better classification. There
are two conditions in order to eliminate a silent segment and merge its non-silent neighbors.
i.
Its duration is below a threshold value,
ii.
Neighbor non-silent segment class types extracted from DBER and TR are both
matching
After merging some of non-silent segments, the overall segmentation scheme is changed
and the features have to be re-extracted over the new (emerged via merging) non-silent segments. For all the non-silent segments, PR and DBER are re-calculated and then they are reclassified. This new classification of non-silent segments may result into such classification
types that allow us to eliminate further silent segments (In the first step they may not be
Unsupervised Audio Classification and Segmentation
53
eliminated because the neighbor classification types did not match). So an iterative loop is
applied to eliminate all possible small silent segments. The iteration is carried out till all small
silent segments are eliminated and non-silent segments are merged to have global segments,
which have a unique classification type.
Due to the perceptual modeling in the feature domain any segment feature may fall into
forced-classification region and overrides the common decision process with its decision
alone. Otherwise a decision look-up table is applied for the final classification as given in
Table V. This table is formed up considering all possible class type combinations. Note that
the majority rule is dominantly applied within this table, that is, if the majority of the segment
features favor a class type, and then that class type is assigned to the segment. If a common
consensus cannot be made, then the segment is set as fuzzy.
Merging Process
DBER
TR
No. of
Silent
Segments
changed
stable
Small Non-Silent Segments
Filtering
Global
FF
Segments
TR
SC
Decision Process
Speech
Fuzzy
Music
Figure 25: Operational Flowchart for Step 3.
Silent
54
Table V: Generic Decision Table.
TR
FF
SC
Decision
speech
speech
speech
speech
speech
speech
speech
speech
speech
music
music
music
music
music
music
music
music
music
speech
speech
speech
music
music
music
fuzzy
fuzzy
fuzzy
speech
speech
speech
music
music
music
fuzzy
fuzzy
fuzzy
speech
music
fuzzy
speech
music
fuzzy
speech
music
fuzzy
speech
music
fuzzy
speech
music
fuzzy
speech
music
fuzzy
Speech
Speech
Speech
Speech
Music
Fuzzy
Speech
Fuzzy
Fuzzy
Speech
Music
Fuzzy
Music
Music
Music
Fuzzy
Music
Fuzzy
3.4.4. Step 4
This step is dedicated to the intra segmentation analysis and mainly performs some post processing to improve the overall segmentation scheme. Once final classification and segmentation is finished in step 3 (section 3.3), non-silent segments with significantly long duration
might still need to be partitioned into new segments if they consist of two or more subsegments (without any silent part in between) with different class types. For example within a
long segment there might be sub-segments that include both pure music and pure speech content without a silent separation in between. Due to the lack of (sufficiently long) silent segment separation in between, the early steps failed to detect those sub-segments and therefore,
a further (intra) segmentation is performed in order to separate those sub-segments.
We developed two approaches to perform intra segmentation. The first approach divides the segment into two and uses SC segment feature to see the presence of a significant
difference in between (unbalanced sub-segments). The second one attempts to detect the
boundaries of any potential sub-segment with different class type by using Subband Centroid
(SC) frame feature. Generally speaking, the first method is more robust on detecting the major
changes since it uses the segment feature that is usually robust to noise. However it sometimes introduces significant offset on the exact location of the sub-segments and therefore
causes severe degradations on the temporal resolution and segmentation accuracy. This problem is mostly solved in the second method but it might increase the amount of false detections
of the sub-segments especially when the noise level is high. In the following sub-sections both
methods will be explained in detail.
Unsupervised Audio Classification and Segmentation
55
3.4.4.1 Intra Segmentation by Binary Division
The first part in this method tests if the non-silent segment is significantly longer than a given
threshold, (i.e. 4 seconds). Then we start by dividing the segment into two sub-segments and
test whether their SC segment feature values are significantly differing from each other. If not,
we keep the parent segment and stop. Otherwise we perform the same operation over the two
child segments and look for the one, which is less balanced (the one which has higher SC
value difference between the left and the right child-segments). The iteration is carried out till
the child segment is small enough and breaks the iteration loop. This gives the sub-segment
boundary and then Step 3 is re-performed over the sub-segments in order to make the most
accurate classification possible. If Step 3 does not assign a different class type for the potential sub-segments detected, then the initial parent segment is kept unchanged. This means a
false detection has been performed in Step 4. Figure 26 illustrates the algorithm in detail. The
function local_change() performs SC based classification for the right and left child segments
within the parent segment (without dividing) and returns the absolute SC difference between
them.
Long
Segment?
Start
No
Stop
Yes
local_change()
Segment
Balanced?
Yes
No
Divide Segment into 2
local_change()
local_change()
Choose the Less
Balanced
Yes
Long
Segment?
No
Step
3
Figure 26: Intra Segmentation by Binary Division in Step 4.
56
3.4.4.2 Intra Segmentation by Breakpoints Detection
This is a more traditional approach performed in many similar works. The same test as in the
previous approach is performed to test whether the segment has a sufficiently long duration. If
so, using a robust frame feature (i.e. SC), a series of breakpoints (certain frame locations)
where the class types of the associated frames according to the SC frame feature alternate with
respect to the class type of the parent segment, are detected. The alternation occurs first with a
particular frame giving such a SC feature value indicating a class type that is different from
the class type of the parent segment. Then it may swing back to original class type of its parent after a while or ends up with the parent segment boundary. SC segment feature is the windowed standard deviation used for the classification of the segment. Keeping the same analogy, we use windowed standard deviation calculated per frame and via comparing it with the
SC segment feature, the breakpoints can be detected. Windowed standard deviation of
SC, σ iSC , for frame i can be calculated as in (14).
σ
SC
i
=
∑ (SC
j ∈W i
j
− µ iSC
NoW
)
2
where µ
SC
i
=
∑ SC
j ∈W i
j
(14)
NoW
So the pair of breakpoints can be detected via comparing σ iSC for all the frames within the
SC
segment with the SC segment feature σ SC (i.e. σ iSC > σ SC → σ iSC
). Sample break+ NoFSS < σ
point detection is shown in Figure 27.
speech
music
speech
Breakpoint
Detected
X
X
SC Segment
Threshold
X
X
Breakpoint
after
Roll-Down
time(sec)
Figure 27: Windowed SC standard deviation sketch (white) in a speech segment. Breakpoints
are successfully detected with Roll-Down algorithm and music sub-segment is extracted.
Unsupervised Audio Classification and Segmentation
3.5.
57
EXPERIMENTAL RESULTS
For the experiments in this section the following MUVIS databases are mainly used (see subsection 2.1.4): Open Video, Real World Audio/Video and Music Audio databases. All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB memory. The
evaluation of the performance is carried out subjectively using only the samples containing
straightforward (obvious) content. In other words, if there is any subjective ambiguity on the
result such as an insignificant (evaluator) doubt on the class type of a particular segment (e.g.
speech or fuzzy?) or the relevancy of some of the audio-based retrieval results of an aural
query, etc., then that sample is simply discarded from the evaluation. Therefore, the experimental results presented in this section depend only on the decisive subjective evaluation via
ground truth and yet they are meant to be evaluator-independent (i.e. same subjective decisions are guaranteed to be made by different evaluators).
The experiments carried out are detailed and reported in the next two subsections.
Subsection 3.5.1 presents the performance evaluation of the enhanced frame features, their
discrimination factors and especially the proposed fuzzy modeling and the final decision
process. The accuracy analysis via error distributions and the performance evaluation of the
overall segmentation and classification scheme is given in subsection 3.5.2.
3.5.1. Feature Discrimination and Fuzzy Modeling
The overall performance of the proposed framework mainly depends on the discrimination
factors of the extracted frame and segment features. Furthermore, the control over the decisions based on each segment feature plays an important role in the final classification. In order to have an effective control on the decision making criteria, we only used certain number
of features giving significant discrimination for different audio content due to their improved
design, instead of having too many traditional ones. As shown in a sample automatic classification and segmentation example in Figure 28, almost all of the features provide a clean distinction between pure speech and music content. Also intra segmentation via breakpoints detection works successfully as shown in the upper part of Figure 28 and false breakpoints are
all eliminated.
58
speech
Overall Classification &
Segmentation
speech
silent
Intra Segmentations
via Breakpoints Det.
Time Pointer
music
music
music
FF Frame (white) &
Segment (red) Features
SC Frame (white) and
Segment (red) Features
TR Segment Feature
Figure 28: Frame and Segment Features on a sample classification and segmentation.
Yet there are still weak and strong points of each feature used. For instance TR is perceptually blind in the discrimination between fuzzy and music content as explained before.
Similarly, FF segment feature might fail to detect harmonic content if the music type is
Techno or Hard Rock with saturated beats and base drums. In the current framework, such a
weak point of a particular feature can still be avoided by the help of the others during the final
decision process. One particular example can be seen in Figure 28: FF segment feature
wrongly classifies the last (music) segment, as speech. However, the overall classification result is still accurate (music) as can be seen on top of Figure 28 since both SC and TR features
favor music which overrides the FF feature (by majority rule) in the end.
3.5.2. Overall Classification and Segmentation Performance
The evaluation of the proposed framework is carried out over the standalone MP3, AAC audio clips, AVI and MP4 files containing MPEG-4 video along with MP3, AAC, ADPCM
(G721 and G723 in 3-5 bits/sample) and PCM (8 and 16 bits/sample) audio. These files are
chosen from Open, Real World and Music databases as mentioned before. The duration of
the clips vary from few minutes up to 2 hours. The clips are captured using several sampling
frequencies from 16 KHz to 44.1 KHz so that both MPEG 1 and MPEG 2 phases are tested
for Layer 3 (MP3) audio. Both MPEG-4 and MPEG-2 AAC are recorded with the Main and
Low Complexity profiles (object types). TNS (Temporal Noise Shaping) and M/S coding
schemes are disabled for AAC. Around 70% of the clips are stereo and the rest are mono. The
total number of files used in the experiments is above 500 and in total measures, the method is
Unsupervised Audio Classification and Segmentation
59
applied to 260 (> 15 hours) MP3, 100 (> 5 hours) AAC and 200 (> 10 hours) PCM (uncompressed) audio clips. Neither the classification and segmentation parameters such as threshold
values, window duration, etc., nor any part of the algorithm are changed for those aforementioned variations in order to test the robustness of the algorithm. The error distributions results, which belong to both bit-stream and generic modes, are provided in Table VI and Table
VII. These results are formed, based on the deviation of the specific content from the groundtruth classification, which is based on subjective evaluation as explained earlier.
Table VI: Error Distribution Table for Bit-Stream Mode.
BS
Type
Speech
Critical
Music
Fuzzy
NonNonCritical
Critical
Critical
Semi-Critical
MP3
2.0 %
0.5 %
5.8 %
10.3 %
24.5 %
AAC
1.2 %
0.2 %
0.5 %
8.0 %
17.6 %
Table VII: Error Distribution Table for Generic Mode.
Speech
Music
Fuzzy
Critical
NonCritical
Critical
NonCritical
Semi-Critical
0.7 %
4.9 %
5.1 %
22.0 %
23.4 %
In fact, for each and every particular audio clip within the database, the output classification and especially the segmentation result are completely different. Furthermore, as classification and segmentation are mutually dependent tasks, there should be a method of evaluating and reporting the results accordingly. Owing to the aforementioned reasons, neither size
nor the number of segments can be taken as a unit on which errors are calculated. Therefore,
in order to report the combined effect of both classification and segmentation, the output error
distribution, ε c* , is calculated and reported in terms of the total misclassification-time of a
specific class, c* , per total ground-truth-time (total actual time) of that content within the database formulated as follows:
ε c * (%) = 100 ×
∑ t (c ) c ∈ (C
D
∑ t (c
*
− c*)
(15)
)
D
where C represents the set of elements from all class types, t represents time, while D represents the experimental database. For example in Table VI, 2.0% for MP3 critical errors in
60
speech basically means that if there were 100 minutes of speech content in the database, two
minutes out of it is misclassified as music. This error calculation approach makes possible that
the results stay independent from the effects of segmentation i.e. the errors are calculated and
reported similarly for cases such as the misclassification of a whole segment or the misclassification of a fraction of a segment due to wrong intra-segmentation.
From the analysis of the simulation results, we can see that the primary objective of
the proposed scheme i.e. minimization of critical errors on classification accuracy, is successfully achieved. As a compromise of this achievement, most of the errors are semi-critical and
sometimes, as intended, non-critical. Semi-critical errors, despite of having relatively higher
values, are still useful, especially considering the fact that the contribution of fuzzy content
towards the overall size of a multimedia database (also in the experimental database) is normally less than 5%. This basically means that the overall effect of these high values on the
indexing and retrieval efficiency is minor. The moderately valued non-critical errors, as the
name suggests, are not critical with respect to the audio-based multimedia retrieval performance because of the indexing and retrieval scheme.
As a result, we have achieved good results with respect to our primary goal of being
able to minimize the critical errors in audio content classification by introducing fuzzy modeling in the feature domain and shown the important role of having the global and perceptually
meaningful segmentation on the accurate classification (and vice versa) in this context. Furthermore, we have shown that high accuracy can be achieved with sufficient number of enhanced features that are all designed according to the perceptual rules in a well-controlled
manner, rather than using a large number of features. The proposed work achieves significant
advantages and superior performance over existing approaches for automatic audio content
analysis. In the next chapter, we will be presenting an audio-based multimedia indexing and
retrieval framework, where this approach is basically integrated into and the performance improvements achieved especially for the aural retrievals in the large-scale multimedia databases.
61
Chapter
4
Audio-Based Multimedia Indexing
and Retrieval
R
apid increase in the amount of the digital audio collections presenting various formats,
types, durations, and other parameters that the digital multimedia world refers, demands
a generic framework for robust and efficient indexing and retrieval based on the aural content.
As mentioned earlier, from the content-based multimedia retrieval point of view the audio information can be even more important than the visual part since it is mostly unique and significantly stable within the entire duration of the content and therefore, the audio can be a
promising part for the content-based management for those multimedia collections accompanied with an audio track. Accordingly in this chapter we focus our efforts on developing a generic and robust audio-based multimedia indexing and retrieval framework, which has been
developed and tested under MUVIS system [P3]. First an overview for the audio indexing and
retrieval works both in the literature and in the commercial systems, along with the major
limitations and drawbacks are presented in the next section. In addition the design issues and
the basic innovative properties of the proposed method will be justified accordingly. Afterwards, the proposed audio indexing and retrieval framework will be presented in parts within
Sections 4.2 and 4.3 respectively. Finally the experimental results, the performance evaluation
and conclusive remarks about the proposed framework are all reported in Section 4.4.
4.1.
AUDIO INDEXING AND RETRIEVAL – AN OVERVIEW
The studies on content-based audio retrievals are still in early stages. Traditional key-word
based search engines such as Google, Yahoo, etc. usually cannot provide successful audio retrievals since they require costly (and usually manual) annotations that are obviously unpractical for large multimedia collections. In recent years, promising content-based audio retrieval
techniques that might be categorized into two major paradigms have emerged. In the first
paradigm, the “Query by Humming” (QBH) approach is tried for music retrievals. There are
many studies in the literature, such as [2], [8], [18], [28], [35], [38], [41], [42]. However this
approach has the disadvantage of being feasible only when the audio data is music stored in
62
some symbolic format or polyphonic transcription (i.e. MIDI). Moreover it is not suitable for
various music genres such as Trance, Hard-Rock, Techno and several others. Such a limited
approach obviously cannot be a generic solution for the audio retrieval problem. The second
paradigm is the well-known “Query by Example” (QBE) technique, which is also common
for visual retrievals of the multimedia items. This is a more global approach, which is adopted
by several research studies and implementations. One of the most popular systems is MuscleFish [44]. The designers, Wold et al. [74] proposed a fundamental approach to retrieve sound
clips based on their content extracted using several acoustic features. In this approach, an N
dimensional feature vector is built where each dimension is used to carry one of the acoustic
features such as pitch, brightness, harmonicity, loudness, bandwidth, etc. and it is used for
similarity search for a query sound. The main drawback of this approach is that it is a supervised algorithm that is only feasible to some limited sub-set of audio collection and hence
cannot provide an adequate and global approach for general audio indexing. Furthermore, the
sound clips must contain a unique content with a short duration. It does not address the retrieval problem in a generic case such as audio files carrying several and temporally mixed
content along with longer and varying durations. Foote in [21] proposed an approach for the
representation of an audio clip using a template, which characterizes the content. First, all the
audio clips are converted in 16 KHz with 16 bits per sample representation. For template
construction the audio clip is first divided into overlapped frames with a fixed duration and a
13-dimensional feature vector based on 12 mel frequency cepstrum coefficients (MFCC) and
one spectral energy is formed for training a tree-based Vector Quantizer. For retrieval of a
query audio clip, it is first converted into the template and then template matching is applied
and ranked to generate the retrieval list. This is again a supervised method designed to work
for short sound files with a single-content, fixed audio parameters and file format (i.e. au). It
achieves an average retrieval precision within a long range from 30% to 75% for different audio classes. Li and Khokar [32] proposed a wavelet-based approach for the short sound file
retrievals and presented a demo using the MuscleFish database. They achieved around 70%
recall rate for diverse audio classes. Spevak and Favreau presented the SoundSpotter [64] prototype system for content-based audio section retrieval within an audio file. In their work the
user selects a specific passage (section) within an audio clip and also sets the number of retrievals. The system then retrieves the similar passages within the same audio file by performing a pattern matching of the feature vectors and a ranking operation afterwards.
All the aforementioned systems and techniques achieved a certain performance; however present further limitations and drawbacks. First the limited amount of features extracted
from the aural data often fails to represent the perceptual content of the audio data, which is
usually subjective information. Second, the similarity matching in the query process is based
on the computation of the (dis-) similarity distance between a query and each item in the database and a ranking operation afterwards. Therefore, especially for large databases it may
turn out to be such a costly operation that the retrieval time becomes infeasible for a particular
search engine or application. Third, all of the aforementioned techniques are designed to work
in pre-fixed audio parameters (i.e. with a fixed format, sampling rate, bits per sample, etc.).
Audio-Based Multimedia Indexing and Retrieval
63
Obviously, large multimedia databases may contain digital audio that is in different formats
(compressed or uncompressed), encoding schemes (MPEG Layer-2 [25], [51], MP3, [10],
[23], [25], [51], AAC [10], [23], [24], ADPCM, etc.), other capturing, encoding and acoustic
parameters (i.e. sampling frequency, bits per sample, sound volume level, bit-rate, etc.) and
durations. It is a fact that the aural content is totally independent from such parameters. For
example, the same speech content can be represented by an audio signal sampled at 16 KHz
or 44.1 KHz, in stereo or mono, compressed by MP3 in 64 Kb/s, or by AAC 24 Kb/s, or simply in (uncompressed) PCM format, lasting 15 seconds or 10 minutes, etc. However, if not
designed accordingly, the feature extraction techniques are often affected drastically by such
parameters and therefore, the efficiency and the accuracy of the indexing and retrieval operations will both be degraded as a result. Finally, they are mostly designed either for short sound
files bearing a unique content or manually selected (short) sections. However, in a multimedia
database, each clip can contain multiple content types, which are temporally (and also spatially) mixed with indefinite durations. Even the same content type (i.e. speech or music) may
be produced by different sources (people, instruments, etc.) and should therefore, be analyzed
accordingly.
In order to overcome the aforementioned problems and shortcomings, in this chapter
we introduce a generic audio indexing and retrieval framework, which is developed and tested
under the MUVIS system presented as in Chapter 2. The primary objective in this framework
is therefore, to provide a robust and adaptive basis, which performs audio indexing according
to the audio class type (speech, music, etc.), audio content (the speaker, the subject, the environment, etc.) and the sound perception as closely modeled as possible to the human auditory
perception mechanism. Furthermore, the proposed framework is designed in such a way that
various low-level audio feature extraction methods can be used. For this purpose the aforementioned Audio Feature eXtraction (AFeX) modular structure can support various AFeX
modules.
In order to achieve efficiency in terms of retrieval accuracy and speed, the proposed
scheme uses high-level audio content information obtained from an efficient, robust and
automatic (unsupervised) audio classification and segmentation algorithm that is presented in
Chapter 3, during both in indexing and retrieval processes. In this context, it is also optional
for all AFeX modules so that a particular module can use the classification and segmentation
information in order to tune and optimize its feature extraction process according to a particular class type. During the feature extraction operation, the feature vectors are extracted from
each individual segment with a different class type and stored and retrieved separately. This
makes more sense for content-based retrieval point of view since it brings the advantage to
perform the similarity comparison between the frames within the segments with a matching
class type and therefore, avoids any potential similarity mismatches and reduces the indexing
and most important of all, the (query) retrieval times significantly.
In the audio retrieval operations the classification based indexing scheme is entirely
used in order to achieve low-complexity and robust query results. The aforementioned AFeX
framework supports merging of several audio feature sets and associated sub-features once
64
the similarity distance per sub-feature is calculated. During the similarity distance calculation
a penalization mechanism is developed in order to penalize not fully (partly) matched clips.
For example if a clip with both speech and music parts is queried, all the clips missing one of
the existing class type (say a music-only clip) will be penalized by the amount of the missing
class (speech) coverage in the queried clip. This will give the priority to the clips with the entire class matching and therefore, ensure a more reliable retrieval. Another mechanism is applied for normalization due to the variations of the audio frame duration in the sub-features.
This will change the number of frames within a class type and hence brings the dependency of
the overall sub-feature similarity distance to the audio frame duration. Such dependency will
negate any sub-feature merging attempts and therefore, normalization per audio per frame is
applied. MUVIS framework internally provides a weighted merging scheme in order to
achieve a “good” blend of the available audio features.
4.2.
A GENERIC AUDIO INDEXING SCHEME
As mentioned in the previous chapter, when dealing with digital audio there are several requirements to be fulfilled and the most important of them is the fact that the aural content is
entirely independent from the digital audio capture parameters (i.e. sound volume, sampling
frequency, etc.), audio file type (i.e. AVI, MP3, etc.), encoder type (MP3, AAC, etc.), encoding parameters (i.e. bit-rate, etc.) and other variations such as duration and sound volume
level. Therefore, similar to the audio classification and segmentation operation, the overall
structure of audio-based indexing and retrieval framework is designed to provide a preemptive robustness (independency) to such parameters and variations.
As shown in Figure 29, audio indexing is applied to each multimedia item in a MUVIS database containing audio, and it is accomplished in several steps. The classification and
segmentation of the audio stream is the first step. As a result of this step the entire audio clip
is segmented into 4 class types and the audio frames among three class types (speech, music
and fuzzy) are used for indexing. Silent frames are simply discarded since they do not carry
any audio content information. The frame conversion is applied in step 2 due to the (possible)
difference occurred in frame durations used in classification and segmentation and the latter
AFeX operations. The boundary frames, which contain more than one class types are assigned
as uncertain and also discarded from indexing since their content is not pure, rather mixed
and hence do not provide a clean content information. The remaining speech, music and fuzzy
frames (within their corresponding segments) are each subjected to audio feature extraction
(AFeX) modules and their corresponding feature vectors are indexed into descriptor files separately after a clustering (key-framing) operation via Minimum Spanning Tree (MST) Clustering [29]. In the following sub-sections we will detail each of the indexing steps.
Audio-Based Multimedia Indexing and Retrieval
65
Audio Stream
Classification & Segmentation per granule / frame.
1
Silence
2
Audio Framing
in Valid Classes
Uncertain
Speech
Music
Fuzzy
Audio Framing & Classification Conversion
Speech
Music
Fuzzy
AFeX Module(s)
3
...
AFeX Operation
per frame
Speech
0
4
KF Extraction
via MST Clustering
Music
2
10
1
20
5
3
Fuzzy
9
6
15
MST
7
KF Feature Vectors
Audio Indexing
Figure 29: Audio Indexing Flowchart.
4.2.1. Unsupervised Audio Classification and Segmentation
In order to achieve suitable content analysis, the first step is to perform accurate classification
and segmentation over the entire audio clip. As explained in Chapter 3 the developed algorithm is a generic audio classification and segmentation especially suitable for audio-based
multimedia indexing and retrieval systems. It has a multimodal structure, which supports both
bit-stream mode for MP3 and AAC audio, and a generic mode for any audio type and format.
In both modes, once a common spectral template is formed from the input audio source, the
same analytical procedure can be performed afterwards. It is also automatic (unsupervised) in
a way that no training or feedback (from the video part or human interference) is required. It
further provides robust (invariant) solution for the digital audio files with various capturing/encoding parameters and modes. In order to achieve a certain robustness level, a fuzzy approach has been integrated within the technique.
Furthermore, in order to improve the performance and most important of all, the overall accuracy, the classification scheme produces only 4 class types per audio segment: speech,
music, fuzzy or silent. Speech, music and silent are the pure class types. The class type of a
66
segment is defined as fuzzy if either it is not classifiable as a pure class due to some potential
uncertainties or anomalies in the audio source or it exhibits features from more than one pure
class. The primary use of such classification and segmentation scheme is the following: For
audio based indexing and retrieval, a pure class content is only searched throughout the associated segments of the audio items in the database having the same (matching) pure class
type, such as speech or music. All silent segments and silent frames within non-silent segments can be discarded from the audio indexing. Special care is taken for the fuzzy content,
that is, during the retrieval phase, the fuzzy content is compared with all relevant content types
of the database (i.e. speech, music and fuzzy) since it might, by definition, contain a mixture of
pure class types, background noise, aural effects, etc. Therefore, for the proposed method, any
erroneous classification on pure classes is intended to be detected as fuzzy, so as to avoid significant retrieval errors (mismatches) due to such potential misclassification.
4.2.2. Audio Framing
As mentioned in the previous section, there are three valid audio segments: speech, music and
fuzzy. Since segmentation and classification are performed per granule/frame basis, such as
per MP3 granule or AAC frame, a conversion is needed to achieve a generic audio framing
for indexing purposes. The entire audio clip is first divided into a user or model-defined audio
frames, each of which will have a classification type as a result of the previous step. In order
to assign a class type to an audio frame, all the granules/frames within or neighbor of that
frame should have a unique class type to which it is assigned; otherwise, it will be assigned as
uncertain.
M: Music
S: Speech
X: Silence
Classification per granule/frame
M
M
M
M
Music
M
M
M
M
X
X
Uncertain
X
S
S
S
S
S
S
S
Speech
S
S
X
X
Uncertain
Final Classification per audio frame
Figure 30: A sample audio classification conversion.
Since the uncertain frames are mixed and hence they are all transition frames (i.e. music to speech, speech to silence, etc.) the feature extraction will result an unclear feature vector, which does not contain a clean content characteristics at all. Therefore, these frames
should be removed from the indexing operation thereafter.
Audio-Based Multimedia Indexing and Retrieval
67
4.2.3. A Sample AFeX Module Implementation: MFCC
MFCC stands for Mel-Frequency Cepstrum Coefficients [55] and they are widely used in several speech and speaker recognition systems due to the fact that they provide a decorrelated,
perceptually-oriented observation vector in the cepstral domain and therefore, they are suitable for the human audio perception system. This is the main reason that we use them for audio based multimedia indexing and retrieval in order to achieve a similarity measure close to
ordinary human audio perception criteria such as ‘sounds like’ with additional higher level
content discrimination via classification (i.e. speech, music, etc.).
The MFCC AFeX module performs several steps to extract MFCC per audio frame.
First the incoming frames are Hamming windowed in order to enhance the harmonic nature of
the vowels in speech and voiced consonants (sounds from instruments, effects, etc.) in music.
In addition, Hamming window can reduce the effects of discontinuities and edges that are introduced during the framing process. Especially in logarithmic domain, the windowing effects
can be encountered significantly. Hamming window is a half of cosine wave shifted upwards,
as given in the following expression:
w ( k ) = 0 . 54 − 0 . 46 cos( 2π
k −1
)
N −1
(16)
where N is the size of the window, which is equal to the size (number of PCM samples) of the
audio frames. In order to perform filtering in the time domain, the audio frame is zero-padded
to get the size as a power of 2 and then FFT is applied to get into the spectral domain for plain
multiplication with the filterbank. The mel (melody) scaled filterbank is a series of filterbank,
which has the central frequencies uniformly distributed in mel-frequency (mel(f)) domain
where
m
f
f
mel ( f ) = m f = 1127 log(1 +
) and f = 700 (e1127 − 1)
700
(17)
Figure 31 illustrates a sample mel-scaled filterbank in the frequency domain. The
number of bands is reduced for the sake of clarity. The shape of the band filters in the filterbank can be Hamming Window or plain triangular shape. As clearly seen in Figure 31 the
resolution is high for low frequencies and low for higher frequencies. That is in tune with the
human ear nature and one of the main reasons of the mel-scale usage. Once the filtering is applied, the energy is calculated per band and Cepstral Transform is applied on the band energy
values. Cepstral Transform is a discrete cosine transform of log filterbank amplitudes:
P
π ⋅i

ci = (2 / P)1 / 2 ∑ j =1 log m j ⋅ cos
( j − 0.5) 
 N

(18)
68
where 0 < i ≤ P and P is the number of filter banks. A subset of
ci is then used as the feature
vector for this frame.
1
freq.
m
mj
m1
P
Energy in
each band
Figure 31: The derivation of mel-scaled filterbank amplitudes.
As mentioned in the previous sections, any AFeX module should provide generic feature vectors independent from the following variations:
• Sampling Frequency.
• Number of audio channels (mono/stereo).
• Sound Volume level.
By using audio data from only one channel for AFeX operation, the effect of multiple
audio channels can be avoided. However, we need normalization during the calculation of the
energy per filterbank in order to neutralize the effects of sampling frequency and volume
variations. Let f S be the sampling frequency. According to the Nyquist theorem, the bandwidth of the signal will be: f BW = f s / 2 . The frequency resolution ( ∆f ) per FFT spectral line
will then be:
∆f =
f BW
f
= S
N FL / 2 N FL
(19)
Let T be the duration (in milliseconds) of the incoming audio frames. Then the number
of PCM samples within an audio frame will be: N = T ⋅ f s / 1000 . An audio clip sampled
with different sampling frequencies will result into different energy per band calculations due
to the fact that the number of samples within the frame is varying and therefore, the band energy values should be normalized by a generic coefficient λ where λ ~ N .
Sound Volume (V) can be approximated as the absolute average level within the audio
frame such as:
N
V ≅
∑
i
N
xi
(20)
Audio-Based Multimedia Indexing and Retrieval
69
Similarly an audio clip with different volume levels will result into different energy per band
calculations and therefore, the energy values should be normalized by
overall normalization will be:
λ where λ ~ V . The
N
λ ~ λV ⋅ λ f ~ V ⋅ N → λ ~ ∑ xi
(21)
i
During the calculation of the band energies under each filterbank, the energy values
are divided by λ to prevent both volume and sampling frequency effects over the Cepstrum
coefficients calculation. As shown in Figure 31, the filterbank central frequencies are unii
formly distributed over the mel-scale. Let f CF
be the center frequency of the i th filter bank,
then the filterbank central frequencies can be obtained by the following equation:
i
mel ( f CF
)=
i ⋅ mel ( f BW )
P
(22)
So it is clear that the central frequencies will also be dependent on the sampling frequency ( f BW = f s / 2 ). This brings the problem that the audio clips with different sampling
frequencies will have filterbanks with different central frequencies and hence the feature vectors (MFCC) will be totally uncorrelated since they are derived directly from the band energy
values from each filter bank. In order to fix the filterbank locations, we use a fixed cut-off
frequency that corresponds to the maximum sampling frequency value used within MUVIS.
The minimum and maximum sampling frequencies within the proposed audio indexing
framework are 16 KHz and 44.1 KHz; therefore,
i
mel( fCF
)=
i ⋅ mel( f FCO )
, where f FCO ≥ 22050
P
(23)
Setting the central frequencies by using the formula above will ensure the use of the
same filterbank for all audio clips. Nevertheless, only the audio clips sampled at 44.1 KHz
will use all the filters (assuming f FCO = 22050 Hz ) whilst the other audio clips sampled at
lower frequencies will produce such band energy values of which the highest band values ( m j
where j>M) are automatically set to 0 since those are outside of the bandwidth of the audio
signal. This will yield erroneous results in the calculation of MFCC since the latter are nothing but DCT transforms of the logarithm of band energy values. In order to prevent this, only
some portion of band energy values that are common for all possible sampling frequencies
(within MUVIS) are used. In order to achieve this, the minimum possible M value is found
using
the
lowest
( f S = 16 KHz ⇒ f BW = 8 KHz )
and
the
highest
( f S = 44 .1KHz ⇒ f BW = 22 .050 KHz ) sampling frequency values for MUVIS. Using
mel (8000 ) = 2840.03 & mel ( 22050 ) = 3923.35 into (23), the bound for M can be stated
70
as: M ≤ P ⋅ 0 . 7238 . Therefore, having a filterbank that contains P band filters, we use M of
them for the calculation of MFCC. By this way only a common range of MFCC is hence, used
in order to negate the effect of the varying sampling frequencies of the audio clips within the
database.
For indexing, only the static values of Cepstral Transform coefficients ( ci ) are used.
The first coefficient is not used within the feature vector since it is a noisy calculation of the
frame energy and hence it does not contain reliable information. The remaining M-1 coefficients over P ( c i ∀1 < i ≤ M ) are used to form a MFCC feature vector.
4.2.4. Key-Framing via MST Clustering
The number of audio frames is proportional with the duration of the audio clip and once AFeX
operation is performed, this may potentially result in a massive number of feature vectors,
many of which are probably redundant due to the fact that the sounds within an audio clip are
immensely repetitive and most of the time entirely alike. In order to achieve an efficient audio-based retrieval within an acceptable time, only the feature vectors of the frames from different sounds should be stored for indexing purposes. This is indeed a similar situation with
the visual feature extraction scheme where only the visual feature vectors of the Key-Frames
(KFs) are stored for indexing. There is however one difference: In the visual case KFs are
known before the feature extraction phase but in aural case since there is no such physical
“frame” structure and hence the audio is framed uniformly with some certain duration, we
need to obtain features of each frame beforehand in order to make Key-Frame analysis. This
is why AFeX operation is performed (over valid frames) first and audio KFs are extracted afterwards.
p
S
ee
'S'
0
L
b
a
'ch'
2
8
1
1
20
'b'
2
9
1
10
1
ch
1
15
6
3
8
2
21
16
1
9
9
5
'L'
2
3
1
7
'ee'
11 1
12
1
8
17
7
6
'a'
p
21
Figure 32: An illustrative clustering scheme
4
2
18
1
13
1
14
19
1
Audio-Based Multimedia Indexing and Retrieval
71
In order to achieve an efficient KF extraction, the audio frames, which have similar
sounds (and therefore, similar feature vectors) should first be clustered and one or more frame
from each cluster should be chosen as a KF. An illustrative example is shown in Figure 32.
Here the problem is to determine the number of clusters that should be extracted over a particular clip. This number will in fact vary with the content of the audio. For instance, a monolog speech will have less number of KFs than an action movie. For this we define KF rate that
is the ratio between KF numbers over the total number of valid frames within a certain audio
class type. Once a practical KF rate is set, the number of clusters can be easily calculated and
eventually this number will be proportional to the duration of the clip. However, the longer
clips will increase the chance of bearing similar sounds. Especially if the content is mostly
based on speech, the similar sounds (vowels and unvoiced parts) will be repeated over time.
Therefore, KF rate can be dynamically set via an empirical Key-Framing model that is shown
in Figure 33.
Figure 33: KF Rate (%) Plots
Once the number of KFs (KFno) is set, the audio frames are then clustered using
Minimum Spanning Tree (MST) clustering technique. Every node in MST is a feature vector
of a unique audio frame and the distance between the nodes is calculated using the AFeX
module’s AFeX_GetDistance() function. Once the MST is formed, then the longest KFno-1
branch is broken and as a result KFno clusters are obtained. By taking one (i.e. the first) frame
as a KF, the feature vectors of the KFs are then used for indexing
4.3.
AUDIO RETRIEVAL SCHEME
As explained in the previous sections, the audio part of any multimedia item within a MUVIS
database is indexed using one or more AFeX modules that are dynamically linked to the MUVIS application. The indexing scheme uses the audio classification per segment information
72
to improve the effectiveness in such a way that during an audio-based query scheme, the
matching (same audio class types) audio frames will be compared with each other via the
similarity measurement.
In order to accomplish an audio based query within MUVIS, an audio clip is chosen
from a multimedia database and queried through the database if at least one or more audio
features are extracted for the database. Let NoS be the number of feature sets existing in a database and let NoF(s) is the number of sub-features per feature set where 0 ≤ s < NoS . As
mentioned before sub-features are obtained by changing the AFeX module parameters or the
audio frame size during the audio feature extraction process. Let the similarity distance function be SD ( x ( s , f ), y ( s , f )) where x and y are the associated feature vectors of the feature
index s and the sub-feature index f. Let i be the index of the audio frames within the class
C q of the queried clip. Due to the aforementioned reasons, the similarity distance is only calC
culated between a sub-feature vector of this frame (i.e. QFV i q ( s , f ) ) and an audio frame
(index j) of the same class type from a clip (index c) within the database. For all the frames
that have the same class type ( ∀ j ⇒ j ∈ C q ), one audio frame, which gives the minimum
distance to the audio frame i in the queried clip is found ( D i ( s , f ) ) and used for calculation
of the total sub-feature similarity distance ( D ( s , f ) ) between two clips. Therefore, the particular frames and sections of the query audio are only compared with their corresponding
(matching) frames and sections of a clip in database and this internal search will then provide
the necessary retrieval robustness against the abrupt content variations within the audio clips
and particularly their indefinite durations. Figure 34 illustrates the class matching and minimum distance search mechanisms during the similarity distance calculations per sub-feature.
Furthermore, two factors should be applied during the calculation of D ( s , f ) in order to
achieve unbiased and robust results:
• Penalization: If no audio frames with class type C q can be found in clip c then a penalization is applied during the calculation of D ( s , f ) . Let N Q ( s , f ) be the number of
∅
valid frames in queried clip and let N Q ( s , f ) be the number of frames that are not included
for the calculation of the total sub-feature similarity distance due to the mismatches of their
class types. Let N QΘ ( s , f ) be the number of the rest of the frames, which will all be used in
the
calculation
N Q (s, f ) = N
∅
Q
of
the
total
sub-feature
similarity
distance.
Therefore,
Θ
Q
( s , f ) + N ( s , f ) and the class mismatch penalization can be formu-
lated as follows:
P
C
Q
(s, f ) = 1 +
N
∅
Q
(s, f )
N
Q
(s, f )
(24)
Audio-Based Multimedia Indexing and Retrieval
73
If all the class types of the queried clip match with the class types of the database clip
∅
C
c, then N Q ( s , f ) = 0 ⇒ PQ ( s , f ) = 1 and this case naturally applies no penalization on
the calculation of D ( s , f ) .
• Normalization: Due to the possibility of the variation of the audio frame duration for
a sub-feature, the number of frames having a certain class types might change and this results
in a biased (depending on the number of frames) similarity sub-feature distance calculation. In
order to prevent this, D ( s , f ) should be normalized by the total number of frames for each
sub-feature ( N Q ( s, f ) ). Therefore, this will yield a normalized D ( s , f ) calculation, which
is nothing but the sub-feature similarity distance per audio frame. Since the audio vectors are
normalized the total query similarity distance ( QDc ) between the queried clip and the clip c
in the database is calculated with a weighted sum. The weights per sub-feature f, of a feature
set s,
W ( s, f ) can be used for experimentation in order to find an optimum merging scheme
for the audio features available in the database. The following equation formalizes the calculation of QDc .
•
[ (
)]
min SD QFVi C q ( s, f ), DFV j C q ( s, f )
if j ∈ C q 
j∈C i
Di ( s, f ) = 

if j ∉ C q
 0

PQC ( s, f )
D ( s, f ) =
⋅ ∑ ∑ Di (s, f )
N Q ( s, f ) q i∈Cq
(25)
NoS NoF ( s )
QDc = ∑
s
∑ W ( s, f ) ⋅ D( s, f )
f
Calculation of QDc as in equation (25) is only valid if there is at least one matching
class type between the queried clip and the database clip c. If no matching class types exist,
then QDc is assigned as
∞
and hence it will be placed among the least matching clips (to
the end of the query ranking list). This is an expected case since two clips have nothing in
common with respect to a high-level content analysis, i.e. their mismatching audio class types
per segment.
74
Figure 34: A class-based audio query illustration showing the distance calculation per audio
frame
4.4.
EXPERIMENTAL RESULTS
All the sample MUVIS databases were presented in sub-section 2.1.4. For the experiments in
this section the following MUVIS databases are mainly used among them: Open Video, and
Real World Audio/Video databases.
All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB
memory. All the sample MUVIS databases are indexed aurally using MFCC AFeX module
and visually using color (HSV and YUV color histograms), texture (Gabor [40] and GLCM
[49]) and edge direction (Canny [13]) FeX modules. The feature vectors are unit normalized
and equal weights are used for merging sub-features from each of the FeX modules while calculating total (dis-) similarity distance. During the encoding and capturing phases, the acoustic parameters, codecs and the sound volume are kept varying among the potential values
given in Table I. Furthermore, the clips in both databases have varying durations between 30
seconds to 3 minutes. The evaluation of the performance is carried out subjectively using
only the samples containing unique content. In other words, if there is any subjective ambigu-
Audio-Based Multimedia Indexing and Retrieval
75
ity on the result such as a significant doubt on the relevancy of any of the audio-based retrieval results from an aural (or a visual) query, etc., then that sample experimentation is simply discarded from the evaluation. Therefore, the experimental results presented in this section depend only on the decisive subjective evaluation via ground truth and yet they are meant
to be evaluator-independent (i.e. same subjective decisions are guaranteed to be made by different evaluators).
For the analytical notion of performance along with the subjective evaluation, we used
the traditional PR (Precision-Recall) performance metric measured under relevant (and unbiased) conditions, notably using the aforementioned ground truth methodology. Note that recall, R, and precision, P, are defined as:
R=
RR
RR
and P =
TR
N
(26)
where RR is the number of relevant items retrieved (i.e. correct matches) among total number
of relevant items, TR. N is the total number of items (relevant + irrelevant) retrieved. For
practical considerations, we fixed N as 12. Recall is usually used in conjunction with the precision, which measures the fractional precision (accuracy) within retrieval and both can often
be traded-off (i.e. one can achieve high precision versus low recall rate or vice versa.).
This section is organized as follows: First the effect of classification and segmentation
(Step 1) over the total (indexing and) retrieval performance will be examined in the next subsection. Afterwards, a more generic performance evaluation will be realized based on various
aural retrieval experiments and especially the aural retrieval performance will be compared
with the visual counterpart in an analytical and subjective (via visual inspection) way.
4.4.1. Classification and Segmentation Effect on Overall Performance
Several experiments are carried out in order to assess the performance effects of the audio
classification and segmentation algorithm. The sample databases are indexed with and without the presence of audio classification and segmentation scheme, which is basically a matter
of including/excluding Step-1 (the classification and segmentation module) from the indexing
scheme. Extended experiments on audio based multimedia query retrievals using the audio
classification and segmentation during the indexing and retrieval stages show that significant
gain is achieved due to filtering the perceptually relevant audio content from a semantic point
of view. The improvements in the retrieval process can be described based on each of the following factors:
4.4.1.1 Accuracy
Since only multimedia clips, containing matching (same) audio content are to be compared
with each other (i.e. speech with speech, music with music, etc.) during the query process, the
probability of erroneous retrievals is reduced. The accuracy improvements are observed
76
within 0-35% range for the average retrieval precision. One typical PR curve for an audiobased retrieval of a 2 minutes multimedia clip bearing pure speech content within Real
World database is shown in Figure 35. Note that in the left part of Figure 35, 8 relevant clips
are retrieved within 12 retrievals via classification and segmentation based retrieval. Without
classification and segmentation, one relevant item is clearly missed on the right side.
Recall
PR curve (with classification
and segmentation)
Recall
PR curve (without classification
and segmentation)
Precision
1
1
0,9
0,9
0,8
0,8
0,7
0,7
0,6
0,6
0,5
0,5
0,4
0,4
0,3
0,3
0,2
0,2
0,1
Precision
0,1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
Figure 35: PR curves of an aural retrieval example within Real World database indexed with
(left) and without (right) using classification and segmentation algorithm.
4.4.1.2 Speed
The elimination of silent parts from the indexing scheme reduces the amount of data for indexing and retrieval and hence improves the overall retrieval speed. Moreover, the filtering of
irrelevant (different) class types during the retrieval process significantly improves the speed
by reducing the CPU time needed for similarity distance measurements and the sorting process afterwards. In order to verify this expectation experimentally and obtain a range for speed
improvement, we have performed several aural queries on Real World database indexed with
and without the application of classification and segmentation algorithm (i.e. Step-1 in indexing scheme). Among these retrievals we have chosen 10 of them, which have the same precision level in order to have an unbiased measure. Table VIII presents the total retrieval time
(the time passed from the moment user initiates an aural query till the query is completed and
results are displayed on the screen) for both cases. As a result the query speed improvements
are observed within 7-60% range whilst having the same retrieval precision level.
Table VIII: QTT (Query Total Time) in seconds of 10 aural retrieval examples from Real
World database.
Aural Retrieval No.
1
2
3
4
5
6
7
8
9
QTT (without classifica47.437 28.282 42.453 42.703 43.844 42.687 46.782 45.814 44.406
tion and segmentation)
10
41.5
QTT (with classification
30.078 26.266 19.64 39.141 18.016 16.671 31.312 30.578 20.006 37.39
and segmentation)
QTT Reduction (%)
36.59
7.12
53.73
8.34
58.9
60.94
33.06
33.25
54.94
9.90
Audio-Based Multimedia Indexing and Retrieval
77
4.4.1.3 Disk Storage
Fewer amounts of data are needed and henceforth recorded for the audio descriptors due to
the same analogy given before. Furthermore the silent parts are totally discarded from the indexing structure. Yet it is difficult to give an exact analytical figure showing how much disk
space can be saved by performing the classification and segmentation based indexing because
this eventually depends on the content itself and particularly the amount of silent parts that the
database items contain. The direct comparison between the audio descriptor file sizes of the
same databases indexed with and without the proposed method shows that above 30% reduction can be obtained.
4.4.2. Experiments on Audio-Based Multimedia Indexing and Retrieval
For analytic evaluation, 10 aural and visual QBE (Query by Example) retrievals are performed according to the experimental conditions explained earlier. We only consider the first
12 retrievals (i.e. N=12) and both precision and recall values are given in Table IX and Table
X.
Table IX: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Open Video
Database.
Query No:
Visual
Aural
1
2
3
4
5
6
7
8
9
Precision 0.66 0.75 0.25 0.25 0.66 1 0.58 1 0.83
Recall 0.66 0.75 0.33 0.25 0.8 1 0.58 1 0.83
Precision 1
1
1
1 0.83 1
1 0.8 1
Recall
1
1
1
1
1
1
1 0.8 1
10
1
1
1
1
Table X: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Real World
Database.
Query No:
Visual
Aural
Precision
Recall
Precision
Recall
1
2
3
4
5
6
1 0.25 1 0.25 0.83 0.33
1 0.5 1 0.25 1 0.33
1 0.5 1 0.5 0.83 0.75
1
1
1 0.5 1 0.75
7
8
1
1
1
1
0.41
0.41
0.75
0.75
9
10
0.08 0.16
0.125 0.66
0.25 0.25
0.375 1
As the PR results clearly indicate in Table IX and Table X, in almost all the retrieval
experiments performed, the aural queries achieved “equal or better” performance than their
visual counterpart although there is only one feature (MFCC) used as the aural descriptor
against a “good” blend of several visual features.
78
Figure 36 shows three sample retrievals via visual and aural queries from Open Video
database using MUVIS MBrowser application. The query (the first key-frame in the clip) is
shown on the top-left side of each figure. The first (a) example is a documentary about sharpshooting competition and hunting in the USA. The audio is mostly speech with an occasional
music and environmental noise (fuzzy). Among 12 retrievals, the visual query (left) retrieved
three relevant clips (P=R=0.25) whereas the aural query retrieved all relevant ones (i.e.
P=R=1.0). The second (b) example is a cartoon with several characters and the aural content
is mostly speech (dialogs between the cartoon characters) with long intervals of music. It can
be easily seen that within 12 retrievals, the visual query (left) retrieved three relevant clips
among 9 (P=0.25 and R=0.33) whereas the aural query retrieved all relevant ones within the
first 9 ranks (i.e. P=R=1.0). Finally, the third example (c) is a commercial with an audio content that is speech with music in the background (fuzzy). Similarly, it is obvious that among 12
retrievals, the visual query retrieved 9 relevant clips (i.e. P=R=0.75) whereas the aural query
retrieved all of them (i.e. P=R=1.0). All three examples show that the aural query can outperform the visual counterpart especially when there is a significant variation in the visual scenery, lightning conditions, background or object motions, camera effects, etc. whereas, the audio has the advantage of being usually stable and unique with respect to the content.
(a)
Audio-Based Multimedia Indexing and Retrieval
79
(b)
(c)
Figure 36: Three visual (left) and aural (right) retrievals in Open Video database. The top-left
clip is the query.
80
81
Chapter
5
Progressive Query: A Novel Retrieval Scheme
O
ne of the challenges in the development of content-based multimedia indexing and retrieval application is to achieve an efficient retrieval scheme. The developers and users
who are accustomed to making queries and thus retrieving any multimedia item from a large
scale database can be frustrated by the long query times and memory and minimum system
requirements. This chapter is about a novel multimedia retrieval technique, called Progressive
Query (PQ), which is designed to bring an effective solution especially for querying large
multimedia databases and furthermore to provide periodic query retrievals along with the ongoing query process. In this way it achieves a series of sub-query results that will finally converge to the full-scale search retrieval in a faster and with no minimum system requirements.
In order to address the problems encountered and present the major limitations on the multimedia retrieval area, a generic overview for the traditional indexing and retrieval methods will
be introduced in the next section. Afterwards, the generic PQ design philosophy and the implementation details will be presented in Section 5.2. A new multi-thread programming approach, which achieves High Precision on the periodicity on the PQ retrievals and therefore,
called as HP PQ will be introduced in Section 5.3. Finally the experimental results, the performance evaluation and conclusive remarks about PQ are reported in 5.4.
5.1.
QUERY TECHNIQUES - AN OVERVIEW
The usual approach for indexing is to map database primitives into some high dimensional
vector space that is so called feature domain. The feature domain may consist of several types
of features such as visual, aural, motion, etc. as long as the database contains such items from
which those particular features can be extracted. Among so many variations, careful selection
of the feature sets allows capturing the semantics of the database items. Especially for largescale multimedia databases the number of features extracted from the raw data is often kept
large due to the naïve expectation that it helps to capture the semantics better. Content-based
82
similarity between two database items can then be assumed to correspond to the (dis-) similarity distance of their feature vectors. Henceforth, the retrieval of a similar database items with
respect to a given query (item) can be transformed into the problem of finding such database
items that gives feature vectors, which are close to the query feature vector. This is so-called
query-by-example (QBE), which is one of the most common retrieval schemes. The basic
QBE operation is so called Normal Query (NQ), and works as follows: using the available
aural or visual features (or both) of the queried multimedia item (i.e. an image, a video clip,
an audio clip, etc.) and all the database items, the similarity distances are calculated and then
merged to obtain a unique similarity distance per database item. Ranking the items according
to their similarity distances (to the queried item) over the entire database yields the query result.
Such an exhaustive search for QBE is costly and CPU intensive especially for largescale multimedia databases since the number of similarity distance calculations is proportional
with the database size. This fact brought a need for indexing techniques, which will organize
the database structure in such a way that the query time and I/O access amount could be reduced. During the past three decades, several indexing techniques have been proposed. Many
of these techniques are covered in the next chapter. Along with these indexing techniques certain query techniques are needed to speed up the QBE process. The most common query
techniques developed especially those indexing techniques are as follows:
• Range Queries: Given a query object, Q, and a maximum similarity distance range, ε,
the range query selects all indexed database items, Q i , such that SD (Q, Q i ) < ε.
•
kNN Queries: Given a query object, Q, and an integer number k > 0, kNN query selects
the k database items, which have the shortest similarity distance from Q.
Unfortunately, both query techniques may not provide efficient retrieval scheme from
the user’s point of view due to their strict parameter dependency. For instance, range queries
require a distance parameter, ε, where the user may not be able to provide such a number prior
to a query process since it is not obvious to find out a suitable range value if the database contains various types of features and feature subsets. Similarly, parameter k in a kNN query may
be hard to determine since it can be too small in case the database may provide many more
similar (relevant) items than required, or too big if the number of similar objects is only a
small fraction of the required number k, which means unnecessary CPU time has been wasted
for that query process. Both query techniques often require several trials to converge to a successful retrieval result and this alone might remove the speed benefit of the underlying indexing scheme, if there is any.
As mentioned before, the other alternative is the so-called Normal Query (NQ), which
makes a sequential (exhaustive) search due to lack of an indexing scheme. NQ for QBE is
costly and CPU intensive especially for large-scale multimedia databases; however, it yields
such a final and decisive result that no further trials are needed. Still, all QBE alternatives
have some common drawbacks. First of all, the user has to wait until all (or some) of the similarity distances are calculated and the searched database items are ranked accordingly. Naturally, this might take a significant time if the database size (or k, ε) is large and the database
Progressive Query: A Novel Retrieval Scheme
83
contains a rich set of aural and visual features, which might further reduce the efficiency on
the indexing process. Furthermore, any abrupt stopping (i.e. manual stop by the user) during
the query process will cause total loss of retrieval information and essentially nothing can be
saved out of the query operation so far performed. In order to speed up the query process, it is
a common application design procedure to hold all features of database items into the system
memory first and then perform the calculations. Therefore, the growth in the size of the database and the set of features will not only (proportionally) increase the query time (the time
needed for completing a query) but it might also increase the minimum system memory requirements such as memory capacity and CPU power.
In order to eliminate such drawbacks and provide a faster query scheme, we develop a
novel retrieval scheme, the Progressive Query (PQ), which is implemented under MUVIS
system to provide a basis for multimedia retrieval and to test the performance of the technique. PQ is a retrieval (via QBE) technique, which can be performed over the databases
with or without the presence of an indexing structure. Therefore, it can be an alternative to
NQ where both produce (converge to) the same result at the end. When the database has an
indexing structure, PQ can replace kNN and range queries whenever a query path over which
PQ proceeds, can be formed. As its name implies, PQ provides intermediate query results
during the query process. The user may browse these results and may stop the ongoing query
in case the results obtained so far are satisfactory and hence no further time should unnecessarily be wasted. As expected, PQ and NQ will converge to the same (final) retrieval results at
the end. Furthermore, PQ may perform the overall query process faster (within a shorter total
query time) than NQ. Since PQ provides a series of intermediate results, each of which obtained from a (smaller) sub-set within the database, the chance of retrieving relevant database
items that would not be retrieved otherwise via NQ, might be increased. Approvingly some
experimental results show that it is quite probable to achieve even better retrieval performance
within an intermediate sub-query than the final query state.
It is a known fact that significant performance improvements of content-based multimedia retrieval systems can be achieved by using a technique known as Relevance Feedback
[16], [17], [56], which allows the user to rate (intermediary) retrieval results. This helps to
tune the ranking and retrieval parameters and hence yields a better retrieval result at the end.
Traditional, query techniques so far addressed (NQ, kNN and range queries) may allow such a
feedback only after the query is completed. Since PQ provides the user with intermediate results during the query process, relevance feedback may be applied already to these intermediate results, yielding possibly faster satisfactory retrieval results from the user’s point of view.
5.2.
PROGRESSIVE QUERY
The principal idea behind the design of PQ is to partition the database items into some subsets within which individual (sub-) queries can be performed. Therefore, a sub-query is a fractional query process that is performed over any sub-set of database items. Once a sub-query is
completed over a particular sub-set, the incremental retrieval results (belonging only to that
84
sub-set) should be fused (merged) with the last overall retrieval result to obtain a new overall
retrieval result, which belongs to the items where PQ operation so far covers from the beginning of the operation. Note that this is a continuous operation, which proceeds incrementally,
sub-set by sub-set, by covering more and more group of items within the database. Each time
a new sub-query operation is completed, PQ updates the retrieval results to the user. Since the
previous (overall) query results are used to obtain the next (overall) retrieval result via fusion,
the time consuming query operation is only performed over the (next) partitioned group of
items instead of all the items where PQ covered so far.
The order of the database items processed is a matter of the indexing structure of the
database. If the database is not indexed at all, simply a sequential or random order can be chosen. In case the database has an indexing structure, a query path can be formed in order to retrieve the most relevant items at the beginning during a PQ operation. Since there are various
indexing schemes addressed in the previous section, for the sake of simplicity, we shall first
explain the basics of PQ for a database with no indexing structure.
t
2t
3t
4t
time
MM Dbs.
Sub-Set 1
Sub-Set 2
Sub-Set 3
Periodic
Sub-Query
Results
1
Sub-Set N
2
3
Sub-Query
Fusion
Sub-Query
Fusion
4
Progressive
Sub-Query
Result
1
1+2
1+2+3
Figure 37: Progressive Query Overview.
Another important factor is to determine the size of each sub-set (i.e. the number of
items within a sub-set where sub-query operation is performed) that is most convenient from
the user’s point of view. A straightforward solution is to let the user fix the sub-set size (say
e.g. 25). This would mean that the user wants updates every time 25 items are covered during
the ongoing PQ operation. However, this also brings the problem of uncertainty because the
user cannot know how much time a sub-query will take beforehand since the sub-query time
will vary due to the factors such as the amount of features present in the database and the
speed of the computer where it is running, etc. So the PQ retrieval updates might be too fast
or too slow for the user. In order to avoid such uncertainties, the proposed PQ scheme is designed over periodic sub-queries as shown in Figure 37 with a user defined period value
( t = t p ). The period (time) is obviously more natural choice since the user can eventually ex-
Progressive Query: A Novel Retrieval Scheme
85
pect the retrieval results will be updated every t p seconds no matter what database is involved
or what computer is used. Without loss of generality, in databases without an indexing structure, PQ is designed to perform sub-set partitioning sequentially with a forward direction (i.e.
starting from the first item to the last one).
5.2.1. Periodic Sub-Query Formation
In order to achieve periodic sub-queries, we need to define the following additional sub-query
compositions.
5.2.1.1 Atomic Sub-Query
The smallest sub-set size on which a sub-query is performed. Here we assume that atomic
sub-query time is not significant compared to periodic sub-query time. Atomic sub-queries are
the only sub-query types that have a fixed sub-set size ( S ASQ ). They are only used during first
0
periodic query and they are used in order to provide an initial sub-query rate ( t r ), that is the
time spent for the retrieval of a single database item, formulated as follows:
t r0 =
t ASQ
N ASQ
if N ASQ > 0
(27)
where t ASQ is the total time spent for atomic sub-query and N ASQ is the number of database
items, which are involved (used) in query operation. Without an indexing structure, note that
0 ≤ N ASQ ≤ S ASQ , since the initial database items might not belong to the ongoing query
type. For example in a multimedia database, there might be video-only clips and audio-only
clips (and clips with both media types). So for a visual query, those audio-only clips will be
totally discarded and if the initial atomic query sub-set covers such audio-only clips then naturally N ASQ ≤ S ASQ . In case N ASQ = 0 , one or more atomic sub-queries have to be performed until we get a valid t r value (i.e. NASQ > 0 ).
0
5.2.1.2 Fractional Sub-Query
This can be any sub-query performed over a sub-set whose size is smaller or equal to the subset size of the periodic sub-query. In other words a fractional sub-query time might be less
than or equal to a periodic sub-query time.
As explained earlier, periodic sub-queries are periodic over time and a mechanism is
needed to ensure this periodicity. This mechanism works over atomic and fractional subqueries; it performs fusion operation over as many atomic and fractional sub-queries as necessary. First, it starts with an atomic sub-query to obtain a valid (initial) sub-query per item
time, t r0 , and it keeps going with atomic queries until a valid t r0 value is obtained. Once it
86
is obtained, then one or more fractional sub-queries will be performed to complete the first
periodic sub-query. The size of the fractional query ( N FSQ ) can then be estimated as:
N
where
0
FSQ
≅
t 0p
(28)
t r0
a
t 0p = t p − t Σa is the time left for completing the first periodic sub-query and tΣ is the
total time spent for all atomic queries so far performed. Afterwards, the fractional sub-query
can be performed within a sub-set of N FSQ items. Once the fractional sub-query is completed,
the total time ( t ) so far spent from the beginning of the operation till now is compared with
∑
q
the required time period of the q th (so far q=0) periodic query, t p , where q is the periodic
sub-query index. If t
∑
q
value is not within a close neighborhood (i.e. δ t w < 0.5 sec.) of t p
< t qp − δt w ) then the operation continues with a new fractional sub-query until the
∑
condition is met. For the new fractional sub-query and for all the latter fractional sub-query
operations t r value is re-estimated (updated) from the former operations such as:
(i.e. t
N
i
FSQ
=
t qp − t
tr
i
∑
t ∑ < t qp − δt w
   → t
i +1
r
t
=
∑
if
i
N
∑ FSQ
∑N
i∈ FSQ
i
FSQ
>0
(29)
i∈ FSQ
Once one or more fractional queries form the q th periodic query, due to offset that is
occurred from the period of PQ, next periodic sub-query (q+1st) is formed with an updated
(offset removed) period value:
t qp + 1 = t p + ( t qp − t
where t p is the required period and (t qp − t
∑
∑
)
(30)
) is the offset time, which is the time difference
between the required period time for q th periodic sub-query and the total (actual) time spent
so far. The flowchart of the formation of a periodic sub-query is shown in Figure 38.
Progressive Query: A Novel Retrieval Scheme
87
Atomic Sub-Query
N
No
ASQ
> 0
Implemented
only at the
beginnning of
the first periodic
sub-query
Yes
Calculate size of
next Fractional
Sub-Query
Fusion
Fractional
Sub-Query
t > t p − δt w
No
Yes
Periodic SubQuery (t ≅ t p )
Stop
Figure 38: Formation of a Periodic Sub-Query.
5.2.1.3 Sub-Query Fusion Operation
The overall PQ operation is carried out over Progressive Sub-Queries (PSQs). In principal, it
can be stated that PQ is a (periodic) series of PSQ results. As shown in Figure 37, a new PSQ
retrieval is formed each time a periodic sub-query is completed and then it is fused with the
previous PSQ retrieval result. Once a PSQ is realized, the results are shown to the user and
saved during the lifetime of the ongoing PQ so that the user can access them at any time. The
user is shown updated retrieval results each time a new PSQ is completed. The first PSQ is
the first periodic sub-query performed. The fusion operation is a process of fusing two sorted
sub-query results to achieve one (fused) sub-query result. Since both of the sub-query results
are already sorted with respect to the similarity distances, simply comparing the consecutive
items in each of the sub-query lists can perform the fusion operation. Let n1 and n 2 be the
number of items in the first and second sub-set, respectively. Since there are n1 + n2 items to
be inserted into the fused list one at a time, the fusion operation can take maximum
n1 + n2 comparisons. This (worst) case occurs whenever the items from both lists are evenly
distributed with respect to each other. On the other hand if the maximum valued item (i.e. the
88
last item) in the smaller list is less than the minimum valued item (i.e. the first item) in the
bigger list, the number of comparisons will not exceed the number of items in the smaller list
because once all of the items in it are compared with the (smallest) first item in the bigger list
and henceforth inserted into the fused list, there will not be any more comparisons needed.
Note that this is the direct consequence of the fact that the both lists are sorted (from minimum to maximum) beforehand and one of them is now fully depleted. Therefore, the fusion
operation will take minimum Min ( n1 , n 2 ) comparisons respectively. A sample fusion operation is illustrated in Figure 39. Note that the subsets X and Y contain 12 and 6 items, respectively, and the fusion operation performs only 12 comparisons.
X
Y
Fusion: X~Y
X1 = 0.1
Y1 = 0.01
Y1 = 0.01
X2 = 0.2
Y2 = 0.05
Y2 = 0.05
X3 = 0.3
Y3 = 0.15
X1 = 0.1
X4 = 0.4
Y4 = 0.25
Y3 = 0.15
X5 = 0.5
Y5 = 0.55
X2 = 0.2
X6 = 0.6
Y6 = 0.65
Y4 = 0.25
X7 = 0.7
X3 = 0.3
X8 = 0.8
X4 = 0.4
X9 = 0.9
X5 = 0.5
X10 = 1.0
Y5 = 0.55
X11 = 1.1
X6 = 0.6
X12 = 1.2
Y6 = 0.65
X7 = 0.7
X8 = 0.8
X9 = 0.9
X10 = 1.0
X11 = 1.1
X12 = 1.2
Figure 39: A sample fusion operation between subsets X and Y.
Since the fusion operation is nothing but merging two arbitrary sized sub-sets (retrieval results), it can be applied during each phase of a PQ operation. For instance, the atomic
sub-queries are fused with fractional sub-queries and several fractional sub-queries are fused
to obtain a periodic sub-query. Fusing the periodic sub-query with the previous PSQ retrieval
produces a new PSQ retrieval, covering a larger part of the database. If the user does not stop
the ongoing PQ operation, it will eventually cover the entire database at the end and therefore,
it generates the overall retrieval result of the queried item. In this case PQ generates the exact
same retrieval result as NQ since both of them perform the same operation, i.e. searching a
queried item through the entire database and ranking the database items accordingly
Progressive Query: A Novel Retrieval Scheme
89
5.2.2. PQ in Indexed Databases
In the previous sections, PQ is presented over databases without an indexing structure and in
this context it is compared with a traditional query type, NQ. As a retrieval process, PQ can
also be performed over indexed databases as long as a query path can be formed over the
clusters (partitions) of the underlying indexing structure. Obviously, query path is nothing but
a special sequence of the database items, and when the database lacks an indexing structure, it
can be formed in any convenient way such as sequentially (starting from the 1st item towards
the last one, or vice versa) or randomly. Otherwise, the most advantageous way to perform
PQ is to use the indexing information so that the most relevant items can be retrieved in earlier PSQ steps. Since the technical details of various available indexing schemes are beyond
the scope of this chapter, we only show the hypothetical formation of the query path and runtime evaluation of PQ over this path. In the next chapter, the real implementation of PQ over
an MAM-based indexing structure will be presented in detail.
Figure 40: Query path formation in a hypothetical indexing structure.
As briefly mentioned earlier, the primary objective of indexing structures is to partition the feature domain into such (tree-based) clusters that CPU time and I/O accesses are
shortened via pruning of the redundant tree nodes. Figure 40 shows a hypothetical clustering
scheme and the formation of the query path over which PQ will proceed during its run-time.
This illustration shows 4 clusters (partitions or nodes), which contain a certain number of
items (features) and the query path is formed according to the relative (similarity) distance to
the queried item and its parent cluster. Therefore, PQ will give the priority to cluster A (the
90
host), then B (the closest), C, D, etc. Note that the query path might differ from the final retrieval result depending on the accuracy of the indexing scheme. For instance, query path
gives priority to B2 on the periodic search with respect to C4 but C4 may contain more relevant items (i.e. items more similar to the query item) than B2 and when the retrieval results
are formed it will eventually be ranked higher and presented earlier to the user by PQ.
At this point, one can implement two different approaches: the overall query path can
be formed immediately after the query is initiated and then the PQ evolves over it with its
natural supplies of periodic retrievals. This approach is only recommended for small and medium sized databases where the complete query path formation takes insignificant time. Otherwise, the query path should be dynamically (incrementally) formed along with the PQ runtime process and the time spent for it should be taken into account during the adaptive calculation of period given in (31). In this case, the adaptive period calculation for the q + 1 st periodic sub-query should be reformulated as follows:
t qp + 1 = t p + ( t qp − t
where
∑
q
− t QP
)
q
t QP
is the time spent for forming the query path during the formation of
(31)
q th periodic
sub-query.
As a result PQ in indexed databases makes more sense than to be strictly dependant to
an unknown parameter such as k as in kNN query or ε in range query, which might cause a
deficiency in the retrieval performance such as the casual need for doing multiple queries to
come with a satisfactory result at the end. On the other hand there exists a certain degree of
similarity (or analogy) between PQ and those conventional query techniques. For instance
each PSQ retrieval can be seen as a particular kNN (or range) query retrieval with only one
difference: the parameter k (or ε) is not fixed beforehand, rather dynamically changing (growing) over time along with the lifetime of PQ and the user has the opportunity to fix it (stop
PQ) whenever satisfactory retrievals obtained
5.3.
HIGH PRECISION PQ – THE NEW APPROACH
The PQ operation presented so far is designed as a single process (thread). For this reason it
has a pre-emptive approach for adaptively arranging the periodic sub-set sizes and this will
yield up to 10% of timing shifts from the required period value, t p , which can be avoided by
changing the design into a parallel processing basis. In order to achieve such a high precision
on the periodic PSQ retrievals, the entire PQ process is divided between two threads controlled with a timer. Figure 41 illustrates the how the multithread approach is integrated over
the PQ parts to perform HP PQ. Two distinct threads are formed to perform the following
tasks:
Progressive Query: A Novel Retrieval Scheme
•
•
91
Periodic Sub-Query Thread:
– Load features from disc into memory.
– Calculate similarity distance of each item with respect to query item.
– Sort items within periodic sub-query set.
PQ Formation Thread:
– Suspend Periodic Sub-Query thread when timer kicks.
– Apply sub-query fusion to form next PSQ.
– Release Periodic Sub-Query thread.
– Render next PSQ retrievals on screen.
So the main idea is to leave all the time consuming tasks to the periodic sub-query
thread and keep the other (PQ Formation) thread in suspended mode until the timer kicks.
Once the timer activates the PQ Formation thread then it can immediately form the PSQ retrieval only with a single fusion operation and renders the retrieval results on the screen.
t
2t
3t
4t
time
MM Dbs.
Periodic
Sub-Query
Results
Sub-Set 1
Sub-Set 2
Sub-Set 3
1
Sub-Set N
2
3
Sub-Query
Fusion
Sub-Query
Fusion
4
Progressive
Sub-Query
Result
1
1+2
1+2+3
Timer
t
2t
3t
4t
PQ Formation Thread
Periodic Sub-Query Thread
Figure 41: HP PQ Overview.
92
5.4.
EXPERIMENTAL RESULTS
5.4.1. PQ in MUVIS
As mentioned in Chapter 2, MBrowser is the primary media browser and retrieval application
into which PQ technique is integrated as the primary retrieval (QBE) scheme. NQ is the alternative query scheme within MBrowser. Both PQ and NQ can be used for ranking the multimedia primitives with respect to their similarity to a queried media item (an audio/video clip,
a video frame or an image). In order to query an audio/video clip, it should first be appended
to a MUVIS database upon which the query will be performed. There is no such necessity for
images; any digital image (inclusive or exclusive to the active database) can be queried within
the active database. The similarity distances will be calculated by the particular functions,
each of which is implemented in the corresponding visual/aural feature extraction (FeX or
AFeX) modules.
Queried MM Clip
PQ Browser Handle. 10th
PSQ is currently active.
Prev
PQ Knob
Next
First 12 Retrieval Results
Figure 42: MBrowser GUI showing a PQ operation where 10th PSQ is currently active (or set
manually).
As shown in Figure 42, MBrowser GUI is designed to support all the functionalities
that PQ provides. Once a MUVIS database is loaded into the MBrowser, the user can browse
among the database items, choose any item and then initiate a query. The basic query parameters such as query type (PQ or NQ), query genre (aural or visual), PQ update period (time),
the (visual and aural) set of features and their individual parameters (i.e. feature weights), etc.
Progressive Query: A Novel Retrieval Scheme
93
can be set prior to a query operation. When a (sub-) query is completed the retrieval results
are then presented to the user page by page. Each page renders 12 ranked results in the descending order (from left to right and from top to bottom) and the user can browse the pages
back and forth using the Next and Prev buttons on the bottom-right side of the figure (the first
page with 12-best retrieval results is shown on the right side of Figure 42). If NQ is chosen,
then the user has to wait till the whole process is completed but if PQ is chosen then the retrieval results will be updated periodically (with the user-defined period value) each time a
new PSQ is accomplished. The current PSQ number (10) is shown on the PQ Browser Handle
and this handle can also be used to browse manually among the retrieved PSQ results during
(or after) an ongoing PQ operation. In the snapshot shown in Figure 42, a video clip is chosen
within a MUVIS (video) database and visual PQ is performed. Currently the 1st page (12-best
results) of the 10th PSQ retrieval results is shown on the GUI window of MBrowser.
Four of the sample MUVIS databases that were presented in sub-section 2.1.4, are
used in the experiments performed in this section: Open Video, Real World, Sports video and
Shape image databases. All experiments are carried out on a Pentium-4 3.06 GHz computer
with 2048 MB memory. In order to have unbiased comparisons between PQ and NQ, the experiments are performed using the same queried multimedia item with the same instance of
MBrowser application.
5.4.2. PQ versus NQ
As explained earlier, PQ and NQ eventually converge to the same retrieval result at the end.
Also in the abovementioned scenarios they are both designed to perform exhaustive search
over the entire database within MUVIS. However PQ has several advantages over NQ in the
following aspects:
• System Memory Requirement: The memory requirement is proportional to the database size and the number of features present in a NQ operation. Due to the partitioning of the
database into sub-sets, PQ will reduce the memory requirement by the number of PSQ operations performed. After each periodic sub-query operation, the memory used for feature vectors in that sub-set is no longer needed and can be used for the next periodic sub-query. Figure
43 illustrates the memory usage of a retrieval example that is shown in Figure 45 by a PQ first
and a NQ afterwards.
Figure 43: Memory usage for PQ and NQ.
94
We also observe that especially in very large-scale databases containing a rich set of
features, NQ might exceed the system memory. Two possible outcomes may eventually occur
as a result. The operating system may handle it by using virtual memory (i.e. disc) if the excessive memory requirement is not too high. In this case, the operational speed for a NQ operation will be drastically degraded and eventually PQ can outperform NQ several times with
respect to overall retrieval time. The other possibility is that NQ operation cannot be completed at all since the system is not capable of providing the excessive memory required by
NQ and in this case PQ is the only feasible query operation.
• “Earlier and Better” Retrieval Results: Along with the ongoing process PQ allows
intermediate query results (PSQ steps), which might sometimes show equal or ‘even better’
performance than the final (overall) retrieval result as some typical examples given in Figure
44 and Figure 45.
In Figure 44, an image retrieval example within Shape database via PQ using Canny
[13] Edge Histogram feature is shown. We use t p = 0.2sec and PQ operation is completed in
three PSQ series (i.e. PQ #1, #2 and #3). This is one particular example in which an intermediate PSQ retrieval yields a better performance than the final PQ retrieval (that is same as the
retrieval result of NQ). In this example, the first 12-best retrieval results in PQ #1 are obviously better (more relevant in terms of shape similarity to the queried shape) than the ones in
PQ #3 (NQ).
In Figure 45, a video retrieval example within Real World database via aural PQ using MFCC (Mel-Frequency Cepstral Coefficients) as the audio features is shown. We use
t p = 5 sec and PQ operation is completed in 12 PSQ series but only three PSQ retrievals (i.e.
PQ #1, #6 and #12) are shown. Note that PQ #6 and the latter retrieval results till PQ #12 are
identical, which means that PQ operation produces the final retrieval result (which NQ would
produce) in an earlier (intermediate) PSQ retrieval.
Such “earlier and even better” retrieval results can be verified due to the fact that
searching an item in a smaller data set usually yields better (detection or retrieval) performance than searching in a larger set. This is obviously an advantage for PQ since it proceeds
within sub-queries performed in (smaller) sub-sets whereas NQ always has to proceed through
the entire database. Furthermore, for the databases that are not indexed such as in the examples given, this basically means that the order of the relevant items coincides with the progress of the ongoing PQ operation in the earlier PSQ steps. When the database has a solid indexing structure and a query path can be formed according to the relevancy of the queried
item, the user eventually gets relevant retrieval results in a fraction of the time that is needed
for a typical NQ operation.
Progressive Query: A Novel Retrieval Scheme
95
Figure 44: PQ retrievals of a query image (left-top) within three PSQs. t p = 0.2 sec
Figure 45: Aural PQ retrievals of a video clip (left-top) in 12 PSQs (only 1st, 6th and 12th
are shown). t p = 5 sec .
• Query Accessibility: This is the major advantage that PQ provides. Stopping an ongoing query operation is an important capability in the user point of view. As shown in Figure
96
42, by pressing the PQ Knob during an ongoing PQ operation, the user can stop it any time
(i.e. when the results are so far satisfactory). Of course, NQ can also be stopped but no retrieval result can be available afterwards since the operations such as similarity distance calculations or sorting are likely not completed.
Another important accessibility option that PQ offers is so-called PSQ Browsing.
When stopped abruptly or completed at the end, the user can still browse among PSQ retrievals and visit any retrieval page of that particular PSQ since the retrieval results will be alive
unless a new PQ is initiated or the application is terminated. This is obviously a significant
requirement especially when better results are obtained in an earlier PSQ step than the later
ones as mentioned before. On the other hand, this might still be a desirable option even if the
earlier PSQ results are not better but comparable as much as the later ones. This could be
relevant to the user. One particular example is shown in Figure 46, that is, a video retrieval
example from the Open Video database via visual PQ using several color (YUV, HSV, etc.),
texture (GLCM [49]) and shape (Canny Edge [13] Histogram) features. We use t p = 3 sec and
PQ operation is completed in 4 PSQ series. Note that in this particular example a retrieval
performance evaluation is difficult to accomplish among 4 PSQ retrievals since their relevancy to the queried item is subjective.
PQ #4
PQ #3
PQ #2
PQ #1
(sec)
time
s
0.099
t=1
2s
t = 9.34
.893s
t=5
84s
t = 2.8
Figure 46: Visual PQ retrievals of a video clip (left-top) in 4 PSQs. t p = 3 sec .
The most important accessibility advantage that PQ can provide is that it can further
improve the efficiency of any relevance feedback mechanism in certain ways. An ordinary
relevance feedback technique works as follows: the user initiates a query and after the query
is completed, the user gives some feedback to the system about the retrieval results according
Progressive Query: A Novel Retrieval Scheme
97
to their relevancy (and/or irrelevancy) with respect to the queried item. Afterwards, a new
query is initiated in order to get better retrieval results and this might be repeated iteratively
until satisfactory results are obtained. This is especially a time consuming process since at
each iteration the user has to wait until the query operation is completed. Due to the enhanced
accessibility options that PQ provides, significant improvements can be achieved for the user
with the following scenarios: First the user can employ relevant (and irrelevant) feedbacks
during the query process and the incoming progressive retrievals can thus be tuned progressively. This means that during an ongoing query process the user can employ one or more
relevant feedbacks anytime (within the life-time of PQ). Another alternative is that, the user
can stop an ongoing PQ and then employs the relevant feedbacks with respect to the (intermediate) retrievals via PSQ Browsing and re-initiate a new (fine-tuned) PQ. Basically any
relevance feedback technique can be applied along with PQ since in both scenarios PQ only
provides the necessary basis for the (user) accessibility to employ the relevance feedback but
otherwise stay independent from the internal structure of any individual technique employed.
• Overall Retrieval Time (Query Speed): The overall query time is the time elapsed
from the beginning of the query to the end of the operation. For NQ, this is obviously the total
time from the moment the user initiates the query until the results are ranked and displayed on
the screen. However, for PQ, since the retrieval is a continuous process with PSQ series, the
overall retrieval means that PQ proceeds over the entire database and its process is finally
completed. As mentioned earlier, at this point both PQ and NQ should generate identical retrieval results for a particular queried item.
There are basically three major processes in NQ: Loading database features (from the
disc) to the system memory, calculating the (dis-) similarity distances from the features and
sorting (ranking) the database items according to their similarity distances. When PQ performs overall retrieval, the first two processes will take exactly the same time but the sorting
will be faster due to the following fact. Let n be the number of database items in the database,
if, for example, Quick Sort is applied, then the number of comparisons will be O ( n log n ) on
average case and O ( n 2 ) in the worst case. Assume that we only perform PQ in two PSQ series: Let n1 be the number of items in the first sub-set and n 2 be the number of items in the
second one, where n = n1 + n2 . In both average and worst-case scenario
O(n12 ) + O(n22 ) < O(n 2 ) (worst case)
O(n1 log n1 ) + O(n2 log n2 ) < O(n log n) (avg. case)
(32)
Here we did not take into account the time spent for fusion operation since in the
worst case it only requires n = n1 + n2 comparisons, an O ( n ) operation, which is negligible.
So PQ will apply a faster sorting algorithm especially if the worst-case scenario is considered.
It can be shown by deduction that PQ time will become slightly faster if the number of PSQ
operation increases (i.e. with smaller sub-set size or t p value). In order to verify this, several
aural PQ retrieval experiments in Real World database have been performed with different
98
t p values. In order to get an unbiased measure, the experiments for each t p value are repeated 5 times and the median from 5 overall retrieval times is taken into account. PQ total
execution time (overall retrieval time) and the number of PSQ updates are plotted in Figure
47. The same experiment is repeated for visual PQ operation and the result is shown in Figure
48. Note that if PQ is completed with only one PSQ, then it basically performs a NQ operation and therefore NQ retrieval time can also be examined in both figures.
As experimentally verified, PQ’s overall retrieval time is 0-25% faster than NQ retrievals (depending on the number of PSQ series) if NQ memory requirement does not exceed
the system memory.
Figure 47: Aural PQ Overall Retrieval Time and PSQ Number vs. PQ Period.
Figure 48: Visual PQ Overall Retrieval Time and PSQ Number vs. PQ Period.
Progressive Query: A Novel Retrieval Scheme
99
It is also observed that the real PSQ retrieval times are close neighborhood of t p
(user-defined period) value in general. One typical example showing PSQ arrival times for the
PQ example shown in Figure 45 is plotted in Figure 49.
Figure 49: PSQ and PQ retrieval times for the sample retrieval example given in Figure 45.
5.4.3. PQ versus HP PQ
The experimental results so far presented are all for the single-thread PQ implementation.
However, most of the results, such as memory, speed and overall retrieval time, are also valid
for HP PQ scheme. The outcome of the new PQ implementation scheme, HP PQ, differs only
in the timing accuracy of the PSQ retrieval times. As expected, HP PQ can provide a precise
timing (periodicity) on the PSQ retrievals as one comparative PQ retrieval plot can be seen in
Figure 50.
Figure 50: PSQ Retrieval times for single and multi threaded (HP) PQ schemes. t p = 5 sec
100
5.4.4. Remarks and Evaluation
In accordance with the experimental results presented so far, the following conclusive remarks can be made about the innovative properties of PQ:
• PQ is an efficient retrieval technique (via QBE), which works with both indexed and
non-indexed databases.
• In this context it is the unique query method which may provide “faster” retrievals
without requiring any special “indexing” structures.
• In the same context, it is the unique query method that provides “Browsing” capability
between instances (PSQ) of the ongoing query. The user can browse PSQ retrievals in any
time, i.e. during or after the query process.
• In databases without an indexing structure, it achieves several improvements such as
loose system requirement (in terms of memory, CPU power, etc.), “early and even better” retrieval results, user-friendly query accessibility options (i.e. PQ can be stopped in case satisfactory results obtained, PSQ retrievals can be browsed any time, etc.), reduced overall timing
(in case PQ is completed), etc. As mentioned earlier for some large scale databases it is the
“only” feasible query process; whereas, NQ might yield an infeasible waiting period or requires excessive memory, etc.
• It can also be applied to indexed databases efficiently (to get the most relevant results
the earliest) and in this case it shows a “dynamic kNN query” behavior where k increases with
time and hence the user can thus have the advantage of assigning it by seeing (and judging)
the results. This is obviously a significant advantage with respect to traditional kNN or range
queries since the user cannot know a “good” k value (or range value) beforehand and these
values are dependant directly to the content distribution of the database and the relevancy of
the queried item.
• The most important advantage above all is that it provides continuous user interaction
with the ongoing query operation. The user can see the results so far obtained, can immediately evaluates them and performs “relevance feedback” into the system or simply wastes no
time if satisfactory results are obtained so far (query stop).
• Finally, in addition to all the aforementioned advantages and performance improvements that PQ provides, PQ does not have any significant drawbacks applicable on to the system and the user. This means there is no practical cost for using PQ.
In the next chapter, an efficient implementation of PQ along with an indexing scheme
will be presented and therefore, the theoretical expectation about the earliest retrievals of the
most relevant items will be verified accordingly.
101
Chapter
6
A Novel Indexing Scheme: Hierarchical Cellular Tree
E
specially for the management of large multimedia databases, there are certain requirements that should be fulfilled in order to improve the retrieval feasibility and efficiency.
Above all, such databases need to be indexed in some way and traditional methods are no
longer adequate. It is clear that the nature of the search mechanism is influenced heavily by
the underlying architecture and indexing system employed by the database. Therefore, this
chapter addresses this problem and presents a novel indexing technique, Hierarchical Cellular Tree, which is designed to bring an effective solution especially for indexing multimedia
databases and furthermore to provide an enhanced browsing capability, which enables user to
make a guided tour within the database. A pre-emptive cell search mechanism is introduced in
order to prevent the corruption of large multimedia item collections due to the misleading
item insertions, which might occur otherwise. In addition to this, the similar items are focused
within appropriate cellular structures, which will be the subject to mitosis operations when the
dissimilarity emerges as a result of irrelevant item insertions. Mitosis operations ensure to
keep the cells in a focused and compact form and yet the cells can grow into any dimension as
long as the compactness prevails. The proposed indexing scheme is then optimized the proposed query method earlier, the Progressive Query, in order to maximize the retrieval efficiency for the user point of view. In order to provide a better understanding to the concept of
indexing operation for the multimedia databases, the next section is devoted to an overview
for the traditional indexing structures, their limitations and drawbacks. Hence the philosophy
and the design fundamentals behind the proposed HCT structure can then be introduced especially by focusing the attention on a particular indexing structure, the M-tree, which is the
most promising indexing structure published so far. Afterwards the generic HCT architecture
and implementation details are introduced in Section 6.2. Section 6.3 is devoted for PQ operation over HCT indexing structure. A novel browsing scheme, the HCT Browsing, is introduced in Section 6.4. Section 6.5 reports the experimental results along with some example
demonstrations and accordingly presents the conclusive remarks.
102
6.1.
DATABASE INDEXING METHODS – AN OVERVIEW
During the last decade several content-based indexing and retrieval techniques and applications have been developed such as MUVIS system [P4], [P6], [43], Photobook [50], VisualSEEk [63], Virage [75], and VideoQ [15], all of which are designed to bring a framework
structure for handling and especially the retrieval of the multimedia items such as digital images, audio and/or video clips. As explained in the previous chapter, database primitives are
mapped into some high dimensional feature domain, which may consist of several types of
features such as visual, aural, etc. as long as the database contains such items from which
those particular features can be extracted. A particular feature set models the contents of the
multimedia item into a set of semantic attributes which can then be managed and processed by
conventional database management systems. In this way the content-based similarity between
two database items can be estimated by calculating the (dis-) similarity distance of their feature vectors. Henceforth, content-based similarity retrieval according to a query item can be
carried out by similarity measurements, which produce a ranking order of similar multimedia
items within the database. This is the general query-by-example (QBE) scenario, however it is
also costly and CPU intensive especially for large multimedia databases since the number of
similarity distance calculations is proportional to the database size. This fact brought a need
for indexing techniques, which will organize the database structure in such a way that the
query time and I/O access amount could be reduced. For the past three decades, researchers
proposed several indexing techniques that are formed mostly in a hierarchical tree structure
that is used to cluster (or partition) the feature space. Initial attempts such as KD-Trees [4]
used space-partitioning methods that divide the feature space into predefined hyperplanes regardless of the distribution of the feature vectors. Each region is mutually disjoint and their
union covers the entire space. In R-tree [22] the feature space is divided according to the distribution of the database items and the region overlapping can be introduced as a result. Both
KD-tree and R-tree are the first examples of Spatial Access Methods (SAMs). Afterwards
several enhanced SAMs have been proposed. R*-tree [3] provides a consistently better performance by introducing a policy called “forced reinsert” than the R-tree and R+-tree [61].
R*-tree also improves the node splitting policy of the R-tree by taking overlapping area and
region parameters into consideration. Lin et al. proposed TV-tree [45], which uses so-called
telescope vectors. These vectors can be dynamically shortened assuming that only dimensions
with high variance are important for the query process and therefore low variance dimensions
can be neglected. Berchtold et al. [5] introduced X-tree, which is particularly designed for indexing higher dimensional data. X-tree avoids overlapping of region bounding boxes in the
directory structure by using a new organization of the directory and as a result, X-tree outperforms both TV-tree and R*-tree significantly. It is 450 times faster than R-tree and between 4
to 12 times faster than the TV-tree when the dimension is higher than two and it also provides
faster insertion times. Still bounding rectangles can overlap in higher dimensions. In order to
prevent this, White and Jain proposed the SS-tree [73], an alternative to R-tree structure,
which uses minimum bounding spheres instead of rectangles. Even though SS-tree outperforms R*-tree, the overlapping in the high dimensions still occurs. Thereafter, several other
A Novel Indexing Scheme: Hierarchical Cellular Tree 103
SAM variants are proposed such as SR-tree [31], S²-Tree [71], Hybrid-Tree [14], A-tree [57],
IQ-tree [5], Pyramid Tree [6], NB-tree [20], etc. Especially for content-based indexing and
retrieval in large-scale multimedia databases, SAMs have several drawbacks and significant
limitations. By definition an SAM-based indexing scheme partitions and works over a single
feature space. However, a multimedia database can have several feature types (visual, aural,
etc.), each of which might also have multiple feature subsets. Furthermore, SAMs assume that
query operation time and complexity are only related to accessing a disk page (I/O access
time) containing the feature vector. This is obviously not a trivial assumption for multimedia
databases and consequently, no attempt in the design of SAMs has been done to reduce the
similarity distance computations (CPU time). In order to provide a more general approach to
similarity indexing for multimedia databases, several efficient Metric Access Methods
(MAMs) are proposed. The generality of MAMs comes from the fact that any MAM employs
the indexing process by assuming only the availability of a similarity distance function, which
satisfies three trivial rules: symmetry, non-negativity and triangular inequality. Therefore, a
multimedia database might have several feature types along with various numbers of feature
sub-sets all of which are in different multi-dimensional feature spaces. As long as a similarity
distance function that is usually treated as a “black box” by the underlying MAM, exists the
database can be indexed by any MAM. Several MAMs are proposed so far. Yianilos [76] presented vp-tree that is based on partitioning the feature vectors (data points) into two groups
according to their similarity distances with respect to a reference point, so called vantage
point. Bozkaya and Ozsoyoglu [9] proposed an extension of vp-tree, so-called mvp-tree (multiple vantage point), which basically assigns m vantage points for a node with a fan out of m2 .
They reported 20% to 80% reduction of similarity distance computation compared to vp-trees.
Brin [11] introduced Geometric Near-Neighbor Access Tree (GNAT) indexing structure,
which chooses k number of split points at the top level and each of the remaining feature vectors are associated with the closest split points. GNAT is then built recursively and the parameter k value is chosen to be a different value for each feature set depending on its cardinality. The MAMs so far addressed present several shortcomings. Contrary to SAMs, these metric trees are designed only to reduce the number of similarity distance computations, paying
no attention to I/O costs (disk page accesses). They are also intrinsically static methods in the
sense that the tree structure is built once and new insertions are not supported. Furthermore,
all of them build the indexing structure from top to bottom and hence the resulting tree is not
guaranteed to be balanced. Ciaccia et al. [19] proposed M-tree to overcome such problems.
M-tree is a balanced and dynamic tree, which is built from bottom to top, creating a new root
level only when necessary. The node size is a fixed number, M, and therefore, the tree height
depends on M and the database size. Its performance optimization concerns both CPU computational time for similarity distances and I/O costs for disk page accesses for feature vectors of
the database items. Recently Traina et al. [67] proposed Slim-tree, an enhanced variant of Mtrees, which is designed for improving the performance by minimizing the overlaps between
nodes. They introduced to factors, “fat-factor” and “bloat-factor”, to measure the degree of
104
overlap and proposed the usage of Minimum Spanning Tree (MST) for splitting the node.
Another slightly enhanced M-tree structure, so-called M+-tree, can be found in [79].
As a summary, the indexing structures so far addressed are all designed to speed up
any QBE process by using some multidimensional index structure. However, all of them have
significant drawbacks for the indexing of large-scale multimedia databases. As mentioned before, SAMs are, by nature, not suitable for this purpose since their design concerns only single
feature space partitioning; whereas, the query process should be performed over several features and feature sub-sets extracted for the proper indexing of the multimedia database. The
static MAMs so far addressed do not support dynamic changes (new insertions or deletions);
whereas this is an essential requirement during the incremental construction of the database.
Even though M-tree and its variants provide dynamic database access, the incremental construction of the indexing tree could lead, depending of the order of the objects, to significantly
varying performances during the querying phase. Moreover, MAMs performance also deteriorates significantly with increasing number of database items and the choice of the prefixed
node capacity, M, affects the tree structure and hence the performance of indexing. So far, all
indexing methods (MAMs and SAMs), while providing good results on low dimensional feature space (i.e. d<10 for SAMs), do not scale up well to high dimensional spaces due to the
phenomenon so called “the curse of dimensionality”. Recent studies [72] show that most of
the indexing schemes even become less efficient than sequential indexing for high dimensions. Such degradations and shortcomings prevent a wide spread usage of such indexing
structures especially on multimedia collections.
Furthermore in multimedia databases, the discrimination factor of the visual and aural
descriptors (features) is quite limited. They have mainly significant drawbacks in terms of
representing the content similarity (or providing a reliable similarity measure). This is a major
problem if a dynamic indexing algorithm such as M-tree relies on a clustering scheme, which
depends on assigning a single nucleus item (the routing object) and then grouping a set of
similar items around it. No matter how accurately a nucleus item is chosen, due to the aforementioned fact of unreliable and highly variant descriptors (features) present, irrelevant (dissimilar) items can be chosen in a cluster or insufficient number of similar items can be clustered. In MVP-tree, this problem is addressed by introducing multiple vantage (nucleus)
items.
In order to overcome the problems and provide efficient solutions to the aforementioned shortcomings, especially for multimedia databases, we develop a MAM-based, dynamic and self-organized indexing scheme, the Hierarchical Cellular Tree (HCT). As its name
implies, HCT has a hierarchic structure, which is formed into one or more levels. Each level is
capable of holding one or more “cell” (s). The cell structure is nothing but an acronym of the
“node” structure in M-tree. The reason we call it a different name is because the cells further
contain a tree structure, a Minimum Spanning Tree (MST), which refers to the database objects (their database representations and basically their descriptors) as its MST nodes. Among
all indexing structures available, M-tree shows the highest structural similarity to HCT, such
as:
A Novel Indexing Scheme: Hierarchical Cellular Tree 105
•
Both indexing schemes are MAM-based and have a similar hierarchical structure, i.e.
levels.
•
They are both created dynamically, from bottom to top. The tree grows one level upwards when the number of cells becomes two in the top level (due to a mitosis operation).
• A similar concept of representing each cell in the lower level with a nucleus (routing)
object in the higher level.
However, there are several major differences in their design philosophies and objectives, such as:
•
M-tree is a generic MAM, designed to achieve a balanced tree with a low I/O cost in a
large data set. HCT is on the other hand designed for indexing multimedia databases where
the content variation is seldom balanced and it is especially optimized for compactness
(highly focused cells).
• M-tree works over nodes with a maximum (fixed size) capacity, M. Therefore, the
performance depends on a “good” choice of this parameter with respect to the database size
and thus, M-tree construction significantly varies with it. Especially for multimedia databases
the database size is dynamic and unknown most of the time. Furthermore, the content variation among the database items is quite volatile and unknown beforehand, too. There might be
a group of similar items whereas their number exceeds several multiples of M and hence, too
many nodes will unnecessarily be used for representing them. So with a static M setting, there
is a danger of saturated number of nodes representing a group of similar items and therefore,
causing significant indexing degradations due to excessively crowded levels and unnecessarily long M-tree height. Another potential danger in such circumstances is to lose minority
cluster of (similar) objects due to excessive domination of a major number of objects with
similar content. Such minor clusters will therefore lose the chance of representation on the
higher levels and this will cause misleading insertions of similar items. This is the main reason of corruption due to the “crowd effect” on large databases. HCT on the other hand has no
limit for the cell size as long as the cell keeps a definite “compactness” measure. So HCT will
not drastically suffer from the “crowd effect” and the resultant corruption by clustering each
similar object into one (or minimum number of) cell(s) and hence providing an equal representation chance for both minor and major group of items on the higher levels.
• In M-tree, the cell compactness is only measured with respect to distance of the routing (nucleus) object to the farthest object that is so called the covering radius. Due to the
aforementioned reasons of unreliability on such single measure for the cell compactness, HCT
uses all items and their minimum distances to the cell (instead of a single nucleus item alone)
to come up with a regularization function that represents a dynamic model for the cell compactness. During the lifetime of the HCT body (i.e. with incoming item insertions, removals
and internal transfers, events, etc.) this function dynamically updates the current cell compactness feature, which is then compared to a certain statistically driven level threshold value
to decide whether or not the cell should be split (mitosis).
• The split policies and objectives are also different between M-tree and HCT. First of
all, M-tree performs a split operation only when the cell size reaches M without paying atten-
106
tion to the current status (i.e. compactness) of the cell. For the split operation, M-tree first
tries to find suitable nucleus (routing) objects and then form the child cells around them.
There are several methods for promoting those nucleus objects and the one, which is used to
preserve compactness, is so called m_RAD (the minimum sum of RADii algorithm) and is
also the most complex one that requires O ( N 2 ) distance computations within the node, each
time a split occurs. Once the nucleus objects are found, there are two distribution alternatives
for the formation of the child nodes: Generalized Hyperplane method [19] is used to optimize
the compactness by assigning the objects to the nearest nucleus (routing) object. Balanced
method is used to obtain as balanced child nodes as possible. On the other hand, HCT performs mitosis (split) operation over a cell only when the cell is reached a certain degree of
maturity and only if the current compactness feature indicates that the cell should undergo a
mitosis operation to preserve a compactness level for which the cell’s (owner) level requires.
Therefore, mitosis is one of the major operations for preserving/enhancing the overall compactness. Similar to natural mitosis occurring in organic cells, the most sparse (dissimilar) object or group of objects are detached from the other group and thus more and more similar
groups are kept within the same cell, which provides an increasing similarity focus (compactness) in the long term. HCT first performs the mitosis operation to split the cell into two child
cells and afterwards assigns the most suitable nucleus items for them accordingly. Due to the
presence of MST formation within each cell, there is no cost for mitosis operation since MST
is used to decide from which branch the partition should be executed.
• Although both indexing structures are built dynamically by incremental (one by one)
item insertions, performed with a cell search from top to bottom, the insertion processes differ
significantly in terms of cell search operations. M-tree insertion operation is based on MostSimilar Nucleus (MS-Nucleus) cell search, which depends on a simple heuristics which assumes that the closest nucleus item (aka “routing object”) yields the best sub-tree during the
descend and finally the best (target) cell to be appended. In this chapter, we will show that this
is not always a valid assumption and therefore, a potential cause for corruption since it can
lead to sub-optimum insertions especially for large databases due to the “crowd effect”. Furthermore, the incremental construction of an M-tree could lead to different structures depending on the order of item insertions. HCT is designed to perform an optimum search for the
target cell to which the incoming item should belong. This search, so called Pre-emptive cell
search, during descend at each level verifies all the paths that are likely to yield a better nucleus item (and hence a better cell at one lower level) in an iterative way. By this way, along
with the mitosis operation this search algorithm further improves the cell compactness factor
at each level.
• M-tree has a conservative structure that might cause degradations in due time. For example the cell nucleus (routing object) is not changed after an insertion or removal operation
even though another item might now be a more suitable candidate of being the cell nucleus
and hence a better representative of that cell on the higher level. Another example is the static
allocation of new cell nucleuses after a mitosis operation; the new cell nucleuses are always
assigned to their parents’ owner cell in the higher level. On the contrary, HCT has a totally
A Novel Indexing Scheme: Hierarchical Cellular Tree 107
dynamic approach. Any operation (insertion, removal or mitosis) can change the current cell
nucleus to a new (better) one, in which case the old nucleus has been removed and the new
one is inserted into the most suitable cell in the upper level –not necessarily to the owner cell
of the old one but to the optimum one that can be found at that instance of HCT body. Similarly, when mitosis occurs within a cell at a certain level, the old (parent) nucleus item has
been removed from its owner cell and instead two new nucleus items are inserted into the
higher level independently (i.e. the old parent nucleus has no effect on this). Via such dynamic updates towards “best possible” assignments and further applying the Pre-emptive cell
search algorithm for item insertions that are generic for any level, the corruption can therefore
be avoided and the HCT body is continuously kept intact.
Along with the indexing techniques addressed so far, certain query techniques have to
be used to speed up a QBE process within indexed databases. The most common query techniques are kNN and range queries, which are explicitly introduced in Section 5.1, along with
their limitations and drawbacks. A simple yet efficient retrieval scheme, the Progressive
Query (PQ), is then proposed. PQ is a retrieval (via query) technique, which can be performed over databases with or without the presence of an indexing structure. Therefore, it can
be an alternative to Normal Query (NQ) with the exhaustive search where both of them produce (converge to) the same result at the end. When the database has an indexing structure,
PQ can replace kNN and range queries whenever a Query Path (QP) over which PQ proceeds,
can be formed. Instead of relying on some unknown parameters such as k or ε as used for the
number of relevant items for kNN and the range value for range queries, PQ provides periodic
(with a user-defined time period) query results along with the query process and lets the user
browse around the queries obtained and stops the ongoing query in case the results obtained
so far are satisfactory and hence no further time should unnecessarily be wasted. Therefore,
the proposed (HCT) indexing technique has been designed to work in harmony with PQ in
order to evaluate the retrieval performance in the end, i.e. how fast the most relevant items
can be retrieved or how efficient HCT can provide a QP for a particular query item.
6.2.
HCT FUNDAMENTALS
HCT is a dynamic, cell–based and hierarchically structured indexing method, which is purposefully designed for PQ operations and advanced browsing capabilities within large-scale
multimedia databases. It is mainly a hierarchical clustering method where the items are partitioned depending on their relative distances and stored within cells on the basis of their similarity proximity. The similarity distance function implementation is a black-box for the HCT.
Furthermore, HCT is a self-organized tree, which is implemented by genetic programming
principles. This basically means that the operations are not externally controlled; instead each
operation such as item insertion, removal, mitosis, etc. are carried out according to some internal rules within a certain level and their outcomes may uncontrollably initiate some other
operations on the other levels. Yet all such “reactions” are bound to end up in a limited time,
that is, for any action (i.e. an item insertion), its consequent reactions cannot last indefinitely
108
due to the fact that each of them can occur only in a higher level and any HCT body has naturally limited number of levels. In the following sub-sections, we will detail the basic structural
components of the HCT body and then explain the indexing operations in an algorithmic way.
6.2.1. Cell Structure
A cell is the basic container structure in which similar database items are stored. Ground level
cells contain the entire database items. Each cell further carries a MST where the items are
spanned via its nodes. This internal MST stores the minimum (dis-) similarity distance of each
individual item to the rest of the items in the cell. So this scheme resembles MVP-tree [9]
structure; however, instead of using some (pre-fixed) number of items, all cell items are now
used as vantage points for any (other) cell item. These item-cell distance statistics are mainly
used to extract the cell compactness feature. In this way we can have a better idea about the
similarity proximity of any item instead of comparing it only with a single item (i.e. the cell
nucleus) and hence a better compactness feature. The compactness feature calculation is also
a black-box implementation and we use a regularization function obtained from the statistical
analysis using the MST and some cell data. This dynamic feature can then be used to decide
whether or not to perform mitosis within a cell at any instant. If permission for mitosis is
granted, the MST is again used to decide where the partition should occur and the longest
branch is a natural choice. Thus an optimum decision can be made to enhance the overall
compactness of the cell with no additional computational cost. Furthermore, the MST is used
to find out an optimum nucleus item after any operation is completed within the cell.
In HCT, the cell size is kept flexible, which means there is no fixed cell size that cannot be exceeded. However, there is a maturity concept for the cells in order to prevent a mitosis operation before the cell reaches a certain level of maturity. Otherwise, we cannot obtain
reliable information whether or not the cell is ready for mitosis since there is simply not
enough statistical data that are gathered from the cell items and its MST. Therefore, using a
similar argument for the organic cells, a maturity cell size (e.g. 6) is set for all the cells in
HCT body (level independent). Note that this should not be compared with the parameter M
for M-tree where M is used to enforce the mitosis for a cell with size M no matter what the
cell condition (i.e. compactness) is. M is the maximum size that a cell can have. However, in
our case we set the minimum size of a cell as a pre-requisite condition for a mitosis operation.
This is not a significant parameter, which neither affects the overall performance of HCT nor
needs to be proportional with the database size or any other parameter, as in the case in an Mtree.
6.2.1.1 MST Formation
A Minimum Spanning Tree (MST) is the same as any other fully connected graph, except that
each of the branch weight is minimal. However, note that this does not mean that all paths are
minimal. MST is therefore, the (minimal) set of branches required to connect the graph. A
further constraint is that the MST should contain no cycles.
A Novel Indexing Scheme: Hierarchical Cellular Tree 109
There are several MST construction algorithms, such as Kruskal [60] and Prim’s [54].
These are however static algorithms, that is, all items with their relative (similarity) distances
with respect to each other should be known beforehand. The construction of MST requires
O ( N 2 ) computational cost where N is the number of items. In our case the cell, and hence its
MST should be constructed dynamically since items can be inserted any time during the lifetime of HCT body and it would be infeasible to re-construct MST from scratch each time a
new item is inserted into a particular cell as this would require O ( N 3 ) computations. To avoid
this, we modify the traditional (static) construction of the MST algorithm into dynamic one
using the following DynamicMST algorithm. Let item-I be the next item that is to be inserted
into MST. DynamicMST algorithm can then be expressed as follows:
DynamicMST (item-I)
Ø Create a new MST node for item-I: node-I
Ø Extract the distance table between node-I and MST nodes.
Ø Find the closest node to node-I: node-C and connect two nodes with the branch.
Ø Create an array for connected nodes: ArrayCN[], put node-C into it
Ø CheckBranches (ArrayCN[] )
CheckBranches (ArrayCN[] )
Ø For all the nodes in ArrayCN (say node-C) do:
• For all the nodes that node-C is connected (say node-j) do:
o Create a candidate branch between node-j and node-I: branch(node-I,
node-j)
o If( |branch(node-C, node-j)| > |branch(node-I, node-j)| ) then do:
§ Delete branch(node-C, node-j) from MST.
§ Insert branch(node-I, node-j)into MST.
§ Insert node-j into ArrayCN.
o Else do:
§ Delete branch(node-I, node-j).
• End loop.
Ø End loop.
Ø If ArrayCN is empty then Return.
Ø Else CheckBranches (ArrayCN[]).
The function CheckBranches checks all the branches of the nodes stacked in the ArrayCN and if any particular branch (to a particular node) a gives longer distance than a possible (candidate) branch between that particular node and node-I, then that longer (existing)
branch is cut and the candidate branch is replaced. All the nodes that are connected to node-I
in such a fashion are then put into array ArrayCN and CheckBranches is called again. The
operation continues recursively until all the nodes are consumed in ArrayCN (i.e. the size of
the array is 0).
110
By means of the proposed dynamic insertion technique, the computational cost is
2
still O ( N ) and the initial MST is initially used and updated only whenever necessary. A
sample dynamic insertion operation is illustrated in Figure 51.
MST Before Insertion
5
Node SD
7
2
1
3
4
4
3
4
2
4
1
Insert
3
3
3
X
1
3
4>3
2
2
3>2
2
text
1
3
2
5
1
3
X
3
4
2
MST After Insertion
4
3
2
1
3
2
5
4
1
2
1
3
2
5
1
2
Figure 51: A sample dynamic item (5) insertion into a 4-node MST.
6.2.1.2 Cell Nucleus
Cell nucleus is the item which represents the owner cell on the higher level(s). Since during
the top-down cell search for an item insertion, these nucleus items are used to decide the cell
into which the item should be inserted, it is essential to promote the best item for this representation at any instant. When there is only one item in the cell, it is obviously the nucleus
item of that cell; otherwise, the nucleus item is assigned using the cell MST as the item having the maximum number of branches (connections to other items). This heuristics makes
sense since it is the unique item to which most of the items have the closest proximity to it.
Contrary to static nucleus assignment of some other MAM-based indexing schemes such as
M-tree, the cell nucleus is dynamically verified and if necessary updated whenever an operation is performed over the cell in order to maintain the best representation of the cell and there
is no computational cost for this so far. For example, in Figure 51, the nucleus is item 2 before
the insertion since it is the only node, which has two branches (connections). However, after
the insertion, item 5 should be promoted as the new nucleus item since it has now more
branches than any other item within the cell.
6.2.1.3 Cell Compactness Feature
As mentioned earlier, this is the feature, which represents the compactness of the cell items,
i.e. how tight (focused) the clustering for the items within the cell. Furthermore, the regularization function implementation for the calculation of the cell compactness feature is a black
box for HCT. In this sub-section we will present the parameters of this function used in the
experiments that are reported in this thesis.
A similar argument can be made for the extraction of the cell compactness feature. Instead of using only the distance values of all the items in the cell with respect to the nucleus
item, it would be more reliable and convenient to use the (minimum) distance of each item
with respect to the cell (in fact the rest of the items in the cell) that is basically nothing but the
A Novel Indexing Scheme: Hierarchical Cellular Tree 111
branch weights of the cell MST. Once a cell reaches maturity (a pre-requisite for the compactness feature calculation) then a regularization function, f, can be expressed using the following statistical cell parameters:
CFC = f ( µ C , σ C , rC , max(wC ), N C ) ≥ 0
(33)
where µ C and σ C are the mean and standard deviation of the MST branch weights,
wC , of the cell C. rC is the covering radius, that is the distance from the nucleus where all the
cell items lie and N C > N M is the number of items in cell C. A compact cell can be obtained
if all these parameters can be minimized. Accordingly, a regularization function should then
be implemented to minimize the compactness feature, CFC . In the limit, the highest compactness can be achieved when CFC = 0 which means that all cell items are identical.
Similar to continuous updates for the nucleus item, the CFC value is also updated
each time an operation is performed over the cell. The new CFC value is then compared with
the current level compactness threshold, CThr L , that is dynamically calculated within each
level and if the cell is mature and not compact enough, i.e. CFC > CThrL , mitosis is therefore, granted for that cell. The dynamic calculation of CThr L for level L will be explained in
the next section.
6.2.1.4 Cell Mitosis
As explained earlier there are two conditions necessary for a mitosis operation: maturity, i.e.,
N C > N M and cell compactness, i.e., CFC > CThrL . Both conditions are checked after an
operation (e.g. item insertion or removal) occurs within the cell in order to signal a mitosis
operation.
Due to the presence of MST within each cell, mitosis has no computational cost in
terms of similarity distance calculations. The cell is simply split by breaking the longest
branch in MST and each of the newborn child cell is formed using one of the MST partitions.
A sample mitosis operation is shown in Figure 52.
Parent Cell Before Mitosis
2 Child Cells After Mitosis
C
C1
4
2
1
3
1
8
1
X
2
5
1
2
4
8
3
7
=
3
1
1
2
5
1
9
2
1
6
2
C2
9
+
8
6
2
2
Figure 52: A sample mitosis operation over a mature cell C.
3
7
112
6.2.2. Level Structure
HCT body is hierarchically partitioned among one or more levels, as one sample example
shown in Figure 53. In this example there are three levels that are used to index 18 items.
Apart from the top level, each level contains various numbers of cells that are created by mitosis operations occurring at that level. The top level contains a single cell and when this cell
splits, then a new cell is created at the level above. As mentioned earlier, the nucleus item of
each cell in a particular level is represented on the higher level.
A
a
b
c
Level 2 = Top Level
A
C
B
c
d
a
b
e
f
Level 1
A
B
d
C
D
c
e
j
g
h
i
k
E
F
a
b
l
n
o
m
p
r
f
s
Level 0 = Ground Level
Figure 53: A sample 3-level HCT body.
Each level is responsible for taking logs about the operations performed on it, such as
the number of mitosis operations, the statistics about the compactness feature of the cells, etc.
Note that each level dynamically tries to maximize the compactness of their cells although this is not a straightforward process to do since the incoming items may not show a
similarity to the items present in the cells and therefore, such dissimilar item insertions will
cause a temporary degradation on the overall (average) compactness of the level. So each
level, while analyzing the effects of the (recent) incoming items on the overall level compactness, should employ necessary management steps towards improving compactness in due
time (i.e. with future insertions). Within a period of time (i.e. during a number of insertions or
after some number of mitosis), each level updates its compactness threshold according to the
compactness feature statistics of mature cells, into which an item was inserted. Therefore,
CThrL value for a particular level L can be estimated as follows:
A Novel Indexing Scheme: Hierarchical Cellular Tree 113
CThr L =
k0
P
C ∈S P
∑ CF
C NC >NM
C
= k 0 µ CFC ∀ C ∈ S P
(34)
where S P is the set of mature cells, upon which P insertions are recently performed and
0 < k0 ≤ 1 is the compactness enhancement rate, which determines how much enhancement
will be targeted for the next P insertions beginning from the moment of the latest CThr L setting. If k0 =1, the trend is built upon keeping the current level of compactness intact and so no
enhancement will be targeted for the future insertions. On the other hand if k 0 = 0 then the
cells will split each time they reach maturity and in this case HCT split policy will be identical
to M-tree.
6.2.3. HCT Operations
There are mainly three HCT operations: cell mitosis, item insertion and removal. Cell mitosis
can only happen after any of the other two HCT operations occurs and is already covered in
sub-section 6.2.1.4. Both item insertion and removal are generic HCT operations that are identical for any level. Insertions should be performed one item at a time. However, item removals
can be performed on a cell-based, i.e., any number of items in the same cell can be removed
simultaneously. In the following sub-sections, we will present the algorithmic details of both
operations.
6.2.3.1 Item Insertion Algorithm for HCT
Item insertion is a level-based operation and can be implemented per item basis. Let nextItem
be the item to be inserted into a target level indicated by a number, levelNo. The insertion algorithm, Insert (nextItem, levelNo), first performs a novel search algorithm, the Pre-emptive
cell search, which recursively descends the HCT from top to target level in order to locate the
most suitable cell for nextItem. Once the target cell is located in the target level, the item is
inserted into the cell and then the cell becomes subject to a generic post-processing check.
First the cell is examined for a mitosis operation and as explained earlier. If the cell is mature
and yields a worse compactness than required (i.e. CFC > CThrL ), then mitosis occurs and two
new (child) cells are produced on the same level. Hence, the parent cell is then removed from
the cell queue of the level and two child cells are inserted instead. Accordingly, the old nucleus item is removed from the upper level and two new nucleus items are inserted into the
upper level by consecutively calling Insert (nextItem, levelNo+1) function for both of the
(nucleus) items. This is a particular genetic algorithm example that an independent process
deterministically produces another process in an iterative way. Note that these processes are
separate from each other, but one’s outcome may initiate the other.
114
On the other hand, if mitosis is not performed (for instance the cell is still compact
enough after the insertion) another post processing step is applied to verify the need for a cell
nucleus change. As explained earlier, the nucleus item of the owner cell can also be changed
after an insertion operation and in this case, first the old nucleus is first removed from the upper level and the new one is inserted using Insert (nextItem, levelNo+1) for the new nucleus
item. The Insert algorithm can be expressed as follows:
Insert (nextItem, levelNo)
Ø Let top level number: topLevelNo and the single cell in top level: cell-T
Ø If(levelNo > topLevelNo) then do:
o Create a new top level: level-T with number = topLevelNo+1
o Create a new cell in level-T: cell-T
o Append nextItem into cell-T.
o Return.
Ø Let the Owner (target) cell in level levelNo: cell-O
Ø If(levelNo = topLevelNo ) then do:
o Assign cell-O = cell-T
Ø Else do:
o Create a cell array for Pre-emptive cell search: ArrayCS[], put cell-T into
it
o Assign cell-O = PreEmptiveCellSearch (ArrayCS[], nextItem, topLevelNo)
Ø Append nextItem into cell-O.
Ø Check cell-O for Post-Processing:
o If cell-O is split then do:
§ Let item-O, item-N1 and item N2 be old nucleus item (parent) and
new nucleus items (2 child)
§ Remove( item-O, levelNo+1)
§ Insert(item-N1, levelNo+1)
§ Insert(item-N2, levelNo+1)
o Else if nucleus item is changed within cell-O then do:
§ Let item-O and item-N be old and new nucleus items.
§ Remove( item-O, levelNo+1 )
§ Insert( item-N, levelNo+1 )
Ø Return.
The function PreemptiveCellSearch implements the Pre-emptive cell search algorithm for finding the optimum (owner) cell on the level at which insertion occurs. The traditional cell search technique, MS-Nucleus, which is used in M-Tree and its derivatives, depends on a simple heuristics, which assumes that the closest nucleus (routing) item (object)
yields (tracks) the best sub-tree during descend and finally the best (owner) cell is appended.
i
Let d( ) be the similarity distance function, O be the item (object) to be inserted, O N and
r ( O Ni ) be the nucleus object and its covering radius for the ith cell, C i , respectively. Particularly M-tree presents two cases:
A Novel Indexing Scheme: Hierarchical Cellular Tree 115
i
i
Case 1. If no nucleus item for which d (O, ON ) ≤ r (ON ) ∀Ci , the choice is taken in order to
minimize the increase of the covering radius, i.e. ∆ i = d (O, ON ) − r (ON ) ∀Ci ,
among all the nucleus objects that are in the owner cell C.
i
i
Case 2. If there exists a nucleus item for which d (O, ON ) ≤ r (ON ) ∀Ci exists, then its subtree is tracked in the lower level. If multiple sub-trees (nucleus objects) with this
property exist, then the one to which object O is closest, is chosen.
i
i
Both cases fail to track the closest (most similar) objects in the lower level as shown in
the sample illustration in Figure 54. In this figure, O1N and ON2 are the nucleus (routing) objects representing the lower level cells C1 and C 2 on the upper level. In both cases the MSNucleus technique tracks down the sub-tree of ON2 , that is, the cell C 2 as a result of the cases
expressed above. However in the lower level the closest (most similar) object is item c (since
d1 < d 2 ), which is a member of the cell C1 .
Case 1:
∆ 2 < ∆1 ⇒ O 
→ C 2
C1
∆1
∆2
a
O1N
c
Case 2:
d 2 < r (ON2 ) ⇒ O 
→ C2
ON2
o
1
N
C2
f
r (O )
b
d1
d2
C1
d
a
r (ON2 )
r (O1N )
O1N
b
e
C2
f
ON2
o
c
d1
d
r (ON2 )
d2
e
Figure 54: M-Tree rationale that is used to determine the most suitable nucleus (routing)
object for two possible cases. Note that in both cases the rationale fails to track the closest
nucleus object on the lower level.
A novel pre-emptive cell search algorithm is developed to perform a pre-emptive
analysis on the upper level to find out all possible nucleus objects that are likely to yield the
closest (most similar) objects on the lower level. Note that in the upper level, we have no information about the items in the cells C1 and C 2 , yet we can set appropriate pre-emptive criteria to fetch all possible nucleus items whose cells should be analyzed to track the closest
item (item c in this particular example) on the lower level. Let d min be the distance to the
closest nucleus item (on the upper level). The rationale of the pre-emptive cell search can be
expressed as follows:
Case 1. If no nucleus item for which d (O, ONi ) ≤ r (ONi ) ∀Ci exists, then fetch all nucleus
items whose cells on the lower level may provide the closest object, i.e.
∆ i = d (O, ONi ) − r (ONi ) ≤ d min ∀Ci , among all nucleus objects that are in the owner
cell C.
116
Case 2. If there exists one or more nucleus item(s) for which d (O, ONi ) ≤ r (ONi ) ∀Ci , then
consider all of them since any of their owner cells on the lower level may provide the
closest object.
Since Case 1 implies Case 2, the former alone can be used as the only criterion to fetch all
nucleus items for tracking. Accordingly, the pre-emptive cell search algorithm, PreemptiveCellSearch, can be expressed as follows:
PreemptiveCellSearch (ArrayCS[], nextItem, curLevelNo)
Ø By searching ∀ONi O Ni ∈ Ci ∧ ∀Ci ∈ ArrayCS à Find the most similar item,
item-MS and d min .
Ø If(curLevelNo = levelNo + 1) then do:
o Let the owner cell of item-MS: cell-MS in the (target) level (with level
number: levelNo)
o Return cell-MS
Ø Create a new array for cell search: NewArrayCS[] = ∅
O Ni ∈ C i ∧ ∀C i ∈ ArrayCS , do:
Ø For ∀O Ni
i
i
o If( ∆ i = d (O, ON ) − r (ON ) ≤ d min ) then do:
§
Find the owner cell of (nucleus) item ONi in the lower level:
cell − CNi
§
Append cell − C Ni into NewArrayCS[]
Ø End loop.
Ø Return PreemptiveCellSearch (NewArrayCS[], nextItem, curLevelNo-1)
At each level while descending towards the target level, using such a pre-emptive
analysis that fetches all nucleus items whose owner cells may provide the “most similar” nucleus item for the lower level, Pre-emptive cell search terminates its recursion one level above
the target level and presents the (final) most similar nucleus item with its owner cell on the
target level into which the nextItem should be inserted. This achieves an optimum insertion
scheme in the sense that the owner cell found on the target level presents the closest nucleus
item with respect to the item to be inserted (i.e. nextItem) and therefore, along with the mitosis
operations, which are used to improve the compactness of a cell, Pre-emptive cell search
based item insertion algorithm further improves the cell compactness.
6.2.3.2 Item Removal Algorithm for HCT
This is another level-based operation, which does not require any cell search operation. However, upon its completion it may cause several post-processing operations, affecting the overall HCT body. As explained earlier, if multiple items are required to be removed from a particular (target) level, first the items are grouped into one or more sub-sets according to their
owner cells and then each sub-set can be conveniently removed from the HCT body within a
A Novel Indexing Scheme: Hierarchical Cellular Tree 117
single operation. Therefore, without loss of generality we will introduce the algorithmic steps
assuming that all the items for the removal operation already belong to a single owner cell.
A significant post-complication occurs if all items in a particular cell are removed; the
cell dies and therefore it is completely removed from the host level. If there is a single cell left
on the target level, it automatically becomes the new top level and the level above (with its
single cell with a single (removed) item) is also removed from the HCT body and hence the
HCT height is reduced by one. The remaining post processing steps are similar to the ones
given with item insertion algorithm: the owner cell can undergo a mitosis operation and furthermore any change in the cell nucleus due to item(s) removal, cell mitosis or death may require insertions of the new nucleus item(s) and/or the removal of the old one(s).
Let ArrayIR[] be the array for the items (which belong to a single owner cell, say cellO) to be removed from the (target) level with a number levelNo. The Remove algorithm can
then be expressed as follows:
Remove (ArrayIR[], levelNo)
Ø Let top level number: topLevelNo and the single cell in top level: cell-T
Ø Let the Owner (target) cell in level levelNo: cell-O
Ø Remove items in ArrayIR within cell-O
Ø Check cell-O for Post-Processing:
o If cell-O is depleted (cell-death) then do:
§ If( levelNo = topLevelNo ) then do:
• Remove cell-O=cell-T
• Remove the top level from HCT body
§ Else do:
• Let item-O be the old nucleus item
• Remove (item-O, levelNo+1)
o Else if cell-O is split then do:
§ Let item-O, item-N1 and item N2 are old nucleus item and two new
nucleus items.
§ Remove (item-O, levelNo+1)
§ Insert (item-N1, levelNo+1)
§ Insert (item-N2, levelNo+1)
o Else if nucleus item is changed within cell-O then do:
§ Let item-O and item-N be old and new nucleus items.
§ Remove (item-O, levelNo+1)
§ Insert (item-N, levelNo+1)
Ø Return.
6.2.4. HCT Indexing
In order to index (construct a HCT body for) a multimedia database, the database should contain at least one feature extracted according to the genre of its multimedia items, i.e., “visual”
for images and video clips and “aural” for audio and video clips. According to the features
present in a database, two different genres of indexing can be performed: visual and aural in-
118
dexing (HCT). If a database is indexed both visually and aurally, apart from the exclusive
similarity distance implementation, there is no difference from the algorithmic point of view;
the exact same algorithmic approach as presented in this chapter is performed to both cases. If
there are multiple features and/or sub-features present, then any suitable combination of these
features can be used for indexing. Once indexing operation is completed, both genres of query
and browsing operations (visual and aural) can be performed over that database.
There are mainly two distinct operations for HCT indexing. The incremental construction of the HCT body and an optional periodic fitness check operation over it. In the following
sub-sections, we will present the algorithmic details of both operations.
6.2.4.1 HCT Incremental Construction
Let G represent the indexing genre, which can equally be both visual and aural for a multimedia database. Let ArrayI<G> be the array containing items that are to be appended to the database, D, according to the indexing genre. Initially D may or may not have a HCT indexing
body. If it does not, then all (valid) items within D will be inserted into ArrayI<G> and a new
HCT body is constructed; otherwise, the available HCT body should be first loaded, activated
and updated for the newcomers present in ArrayI<G>. Accordingly the HCT indexing body
construction algorithm, HCTIndexing, can be expressed as follows:
HCTIndexing ( ArrayI< G > , G, D)
Ø Load and activate HCT indexing body in genre G for database D.
i
i
Ø For ∀OG ∀OG ∈ ArrayI < G > , do: // For all items in the array, perform incremental insertion.
§ Insert ( OGi , 0) // Insert ith item into HCT body.
Ø End loop.
6.2.4.2 HCT (Periodic) Fitness Check
HCT fitness check is an optional operation that can be performed periodically during or after
the indexing operation. The objective is to reduce the “crowd effect” by removing redundant
immature cells from the HCT body. Due to the insertion ordering of the items, one or some
minor group of items may form a cell at the initial stages of the incremental construction operation. Later on, some other major cells might become suitable containers for those items
that got trapped within those immature cells. So the idea is to remove all immature cells and
feed their items back to the system, expecting that some other mature cell might now be better
host for them. Note that after they are inserted to the most suitable cell on the level, the host
cell may still refuse them if their insertion causes a significant degradation in cell compactness and hence would cause the cell to split as a result. In such a case, the original part of the
host cell and the new item will be assigned to one of two newborn cells. This is the case
where they are in fact “minority cases” that no other (similar) cell exist to accept them so that
they eventually have to form a new immature cell and bind themselves into it.
A Novel Indexing Scheme: Hierarchical Cellular Tree 119
As a result, the basic idea is to reduce immature cells that are making the level
“crowded” whilst respecting such minority cases. The obvious expectation from this operation
is to increase the percentage of mature cells along with their item coverage in a particular
level without causing significant degradations on the overall compactness. Note that periodic
fitness check will be applied to each level of the HCT body except the top level since it contains only one cell. The HCT fitness check algorithm, FitnessCheck, can be expressed as follows:
FitnessCheck ( )
Ø Let l represents the level index.
Ø l = topLevelNo – 1 // Start from the highest level possible
Ø While l ≥ 0 , do: // For all levels, perform the fitness check
o Let ArrayIR[] be the array for the removed items (which belong to a immature owner cell: C)
o For ∀C N C < N M ∧ C ∈ Level(l ) // for all immature cells of level l
§
For ∀OCi
O Ci ∈ C // for all items in cell C
• Append OCi to ArrayIR[].
§ End loop.
§ Remove( ArrayIR[], l)
§ For ∀ OCi OCi ∈ ArrayIR , do: // For all array items, perform incremental insertion
• Insert( O Ci , l ) // Insert ith item back into lth level of the
HCT body..
§ End loop.
o End loop.
o Set l à l-1 // continue with the lower level
Ø End loop.
Note that the fitness check is performed to all of the levels except top level in decreasing order (higher levels are handled earlier than the lower levels). The reason for that is because the (incremental) insertion operation on a particular level will require a cell search (Preemptive) operation performed on all of the higher levels. So performing a fitness check operation to them first will obviously improve the performance of the fitness check operations performed on lower levels.
6.3.
PQ OVER HCT
As presented in Chapter 5, Progressive Query (PQ) is a novel retrieval scheme, which presents Progressive Sub-Query (PSQs) retrieval results periodically to the user and allows user
to interact with the ongoing query process. Among other traditional query techniques such as
exhaustive search based Normal Query (NQ) that can be only used for databases without any
indexing structure or kNN and range queries for indexed databases, PQ presents significant
innovative features and therefore, HCT is designed and optimized especially for PQ.
120
PQ can be executed over databases without an indexing structure and in this context it
presents an alternative to the traditional query type, NQ, which is usually performed for the
sake of simplicity. As a retrieval process PQ can also be performed over indexed databases as
long as a query path (QP) can be formed over the clusters (partitions) of the underlying indexing structure. Obviously QP is nothing but a special sequence of the database items and it can
be formed in any convenient way such as sequentially (starting from the 1st item towards the
last one, or vice versa) or randomly when the database lacks an indexing structure. Otherwise
the most advantageous way to perform PQ is to use the indexing information so that the most
relevant items can be retrieved in earlier periodic steps of PQ.
PQ operation over HCT is executed synchronously over two parallel processes: HCT
tracer and a generic process for PSQ formation using the latest QP segment. HCT tracer is a
recursive algorithm, which traces among the HCT levels in order to form a QP (segment) for
the next PSQ update. When the time allocated for this operation is completed, this process is
paused and the next PSQ retrieval result is formed and presented to the user. Then HCT tracer
is re-activated for the next PSQ update and both processes stay alive unless the user stops the
PQ or entire PQ process is completed (when all the indexed database items are covered).
6.3.1. QP Formation from HCT
As mentioned briefly, QP is formed segment by segment for each and every PSQ update.
Once a QP segment is formed, the periodic sub-query results are obtained within this segment
(group of items) and then the retrieval result (the sorted list of items) is fused with the last
PSQ update to form the next PSQ retrieval result. Starting from the top level, HCT tracer algorithm recursively traces among the levels and their cells according to the similarity of the
cell nucleuses. This is similar to the MS-Nucleus cell search process, only this time it will not
stop its execution when it finds the “most similar” cell on the ground (target) level but continues its sweep by visiting the 2nd most similar, then 3rd most and so on, while inserting all the
items of the cells that it visits in the ground level to the current QP segment. Starting from the
top level, each cell it visits in an intermediate level (any level except the ground level), HCT
tracer forms a priority (item) queue, which ranks the cell items according to their similarity
with respect to the query item. Note that these items are nothing but the nucleus items of the
cells on the lower level and hence on the lower level, the cell “tracing” order is determined
according to the priority queue that is formed on the upper level using their representative
(nucleus) items. When the tracing operation is completed among the cells on the lower level
(i.e. when the priority queue is depleted for the cell in a particular level) HCT tracer retreats to
the upper level (cell) where it is originated. The process is terminated when the priority queue
of the top level (cell) is depleted, which means, the whole HCT body is traced. Within the implementation of HCT tracer, we further develop an internal structure that prevents redundant
similarity distance calculation, that is, similarity distances between items of the cells on intermediate levels are calculated only once and used on the lower levels whenever needed. In
fact this is a general property of PQ operation, all the (computationally) costly operations
A Novel Indexing Scheme: Hierarchical Cellular Tree 121
such as similarity distance calculations, loading features of the items from disc to the system
memory, etc. are performed only once and shared between the processes whenever needed.
The following HCTtracer algorithm implements HCT tracer operation, which basically extracts the next QP segment into a generic array, ArrayQP[]. It is initially called with
the top-level number (topLevelNo) and an item (item-MS) from the single cell on the top level:
HCTtracer (ArrayQP[], levelNo, item-MS). Let item-MS be the (next) most similar item to
the query item, item-Q, on the (target) level indicated with a number, levelNo. HCTtracer
algorithm can then be expressed as follows:
HCTtracer (ArrayQP[], levelNo, item-MS)
Ø Let cell-MS be the owner cell of item-MS.
Ø If (levelNo = 0) then do: // if this is ground level
o Append all items in cell-MS into ArrayQP[].
o Return.
Ø Else do: // if this is an intermediate level
o Create the priority queue of cell-MS: queue-MS.
o For ∀ONi ∈ queue − MS , do: // for all sorted (nucleus) items do:
§
HCTtracer (ArrayQP[] , levelNo-1, ONi )
Ø Return.
Note that this algorithm is executed as a separate process (thread) and can be paused
externally from the main PQ process when the time comes for the next PSQ update. An example HCT tracer process is illustrated in Figure 55 for the sample HCT body shown in
Figure 53.
122
Query Item:
PQ operation
Q
A
a
b
c
Q
A
1
2
3
b
c
a
Level 2 = Top Level
2
1
1
3
d
1
i
k
e
f
E
Q
C
1
2
b
f
a
F
b
r
l
n
o
m
3
5
C
b
D
j
h
a
2
C
c
g
B
1
2
B
d
Q
a
1
3
1
B
1
3
c
e
d
2
A
A
1
2
3
2
Level 1
e
Q
1
c
1
2
3
A
p
4
f
s
2
1
6
Level 0 = Ground Level
QP(Q)
b
r
p
s
f
c
i
e
j
k
l
m
n
d
g
h
a
o
Figure 55: QP formation on a sample HCT body.
6.3.2. PQ Operation over HCT
Once QP segments are formed, PQ operation that is executed over HCT body becomes similar to the sequential PQ illustrated in Figure 37. There are two main differences: each database sub-set shown in Figure 37 should now be replaced by the particular QP segment created
by the HCT tracer process and a particular (e.g. say q + 1 st ) periodic sub-query period value
should be reformulated according to expression (31).
In order to present the overall PQ algorithm over an HCT body, let HCTfile be the
HCT filename where HCT body is stored along with the database for which it is extracted and
t = t p be the user defined period value. So the PQoverHCT algorithm can be expressed as
follows:
A Novel Indexing Scheme: Hierarchical Cellular Tree 123
0
PQoverHCT(HCTfile, t p )
Ø Load the HCTfile to activate HCT body of the database.
Ø Create a timer, which signals to this process every t = t p millisecond.
Ø Create a process (thread) for HCT tracer.
Ø Set q = 0.
0
q
Ø While ( timer< t p > ticks ) do:
o Pause HCT tracer process.
o Retrieve QP segment as a periodic sub-query result.
o Fuse the periodic sub-query result with the last PSQ result to form next
PSQ update.
o Render the next PSQ update to the user.
q
o Update t p value for the next (q+1st) PSQ period as given in Eq. 3. Reset
q+1
the timer < t p >.
o Set qà q+1.
o Re-activate HCT tracer process.
Ø End loop.
6.4.
HCT BROWSING
Generally speaking, there are two ways of retrieval from a (multimedia) database: through a
query process, such as query by example (QBE), and browsing. A query is a search and retrieve type process and is bound to some strictly defined rules and algorithmic steps. It is a
retrieval race against time, so to provide the “most relevant” results in the “earliest” possible
time given an “example” query item. However, such a scheme, by its nature, might have limitations and drawbacks. The user may not have a definitive idea what he/she is looking for and
even though he/she has a clear idea about the query item, finding a relevant example may not
be so trivial. Therefore, the problem of locating one or more initial query examples can be
addressed by a particular browsing scheme. Database browsing, on the other hand, is a loose
process, which usually requires a continuous interaction and feedback from the user and
therefore, it is a kind of free-navigation and exploration among database items. Yet, it has a
purpose of its own: it is to reach (classes of) items in a systematic and an efficient way even
though the items may rather vague defined. Therefore, it is the browsing algorithm’s responsibility to organize the database in such a way that the “unknown” parameters of any browsing action can be resolved as efficiently as possible. Since browsing requires the capability of
handling the entire database, a particular visualization system (for visual databases) and
tool(s) for navigation should be provided; otherwise, browsing can turn out to be a disorienting process for the user. For this reason, it is essential to provide an organized (perhaps in a
hierarchical way) map of the entire database along with the current status of the user (e.g.
such as a “You are here!” sign) should be provided during the browsing process.
124
In order to assist browsing, database items should be organized in some way. Particularly for large databases, a hierarchical representation of the database may provide a natural
support for free-navigation between the “levels” of the hierarchy such as traversing in and out
among the levels. Several browsing methodologies are proposed in the literature. Koikkalainen and Oja introduced TS-SOM [36] that is used in PicSOM [37] as a CBIR indexing
structure. TS-SOM provides a tree-structured vector quantization algorithm. Other similar
SOM-based approaches are introduced by Zhang and Zhong [80], and Sethi and Coman [62].
All SOM-based indexing methods rely on training of the levels using the feature vectors and
each level has a pre-fixed node size that has to be arranged according to the size of the database. This brings a significant limitation, that is, they are all static indexing structures, which
do not allow dynamic construction or updates for a particular database. Retraining and costly
reorganizations are required each time the content of the image database changes (i.e. new
insertions or deletions), that is rebuilding the whole indexing structure from scratch.
In the previous section, we presented an efficient query method, PQ implementation
over the proposed indexing scheme, HCT. Moreover, HCT can provide a basis for accomplishing an efficient browsing scheme, namely HCT Browsing. It is designed to address efficient solutions for such limitations and problems. As explained earlier, HCT is a dynamic indexing structure, which allows incremental item insertions and removals. It has virtually no
parameter dependency, not even on the database size. The cell sizes are also dynamic and they
can vary according to the coverage or the amount of a particular content all of which is intended to be stored in one (or the least number of) cell(s) in the ground level. The hierarchic
structure of HCT can be used to present the user an overview of what lies under the current
level. Therefore, provided that a user friendly GUI is designed, HCT Browsing can turn out to
be a guided tour among the database items. When the user initiates it, it is designed to show
the items in the cell at the top level, so it is the first clue about what the database contains or
more specifically a brief summary of it. The next logical step for the user is to choose an item
of interest among the items in this cell and starts the tour downwards. So this is the first functionality that HCT Browsing provides: to choose an item on the upper level and trace down to
see the cell it represents (as a nucleus item of that cell). As long as the item chosen belongs to
the current cell, the “level down” option will lead to the lower-level cell that is represented by
that item. Otherwise, the first cell on the lower level will be shown by default. The opposite
functionality, the “level up” will lead to the upper level cell, which is the owner (cell) of the
nucleus item of the host cell at the current level and it works at all the levels except the top
level. This is also a useful functionality to see and visit similar cells of a particular cell. Therefore, the (slight) variations of a particular content can be visited using both of the level functionalities.
In addition to such inter-level navigation options, HCT Browsing provides intercellular navigation within a certain level. The user can visit the cells sequentially, in forward
or backward direction, one cell at a time. This is especially useful when the user does not have
any particular target content in mind and he/she may be just “looking” for interesting items.
So in such an indefinite or “open-ended” exploration task, navigating through the consecutive
A Novel Indexing Scheme: Hierarchical Cellular Tree 125
cells in a certain level will summarize the overall database content and the amount of summarization obviously depends on the “height” of the navigation performed, that is, simply the
current level number.
Nucleus
Item
Level
Controls
Cell
Controls
Cell
Items
HCT
Info
Item
Navigator
Buttons
Cell MST
Info
Prev Next
Figure 56: HCT Browsing GUI on MUVIS-MBrowser Application.
Figure 56 shows a snapshot of the MUVIS application, MBrowser, where HCT Browsing is
implemented. Depending on the HCT indexing genre (i.e. visual or aural), the aforementioned
functionalities of HCT Browsing are supported by means of a Control Window along with
GUI of MBrowser. In this example, the database used is Corel_1K with 1000 images. Color
(HSV and YUV histograms) and texture (Gray Level Co-occurrence Matrix) features are extracted and HCT indexing is then performed. As shown in the bottom-left part of the figure,
the Control Window allows the user to perform inter-level and inter-cellular navigations. Furthermore, it gives some logs and useful information to the expert users about the HCT body in
general and particularly about the current cell and its MST. So any (expert) user can examine
the cell compactness, which items are connected to each other within MST, the nucleus item,
and the most important of all, whether or not the current cell is compact and mature. In this
example, by comparing the cell compactness feature with the current level compactness
threshold value, ( CThr L = 246 . 3 and CF C = 24 .86 for this cell) it can be easily deducted that
this is a highly focused cell. As a compact cell, its MST branch weights are quite low and
126
within a close neighborhood as expected. Additional important information that can be obtained from the Control Window is the reliability and discrimination factor of the visual (or
aural) features by visually (or aurally) inspecting the cell items’ relevancy along with their
(minimal) connections to the cell.
Two examples of HCT Browsing with inter-level navigations are shown in Figure 57
Figure 58. In both illustrations, the user starts browsing from the 3rd level of a 5-level high
HCT body. Due to the space limitation, only a portion of HCT body (where the browsing operation is performed) is shown. Note that in both examples, HCT indexing scheme provides
more and more “narrowed” content in the descending order of the levels. For example, the
user chooses an “outdoor, city, architecture” content on the third level where it yields “outdoor, city, beach and buses” content carrying cell in the second level. He/she then chooses a
multi-color “bus” and then navigating down into the first level, it yields a cell, which contains
mostly “buses” with different color as the content; and finally choosing a “red bus” image
(nucleus item) yields the cell of “red buses” on the ground level. Another example can be seen
through: “outdoor, city, architecture” à “outdoor, city, beach and buses” à “beach and
mountains” à “beaches”. Similar series of examples can also be seen in the sample HCT
Browsing operation within a texture database. The cells are getting more and more compact
(focused) in the descending order of level and the ground level cells achieve a “good” clustering of texture images showing high similarity.
A Novel Indexing Scheme: Hierarchical Cellular Tree 127
Level 3
Level 2
Level 1
Level 0
Figure 57: An HCT Browsing example, which starts from the third level within Corel_1K
MUVIS database. The user navigates among the levels shown with the lines through ground
level.
128
Level 3
Level 2
Level 1
Level 0
Figure 58: Another HCT Browsing example, which starts from the third level within Texture
MUVIS database. The user navigates among the levels shown with the lines through ground
level.
A Novel Indexing Scheme: Hierarchical Cellular Tree 129
6.5.
EXPERIMENTAL RESULTS
The experiments performed in this chapter use 7 different MUVIS multimedia databases as
presented in section 2.1.4:
1) Open Video Database:
2) Real World Audio/Video Database:
3) Sports Hybrid Database:.
4) Corel_1K Image Database:
5) Corel_10K Image Database:
6) Shape Image Database:
7) Texture Image Database
All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB
memory. In order to have unbiased evaluations, each query experiment is performed using the
same queried multimedia item with the same instance of MBrowser application. The evaluation of the retrieval results by PQ is performed subjectively using ground-truth method, i.e. a
group of people evaluates the query results of a certain set of retrieval experiments, upon
which all the group members unanimously agreed about the query retrieval performance.
Among these a certain set of examples were chosen and presented in this section for visual
inspection and verification.
6.5.1. Performance Evaluation of HCT Indexing
In this section, performance evaluation is made based on the cell search algorithm and the fitness check. The fitness check operation is performed only once as a post-processing step after
the completion of the incremental construction of the HCT body. To examine the “quality”
and “compactness” of the clustering, especially at the ground level where the entire database
is stored, some statistics are used as shown in Table XI, in the left column. First, we will explain the details of the statistical data and analysis performed over the key algorithms of the
HCT indexing operation. Afterwards, the performance evaluation is presented based on the
statistical data given in Table XI.
130
Table XI: Statistics obtained from the ground level of HCT bodies extracted from the sample
MUVIS databases.
Statistics
(Level 0)
Mature Cell %
Item % in
Mature Cells
Average
Compactness
Average
Covering Radius
Average Broken
Branch Weight
Average Cell Size
Average Mature
Cell Size
HCT
Construction
Time (seconds)
Cell Search Fitness Real World
Algorithm Check
Video
Before
7.246
MSFC
Nucleus
After FC
17.526
Before
5.848
PreFC
emptive
After FC
9.938
Before
24.062
MSFC
Nucleus
After FC
37.748
Before
23.179
PreFC
emptive
After FC
36.865
Before
173.328
MSFC
Nucleus
After FC
192.968
Before
112.148
PreFC
emptive
After FC
128.164
Before
1.238
MSFC
Nucleus
After FC
1.229
Before
1.068
PreFC
emptive
After FC
1.093
Before
1.015
MSFC
Nucleus
After FC
1.034
Before
0.890
PreFC
emptive
After FC
0.883
Before
3.283
MSFC
Nucleus
After FC
4.670
Before
2.649
PreFC
emptive
After FC
2.814
Before
10.900
MSFC
Nucleus
After FC
10.059
Before
10.500
PreFC
emptive
After FC
10.438
Before
168.079
MSFC
Nucleus
After FC
286.906
Before
541.073
PreFC
emptive
After FC 1548.585
Real World
Audio
Sports
Image
Sports
Video
Corel_
1K
Corel_
10K
Shape Texture
9.052
15.929
0.000
15.962
18.228
13.694 31.783
14.286
28.261
15.686
19.324
34.654
50.877 44.444
6.792
12.308
1.389
16.393
13.937
12.760 28.114
12.389
25.510
10.909
24.631
25.618
19.048 41.048
21.303
44.040
0.000
39.700
47.865
41.143 64.091
30.451
61.818
32.000
44.600
65.116
75.214 72.898
20.802
40.404
3.500
44.200
40.729
42.357 61.705
30.952
63.030
27.000
54.400
56.465
52.571 70.000
2299.667
255.417
NaN
152.672
289.694 23.636 0.038
2687.994
304.267
10.989
145.581
384.193 157.74 0.048
2172.990
134.299
11.586
82.440
65.020
12.097 0.011
2096.814
195.721
9.378
99.905
77.587
14.104 0.015
2.448
1.089
NaN
1.049
1.206
0.580
0.127
2.449
1.156
0.588
1.042
1.293
0.817
0.133
2.357
0.961
0.501
0.935
0.863
0.504
0.098
2.514
1.077
0.505
0.986
0.925
0.539
0.107
2.351
1.025
0.558
1.014
1.200
0.668
0.147
2.387
1.119
0.585
1.031
1.255
0.707
0.149
2.278
0.915
0.517
0.861
0.794
0.540
0.109
2.281
0.980
0.531
0.872
0.805
0.549
0.109
3.440
4.381
3.226
4.695
4.619
4.459
6.822
3.931
5.380
3.922
4.831
6.231
8.187
8.148
3.011
3.808
2.778
4.098
4.097
4.154
6.263
3.531
5.051
3.636
4.926
5.321
5.128
7.686
8.095
12.111
NaN
11.676
12.128
13.395 13.756
8.379
11.769
8.000
11.150
11.708
12.103 13.365
9.222
12.500
7.000
11.050
11.973
13.791 13.747
8.821
12.480
9.000
10.880
11.727
14.154 13.106
4359.849
1.847
79.600
2.310
46.162
23.365 3.000
7674.978
3.034
158.287
4.147
72.456
31.020 3.989
15110.948
1.925
193.918
3.054
450.105 67.631 3.196
31708.918
3.868
394.462
5.673
926.525 117.51 4.403
A Novel Indexing Scheme: Hierarchical Cellular Tree 131
6.5.1.1 Statistical Analysis
Table XI presents several statistics per fitness check status (before and after) and per cell
search algorithm: the proposed Pre-emptive vs. traditional MS-Nucleus. The first two statistics, the percentage of mature cells and their overall item coverage are mainly chosen to show
the effect of both algorithms on the maturity of the cells. Furthermore the effect of fitness
check can be clearly seen by examining these measures. The other three averaging statistics,
compactness, nucleus distance and broken branch weight are about the “quality” of the indexing scheme, that is, how focused (compact) the obtained cells are. For (HCT) indexing of
these databases the following regularization function is used:
CF C = f ( µ C , σ C , rC , max( w C ), N C ) = K µ C σ C rC max( w C ) N C
(35)
where K, is a scaling coefficient, µ C and σ C are the mean and standard deviation of the MST
branch weights, wC , of the cell C.
rC is the covering radius, that is the distance from the nu-
cleus where all the cell items lie and N C > N M is the number of items in cell C.. Once the
indexing operation is completed, the average compactness is then calculated using
CFC
value of all mature cells on the ground level. We used N M = 6 for maturity and K=1000 in
the experiments. Average covering radius is the conventional way for the analytical expression of the cell compactness. Therefore, its average over the entire level is expected to be low
for indexing operations targeting high quality. Finally, the average broken branch weight is
obtained from a log of all mitosis operations performed so far on ground level. It is an alternative criterion for measuring the overall level compactness, and therefore, similar argument can
be done for this statistical parameter.
Since Pre-emptive cell search is primarily designed to improve the overall level compactness, its corresponding statistics are expected to be lower (indicating more focused cells)
and significantly lower when the database size is getting higher. This basically reveals the
subject of scalability and thus, from several experimental databases given in Table XI, comparing these statistics for Corel_1K and Corel_10K databases, the scalability performance of
each algorithm can be evaluated.
The average mature cell size is used to perform a compactness evaluation that is not
affected by the actual mature cell size. As given in equation (35), the
CFC value for a cell C
is proportional to the (square root of the) size of C, that is, if a cell is reluctant to carry more
and more items, then the items should be more and more similar (focused) in order not to
cause the cell to split. In other words increasing the cell size should be compensated (is only
allowed) with more focused (i.e. comparatively low values of µ C , σ C , rC , max( wC ) ) cell
items in order to keep the cell in one piece. Finally, HCT construction time represents the basic cost of each operation as the (CPU) computational time.
132
6.5.1.2 Performance Evaluation
The numerical results given in Table XI show that regardless which cell search algorithm is
used, the fitness check operation usually improves the amount (percentage) of mature cells
and also the number of items stored in these cells significantly without degrading the overall
compactness drastically. Such effects naturally cause a significant increase in the average cell
size; however, the average mature cell size is slightly affected. This means that the fitness
check operation is not changing the natural maturity level within the HCT body so that the
average size of similar item groups can be kept intact.
As expected, Pre-emptive cell search achieves a major compactness improvement with
respect to MS-Nucleus algorithm. Its performance is further increased when the database is
larger and the features are getting better (higher discrimination factor). Consider for example
the statistics obtained from Corel_1K and Corel_10K, in “before fitness check” status (to discard the effect of fitness check): Pre-emptive cell search algorithm yields an average compactness factor of 82.44 compared with 152.672 for MS-Nucleus, see Table XI. Not surprisingly, the average compactness is increased even more in the case of a larger database, e.g.
Corel_10K (65.020 for pre-emptive search vs. 289.694 for MS-nucleus). Note that this is an
unbiased comparison since both algorithms yields close values for average mature cell size for
both databases. Therefore, due to the reasons explained earlier, MS-Nucleus cell search algorithm induces corruption proportional with the database size. On the other hand, Pre-emptive
cell search algorithm is not degraded from the increasing database size and therefore achieves
a significant scalability in this aspect. This can be verified by comparing the average compactness and covering radius values obtained (82.440 vs. 65.020 and 0.935 vs. 0.863) and it
can be clearly seen that the compactness level is, on the contrary, improved with the increasing database size; whereas, the opposite is true for MS-Nucleus cell search algorithm.
Apart from the database size, the reliability (discriminating factor) of the feature(s) is
also an important factor. Improved feature discrimination factors yield more robust similarity
distance measures which in turn leads to more focused cells obtained by Pre-emptive cell
search algorithm. Among the features used in the experiments that are reported in this thesis,
the most reliable ones are texture features (GLCM [49] and Gabor [40]). Hence, the (second)
highest difference in terms of compactness (more than three times) between Pre-emptive and
MS-Nucleus cell search algorithms can be seen in this database (0.011 vs. 0.038, in Table XI).
The cost for using both fitness check and Pre-emptive cell search is the increased
computational time for the construction of the HCT indexing structure. However, since indexing is an off-line process that is performed only once during the creation of the database, this
cost can be compensated for by the accuracy and time gains in the query and browsing, both
of which are online processes that are subject to be performed several times during the lifetime of the database.
6.5.2. PQ over HCT
Two different performance evaluations can be performed for PQ operations over HCT indexing structure. First the relevancy of the QP where PQ will proceed can be examined from
A Novel Indexing Scheme: Hierarchical Cellular Tree 133
some typical QP (similarity distance) plots. These plots indicate whether the order of the
items within QP is formed in accordance with the similarity of the query item so that the most
similar items can be retrieved earliest. In Figure 59, a query image has a group of 97 relevant
images among 1000 images in the database and in Figure 60 a query video has a group of 21
relevant video clips among 200 video clips. It can be seen from the figures that HCT tracer
successfully captures all of the relevant items at the beginning of QP. Therefore, PQ operation will be presenting them (first) to the user immediately after the query operation is initiated. Another important remark should be made about the “trend” of the QP plots, that is, it
traces along the increasing order of dissimilarity, as intended.
Figure 59: QP plot of a sample image query in Corel_1K database.
Figure 60: QP plot of a sample video query in Sports database.
The second performance evaluation is about the speed (or timing) of PQ over HCT
operation compared with both the Sequential PQ and NQ. To this end, we performed 10 visual and aural retrieval experiments using all three query methods and we measured the time to
retrieve at least 90% of all relevant items that are subjectively determined using the groundtruth method within each database. We used t p ≤ 3 sec and the results are presented in Table
XII.
134
It is not surprising that over 20 query operations performed, PQ over HCT achieves
the fastest retrievals and it yields the retrieval result in the first periodic update of PQ except 4
aural retrievals. In those examples the audio feature could not provide sufficient discrimination and hence the cell search within HCT tracer (MS-Nucleus) fails to track on the optimum
sub-tree at the beginning. Note that this is the main and expected problem causing corruption
for the indexing phase as explained earlier.
Table XII: Retrieval times (in msec) for 10 visual and aural query operations performed per
query type.
Query
Genre
Visual
Aural
Query Type
NQ
Seq. PQ
PQ over HCT
NQ
Seq. PQ
PQ over HCT
Q1=428 Q2=466 Q3=381 Q4=705 Q5=617 Q6=291 Q7=417 Q8=784 Q9=277 Q10=603
19624
12007
3003
37675
21004
3001
9407
5974
2003
29820
21005
3000
5938
3997
2002
31977
18006
12003
5922
6000
3139
36652
36023
1501
7776
6003
2501
54869
45003
1500
6774
4004
1501
39905
18050
12033
7290
6002
3002
48553
30023
3002
7799
2003
2003
58897
24003
9000
4954
2001
1504
84921
82998
12000
6.5.3. Remarks and Evaluation
As a brief summary, the following innovative properties achieved by HCT can be listed:
• HCT is a dynamic, parameter independent and flexible cell (node) sized indexing
structure, which is optimized to achieve as focused cells as possible using visual and
aural descriptors with limited discrimination factors.
• By means of the flexible cell size property, one or minimum number of cell(s) are
used to store a group of similar items, which in effect reduces the degradations
caused by “crowd effect” within the HCT body.
• During their life-time, the cells are under the close surveillance of their levels in order to enhance the compactness using mitosis operations whenever necessary to rid
the cell of dissimilar item(s). Furthermore, for item insertions, an optimum cell
search technique (Pre-emptive) is used to determine the most suitable (similar) cell in
each level.
• HCT is also intrinsically dynamic, meaning that the cell and level parameters and
primitives are subject to continuous upgrade operations to provide most reliable environment. For example a cell nucleus item is changed whenever a better candidate is
available and once a new nucleus item is assigned, its owner cell on the upper level is
found after a cell search instead of using the old one’s owner cell. Such a dynamic internal behavior keeps the HCT body intact by preventing the potential sources of corruption.
• By means of MST within each cell, the optimum nucleus item can be assigned whenever necessary and with no cost. Furthermore, the optimum split management can be
done when the mitosis operation is performed (again with no cost). Most important of
all, MST provides a reliable compactness measure via “cell similarity” for any item
instead of relying on only a single (nucleus) item. By this way a better judgment can
be done whether or not a particular item is suitable for a mature cell.
• HCT is particularly designed to work with PQ in order to provide the earliest possible
retrievals of relevant items.
7039
3002
2001
61080
57057
9002
A Novel Indexing Scheme: Hierarchical Cellular Tree 135
•
Finally, HCT indexing body can be used for efficient browsing and navigation among
database items. The user is guided at each level by the nucleus items and several hierarchic levels of summarization help the user to have a “mental picture” about the
entire database.
Experimental results presented earlier demonstrate that HCT achieves all the abovementioned properties and capabilities in an automatic and parameter invariant way. It further
achieves significant improvements on the cell compactness and shows no sign of corruption
when the database size is getting larger. Assuming a stable content distribution, the cells keep
approximately the same level of compactness when the database size is increased significantly
(i.e. 10 times). The analysis obtained from different databases suggests that HCT usually
yields better clustering performance when the discrimination factor of the features is sharper
and they provide better relevancy from the user’s point of view.
136
137
Chapter
7
Conclusions
M
ultimedia management has been and always will be a grotesque challenge of humancentric computing. It requires efficient consumption of multimedia primitives along
with several distinct operations performed during their life cycle. These operations such as
content analysis, indexing, retrieval, summarization, etc., should all be involved in an efficient
framework to achieve a global and generic basis for an efficient management. Especially the
variations on the digital multimedia parameters and formats increase the complexity of the
problem for the sake of a global approach. On the other hand, the lack of current Artificial
Intelligence level makes the training-based methods, such as recognition and identification,
infeasible to yield a generic solution. Moreover, traditional early attempts such as text based
query methods are also far from bringing a content-based solution to the problem, mainly because of two reasons: First the text annotations are user dependent and might vary among different people. Also its feasibility is only limited to small size databases since annotation requires a significant laborious effort. Therefore, designing feasible, yet global and generic
techniques for content-based multimedia management becomes the primary objective of this
thesis. Having defined the primary motivation behind developing the MUVIS framework as
such, MUVIS is further designed as a self-sufficient test-bed platform for developing modular
aural and visual feature extraction methods, novel indexing and retrieval techniques, efficient
browsing capabilities with its user-friendly GUI design in addition to several other functionalities, such as scalable video management, summarization, etc.
The contributions of the thesis can then be mainly summarized into four parts: Audio
content analysis and content-based audio indexing and retrieval; the query technique called
Progressive Query; the Hierarchical Cellular Tree as an efficient retrieval, indexing and
browsing techniques for multimedia databases; all of which have been successfully integrated
into a modular framework, MUVIS.
At present content-based multimedia indexing and retrieval has been an active research topic for roughly two decades. Within this context, there are several efficient methodologies developed for visual items such as images and video; however, the audio counterpart
is still in its infancy despite the fact that it can yield a better retrieval performance due its
unique and stable nature with respect to the content. Therefore, we focus the attention on
138
automatic audio content analysis and efficient aural indexing and retrieval framework design.
The former technique is especially designed for and within the context of the latter, yet it
achieves a significant classification and segmentation performance within a bi-modal and unsupervised (fully automatic) structure. By using it as the initial step, the proposed audio-based
multimedia indexing and retrieval framework then becomes a major alternative, showing
equal or better performance with respect to the visual counterpart. The positive experimental
results may lead one to predict that audio may be the key to closing the “semantic gap”, which
exists today between the low-level feature and the real semantic content of the audio-visual
world. Consequently, future research studies in this field shall focus on extensions and improvements of the techniques developed in this thesis. Particular emphasis will be placed on
additional features and enhanced summarization models extracted from the aural content.
The major part of the management of the multimedia collections obviously requires
powerful indexing, retrieval via query and browsing techniques. Although much work has
been done for the development of such techniques, as it is shown with the extensive literature
search in this thesis, most of the current techniques and systems have significant limitations
and drawbacks especially for large multimedia databases. In this context, the thesis first focuses on the problem of an efficient query methodology. The proposed Progressive Query has
been developed to achieve several innovative retrieval capabilities for databases with or without an indexing structure. Since the ultimate measure of any retrieval system performance is
the satisfaction of the system user, the most important property achieved is therefore its continuous user interaction, which provides a solid control and enhanced relevance feedback
mechanism along with the ongoing query operation. The experiments performed on databases
without an indexing structure clearly demonstrate the superior performance achieved by PQ
in terms of speed, minimum system requirements, user interaction and possibility of better
retrievals as compared with the traditional and the only available query scheme, the NQ.
Yet the ultimate objective, the retrieval of the most relevant items in the earliest possible time, cannot be fulfilled without the imminent role of an efficient indexing scheme. It is a
fact that existing indexing methods have not coped with the requirements set by multimedia
databases, such as the presence of multiple features possibly in high dimensions, the dynamic
construction capability, the need for prevention of the corruption for the large-scale databases,
robustness against low discrimination power of low-level features, etc. A novel indexing
structure, the Hierarchical Cellular Tree, is presented to accomplish these requirements.
Moreover, it is experimentally verified in this thesis that it can work in harmony with PQ to
retrieve at an early stage the most relevant items to the user’s query.
Visualization of multimedia primitives and the query by example (searching a specific
item of interest) operations are the flip sides of a typical retrieval scheme. The user may want
to switch back and forth between the two modes, as provided by the enhanced GUI of MUVIS. Furthermore, the user may need an efficient browsing scheme to reach the item(s) of interest that can then be used for a query operation. However, the user may have neither a definitive idea about what exactly the item of interest must be, nor where it can be found, hence,
a navigation process should be entirely guided. Such a “guided tour” among the database
Conclusions 139
items along with a hierarchical summarization of the database is provided as a side feature of
the HCT indexing body. Depending of the level of discrimination that the features can provide, it is shown that HCT indexing structure can well scale with the database size and this
can yield a better browsing capability for the user. Due to the subjectivity and the human factor in any browsing operation, the performance analysis and evaluations are currently limited
to the statistical measures taken from the HCT indexing body. A crucial future task will be to
develop a common setting for benchmarking of indexing and browsing so that the performance of HCT can be assessed.
The current status of the techniques proposed in this thesis promotes several interesting options and possibilities for further research. All the techniques within the context of multimedia management are designed and implemented to provide generic, automatic and global
solutions. This is due to the direct consequences of the massive size and variations that can be
seen in today’s multimedia collections. Even though they are developed independent from a
certain multimedia, database, application and environment type, in a specific application domain, such a general approach might not be the optimal and some modifications are likely to
be justified.
Finally, the management of the ever-increasing, massive multimedia databases will
still be a great challenge in the future. It is obvious that it can never be done manually, yet the
occurrence of the so-called “semantic gap” with the fully automatic methods is unavoidable.
Therefore, the efforts will be focused on the development of feasible methods for narrowing
the “semantic gap” by having better and possibly higher level descriptors as well as designing
a semi-automatic framework for providing semantic annotation for the multimedia content
with a certain degree of human interaction. The ultimate goal in the latter is to improve the
level of interactivity and interoperability with smarter GUI designs, and equally to minimize
the amount of manual work. In order to accomplish this, some studies have already been
started to improve the accuracy and performance of the fully automatic techniques whilst appending new capabilities such as interactive editing and semantic identification.
140
141
Bibliography
[1]
R. M. Aarts and R. T. Dekkers, “A Real-Time Speech-Music Dicriminator”, J. Audio
Eng. Soc., vol. 47, No 9, pp. 720-725, September 1999.
[2]
D. Bainbridge, Extensible optical music recognition. PhD thesis, Department of Computer Science, University of Canterbury, New Zealand, 1997.
[3]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An efficient
and robust access method for points and rectangles”, In Proc. of ACM SIGMOD Int.
Conf. on Management of Data, Atlantic City, US, pp. 322-331. 1990.
[4]
J. L. Bentley, “Multidimensional binary search trees used for associative searching”, In
Proc. of Communications of the ACM, v.18 n.9, pp.509-517, September 1975.
[5]
S. Berchtold, C. Bohm, H. V. Jagadish, H.-P. Kriegel, J. Sander, ‘Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces’, In Proc.
of the 16th Int. Conf. on Data Engineering, San Diego, USA, pp.577-588, Feb. 2000.
[6]
S. Berchtold , C. Böhm , H.-P. Kriegal, “The pyramid-technique: towards breaking the
curse of dimensionality”, In Proc. of the 1998 ACM SIGMOD International conference
on Management of data, pp.142-153, Seattle, Washington, US, June 01-04, 1998.
[7]
S. Berchtold, D. A. Keim, and H.-P.Kriegel, “The X-tree: An index structure for highdimensional data”, In Proc. of the 22th International Conference on Very Large Databases (VLDB) Conference, 1996.
[8]
S. Blackburn and D. DeRoure. “A Tool for Content Based Navigation of Music”, In
Proc. of ACM Multimedia, 1998.
[9]
T. Bozkaya, Z. M. Ozsoyoglu, “Distance-Based Indexing for High-Dimensional Metric
Spaces”, In Proc. of ACM-SIGMOD, pp.357-368, 1997.
[10] K.-H. Brandenburg, “MP3 and AAC Explained”, In Proc. of AES 17th International
Conference, Florence, Italy, September 1999.
[11] S. Brin, “Near Neighbor Search in Metric Spaces”, In Proc. of International Conference on Very Large Databases (VLDB), pp. 574-584, 1995.
[12] A. Bugatti, A. Flammini, P. Migliorati, “Audio Classification in Speech and Music: A
Comparison Between a Statistical and a Neural Approach”, Eurasip Journal on Applied
Signal Processing, Vol. 2002, No. 4, Part 1, pp. 372-378, April 2002.
142
[13] J. Canny, “A Computational Approach to Edge Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, pp. 679-698, November 1986.
[14] K. Chakrabarti and S. Mehrotra. “The hybrid tree: An index structure for high dimensional feature spaces”, In Proc. of Int. Conf. on Data Engineering, pp. 440-447, February 1999.
[15] S.F. Chang, W. Chen, J. Meng, H. Sundaram and D. Zhong, “VideoQ: An Automated
Content Based Video Search System Using Visual Cues”, In Proc. of ACM Multimedia,
Seattle, 1997.
[16] F. A. Cheikh, “MUVIS: A System for Content-Based Image Retrieval”, PhD. Thesis at
Tampere University of Technology, Tampere, Finland, April 2004.
[17] F.A. Cheikh, B. Cramariuc, M. Gabbouj, “Relevance feedback for shape query refinement”, In Proc. of IEEE International Conference on Image Processing, ICIP 2003,
Barcelona, Spain, 14-17 September 2003.
[18] T.-C. Chou , A. L. P. Chen , C.-C. Liu, “Music Databases: Indexing Techniques and
Implementation”, In Proc. of the 1996 International Workshop on Multi-Media Database Management Systems (IW-MMDBMS '96), pp. 46, August 14-16, 1996.
[19] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: an efficient access method for similarity
search in metric spaces”, In Proc. of International Conference on Very Large Databases (VLDB), pp. 426-435, Athens, Greece, August 1997.
[20] M. J. Fonseca, J. A. Jorge, “Indexing High-Dimensional Data for Content-Based Retrieval in Large Databases”, In Proc. of Eighth International Conference on Database
Systems for Advanced Applications (DASFAA '03), pp. 267-274, Kyoto-Japan, March
26 – 28, 2003.
[21] J. T. Foote, “Content-Based Retrieval of Music and Audio”, in Proc. of SPIE, vol.
3229, pp. 138-147, 1997.
[22] A. Guttman, “R-trees: a dynamic index structure for spatial searching”, In Proc. of
ACM SIGMOD, pp. 47-57, 1984.
[23] ISO/IEC 11172-3, Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s, Part 3: Audio, 1992.
[24] ISO/IEC CD 14496-3 Subpart4: 1998, Coding of Audiovisual Object Part 3: Audio,
1998.
[25] ISO/IEC 13818-3:1997, Information technology -- Generic coding of moving pictures
and associated audio information -- Part 3: Audio, 1997.
143
[26] ISO/IEC JTC1/SC29/WG11, “Overview of the MPEG-7 Standard Version 5.0”, March
2001.
[27] K. El-Maleh, M. Klein, G. Petrucci and P. Kabal, “Speech/Music Discrimination for
Multimedia Applications”, In Proc. of IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. 2445--2448, Istanbul, Turkey, 2000.
[28] A. Ghias, J. Logan, and D. Chamberlin. B. C. Smith, “Query By Humming.”, In Proc.
of ACM Multimedia 95, pp. 231-236, 1995.
[29] R.L. Graham and O. Hell, “On the history of the minimum spanning tree problem,”
Annual Hist. Comput. 7, pp. 43-57. 1985.
[30] R. Jarina, N. Murphy, N. O'Connor, and S. Marlow, “Speech-music discrimination
from MPEG-1 bitstream”, in Kluev, V.V., and Mastorakis, N.E. (eds.). Advances in signal processing, robotics and communications, WSES Press, pp. 174-178, 2001.
[31] N. Katayama , S. Satoh, “The SR-tree: an index structure for high-dimensional nearest
neighbor queries”, In Proc. of the 1997 ACM SIGMOD international conference on
Management of data, pp.369-380, Tucson, Arizona, US, May 11-15, 1997.
[32] A. Khokhar and G. Li, “Content-based Indexing and Retrieval of Audio Data using
Wavelet”, In Proc. of ICME, 2000.
[33] S. Kiranyaz, A. F. Qureshi, and M. Gabbouj, “A Generic Audio Classification And
Segmentation Approach For Multimedia Indexing and Retrieval”, In Proc. of the
EWIMT 2004, IEE European Workshop on the Integration of Knowledge, Semantics
and Digital Media Technology, London, U.K., November 2004.
[34] S. Kiranyaz, M. Gabbouj, “A Novel Multimedia Retrieval Technique: Progressive
Query (WHY WAIT?)”, In Proc. of WIAMIS Workshop, Lisboa, Portugal, April 2004.
[35] L. Kjell and L. Pauli, “Musical Information Retrieval using musical Parameters”, In
Proc. of International Computer Music Conference, Ann Arbour, 1998.
[36] P. Koikkalainen and E. Oja. ”Self-organizing hierarchical feature maps”, In Proc. of the
International Joint Conference on Neural Networks, San Diego, CA, 1990.
[37] J. T. Laaksonen, J. M. Koskela, S. P. Laakso, and E. Oja, ”PicSOM - content-based image retrieval with self-organizing maps”, Pattern Recognition Letters, 21(13-14), pp.
1199-1207, December 2000.
[38] L. Lu, H. You, H. J. Zhang, “A New Approach to Query by Humming in Music Retrieval”, In Proc. of ICME 2001, Tokyo, August 2001.
[39] L. Lu, H. Jiang and H. J. Zhang, “A Robust Audio Classification and Segmentation
Method”, In Proc. of ACM 2001, pp. 203-211, Ottawa, Canada, 2001.
144
[40] W. Y. Ma, B. S. Manjunath, ”A Comparison of Wavelet Transform Features for Texture Image Annotation”, In Proc. of IEEE International Conf. On Image Processing,
1995.
[41] R.J. McNab, L.A.Smith, I.H. Witten, C.L. Henderson, and S.J. Cunningham, “Towards
the digital music library: tune retrieval from acoustic input”, In Proc. of ACM Digital
Libraries '96, 1118, 1996.
[42] R. J. McNab, L. A. Smith, D. Bainbridge, and I. H. Witten., “The New Zealand Digital
Library MELody inDEX.”, http://www.dlib.org/dlib/may97/meldex/05written.html ,
May 1997.
[43] MUVIS. http://muvis.cs.tut.fi/
[44] Muscle Fish LLC. http://www.musclefish.com/
[45] K. Lin, H.V. Jagadish, and C. Faloutsos. “The TV-tree: an index for high dimensional
data”, Very Large Databases (VLDB) Journal, 3(4), pp. 517-543, 1994.
[46] Y. Nakayima, Y. Lu, M. Sugano, A. Yoneyama, H. Yanagihara, A. Kurematsu, “A Fast
Audio Classification from MPEG Coded Data”, In Proc. of Int. Conf. on Acoustics,
Speech and Signal Proc., vol. 6, pp. 3005-3008, Phoenix, AZ, March 1999.
[47] M. Noll, “Pitch Determination of Human Speech by the Harmonic Product Spectrum,
the Harmonic Sum Spectrum, and a Maximum Likelihood Estimate”, In Proc. of the
Symposium on Computer Processing Communications, pp. 770-797, Polytechnic Inst.
of Brooklyn, 1969.
[48] Open Video Project. http://www.open-video.org/
[49] M. Partio, B. Cramariuc, M. Gabbouj, A. Visa, “Rock Texture Retrieval Using Gray
Level Co-occurrence Matrix”, In Proc. of 5th Nordic Signal Processing Symposium,
October 2002.
[50] A. Pentland, R.W. Picard, S. Sclaroff, “Photobook: tools for content based manipulation of image databases”, In Proc. of SPIE (Storage and Retrieval for Image and Video
Databases II), 2185, pp. 34-37, 1994.
[51] D. Pan, “A tutorial on MPEG/Audio Compression”, IEEE Multimedia, pp. 60-74, 1995.
[52] S. Pfeiffer, J. Robert-Ribes and D. Kim, “Audio Content Extraction from MPEG Encoded Sequences”, In Proc. of Fifth Joint Conference on Information Sciences, Vol. II,
pp. 513-516, New Jersey, US, 1999.
[53] S. Pfeiffer, S. Fischer and W. Effelsberg, “Automatic Audio Content Analysis”, In
Proc. of ACM International Conference on Multimedia, pp. 21-30, 1996.
145
[54] R. C. Prim, “Shortest Connection Matrix Network and Some Generalizations”, Bell
System Technical Journal, vol. 36, pp. 1389-1401, November, 1957.
[55] L. R. Rabiner and B. H. Juang, Fundamental of Speech Recognition, Prentice hall,
1993.
[56] Y. Rui, T.S.Huang S. Metrotra, “Relevance feedback techniques in interactive contentbased image retrieval”, In Proc. of IS& T and SPIE Storage and Retrieval of Image and
Video Databases VI, San Juan, PR, pp. 762-768, June 1997.
[57] Y. Sakurai , M.Yoshikawa , S. Uemura , H. Kojima, “The A-tree: An Index Structure
for High-Dimensional Spaces Using Relative Approximation”, In Proc. of the 26th International Conference on Very Large Data Bases, pp. 516-526, September 10-14,
2000.
[58] J. Saunders, “Real Time Discrimination of Broadcast Speech/Music”, In Proc. of
ICASSP-96, vol.II, Atlanta, May, pp. 993-996, 1996.
[59] E. Scheirer and M. Slaney, “Construction and Evaluation of a Robust Multifeature,
Speech/Music Discriminator”, In Proc. of IEEE Int. Conf. on Acoustics, Speech, Signal
Proc., pp. 1331-1334, Munich, Germany, Apr. 1997.
[60] R. Sedgewick, Algorithms, Addison-Wesley Publishing Company, Inc, New York reprinted with corrections, August 1989.
[61] T. K. Sellis , N. Roussopoulos , C. Faloutsos, “The R+-Tree: A Dynamic Index for
Multi-Dimensional Objects”, In Proc. of the 13th International Conference on Very
Large Data Bases, pp.507-518, September 01-04, 1987.
[62] I. K. Sethi, I. Coman, “Image retrieval using hierarchical self-organizing feature map”,
Pattern Recognition Letters, 20:1337–1345, 1999.
[63] J.R. Smith and Chang, “VisualSEEk: a fully automated content-based image query system”, In Proc. of ACM Multimedia, Boston, November 1996.
[64] C. Spevak and E. Favreau, “Soundspotter - a prototype system for content-based audio
retrieval” In Proc. of the COST G-5 Conf. on Digital Audio Effects (DAFX-02), Hamburg, Germany, September 2002.
[65] S. Srinivasan, D. Petkovic and D. Ponceleon, “Towards robust features for classifying
audio in the CueVideo System”, In Proc. of the seventh ACM International Conf. On
Multimedia, pp. 393-400, Ottawa, Canada 1999.
[66] E. Terhardt, “Pitch Shifts of Harmonics, An Explanation of the Octave Enlargement
Phenomenon”, In Proc. of the 7th International Congress on Acoustics, pp. 621-624,
1971.
146
[67] C. Traina Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos, “Slim-trees: High performance metric trees minimizing overlap between nodes”, In Proc. of EDBT 2000, pp. 5165, Konstanz, Germany, March 2000.
[68] G. Tzanetakis and P. Cook, “Sound Analysis Using MPEG Compressed Audio”, In
Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Proc. ICASSP 2000, Vol II,
pp. 761-764, Istanbul, Turkey, 2000.
[69] The Open Video Project. http://www.open-video.org/
[70] J. K. Uhlmann, “Satisfying General Proximity/Similarity Queries with Metric Trees”,
Information Processing Letters, vol. 40, pp. 175-179, 1991.
[71] H. Wang, C.-S. Perng, “The S²-Tree: An Index Structure for Subsequence Matching of
Spatial Objects”, In Proc. of 5th Pacific-Asic Conf. On Knowledge Discovery and Data
Mining (PAKDD), Hong Kong, 2001.
[72] R. Weber , H.-J. Schek , S. Blott, “A Quantitative Analysis and Performance Study for
Similarity-Search Methods in High-Dimensional Spaces”, In Proc. of the 24rd International Conference on Very Large Databases, pp.194-205, August 24-27, 1998.
[73] D. White and R. Jain, “Similarity Indexing with the SS-tree”, In Proc. of the 12th IEEE
Int. Conf. On Data Engineering, pp. 516-523, 1996.
[74] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based Classification, Search,
and Retrieval of Audio”, IEEE Multimedia Magazine, pp. 27-36, Fall 1996.
[75] Virage. www.virage.com
[76] P. N. Yianilos, “Data structures and algorithms for nearest neighbor search in general
metric spaces”, In Proc. of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pp.311-321, Austin, Texas, US, January 25-27, 1993.
[77] T. Zhang and J. Kuo, “Video Content Parsing Based on Combined Audio and Visual
Information”, In Proc. of SPIE 1999, Vol. IV, pp. 78-89. 1999.
[78] T. Zhang and C.–C. J. Kuo, “Hierarchical Classification of Audio Data for Archiving
and Retrieving”, In Proc. of IEEE Int. Conf. on Acoustics, Speech, Signal Proc., pp.
3001-3004, Phoenix. March 1999.
[79] X. Zhou , G. Wang , J. X. Yu , G. Yu, “M+-tree: a new dynamical multidimensional
index for metric spaces”, In Proc. of the Fourteenth Australasian database conference
on Database technologies 2003, pp.161-168, Adelaide, Australia, February 2003.
[80] H. Zhang and D. Zhong, “A scheme for visual feature based image indexing”, In Proc.
of SPIE/IS&T Conf. On Storage and Retrieval for Image and Video Databases III, vol.
2420, pp. 36-46, (San Jose, CA), February 9-10, 1995.