Serkan phD Thesis R4
Transcription
Serkan phD Thesis R4
Mustafa Serkan Kiranyaz Advanced Techniques for Content-Based Management of Multimedia Databases Tampere 2005 ii iii Abstract Digital multimedia collections are evolving in a tremendous pace as the modus operandi for information creation, exchange and storage in our modern era. This brings an urgent need to have means and ways to manage it. Earlier attempts such as text-based indexing and information retrieval systems show drastic limitations and require infeasible laborious work. The efforts are thus focused on the content-based management area; however, we are still at the early stages of the development of techniques to guarantee efficiency and effectiveness in content-based multimedia systems. The peculiar nature of the multimedia information such as difficulty of semantic indexing, complex multimedia identification, and difficulty of adaptation to different applications have caused an ongoing premature state for the techniques in this area. This thesis considers a global approach to the management of the multimedia databases by providing advanced techniques encapsulated within a generic framework, MUVIS. These techniques are intended to cover the entire management functionalities for multimedia collections such as indexing, browsing, retrieval, summarization and efficient access of multimedia peculiars. MUVIS, in addition to its unique host architecture for such techniques developed, is also designed as a framework structure to provide a flexible basis for developing and testing novel descriptors that are purposefully detached from the core of the system. Furthermore, it supports widely-used multimedia formats, last generation codecs, types and parameters to test and improve the robustness of the techniques and descriptors against such variations. Special care has been drawn for its user interface design to especially ensure a scalable video management. A significant contribution of this thesis is to bring a robust framework structure specifically in the area of audio-based multimedia management, which can be much promising, however comparatively in immature status than the visual counterpart. The efforts are first focused on an automatic and optimized audio content classification and segmentation method following a hierarchic approach with a new fuzzy modeling concept. The proposed technique achieves a solid robustness and high accuracy level. The proposed audio-based multimedia indexing and retrieval framework supports dynamic integration of audio feature extraction modules during the indexing and retrieval phases. The experimental results show that audio retrieval achieves equal or better performance with respect to the visual query. For the evaluation of any technique within the context of multimedia management, the ultimate measure of its performance is the user satisfaction. Especially in databases without an indexing structure, using the traditional query methodology the retrieval times become longer and proportional with the database size. Therefore, this thesis presents a simple, yet iv efficient query method, the Progressive Query, which is developed to achieve several innovative retrieval capabilities for databases with or without an indexing structure. It especially provides an enhanced user interaction and query browsing capabilities to ensure a solid control and a better relevance feedback scheme. It achieves a superior performance in terms of speed, minimum system requirements, user interaction and possibility of better retrievals as compared to the exhaustive search based traditional query method. In order to accomplish the primary objective of any query operation, that is, the retrieval of the most relevant items at the earliest possible time, an efficient indexing scheme, the Hierarchical Cellular Tree, is then introduced. It is specifically designed to cope up with the indexing requirements of multimedia databases such as the presence of multiple features possibly in high dimensions, the dynamic construction and editing capability, the need for prevention of the corruption for the large-scale databases, robustness against the limited discrimination and deficiencies obtained from the low-level features, etc. The earliest retrieval of the most relevant items is then shown to be feasible by the joint implementation of Progressive Query over the proposed indexing structure. Another major retrieval scenario is database browsing. Browsing is a flexible operation, which however requires a continuous interaction and feedback from the user. It is likely to be the initial operation needed to locate the example item for a query-by-example operation and therefore, the main purpose is to access the items of interest in an efficient way even though the definition about those items may not be crystal clear. The thesis presents an effective browsing scheme that is fully compliant with the Hierarchical Cellular Tree indexing structure. By using the hierarchical structure of this indexing scheme, two important features for browsing efficiency can be achieved: A hierarchical summarization, or say a mental picture of the database for the user perception and a guided navigation scheme that the user can perform among the database items. Finally the interaction between the two retrieval scenarios, the query by example and database browsing, is accomplished by a proper user interface design so that the user can have the continuous option of switching back and forth in between. v Preface "The definition of success--To laugh much; to win respect of intelligent persons and the affections of children; to earn the approbation of honest critics and endure the betrayal of false friends; to appreciate beauty; to find the best in others; to give one's self; to leave the world a little better, whether by a healthy child, a garden patch, or a redeemed social condition.; to have played and laughed with enthusiasm, and sung with exultation; to know even one life has breathed easier because you have lived--this is to have succeeded." Ralph Waldo Emerson The research presented in this thesis has been carried out at the Signal Processing Institute of Tampere University of Technology, Finland as a part of the MUVIS project. First and the foremost I wish to express my deepest gratitude to my supervisor, Professor Moncef Gabbouj, for his guidance, constant support and patience during all these years and above all, his belief on me when it was most needed. I would like to thank Professor Erkki Oja from Laboratory of Computer and Information Science in Helsinki University of Technology and Professor Jaakko Sauvola as the head of the Media Team in Oulu University, the reviewers of this thesis, for their constructive feedback and helpful comments. In addition I would like to thank to my MSc supervisor, Professor Levent Onural from Bilkent University, for being the first true light in my career and guided me through the labyrinths of knowledge and wisdom during the earlier years in this fascinating field. Over the years I have had the privilege to work with a wonderful group of people, the colleagues and all my friends here. The amount of our achievements altogether is much more than any individual achievement and I strongly believe together we have really built something significant. I thank all of them from all my heart. More thanks are due to Vivre Larmila, Elina Orava and Ulla Siltaloppi for their kind help whenever needed. Warmest thanks go to all my close friends, especially Erdogan Özdemir, Burak Kirman, Esin and Olcay Güldogan, Aytaç Sen, in short the members of our small Turkish community in Tampere, and also all my buddies abroad, Kerem Ayhan, Güner Aktürk, Alper Yildirim, Özgür Güleryüz, Utku Asim, Ugur Türkoglu, Tunç Bostanci, Emre Aksu, Kerem Çaglar, and the rest of them not mentioned here but unforgotten, for their spiritual support and friendship within all these years. The financial support provided by the Tampere Graduate School of Information Science and Engineering (TISE) is also gratefully acknowledged. vi Last but not least, I wish to express my warmest thanks to my parents, Gönül and Yavuz and to my brother, Sertaç for their endless love and support, and for always being nearby me despite the physical distance that has separated us for all these years. As being the light and color in those moments when everything seems in vain and everything fades out, I would like to dedicate this thesis to my beloved family. Tampere, June 2005 Serkan Kiranyaz. vii Contents Abstract ....................................................................................................................................iii Preface ....................................................................................................................................... v Contents...................................................................................................................................vii List of Publications................................................................................................................... x List of Acronyms ....................................................................................................................xii List of Tables.......................................................................................................................... xiv List of Figures ......................................................................................................................... xv 1. Introduction ....................................................................................................................... 1 1.1. 1.2. 1.3. 2. CONTENT-BASED MULTIMEDIA MANAGEMENT ............................................................. 2 OUTLINE OF THE THESIS................................................................................................. 4 PUBLICATIONS ............................................................................................................... 5 MUVIS Framework .......................................................................................................... 7 2.1. MUVIS OVERVIEW........................................................................................................ 8 2.1.1. Block Diagram of the System ................................................................................. 8 2.1.2. MUVIS Multimedia Family .................................................................................... 9 2.1.3. MUVIS Applications............................................................................................. 10 2.1.3.1 AVDatabase..................................................................................................................10 2.1.3.2 DbsEditor......................................................................................................................11 2.1.3.3 MBrowser.....................................................................................................................12 2.1.4. MUVIS Databases ................................................................................................ 14 2.2. INDEXING AND RETRIEVAL SCHEME ............................................................................ 15 2.2.1. Indexing Methods ................................................................................................. 15 2.2.2. Retrieval Methods................................................................................................. 15 2.3. FEATURE EXTRACTION FRAMEWORK........................................................................... 16 2.3.1. Aural Feature Extraction: AFeX.......................................................................... 16 2.3.2. Visual Feature Extraction: FeX ........................................................................... 17 2.4. VIDEO SUMMARISATION .............................................................................................. 18 2.4.1. Scene Analysis by MST......................................................................................... 19 viii 2.4.2. 2.4.3. 2.4.4. Scene Analysis by NNE ........................................................................................ 21 Video Summarization Experiments ...................................................................... 23 Scalable Video Management................................................................................ 25 2.4.4.1 ROI Access and Query .................................................................................................26 2.4.4.2 Visual Query of Video..................................................................................................27 3. Unsupervised Audio Classification and Segmentation ................................................ 31 3.1. AUDIO CLASSIFICATION AND SEGMENTATION – AN OVERVIEW................................ 31 3.2. SPECTRAL TEMPLATE FORMATION .............................................................................. 35 3.2.1. Forming the MDCT Template from MP3/AAC Bit-Stream.................................. 36 3.2.1.1 MP3 and AAC Overview .............................................................................................36 3.2.1.2 MDCT Template Formation .........................................................................................37 3.2.2. Spectral Template Formation in Generic Mode .................................................. 41 3.3. FEATURE EXTRACTION ................................................................................................ 41 3.3.1. Frame Features .................................................................................................... 42 3.3.1.1 3.3.1.2 3.3.1.3 3.3.1.4 3.3.2. Total Frame Energy Calculation...................................................................................42 Band Energy Ratio Calculation ....................................................................................42 Fundamental Frequency Estimation .............................................................................42 Subband Centroid Frequency Estimation .....................................................................44 Segment Features ................................................................................................. 45 3.3.2.1 3.3.2.2 3.3.2.3 3.3.2.4 Dominant Band Energy Ratio.......................................................................................45 Transition Rate vs. Pause Rate .....................................................................................45 Fundamental Frequency Segment Feature....................................................................46 Subband Centroid Segment Feature .............................................................................47 3.3.3. Perceptual Modeling in Feature Domain ............................................................ 48 3.4. GENERIC AUDIO CLASSIFICATION AND SEGMENTATION ............................................... 49 3.4.1. Step 1: Initial Classification................................................................................. 50 3.4.2. Step 2.................................................................................................................... 51 3.4.3. Step 3.................................................................................................................... 52 3.4.4. Step 4.................................................................................................................... 54 3.4.4.1 Intra Segmentation by Binary Division ........................................................................55 3.4.4.2 Intra Segmentation by Breakpoints Detection..............................................................56 3.5. EXPERIMENTAL RESULTS ............................................................................................. 57 3.5.1. Feature Discrimination and Fuzzy Modeling ...................................................... 57 3.5.2. Overall Classification and Segmentation Performance....................................... 58 4. Audio-Based Multimedia Indexing and Retrieval ....................................................... 61 4.1. AUDIO INDEXING AND RETRIEVAL – AN OVERVIEW .................................................... 61 4.2. A GENERIC AUDIO INDEXING SCHEME ........................................................................ 64 4.2.1. Unsupervised Audio Classification and Segmentation ........................................ 65 4.2.2. Audio Framing ..................................................................................................... 66 4.2.3. A Sample AFeX Module Implementation: MFCC................................................ 67 4.2.4. Key-Framing via MST Clustering ........................................................................ 70 4.3. AUDIO RETRIEVAL SCHEME ......................................................................................... 71 4.4. EXPERIMENTAL RESULTS ............................................................................................. 74 4.4.1. Classification and Segmentation Effect on Overall Performance ....................... 75 4.4.1.1 Accuracy.......................................................................................................................75 ix 4.4.1.2 Speed ............................................................................................................................76 4.4.1.3 Disk Storage .................................................................................................................77 4.4.2. 5. Experiments on Audio-Based Multimedia Indexing and Retrieval ...................... 77 Progressive Query: A Novel Retrieval Scheme ............................................................ 81 5.1. QUERY TECHNIQUES - AN OVERVIEW.......................................................................... 81 5.2. PROGRESSIVE QUERY ................................................................................................... 83 5.2.1. Periodic Sub-Query Formation............................................................................ 85 5.2.1.1 Atomic Sub-Query........................................................................................................85 5.2.1.2 Fractional Sub-Query ...................................................................................................85 5.2.1.3 Sub-Query Fusion Operation........................................................................................87 5.2.2. PQ in Indexed Databases ..................................................................................... 89 5.3. HIGH PRECISION PQ – THE NEW APPROACH ................................................................. 90 5.4. EXPERIMENTAL RESULTS ............................................................................................. 92 5.4.1. PQ in MUVIS........................................................................................................ 92 5.4.2. PQ versus NQ....................................................................................................... 93 5.4.3. PQ versus HP PQ................................................................................................. 99 5.4.4. Remarks and Evaluation .................................................................................... 100 6. A Novel Indexing Scheme: Hierarchical Cellular Tree ............................................. 101 6.1. DATABASE INDEXING METHODS – AN OVERVIEW ..................................................... 102 6.2. HCT FUNDAMENTALS ................................................................................................ 107 6.2.1. Cell Structure ..................................................................................................... 108 6.2.1.1 6.2.1.2 6.2.1.3 6.2.1.4 6.2.2. 6.2.3. MST Formation ..........................................................................................................108 Cell Nucleus ...............................................................................................................110 Cell Compactness Feature ..........................................................................................110 Cell Mitosis ................................................................................................................111 Level Structure ................................................................................................... 112 HCT Operations ................................................................................................. 113 6.2.3.1 Item Insertion Algorithm for HCT .............................................................................113 6.2.3.2 Item Removal Algorithm for HCT .............................................................................116 6.2.4. HCT Indexing ..................................................................................................... 117 6.2.4.1 HCT Incremental Construction ..................................................................................118 6.2.4.2 HCT (Periodic) Fitness Check....................................................................................118 6.3. PQ OVER HCT ........................................................................................................... 119 6.3.1. QP Formation from HCT ................................................................................... 120 6.3.2. PQ Operation over HCT .................................................................................... 122 6.4. HCT BROWSING ......................................................................................................... 123 6.5. EXPERIMENTAL RESULTS ........................................................................................... 129 6.5.1. Performance Evaluation of HCT Indexing......................................................... 129 6.5.1.1 Statistical Analysis .....................................................................................................131 6.5.1.2 Performance Evaluation .............................................................................................132 6.5.2. 6.5.3. 7. PQ over HCT...................................................................................................... 132 Remarks and Evaluation .................................................................................... 134 Conclusions .................................................................................................................... 137 Bibliography.......................................................................................................................... 141 x List of Publications This thesis is written on the basis of the following publications. In the text, these publications are referred to as Publications [P1], …, [P14]. [P1] S. Kiranyaz, A. F. Qureshi and M. Gabbouj, “A Generic Audio Classification And Segmentation Approach For Multimedia Indexing and Retrieval”, IEEE Transactions on Speech and Audio Processing, in Print. [P2] S. Kiranyaz, M. Gabbouj, “A Novel Multimedia Retrieval Technique: Progressive Query (WHY WAIT?)”, IEE Proceedings Vision, Image and Signal Processing, in Print. [P3] M. Gabbouj and S. Kiranyaz, “Audio-Visual Content-Based Multimedia Indexing and Retrieval - the MUVIS Framework”, In Proc. of the 6th International Conference on Digital Signal Processing and its Applications, DSPA 2004, pp. 300-306, Moscow, Russia, March 31 - April 2, 2004. [P4] S. Kiranyaz, K. Caglar, E. Guldogan, O. Guldogan, and M. Gabbouj, “MUVIS: A Content-Based Multimedia Indexing and Retrieval Framework”, In Proc. of the Seventh International Symposium on Signal Processing and its Applications, ISSPA 2003, pp. 1-8, Paris, France, 1-4 July 2003. [P5] S. Kiranyaz, M. Aubazac, M. Gabbouj, “Unsupervised Segmentation and Classification over MP3 and AAC Audio Bitstreams”, In Proc. of WIAMIS Workshop, pp. 338-345, London, England, 2003. [P6] S. Kiranyaz, K. Caglar, B. Cramariuc, and M. Gabbouj, “Unsupervised Scene Change Detection Techniques In Feature Domain Via Clustering and Elimination”, In Proc. of the IWDC 2002 Conference on Advanced Methods for Multimedia Signal Processing, Capri, Italy, September 2002. [P7] O. Guldogan, E. Guldogan, S. Kiranyaz, K. Caglar, and M. Gabbouj, "Dynamic Integration of Explicit Feature Extraction Algorithms into MUVIS Framework", In Proc. of the 2003 Finnish Signal Processing Symposium, FINSIG'03, pp. 120-123, Tampere, Finland, 19-20 May 2003. xi [P8] S. Kiranyaz, A. F. Qureshi, and M. Gabbouj, “A Fuzzy Approach Towards Perceptual Classification and Segmentation of MP3/AAC Audio”, In Proc. of the First International Symposium on Control, Communications and Signal Processing, ISCCSP 2004, pp.727-730, Hammamet, Tunisia, 21-24 March, 2004. [P9] S. Kiranyaz and M. Gabbouj, “A Dynamic Content-based Indexing Method for Multimedia Databases: Hierarchical Cellular Tree”, In Proc. of IEEE International Conference on Image Processing, ICIP 2005, Paper ID: 2896, Genova, Italy, September, 2005, To Appear. [P10] S. Kiranyaz and M. Gabbouj, “Hierarchical Cellular Tree: An Efficient Indexing Method for Browsing and Navigation in Multimedia Databases”, In Proc. of European Signal Processing Conference, Eusipco 2005, Paper ID: 1063, Antalya, Turkey, September, 2005, To Appear. [P11] E. Guldogan, O. Guldogan, S. Kiranyaz, K. Caglar and M. Gabbouj, “Compression Effects on Color and Texture based Multimedia Indexing and Retrieval”, In Proc. of IEEE International Conference on Image Processing, ICIP 2003, Barcelona, Spain, September 2003. [P12] I. Ahmad, S. Abdullah, S. Kiranyaz, M. Gabbouj, “Content-Based Image Retrieval on Mobile Devices”, In Proc. of SPIE (Multimedia on Mobile Devices), 5684, San Jose, US, 16-20 January 2005, To Appear. [P13] I. Ahmad, S. Kiranyaz, M. Gabbouj, “Progressive Query Technique for Image Retrieval on Mobile Devices”, In Proc. of Fourth International Workshop on Content-Based Multimedia Indexing, Riga, Latvia, 21-23 June, 2005, To Appear. [P14] S. Kiranyaz, M. Ferreira and M. Gabbouj, “A Novel Feature Extraction Method based on Segmentation over Edge Field for Multimedia Indexing and Retrieval”, In Proc. of WIAMIS Workshop, Montreux, Switzerland, 13-15 April, 2005. xii List of Acronyms 2D 2 Dimensional AAC (MPEG-2,4) Advanced Audio Codec AFeX Audio Feature Extraction API Application Programming Interface AV Audio-Visual AVI Audio Video Interlaced (Microsoft ©) BER Band Energy Ratio CBIR Content-based Image Retrieval CPU Central Processing Unit Ds Descriptors DBER Dominant Band Energy Ratio DFT Discrete Fourier Transform DLL Dynamic Link Library DSs Description Schemes FeX Feature Extraction FF Fundamental Frequency FFT Fast Fourier Transform FT Fourier Transform FV Feature Vector GLCM Gray Level Co-occurrence Matrix GMM Gaussian Mixture Model GUI Graphical User Interface HCT Hierarchical Cellular Tree HMM Hidden Markov Model HP PQ High Precision Progressive Query HPS Harmonic Product Spectrum HSV Hue, Saturation and (Luminance) Value HVS Human Visual System ISO International Organization for Standardization JPEG Joint Pictures Expert Group xiii KF Key-Frame kNN k Nearest Neighbours MAM Metric Access Method MDCT Modified Discrete Cosine Transform MFCC Mel-Frequency Cepstrum Coefficients MPEG Moving Picture Experts Group MP3 MPEG Layer 3 MST Minimum Spanning Tree MUVIS Multimedia Video Indexing and Retrieval System NQ Normal Query NNE Nearest Neighborhood Elimination P Precision PCA Principal Component Analysis PCM Pulse Coded Modulation PR Pause Rate PSQ Progressive Sub-Query PQ Progressive Query ROI Region of Interest R Recall RGB Red, Green and Blue SAM Spatial Access Method SC Subband Centroid SD Scene Detection SOM Self Organizing Maps TFE Total Frame Energy TR Transition Rate TS-SOM Tree Structured Self Organizing Maps QBE Query by Example QBH Query by Humming QTT Query Total Time QP Query Path UI User Interface ZCR Zero Crossing Rate xiv List of Tables Table I. MUVIS Multimedia Family ......................................................................................... 9 Table II: The ground-truth table for automatic scene detection algorithms in 10 sample video sequences.................................................................................................................................. 25 Table III: The MDCT template array dimension with respect to Compression Type and Windowing Mode..................................................................................................................... 37 Table IV: Transition Penalization Table. ................................................................................. 46 Table V: Generic Decision Table............................................................................................. 54 Table VI: Error Distribution Table for Bit-Stream Mode........................................................ 59 Table VII: Error Distribution Table for Generic Mode............................................................ 59 Table VIII: QTT (Query Total Time) in seconds of 10 aural retrieval examples from Real World database......................................................................................................................... 76 Table IX: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Open Video Database. .................................................................................................................................. 77 Table X: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Real World Database. .................................................................................................................................. 77 Table XI: Statistics obtained from the ground level of HCT bodies extracted from the sample MUVIS databases................................................................................................................... 130 Table XII: Retrieval times (in msec) for 10 visual and aural query operations performed per query type............................................................................................................................... 134 xv List of Figures Figure 1: General structure of MUVIS framework .................................................................... 9 Figure 2: MUVIS AVDatabase application creating a video database in real time ................. 11 Figure 3: MUVIS DbsEditor application. ................................................................................ 12 Figure 4: MUVIS MBrowser application with an image (progressive) query performed. ...... 13 Figure 5: AFeX Module interaction with MUVIS applications. .............................................. 16 Figure 6: FeX module interaction with MUVIS applications. ................................................. 18 Figure 7: MST clustering illustration. ...................................................................................... 19 Figure 8: Number of scene frames versus threshold sketch for Figure 9................................. 22 Figure 9: Key-Frames (top) and Unsupervised Scene Frames by NNE (bottom-up) and MST (bottom-down).......................................................................................................................... 24 Figure 10: Key-Frames (top) and Semi-Automatic 3 Scene Frames by NNE (bottom-up) and MST (bottom-down)................................................................................................................. 24 Figure 11: ROI Selection and Rendering from the key-frames of a video clip........................ 26 Figure 12: ROI (Visual) Query of the example in Figure 11. .................................................. 27 Figure 13: Different error types in classification ..................................................................... 35 Figure 14: Classification and Segmentation Framework ......................................................... 35 Figure 15: MP3 Long Window MDCT template array formation from MDCT subband coefficients. .............................................................................................................................. 39 Figure 16: MP3 Short Window MDCT template array formation from MDCT subband coefficients ............................................................................................................................... 39 Figure 17: AAC Long Window MDCT template array formation from MDCT subband coefficients ............................................................................................................................... 40 Figure 18: AAC Short Window MDCT template array formation from MDCT subband coefficients. .............................................................................................................................. 40 Figure 19: Generic Mode Spectral Template Formation.......................................................... 41 Figure 20: FF detection within a harmonic frame.................................................................... 44 Figure 21: Perceptual Modeling in Feature Domain. ............................................................... 48 Figure 22: The flowchart of the proposed approach. ............................................................... 50 Figure 23: Operational Flowchart for Step 1............................................................................ 51 Figure 24: Operational Flowchart for Step 2............................................................................ 52 Figure 25: Operational Flowchart for Step 3............................................................................ 53 Figure 26: Intra Segmentation by Binary Division in Step 4. .................................................. 55 Figure 27: Windowed SC standard deviation sketch (white) in a speech segment. Breakpoints are successfully detected with Roll-Down algorithm and music sub-segment is extracted..... 56 Figure 28: Frame and Segment Features on a sample classification and segmentation........... 58 Figure 29: Audio Indexing Flowchart. ..................................................................................... 65 Figure 30: A sample audio classification conversion............................................................... 66 Figure 31: The derivation of mel-scaled filterbank amplitudes. .............................................. 68 Figure 32: An illustrative clustering scheme............................................................................ 70 Figure 33: KF Rate (%) Plots ................................................................................................... 71 Figure 34: A class-based audio query illustration showing the distance calculation per audio frame......................................................................................................................................... 74 Figure 35: PR curves of an aural retrieval example within Real World database indexed with (left) and without (right) using classification and segmentation algorithm. ............................ 76 xvi Figure 36: Three visual (left) and aural (right) retrievals in Open Video database. The top-left clip is the query. ....................................................................................................................... 79 Figure 37: Progressive Query Overview. ................................................................................ 84 Figure 38: Formation of a Periodic Sub-Query........................................................................ 87 Figure 39: A sample fusion operation between subsets X and Y. ........................................... 88 Figure 40: Query path formation in a hypothetical indexing structure.................................... 89 Figure 41: HP PQ Overview.................................................................................................... 91 Figure 42: MBrowser GUI showing a PQ operation where 10th PSQ is currently active (or set manually).................................................................................................................................. 92 Figure 43: Memory usage for PQ and NQ. .............................................................................. 93 Figure 44: PQ retrievals of a query image (left-top) within three PSQs. t p = 0.2 sec ........... 95 Figure 45: Aural PQ retrievals of a video clip (left-top) in 12 PSQs (only 1st, 6th and 12th are shown). t p = 5 sec ............................................................................................................. 95 Figure 46: Visual PQ retrievals of a video clip (left-top) in 4 PSQs. t p = 3 sec .................... 96 Figure 47: Aural PQ Overall Retrieval Time and PSQ Number vs. PQ Period. ..................... 98 Figure 48: Visual PQ Overall Retrieval Time and PSQ Number vs. PQ Period..................... 98 Figure 49: PSQ and PQ retrieval times for the sample retrieval example given in Figure 45. 99 Figure 50: PSQ Retrieval times for single and multi threaded (HP) PQ schemes. t p = 5 sec 99 Figure 51: A sample dynamic item (5) insertion into a 4-node MST. ................................... 110 Figure 52: A sample mitosis operation over a mature cell C. ................................................ 111 Figure 53: A sample 3-level HCT body. ................................................................................ 112 Figure 54: M-Tree rationale that is used to determine the most suitable nucleus (routing) object for two possible cases. Note that in both cases the rationale fails to track the closest nucleus object on the lower level. .......................................................................................... 115 Figure 55: QP formation on a sample HCT body. ................................................................. 122 Figure 56: HCT Browsing GUI on MUVIS-MBrowser Application. .................................... 125 Figure 57: An HCT Browsing example, which starts from the third level within Corel_1K MUVIS database. The user navigates among the levels shown with the lines through ground level. ....................................................................................................................................... 127 Figure 58: Another HCT Browsing example, which starts from the third level within Texture MUVIS database. The user navigates among the levels shown with the lines through ground level. ....................................................................................................................................... 128 Figure 59: QP plot of a sample image query in Corel_1K database...................................... 133 Figure 60: QP plot of a sample video query in Sports database. ........................................... 133 Introduction Chapter 1 1 Introduction M as a generic term involves the combination of two or more media types to effectively create a sequence of events that will communicate an information usually with both sound and visual support. Multimedia technology can then generally be defined as the combined use of several methods of storage, sensory transmission and finally consumption employed for information to a terminal. Under this definition, multimedia technology is old and widely used, comprising radio, television, performance art, many printed and educational materials. All of these systems involve the use of multiple sensory formats to facilitate the transmittal of information. It is in the digital age that the term multimedia has taken on the definition and level of prestige that it currently enjoys. The advent of digital technologies has increased multimedia capabilities and potential to unprecedented levels. Digital multimedia is then defined as the processes of employing a variety of digital items, possibly synchronized and perhaps embedded within one another, or within an application, to present and transmit information. The digital items are used as the generic term for any type of digitized information such as still images, audio and video clips. Digital multimedia technologies, which provide powerful means to acquire and incorporate knowledge from various sources for a broad range of applications, have a strong impact on the daily life, and have changed our way of learning, thinking and living. The rapid increase of multimedia technology over the last decade has brought about fundamental changes to computing, entertainment, and education and it has presented our computerized society with opportunities and challenges that in many cases are unprecedented. As the use of digital multimedia increases, effective data storage and management become increasingly important. In fields, which use large quantities of multimedia data collections there is an urge to minimize the volume of data stored while meeting the often conflicting demand for accurate data representation. In addition, a multimedia collection or a single digital item need to be managed such that the users can have an efficient access, search, browsing capabilities and effective consumption of the required data. Therefore, the rest of the thesis will focus on the development of efficient techniques for the browsing, indexing, retrieval, etc., in short, the “management” of the multimedia data. ULTIMEDIA 2 1.1. CONTENT-BASED MULTIMEDIA MANAGEMENT As the revolutionary advances in the information era continue into the new millennium, the generation and dissemination of digital multimedia content is to witness phenomenal growth. Especially with the advances in storage technology and the advent of the World Wide Web, there has been an explosion in the amount and complexity of digital information being generated, analyzed, stored, accessed and transmitted. However, this rate of growth has not been matched by the simultaneous emergence of technologies that can manage the content efficiently. State of the art systems for content management lag far behind the expectations of the users of such systems. The users mainly expect these systems to perform analysis at the same level of complexity and semantics that a human would perceive while analyzing the content. Herein lies the reason why no commercially successful systems for content management still exist. Humans assimilate information at a semantic level and do so with remarkable ease thanks to the human intelligence, the natural presence of knowledge and the years of experience. The human ability to apply knowledge to the task of sifting through large volumes of auditory, somatic, proprioceptive or visual data and extracting only relevant information is indeed amazing. The troika of sensory organs, short-term and long-term memory, and the ability to learn and reason based on sensory inputs (through supervised or unsupervised training) are the mainstay of the human ability to perform semantic analysis on multimedia data. Henceforth, to make use of this vast amount of multimedia data, there is an undeniable need to develop techniques to efficiently retrieve items of interest over large multimedia repositories based on their content. Due to the difficulty in capturing the content using textual annotations and the non-scalability of the approach to large data sets due to a high degree of manual effort required in performing the annotations, the approach based on supporting content-based retrieval over low-level features has become a promising research direction. Using low-level features, query by example (QBE) based retrieval performs relatively well for images, audio clips and even for video clips. As impressive as these systems may be, it is however obvious that they do not really address many multimedia management requirements from consumers’, companies’ and professionals’ point of view. Such systems, for instance, might be good at finding sunset images using color histograms but they do not appreciably help a user who is really looking for a frame of “Tom Cruise at the Oscars”. Only when such systems come to a point that offers capability of video and audio “content” understanding, not just similarity matching, will we be able to manage multimedia by semantic content. Unfortunately this problem seems to be unfeasibly difficult, if not impossible, at least for the moment. Although there has been some progress in some automatic techniques such as speech recognition, text identification and face recognition, they can only be used for a limited set of multimedia collection and they are far from being robust and reliable. Short of addressing the hard Artificial Intelligence problems in this context, how can we advance the state of the art? How the management of ever increasing vast amounts of multimedia collections should be efficiently performed by employing “feasible” methodologies with available features in order not to get lost among them? Introduction 3 The current state of the art in multimedia management systems centers around with two fundamental approaches to the organization of the semantics. First the manual and semimanual annotation techniques loosely defined as metadata initiatives, and the second one is the automatic extraction of information based on the visual and aural characteristic properties of multimedia items. The well-known drawbacks and limitations of the first approach such as infeasibility of its application over large collections and its user (annotator) dependency make it usable only in some specific areas and applications and yet far from being the global solution for the management problem. The second approach, although much promising, also lacks “feasible” techniques providing robust and reliable solutions. At the heart of the issue of any technique of semantic analysis is the age old question: where does true “content” meaning lie? Does it lie in the relationships among the objects and the audiovisual properties in any one audio section, image or scene? Or does it “emerge” from the real-world context history of the multimedia objects, and their potential users - the humans? Researchers who believe the former tend to focus more on automated techniques for feature extraction and due content analysis and retrieval. The others favoring the latter tend to focus more on advanced applications with enhanced user interface design, querying and relevance feedback strategies. However, there is a common agreement in the field that these two approaches need to converge since the solution to this old question is not unique, but is rather shared between the two approaches. So far the most promising framework that integrates the two approaches appears to be the standardization activity MPEG-7 [26], formally named “Multimedia Content Description Interface”. It is the standard that describes multimedia content so users can search, browse, and retrieve content more efficiently and effectively than they could using today’s mainly textbased search engines such as Yahoo, Google, etc. Yet MPEG-7 aims to standardize only a common interface for describing multimedia items (“the bits about the bits”), that is a standardization of a common syntax for the descriptions (Ds) and description schemes (DSs). Consequently MPEG-7 neither standardizes the (automatic) extraction of audiovisual descriptions or features nor does it specify the efficient multimedia management techniques and methodologies that can make use of the description. In this context it is only a common language that can be used among such techniques whenever needed. In this thesis, we intend to address the problem from the point where MPEG-7 left out, that is, bringing a global approach to the management of the multimedia databases providing efficient techniques encapsulated within a feasible framework, MUVIS - Multimedia Video Indexing and Retrieval System, [P3], [P4], [P6], [P11], [43]. The proposed techniques are spread among the entire management functionalities such as indexing, browsing, retrieval, summarization and efficient access of multimedia items. Developing novel features or improving the performance of the existing ones is another objective and therefore, a feature extraction framework has been embedded into MUVIS from the early stages. Via this framework, several visual and aural feature extraction techniques can be involved for any of the management activities in real-time. The primary motivation to design a global framework structure, which supports several digital formats, codecs, types and parameters lies in the following fact: The semantic content is totally independent from the various formats and repre- 4 sentations of digital multimedia. Therefore, it is crucial to have a framework or a test-bed structure such as MUVIS in order to develop robust techniques against such variations. From the content-based multimedia retrieval point of view the audio information can be even more important than the visual part since it is mostly unique and significantly stable within the entire duration of the content. However, audio-based studies lag far behind the visual counterpart and the development of robust and reliable aural content management systems is still in its infancy. Therefore, the current efforts in this thesis are especially focused on audio-based content analysis and multimedia management. 1.2. OUTLINE OF THE THESIS The thesis is organized as follows. First the general overview about the MUVIS framework where all the advanced techniques proposed for content-based multimedia management are either embedded into or implemented over is presented in Chapter 2. Especially the global approach for the management and handling of the multimedia items during their entire lifetime, from the (real-time) capturing or insertion into a MUVIS database via conversion, till their efficient consumption (retrieval, browsing, etc.) by the user, will be discussed in this chapter. Furthermore, the generic feature extraction framework, which allows efficient development and real-time integration of feature extraction algorithms into the system along with the video management architecture where a hierarchical handling, querying and summarization capabilities are performed will particularly be emphasized. All MUVIS related author’s publications are contained in [P3], [P4], [P6], [P7] and [P11]. The issues related to audio-based multimedia management are exposed in Chapters 3 and 4. In Chapter 3, we focus the attention on the area of generic and automatic audio classification and segmentation for audio-based multimedia indexing and retrieval applications. In particular, a fuzzy approach towards hierarchic audio classification and global segmentation framework based on automatic audio analysis providing robust, bi-modal and parameter invariant classification over automatically extracted audio segments is presented. This chapter is mainly based on the author’s original publications [P1], [P5] and [P8]. Chapter 4 presents a generic and robust audio based multimedia indexing and retrieval framework, which supports the dynamic integration of the audio feature extraction modules during the indexing and retrieval phases and therefore, provides a test-bed platform for developing robust and efficient aural feature extraction techniques. A sample audio feature extraction technique is also developed in this chapter. Furthermore this framework is designed based on the high-level content classification and segmentation that is presented in Chapter 3, in order to improve the speed and accuracy of an aural retrieval. The work presented in this chapter is based on the author’s publication [P4]. In Chapters 5 and 6, novel indexing, retrieval and browsing schemes are proposed. First Chapter 5 presents a novel multimedia retrieval technique, called Progressive Query (PQ). PQ is designed to bring an effective solution especially when querying large multimedia databases. In addition, PQ produces intermediate query retrieval results during the execu- Introduction 5 tion of the query, which will finally converge to the full-scale search retrieval in a faster way and with no minimum system requirements. The original work for PQ was published in [P2]. Chapter 6 presents a novel indexing technique, called Hierarchical Cellular Tree (HCT), which is designed to bring an effective solution especially for indexing of large multimedia databases and furthermore to provide an enhanced browsing capability, which enables users to perform a “guided tour” among the database items. The proposed indexing scheme is then optimized for the query method introduced in Chapter 5, the Progressive Query, in order to maximize the retrieval efficiency from the user point of view. Chapter 6 is mainly based on the author’s original publications [P9] and [P10]. Due to the amount and diversity among the subjects, a proper introduction with the extensive literature survey and the experimental results along with the conclusive remarks are given within each chapter. Finally, the conclusions of the thesis are drawn in Chapter 7. 1.3. PUBLICATIONS The majority of the author’s contribution to the field of content-based management of multimedia databases is shared among the Chapters 2 to 6. As mentioned earlier MUVIS is the host framework into which all the proposed techniques are either embedded or implemented over it and therefore, the structure of the thesis follows the natural development phases of MUVIS. The main contributions can be summarized in the following points: q q q q q q q q q q q The design and implementation of the MUVIS system, in [P3] and [P4]. The hierarchical video management, in [P6]. The design of a dynamic feature extraction framework within MUVIS, in [P7]. A generic and robust audio classification and segmentation scheme, in [P1], [33], [P5] and [P8]. Audio-based multimedia indexing and retrieval framework, in [P4]. A novel multimedia retrieval technique: Progressive Query, in [P2] and [34]. An efficient indexing and browsing method for content-based retrieval for multimedia databases: Hierarchical Cellular Tree, in [P9] and [P10]. An ongoing study for compression effects on color and texture based multimedia indexing and retrieval, in [P11]. The extension of MUVIS framework on mobile platforms, the M-MUVIS, in [P12]. An implementation of Progressive Query technique on mobile platforms, in [P13]. A novel feature and shape extraction method based on segmentation over Canny edge field [13] for multimedia indexing and retrieval, in [P14]. Note that the last four contributions are deliberately excluded from this thesis, however they are closely linked with the entire work presented in this thesis and they will be used by the associated co-authors in their doctoral thesis. 6 7 Chapter 2 MUVIS Framework M UVIS has been initially created as a Java application during the late 90s to provide indexing and retrieval framework for large image databases using visual and semantic features such as color, texture and shape. Based upon the experience and feedback from this first system [16], a new framework, which aims to bring a unified and global approach to indexing, browsing and querying of various digital multimedia types such as audio/video clips and digital images, has been developed. The primary motivation behind developing this new version of MUVIS is to achieve a unified and global framework and a robust set of applications for capturing, recording, indexing and retrieval combined with browsing and various other visual and semantic capabilities. The current MUVIS system has been developed as a framework to bring a unified and generic solution to content-based multimedia indexing and retrieval. Variations in formats, representations and other parameters in today’s digital multimedia world such as codec types, file formats, capture and encoding parameters, may significantly affect indexing and retrieval. Therefore, covering a wide-range of multimedia family and especially the last generation multimedia codecs, MUVIS is developed to provide an efficient framework structure upon which robust algorithms can be implemented, tested, configured and compared against each other. Furthermore, it supports three types of browsing, five levels of hierarchic video representation and summarization and most important of all, MUVIS framework supports integration of the aural and visual feature extraction algorithms explicitly. This brings a significant advantage for third parties to independently develop and test their feature extraction modules. In short the MUVIS framework supports the following processing capabilities and properties: • • • • • An effective framework structure, which provides an application independent basis in order to develop audio and visual feature extraction techniques that are dynamically integrated to and used by the MUVIS applications for indexing and retrieval. Real-time audio and video capturing, encoding and recording, Multimedia conversions into one of the convertible formats that MUVIS supports, Scalable video management, Video summarization via scene frame extraction from the shot frames available in the video bit-stream, 8 • • • • • A novel Progressive Query mechanism, which provides faster query results along with the query process and lets the user browse among the queries obtained and stops an ongoing query in case the results obtained so far are satisfactory. An enhanced retrieval scheme based on explicit visual and aural queries initiated from any MUVIS database that includes audio/video clips and still images. Multimedia format and type conversions such as MPEG-1 video (with or without the presence of MPEG-1 Layer 2 audio) to one of MUVIS video (and audio) formats are so far supported in order to append alien multimedia files to a MUVIS database, A novel indexing and browsing Method: Hierarchical Cellular Tree (HCT) and implementation of PQ over HCT. Audio content analysis and audio-based multimedia indexing and retrieval. This chapter is organized as follows. Section 2.1 introduces the MUVIS system architecture, primary applications, MUVIS multimedia family and databases. The visual/aural indexing and retrieval schemes are discussed in Section 2.2. The dynamic visual and aural feature extraction frameworks, FeX and AFeX, are explained in Section 2.3 . Finally the scalable video management and summarization are discussed in Section 2.4. 2.1. MUVIS OVERVIEW 2.1.1. Block Diagram of the System As shown in Figure 1, MUVIS framework is based upon three applications, each of which has different responsibilities and functionalities. AVDatabase is mainly responsible for real-time audio/video database creation with which audio/video clips are captured, (possibly) encoded and recorded in real-time from any peripheral audio and video devices connected to a computer. DbsEditor performs the indexing of the multimedia databases and therefore, offline feature extraction process over the multimedia collections is its main task. MBrowser is the primary media browser and retrieval application into which PQ technique is integrated as the primary retrieval (QBE) scheme. NQ is the alternative query scheme within MBrowser. Both PQ (Sequential and over HCT) and NQ can be used for retrieval of the multimedia primitives with respect to their similarity to a queried media item (an audio/video clip, a video frame or an image). Due to their unknown duration, which might cause impractical indexing times for an online query process, in order to query an (external) audio/video clip, it should first be appended (offline operation) to a MUVIS database upon which a query can then be performed. There is no such necessity for images; any digital image (inclusive or exclusive to the active database) can be queried within the active database. The similarity distances will be calculated by the particular functions, each of which is implemented in the corresponding visual/aural feature extraction (FeX or AFeX) modules. MUVIS Framework FeX Modules Indexing 9 AFeX Modules FeX & AFeX API Retrieval DbsEditor Still Images Database Management Image Database HCT Indexing MM Insertion Removal AV Clips An Image MBrowser MM Conversions FeX - AFeX Management Hybrid Database Display An AudioVideo Clip Query: PQ & NQ HCT Browsing AVDatabase Video Summarization AV Database Creation A Video Frame Video Database Real-time Capturing Encoding Recording Figure 1: General structure of MUVIS framework 2.1.2. MUVIS Multimedia Family MUVIS databases are formed using the variety of multimedia types belonging to MUVIS multimedia family as given in Table I. The associated MUVIS application will allow the user to create an audio/video MUVIS database in real time via capturing or by converting into any of the specified format within MUVIS multimedia family. Since both audio and video formats are the most popular and widely used formats, a native clip with the supported format can be directly inserted into a MUVIS database without any conversion. This is also true for the images but if the conversion is required by the user anyway, any image can be converted into one of the “Convertible” image types presented in Table I. Table I. MUVIS Multimedia Family MUVIS Audio Channel Number File Formats Codecs Mono Stereo MP3 AAC H263+ MPEG-4 Any AVI YUV 4:2:0 Any MP4 RGB 24 Codecs Sampling Freq. MP3 AAC 16, 22.050, G721 G723 PCM MUVIS Video 24, 32, 44.1 KHz Frame Rate Frame Size File Formats 1..25 fps Any AVI MP4 10 MUVIS Image Types Convertible Formats JPEG JPEG 2K BMP TIFF PNG Non-convertible Formats PCX GIF PCT TGA PCX EPS WMF PGM 2.1.3. MUVIS Applications MUVIS applications are developed for Windows OS family. Figure 1 illustrates the primary applications within the current MUVIS framework. In the following subsections we review the basic features of each application while emphasizing their role in the overall system. 2.1.3.1 AVDatabase AVDatabase application is specifically designed for creating audio/video databases by collecting real-time audio/video files via capturing from a peripheral video/audio device as shown in Figure 2. An audio/video clip may include only video information, only audio information or both video and audio interlaced information. Several video and audio encoding techniques can be used with any encoding parameters specified in Table I. Video can be captured from any peripheral video source (i.e. PC camera, TV card, etc.) in one of the following formats: YV12 (I420) à YUV 4:2:0, RGB24 (or RGB32) and YUY2 (YUYV) or UYVY. If the capture format is other than YUV 4:2:0 then the frame is first converted to YUV 4:2:0 format for encoding. Capturing parameters such as video framerate and frame size can be set during the recording phase by the user. The captured video is then encoded in real-time with the user-specified parameters given in Table I, recorded into a supported file format and finally appended into the active MUVIS database. Video encoding parameters such as bit-rate, frame-rate, forced-intra rate (if enabled), etc. can be defined during the recording time. The supported file formats handle the synchronization of video with respect to the encoding time-stamp of each frame. Audio (only) files can also be captured, encoded, recorded and appended in real-time similar to the video. For audio encoding, we also use last generation audio encoders giving significantly high quality even in very low bit-rates such as MP3 and AAC. ADPCM encoders such as G721 and G723 can also be used for low complexity. Furthermore, audio can be recorded in raw (PCM 16b) format. Compressed audio bit-stream is then recorded into any audio-only file format (container) such as MP3 and AAC or possibly interlaced with video, such as AVI (Microsoft ©) and MP4 (MPEG-4 File Format v1 and v2). It is also possible to store standalone audio bitstream into AVI and MP4 files. MUVIS Framework 11 Figure 2: MUVIS AVDatabase application creating a video database in real time 2.1.3.2 DbsEditor As shown in Figure 3, DbsEditor application is designed to handle indexing and any editor task for the MUVIS databases. Audio/video clips can be created by AVDatabase application in real time. On the other hand, available clips can be directly appended to a MUVIS database provided that their formats are supported, see Table I. Alien formats can be converted (e.g. MPEG-1) by DbsEditor to one of the supported formats first and then appended. Feature extraction is the primary task of DbsEditor application. This is basically achieved by extracting and appending new features into any MUVIS database. Hence DbsEditor can add to and also remove features from any type of MUVIS database. The overall functionalities of DbsEditor can be listed as follows: • Appending new audio/video clips and still images to any MUVIS database and removing such multimedia items from the database. • Dynamic integration and management of feature extraction (FeX and AFeX) modules. • Extracting new features or removing existing features of a database by using available FeX and AFeX modules. • Converting of alien audio/video files into any MUVIS database. Preview of any audio/video clip or image in a database. • Display statistical information of a database and/or items in a database. 12 • Hierarchical Cellular Tree (HCT) based visual and aural indexing. Figure 3: MUVIS DbsEditor application. 2.1.3.3 MBrowser MBrowser is the main media retrieval and browser terminal. In the most basic form, it has all the capabilities of an advanced multimedia player (or viewer) and an efficient multimedia database browser. Furthermore, it allows users to access any multimedia item within a MUVIS database easily, efficiently and in any of the designed hierarchic levels, especially for the video clips. It supports five levels of video display hierarchy: single frame, shot frames (keyframes), scene frames, a video segment and the entire video clip. MBrowser has a built-in search and query engine, which is capable of searching multimedia primitives in a database of any multimedia type that is similar to a queried media item (a video clip, a frame or an image). In order to query an audio/video clip, it should first be appended to a MUVIS database upon which the query will be performed. There is no such requirement for images; any digital image (inclusive or exclusive to the active database) can be queried within the active database. Query retrieval is based on comparing the similarity distances between the queried media item’s feature vector(s) with the feature vectors of multi- MUVIS Framework 13 media primitives available in the database and performing a ranking operation afterwards. The similarity distances will be calculated by the particular functions each of which is implemented in the corresponding feature extraction module. The novel retrieval scheme, the Progressive Query (PQ), has been developed and integrated into MBrowser application. It provides instantaneous query results along with the query process and lets the user browse around the sub-query retrievals and stops an ongoing query in case the results obtained so far are satisfactory. Chapter 5 will cover the details of the PQ method. One example PQ instance is as shown in Figure 4. MBrowser provides the following additional functionalities: • Video summarization via scene detection and key-frame browsing, • Random access support for audio/video clips, • Displaying any crucial information (i.e. database features, parameters, status, etc.) related with the active database and user commands, • Visualizations of feature vectors of the images and video key-frames. • Various browsing options: random, forward/backward and aural or visual HCT (if database is indexed via HCT). Figure 4: MUVIS MBrowser application with an image (progressive) query performed. 14 2.1.4. MUVIS Databases MUVIS system supports the following types of multimedia databases: • Audio/Video databases include only audio/video clips and associated indexing information. • Image databases include only still images and associated indexing information. • Hybrid databases include both audio/video clips and images, and associated indexing information. As illustrated in Figure 1, audio/video databases can be created using both AVDatabase, and DbsEditor applications and image databases can be created using only DbsEditor. Hybrid databases can be created by appending images to video databases using DbsEditor, or by appending audio/video clips to image databases using either DbsEditor or AVDatabase. The experiments and simulations performed in the thesis use the following 8 sample MUVIS databases: 1) Open Video Database: This database contains 1500 video clips, each of which is downloaded from “The Open Video Project” web site [48]. The clips are quite old (from 1960s) but contain color video with sound. The total duration of the database is around 46 hours. The spoken language is English. 2) Real World Audio/Video Database: There are 800 audio-only and video clips in the database with a total duration of over 36 hours. They are captured from several TV channels and the content is distributed among News, Advertisements, Talk Shows, Cartoons, etc. The speech is distributed among English, Turkish, French, Swedish, Arabic and Finnish languages. 3) Sports Hybrid Database: There are 200 video clips with a total duration of 12 hours and mainly carrying sports content such as Football, Tennis and Formula-1. There are also 495 images (in GIF and JPEG formats) showing instances from Football matches and other sport tournaments. The spoken language is mostly Finnish. 4) Music Audio Database: There are 550 MP3 music files that are among Classical, Techno, Rock, Metal, Pop and some other native music types. 5) Corel_1K Image Database: There are 1000 medium resolution (384x256 pix) images from diverse contents such as wild life, city, buses, horses, mountains, beach, food, African natives, etc. 6) Corel_10K Image Database: There are 10000 low resolution images (in thumbnail size) from diverse contents such as wild life, city, buses, horses, mountains, beach, food, African natives, etc. 7) Shape Image Database: There are 1500 black and white (binary) images that mainly represent the shapes of different objects such as animals, cars, accessories, geometric objects, etc. 8) Texture Image Database: There are 1760 texture images representing the pure textures from several materials and products. MUVIS Framework 2.2. 15 INDEXING AND RETRIEVAL SCHEME 2.2.1. Indexing Methods The indexing of a MUVIS database is performed by DbsEditor application in three steps. The first mandatory step is collecting the multimedia items within a MUVIS database. This will create a default sequential indexing where the database items are numbered (indexed) sequentially. The other two steps are optional and required only to enable (fast) query and (HCT) browsing functionalities. Hence the second step is to extract visual/aural features using the available FeX and AFeX modules. As long as one visual or aural feature is extracted for a database, then as the third step, the database can be further indexed by HCT. HCT indexing scheme is recently developed for MUVIS databases. It has mainly the following characteristics: • Dynamic (Incremental) indexing scheme. • Parameter invariant (None or minimum parameter dependency) • Dynamic cell size formation. • Hierarchic structure with fast indexing (i.e. ~O(nlogn)) formation. • Similar items are grouped into cells via Mitosis operation(s). • Optimized for PQ. Once a MUVIS database is indexed by HCT, then the most relevant items can be retrieved faster via “PQ over HCT”. Moreover, an enhanced browsing scheme, the so called HCT Browsing is enabled under MBrowser application. However, even without HCT indexing, PQ can still be performed over the sequential indexing and hence it is called as “Sequential PQ”. The structural details of HCT indexing method are explained in Chapter 6. 2.2.2. Retrieval Methods There are two retrieval schemes available under MBrowser application for the multimedia items within a MUVIS database: browsing and query-by-example (QBE). MBrowser provides three different browsing methods: sequential, random and HCT. The former two require only the first (mandatory) indexing step; however for HCT browsing all three indexing steps should be performed. Depending on the database type and the features present, there are two different HCT browsing types: visual and aural. If both feature types are present, both browsing types can be performed on a hybrid or a video database. However, for image databases, only visual HCT browsing is possible. The details about HCT browsing methods are covered in Chapter 6. There are mainly two QBE methods available: Normal Query (NQ) and Progressive Query (PQ). NQ is the most basic QBE operation and mainly works as follows: using the available aural or visual features (or both) of the queried multimedia item (i.e. an image, a video clip, an audio clip, etc.) and all the database items, the similarity distances are calculated and then merged to obtain a unique similarity distance per database item. Ranking the items according to their similarity distances (to the queried item) over the entire database 16 yields the query result. NQ for QBE is computationally slow, costly and CPU intensive especially for large multimedia databases. Therefore, Progressive Query (PQ) is implemented in MUVIS to provide an alternative query technique. As its name implies PQ provides query results along with the query process and lets the user browse around the queries obtained and stops the ongoing query in case the results obtained so far are satisfactory. As expected PQ and NQ will produce the same (final) retrieval results at the end but PQ yields faster retrievals than NQ especially if the database is HCT indexed and PQ is performed over HCT. 2.3. FEATURE EXTRACTION FRAMEWORK As mentioned in the previous section multimedia items can be collected via several methods such as real-time recording or conversion, and then appended into any MUVIS database. Once items are available within a MUVIS database, associated features are extracted and stored along with the items to complete the sequential indexing scheme for that database. In order to provide both visual and aural indexing schemes MUVIS provides both visual and aural feature extraction frameworks, in such a way that feature extraction modules can be developed independently and integrated into MUVIS system dynamically (during run-time). This is the basis of the framework structure, which allows third party feature extraction modules to be integrated into MUVIS without knowing the details of the MUVIS applications. In the following sections we explain the details of the aural and visual feature extraction frameworks. DBSEditor AFex_*.DLL AFex_API.h AFex_Bind AFex_Init AFex_Extract AFex_Exit AFex_Bind() AFex_Init() AFex_Extract() AFex_GetDistance() AFex_Exit() AFex_GetDistance MBrowser Figure 5: AFeX Module interaction with MUVIS applications. 2.3.1. Aural Feature Extraction: AFeX AFeX framework mainly supports dynamic audio feature extraction module integration for audio clips. Figure 5 shows the API functions and linkage between MUVIS applications and a MUVIS Framework 17 sample AFeX module. Each audio feature extraction algorithm should be implemented as a Dynamically Linked Library (DLL) with respect to AFeX API. AFeX API provides the necessary handshaking and information flow between a MUVIS application and an AFeX module. Five API function properties (name and types) are declared in AFex_API.h. The creator of an AFeX module should implement all specified API functions, which are described as follows: q q q q q AFex_Bind: Used for handshaking operation between a MUVIS application and an AFeX module. AFeX module fills the specific structure to introduce itself to the application. This function is called only once at the beginning, just after the application links the AFeX module in run-time. AFex_Init: The feature extraction parameters are given to initialize the AFeX module. The AFeX module performs necessary initialization operations, i.e. memory allocation, table creation etc. This function is called for the initialization of a unique sub-feature extraction operation. A new sub-feature can be created by using different set of feature parameters. AFex_Extract: It is used to extract the features of an audio frame (buffer). It returns the feature vectors, which should be normalized in such a way that the total length of the vector should be in between 0.0 and 1.0. This normalization is required for merging multiple (sub-) features while querying in MBrowser. AFex_Exit: It is for resetting and terminating the AFeX module operation. It frees the entire memory space allocated in AFex_Bind function. Additionally, if AFex_Init has been called already, this function resets the AFeX module to perform further feature extraction operations. This function is called at least once while the MUVIS application is terminated, but it might be called at the end of each AFeX operation per subfeature extraction. AFex_GetDistance: This function is used to obtain the similarity measure via calculating the distance between two feature vectors and therefore, the appropriate distance measurement algorithm should be implemented in this function. The resulting distance is returned as a double precision number. The details about audio indexing along with a sample AFeX module implementation are explained in Chapter 4. 2.3.2. Visual Feature Extraction: FeX Visual features are extracted from two visual media types in MUVIS framework: video clips and images. Features of video clips are extracted from the key-frames of the video clips. During real-time recording phase, AVDatabase may optionally and separately store the uncompressed (original) key-frames of a video clip along with the video bit-stream. If the original key-frames are present they are used for visual feature extraction process. If not, DbsEditor can extract the key-frames from the video bit-stream and use them instead. The key-frames are the INTRA frames in MPEG-4 or H.263 bit-stream. In most cases, a shot detection algo- 18 rithm is used to select the INTRA frames during the encoding stage but sometimes a forcedintra scheme might be present in order to prevent possible degradations. Image features on the other hand are simply extracted from their 24-bit RGB frame buffer, which is obtained by decoding the image. DBSEditor Fex_*.DLL Fex_API.h Fex_Bind Fex_Init Fex_Extract Fex_Exit Fex_Bind() Fex_Init() Fex_Extract() Fex_GetDistance() Fex_Exit() Fex_GetDistance MBrowser Figure 6: FeX module interaction with MUVIS applications. The rest of the implementation details of FeX structure are similar to AFeX: each visual FeX module should be implemented as a Dynamic Link Library (DLL) with respect to FeX API, and stored in an appropriate directory. FeX API provides the necessary handshaking and information flow between a MUVIS application and the feature extraction module. Figure 6 summarizes the API functions and linkage between MUVIS applications and an illustrative FeX module. 2.4. VIDEO SUMMARISATION Video summarization is basically achieved within MUVIS framework by performing scene analysis over the key-frames, which are available either explicitly or implicitly along with the video stream. It is a fact that video representation and summarization rely on the video structure, which can be hierarchically described in three segmentation layers: shots, scenes and the entire video. In shots layer, one or more shot frames can be chosen as key-frames in order to represent the shot-segments. For video summarization purposes key-frames might include quite a large amount of redundancy such as repetition of similar shots during the run-time of the video. Therefore, by eliminating such redundancy, scene frames should be extracted out of the key-frames to achieve efficient video summarization. In order to extract scene frames from key-frames, we developed the following techniques, which are integrated into MBrowser application: Scene Detection (SD) by Minimum Spanning Tree (MST) clustering and SD by Nearest Neighborhood Elimination (NNE). The video encoder (MPEG-4 or H.263+) segments the video sequence into INTRA frames usually MUVIS Framework 19 by a shot detection algorithm that is implemented along with the encoding process and encodes the key-frames as the first frame of each shot interval as INTRA frames (I-frames). Using the available key-frames that are encoded in the video bit-stream, both techniques apply feature extraction directly to the key-frames in order to get an inter-similarity measure between them. Any feature extraction method (based on color, texture, motion, DCT coefficients, etc.) can be used for the proposed techniques. In this work for any particular video, we used two color features based on HSV and YUV color histograms, and two texture features (Gabor [40] and GLCM [49]) available within the active MUVIS database. These features are used for inter-similarity measure between the key-frames of the video stream for both techniques (MST and NNE). The work presented here is based on the author’s publication [P6]. 2.4.1. Scene Analysis by MST MST is an efficient and widely used clustering technique [29]. The key-frames of a particular video represent the nodes in a MST. In order to find the weight (similarity measure) of the branch between two nodes, we use the normalized Euclidean Distance (D) between associated feature vectors x and y , such as: D = 1 + x − y x + 2 2 (1) y 2 By using the distance between two vectors as the weight of the MST branches, the tree is formed. Then the tree branches are sorted from the largest branch weight to the smallest one. At the beginning, the whole tree is a single cluster. In order to create a new cluster, it is sufficient to break the (next) largest branch. Figure 7 illustrates a sample MST example with three clusters. 0 2 8 1 9 2 10 1 2 1 4 1 6 2 8 9 3 5 3 mean = 3 1 7 Figure 7: MST clustering illustration. Each time a branch is broken and thus a new cluster is created, the nodes inside the cluster can be assumed to show high similarity than the nodes outside the cluster. Knowing 20 that those nodes represent the key-frames, in this way we grouped all the key-frames that are similar to each other. Once a sufficient number of clusters are achieved, it is straightforward to choose one or more key-frame(s) as the scene frame(s), which can represent the scene (cluster). Therefore, for an efficient scene detection operation, it is important to know where to stop clustering. In supervised methods, the user sets a threshold value for branch weight comparison and thereof it can be applied for the decision whether or not to break the branch. However, these values might not achieve the optimal number of scene frames that should be extracted. Another alternative is to make the scene detection in a semi-automatic way, that is, the user directly sets the number of scene frames required for summarization and the corresponding threshold value can be found accordingly. Yet again the user may not determine the optimal number of scene frames beforehand but at least the video will be summarized by the specified number of scene frames at the end. We developed an unsupervised (automatic) technique in which this threshold value is found adaptively. In a sample case as shown in Figure 7, the optimal threshold value should be between 3 and 8 since there are two significantly long (8 and 9) inter-cluster branches and short intra-cluster branches. So the aim is to break the longest of the two branches and stop clustering. For this the threshold value should be somewhere in between the smallest inter-cluster (broken) branch and the largest intra-cluster (unbroken) branch. This might justify the usage of the mean of all the branch weights (mean = 3) as the threshold value since it satisfies the condition. There might however be circumstances that this method leads to some redundant cluster creation and hence redundant scene frame detections. The probability of redundant scene detection is especially increased when there are significantly more intra-cluster branches than the inter-cluster branches. For example in the sample MST in Figure 7, if there happen to be one or more intra-cluster branches that have weights below 3, the mean value would be less than 3 and as a result, there would indeed be 4 clusters and node 5 is to be a new but redundant scene frame. In order to avoid such a case, we propose a 2-step weighted root mean square algorithm to determine the threshold value as follows: • Step 1: Calculate mean (µ) and variance (σ) of the branch weights (wi). • Step 2: Calculate the threshold value as the weighted root mean square of the branch weights as follows: N = Threshold where ∑ (k N '= i N N ∑k i (int) ki = ∗ wi ) 2 i i ' and N = # of branches. (w i − µ ) if w i ≥ µ . σ 1 if w i < µ (2) MUVIS Framework 21 As clearly seen in this equation, large weight values affect the threshold value more significantly than the smaller ones. Furthermore, the weight values around the mean (in a window of variance) have no effect at all on the threshold value. As a result we can summarize the unsupervised (automatic) SD by MST algorithm as follows: I. Extract key-frames and their feature vectors. II. Use feature vectors as the nodes of the MST and a similarity distance measure (i.e. Euclidean) as the branch weights (distance between nodes). III. Form MST and sort branches from biggest to smallest weights. IV. Find the mean and variance of the branch weights and the Threshold according to equation (2). V. Break the next largest branch if it is bigger than Threshold value. If not, stop clustering and proceed to VI. VI. For each cluster choose appropriate one or more key-frame(s) to represent the cluster (scene) and thus extract scene frames. VII. The total number of clusters gives the optimal number of scenes. Finally, scene frame(s) extracted from each cluster can be used for the summarization of the video sequence. 2.4.2. Scene Analysis by NNE In this technique, the idea is to keep only one key-frame as a scene frame and eliminate all the other similar ones. This is basically achieved without pre-clustering or any kind of tree forming. Initially the first key-frame of the video sequence is chosen as the first scene frame. By using the same normalized similarity measure as in the previous section, similar key-frames to the current scene frame are eliminated from the list of key-frames. In a supervised mode, a fixed threshold value is used to find which key-frames are to be eliminated. Once all the similar key-frames are eliminated, the next “present” key-frame in the time-line is chosen as the second scene frame since it is not eliminated and therefore, it is a new scene frame. Then all the existing key-frames are compared with this new scene frame and similar ones are again eliminated. The algorithm proceeds up to the elimination of the last key-frame while always awarding next surviving key-frame as the scene frame. Similar to SD by MST, this technique can be implemented in three modes: unsupervised, semi-automatic and supervised (automatic). As mentioned in the previous sub-section it is straightforward to perform unsupervised and semi-automatic modes for this technique. In order to achieve unsupervised (automatic) SD by NNE, the optimal threshold value used for key-frame elimination should be found by an adaptive mechanism. Similar to MST method, this value should yield the algorithm to come up with the optimal number of scenes to represent (summarize) the video clip. Since the similarity measure between two key-frames is found by normalized Euclidean distance, this threshold value is somewhere in between 0 and 22 1.0. So starting from 0 threshold, increasing threshold values yields decreasing number of scene frames. For the supervised (i.e. for a fixed threshold) mode, we can summarize the NNE algorithm as follows: I. Choose the first/next key-frame as the first/next scene frame. II. Compare the normalized similarity measure between the current scene frame and next key-frame with the threshold value: if it is less than the threshold, eliminate that key-frame otherwise keep it. III. Once all the key-frames in the list are compared with the current scene frame, proceed to the next surviving key-frame. If such key frame exists, go to I, otherwise, terminate iterations. Figure 8: Number of scene frames versus threshold sketch for Figure 9. For the semi-automatic mode, the number of scene frames ( N S ) is given and the appropriate threshold value can be adaptively found as follows: 0 0 I. Set two boundary threshold values: thrlow = 0 thr high = 1 and i = 0 where i shows the iteration number, II. Let threshold III. IV. V. VI. i = i i thr low + thr high 2 , Run SD by NNE and extract scene frames. Let scene frame number be n i If n i = N S then stop and terminate the iteration. i +1 i +1 If n i > N S then set thrlow = threshold i otherwise set thr high = threshold i . Increment iteration number (ià i+1) and go to II. Finally in unsupervised (automatic) mode, the algorithm adaptively determines the optimal number of scene frames. In ideal case the key-frames, which belong to the same scene (cluster) should be semantically similar to each other, so they are expected to give closer dis- MUVIS Framework 23 tance values in feature domain compared to those key-frames from other scenes (clusters). Therefore, the automatic mode runs the SD by NNE algorithm using several (i.e. 40) threshold values that are logarithmically distributed between 0 and 1.0. This yields a number of scene frames versus threshold plot as one example is shown in Figure 8. Starting from 0 along with the increasing threshold values the elimination of the close (intra-scene) key-frames are soon completed and then one can expect a region in which no elimination is possible until the interscene frame eliminations occur. In order to have a robust algorithm we try to detect the lowest gradient section with the largest width as shown in Figure 8. Once this region is found, the scene frame number yields the optimal number of scenes with which this video clip can be represented (summarized) and thus the scene frames are accordingly extracted. 2.4.3. Video Summarization Experiments In the experiments we use color and texture features such as HSV, YUV color histogram bins and GLCM-based texture features [49]. As mentioned in the previous section, the similarity measure is the normalized Euclidean distance. The first experiment is unsupervised (automatic) scene frame detection. For SD by MST method, in order to have comparable results with SD by NNE method, only one keyframe with minimum time-line index in each cluster is chosen as the scene frame for that cluster. First a monotonous dialog between two people in a 5 minutes clip is particularly chosen to get a clear idea about the exact number of scenes available and consequently Figure 9 shows the results of the automatic scene-detection experiment using two proposed methods. Another similar example as shown in Figure 10 is used for the semi-automatic scene detection experiment with number of scene frames is set as 3. The top sections of these figures show the extracted key-frames by the encoder. The bottom two rows display the results of the two methods (upper row by SD by NNE and lower row SD by MST). 24 Figure 9: Key-Frames (top) and Unsupervised Scene Frames by NNE (bottom-up) and MST (bottom-down). Figure 10: Key-Frames (top) and Semi-Automatic 3 Scene Frames by NNE (bottom-up) and MST (bottom-down). A ground-truth experimental set-up is used in order to test the performance of the proposed automatic scene detection algorithms performed over 10 video clips, which are captured in QCIF size at 25 fps and both H.263+ and MPEG-4 (simple profile) encoded at several bit- MUVIS Framework 25 rates (i.e. >128 Kb/s), are used for evaluation. Their content varies among the following: sport programs, talk shows, commercials and news and their duration varies between 2 to 7 minutes approximately. YUV histogram based shot-detection methods in the encoders extract keyframes including some additional forced intra key-frames (period of 1000). A group of 8 people evaluated the scene frames detected by both of the methods in automatic mode and the results are shown in Table II. Table II: The ground-truth table for automatic scene detection algorithms in 10 sample video sequences. Seq. # # of KFs # of Scenes Detected False Detection Missed True number of scenes MST NNE MST NNE MST NNE SEQ1 23 KFs SEQ2 21 KFs SEQ3 30 KFs SEQ4 27 KFs SEQ5 31 KFs SEQ6 63 KFs SEQ7 38 KFs SEQ8 39 KFs SEQ9 17 KFs SEQ10 26 KFs 4 4 0 0 0 0 9 7 4 2 0 0 5 4 1 1 0 1 4 5 0 0 2 1 13 10 4 2 0 1 12 9 3 2 3 5 5 7 1 3 1 1 12 9 4 2 2 3 15 8 2 0 0 5 9 8 0 0 8 9 4 5 4 6 9 12 5 10 13 17 Both methods achieve a high accuracy for the talk shows and news sequences when there is a clean distinction between scene frames. On the other hand, the accuracy is significantly degraded in the examples where such “clean” scene distinction is not encountered such as sport programs and commercials. In fact in such examples, the evaluation group also found it difficult to be in consensus on determining the number of scenes. Several experiments using both methods in semi-automatic mode show that they can extract the required number of scene frames in almost all cases with high accuracy. Semi-automatic mode is much more robust to the variations in scene content and it takes significantly less time compared to automatic mode. Therefore, when the content is fuzzy with respect to scene extraction, the semiautomatic mode otherwise the automatic mode should be used in both methods. 2.4.4. Scalable Video Management The term “Video Management” involves the efficient consumption of digital video clips possibly in a large multimedia database. Scalability is the primary property needed for efficient management of the video information since any video clip can have indefinite duration possibly bearing multiple and mixed content. In this context, in addition to the summarization scheme presented earlier, a hierarchical representation is further required for the retrieval of the region of interest (ROI) of the video. Thus, the user can access and then also query a ROI of a video clip. 26 2.4.4.1 ROI Access and Query ROI can be a single (key-) frame or a group of shots represented by their key-frames. The built-in key-frame summarization and scene analysis are presented in the previous subsections. This is a static representation (summarization) of a video clip, that is, the key-frames and the extracted scene frames are shown to the user. In the ROI approach, the user can further define the region over the key-frames using MBrowser and then MBrowser can retrieve it by rendering this ROI section of the video clip alone (possibly together with the synchronized audio) via its random access capability. Figure 11 shows a sample ROI selection from the key-frame summary of a video clip with duration of roughly two minutes, containing around 90 seconds of commercials and 30 seconds of F-1 race at the end. Note that only the last 11 key-frames (in the 2nd page of Key-Frames Display Dialog) are shown out of 32. ROI From Play ROI To Video for ROI Figure 11: ROI Selection and Rendering from the key-frames of a video clip. The example given in Figure 11 is only one way of defining a ROI, which is through the key-frame summarization of the video clip. MBrowser also provides various other ways of ROI setting directly from the main GUI window such as using an enhanced slider bar with two pointers (i.e. “from” and “to”) or directly browsing and setting the appropriate ROI boundary key-frames (or frames) using the (key-) frame browser buttons. These GUI primitives (buttons and the slider bar) and the ROI options in the “Video Query Options Dialog” can be seen in Figure 12. Once the ROI is set, the next step can be the ROI (visual) query. In other words, the user may want to retrieve all video clips in the database containing similar key-frames as in the ROI of the query video. Figure 12 shows one ROI query example where ROI is set as shown in Figure 11. The retrieval results on the right side of MBrowser GUI window clearly indicate that only the specified content within the ROI (i.e. “F-1 race with BRIDGESTONE flag”) is successfully retrieved. By this way the “content purification” can be achieved from a multi-content bearing video clip possibly having a long duration, and thus only the “content of interest” can be retrieved accordingly. MUVIS Framework 27 Frame Browser Buttons ROI Prev. Frame Next Prev. Frame KF Next KF Slider Bar with two pointers Figure 12: ROI (Visual) Query of the example in Figure 11. 2.4.4.2 Visual Query of Video As mentioned earlier, the visual features of the key-frames of the video clips are used for performing the visual query operation. The key-frames are the first frame within a shot and they are detected by shot detectors, which are usually embedded into the video encoders. Once all video clips are collected within a MUVIS database, then feature extraction is performed for each key-frame of every video clip in the database by the appropriate FeX modules. Then any video clip can be queried within the database by simply calculating the (dis-) similarity distance using the features of the key-frames between the query and the database clips and a ranking operation is performed afterwards. The main question is how to calculate a similarity distance between the query and any clip since their durations or the number of the key-frames might vary significantly and the content they bear may not be necessarily homogeneous, rather mixed and incoherent. For instance if the query clip has only one minute duration, how feasible can it be to compare it with a one hour clip containing various different visual content? The basic approach in MUVIS is to find the best matching key-frame for each key-frame in the query clip. Therefore, this turns out to be equivalent to searching for the entire query clip within each database clip to find their best matching occurrences. This yields, for example, searching for a one minute query clip inside a one hour clip and if a content-wise matching of one minute segment (or some group of key-frames) can be found within the one hour 28 clip, then both clips can be claimed as “similar”. Note that we are neither looking for a matching criteria for the (temporal) order of the query key-frames, nor the continuation (vicinity) of them in the one hour clip since both of them can vary and change, but still yield a similar content. In order to formulate the aforementioned similarity distance calculation analytically, let NoS be the number of feature sets existing in a database and let NoF(s) be the number of sub-features per feature set, where 0 ≤ s < NoS . Let the similarity distance function be SD ( x ( s , f ), y ( s , f )) where x and y are the associated feature vectors of the feature index s and the sub-feature index f. Let i be the index of the key-frames in the query clip, Q and QFV i ( s , f ) be its sub-feature vector of the feature index s and the sub-feature index f. Similarly, let j be the index of the key-frames of a database clip, C and CFV j ( s , f ) be its subfeature vector. For all key-frames in C, one particular key-frame, which gives the minimum distance to the video frame i in the queried clip is found ( D i ( s , f ) ) and used for calculation of the total sub-feature similarity distance ( D ( s , f ) ) between two clips. Here ROI(Q) represents the group of key-frames in the ROI of the query clip Q and by default it is the entire clip unless set manually by the user. Since the visual feature vectors are unit normalized, the total query similarity distance ( QDC ) between the clips Q and C in the database can be calculated with a weighted linear interpolation, the weights per sub-feature f, of a feature set s, W ( s, f ) , can be set by the user to find an optimum merging settings for the weights of all the visual features present in the database. The following equation formalizes the calculation of QDC . { [ ( Di ( s , f ) = min SD QFV i ( s , f ), CFV j ( s , f ) D (s, f ) = ∑ D ( s, f ) i∈ ROI ( Q ) )] j ∈C } i (3) NoS NoF ( s ) QD C = ∑ ∑ W ( s, f ) ⋅ D ( s, f ) s f NoS NoF ( s ) ∑ ∑ W ( s, f ) s f The weight normalization in the calculation of QDC is basically needed for two reasons: First to preserve the unit normalization in the final similarity distance and also to negate the effect of biased similarity distance calculation for the clips missing one or more features. The lack of features can occur as a consequence of abrupt stopping of an ongoing feature ex- MUVIS Framework 29 traction operation (e.g. the user can stop manually for some reason) and can be corrected any time by performing a “Consistency Check” operation for that database. Obviously the lack of some features for a clip will yield a lesser amount of QDC calculation unless such weight normalization is not applied. 30 31 Chapter 3 Unsupervised Audio Classification and Segmentation A udio information often plays an essential role in understanding the semantic content of digital media and in certain cases, audio might even be the only source of information e.g. audio-only clips. Henceforth, audio information has been recently used for content-based multimedia indexing and retrieval. Audio may also provide significant advantages over the visual counterpart especially if the content can be extracted according to the human auditory perceptual system. This, on the other hand, requires efficient and generic audio (content) analysis that yields a robust and semantic classification and segmentation. Therefore, in this chapter we focus the attention on the area of generic and automatic audio classification and segmentation for audio-based multimedia indexing and retrieval applications. The next section presents an overview on audio classification and segmentation studies in the literature, their performance analysis, especially their major limitations and drawbacks. In this way we can then introduce and justify our philosophy behind the generic approach for the proposed audio classification and segmentation scheme. Having defined the objectives and the basic key-points, the rest of the sections details the proposed method. The chapter is organized as follows. Section 3.2 is devoted to the description of the common spectral template formation depending on the mode. Then the hierarchic approach adopted for the overall feature extraction scheme and the perceptual modeling in the feature domain are introduced in Section 3.3. Section 3.4 describes the proposed framework with its hierarchical steps. Finally the experimental results along with the evaluation of the proposed algorithm and conclusive remarks are reported in Section 3.5. 3.1. AUDIO CLASSIFICATION AND SEGMENTATION – AN OVERVIEW During the recent years, there have been many studies on automatic audio classification and segmentation using several techniques and features. Traditionally, the most common approach is speech/music classifications in which the highest accuracy has been achieved, especially 32 when the segmentation information is known beforehand (i.e. manual segmentation). Saunders [58] developed a real-time speech/music classifier for audio in radio FM receivers based on features such as zero crossing rate (ZCR) and short-time energy. A 2.4 s window size was used and the primary goal of low computational complexity was achieved. Zhang and Kuo [77] developed a content-based audio retrieval system, which performs audio classification into basic types such as speech, music and noise. In their latter work [78] using a heuristic approach and pitch tracking techniques, they introduced more audio classes such as songs, mixed speech over music. Scheirer and Slaney [59] proposed a different approach for the speech/music discriminator systems particularly for ASR. El-Maleh et al. [27] presented a narrowband (i.e. audio sampled at 8 KHz) speech/music discriminator system using a new set of features. They achieved a low frame delay of 20 ms, which makes it suitable for real-time applications. A more sophisticated approach has been proposed by Srinivasan et al. [65]. They tried to categorize the audio into mixed class types such as music with speech, speech with background noise, etc. They reported over 80% classification accuracy. Lu et al. [39] presented an audio classification and segmentation algorithm for video structure paring using a one-second window to discriminate speech, music, environmental sound and silence. They proposed new features such as band periodicity to enhance the classification accuracy. Although audio classification has been mostly realized in the uncompressed domain, with the emerging MPEG audio content, several methods have been reported for audio classification on MPEG-1 (Layer 2) encoded audio bit-stream [30], [46], [52], [68]. The last years have shown a widespread usage of MPEG Layer 3 (MP3) audio [10], [23], [25], [51] as well as proliferation of several video content carrying MP3 audio. The ongoing research on perceptual audio coding yields a more efficient successor called (MPEG-2/4) Advanced Audio Coding (AAC) [10], [24]. AAC has various similarities with its predecessor but promises significant improvement in coding efficiency. In a previous work [P8], we introduced an automatic segmentation and classification method over MP3 (MPEG–1, 2, 2.5 Layer-3) and AAC bit-streams. In this work, using a generic MDCT template formation extracted from both MP3 and AAC bit-streams, an unsupervised classification over globally extracted segments is achieved using a hierarchical structure over the common MDCT template. Audio content extraction via classification and segmentation enables the design of efficient indexing schemes for large-scale multimedia databases. There might, however, be several shortcomings of the simple speech/music classifiers so far addressed in terms of extracting real semantic content, especially for multimedia clips that presents various content variations. For instance, most of speech/music discriminators work on digital audio signals that are in the uncompressed domain, with a fixed capturing parameter set. Obviously, large multimedia databases may contain digital audio that is in different formats (compressed/uncompressed), encoding schemes (MPEG Layer-2, MP3, AAC, ADPCM, etc.), capturing and encoding parameters (i.e. sampling frequency, bits per sample, sound volume level, bit-rate, etc.) and durations. Therefore, the underlying audio content extraction scheme should be robust (invariant) to such variations since the content is independent from the underlying parameters that the digital multimedia world presents. For example, the same content of a Unsupervised Audio Classification and Segmentation 33 speech may be represented by an audio signal sampled at 8KHz or 32KHz, in stereo or mono, compressed by AAC or stored in (uncompressed) PCM format, lasting 15 seconds or 10 minutes, etc. A comparative study of several statistical, HMM and GMM and neural network based training models was carried out by Bugatti et al. [12]. Although such approaches may achieve a better accuracy for a limited set of collections, they are usually restricted to a focused domain and hence do not provide a generic solution for the massive content variations that a large multimedia database may contain. Another important drawback of many existing systems is the lack of global segmentation. Since classification and segmentation are closely related and dependent problems, an integrated and well-tuned classification and segmentation approach is required. Due to technical difficulties or low-delay requirements, some systems tend to rely on manual segmentation (or simply work over a short audio file having a single audio content type). The other existing systems use several segment features that are estimated over audio segments with a fixed duration (0.5 – 5 seconds) to accomplish a classification per segment. Although fixing the segment size brings many practical advantages in the implementation and henceforth improves the classification accuracy, its performance may suffer either from the possibly high resolution required by the content or from the lack of sufficient statistics needed to estimate the segment features due to the limited time span of the segment. An efficient and more natural solution is to extract global segments within which the content is kept stationary so that the classification method can achieve an optimum performance within the segment. Almost all of the systems so far addressed do not have a bi-modal structure. That is, they are either designed in bit-stream mode where the bit-stream information is directly used (without decoding) for classification and segmentation, or in generic mode where the temporal and spectral information is extracted from the PCM samples and the analysis is performed afterwards. Usually, the former case is applied for improved computational speed and the latter for higher accuracy. A generic bi-modal structure, which supports both modes (possibly to some extent), is obviously needed in order to provide feasible solutions for the audio-based indexing of large multimedia databases. Such a framework can, for instance, work in bitstream mode whenever the enhanced speed is required, especially for long clips for which the generic mode is not a feasible option for the underlying hardware or network conditions; or it can work in the generic mode whenever feasible or required. Due to content variations, most of the existing works addressing just “speech and music” categories may not be satisfactory for the purpose of an efficient audio indexing scheme. The main reason for this is the presence of mixed audio types, such as speech with music, speech with noise, music with noise, environmental noise, etc. There might even be difficult cases where a pure class type ends up with an erroneous classification due to several factors. For the sake of audio indexing overall performance, either new class types for such potential audio categories should be introduced, or such “mixed” or “erroneous” cases should be collected under a certain class category (e.g. fuzzy) so that special treatment can be applied while indexing such audio segments. Since high accuracy is an important and basic requirement for 34 the audio analysis systems used for indexing and retrieval, introducing more class types might cause degradations in performance and hence is not considered as a feasible option most of the time for such generic solutions. In order to overcome the aforementioned problems and shortcomings, in this chapter we present a generic audio classification and segmentation framework especially suitable for audio-based multimedia indexing and retrieval systems. The proposed approach has been integrated into the MUVIS system [P4], [P6], [43]. The proposed method is automatic and uses no information from the video signal. It also provides robust (invariant) solution for the digital audio files with various capturing/encoding parameters and modes such as sampling frequencies (i.e. 8KHz up to 48 KHz), channel modes (i.e. mono, stereo, etc.), compression bit-rates (i.e. 8kbps up to 448kbps), sound volume level, file duration, etc. In order to increase accuracy, a fuzzy approach has been integrated within the framework. The main process is selflearning, which logically builds on the extracted information throughout its execution, to produce a reliable final result. The proposed method proceeds through logical hierarchic steps and iterations, based on certain perceptual rules that are applied on the basis of perceptual evaluation of the classification features and the behavior of the process. Therefore, the overall design structure is made suitable for human aural perception and for this, the proposed framework works on perceptual rules whenever possible. The objective is to achieve such a classification scheme that ensures a decision making approach suitable to human content perception. The proposed method has a bi-modal structure, which supports both bit-stream mode for MP3 and AAC audio, and generic mode for any audio type and format. In both modes, once a common spectral template is formed from the input audio source, the same analytical procedure is performed afterwards. The spectral template is obtained from MDCT coefficients of MP3 granules or AAC frames in bit-stream mode and hence called as MDCT template. The power spectrum obtained from FFT of the PCM samples within temporal frames forms the spectral template for the generic mode. In order to improve the performance and most important of all, the overall accuracy, the classification scheme produces only 4 class types per audio segment: speech, music, fuzzy or silent. Speech, music and silent are the pure (unique) class types. The class type of a segment is defined as fuzzy if it is either not classifiable as a pure class due to some potential uncertainties or anomalies in the audio source or it exhibits features from more than one pure class. For audio based indexing and retrieval in MUVIS system, a pure class content is only searched throughout the associated segments of the audio items in the database having the same (matching) pure class type, such as speech or music. All silent segments and silent frames within non-silent segments are discarded from the audio indexing. As mentioned earlier, special care is taken for fuzzy content, that is, during the retrieval phase, fuzzy content is compared with all relevant content types of the database (i.e. speech, music and fuzzy) since it might, by definition, contain a mixture of pure class types, background noise, aural effects, etc. Therefore, for the proposed method, any erroneous classification on pure classes is intended to be detected as fuzzy, so as to avoid significant retrieval errors (mismatches) due to Unsupervised Audio Classification and Segmentation 35 such potential misclassification. In this context, three prioritized error types of classification, illustrated in Figure 13, are defined: • • • Critical Errors: These errors occur when one pure class is misclassified into another pure class. Such errors significantly degrade the overall performance of an indexing and retrieval scheme. Semi-critical Errors: These errors occur when a fuzzy class is misclassified as one of the pure class types. These errors moderately affect the performance of retrieval. Non-critical Errors: These errors occur when a pure class is misclassified as a fuzzy class. The effect of such errors on the overall indexing/retrieval scheme is negligible. Critical Error speech fuzzy Semi-Critical Error music speech fuzzy Non-Critical Error music speech fuzzy music Figure 13: Different error types in classification 3.2. SPECTRAL TEMPLATE FORMATION In this section, we focus on the formation of the generic spectral template, which is the initial and pre-requisite step in order to provide a bi-modal solution. As shown in Figure 14, the spectral template is formed either from the MP3/AAC encoded bit-stream in bit-stream mode or the power spectrum of the PCM samples in the generic mode. Basically, this template provides spectral domain coefficients, SPEQ(w, f), (MDCT coefficients in bit-stream mode or power spectrum in generic mode) with the corresponding frequency values FL(f) for each granule/frame. By using FL(f) entries, all spectral features and any corresponding threshold values can be fixed independently from the sampling frequency of the audio signal. Once the common spectral template is formed the granule features can be extracted accordingly and thus, the primary framework can be built on a common basis, independent from the underlying audio format and the mode used. Spectral Template Bit-Stream Mode Segment Features Decision Segmentation Granule Ext. Audio Stream Segment Classification MDCT Generic Mode Granule Features Music Fuzzy Speech Power Spectrum Figure 14: Classification and Segmentation Framework Silence 36 3.2.1. Forming the MDCT Template from MP3/AAC Bit-Stream 3.2.1.1 MP3 and AAC Overview MPEG audio is a group of coding standards that specify a high performance perceptual coding scheme to compress audio signals into several bit-rates. Coding is performed in several steps and some of them are common for all three layers. There is a perceptual encoding scheme that is used for time/frequency domain masking by a certain threshold value computed using psychoacoustics rules. The spectral components are all quantized and a quantization noise is therefore introduced. MP3 is the most complex MPEG layer. It is optimized to provide the highest audio quality at low bit-rates. Layer 3 encoding process starts by dividing the audio signal into frames, each of which corresponds to one or two granules. The granule number within a single frame is determined by the MPEG phase. Each granule consists of 576 PCM samples. Then a polyphase filter bank (also used in Layer 1 and Layer 2) divides each granule into 32 equal-width frequency subbands, each of which carries 18 (subsampled) samples. The main difference between MPEG Layer 3 and the other layers is that an additional MDCT transform is performed over the subband samples to enhance spectral resolution. A short windowing may be applied to increase the temporal resolution in such a way that 18 PCM samples in a subband is divided into three short windows with 6 samples. Then MDCT is performed over each (short) window individually and the final 18 MDCT coefficients are obtained as a result of three groups of 6 coefficients. There are three windowing modes in MPEG Layer 3 encoding scheme: Long Windowing Mode, Short Windowing Mode and Mixed Windowing Mode. In Long Windowing Mode, MDCT is applied directly to the 18 samples in each of the 32 subbands. In Short Windowing Mode, all of 32 subbands are short windowed as mentioned above. In Mixed Windowing Mode, the first two lower subbands are long windowed and the remaining 30 higher subbands are short windowed. Once MDCT is applied to each subband of a granule according to the windowing mode, the scaled and quantized MDCT coefficients are then Huffman coded and thus the MP3 bit-stream is formed. There are three MPEG phases concerning MP3: MPEG-1, MPEG-2 and MPEG 2.5. MPEG-1 Layer 3 supports sampling rates of 32, 44.1 and 48 kHz and bit-rates from 32 to 448 kbps. It performs encoding on both mono and stereo audio, but not multi-channel surround sound. One MPEG-1 Layer 3 frame consists of two granules (1152 PCM samples). During encoding, different windowing modes can be applied to each granule. MPEG-2 Layer 3 is a backwards compatible extension to MPEG-1 with up to five channels, plus one low frequency enhancement channel. Furthermore, it provides support for lower sampling rates such as 16, 22.05 and 24 kHz for bit-rates as low as 8 kbps up to 320 kbps. One MPEG-2 Layer 3 frame consists of one granule (576 PCM samples). MPEG 2.5 is an unofficial MPEG audio extension, which was created by Fraunhofer Institute to improve performance at lower bit-rates. At lower bit-rates, this extension allows sampling rates of 8, 11.025 and 12 KHz. AAC and MP3 have mainly a similar structure. Nevertheless, compatibility with other MPEG audio layers has been removed and AAC has no granule structure within its frames Unsupervised Audio Classification and Segmentation 37 whereas MP3 might contain one or two granules per frame depending on the MPEG phase as mentioned before. AAC supports a wider range of sampling rates (from 8 kHz to 96 kHz) and up to 48 audio channels. Furthermore it works at bit-rates from 8 kbps for mono speech and in excess of 320 kbps. A direct MDCT transformation is performed over the samples without dividing the audio signal in 32 subbands as in MP3 encoding. Moreover, the same tools (psychoacoustic filters, scale factors and Huffman coding) are applied to reduce the number of bits used for encoding. Similar to MP3 coding scheme, two windowing modes are applied before MDCT is performed in order to achieve a better time/frequency resolution: Long Windowing Mode or Short Windowing Mode. In Long Windowing Mode MDCT is directly applied over 1024 PCM samples. In Short Windowing Mode, an AAC frame is first divided into 8 short windows each of which contains 128 PCM samples and MDCT is applied to each short window individually. Therefore, in Short Windowing Mode, there are 128 frequency lines and hence the spectral resolution is decreased by 8 times whilst increasing the temporal resolution by 8. AAC has a new technique so called “Temporal Noise Shaping”, which improves the speech quality especially at low bit-rates. More detailed information about MP3 and AAC can be found in [10]. The structural similarity in MDCT domain between MP3 and AAC makes developing generic algorithms that cover both MP3 and AAC feasible. So the proposed algorithm in this chapter uses this similarity as an advantage to form a common spectral template based on MDCT coefficients. This template allows us to achieve a common classification and segmentation technique that uses the compressed domain audio features as explained in the next subsection. 3.2.1.2 MDCT Template Formation The bit-stream mode uses the compressed domain audio features in order to perform classification and segmentation directly from the compressed bit-stream. Audio features are extracted using the common MDCT sub-band template. Hence MDCT template is nothing but a vari- able size MDCT double array, MDCT (w, f ) , along with a variable size frequency line array FL( f ) , which represents the real frequency value of the each row entry in the MDCT array. The index w represents the window number and the index f represents the line frequency index. Table III represents array dimensions NoW and NoF with respect to the associated window modes of MP3 and AAC. Table III: The MDCT template array dimension with respect to Compression Type and Windowing Mode Compression Type & Windowing Mode NoW NoF MP3 Long Window 1 576 MP3 Short Window 3 192 MP3 Mixed Window 3 216 AAC Long Window 1 1024 AAC Short Window 8 128 38 Let f s be the sampling frequency. Then according to Nyquist's theorem the maximum frequency ( f BW ) of the audio signal will be: f BW = f s 2 . Since both AAC and MP3 use linearly spaced frequency lines, then the real frequency values f can be obtained from the FL( f ) using the following expression: f ( f + 1) × BW Short or Long Win. Mode NoF f BW FL( f ) = ( f + 1) × f < 36 576 MP3 Mixed Win. Mode f BW ( f − 35) × f BW f ≥ 36 16 + 192 (4) where f is the index from 0 to the corresponding NoF given in Table III. The MDCT template array is formed from the absolute values of the MDCT subband coefficients, which are (Huffman) decoded from the MP3/AAC bit-stream per MP3 granule or AAC frame. For each MP3 granule, the MDCT subband coefficients are given in the form of a matrix of 32 lines, representing the frequency subbands, with 18 columns each of which for every coefficient as shown in Figure 15. In case of short window, there are three windows within a granule containing 6 coefficients. The template matrix formation for short window MP3 granules is illustrated in Figure 16. In order to process the same algorithm for both encoding schemes, we apply a similar template formation structure to AAC frames. So in case of long window AAC frame, 1024 MDCT coefficient array is divided into 32 groups of 32 MDCT coefficients and the template matrix for AAC is formed by taking into account that the number of MDCT coefficients for a subband is not 18 (as in MP3) but now 32. Figure 17 illustrates AAC long window template formation. In case of short window AAC frame, 1024 coefficients are divided into 8 windows of 128 coefficients each. We divide these 128 coefficients in 32 subbands and fill the matrix with 4 coefficients in every subband in order to have the same template as the MP3 short window case. Figure 18 shows how the subbands are arranged and the template array is formed by this technique. Unsupervised Audio Classification and Segmentation 39 Long Window Subband Filter 31 MDCT MDCT 558 MDCT 575 MDCT MDCT 18 MDCT 35 MDCT MDCT MDCT 0 1 MDCT 17 Subband Filter 1 Subband Filter 0 Figure 15: MP3 Long Window MDCT template array formation from MDCT subband coefficients. Short Windows Subband Filter 31 MDCT MDCT 186 MDCT 191 MDCT MDCT 6 MDCT 11 MDCT MDCT 0 MDCT 5 MDCT 186 MDCT 191 MDCT 6 MDCT 11 Subband Filter 1 MDCT 0 MDCT 5 Figure 16: MP3 Short Window MDCT template array formation from MDCT subband coefficients 40 Long Window MDCT MDCT 0 MDCT 32 Part 1 Part 2 Part 32 Long Window MDCT 992 MDCT 1023 MDCT 32 MDCT 63 MDCT MDCT 0 1 MDCT 31 Figure 17: AAC Long Window MDCT template array formation from MDCT subband coefficients Short Windows MDCT MDCT 127 MDCT 0 Window 1 MDCT 1023 Window 8 Window 2 Window 8 Window 1 Part 1 MDCT 0 MDCT 3 Subband 31 MDCT 124 Subband 1 MDCT 4 MDCT 7 Subband 0 MDCT 0 MDCT 3 Part 32 MDCT 127 MDCT 127 MDCT 124 MDCT 127 MDCT 4 MDCT 7 MDCT 0 MDCT 3 Figure 18: AAC Short Window MDCT template array formation from MDCT subband coefficients. Unsupervised Audio Classification and Segmentation 41 3.2.2. Spectral Template Formation in Generic Mode In the generic mode, the spectral template is formed from the FFT of the PCM samples within a frame that has a fixed temporal duration. In bit-stream mode, the frame (temporal) duration varies since the granule/frame size is fixed (i.e. 576 in MP3, 1024 in AAC long window mode). However, in this mode, we have the possibility to extract both fixed-size or fixedduration frames depending on the feature type. For analysis compatibility purposes, it is a common practice to fix the (analysis) frame duration. If, however, fixed spectral resolution is required (i.e. for fundamental frequency estimation), the frame size (hence the FFT window size) can also be kept constant by increasing the frame size by zero padding or simply using the samples from neighbor frames. Spectral Template Audio Stream FFT Frame Ext. Template Ext. Power Spectrum PSPQ(w,f) Freq. Line FL(f) Audio Parameters Figure 19: Generic Mode Spectral Template Formation. As shown in Figure 19, the generic mode spectral template consists of a variable size power spectrum double array, PSPQ(w, f), along with a variable size frequency line array FL ( f ) , which represents the real frequency value of each row entry in the PSPQ array. The index w represents the window number and the index f represents the line frequency index. In generic mode, NoW =1 and NoF is the number of frequency lines within the spectral bandwidth: NoF = 2 log 2 ( frdur f s ) − 1 where . is the ceiling operator, fr dur is the duration of one audio (analysis) frame and f S is the sampling frequency. Note that for both modes, the template is formed independently from the number of channels (i.e. stereo/mono) in the audio signal. If the audio is stereo, both channels are averaged and used as the signal before the frame extraction is processed. 3.3. FEATURE EXTRACTION As shown in Figure 14, a hierarchic approach has been adopted for the overall feature extraction scheme in the proposed framework. First the frame (or granule) features are extracted using the spectral template and the segment features are derived afterwards in order to accomplish classification for the segments. In the following sub-sections we will focus on the extraction of the several frame features. 42 3.3.1. Frame Features Granule features are extracted from the spectral template, SPEQ(w, f) and FL(f), where SPEQ can be assigned to MDCT or PSPQ depending on the current working mode. We use some classical features such as Band Energy Ratio (BER), Total Frame Energy (TFE) and Subband Centroid (SC). We also developed a novel feature so called Transition Rate (TR) and tested it against the conventional counterpart, Pause Rate (PR). Since both PR and TR are segment features by definition, they will be introduced on the next section. Finally we proposed an enhanced Fundamental Frequency (FF) detection algorithm, which is based on the well-known HPS (Harmonic Product Spectrum) technique [47]. 3.3.1.1 Total Frame Energy Calculation Total Frame Energy (TFE) can be calculated using (5). It is the primary feature to detect silent granules/frames. Silence detection is also used for the extraction of TR, which is one of the main segment features. TFE j = ∑ ∑ (SPEQ 2 NoW NoF w j ( w, f ) ) (5) f 3.3.1.2 Band Energy Ratio Calculation Band Energy Ratio (BER) is the ratio between the total energies of two spectral regions that are separated by a single cut-off frequency. The spectral regions fully cover the spectrum of the input audio signal. Given a cut-off frequency value f c ( f c ≤ f BW ), let f f c be the line frequency index where FL ( f f c ) ≤ f c < FL ( f f c + 1) , BER for a granule/frame j can be calculated using (6). ∑ ∑ (SPEQ NoW BER j ( f c ) = w f fc f =0 ∑ ∑ (SPEQ NoW w NoF f=f ( w, f ) ) 2 j (w, f )) (6) 2 j fc 3.3.1.3 Fundamental Frequency Estimation If the input audio signal is harmonic over a fundamental frequency (i.e. there exists a series of major frequency components that are integer multiples of a fundamental frequency value), the real Fundamental Frequency (FF) value can be estimated from the spectral coefficients (SPEQ(w, f)). Therefore, we apply an adaptive peak-detection algorithm over the spectral template to check whether sufficient number of peaks around the integer multiple of a certain frequency (a candidate FF value) can be found or not. The algorithm basically works in 3 steps: Unsupervised Audio Classification and Segmentation • • • 43 Adaptive extraction of the all the spectral peaks, Candidate FF Peaks extraction via Harmonic Product Spectrum (HPS), Multiple peak search and Fundamental Frequency (FF) verification. Especially the human speech has most of its energy at lower bands (i.e. f < 500 Hz.) and hence the absolute value of the peaks in this range might be significantly greater than the peaks in the higher frequency bands. This brings the need for an adaptive design in order to detect the major spectral peaks in the spectrum. We therefore, apply a non-overlapped partitioning scheme over the spectrum and the major peaks are then extracted within each partition. Let N P is the number of partitions each of which have the ( f BW / N P ) Hz bandwidth. In order to detect peaks in a partition, the absolute mean value is first calculated from the spectral coefficients in the partition and if a spectral coefficient is significantly bigger than the mean value (e.g. greater than twice the mean value), it is chosen as a new peak and this process is repeated for all the partitions. The maximum spectral coefficient within a partition is always chosen as a peak even if it does not satisfy the aforementioned rule. This is basically done to ensure that at least one peak is to be detected per partition. One of the main advantages of the partition based peak detection is that the amount of redundant spectral data is significantly reduced towards the major peaks, which are the main concern for FF estimation scheme. The candidate FF peaks are obtained via HPS. If a frame is harmonic with a certain FF value, HPS can detect this value. However there might be two potential problems. First HPS can be noisy if the harmonic audio signal is a noisy and mixed signal with significant non-harmonic components. In this case HPS will not extract the FF value as the first peak in the harmonic product, but as the second or higher order peak. For this reason we consider a reasonable number (i.e. 5) of the highest peaks extracted from the harmonic product as the candidate FF values. Another potential problem is that HPS does not provide whether or not the audio frame is harmonic, since it always produces some harmonic product peak values from a given spectrum. The harmonicity should therefore be searched and verified among the peak values extracted in the previous step. The multiple peak verification is a critical process for fundamental frequency (FF) calculation. Due to the limited spectral resolution, one potential problem might be that the multiple peak value may not be necessarily on the exact frequency line that spectral coefficient exists. Let the linear frequency spacing between two consecutive spectral coefficient be ∆ f = FL ( f ) − FL ( f − 1) = f BW / NoF and let the real FF value will be in the {− ∆f 2 , + ∆f 2} neighborhood of a spectral coefficient at the frequency FL( f ) . Then the minimum window width to search for n th (possible) peak will be: W (n) = n × ∆f . Another problem is the pitch-shifting phenomenon of the harmonics that especially occurs to harmonic patterns of the speech. Terhardt [66] proposed stretched templates for the detection of the pitch (fundamental frequency) and one simple analytical description of the template stretching is given in the following expression. 44 f n = nσ f 0 ⇒ W (n) = (nσ − n) f0 ∀n = 2,3,4... (7) σ is the stretch factor with a nominal value 1 and f 0 is the perceived fundamental frequency. Practically σ ≈ 1 . 01 for the human speech and this can therefore be approxi- where mated as a linear function (i.e. f n ≅ n σ f 0 ). Due to such harmonic shifts and the limited spectral resolution of the spectral template, an adaptive search window is applied in order not to miss a (multiple) peak on a multiple frequency line. On the other hand false detections might occur if the window width is chosen larger than necessary. We developed, tested and used the following search window template for detection: W ( n) = n ∆f σ ∀n = 2,3,4... (8) Note that the search window width is proportional to both sampling frequency of the audio signal and a stretch factor, σ , and inversely proportional to the total number of frequency lines, both of which gives a good measure of resolution and provides a stretched template modeling for the perceived FF value. Figure 20 illustrates a sample peak detection applied on the spectral coefficients of an audio frame with the sampling frequency 44100 Hz. Therefore, f BW = 22050 Hz but the sketch shows up to around 17000 Hz for the sake of illustration. The lower subplot shows the overall peaks detected in the first step (red), 5 candidate peaks extracted in the second step via HPS algorithm (black), the multiple peaks found (blue) and finally the FF value estimated accordingly ( FF = FL (18 ) = 1798 Hz in this example) Power Spectrum Curve FF Peak = 1798 Hz. Freq. (x10KHz) x x HPS Detected 5 Peaks x o Overall Spectral Peaks o Harmonic Peaks via FF x x Figure 20: FF detection within a harmonic frame 3.3.1.4 Subband Centroid Frequency Estimation Subband Centroid (SC) is the first moment of the spectral distribution (spectrum) or in compressed domain it can be estimated as the balancing frequency value for the absolute spectral values. Using the spectral template arrays, SC ( f SC ) can be calculated as follows: Unsupervised Audio Classification and Segmentation 45 NoW NoF f SC = ∑ ∑ (SPEQ ( w , f ) × FL ( f ) ) w f NoW NoF ∑ ∑ SPEQ ( w , f ) w (9) f 3.3.2. Segment Features Segment Features are extracted from the frame (or granule) features and mainly used for the classification of the segment. A segment, by definition, is a temporal window, which lasts a certain duration within an audio clip. There are basically two types: silent and non-silent segments. The non-silent segments are subject to further classification using their segment features. As mentioned before, the objective is to extract global segments, each of which should contain a stationary content along with its time span in the semantic point of view. Practically, there is no upper bound for the segments. For instance a segment may cover the whole audio clip if there is a unique audio category in it (i.e. MP3 music clips). However there is and should be a practical lower bound for the segments duration within which a perceptive content can exist (i.e. > 0.6 seconds). Total Frame Energy (TFE) is the only feature used for the detection of the silent segments. The segment features are then used to classify the non-silent segments and will be presented in the following sections. 3.3.2.1 Dominant Band Energy Ratio Since the energy is concentrated mostly on the lower frequencies for human speech, an audio frame can be classified as speech or music by comparing its Band Energy Ratio (BER) value with an empirical threshold. This is an unreliable process when it is applied per-frame basis, but within a segment it can turn out to be an initial classifier by using the dominant (winning) class type within the segment. Experimental results show that Dominant Band Energy Ratio (DBER), as a segment feature, does not achieve as high accuracy as the other major segment features but it usually gives consistent results for the similar content. Therefore, we use DBER for the initial steps of the main algorithm, mainly for merging the immature segments into more global segments if their class types match with respect to DBER. One of the requirements of segment merging is to have same class types of the neighbor segments and DBER is consistent of giving same (right or wrong) result for the same content. 3.3.2.2 Transition Rate vs. Pause Rate Pause Rate (PR) is a well-known feature as a speech/music discriminator and basically it is the ratio between the numbers of silent granules/frames to total number of granules/frames in a non-silent segment. Due to natural pauses or unsound consonants that occur within any speech content, speech has a certain level of PR level that is usually lacking in any music content. Therefore, if this ratio is over a threshold ( T PR ), then the segment is classified as a speech segment, otherwise music. 46 PR usually achieves a significant performance in discriminating speech from music. However, its performance is degraded when there is a fast speech (without sufficient amount of pauses), a background noise or when the speech segment is quite short (i.e. < 3 seconds). Since PR is only related with the amount (number) of silence (silent frames) within a segment, it can lead to critical misclassifications. In a natural human speech, due to the presence of the unsound consonants, the frequency of the occurrence of the silent granules is generally high even though their total time span (duration) might be still low as in the case of a fast speech. On the other hand, in some classical music clips, there might be one or a few intentional silent sections (silent passes) that may cause misclassification of the whole segment (or clip) as speech due to the long duration of such passes. These erroneous cases lead us to introduce an improved measure, Transition Rate (TR), which is based on the transitions, occurs between consecutive frames. TR can be formulated for a segment as in (10). NoF TR ( S ) = NoF + ∑ TP i (10) i 2 NoF where NoF is the number of frames within segment S, i is the frame index and TP i is the transition penalization factor that can be obtained from the following table: Table IV: Transition Penalization Table. Transition: fr i → fr i+1 TP i silent → non-silent +1 non-silent → silent +1 silent → silent +1 non-silent → non-silent -1 Note that although the total amount of silent frames is low for a fast speech or in short speech segment, the transition rate will be still high due to their frequent occurrence. 3.3.2.3 Fundamental Frequency Segment Feature Fundamental Frequency (FF) is another well-known music/speech discriminator due to the fact that music is more harmonic than the speech in general. Pure speech contains a sequence of harmonic tonals (vowels) and inharmonic consonants. In speech, due to the presence of inharmonic consonants, the natural pauses and the low-bounded FF values (i.e. <500 Hz) the average FF value within a segment tend to be quite low. Since the presence of the continuous instrumental notes results large harmonic sections with unbounded FF occurrences in music, the average FF value tends to be quite high. However, the average FF value alone might result in classification failures in some exceptional cases. Experiments show that such misclas- Unsupervised Audio Classification and Segmentation 47 sifications occur especially in some harmonic female speech segments or in some hard-rock music clips with saturated beats and base-drums. In order to improve the discrimination factor from FF segment feature, we develop an enhanced segment feature based on conditional mean, which basically verifies strict FF tracking (continuity) within a window. Therefore, FF value of a particular frame will be introduced in the mean summation only if its nearest neighbors are also harmonic, otherwise discarded. The conditional mean based FF segment feature is formulated in (11). NoF FF ( S ) = ∑ FF c i i NoF FF if FF j ≠ 0 ∀ j ∈ NN ( i ) where FF i c = i otherwise 0 where FF i is the FF value of the (11) i th frame in segment S and j represents the index of the frames in the nearest neighbor frame set of the i th frame, (i.e. i − 3 ≤ NN(i) ≤ i + 3 ). Due to frequent discontinuities in the harmonicity such as pauses (silent frames) and consonants on a typical speech segment, the conditional mean results in a significantly low FF segment value for pure speech. Music segments, on the other hand, tend to have higher FF segment values due to their continuous harmonic nature even though the beats or base-drums might cause some partial losses on the FF tracking. The experimental results approve the significant improvement obtained from the conditional mean based FF segment feature and thus FF segment feature become one of the major features that we use for the classification of the final (global) segments. 3.3.2.4 Subband Centroid Segment Feature Due to the presence of both voiced (vowels) and unvoiced (consonants) parts in a speech segment, the average Subband Centroid (SC) value tend to be low with a significantly higher standard deviation and vice versa for music segments. However the experimental results show that some music types can also present quite low SC average values and thus SC segment feature used to perform classification is the standard deviation alone with one exception: The mean of SC within a segment is only used when it gives such a high value (forcedclassification by SC) indicating the presence of the music with a certainty. Both of SC segment features are extracted by smoothly sliding a short window through the frames of the non-silent segment. The standard deviation of the SC is calculated using local windowed mean and windowed standard deviation of SC in the segment and formulated as in (12). ∑ (SC NoF σ SC (S ) = j − µ SC j j NoF where µ iSC is the windowed SC mean of the ) 2 where µ SC i = ∑ SC j∈Wi j (12) NoW i th frame calculated within a window, SC NoW frames. σ (S ) is the SC segment feature of the segment S with NoF frames W i , with 48 Such adaptive calculation of the segment feature improves the discrimination between speech and music and therefore, SC is used as the third major feature within the final classification scheme. 3.3.3. Perceptual Modeling in Feature Domain The primary approach in the classification and segmentation framework is based on the perceptual modeling in the feature domain that is mainly applied on to the major segment features: FF, SC and TR. Depending on the nature of the segment feature, the model provide a perceptual-rule based division in the feature space as shown in Figure 21. The forcedclassification occurs if that particular feature results such an extreme value that perceptual certainty about content identification is occurred. Therefore, it overrides all the other features so that the final decision is made with respect to that feature alone. Note that the occurrence of a forced classification, its region boundaries and its class category depend on the nature of the underlying segment feature. In this context each major segment feature has the following forced-classification definitions: • TR has a forced speech classification region above 15% due to the fact that only pure speech can yield such a high value within a segment formation. • FF has a forced music classification with respect to its mean value that is above 2 KHz. This makes sense since only pure and excessively harmonic music content can yield such an extreme mean value. • SC has two forced-classification regions, one for music and the other for speech content. The forced music classification occurs when the SC mean exceeds 2 KHz and the forced speech classification occurs when the primary segment feature of SC, the adaptive σ SC value, exceeds 1200 Hz. FORCED CLASSIFICATION Music/Speech FUZZY REGION Speech/Music FORCED CLASSIFICATION Figure 21: Perceptual Modeling in Feature Domain. Unsupervised Audio Classification and Segmentation 49 Although the model supports both lower and upper forced-classification regions, only the upper regions are so far used. However we tend to keep the lower region in case the further experimentations might approve the usage of that region in the future. The region below forced-classification is where the natural discrimination occurs into one of the pure classes such as speech or music. For all segment features the lower boundary of this region is tuned so that the feature would have a typical value that can be expected from a pure class type but still quite far away having a certainty to decide the final classification alone. Finally there may be a fuzzy region where the feature value is no longer reliable due to various possibilities such as the audio class type is not pure, rather mixed or some background noise is present causing ‘blurring’ on the segment features. So for those segment features that are examined and approved for the fuzzy approach, a fuzzy region is formed and tuned experimentally to deal with such cases. There is, on the other hand, another advantage of having a fuzzy region between the regions where the real discrimination occurs. The fuzzy region prevents most of the critical errors, which might occur due to noisy jumps from one (pure class) region on to another. Such noisy cases or anomalies can be handled within the fuzzy region and a critical error turns out to be a non-critical error for the sake of audio-based indexing and retrieval. Furthermore, there are still other features that might help to clarify the classification at the end. Experimental results show that FF and SC segment features are suitable for fuzzy region modeling. However TR cannot provide a reliable fuzzy model due to its nature. Although TR can achieve probably the highest reliability of distinguishing speech from the other class types, it is practically blind of categorization of any other non-speech content (i.e. fuzzy from music, music from speech with significant background noise, etc.). Therefore, fuzzy modeling is not applied to TR segment feature to prevent such erroneous cases. 3.4. GENERIC AUDIO CLASSIFICATION AND SEGMENTATION The proposed approach is mainly developed based on the aforementioned fact: automatic audio segmentation and classification are mutually dependent problems. A good segmentation requires good classification and vice versa. Therefore, without any prior knowledge or supervising mechanism, the proposed algorithm proceeds in an iterative way, starting from granule/frame based classification and initial segmentation, the iterative steps are carried out until a global segmentation and thus a successful classification per segment can be achieved at the end. Figure 22 illustrates the 4-steps iterative approach to the audio classification and segmentation problem. 50 Initial Classification (per granule) Initial Segmentation Intermediate Classification Primary Segmentation Primary Classification Further (Intra) Segmentation Step 1 Step 2 Step 3 Step 4 Figure 22: The flowchart of the proposed approach. 3.4.1. Step 1: Initial Classification As for the first step the objective is to extract silent and non-silent frames and then obtain an initial categorization for the non-silent frames in order to proceed with an initial segmentation on the next step. Since the frame-based classification is nothing but only needed for an initial segmentation, there is no need of introducing fuzzy classification in this step. Therefore, each granule/frame is classified in one of three categories: speech, music or silent. Silence detection is performed per granule/frame by applying a threshold ( TTFE ) to the total energy as given in (5). TTFE is calculated adaptively in order to take the audio sound volume effect into account. The minimum ( Emin ), maximum ( Emax) and average ( E µ ) granule/frame energy values are first calculated from the entire audio clip. An empirical all-mute test is performed to ensure the presence of the audible content. Two conditions are checked: I. E max > Minimum Audible Frame Energy Level. II. E max >> Emin . Otherwise the entire clip is considered as all mute and hence no further steps are necessary. Once the presence of some non-silent granules/frames is confirmed then TTFE is calculated according to (13). TTFE = E min + λs × ( E µ − E min ), where 0 < λs ≤ 1 (13) λs is the silence coefficient, which determines the silence threshold value between Emin and E µ . If the total energy of a granule/frame is below TTFE , then it is classified as silent, otherwise non-silent. If a granule/frame is not classified as silent, the BER is then calculated for a cut-off frequency of 500 Hz due to the fact that most of speech energy is concentrated below 500 Hz. If BER value for a frame is over a threshold (i.e. 2%) that granule/frame is classified as music, otherwise speech. Figure 23 summarizes the operation performed in Step 1. Unsupervised Audio Classification and Segmentation Frame 1 Frame 2 Frame 3 51 Frame NoF ..... speech Silent Detection non-silent BER music silent Frame Classification speech speech silent music .... music Figure 23: Operational Flowchart for Step 1. 3.4.2. Step 2 In this step, using the frame-based features extracted and classifications performed in the previous step the first attempts for the segmentation has been initiated and the first segment features are extracted from the initial segments formed. To begin with silent and non-silent segmentations are performed. In the previous step, all the silent granules/frames have already been found. So the silent granules/frames are merged to form silent segments. An empirical minimum interval (i.e. 0.2 sec.) is used to assign a segment as a silent segment if sufficient number of silent granules/frames merges to a segment, which has the duration greater than this threshold. All parts left between silent segments can then be considered as non-silent segments. Once all non-silent segments are formed, then the classification of these segments is performed using DBER and TR. The initial segmentation and segment classification (via DBER) is illustrated in Figure 24 with a sample segmentation and classification example at the bottom. Note that there might be different class types assigned independently via DBER and TR for the non-silent segments. This is done on purpose since the initial segment classification performed in this step with such twofold structure is nothing but a preparation for the further (towards a global) segmentation efforts that will be presented in the next step. 52 Silent Segmentation Non-Silent Segmentation Separated Non-Silent Segments DBER TR Classified Segments S M S M M S M X X X S M X S S S M X S S X silent segment M M M X X X X X S S S S S S ... silent segment X S S S S S X X X X ... Figure 24: Operational Flowchart for Step 2. 3.4.3. Step 3 This is the primary step where most of the efforts towards classification and global segmentation are summed up, as illustrated in Figure 25. The first part of this step is devoted to a merging process to obtain more global segments at the end. The silent segments extracted in the previous step might be ordinary local pauses during a natural speech or they can be the borderline from one segment to another one with a different class type. If the former argument is true, such silent segments might still be quite small and negligible for the sake of segmentation since they reduce the duration of the non-silent segments and hence they might lead to erroneous calculations for the major segment features. Therefore, they need to be eliminated to yield a better (global) segmentation that would indeed result in a better classification. There are two conditions in order to eliminate a silent segment and merge its non-silent neighbors. i. Its duration is below a threshold value, ii. Neighbor non-silent segment class types extracted from DBER and TR are both matching After merging some of non-silent segments, the overall segmentation scheme is changed and the features have to be re-extracted over the new (emerged via merging) non-silent segments. For all the non-silent segments, PR and DBER are re-calculated and then they are reclassified. This new classification of non-silent segments may result into such classification types that allow us to eliminate further silent segments (In the first step they may not be Unsupervised Audio Classification and Segmentation 53 eliminated because the neighbor classification types did not match). So an iterative loop is applied to eliminate all possible small silent segments. The iteration is carried out till all small silent segments are eliminated and non-silent segments are merged to have global segments, which have a unique classification type. Due to the perceptual modeling in the feature domain any segment feature may fall into forced-classification region and overrides the common decision process with its decision alone. Otherwise a decision look-up table is applied for the final classification as given in Table V. This table is formed up considering all possible class type combinations. Note that the majority rule is dominantly applied within this table, that is, if the majority of the segment features favor a class type, and then that class type is assigned to the segment. If a common consensus cannot be made, then the segment is set as fuzzy. Merging Process DBER TR No. of Silent Segments changed stable Small Non-Silent Segments Filtering Global FF Segments TR SC Decision Process Speech Fuzzy Music Figure 25: Operational Flowchart for Step 3. Silent 54 Table V: Generic Decision Table. TR FF SC Decision speech speech speech speech speech speech speech speech speech music music music music music music music music music speech speech speech music music music fuzzy fuzzy fuzzy speech speech speech music music music fuzzy fuzzy fuzzy speech music fuzzy speech music fuzzy speech music fuzzy speech music fuzzy speech music fuzzy speech music fuzzy Speech Speech Speech Speech Music Fuzzy Speech Fuzzy Fuzzy Speech Music Fuzzy Music Music Music Fuzzy Music Fuzzy 3.4.4. Step 4 This step is dedicated to the intra segmentation analysis and mainly performs some post processing to improve the overall segmentation scheme. Once final classification and segmentation is finished in step 3 (section 3.3), non-silent segments with significantly long duration might still need to be partitioned into new segments if they consist of two or more subsegments (without any silent part in between) with different class types. For example within a long segment there might be sub-segments that include both pure music and pure speech content without a silent separation in between. Due to the lack of (sufficiently long) silent segment separation in between, the early steps failed to detect those sub-segments and therefore, a further (intra) segmentation is performed in order to separate those sub-segments. We developed two approaches to perform intra segmentation. The first approach divides the segment into two and uses SC segment feature to see the presence of a significant difference in between (unbalanced sub-segments). The second one attempts to detect the boundaries of any potential sub-segment with different class type by using Subband Centroid (SC) frame feature. Generally speaking, the first method is more robust on detecting the major changes since it uses the segment feature that is usually robust to noise. However it sometimes introduces significant offset on the exact location of the sub-segments and therefore causes severe degradations on the temporal resolution and segmentation accuracy. This problem is mostly solved in the second method but it might increase the amount of false detections of the sub-segments especially when the noise level is high. In the following sub-sections both methods will be explained in detail. Unsupervised Audio Classification and Segmentation 55 3.4.4.1 Intra Segmentation by Binary Division The first part in this method tests if the non-silent segment is significantly longer than a given threshold, (i.e. 4 seconds). Then we start by dividing the segment into two sub-segments and test whether their SC segment feature values are significantly differing from each other. If not, we keep the parent segment and stop. Otherwise we perform the same operation over the two child segments and look for the one, which is less balanced (the one which has higher SC value difference between the left and the right child-segments). The iteration is carried out till the child segment is small enough and breaks the iteration loop. This gives the sub-segment boundary and then Step 3 is re-performed over the sub-segments in order to make the most accurate classification possible. If Step 3 does not assign a different class type for the potential sub-segments detected, then the initial parent segment is kept unchanged. This means a false detection has been performed in Step 4. Figure 26 illustrates the algorithm in detail. The function local_change() performs SC based classification for the right and left child segments within the parent segment (without dividing) and returns the absolute SC difference between them. Long Segment? Start No Stop Yes local_change() Segment Balanced? Yes No Divide Segment into 2 local_change() local_change() Choose the Less Balanced Yes Long Segment? No Step 3 Figure 26: Intra Segmentation by Binary Division in Step 4. 56 3.4.4.2 Intra Segmentation by Breakpoints Detection This is a more traditional approach performed in many similar works. The same test as in the previous approach is performed to test whether the segment has a sufficiently long duration. If so, using a robust frame feature (i.e. SC), a series of breakpoints (certain frame locations) where the class types of the associated frames according to the SC frame feature alternate with respect to the class type of the parent segment, are detected. The alternation occurs first with a particular frame giving such a SC feature value indicating a class type that is different from the class type of the parent segment. Then it may swing back to original class type of its parent after a while or ends up with the parent segment boundary. SC segment feature is the windowed standard deviation used for the classification of the segment. Keeping the same analogy, we use windowed standard deviation calculated per frame and via comparing it with the SC segment feature, the breakpoints can be detected. Windowed standard deviation of SC, σ iSC , for frame i can be calculated as in (14). σ SC i = ∑ (SC j ∈W i j − µ iSC NoW ) 2 where µ SC i = ∑ SC j ∈W i j (14) NoW So the pair of breakpoints can be detected via comparing σ iSC for all the frames within the SC segment with the SC segment feature σ SC (i.e. σ iSC > σ SC → σ iSC ). Sample break+ NoFSS < σ point detection is shown in Figure 27. speech music speech Breakpoint Detected X X SC Segment Threshold X X Breakpoint after Roll-Down time(sec) Figure 27: Windowed SC standard deviation sketch (white) in a speech segment. Breakpoints are successfully detected with Roll-Down algorithm and music sub-segment is extracted. Unsupervised Audio Classification and Segmentation 3.5. 57 EXPERIMENTAL RESULTS For the experiments in this section the following MUVIS databases are mainly used (see subsection 2.1.4): Open Video, Real World Audio/Video and Music Audio databases. All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB memory. The evaluation of the performance is carried out subjectively using only the samples containing straightforward (obvious) content. In other words, if there is any subjective ambiguity on the result such as an insignificant (evaluator) doubt on the class type of a particular segment (e.g. speech or fuzzy?) or the relevancy of some of the audio-based retrieval results of an aural query, etc., then that sample is simply discarded from the evaluation. Therefore, the experimental results presented in this section depend only on the decisive subjective evaluation via ground truth and yet they are meant to be evaluator-independent (i.e. same subjective decisions are guaranteed to be made by different evaluators). The experiments carried out are detailed and reported in the next two subsections. Subsection 3.5.1 presents the performance evaluation of the enhanced frame features, their discrimination factors and especially the proposed fuzzy modeling and the final decision process. The accuracy analysis via error distributions and the performance evaluation of the overall segmentation and classification scheme is given in subsection 3.5.2. 3.5.1. Feature Discrimination and Fuzzy Modeling The overall performance of the proposed framework mainly depends on the discrimination factors of the extracted frame and segment features. Furthermore, the control over the decisions based on each segment feature plays an important role in the final classification. In order to have an effective control on the decision making criteria, we only used certain number of features giving significant discrimination for different audio content due to their improved design, instead of having too many traditional ones. As shown in a sample automatic classification and segmentation example in Figure 28, almost all of the features provide a clean distinction between pure speech and music content. Also intra segmentation via breakpoints detection works successfully as shown in the upper part of Figure 28 and false breakpoints are all eliminated. 58 speech Overall Classification & Segmentation speech silent Intra Segmentations via Breakpoints Det. Time Pointer music music music FF Frame (white) & Segment (red) Features SC Frame (white) and Segment (red) Features TR Segment Feature Figure 28: Frame and Segment Features on a sample classification and segmentation. Yet there are still weak and strong points of each feature used. For instance TR is perceptually blind in the discrimination between fuzzy and music content as explained before. Similarly, FF segment feature might fail to detect harmonic content if the music type is Techno or Hard Rock with saturated beats and base drums. In the current framework, such a weak point of a particular feature can still be avoided by the help of the others during the final decision process. One particular example can be seen in Figure 28: FF segment feature wrongly classifies the last (music) segment, as speech. However, the overall classification result is still accurate (music) as can be seen on top of Figure 28 since both SC and TR features favor music which overrides the FF feature (by majority rule) in the end. 3.5.2. Overall Classification and Segmentation Performance The evaluation of the proposed framework is carried out over the standalone MP3, AAC audio clips, AVI and MP4 files containing MPEG-4 video along with MP3, AAC, ADPCM (G721 and G723 in 3-5 bits/sample) and PCM (8 and 16 bits/sample) audio. These files are chosen from Open, Real World and Music databases as mentioned before. The duration of the clips vary from few minutes up to 2 hours. The clips are captured using several sampling frequencies from 16 KHz to 44.1 KHz so that both MPEG 1 and MPEG 2 phases are tested for Layer 3 (MP3) audio. Both MPEG-4 and MPEG-2 AAC are recorded with the Main and Low Complexity profiles (object types). TNS (Temporal Noise Shaping) and M/S coding schemes are disabled for AAC. Around 70% of the clips are stereo and the rest are mono. The total number of files used in the experiments is above 500 and in total measures, the method is Unsupervised Audio Classification and Segmentation 59 applied to 260 (> 15 hours) MP3, 100 (> 5 hours) AAC and 200 (> 10 hours) PCM (uncompressed) audio clips. Neither the classification and segmentation parameters such as threshold values, window duration, etc., nor any part of the algorithm are changed for those aforementioned variations in order to test the robustness of the algorithm. The error distributions results, which belong to both bit-stream and generic modes, are provided in Table VI and Table VII. These results are formed, based on the deviation of the specific content from the groundtruth classification, which is based on subjective evaluation as explained earlier. Table VI: Error Distribution Table for Bit-Stream Mode. BS Type Speech Critical Music Fuzzy NonNonCritical Critical Critical Semi-Critical MP3 2.0 % 0.5 % 5.8 % 10.3 % 24.5 % AAC 1.2 % 0.2 % 0.5 % 8.0 % 17.6 % Table VII: Error Distribution Table for Generic Mode. Speech Music Fuzzy Critical NonCritical Critical NonCritical Semi-Critical 0.7 % 4.9 % 5.1 % 22.0 % 23.4 % In fact, for each and every particular audio clip within the database, the output classification and especially the segmentation result are completely different. Furthermore, as classification and segmentation are mutually dependent tasks, there should be a method of evaluating and reporting the results accordingly. Owing to the aforementioned reasons, neither size nor the number of segments can be taken as a unit on which errors are calculated. Therefore, in order to report the combined effect of both classification and segmentation, the output error distribution, ε c* , is calculated and reported in terms of the total misclassification-time of a specific class, c* , per total ground-truth-time (total actual time) of that content within the database formulated as follows: ε c * (%) = 100 × ∑ t (c ) c ∈ (C D ∑ t (c * − c*) (15) ) D where C represents the set of elements from all class types, t represents time, while D represents the experimental database. For example in Table VI, 2.0% for MP3 critical errors in 60 speech basically means that if there were 100 minutes of speech content in the database, two minutes out of it is misclassified as music. This error calculation approach makes possible that the results stay independent from the effects of segmentation i.e. the errors are calculated and reported similarly for cases such as the misclassification of a whole segment or the misclassification of a fraction of a segment due to wrong intra-segmentation. From the analysis of the simulation results, we can see that the primary objective of the proposed scheme i.e. minimization of critical errors on classification accuracy, is successfully achieved. As a compromise of this achievement, most of the errors are semi-critical and sometimes, as intended, non-critical. Semi-critical errors, despite of having relatively higher values, are still useful, especially considering the fact that the contribution of fuzzy content towards the overall size of a multimedia database (also in the experimental database) is normally less than 5%. This basically means that the overall effect of these high values on the indexing and retrieval efficiency is minor. The moderately valued non-critical errors, as the name suggests, are not critical with respect to the audio-based multimedia retrieval performance because of the indexing and retrieval scheme. As a result, we have achieved good results with respect to our primary goal of being able to minimize the critical errors in audio content classification by introducing fuzzy modeling in the feature domain and shown the important role of having the global and perceptually meaningful segmentation on the accurate classification (and vice versa) in this context. Furthermore, we have shown that high accuracy can be achieved with sufficient number of enhanced features that are all designed according to the perceptual rules in a well-controlled manner, rather than using a large number of features. The proposed work achieves significant advantages and superior performance over existing approaches for automatic audio content analysis. In the next chapter, we will be presenting an audio-based multimedia indexing and retrieval framework, where this approach is basically integrated into and the performance improvements achieved especially for the aural retrievals in the large-scale multimedia databases. 61 Chapter 4 Audio-Based Multimedia Indexing and Retrieval R apid increase in the amount of the digital audio collections presenting various formats, types, durations, and other parameters that the digital multimedia world refers, demands a generic framework for robust and efficient indexing and retrieval based on the aural content. As mentioned earlier, from the content-based multimedia retrieval point of view the audio information can be even more important than the visual part since it is mostly unique and significantly stable within the entire duration of the content and therefore, the audio can be a promising part for the content-based management for those multimedia collections accompanied with an audio track. Accordingly in this chapter we focus our efforts on developing a generic and robust audio-based multimedia indexing and retrieval framework, which has been developed and tested under MUVIS system [P3]. First an overview for the audio indexing and retrieval works both in the literature and in the commercial systems, along with the major limitations and drawbacks are presented in the next section. In addition the design issues and the basic innovative properties of the proposed method will be justified accordingly. Afterwards, the proposed audio indexing and retrieval framework will be presented in parts within Sections 4.2 and 4.3 respectively. Finally the experimental results, the performance evaluation and conclusive remarks about the proposed framework are all reported in Section 4.4. 4.1. AUDIO INDEXING AND RETRIEVAL – AN OVERVIEW The studies on content-based audio retrievals are still in early stages. Traditional key-word based search engines such as Google, Yahoo, etc. usually cannot provide successful audio retrievals since they require costly (and usually manual) annotations that are obviously unpractical for large multimedia collections. In recent years, promising content-based audio retrieval techniques that might be categorized into two major paradigms have emerged. In the first paradigm, the “Query by Humming” (QBH) approach is tried for music retrievals. There are many studies in the literature, such as [2], [8], [18], [28], [35], [38], [41], [42]. However this approach has the disadvantage of being feasible only when the audio data is music stored in 62 some symbolic format or polyphonic transcription (i.e. MIDI). Moreover it is not suitable for various music genres such as Trance, Hard-Rock, Techno and several others. Such a limited approach obviously cannot be a generic solution for the audio retrieval problem. The second paradigm is the well-known “Query by Example” (QBE) technique, which is also common for visual retrievals of the multimedia items. This is a more global approach, which is adopted by several research studies and implementations. One of the most popular systems is MuscleFish [44]. The designers, Wold et al. [74] proposed a fundamental approach to retrieve sound clips based on their content extracted using several acoustic features. In this approach, an N dimensional feature vector is built where each dimension is used to carry one of the acoustic features such as pitch, brightness, harmonicity, loudness, bandwidth, etc. and it is used for similarity search for a query sound. The main drawback of this approach is that it is a supervised algorithm that is only feasible to some limited sub-set of audio collection and hence cannot provide an adequate and global approach for general audio indexing. Furthermore, the sound clips must contain a unique content with a short duration. It does not address the retrieval problem in a generic case such as audio files carrying several and temporally mixed content along with longer and varying durations. Foote in [21] proposed an approach for the representation of an audio clip using a template, which characterizes the content. First, all the audio clips are converted in 16 KHz with 16 bits per sample representation. For template construction the audio clip is first divided into overlapped frames with a fixed duration and a 13-dimensional feature vector based on 12 mel frequency cepstrum coefficients (MFCC) and one spectral energy is formed for training a tree-based Vector Quantizer. For retrieval of a query audio clip, it is first converted into the template and then template matching is applied and ranked to generate the retrieval list. This is again a supervised method designed to work for short sound files with a single-content, fixed audio parameters and file format (i.e. au). It achieves an average retrieval precision within a long range from 30% to 75% for different audio classes. Li and Khokar [32] proposed a wavelet-based approach for the short sound file retrievals and presented a demo using the MuscleFish database. They achieved around 70% recall rate for diverse audio classes. Spevak and Favreau presented the SoundSpotter [64] prototype system for content-based audio section retrieval within an audio file. In their work the user selects a specific passage (section) within an audio clip and also sets the number of retrievals. The system then retrieves the similar passages within the same audio file by performing a pattern matching of the feature vectors and a ranking operation afterwards. All the aforementioned systems and techniques achieved a certain performance; however present further limitations and drawbacks. First the limited amount of features extracted from the aural data often fails to represent the perceptual content of the audio data, which is usually subjective information. Second, the similarity matching in the query process is based on the computation of the (dis-) similarity distance between a query and each item in the database and a ranking operation afterwards. Therefore, especially for large databases it may turn out to be such a costly operation that the retrieval time becomes infeasible for a particular search engine or application. Third, all of the aforementioned techniques are designed to work in pre-fixed audio parameters (i.e. with a fixed format, sampling rate, bits per sample, etc.). Audio-Based Multimedia Indexing and Retrieval 63 Obviously, large multimedia databases may contain digital audio that is in different formats (compressed or uncompressed), encoding schemes (MPEG Layer-2 [25], [51], MP3, [10], [23], [25], [51], AAC [10], [23], [24], ADPCM, etc.), other capturing, encoding and acoustic parameters (i.e. sampling frequency, bits per sample, sound volume level, bit-rate, etc.) and durations. It is a fact that the aural content is totally independent from such parameters. For example, the same speech content can be represented by an audio signal sampled at 16 KHz or 44.1 KHz, in stereo or mono, compressed by MP3 in 64 Kb/s, or by AAC 24 Kb/s, or simply in (uncompressed) PCM format, lasting 15 seconds or 10 minutes, etc. However, if not designed accordingly, the feature extraction techniques are often affected drastically by such parameters and therefore, the efficiency and the accuracy of the indexing and retrieval operations will both be degraded as a result. Finally, they are mostly designed either for short sound files bearing a unique content or manually selected (short) sections. However, in a multimedia database, each clip can contain multiple content types, which are temporally (and also spatially) mixed with indefinite durations. Even the same content type (i.e. speech or music) may be produced by different sources (people, instruments, etc.) and should therefore, be analyzed accordingly. In order to overcome the aforementioned problems and shortcomings, in this chapter we introduce a generic audio indexing and retrieval framework, which is developed and tested under the MUVIS system presented as in Chapter 2. The primary objective in this framework is therefore, to provide a robust and adaptive basis, which performs audio indexing according to the audio class type (speech, music, etc.), audio content (the speaker, the subject, the environment, etc.) and the sound perception as closely modeled as possible to the human auditory perception mechanism. Furthermore, the proposed framework is designed in such a way that various low-level audio feature extraction methods can be used. For this purpose the aforementioned Audio Feature eXtraction (AFeX) modular structure can support various AFeX modules. In order to achieve efficiency in terms of retrieval accuracy and speed, the proposed scheme uses high-level audio content information obtained from an efficient, robust and automatic (unsupervised) audio classification and segmentation algorithm that is presented in Chapter 3, during both in indexing and retrieval processes. In this context, it is also optional for all AFeX modules so that a particular module can use the classification and segmentation information in order to tune and optimize its feature extraction process according to a particular class type. During the feature extraction operation, the feature vectors are extracted from each individual segment with a different class type and stored and retrieved separately. This makes more sense for content-based retrieval point of view since it brings the advantage to perform the similarity comparison between the frames within the segments with a matching class type and therefore, avoids any potential similarity mismatches and reduces the indexing and most important of all, the (query) retrieval times significantly. In the audio retrieval operations the classification based indexing scheme is entirely used in order to achieve low-complexity and robust query results. The aforementioned AFeX framework supports merging of several audio feature sets and associated sub-features once 64 the similarity distance per sub-feature is calculated. During the similarity distance calculation a penalization mechanism is developed in order to penalize not fully (partly) matched clips. For example if a clip with both speech and music parts is queried, all the clips missing one of the existing class type (say a music-only clip) will be penalized by the amount of the missing class (speech) coverage in the queried clip. This will give the priority to the clips with the entire class matching and therefore, ensure a more reliable retrieval. Another mechanism is applied for normalization due to the variations of the audio frame duration in the sub-features. This will change the number of frames within a class type and hence brings the dependency of the overall sub-feature similarity distance to the audio frame duration. Such dependency will negate any sub-feature merging attempts and therefore, normalization per audio per frame is applied. MUVIS framework internally provides a weighted merging scheme in order to achieve a “good” blend of the available audio features. 4.2. A GENERIC AUDIO INDEXING SCHEME As mentioned in the previous chapter, when dealing with digital audio there are several requirements to be fulfilled and the most important of them is the fact that the aural content is entirely independent from the digital audio capture parameters (i.e. sound volume, sampling frequency, etc.), audio file type (i.e. AVI, MP3, etc.), encoder type (MP3, AAC, etc.), encoding parameters (i.e. bit-rate, etc.) and other variations such as duration and sound volume level. Therefore, similar to the audio classification and segmentation operation, the overall structure of audio-based indexing and retrieval framework is designed to provide a preemptive robustness (independency) to such parameters and variations. As shown in Figure 29, audio indexing is applied to each multimedia item in a MUVIS database containing audio, and it is accomplished in several steps. The classification and segmentation of the audio stream is the first step. As a result of this step the entire audio clip is segmented into 4 class types and the audio frames among three class types (speech, music and fuzzy) are used for indexing. Silent frames are simply discarded since they do not carry any audio content information. The frame conversion is applied in step 2 due to the (possible) difference occurred in frame durations used in classification and segmentation and the latter AFeX operations. The boundary frames, which contain more than one class types are assigned as uncertain and also discarded from indexing since their content is not pure, rather mixed and hence do not provide a clean content information. The remaining speech, music and fuzzy frames (within their corresponding segments) are each subjected to audio feature extraction (AFeX) modules and their corresponding feature vectors are indexed into descriptor files separately after a clustering (key-framing) operation via Minimum Spanning Tree (MST) Clustering [29]. In the following sub-sections we will detail each of the indexing steps. Audio-Based Multimedia Indexing and Retrieval 65 Audio Stream Classification & Segmentation per granule / frame. 1 Silence 2 Audio Framing in Valid Classes Uncertain Speech Music Fuzzy Audio Framing & Classification Conversion Speech Music Fuzzy AFeX Module(s) 3 ... AFeX Operation per frame Speech 0 4 KF Extraction via MST Clustering Music 2 10 1 20 5 3 Fuzzy 9 6 15 MST 7 KF Feature Vectors Audio Indexing Figure 29: Audio Indexing Flowchart. 4.2.1. Unsupervised Audio Classification and Segmentation In order to achieve suitable content analysis, the first step is to perform accurate classification and segmentation over the entire audio clip. As explained in Chapter 3 the developed algorithm is a generic audio classification and segmentation especially suitable for audio-based multimedia indexing and retrieval systems. It has a multimodal structure, which supports both bit-stream mode for MP3 and AAC audio, and a generic mode for any audio type and format. In both modes, once a common spectral template is formed from the input audio source, the same analytical procedure can be performed afterwards. It is also automatic (unsupervised) in a way that no training or feedback (from the video part or human interference) is required. It further provides robust (invariant) solution for the digital audio files with various capturing/encoding parameters and modes. In order to achieve a certain robustness level, a fuzzy approach has been integrated within the technique. Furthermore, in order to improve the performance and most important of all, the overall accuracy, the classification scheme produces only 4 class types per audio segment: speech, music, fuzzy or silent. Speech, music and silent are the pure class types. The class type of a 66 segment is defined as fuzzy if either it is not classifiable as a pure class due to some potential uncertainties or anomalies in the audio source or it exhibits features from more than one pure class. The primary use of such classification and segmentation scheme is the following: For audio based indexing and retrieval, a pure class content is only searched throughout the associated segments of the audio items in the database having the same (matching) pure class type, such as speech or music. All silent segments and silent frames within non-silent segments can be discarded from the audio indexing. Special care is taken for the fuzzy content, that is, during the retrieval phase, the fuzzy content is compared with all relevant content types of the database (i.e. speech, music and fuzzy) since it might, by definition, contain a mixture of pure class types, background noise, aural effects, etc. Therefore, for the proposed method, any erroneous classification on pure classes is intended to be detected as fuzzy, so as to avoid significant retrieval errors (mismatches) due to such potential misclassification. 4.2.2. Audio Framing As mentioned in the previous section, there are three valid audio segments: speech, music and fuzzy. Since segmentation and classification are performed per granule/frame basis, such as per MP3 granule or AAC frame, a conversion is needed to achieve a generic audio framing for indexing purposes. The entire audio clip is first divided into a user or model-defined audio frames, each of which will have a classification type as a result of the previous step. In order to assign a class type to an audio frame, all the granules/frames within or neighbor of that frame should have a unique class type to which it is assigned; otherwise, it will be assigned as uncertain. M: Music S: Speech X: Silence Classification per granule/frame M M M M Music M M M M X X Uncertain X S S S S S S S Speech S S X X Uncertain Final Classification per audio frame Figure 30: A sample audio classification conversion. Since the uncertain frames are mixed and hence they are all transition frames (i.e. music to speech, speech to silence, etc.) the feature extraction will result an unclear feature vector, which does not contain a clean content characteristics at all. Therefore, these frames should be removed from the indexing operation thereafter. Audio-Based Multimedia Indexing and Retrieval 67 4.2.3. A Sample AFeX Module Implementation: MFCC MFCC stands for Mel-Frequency Cepstrum Coefficients [55] and they are widely used in several speech and speaker recognition systems due to the fact that they provide a decorrelated, perceptually-oriented observation vector in the cepstral domain and therefore, they are suitable for the human audio perception system. This is the main reason that we use them for audio based multimedia indexing and retrieval in order to achieve a similarity measure close to ordinary human audio perception criteria such as ‘sounds like’ with additional higher level content discrimination via classification (i.e. speech, music, etc.). The MFCC AFeX module performs several steps to extract MFCC per audio frame. First the incoming frames are Hamming windowed in order to enhance the harmonic nature of the vowels in speech and voiced consonants (sounds from instruments, effects, etc.) in music. In addition, Hamming window can reduce the effects of discontinuities and edges that are introduced during the framing process. Especially in logarithmic domain, the windowing effects can be encountered significantly. Hamming window is a half of cosine wave shifted upwards, as given in the following expression: w ( k ) = 0 . 54 − 0 . 46 cos( 2π k −1 ) N −1 (16) where N is the size of the window, which is equal to the size (number of PCM samples) of the audio frames. In order to perform filtering in the time domain, the audio frame is zero-padded to get the size as a power of 2 and then FFT is applied to get into the spectral domain for plain multiplication with the filterbank. The mel (melody) scaled filterbank is a series of filterbank, which has the central frequencies uniformly distributed in mel-frequency (mel(f)) domain where m f f mel ( f ) = m f = 1127 log(1 + ) and f = 700 (e1127 − 1) 700 (17) Figure 31 illustrates a sample mel-scaled filterbank in the frequency domain. The number of bands is reduced for the sake of clarity. The shape of the band filters in the filterbank can be Hamming Window or plain triangular shape. As clearly seen in Figure 31 the resolution is high for low frequencies and low for higher frequencies. That is in tune with the human ear nature and one of the main reasons of the mel-scale usage. Once the filtering is applied, the energy is calculated per band and Cepstral Transform is applied on the band energy values. Cepstral Transform is a discrete cosine transform of log filterbank amplitudes: P π ⋅i ci = (2 / P)1 / 2 ∑ j =1 log m j ⋅ cos ( j − 0.5) N (18) 68 where 0 < i ≤ P and P is the number of filter banks. A subset of ci is then used as the feature vector for this frame. 1 freq. m mj m1 P Energy in each band Figure 31: The derivation of mel-scaled filterbank amplitudes. As mentioned in the previous sections, any AFeX module should provide generic feature vectors independent from the following variations: • Sampling Frequency. • Number of audio channels (mono/stereo). • Sound Volume level. By using audio data from only one channel for AFeX operation, the effect of multiple audio channels can be avoided. However, we need normalization during the calculation of the energy per filterbank in order to neutralize the effects of sampling frequency and volume variations. Let f S be the sampling frequency. According to the Nyquist theorem, the bandwidth of the signal will be: f BW = f s / 2 . The frequency resolution ( ∆f ) per FFT spectral line will then be: ∆f = f BW f = S N FL / 2 N FL (19) Let T be the duration (in milliseconds) of the incoming audio frames. Then the number of PCM samples within an audio frame will be: N = T ⋅ f s / 1000 . An audio clip sampled with different sampling frequencies will result into different energy per band calculations due to the fact that the number of samples within the frame is varying and therefore, the band energy values should be normalized by a generic coefficient λ where λ ~ N . Sound Volume (V) can be approximated as the absolute average level within the audio frame such as: N V ≅ ∑ i N xi (20) Audio-Based Multimedia Indexing and Retrieval 69 Similarly an audio clip with different volume levels will result into different energy per band calculations and therefore, the energy values should be normalized by overall normalization will be: λ where λ ~ V . The N λ ~ λV ⋅ λ f ~ V ⋅ N → λ ~ ∑ xi (21) i During the calculation of the band energies under each filterbank, the energy values are divided by λ to prevent both volume and sampling frequency effects over the Cepstrum coefficients calculation. As shown in Figure 31, the filterbank central frequencies are unii formly distributed over the mel-scale. Let f CF be the center frequency of the i th filter bank, then the filterbank central frequencies can be obtained by the following equation: i mel ( f CF )= i ⋅ mel ( f BW ) P (22) So it is clear that the central frequencies will also be dependent on the sampling frequency ( f BW = f s / 2 ). This brings the problem that the audio clips with different sampling frequencies will have filterbanks with different central frequencies and hence the feature vectors (MFCC) will be totally uncorrelated since they are derived directly from the band energy values from each filter bank. In order to fix the filterbank locations, we use a fixed cut-off frequency that corresponds to the maximum sampling frequency value used within MUVIS. The minimum and maximum sampling frequencies within the proposed audio indexing framework are 16 KHz and 44.1 KHz; therefore, i mel( fCF )= i ⋅ mel( f FCO ) , where f FCO ≥ 22050 P (23) Setting the central frequencies by using the formula above will ensure the use of the same filterbank for all audio clips. Nevertheless, only the audio clips sampled at 44.1 KHz will use all the filters (assuming f FCO = 22050 Hz ) whilst the other audio clips sampled at lower frequencies will produce such band energy values of which the highest band values ( m j where j>M) are automatically set to 0 since those are outside of the bandwidth of the audio signal. This will yield erroneous results in the calculation of MFCC since the latter are nothing but DCT transforms of the logarithm of band energy values. In order to prevent this, only some portion of band energy values that are common for all possible sampling frequencies (within MUVIS) are used. In order to achieve this, the minimum possible M value is found using the lowest ( f S = 16 KHz ⇒ f BW = 8 KHz ) and the highest ( f S = 44 .1KHz ⇒ f BW = 22 .050 KHz ) sampling frequency values for MUVIS. Using mel (8000 ) = 2840.03 & mel ( 22050 ) = 3923.35 into (23), the bound for M can be stated 70 as: M ≤ P ⋅ 0 . 7238 . Therefore, having a filterbank that contains P band filters, we use M of them for the calculation of MFCC. By this way only a common range of MFCC is hence, used in order to negate the effect of the varying sampling frequencies of the audio clips within the database. For indexing, only the static values of Cepstral Transform coefficients ( ci ) are used. The first coefficient is not used within the feature vector since it is a noisy calculation of the frame energy and hence it does not contain reliable information. The remaining M-1 coefficients over P ( c i ∀1 < i ≤ M ) are used to form a MFCC feature vector. 4.2.4. Key-Framing via MST Clustering The number of audio frames is proportional with the duration of the audio clip and once AFeX operation is performed, this may potentially result in a massive number of feature vectors, many of which are probably redundant due to the fact that the sounds within an audio clip are immensely repetitive and most of the time entirely alike. In order to achieve an efficient audio-based retrieval within an acceptable time, only the feature vectors of the frames from different sounds should be stored for indexing purposes. This is indeed a similar situation with the visual feature extraction scheme where only the visual feature vectors of the Key-Frames (KFs) are stored for indexing. There is however one difference: In the visual case KFs are known before the feature extraction phase but in aural case since there is no such physical “frame” structure and hence the audio is framed uniformly with some certain duration, we need to obtain features of each frame beforehand in order to make Key-Frame analysis. This is why AFeX operation is performed (over valid frames) first and audio KFs are extracted afterwards. p S ee 'S' 0 L b a 'ch' 2 8 1 1 20 'b' 2 9 1 10 1 ch 1 15 6 3 8 2 21 16 1 9 9 5 'L' 2 3 1 7 'ee' 11 1 12 1 8 17 7 6 'a' p 21 Figure 32: An illustrative clustering scheme 4 2 18 1 13 1 14 19 1 Audio-Based Multimedia Indexing and Retrieval 71 In order to achieve an efficient KF extraction, the audio frames, which have similar sounds (and therefore, similar feature vectors) should first be clustered and one or more frame from each cluster should be chosen as a KF. An illustrative example is shown in Figure 32. Here the problem is to determine the number of clusters that should be extracted over a particular clip. This number will in fact vary with the content of the audio. For instance, a monolog speech will have less number of KFs than an action movie. For this we define KF rate that is the ratio between KF numbers over the total number of valid frames within a certain audio class type. Once a practical KF rate is set, the number of clusters can be easily calculated and eventually this number will be proportional to the duration of the clip. However, the longer clips will increase the chance of bearing similar sounds. Especially if the content is mostly based on speech, the similar sounds (vowels and unvoiced parts) will be repeated over time. Therefore, KF rate can be dynamically set via an empirical Key-Framing model that is shown in Figure 33. Figure 33: KF Rate (%) Plots Once the number of KFs (KFno) is set, the audio frames are then clustered using Minimum Spanning Tree (MST) clustering technique. Every node in MST is a feature vector of a unique audio frame and the distance between the nodes is calculated using the AFeX module’s AFeX_GetDistance() function. Once the MST is formed, then the longest KFno-1 branch is broken and as a result KFno clusters are obtained. By taking one (i.e. the first) frame as a KF, the feature vectors of the KFs are then used for indexing 4.3. AUDIO RETRIEVAL SCHEME As explained in the previous sections, the audio part of any multimedia item within a MUVIS database is indexed using one or more AFeX modules that are dynamically linked to the MUVIS application. The indexing scheme uses the audio classification per segment information 72 to improve the effectiveness in such a way that during an audio-based query scheme, the matching (same audio class types) audio frames will be compared with each other via the similarity measurement. In order to accomplish an audio based query within MUVIS, an audio clip is chosen from a multimedia database and queried through the database if at least one or more audio features are extracted for the database. Let NoS be the number of feature sets existing in a database and let NoF(s) is the number of sub-features per feature set where 0 ≤ s < NoS . As mentioned before sub-features are obtained by changing the AFeX module parameters or the audio frame size during the audio feature extraction process. Let the similarity distance function be SD ( x ( s , f ), y ( s , f )) where x and y are the associated feature vectors of the feature index s and the sub-feature index f. Let i be the index of the audio frames within the class C q of the queried clip. Due to the aforementioned reasons, the similarity distance is only calC culated between a sub-feature vector of this frame (i.e. QFV i q ( s , f ) ) and an audio frame (index j) of the same class type from a clip (index c) within the database. For all the frames that have the same class type ( ∀ j ⇒ j ∈ C q ), one audio frame, which gives the minimum distance to the audio frame i in the queried clip is found ( D i ( s , f ) ) and used for calculation of the total sub-feature similarity distance ( D ( s , f ) ) between two clips. Therefore, the particular frames and sections of the query audio are only compared with their corresponding (matching) frames and sections of a clip in database and this internal search will then provide the necessary retrieval robustness against the abrupt content variations within the audio clips and particularly their indefinite durations. Figure 34 illustrates the class matching and minimum distance search mechanisms during the similarity distance calculations per sub-feature. Furthermore, two factors should be applied during the calculation of D ( s , f ) in order to achieve unbiased and robust results: • Penalization: If no audio frames with class type C q can be found in clip c then a penalization is applied during the calculation of D ( s , f ) . Let N Q ( s , f ) be the number of ∅ valid frames in queried clip and let N Q ( s , f ) be the number of frames that are not included for the calculation of the total sub-feature similarity distance due to the mismatches of their class types. Let N QΘ ( s , f ) be the number of the rest of the frames, which will all be used in the calculation N Q (s, f ) = N ∅ Q of the total sub-feature similarity distance. Therefore, Θ Q ( s , f ) + N ( s , f ) and the class mismatch penalization can be formu- lated as follows: P C Q (s, f ) = 1 + N ∅ Q (s, f ) N Q (s, f ) (24) Audio-Based Multimedia Indexing and Retrieval 73 If all the class types of the queried clip match with the class types of the database clip ∅ C c, then N Q ( s , f ) = 0 ⇒ PQ ( s , f ) = 1 and this case naturally applies no penalization on the calculation of D ( s , f ) . • Normalization: Due to the possibility of the variation of the audio frame duration for a sub-feature, the number of frames having a certain class types might change and this results in a biased (depending on the number of frames) similarity sub-feature distance calculation. In order to prevent this, D ( s , f ) should be normalized by the total number of frames for each sub-feature ( N Q ( s, f ) ). Therefore, this will yield a normalized D ( s , f ) calculation, which is nothing but the sub-feature similarity distance per audio frame. Since the audio vectors are normalized the total query similarity distance ( QDc ) between the queried clip and the clip c in the database is calculated with a weighted sum. The weights per sub-feature f, of a feature set s, W ( s, f ) can be used for experimentation in order to find an optimum merging scheme for the audio features available in the database. The following equation formalizes the calculation of QDc . • [ ( )] min SD QFVi C q ( s, f ), DFV j C q ( s, f ) if j ∈ C q j∈C i Di ( s, f ) = if j ∉ C q 0 PQC ( s, f ) D ( s, f ) = ⋅ ∑ ∑ Di (s, f ) N Q ( s, f ) q i∈Cq (25) NoS NoF ( s ) QDc = ∑ s ∑ W ( s, f ) ⋅ D( s, f ) f Calculation of QDc as in equation (25) is only valid if there is at least one matching class type between the queried clip and the database clip c. If no matching class types exist, then QDc is assigned as ∞ and hence it will be placed among the least matching clips (to the end of the query ranking list). This is an expected case since two clips have nothing in common with respect to a high-level content analysis, i.e. their mismatching audio class types per segment. 74 Figure 34: A class-based audio query illustration showing the distance calculation per audio frame 4.4. EXPERIMENTAL RESULTS All the sample MUVIS databases were presented in sub-section 2.1.4. For the experiments in this section the following MUVIS databases are mainly used among them: Open Video, and Real World Audio/Video databases. All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB memory. All the sample MUVIS databases are indexed aurally using MFCC AFeX module and visually using color (HSV and YUV color histograms), texture (Gabor [40] and GLCM [49]) and edge direction (Canny [13]) FeX modules. The feature vectors are unit normalized and equal weights are used for merging sub-features from each of the FeX modules while calculating total (dis-) similarity distance. During the encoding and capturing phases, the acoustic parameters, codecs and the sound volume are kept varying among the potential values given in Table I. Furthermore, the clips in both databases have varying durations between 30 seconds to 3 minutes. The evaluation of the performance is carried out subjectively using only the samples containing unique content. In other words, if there is any subjective ambigu- Audio-Based Multimedia Indexing and Retrieval 75 ity on the result such as a significant doubt on the relevancy of any of the audio-based retrieval results from an aural (or a visual) query, etc., then that sample experimentation is simply discarded from the evaluation. Therefore, the experimental results presented in this section depend only on the decisive subjective evaluation via ground truth and yet they are meant to be evaluator-independent (i.e. same subjective decisions are guaranteed to be made by different evaluators). For the analytical notion of performance along with the subjective evaluation, we used the traditional PR (Precision-Recall) performance metric measured under relevant (and unbiased) conditions, notably using the aforementioned ground truth methodology. Note that recall, R, and precision, P, are defined as: R= RR RR and P = TR N (26) where RR is the number of relevant items retrieved (i.e. correct matches) among total number of relevant items, TR. N is the total number of items (relevant + irrelevant) retrieved. For practical considerations, we fixed N as 12. Recall is usually used in conjunction with the precision, which measures the fractional precision (accuracy) within retrieval and both can often be traded-off (i.e. one can achieve high precision versus low recall rate or vice versa.). This section is organized as follows: First the effect of classification and segmentation (Step 1) over the total (indexing and) retrieval performance will be examined in the next subsection. Afterwards, a more generic performance evaluation will be realized based on various aural retrieval experiments and especially the aural retrieval performance will be compared with the visual counterpart in an analytical and subjective (via visual inspection) way. 4.4.1. Classification and Segmentation Effect on Overall Performance Several experiments are carried out in order to assess the performance effects of the audio classification and segmentation algorithm. The sample databases are indexed with and without the presence of audio classification and segmentation scheme, which is basically a matter of including/excluding Step-1 (the classification and segmentation module) from the indexing scheme. Extended experiments on audio based multimedia query retrievals using the audio classification and segmentation during the indexing and retrieval stages show that significant gain is achieved due to filtering the perceptually relevant audio content from a semantic point of view. The improvements in the retrieval process can be described based on each of the following factors: 4.4.1.1 Accuracy Since only multimedia clips, containing matching (same) audio content are to be compared with each other (i.e. speech with speech, music with music, etc.) during the query process, the probability of erroneous retrievals is reduced. The accuracy improvements are observed 76 within 0-35% range for the average retrieval precision. One typical PR curve for an audiobased retrieval of a 2 minutes multimedia clip bearing pure speech content within Real World database is shown in Figure 35. Note that in the left part of Figure 35, 8 relevant clips are retrieved within 12 retrievals via classification and segmentation based retrieval. Without classification and segmentation, one relevant item is clearly missed on the right side. Recall PR curve (with classification and segmentation) Recall PR curve (without classification and segmentation) Precision 1 1 0,9 0,9 0,8 0,8 0,7 0,7 0,6 0,6 0,5 0,5 0,4 0,4 0,3 0,3 0,2 0,2 0,1 Precision 0,1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Figure 35: PR curves of an aural retrieval example within Real World database indexed with (left) and without (right) using classification and segmentation algorithm. 4.4.1.2 Speed The elimination of silent parts from the indexing scheme reduces the amount of data for indexing and retrieval and hence improves the overall retrieval speed. Moreover, the filtering of irrelevant (different) class types during the retrieval process significantly improves the speed by reducing the CPU time needed for similarity distance measurements and the sorting process afterwards. In order to verify this expectation experimentally and obtain a range for speed improvement, we have performed several aural queries on Real World database indexed with and without the application of classification and segmentation algorithm (i.e. Step-1 in indexing scheme). Among these retrievals we have chosen 10 of them, which have the same precision level in order to have an unbiased measure. Table VIII presents the total retrieval time (the time passed from the moment user initiates an aural query till the query is completed and results are displayed on the screen) for both cases. As a result the query speed improvements are observed within 7-60% range whilst having the same retrieval precision level. Table VIII: QTT (Query Total Time) in seconds of 10 aural retrieval examples from Real World database. Aural Retrieval No. 1 2 3 4 5 6 7 8 9 QTT (without classifica47.437 28.282 42.453 42.703 43.844 42.687 46.782 45.814 44.406 tion and segmentation) 10 41.5 QTT (with classification 30.078 26.266 19.64 39.141 18.016 16.671 31.312 30.578 20.006 37.39 and segmentation) QTT Reduction (%) 36.59 7.12 53.73 8.34 58.9 60.94 33.06 33.25 54.94 9.90 Audio-Based Multimedia Indexing and Retrieval 77 4.4.1.3 Disk Storage Fewer amounts of data are needed and henceforth recorded for the audio descriptors due to the same analogy given before. Furthermore the silent parts are totally discarded from the indexing structure. Yet it is difficult to give an exact analytical figure showing how much disk space can be saved by performing the classification and segmentation based indexing because this eventually depends on the content itself and particularly the amount of silent parts that the database items contain. The direct comparison between the audio descriptor file sizes of the same databases indexed with and without the proposed method shows that above 30% reduction can be obtained. 4.4.2. Experiments on Audio-Based Multimedia Indexing and Retrieval For analytic evaluation, 10 aural and visual QBE (Query by Example) retrievals are performed according to the experimental conditions explained earlier. We only consider the first 12 retrievals (i.e. N=12) and both precision and recall values are given in Table IX and Table X. Table IX: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Open Video Database. Query No: Visual Aural 1 2 3 4 5 6 7 8 9 Precision 0.66 0.75 0.25 0.25 0.66 1 0.58 1 0.83 Recall 0.66 0.75 0.33 0.25 0.8 1 0.58 1 0.83 Precision 1 1 1 1 0.83 1 1 0.8 1 Recall 1 1 1 1 1 1 1 0.8 1 10 1 1 1 1 Table X: PR values of 10 Aural/Visual Retrieval (via QBE) Experiments in Real World Database. Query No: Visual Aural Precision Recall Precision Recall 1 2 3 4 5 6 1 0.25 1 0.25 0.83 0.33 1 0.5 1 0.25 1 0.33 1 0.5 1 0.5 0.83 0.75 1 1 1 0.5 1 0.75 7 8 1 1 1 1 0.41 0.41 0.75 0.75 9 10 0.08 0.16 0.125 0.66 0.25 0.25 0.375 1 As the PR results clearly indicate in Table IX and Table X, in almost all the retrieval experiments performed, the aural queries achieved “equal or better” performance than their visual counterpart although there is only one feature (MFCC) used as the aural descriptor against a “good” blend of several visual features. 78 Figure 36 shows three sample retrievals via visual and aural queries from Open Video database using MUVIS MBrowser application. The query (the first key-frame in the clip) is shown on the top-left side of each figure. The first (a) example is a documentary about sharpshooting competition and hunting in the USA. The audio is mostly speech with an occasional music and environmental noise (fuzzy). Among 12 retrievals, the visual query (left) retrieved three relevant clips (P=R=0.25) whereas the aural query retrieved all relevant ones (i.e. P=R=1.0). The second (b) example is a cartoon with several characters and the aural content is mostly speech (dialogs between the cartoon characters) with long intervals of music. It can be easily seen that within 12 retrievals, the visual query (left) retrieved three relevant clips among 9 (P=0.25 and R=0.33) whereas the aural query retrieved all relevant ones within the first 9 ranks (i.e. P=R=1.0). Finally, the third example (c) is a commercial with an audio content that is speech with music in the background (fuzzy). Similarly, it is obvious that among 12 retrievals, the visual query retrieved 9 relevant clips (i.e. P=R=0.75) whereas the aural query retrieved all of them (i.e. P=R=1.0). All three examples show that the aural query can outperform the visual counterpart especially when there is a significant variation in the visual scenery, lightning conditions, background or object motions, camera effects, etc. whereas, the audio has the advantage of being usually stable and unique with respect to the content. (a) Audio-Based Multimedia Indexing and Retrieval 79 (b) (c) Figure 36: Three visual (left) and aural (right) retrievals in Open Video database. The top-left clip is the query. 80 81 Chapter 5 Progressive Query: A Novel Retrieval Scheme O ne of the challenges in the development of content-based multimedia indexing and retrieval application is to achieve an efficient retrieval scheme. The developers and users who are accustomed to making queries and thus retrieving any multimedia item from a large scale database can be frustrated by the long query times and memory and minimum system requirements. This chapter is about a novel multimedia retrieval technique, called Progressive Query (PQ), which is designed to bring an effective solution especially for querying large multimedia databases and furthermore to provide periodic query retrievals along with the ongoing query process. In this way it achieves a series of sub-query results that will finally converge to the full-scale search retrieval in a faster and with no minimum system requirements. In order to address the problems encountered and present the major limitations on the multimedia retrieval area, a generic overview for the traditional indexing and retrieval methods will be introduced in the next section. Afterwards, the generic PQ design philosophy and the implementation details will be presented in Section 5.2. A new multi-thread programming approach, which achieves High Precision on the periodicity on the PQ retrievals and therefore, called as HP PQ will be introduced in Section 5.3. Finally the experimental results, the performance evaluation and conclusive remarks about PQ are reported in 5.4. 5.1. QUERY TECHNIQUES - AN OVERVIEW The usual approach for indexing is to map database primitives into some high dimensional vector space that is so called feature domain. The feature domain may consist of several types of features such as visual, aural, motion, etc. as long as the database contains such items from which those particular features can be extracted. Among so many variations, careful selection of the feature sets allows capturing the semantics of the database items. Especially for largescale multimedia databases the number of features extracted from the raw data is often kept large due to the naïve expectation that it helps to capture the semantics better. Content-based 82 similarity between two database items can then be assumed to correspond to the (dis-) similarity distance of their feature vectors. Henceforth, the retrieval of a similar database items with respect to a given query (item) can be transformed into the problem of finding such database items that gives feature vectors, which are close to the query feature vector. This is so-called query-by-example (QBE), which is one of the most common retrieval schemes. The basic QBE operation is so called Normal Query (NQ), and works as follows: using the available aural or visual features (or both) of the queried multimedia item (i.e. an image, a video clip, an audio clip, etc.) and all the database items, the similarity distances are calculated and then merged to obtain a unique similarity distance per database item. Ranking the items according to their similarity distances (to the queried item) over the entire database yields the query result. Such an exhaustive search for QBE is costly and CPU intensive especially for largescale multimedia databases since the number of similarity distance calculations is proportional with the database size. This fact brought a need for indexing techniques, which will organize the database structure in such a way that the query time and I/O access amount could be reduced. During the past three decades, several indexing techniques have been proposed. Many of these techniques are covered in the next chapter. Along with these indexing techniques certain query techniques are needed to speed up the QBE process. The most common query techniques developed especially those indexing techniques are as follows: • Range Queries: Given a query object, Q, and a maximum similarity distance range, ε, the range query selects all indexed database items, Q i , such that SD (Q, Q i ) < ε. • kNN Queries: Given a query object, Q, and an integer number k > 0, kNN query selects the k database items, which have the shortest similarity distance from Q. Unfortunately, both query techniques may not provide efficient retrieval scheme from the user’s point of view due to their strict parameter dependency. For instance, range queries require a distance parameter, ε, where the user may not be able to provide such a number prior to a query process since it is not obvious to find out a suitable range value if the database contains various types of features and feature subsets. Similarly, parameter k in a kNN query may be hard to determine since it can be too small in case the database may provide many more similar (relevant) items than required, or too big if the number of similar objects is only a small fraction of the required number k, which means unnecessary CPU time has been wasted for that query process. Both query techniques often require several trials to converge to a successful retrieval result and this alone might remove the speed benefit of the underlying indexing scheme, if there is any. As mentioned before, the other alternative is the so-called Normal Query (NQ), which makes a sequential (exhaustive) search due to lack of an indexing scheme. NQ for QBE is costly and CPU intensive especially for large-scale multimedia databases; however, it yields such a final and decisive result that no further trials are needed. Still, all QBE alternatives have some common drawbacks. First of all, the user has to wait until all (or some) of the similarity distances are calculated and the searched database items are ranked accordingly. Naturally, this might take a significant time if the database size (or k, ε) is large and the database Progressive Query: A Novel Retrieval Scheme 83 contains a rich set of aural and visual features, which might further reduce the efficiency on the indexing process. Furthermore, any abrupt stopping (i.e. manual stop by the user) during the query process will cause total loss of retrieval information and essentially nothing can be saved out of the query operation so far performed. In order to speed up the query process, it is a common application design procedure to hold all features of database items into the system memory first and then perform the calculations. Therefore, the growth in the size of the database and the set of features will not only (proportionally) increase the query time (the time needed for completing a query) but it might also increase the minimum system memory requirements such as memory capacity and CPU power. In order to eliminate such drawbacks and provide a faster query scheme, we develop a novel retrieval scheme, the Progressive Query (PQ), which is implemented under MUVIS system to provide a basis for multimedia retrieval and to test the performance of the technique. PQ is a retrieval (via QBE) technique, which can be performed over the databases with or without the presence of an indexing structure. Therefore, it can be an alternative to NQ where both produce (converge to) the same result at the end. When the database has an indexing structure, PQ can replace kNN and range queries whenever a query path over which PQ proceeds, can be formed. As its name implies, PQ provides intermediate query results during the query process. The user may browse these results and may stop the ongoing query in case the results obtained so far are satisfactory and hence no further time should unnecessarily be wasted. As expected, PQ and NQ will converge to the same (final) retrieval results at the end. Furthermore, PQ may perform the overall query process faster (within a shorter total query time) than NQ. Since PQ provides a series of intermediate results, each of which obtained from a (smaller) sub-set within the database, the chance of retrieving relevant database items that would not be retrieved otherwise via NQ, might be increased. Approvingly some experimental results show that it is quite probable to achieve even better retrieval performance within an intermediate sub-query than the final query state. It is a known fact that significant performance improvements of content-based multimedia retrieval systems can be achieved by using a technique known as Relevance Feedback [16], [17], [56], which allows the user to rate (intermediary) retrieval results. This helps to tune the ranking and retrieval parameters and hence yields a better retrieval result at the end. Traditional, query techniques so far addressed (NQ, kNN and range queries) may allow such a feedback only after the query is completed. Since PQ provides the user with intermediate results during the query process, relevance feedback may be applied already to these intermediate results, yielding possibly faster satisfactory retrieval results from the user’s point of view. 5.2. PROGRESSIVE QUERY The principal idea behind the design of PQ is to partition the database items into some subsets within which individual (sub-) queries can be performed. Therefore, a sub-query is a fractional query process that is performed over any sub-set of database items. Once a sub-query is completed over a particular sub-set, the incremental retrieval results (belonging only to that 84 sub-set) should be fused (merged) with the last overall retrieval result to obtain a new overall retrieval result, which belongs to the items where PQ operation so far covers from the beginning of the operation. Note that this is a continuous operation, which proceeds incrementally, sub-set by sub-set, by covering more and more group of items within the database. Each time a new sub-query operation is completed, PQ updates the retrieval results to the user. Since the previous (overall) query results are used to obtain the next (overall) retrieval result via fusion, the time consuming query operation is only performed over the (next) partitioned group of items instead of all the items where PQ covered so far. The order of the database items processed is a matter of the indexing structure of the database. If the database is not indexed at all, simply a sequential or random order can be chosen. In case the database has an indexing structure, a query path can be formed in order to retrieve the most relevant items at the beginning during a PQ operation. Since there are various indexing schemes addressed in the previous section, for the sake of simplicity, we shall first explain the basics of PQ for a database with no indexing structure. t 2t 3t 4t time MM Dbs. Sub-Set 1 Sub-Set 2 Sub-Set 3 Periodic Sub-Query Results 1 Sub-Set N 2 3 Sub-Query Fusion Sub-Query Fusion 4 Progressive Sub-Query Result 1 1+2 1+2+3 Figure 37: Progressive Query Overview. Another important factor is to determine the size of each sub-set (i.e. the number of items within a sub-set where sub-query operation is performed) that is most convenient from the user’s point of view. A straightforward solution is to let the user fix the sub-set size (say e.g. 25). This would mean that the user wants updates every time 25 items are covered during the ongoing PQ operation. However, this also brings the problem of uncertainty because the user cannot know how much time a sub-query will take beforehand since the sub-query time will vary due to the factors such as the amount of features present in the database and the speed of the computer where it is running, etc. So the PQ retrieval updates might be too fast or too slow for the user. In order to avoid such uncertainties, the proposed PQ scheme is designed over periodic sub-queries as shown in Figure 37 with a user defined period value ( t = t p ). The period (time) is obviously more natural choice since the user can eventually ex- Progressive Query: A Novel Retrieval Scheme 85 pect the retrieval results will be updated every t p seconds no matter what database is involved or what computer is used. Without loss of generality, in databases without an indexing structure, PQ is designed to perform sub-set partitioning sequentially with a forward direction (i.e. starting from the first item to the last one). 5.2.1. Periodic Sub-Query Formation In order to achieve periodic sub-queries, we need to define the following additional sub-query compositions. 5.2.1.1 Atomic Sub-Query The smallest sub-set size on which a sub-query is performed. Here we assume that atomic sub-query time is not significant compared to periodic sub-query time. Atomic sub-queries are the only sub-query types that have a fixed sub-set size ( S ASQ ). They are only used during first 0 periodic query and they are used in order to provide an initial sub-query rate ( t r ), that is the time spent for the retrieval of a single database item, formulated as follows: t r0 = t ASQ N ASQ if N ASQ > 0 (27) where t ASQ is the total time spent for atomic sub-query and N ASQ is the number of database items, which are involved (used) in query operation. Without an indexing structure, note that 0 ≤ N ASQ ≤ S ASQ , since the initial database items might not belong to the ongoing query type. For example in a multimedia database, there might be video-only clips and audio-only clips (and clips with both media types). So for a visual query, those audio-only clips will be totally discarded and if the initial atomic query sub-set covers such audio-only clips then naturally N ASQ ≤ S ASQ . In case N ASQ = 0 , one or more atomic sub-queries have to be performed until we get a valid t r value (i.e. NASQ > 0 ). 0 5.2.1.2 Fractional Sub-Query This can be any sub-query performed over a sub-set whose size is smaller or equal to the subset size of the periodic sub-query. In other words a fractional sub-query time might be less than or equal to a periodic sub-query time. As explained earlier, periodic sub-queries are periodic over time and a mechanism is needed to ensure this periodicity. This mechanism works over atomic and fractional subqueries; it performs fusion operation over as many atomic and fractional sub-queries as necessary. First, it starts with an atomic sub-query to obtain a valid (initial) sub-query per item time, t r0 , and it keeps going with atomic queries until a valid t r0 value is obtained. Once it 86 is obtained, then one or more fractional sub-queries will be performed to complete the first periodic sub-query. The size of the fractional query ( N FSQ ) can then be estimated as: N where 0 FSQ ≅ t 0p (28) t r0 a t 0p = t p − t Σa is the time left for completing the first periodic sub-query and tΣ is the total time spent for all atomic queries so far performed. Afterwards, the fractional sub-query can be performed within a sub-set of N FSQ items. Once the fractional sub-query is completed, the total time ( t ) so far spent from the beginning of the operation till now is compared with ∑ q the required time period of the q th (so far q=0) periodic query, t p , where q is the periodic sub-query index. If t ∑ q value is not within a close neighborhood (i.e. δ t w < 0.5 sec.) of t p < t qp − δt w ) then the operation continues with a new fractional sub-query until the ∑ condition is met. For the new fractional sub-query and for all the latter fractional sub-query operations t r value is re-estimated (updated) from the former operations such as: (i.e. t N i FSQ = t qp − t tr i ∑ t ∑ < t qp − δt w → t i +1 r t = ∑ if i N ∑ FSQ ∑N i∈ FSQ i FSQ >0 (29) i∈ FSQ Once one or more fractional queries form the q th periodic query, due to offset that is occurred from the period of PQ, next periodic sub-query (q+1st) is formed with an updated (offset removed) period value: t qp + 1 = t p + ( t qp − t where t p is the required period and (t qp − t ∑ ∑ ) (30) ) is the offset time, which is the time difference between the required period time for q th periodic sub-query and the total (actual) time spent so far. The flowchart of the formation of a periodic sub-query is shown in Figure 38. Progressive Query: A Novel Retrieval Scheme 87 Atomic Sub-Query N No ASQ > 0 Implemented only at the beginnning of the first periodic sub-query Yes Calculate size of next Fractional Sub-Query Fusion Fractional Sub-Query t > t p − δt w No Yes Periodic SubQuery (t ≅ t p ) Stop Figure 38: Formation of a Periodic Sub-Query. 5.2.1.3 Sub-Query Fusion Operation The overall PQ operation is carried out over Progressive Sub-Queries (PSQs). In principal, it can be stated that PQ is a (periodic) series of PSQ results. As shown in Figure 37, a new PSQ retrieval is formed each time a periodic sub-query is completed and then it is fused with the previous PSQ retrieval result. Once a PSQ is realized, the results are shown to the user and saved during the lifetime of the ongoing PQ so that the user can access them at any time. The user is shown updated retrieval results each time a new PSQ is completed. The first PSQ is the first periodic sub-query performed. The fusion operation is a process of fusing two sorted sub-query results to achieve one (fused) sub-query result. Since both of the sub-query results are already sorted with respect to the similarity distances, simply comparing the consecutive items in each of the sub-query lists can perform the fusion operation. Let n1 and n 2 be the number of items in the first and second sub-set, respectively. Since there are n1 + n2 items to be inserted into the fused list one at a time, the fusion operation can take maximum n1 + n2 comparisons. This (worst) case occurs whenever the items from both lists are evenly distributed with respect to each other. On the other hand if the maximum valued item (i.e. the 88 last item) in the smaller list is less than the minimum valued item (i.e. the first item) in the bigger list, the number of comparisons will not exceed the number of items in the smaller list because once all of the items in it are compared with the (smallest) first item in the bigger list and henceforth inserted into the fused list, there will not be any more comparisons needed. Note that this is the direct consequence of the fact that the both lists are sorted (from minimum to maximum) beforehand and one of them is now fully depleted. Therefore, the fusion operation will take minimum Min ( n1 , n 2 ) comparisons respectively. A sample fusion operation is illustrated in Figure 39. Note that the subsets X and Y contain 12 and 6 items, respectively, and the fusion operation performs only 12 comparisons. X Y Fusion: X~Y X1 = 0.1 Y1 = 0.01 Y1 = 0.01 X2 = 0.2 Y2 = 0.05 Y2 = 0.05 X3 = 0.3 Y3 = 0.15 X1 = 0.1 X4 = 0.4 Y4 = 0.25 Y3 = 0.15 X5 = 0.5 Y5 = 0.55 X2 = 0.2 X6 = 0.6 Y6 = 0.65 Y4 = 0.25 X7 = 0.7 X3 = 0.3 X8 = 0.8 X4 = 0.4 X9 = 0.9 X5 = 0.5 X10 = 1.0 Y5 = 0.55 X11 = 1.1 X6 = 0.6 X12 = 1.2 Y6 = 0.65 X7 = 0.7 X8 = 0.8 X9 = 0.9 X10 = 1.0 X11 = 1.1 X12 = 1.2 Figure 39: A sample fusion operation between subsets X and Y. Since the fusion operation is nothing but merging two arbitrary sized sub-sets (retrieval results), it can be applied during each phase of a PQ operation. For instance, the atomic sub-queries are fused with fractional sub-queries and several fractional sub-queries are fused to obtain a periodic sub-query. Fusing the periodic sub-query with the previous PSQ retrieval produces a new PSQ retrieval, covering a larger part of the database. If the user does not stop the ongoing PQ operation, it will eventually cover the entire database at the end and therefore, it generates the overall retrieval result of the queried item. In this case PQ generates the exact same retrieval result as NQ since both of them perform the same operation, i.e. searching a queried item through the entire database and ranking the database items accordingly Progressive Query: A Novel Retrieval Scheme 89 5.2.2. PQ in Indexed Databases In the previous sections, PQ is presented over databases without an indexing structure and in this context it is compared with a traditional query type, NQ. As a retrieval process, PQ can also be performed over indexed databases as long as a query path can be formed over the clusters (partitions) of the underlying indexing structure. Obviously, query path is nothing but a special sequence of the database items, and when the database lacks an indexing structure, it can be formed in any convenient way such as sequentially (starting from the 1st item towards the last one, or vice versa) or randomly. Otherwise, the most advantageous way to perform PQ is to use the indexing information so that the most relevant items can be retrieved in earlier PSQ steps. Since the technical details of various available indexing schemes are beyond the scope of this chapter, we only show the hypothetical formation of the query path and runtime evaluation of PQ over this path. In the next chapter, the real implementation of PQ over an MAM-based indexing structure will be presented in detail. Figure 40: Query path formation in a hypothetical indexing structure. As briefly mentioned earlier, the primary objective of indexing structures is to partition the feature domain into such (tree-based) clusters that CPU time and I/O accesses are shortened via pruning of the redundant tree nodes. Figure 40 shows a hypothetical clustering scheme and the formation of the query path over which PQ will proceed during its run-time. This illustration shows 4 clusters (partitions or nodes), which contain a certain number of items (features) and the query path is formed according to the relative (similarity) distance to the queried item and its parent cluster. Therefore, PQ will give the priority to cluster A (the 90 host), then B (the closest), C, D, etc. Note that the query path might differ from the final retrieval result depending on the accuracy of the indexing scheme. For instance, query path gives priority to B2 on the periodic search with respect to C4 but C4 may contain more relevant items (i.e. items more similar to the query item) than B2 and when the retrieval results are formed it will eventually be ranked higher and presented earlier to the user by PQ. At this point, one can implement two different approaches: the overall query path can be formed immediately after the query is initiated and then the PQ evolves over it with its natural supplies of periodic retrievals. This approach is only recommended for small and medium sized databases where the complete query path formation takes insignificant time. Otherwise, the query path should be dynamically (incrementally) formed along with the PQ runtime process and the time spent for it should be taken into account during the adaptive calculation of period given in (31). In this case, the adaptive period calculation for the q + 1 st periodic sub-query should be reformulated as follows: t qp + 1 = t p + ( t qp − t where ∑ q − t QP ) q t QP is the time spent for forming the query path during the formation of (31) q th periodic sub-query. As a result PQ in indexed databases makes more sense than to be strictly dependant to an unknown parameter such as k as in kNN query or ε in range query, which might cause a deficiency in the retrieval performance such as the casual need for doing multiple queries to come with a satisfactory result at the end. On the other hand there exists a certain degree of similarity (or analogy) between PQ and those conventional query techniques. For instance each PSQ retrieval can be seen as a particular kNN (or range) query retrieval with only one difference: the parameter k (or ε) is not fixed beforehand, rather dynamically changing (growing) over time along with the lifetime of PQ and the user has the opportunity to fix it (stop PQ) whenever satisfactory retrievals obtained 5.3. HIGH PRECISION PQ – THE NEW APPROACH The PQ operation presented so far is designed as a single process (thread). For this reason it has a pre-emptive approach for adaptively arranging the periodic sub-set sizes and this will yield up to 10% of timing shifts from the required period value, t p , which can be avoided by changing the design into a parallel processing basis. In order to achieve such a high precision on the periodic PSQ retrievals, the entire PQ process is divided between two threads controlled with a timer. Figure 41 illustrates the how the multithread approach is integrated over the PQ parts to perform HP PQ. Two distinct threads are formed to perform the following tasks: Progressive Query: A Novel Retrieval Scheme • • 91 Periodic Sub-Query Thread: – Load features from disc into memory. – Calculate similarity distance of each item with respect to query item. – Sort items within periodic sub-query set. PQ Formation Thread: – Suspend Periodic Sub-Query thread when timer kicks. – Apply sub-query fusion to form next PSQ. – Release Periodic Sub-Query thread. – Render next PSQ retrievals on screen. So the main idea is to leave all the time consuming tasks to the periodic sub-query thread and keep the other (PQ Formation) thread in suspended mode until the timer kicks. Once the timer activates the PQ Formation thread then it can immediately form the PSQ retrieval only with a single fusion operation and renders the retrieval results on the screen. t 2t 3t 4t time MM Dbs. Periodic Sub-Query Results Sub-Set 1 Sub-Set 2 Sub-Set 3 1 Sub-Set N 2 3 Sub-Query Fusion Sub-Query Fusion 4 Progressive Sub-Query Result 1 1+2 1+2+3 Timer t 2t 3t 4t PQ Formation Thread Periodic Sub-Query Thread Figure 41: HP PQ Overview. 92 5.4. EXPERIMENTAL RESULTS 5.4.1. PQ in MUVIS As mentioned in Chapter 2, MBrowser is the primary media browser and retrieval application into which PQ technique is integrated as the primary retrieval (QBE) scheme. NQ is the alternative query scheme within MBrowser. Both PQ and NQ can be used for ranking the multimedia primitives with respect to their similarity to a queried media item (an audio/video clip, a video frame or an image). In order to query an audio/video clip, it should first be appended to a MUVIS database upon which the query will be performed. There is no such necessity for images; any digital image (inclusive or exclusive to the active database) can be queried within the active database. The similarity distances will be calculated by the particular functions, each of which is implemented in the corresponding visual/aural feature extraction (FeX or AFeX) modules. Queried MM Clip PQ Browser Handle. 10th PSQ is currently active. Prev PQ Knob Next First 12 Retrieval Results Figure 42: MBrowser GUI showing a PQ operation where 10th PSQ is currently active (or set manually). As shown in Figure 42, MBrowser GUI is designed to support all the functionalities that PQ provides. Once a MUVIS database is loaded into the MBrowser, the user can browse among the database items, choose any item and then initiate a query. The basic query parameters such as query type (PQ or NQ), query genre (aural or visual), PQ update period (time), the (visual and aural) set of features and their individual parameters (i.e. feature weights), etc. Progressive Query: A Novel Retrieval Scheme 93 can be set prior to a query operation. When a (sub-) query is completed the retrieval results are then presented to the user page by page. Each page renders 12 ranked results in the descending order (from left to right and from top to bottom) and the user can browse the pages back and forth using the Next and Prev buttons on the bottom-right side of the figure (the first page with 12-best retrieval results is shown on the right side of Figure 42). If NQ is chosen, then the user has to wait till the whole process is completed but if PQ is chosen then the retrieval results will be updated periodically (with the user-defined period value) each time a new PSQ is accomplished. The current PSQ number (10) is shown on the PQ Browser Handle and this handle can also be used to browse manually among the retrieved PSQ results during (or after) an ongoing PQ operation. In the snapshot shown in Figure 42, a video clip is chosen within a MUVIS (video) database and visual PQ is performed. Currently the 1st page (12-best results) of the 10th PSQ retrieval results is shown on the GUI window of MBrowser. Four of the sample MUVIS databases that were presented in sub-section 2.1.4, are used in the experiments performed in this section: Open Video, Real World, Sports video and Shape image databases. All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB memory. In order to have unbiased comparisons between PQ and NQ, the experiments are performed using the same queried multimedia item with the same instance of MBrowser application. 5.4.2. PQ versus NQ As explained earlier, PQ and NQ eventually converge to the same retrieval result at the end. Also in the abovementioned scenarios they are both designed to perform exhaustive search over the entire database within MUVIS. However PQ has several advantages over NQ in the following aspects: • System Memory Requirement: The memory requirement is proportional to the database size and the number of features present in a NQ operation. Due to the partitioning of the database into sub-sets, PQ will reduce the memory requirement by the number of PSQ operations performed. After each periodic sub-query operation, the memory used for feature vectors in that sub-set is no longer needed and can be used for the next periodic sub-query. Figure 43 illustrates the memory usage of a retrieval example that is shown in Figure 45 by a PQ first and a NQ afterwards. Figure 43: Memory usage for PQ and NQ. 94 We also observe that especially in very large-scale databases containing a rich set of features, NQ might exceed the system memory. Two possible outcomes may eventually occur as a result. The operating system may handle it by using virtual memory (i.e. disc) if the excessive memory requirement is not too high. In this case, the operational speed for a NQ operation will be drastically degraded and eventually PQ can outperform NQ several times with respect to overall retrieval time. The other possibility is that NQ operation cannot be completed at all since the system is not capable of providing the excessive memory required by NQ and in this case PQ is the only feasible query operation. • “Earlier and Better” Retrieval Results: Along with the ongoing process PQ allows intermediate query results (PSQ steps), which might sometimes show equal or ‘even better’ performance than the final (overall) retrieval result as some typical examples given in Figure 44 and Figure 45. In Figure 44, an image retrieval example within Shape database via PQ using Canny [13] Edge Histogram feature is shown. We use t p = 0.2sec and PQ operation is completed in three PSQ series (i.e. PQ #1, #2 and #3). This is one particular example in which an intermediate PSQ retrieval yields a better performance than the final PQ retrieval (that is same as the retrieval result of NQ). In this example, the first 12-best retrieval results in PQ #1 are obviously better (more relevant in terms of shape similarity to the queried shape) than the ones in PQ #3 (NQ). In Figure 45, a video retrieval example within Real World database via aural PQ using MFCC (Mel-Frequency Cepstral Coefficients) as the audio features is shown. We use t p = 5 sec and PQ operation is completed in 12 PSQ series but only three PSQ retrievals (i.e. PQ #1, #6 and #12) are shown. Note that PQ #6 and the latter retrieval results till PQ #12 are identical, which means that PQ operation produces the final retrieval result (which NQ would produce) in an earlier (intermediate) PSQ retrieval. Such “earlier and even better” retrieval results can be verified due to the fact that searching an item in a smaller data set usually yields better (detection or retrieval) performance than searching in a larger set. This is obviously an advantage for PQ since it proceeds within sub-queries performed in (smaller) sub-sets whereas NQ always has to proceed through the entire database. Furthermore, for the databases that are not indexed such as in the examples given, this basically means that the order of the relevant items coincides with the progress of the ongoing PQ operation in the earlier PSQ steps. When the database has a solid indexing structure and a query path can be formed according to the relevancy of the queried item, the user eventually gets relevant retrieval results in a fraction of the time that is needed for a typical NQ operation. Progressive Query: A Novel Retrieval Scheme 95 Figure 44: PQ retrievals of a query image (left-top) within three PSQs. t p = 0.2 sec Figure 45: Aural PQ retrievals of a video clip (left-top) in 12 PSQs (only 1st, 6th and 12th are shown). t p = 5 sec . • Query Accessibility: This is the major advantage that PQ provides. Stopping an ongoing query operation is an important capability in the user point of view. As shown in Figure 96 42, by pressing the PQ Knob during an ongoing PQ operation, the user can stop it any time (i.e. when the results are so far satisfactory). Of course, NQ can also be stopped but no retrieval result can be available afterwards since the operations such as similarity distance calculations or sorting are likely not completed. Another important accessibility option that PQ offers is so-called PSQ Browsing. When stopped abruptly or completed at the end, the user can still browse among PSQ retrievals and visit any retrieval page of that particular PSQ since the retrieval results will be alive unless a new PQ is initiated or the application is terminated. This is obviously a significant requirement especially when better results are obtained in an earlier PSQ step than the later ones as mentioned before. On the other hand, this might still be a desirable option even if the earlier PSQ results are not better but comparable as much as the later ones. This could be relevant to the user. One particular example is shown in Figure 46, that is, a video retrieval example from the Open Video database via visual PQ using several color (YUV, HSV, etc.), texture (GLCM [49]) and shape (Canny Edge [13] Histogram) features. We use t p = 3 sec and PQ operation is completed in 4 PSQ series. Note that in this particular example a retrieval performance evaluation is difficult to accomplish among 4 PSQ retrievals since their relevancy to the queried item is subjective. PQ #4 PQ #3 PQ #2 PQ #1 (sec) time s 0.099 t=1 2s t = 9.34 .893s t=5 84s t = 2.8 Figure 46: Visual PQ retrievals of a video clip (left-top) in 4 PSQs. t p = 3 sec . The most important accessibility advantage that PQ can provide is that it can further improve the efficiency of any relevance feedback mechanism in certain ways. An ordinary relevance feedback technique works as follows: the user initiates a query and after the query is completed, the user gives some feedback to the system about the retrieval results according Progressive Query: A Novel Retrieval Scheme 97 to their relevancy (and/or irrelevancy) with respect to the queried item. Afterwards, a new query is initiated in order to get better retrieval results and this might be repeated iteratively until satisfactory results are obtained. This is especially a time consuming process since at each iteration the user has to wait until the query operation is completed. Due to the enhanced accessibility options that PQ provides, significant improvements can be achieved for the user with the following scenarios: First the user can employ relevant (and irrelevant) feedbacks during the query process and the incoming progressive retrievals can thus be tuned progressively. This means that during an ongoing query process the user can employ one or more relevant feedbacks anytime (within the life-time of PQ). Another alternative is that, the user can stop an ongoing PQ and then employs the relevant feedbacks with respect to the (intermediate) retrievals via PSQ Browsing and re-initiate a new (fine-tuned) PQ. Basically any relevance feedback technique can be applied along with PQ since in both scenarios PQ only provides the necessary basis for the (user) accessibility to employ the relevance feedback but otherwise stay independent from the internal structure of any individual technique employed. • Overall Retrieval Time (Query Speed): The overall query time is the time elapsed from the beginning of the query to the end of the operation. For NQ, this is obviously the total time from the moment the user initiates the query until the results are ranked and displayed on the screen. However, for PQ, since the retrieval is a continuous process with PSQ series, the overall retrieval means that PQ proceeds over the entire database and its process is finally completed. As mentioned earlier, at this point both PQ and NQ should generate identical retrieval results for a particular queried item. There are basically three major processes in NQ: Loading database features (from the disc) to the system memory, calculating the (dis-) similarity distances from the features and sorting (ranking) the database items according to their similarity distances. When PQ performs overall retrieval, the first two processes will take exactly the same time but the sorting will be faster due to the following fact. Let n be the number of database items in the database, if, for example, Quick Sort is applied, then the number of comparisons will be O ( n log n ) on average case and O ( n 2 ) in the worst case. Assume that we only perform PQ in two PSQ series: Let n1 be the number of items in the first sub-set and n 2 be the number of items in the second one, where n = n1 + n2 . In both average and worst-case scenario O(n12 ) + O(n22 ) < O(n 2 ) (worst case) O(n1 log n1 ) + O(n2 log n2 ) < O(n log n) (avg. case) (32) Here we did not take into account the time spent for fusion operation since in the worst case it only requires n = n1 + n2 comparisons, an O ( n ) operation, which is negligible. So PQ will apply a faster sorting algorithm especially if the worst-case scenario is considered. It can be shown by deduction that PQ time will become slightly faster if the number of PSQ operation increases (i.e. with smaller sub-set size or t p value). In order to verify this, several aural PQ retrieval experiments in Real World database have been performed with different 98 t p values. In order to get an unbiased measure, the experiments for each t p value are repeated 5 times and the median from 5 overall retrieval times is taken into account. PQ total execution time (overall retrieval time) and the number of PSQ updates are plotted in Figure 47. The same experiment is repeated for visual PQ operation and the result is shown in Figure 48. Note that if PQ is completed with only one PSQ, then it basically performs a NQ operation and therefore NQ retrieval time can also be examined in both figures. As experimentally verified, PQ’s overall retrieval time is 0-25% faster than NQ retrievals (depending on the number of PSQ series) if NQ memory requirement does not exceed the system memory. Figure 47: Aural PQ Overall Retrieval Time and PSQ Number vs. PQ Period. Figure 48: Visual PQ Overall Retrieval Time and PSQ Number vs. PQ Period. Progressive Query: A Novel Retrieval Scheme 99 It is also observed that the real PSQ retrieval times are close neighborhood of t p (user-defined period) value in general. One typical example showing PSQ arrival times for the PQ example shown in Figure 45 is plotted in Figure 49. Figure 49: PSQ and PQ retrieval times for the sample retrieval example given in Figure 45. 5.4.3. PQ versus HP PQ The experimental results so far presented are all for the single-thread PQ implementation. However, most of the results, such as memory, speed and overall retrieval time, are also valid for HP PQ scheme. The outcome of the new PQ implementation scheme, HP PQ, differs only in the timing accuracy of the PSQ retrieval times. As expected, HP PQ can provide a precise timing (periodicity) on the PSQ retrievals as one comparative PQ retrieval plot can be seen in Figure 50. Figure 50: PSQ Retrieval times for single and multi threaded (HP) PQ schemes. t p = 5 sec 100 5.4.4. Remarks and Evaluation In accordance with the experimental results presented so far, the following conclusive remarks can be made about the innovative properties of PQ: • PQ is an efficient retrieval technique (via QBE), which works with both indexed and non-indexed databases. • In this context it is the unique query method which may provide “faster” retrievals without requiring any special “indexing” structures. • In the same context, it is the unique query method that provides “Browsing” capability between instances (PSQ) of the ongoing query. The user can browse PSQ retrievals in any time, i.e. during or after the query process. • In databases without an indexing structure, it achieves several improvements such as loose system requirement (in terms of memory, CPU power, etc.), “early and even better” retrieval results, user-friendly query accessibility options (i.e. PQ can be stopped in case satisfactory results obtained, PSQ retrievals can be browsed any time, etc.), reduced overall timing (in case PQ is completed), etc. As mentioned earlier for some large scale databases it is the “only” feasible query process; whereas, NQ might yield an infeasible waiting period or requires excessive memory, etc. • It can also be applied to indexed databases efficiently (to get the most relevant results the earliest) and in this case it shows a “dynamic kNN query” behavior where k increases with time and hence the user can thus have the advantage of assigning it by seeing (and judging) the results. This is obviously a significant advantage with respect to traditional kNN or range queries since the user cannot know a “good” k value (or range value) beforehand and these values are dependant directly to the content distribution of the database and the relevancy of the queried item. • The most important advantage above all is that it provides continuous user interaction with the ongoing query operation. The user can see the results so far obtained, can immediately evaluates them and performs “relevance feedback” into the system or simply wastes no time if satisfactory results are obtained so far (query stop). • Finally, in addition to all the aforementioned advantages and performance improvements that PQ provides, PQ does not have any significant drawbacks applicable on to the system and the user. This means there is no practical cost for using PQ. In the next chapter, an efficient implementation of PQ along with an indexing scheme will be presented and therefore, the theoretical expectation about the earliest retrievals of the most relevant items will be verified accordingly. 101 Chapter 6 A Novel Indexing Scheme: Hierarchical Cellular Tree E specially for the management of large multimedia databases, there are certain requirements that should be fulfilled in order to improve the retrieval feasibility and efficiency. Above all, such databases need to be indexed in some way and traditional methods are no longer adequate. It is clear that the nature of the search mechanism is influenced heavily by the underlying architecture and indexing system employed by the database. Therefore, this chapter addresses this problem and presents a novel indexing technique, Hierarchical Cellular Tree, which is designed to bring an effective solution especially for indexing multimedia databases and furthermore to provide an enhanced browsing capability, which enables user to make a guided tour within the database. A pre-emptive cell search mechanism is introduced in order to prevent the corruption of large multimedia item collections due to the misleading item insertions, which might occur otherwise. In addition to this, the similar items are focused within appropriate cellular structures, which will be the subject to mitosis operations when the dissimilarity emerges as a result of irrelevant item insertions. Mitosis operations ensure to keep the cells in a focused and compact form and yet the cells can grow into any dimension as long as the compactness prevails. The proposed indexing scheme is then optimized the proposed query method earlier, the Progressive Query, in order to maximize the retrieval efficiency for the user point of view. In order to provide a better understanding to the concept of indexing operation for the multimedia databases, the next section is devoted to an overview for the traditional indexing structures, their limitations and drawbacks. Hence the philosophy and the design fundamentals behind the proposed HCT structure can then be introduced especially by focusing the attention on a particular indexing structure, the M-tree, which is the most promising indexing structure published so far. Afterwards the generic HCT architecture and implementation details are introduced in Section 6.2. Section 6.3 is devoted for PQ operation over HCT indexing structure. A novel browsing scheme, the HCT Browsing, is introduced in Section 6.4. Section 6.5 reports the experimental results along with some example demonstrations and accordingly presents the conclusive remarks. 102 6.1. DATABASE INDEXING METHODS – AN OVERVIEW During the last decade several content-based indexing and retrieval techniques and applications have been developed such as MUVIS system [P4], [P6], [43], Photobook [50], VisualSEEk [63], Virage [75], and VideoQ [15], all of which are designed to bring a framework structure for handling and especially the retrieval of the multimedia items such as digital images, audio and/or video clips. As explained in the previous chapter, database primitives are mapped into some high dimensional feature domain, which may consist of several types of features such as visual, aural, etc. as long as the database contains such items from which those particular features can be extracted. A particular feature set models the contents of the multimedia item into a set of semantic attributes which can then be managed and processed by conventional database management systems. In this way the content-based similarity between two database items can be estimated by calculating the (dis-) similarity distance of their feature vectors. Henceforth, content-based similarity retrieval according to a query item can be carried out by similarity measurements, which produce a ranking order of similar multimedia items within the database. This is the general query-by-example (QBE) scenario, however it is also costly and CPU intensive especially for large multimedia databases since the number of similarity distance calculations is proportional to the database size. This fact brought a need for indexing techniques, which will organize the database structure in such a way that the query time and I/O access amount could be reduced. For the past three decades, researchers proposed several indexing techniques that are formed mostly in a hierarchical tree structure that is used to cluster (or partition) the feature space. Initial attempts such as KD-Trees [4] used space-partitioning methods that divide the feature space into predefined hyperplanes regardless of the distribution of the feature vectors. Each region is mutually disjoint and their union covers the entire space. In R-tree [22] the feature space is divided according to the distribution of the database items and the region overlapping can be introduced as a result. Both KD-tree and R-tree are the first examples of Spatial Access Methods (SAMs). Afterwards several enhanced SAMs have been proposed. R*-tree [3] provides a consistently better performance by introducing a policy called “forced reinsert” than the R-tree and R+-tree [61]. R*-tree also improves the node splitting policy of the R-tree by taking overlapping area and region parameters into consideration. Lin et al. proposed TV-tree [45], which uses so-called telescope vectors. These vectors can be dynamically shortened assuming that only dimensions with high variance are important for the query process and therefore low variance dimensions can be neglected. Berchtold et al. [5] introduced X-tree, which is particularly designed for indexing higher dimensional data. X-tree avoids overlapping of region bounding boxes in the directory structure by using a new organization of the directory and as a result, X-tree outperforms both TV-tree and R*-tree significantly. It is 450 times faster than R-tree and between 4 to 12 times faster than the TV-tree when the dimension is higher than two and it also provides faster insertion times. Still bounding rectangles can overlap in higher dimensions. In order to prevent this, White and Jain proposed the SS-tree [73], an alternative to R-tree structure, which uses minimum bounding spheres instead of rectangles. Even though SS-tree outperforms R*-tree, the overlapping in the high dimensions still occurs. Thereafter, several other A Novel Indexing Scheme: Hierarchical Cellular Tree 103 SAM variants are proposed such as SR-tree [31], S²-Tree [71], Hybrid-Tree [14], A-tree [57], IQ-tree [5], Pyramid Tree [6], NB-tree [20], etc. Especially for content-based indexing and retrieval in large-scale multimedia databases, SAMs have several drawbacks and significant limitations. By definition an SAM-based indexing scheme partitions and works over a single feature space. However, a multimedia database can have several feature types (visual, aural, etc.), each of which might also have multiple feature subsets. Furthermore, SAMs assume that query operation time and complexity are only related to accessing a disk page (I/O access time) containing the feature vector. This is obviously not a trivial assumption for multimedia databases and consequently, no attempt in the design of SAMs has been done to reduce the similarity distance computations (CPU time). In order to provide a more general approach to similarity indexing for multimedia databases, several efficient Metric Access Methods (MAMs) are proposed. The generality of MAMs comes from the fact that any MAM employs the indexing process by assuming only the availability of a similarity distance function, which satisfies three trivial rules: symmetry, non-negativity and triangular inequality. Therefore, a multimedia database might have several feature types along with various numbers of feature sub-sets all of which are in different multi-dimensional feature spaces. As long as a similarity distance function that is usually treated as a “black box” by the underlying MAM, exists the database can be indexed by any MAM. Several MAMs are proposed so far. Yianilos [76] presented vp-tree that is based on partitioning the feature vectors (data points) into two groups according to their similarity distances with respect to a reference point, so called vantage point. Bozkaya and Ozsoyoglu [9] proposed an extension of vp-tree, so-called mvp-tree (multiple vantage point), which basically assigns m vantage points for a node with a fan out of m2 . They reported 20% to 80% reduction of similarity distance computation compared to vp-trees. Brin [11] introduced Geometric Near-Neighbor Access Tree (GNAT) indexing structure, which chooses k number of split points at the top level and each of the remaining feature vectors are associated with the closest split points. GNAT is then built recursively and the parameter k value is chosen to be a different value for each feature set depending on its cardinality. The MAMs so far addressed present several shortcomings. Contrary to SAMs, these metric trees are designed only to reduce the number of similarity distance computations, paying no attention to I/O costs (disk page accesses). They are also intrinsically static methods in the sense that the tree structure is built once and new insertions are not supported. Furthermore, all of them build the indexing structure from top to bottom and hence the resulting tree is not guaranteed to be balanced. Ciaccia et al. [19] proposed M-tree to overcome such problems. M-tree is a balanced and dynamic tree, which is built from bottom to top, creating a new root level only when necessary. The node size is a fixed number, M, and therefore, the tree height depends on M and the database size. Its performance optimization concerns both CPU computational time for similarity distances and I/O costs for disk page accesses for feature vectors of the database items. Recently Traina et al. [67] proposed Slim-tree, an enhanced variant of Mtrees, which is designed for improving the performance by minimizing the overlaps between nodes. They introduced to factors, “fat-factor” and “bloat-factor”, to measure the degree of 104 overlap and proposed the usage of Minimum Spanning Tree (MST) for splitting the node. Another slightly enhanced M-tree structure, so-called M+-tree, can be found in [79]. As a summary, the indexing structures so far addressed are all designed to speed up any QBE process by using some multidimensional index structure. However, all of them have significant drawbacks for the indexing of large-scale multimedia databases. As mentioned before, SAMs are, by nature, not suitable for this purpose since their design concerns only single feature space partitioning; whereas, the query process should be performed over several features and feature sub-sets extracted for the proper indexing of the multimedia database. The static MAMs so far addressed do not support dynamic changes (new insertions or deletions); whereas this is an essential requirement during the incremental construction of the database. Even though M-tree and its variants provide dynamic database access, the incremental construction of the indexing tree could lead, depending of the order of the objects, to significantly varying performances during the querying phase. Moreover, MAMs performance also deteriorates significantly with increasing number of database items and the choice of the prefixed node capacity, M, affects the tree structure and hence the performance of indexing. So far, all indexing methods (MAMs and SAMs), while providing good results on low dimensional feature space (i.e. d<10 for SAMs), do not scale up well to high dimensional spaces due to the phenomenon so called “the curse of dimensionality”. Recent studies [72] show that most of the indexing schemes even become less efficient than sequential indexing for high dimensions. Such degradations and shortcomings prevent a wide spread usage of such indexing structures especially on multimedia collections. Furthermore in multimedia databases, the discrimination factor of the visual and aural descriptors (features) is quite limited. They have mainly significant drawbacks in terms of representing the content similarity (or providing a reliable similarity measure). This is a major problem if a dynamic indexing algorithm such as M-tree relies on a clustering scheme, which depends on assigning a single nucleus item (the routing object) and then grouping a set of similar items around it. No matter how accurately a nucleus item is chosen, due to the aforementioned fact of unreliable and highly variant descriptors (features) present, irrelevant (dissimilar) items can be chosen in a cluster or insufficient number of similar items can be clustered. In MVP-tree, this problem is addressed by introducing multiple vantage (nucleus) items. In order to overcome the problems and provide efficient solutions to the aforementioned shortcomings, especially for multimedia databases, we develop a MAM-based, dynamic and self-organized indexing scheme, the Hierarchical Cellular Tree (HCT). As its name implies, HCT has a hierarchic structure, which is formed into one or more levels. Each level is capable of holding one or more “cell” (s). The cell structure is nothing but an acronym of the “node” structure in M-tree. The reason we call it a different name is because the cells further contain a tree structure, a Minimum Spanning Tree (MST), which refers to the database objects (their database representations and basically their descriptors) as its MST nodes. Among all indexing structures available, M-tree shows the highest structural similarity to HCT, such as: A Novel Indexing Scheme: Hierarchical Cellular Tree 105 • Both indexing schemes are MAM-based and have a similar hierarchical structure, i.e. levels. • They are both created dynamically, from bottom to top. The tree grows one level upwards when the number of cells becomes two in the top level (due to a mitosis operation). • A similar concept of representing each cell in the lower level with a nucleus (routing) object in the higher level. However, there are several major differences in their design philosophies and objectives, such as: • M-tree is a generic MAM, designed to achieve a balanced tree with a low I/O cost in a large data set. HCT is on the other hand designed for indexing multimedia databases where the content variation is seldom balanced and it is especially optimized for compactness (highly focused cells). • M-tree works over nodes with a maximum (fixed size) capacity, M. Therefore, the performance depends on a “good” choice of this parameter with respect to the database size and thus, M-tree construction significantly varies with it. Especially for multimedia databases the database size is dynamic and unknown most of the time. Furthermore, the content variation among the database items is quite volatile and unknown beforehand, too. There might be a group of similar items whereas their number exceeds several multiples of M and hence, too many nodes will unnecessarily be used for representing them. So with a static M setting, there is a danger of saturated number of nodes representing a group of similar items and therefore, causing significant indexing degradations due to excessively crowded levels and unnecessarily long M-tree height. Another potential danger in such circumstances is to lose minority cluster of (similar) objects due to excessive domination of a major number of objects with similar content. Such minor clusters will therefore lose the chance of representation on the higher levels and this will cause misleading insertions of similar items. This is the main reason of corruption due to the “crowd effect” on large databases. HCT on the other hand has no limit for the cell size as long as the cell keeps a definite “compactness” measure. So HCT will not drastically suffer from the “crowd effect” and the resultant corruption by clustering each similar object into one (or minimum number of) cell(s) and hence providing an equal representation chance for both minor and major group of items on the higher levels. • In M-tree, the cell compactness is only measured with respect to distance of the routing (nucleus) object to the farthest object that is so called the covering radius. Due to the aforementioned reasons of unreliability on such single measure for the cell compactness, HCT uses all items and their minimum distances to the cell (instead of a single nucleus item alone) to come up with a regularization function that represents a dynamic model for the cell compactness. During the lifetime of the HCT body (i.e. with incoming item insertions, removals and internal transfers, events, etc.) this function dynamically updates the current cell compactness feature, which is then compared to a certain statistically driven level threshold value to decide whether or not the cell should be split (mitosis). • The split policies and objectives are also different between M-tree and HCT. First of all, M-tree performs a split operation only when the cell size reaches M without paying atten- 106 tion to the current status (i.e. compactness) of the cell. For the split operation, M-tree first tries to find suitable nucleus (routing) objects and then form the child cells around them. There are several methods for promoting those nucleus objects and the one, which is used to preserve compactness, is so called m_RAD (the minimum sum of RADii algorithm) and is also the most complex one that requires O ( N 2 ) distance computations within the node, each time a split occurs. Once the nucleus objects are found, there are two distribution alternatives for the formation of the child nodes: Generalized Hyperplane method [19] is used to optimize the compactness by assigning the objects to the nearest nucleus (routing) object. Balanced method is used to obtain as balanced child nodes as possible. On the other hand, HCT performs mitosis (split) operation over a cell only when the cell is reached a certain degree of maturity and only if the current compactness feature indicates that the cell should undergo a mitosis operation to preserve a compactness level for which the cell’s (owner) level requires. Therefore, mitosis is one of the major operations for preserving/enhancing the overall compactness. Similar to natural mitosis occurring in organic cells, the most sparse (dissimilar) object or group of objects are detached from the other group and thus more and more similar groups are kept within the same cell, which provides an increasing similarity focus (compactness) in the long term. HCT first performs the mitosis operation to split the cell into two child cells and afterwards assigns the most suitable nucleus items for them accordingly. Due to the presence of MST formation within each cell, there is no cost for mitosis operation since MST is used to decide from which branch the partition should be executed. • Although both indexing structures are built dynamically by incremental (one by one) item insertions, performed with a cell search from top to bottom, the insertion processes differ significantly in terms of cell search operations. M-tree insertion operation is based on MostSimilar Nucleus (MS-Nucleus) cell search, which depends on a simple heuristics which assumes that the closest nucleus item (aka “routing object”) yields the best sub-tree during the descend and finally the best (target) cell to be appended. In this chapter, we will show that this is not always a valid assumption and therefore, a potential cause for corruption since it can lead to sub-optimum insertions especially for large databases due to the “crowd effect”. Furthermore, the incremental construction of an M-tree could lead to different structures depending on the order of item insertions. HCT is designed to perform an optimum search for the target cell to which the incoming item should belong. This search, so called Pre-emptive cell search, during descend at each level verifies all the paths that are likely to yield a better nucleus item (and hence a better cell at one lower level) in an iterative way. By this way, along with the mitosis operation this search algorithm further improves the cell compactness factor at each level. • M-tree has a conservative structure that might cause degradations in due time. For example the cell nucleus (routing object) is not changed after an insertion or removal operation even though another item might now be a more suitable candidate of being the cell nucleus and hence a better representative of that cell on the higher level. Another example is the static allocation of new cell nucleuses after a mitosis operation; the new cell nucleuses are always assigned to their parents’ owner cell in the higher level. On the contrary, HCT has a totally A Novel Indexing Scheme: Hierarchical Cellular Tree 107 dynamic approach. Any operation (insertion, removal or mitosis) can change the current cell nucleus to a new (better) one, in which case the old nucleus has been removed and the new one is inserted into the most suitable cell in the upper level –not necessarily to the owner cell of the old one but to the optimum one that can be found at that instance of HCT body. Similarly, when mitosis occurs within a cell at a certain level, the old (parent) nucleus item has been removed from its owner cell and instead two new nucleus items are inserted into the higher level independently (i.e. the old parent nucleus has no effect on this). Via such dynamic updates towards “best possible” assignments and further applying the Pre-emptive cell search algorithm for item insertions that are generic for any level, the corruption can therefore be avoided and the HCT body is continuously kept intact. Along with the indexing techniques addressed so far, certain query techniques have to be used to speed up a QBE process within indexed databases. The most common query techniques are kNN and range queries, which are explicitly introduced in Section 5.1, along with their limitations and drawbacks. A simple yet efficient retrieval scheme, the Progressive Query (PQ), is then proposed. PQ is a retrieval (via query) technique, which can be performed over databases with or without the presence of an indexing structure. Therefore, it can be an alternative to Normal Query (NQ) with the exhaustive search where both of them produce (converge to) the same result at the end. When the database has an indexing structure, PQ can replace kNN and range queries whenever a Query Path (QP) over which PQ proceeds, can be formed. Instead of relying on some unknown parameters such as k or ε as used for the number of relevant items for kNN and the range value for range queries, PQ provides periodic (with a user-defined time period) query results along with the query process and lets the user browse around the queries obtained and stops the ongoing query in case the results obtained so far are satisfactory and hence no further time should unnecessarily be wasted. Therefore, the proposed (HCT) indexing technique has been designed to work in harmony with PQ in order to evaluate the retrieval performance in the end, i.e. how fast the most relevant items can be retrieved or how efficient HCT can provide a QP for a particular query item. 6.2. HCT FUNDAMENTALS HCT is a dynamic, cell–based and hierarchically structured indexing method, which is purposefully designed for PQ operations and advanced browsing capabilities within large-scale multimedia databases. It is mainly a hierarchical clustering method where the items are partitioned depending on their relative distances and stored within cells on the basis of their similarity proximity. The similarity distance function implementation is a black-box for the HCT. Furthermore, HCT is a self-organized tree, which is implemented by genetic programming principles. This basically means that the operations are not externally controlled; instead each operation such as item insertion, removal, mitosis, etc. are carried out according to some internal rules within a certain level and their outcomes may uncontrollably initiate some other operations on the other levels. Yet all such “reactions” are bound to end up in a limited time, that is, for any action (i.e. an item insertion), its consequent reactions cannot last indefinitely 108 due to the fact that each of them can occur only in a higher level and any HCT body has naturally limited number of levels. In the following sub-sections, we will detail the basic structural components of the HCT body and then explain the indexing operations in an algorithmic way. 6.2.1. Cell Structure A cell is the basic container structure in which similar database items are stored. Ground level cells contain the entire database items. Each cell further carries a MST where the items are spanned via its nodes. This internal MST stores the minimum (dis-) similarity distance of each individual item to the rest of the items in the cell. So this scheme resembles MVP-tree [9] structure; however, instead of using some (pre-fixed) number of items, all cell items are now used as vantage points for any (other) cell item. These item-cell distance statistics are mainly used to extract the cell compactness feature. In this way we can have a better idea about the similarity proximity of any item instead of comparing it only with a single item (i.e. the cell nucleus) and hence a better compactness feature. The compactness feature calculation is also a black-box implementation and we use a regularization function obtained from the statistical analysis using the MST and some cell data. This dynamic feature can then be used to decide whether or not to perform mitosis within a cell at any instant. If permission for mitosis is granted, the MST is again used to decide where the partition should occur and the longest branch is a natural choice. Thus an optimum decision can be made to enhance the overall compactness of the cell with no additional computational cost. Furthermore, the MST is used to find out an optimum nucleus item after any operation is completed within the cell. In HCT, the cell size is kept flexible, which means there is no fixed cell size that cannot be exceeded. However, there is a maturity concept for the cells in order to prevent a mitosis operation before the cell reaches a certain level of maturity. Otherwise, we cannot obtain reliable information whether or not the cell is ready for mitosis since there is simply not enough statistical data that are gathered from the cell items and its MST. Therefore, using a similar argument for the organic cells, a maturity cell size (e.g. 6) is set for all the cells in HCT body (level independent). Note that this should not be compared with the parameter M for M-tree where M is used to enforce the mitosis for a cell with size M no matter what the cell condition (i.e. compactness) is. M is the maximum size that a cell can have. However, in our case we set the minimum size of a cell as a pre-requisite condition for a mitosis operation. This is not a significant parameter, which neither affects the overall performance of HCT nor needs to be proportional with the database size or any other parameter, as in the case in an Mtree. 6.2.1.1 MST Formation A Minimum Spanning Tree (MST) is the same as any other fully connected graph, except that each of the branch weight is minimal. However, note that this does not mean that all paths are minimal. MST is therefore, the (minimal) set of branches required to connect the graph. A further constraint is that the MST should contain no cycles. A Novel Indexing Scheme: Hierarchical Cellular Tree 109 There are several MST construction algorithms, such as Kruskal [60] and Prim’s [54]. These are however static algorithms, that is, all items with their relative (similarity) distances with respect to each other should be known beforehand. The construction of MST requires O ( N 2 ) computational cost where N is the number of items. In our case the cell, and hence its MST should be constructed dynamically since items can be inserted any time during the lifetime of HCT body and it would be infeasible to re-construct MST from scratch each time a new item is inserted into a particular cell as this would require O ( N 3 ) computations. To avoid this, we modify the traditional (static) construction of the MST algorithm into dynamic one using the following DynamicMST algorithm. Let item-I be the next item that is to be inserted into MST. DynamicMST algorithm can then be expressed as follows: DynamicMST (item-I) Ø Create a new MST node for item-I: node-I Ø Extract the distance table between node-I and MST nodes. Ø Find the closest node to node-I: node-C and connect two nodes with the branch. Ø Create an array for connected nodes: ArrayCN[], put node-C into it Ø CheckBranches (ArrayCN[] ) CheckBranches (ArrayCN[] ) Ø For all the nodes in ArrayCN (say node-C) do: • For all the nodes that node-C is connected (say node-j) do: o Create a candidate branch between node-j and node-I: branch(node-I, node-j) o If( |branch(node-C, node-j)| > |branch(node-I, node-j)| ) then do: § Delete branch(node-C, node-j) from MST. § Insert branch(node-I, node-j)into MST. § Insert node-j into ArrayCN. o Else do: § Delete branch(node-I, node-j). • End loop. Ø End loop. Ø If ArrayCN is empty then Return. Ø Else CheckBranches (ArrayCN[]). The function CheckBranches checks all the branches of the nodes stacked in the ArrayCN and if any particular branch (to a particular node) a gives longer distance than a possible (candidate) branch between that particular node and node-I, then that longer (existing) branch is cut and the candidate branch is replaced. All the nodes that are connected to node-I in such a fashion are then put into array ArrayCN and CheckBranches is called again. The operation continues recursively until all the nodes are consumed in ArrayCN (i.e. the size of the array is 0). 110 By means of the proposed dynamic insertion technique, the computational cost is 2 still O ( N ) and the initial MST is initially used and updated only whenever necessary. A sample dynamic insertion operation is illustrated in Figure 51. MST Before Insertion 5 Node SD 7 2 1 3 4 4 3 4 2 4 1 Insert 3 3 3 X 1 3 4>3 2 2 3>2 2 text 1 3 2 5 1 3 X 3 4 2 MST After Insertion 4 3 2 1 3 2 5 4 1 2 1 3 2 5 1 2 Figure 51: A sample dynamic item (5) insertion into a 4-node MST. 6.2.1.2 Cell Nucleus Cell nucleus is the item which represents the owner cell on the higher level(s). Since during the top-down cell search for an item insertion, these nucleus items are used to decide the cell into which the item should be inserted, it is essential to promote the best item for this representation at any instant. When there is only one item in the cell, it is obviously the nucleus item of that cell; otherwise, the nucleus item is assigned using the cell MST as the item having the maximum number of branches (connections to other items). This heuristics makes sense since it is the unique item to which most of the items have the closest proximity to it. Contrary to static nucleus assignment of some other MAM-based indexing schemes such as M-tree, the cell nucleus is dynamically verified and if necessary updated whenever an operation is performed over the cell in order to maintain the best representation of the cell and there is no computational cost for this so far. For example, in Figure 51, the nucleus is item 2 before the insertion since it is the only node, which has two branches (connections). However, after the insertion, item 5 should be promoted as the new nucleus item since it has now more branches than any other item within the cell. 6.2.1.3 Cell Compactness Feature As mentioned earlier, this is the feature, which represents the compactness of the cell items, i.e. how tight (focused) the clustering for the items within the cell. Furthermore, the regularization function implementation for the calculation of the cell compactness feature is a black box for HCT. In this sub-section we will present the parameters of this function used in the experiments that are reported in this thesis. A similar argument can be made for the extraction of the cell compactness feature. Instead of using only the distance values of all the items in the cell with respect to the nucleus item, it would be more reliable and convenient to use the (minimum) distance of each item with respect to the cell (in fact the rest of the items in the cell) that is basically nothing but the A Novel Indexing Scheme: Hierarchical Cellular Tree 111 branch weights of the cell MST. Once a cell reaches maturity (a pre-requisite for the compactness feature calculation) then a regularization function, f, can be expressed using the following statistical cell parameters: CFC = f ( µ C , σ C , rC , max(wC ), N C ) ≥ 0 (33) where µ C and σ C are the mean and standard deviation of the MST branch weights, wC , of the cell C. rC is the covering radius, that is the distance from the nucleus where all the cell items lie and N C > N M is the number of items in cell C. A compact cell can be obtained if all these parameters can be minimized. Accordingly, a regularization function should then be implemented to minimize the compactness feature, CFC . In the limit, the highest compactness can be achieved when CFC = 0 which means that all cell items are identical. Similar to continuous updates for the nucleus item, the CFC value is also updated each time an operation is performed over the cell. The new CFC value is then compared with the current level compactness threshold, CThr L , that is dynamically calculated within each level and if the cell is mature and not compact enough, i.e. CFC > CThrL , mitosis is therefore, granted for that cell. The dynamic calculation of CThr L for level L will be explained in the next section. 6.2.1.4 Cell Mitosis As explained earlier there are two conditions necessary for a mitosis operation: maturity, i.e., N C > N M and cell compactness, i.e., CFC > CThrL . Both conditions are checked after an operation (e.g. item insertion or removal) occurs within the cell in order to signal a mitosis operation. Due to the presence of MST within each cell, mitosis has no computational cost in terms of similarity distance calculations. The cell is simply split by breaking the longest branch in MST and each of the newborn child cell is formed using one of the MST partitions. A sample mitosis operation is shown in Figure 52. Parent Cell Before Mitosis 2 Child Cells After Mitosis C C1 4 2 1 3 1 8 1 X 2 5 1 2 4 8 3 7 = 3 1 1 2 5 1 9 2 1 6 2 C2 9 + 8 6 2 2 Figure 52: A sample mitosis operation over a mature cell C. 3 7 112 6.2.2. Level Structure HCT body is hierarchically partitioned among one or more levels, as one sample example shown in Figure 53. In this example there are three levels that are used to index 18 items. Apart from the top level, each level contains various numbers of cells that are created by mitosis operations occurring at that level. The top level contains a single cell and when this cell splits, then a new cell is created at the level above. As mentioned earlier, the nucleus item of each cell in a particular level is represented on the higher level. A a b c Level 2 = Top Level A C B c d a b e f Level 1 A B d C D c e j g h i k E F a b l n o m p r f s Level 0 = Ground Level Figure 53: A sample 3-level HCT body. Each level is responsible for taking logs about the operations performed on it, such as the number of mitosis operations, the statistics about the compactness feature of the cells, etc. Note that each level dynamically tries to maximize the compactness of their cells although this is not a straightforward process to do since the incoming items may not show a similarity to the items present in the cells and therefore, such dissimilar item insertions will cause a temporary degradation on the overall (average) compactness of the level. So each level, while analyzing the effects of the (recent) incoming items on the overall level compactness, should employ necessary management steps towards improving compactness in due time (i.e. with future insertions). Within a period of time (i.e. during a number of insertions or after some number of mitosis), each level updates its compactness threshold according to the compactness feature statistics of mature cells, into which an item was inserted. Therefore, CThrL value for a particular level L can be estimated as follows: A Novel Indexing Scheme: Hierarchical Cellular Tree 113 CThr L = k0 P C ∈S P ∑ CF C NC >NM C = k 0 µ CFC ∀ C ∈ S P (34) where S P is the set of mature cells, upon which P insertions are recently performed and 0 < k0 ≤ 1 is the compactness enhancement rate, which determines how much enhancement will be targeted for the next P insertions beginning from the moment of the latest CThr L setting. If k0 =1, the trend is built upon keeping the current level of compactness intact and so no enhancement will be targeted for the future insertions. On the other hand if k 0 = 0 then the cells will split each time they reach maturity and in this case HCT split policy will be identical to M-tree. 6.2.3. HCT Operations There are mainly three HCT operations: cell mitosis, item insertion and removal. Cell mitosis can only happen after any of the other two HCT operations occurs and is already covered in sub-section 6.2.1.4. Both item insertion and removal are generic HCT operations that are identical for any level. Insertions should be performed one item at a time. However, item removals can be performed on a cell-based, i.e., any number of items in the same cell can be removed simultaneously. In the following sub-sections, we will present the algorithmic details of both operations. 6.2.3.1 Item Insertion Algorithm for HCT Item insertion is a level-based operation and can be implemented per item basis. Let nextItem be the item to be inserted into a target level indicated by a number, levelNo. The insertion algorithm, Insert (nextItem, levelNo), first performs a novel search algorithm, the Pre-emptive cell search, which recursively descends the HCT from top to target level in order to locate the most suitable cell for nextItem. Once the target cell is located in the target level, the item is inserted into the cell and then the cell becomes subject to a generic post-processing check. First the cell is examined for a mitosis operation and as explained earlier. If the cell is mature and yields a worse compactness than required (i.e. CFC > CThrL ), then mitosis occurs and two new (child) cells are produced on the same level. Hence, the parent cell is then removed from the cell queue of the level and two child cells are inserted instead. Accordingly, the old nucleus item is removed from the upper level and two new nucleus items are inserted into the upper level by consecutively calling Insert (nextItem, levelNo+1) function for both of the (nucleus) items. This is a particular genetic algorithm example that an independent process deterministically produces another process in an iterative way. Note that these processes are separate from each other, but one’s outcome may initiate the other. 114 On the other hand, if mitosis is not performed (for instance the cell is still compact enough after the insertion) another post processing step is applied to verify the need for a cell nucleus change. As explained earlier, the nucleus item of the owner cell can also be changed after an insertion operation and in this case, first the old nucleus is first removed from the upper level and the new one is inserted using Insert (nextItem, levelNo+1) for the new nucleus item. The Insert algorithm can be expressed as follows: Insert (nextItem, levelNo) Ø Let top level number: topLevelNo and the single cell in top level: cell-T Ø If(levelNo > topLevelNo) then do: o Create a new top level: level-T with number = topLevelNo+1 o Create a new cell in level-T: cell-T o Append nextItem into cell-T. o Return. Ø Let the Owner (target) cell in level levelNo: cell-O Ø If(levelNo = topLevelNo ) then do: o Assign cell-O = cell-T Ø Else do: o Create a cell array for Pre-emptive cell search: ArrayCS[], put cell-T into it o Assign cell-O = PreEmptiveCellSearch (ArrayCS[], nextItem, topLevelNo) Ø Append nextItem into cell-O. Ø Check cell-O for Post-Processing: o If cell-O is split then do: § Let item-O, item-N1 and item N2 be old nucleus item (parent) and new nucleus items (2 child) § Remove( item-O, levelNo+1) § Insert(item-N1, levelNo+1) § Insert(item-N2, levelNo+1) o Else if nucleus item is changed within cell-O then do: § Let item-O and item-N be old and new nucleus items. § Remove( item-O, levelNo+1 ) § Insert( item-N, levelNo+1 ) Ø Return. The function PreemptiveCellSearch implements the Pre-emptive cell search algorithm for finding the optimum (owner) cell on the level at which insertion occurs. The traditional cell search technique, MS-Nucleus, which is used in M-Tree and its derivatives, depends on a simple heuristics, which assumes that the closest nucleus (routing) item (object) yields (tracks) the best sub-tree during descend and finally the best (owner) cell is appended. i Let d( ) be the similarity distance function, O be the item (object) to be inserted, O N and r ( O Ni ) be the nucleus object and its covering radius for the ith cell, C i , respectively. Particularly M-tree presents two cases: A Novel Indexing Scheme: Hierarchical Cellular Tree 115 i i Case 1. If no nucleus item for which d (O, ON ) ≤ r (ON ) ∀Ci , the choice is taken in order to minimize the increase of the covering radius, i.e. ∆ i = d (O, ON ) − r (ON ) ∀Ci , among all the nucleus objects that are in the owner cell C. i i Case 2. If there exists a nucleus item for which d (O, ON ) ≤ r (ON ) ∀Ci exists, then its subtree is tracked in the lower level. If multiple sub-trees (nucleus objects) with this property exist, then the one to which object O is closest, is chosen. i i Both cases fail to track the closest (most similar) objects in the lower level as shown in the sample illustration in Figure 54. In this figure, O1N and ON2 are the nucleus (routing) objects representing the lower level cells C1 and C 2 on the upper level. In both cases the MSNucleus technique tracks down the sub-tree of ON2 , that is, the cell C 2 as a result of the cases expressed above. However in the lower level the closest (most similar) object is item c (since d1 < d 2 ), which is a member of the cell C1 . Case 1: ∆ 2 < ∆1 ⇒ O → C 2 C1 ∆1 ∆2 a O1N c Case 2: d 2 < r (ON2 ) ⇒ O → C2 ON2 o 1 N C2 f r (O ) b d1 d2 C1 d a r (ON2 ) r (O1N ) O1N b e C2 f ON2 o c d1 d r (ON2 ) d2 e Figure 54: M-Tree rationale that is used to determine the most suitable nucleus (routing) object for two possible cases. Note that in both cases the rationale fails to track the closest nucleus object on the lower level. A novel pre-emptive cell search algorithm is developed to perform a pre-emptive analysis on the upper level to find out all possible nucleus objects that are likely to yield the closest (most similar) objects on the lower level. Note that in the upper level, we have no information about the items in the cells C1 and C 2 , yet we can set appropriate pre-emptive criteria to fetch all possible nucleus items whose cells should be analyzed to track the closest item (item c in this particular example) on the lower level. Let d min be the distance to the closest nucleus item (on the upper level). The rationale of the pre-emptive cell search can be expressed as follows: Case 1. If no nucleus item for which d (O, ONi ) ≤ r (ONi ) ∀Ci exists, then fetch all nucleus items whose cells on the lower level may provide the closest object, i.e. ∆ i = d (O, ONi ) − r (ONi ) ≤ d min ∀Ci , among all nucleus objects that are in the owner cell C. 116 Case 2. If there exists one or more nucleus item(s) for which d (O, ONi ) ≤ r (ONi ) ∀Ci , then consider all of them since any of their owner cells on the lower level may provide the closest object. Since Case 1 implies Case 2, the former alone can be used as the only criterion to fetch all nucleus items for tracking. Accordingly, the pre-emptive cell search algorithm, PreemptiveCellSearch, can be expressed as follows: PreemptiveCellSearch (ArrayCS[], nextItem, curLevelNo) Ø By searching ∀ONi O Ni ∈ Ci ∧ ∀Ci ∈ ArrayCS à Find the most similar item, item-MS and d min . Ø If(curLevelNo = levelNo + 1) then do: o Let the owner cell of item-MS: cell-MS in the (target) level (with level number: levelNo) o Return cell-MS Ø Create a new array for cell search: NewArrayCS[] = ∅ O Ni ∈ C i ∧ ∀C i ∈ ArrayCS , do: Ø For ∀O Ni i i o If( ∆ i = d (O, ON ) − r (ON ) ≤ d min ) then do: § Find the owner cell of (nucleus) item ONi in the lower level: cell − CNi § Append cell − C Ni into NewArrayCS[] Ø End loop. Ø Return PreemptiveCellSearch (NewArrayCS[], nextItem, curLevelNo-1) At each level while descending towards the target level, using such a pre-emptive analysis that fetches all nucleus items whose owner cells may provide the “most similar” nucleus item for the lower level, Pre-emptive cell search terminates its recursion one level above the target level and presents the (final) most similar nucleus item with its owner cell on the target level into which the nextItem should be inserted. This achieves an optimum insertion scheme in the sense that the owner cell found on the target level presents the closest nucleus item with respect to the item to be inserted (i.e. nextItem) and therefore, along with the mitosis operations, which are used to improve the compactness of a cell, Pre-emptive cell search based item insertion algorithm further improves the cell compactness. 6.2.3.2 Item Removal Algorithm for HCT This is another level-based operation, which does not require any cell search operation. However, upon its completion it may cause several post-processing operations, affecting the overall HCT body. As explained earlier, if multiple items are required to be removed from a particular (target) level, first the items are grouped into one or more sub-sets according to their owner cells and then each sub-set can be conveniently removed from the HCT body within a A Novel Indexing Scheme: Hierarchical Cellular Tree 117 single operation. Therefore, without loss of generality we will introduce the algorithmic steps assuming that all the items for the removal operation already belong to a single owner cell. A significant post-complication occurs if all items in a particular cell are removed; the cell dies and therefore it is completely removed from the host level. If there is a single cell left on the target level, it automatically becomes the new top level and the level above (with its single cell with a single (removed) item) is also removed from the HCT body and hence the HCT height is reduced by one. The remaining post processing steps are similar to the ones given with item insertion algorithm: the owner cell can undergo a mitosis operation and furthermore any change in the cell nucleus due to item(s) removal, cell mitosis or death may require insertions of the new nucleus item(s) and/or the removal of the old one(s). Let ArrayIR[] be the array for the items (which belong to a single owner cell, say cellO) to be removed from the (target) level with a number levelNo. The Remove algorithm can then be expressed as follows: Remove (ArrayIR[], levelNo) Ø Let top level number: topLevelNo and the single cell in top level: cell-T Ø Let the Owner (target) cell in level levelNo: cell-O Ø Remove items in ArrayIR within cell-O Ø Check cell-O for Post-Processing: o If cell-O is depleted (cell-death) then do: § If( levelNo = topLevelNo ) then do: • Remove cell-O=cell-T • Remove the top level from HCT body § Else do: • Let item-O be the old nucleus item • Remove (item-O, levelNo+1) o Else if cell-O is split then do: § Let item-O, item-N1 and item N2 are old nucleus item and two new nucleus items. § Remove (item-O, levelNo+1) § Insert (item-N1, levelNo+1) § Insert (item-N2, levelNo+1) o Else if nucleus item is changed within cell-O then do: § Let item-O and item-N be old and new nucleus items. § Remove (item-O, levelNo+1) § Insert (item-N, levelNo+1) Ø Return. 6.2.4. HCT Indexing In order to index (construct a HCT body for) a multimedia database, the database should contain at least one feature extracted according to the genre of its multimedia items, i.e., “visual” for images and video clips and “aural” for audio and video clips. According to the features present in a database, two different genres of indexing can be performed: visual and aural in- 118 dexing (HCT). If a database is indexed both visually and aurally, apart from the exclusive similarity distance implementation, there is no difference from the algorithmic point of view; the exact same algorithmic approach as presented in this chapter is performed to both cases. If there are multiple features and/or sub-features present, then any suitable combination of these features can be used for indexing. Once indexing operation is completed, both genres of query and browsing operations (visual and aural) can be performed over that database. There are mainly two distinct operations for HCT indexing. The incremental construction of the HCT body and an optional periodic fitness check operation over it. In the following sub-sections, we will present the algorithmic details of both operations. 6.2.4.1 HCT Incremental Construction Let G represent the indexing genre, which can equally be both visual and aural for a multimedia database. Let ArrayI<G> be the array containing items that are to be appended to the database, D, according to the indexing genre. Initially D may or may not have a HCT indexing body. If it does not, then all (valid) items within D will be inserted into ArrayI<G> and a new HCT body is constructed; otherwise, the available HCT body should be first loaded, activated and updated for the newcomers present in ArrayI<G>. Accordingly the HCT indexing body construction algorithm, HCTIndexing, can be expressed as follows: HCTIndexing ( ArrayI< G > , G, D) Ø Load and activate HCT indexing body in genre G for database D. i i Ø For ∀OG ∀OG ∈ ArrayI < G > , do: // For all items in the array, perform incremental insertion. § Insert ( OGi , 0) // Insert ith item into HCT body. Ø End loop. 6.2.4.2 HCT (Periodic) Fitness Check HCT fitness check is an optional operation that can be performed periodically during or after the indexing operation. The objective is to reduce the “crowd effect” by removing redundant immature cells from the HCT body. Due to the insertion ordering of the items, one or some minor group of items may form a cell at the initial stages of the incremental construction operation. Later on, some other major cells might become suitable containers for those items that got trapped within those immature cells. So the idea is to remove all immature cells and feed their items back to the system, expecting that some other mature cell might now be better host for them. Note that after they are inserted to the most suitable cell on the level, the host cell may still refuse them if their insertion causes a significant degradation in cell compactness and hence would cause the cell to split as a result. In such a case, the original part of the host cell and the new item will be assigned to one of two newborn cells. This is the case where they are in fact “minority cases” that no other (similar) cell exist to accept them so that they eventually have to form a new immature cell and bind themselves into it. A Novel Indexing Scheme: Hierarchical Cellular Tree 119 As a result, the basic idea is to reduce immature cells that are making the level “crowded” whilst respecting such minority cases. The obvious expectation from this operation is to increase the percentage of mature cells along with their item coverage in a particular level without causing significant degradations on the overall compactness. Note that periodic fitness check will be applied to each level of the HCT body except the top level since it contains only one cell. The HCT fitness check algorithm, FitnessCheck, can be expressed as follows: FitnessCheck ( ) Ø Let l represents the level index. Ø l = topLevelNo – 1 // Start from the highest level possible Ø While l ≥ 0 , do: // For all levels, perform the fitness check o Let ArrayIR[] be the array for the removed items (which belong to a immature owner cell: C) o For ∀C N C < N M ∧ C ∈ Level(l ) // for all immature cells of level l § For ∀OCi O Ci ∈ C // for all items in cell C • Append OCi to ArrayIR[]. § End loop. § Remove( ArrayIR[], l) § For ∀ OCi OCi ∈ ArrayIR , do: // For all array items, perform incremental insertion • Insert( O Ci , l ) // Insert ith item back into lth level of the HCT body.. § End loop. o End loop. o Set l à l-1 // continue with the lower level Ø End loop. Note that the fitness check is performed to all of the levels except top level in decreasing order (higher levels are handled earlier than the lower levels). The reason for that is because the (incremental) insertion operation on a particular level will require a cell search (Preemptive) operation performed on all of the higher levels. So performing a fitness check operation to them first will obviously improve the performance of the fitness check operations performed on lower levels. 6.3. PQ OVER HCT As presented in Chapter 5, Progressive Query (PQ) is a novel retrieval scheme, which presents Progressive Sub-Query (PSQs) retrieval results periodically to the user and allows user to interact with the ongoing query process. Among other traditional query techniques such as exhaustive search based Normal Query (NQ) that can be only used for databases without any indexing structure or kNN and range queries for indexed databases, PQ presents significant innovative features and therefore, HCT is designed and optimized especially for PQ. 120 PQ can be executed over databases without an indexing structure and in this context it presents an alternative to the traditional query type, NQ, which is usually performed for the sake of simplicity. As a retrieval process PQ can also be performed over indexed databases as long as a query path (QP) can be formed over the clusters (partitions) of the underlying indexing structure. Obviously QP is nothing but a special sequence of the database items and it can be formed in any convenient way such as sequentially (starting from the 1st item towards the last one, or vice versa) or randomly when the database lacks an indexing structure. Otherwise the most advantageous way to perform PQ is to use the indexing information so that the most relevant items can be retrieved in earlier periodic steps of PQ. PQ operation over HCT is executed synchronously over two parallel processes: HCT tracer and a generic process for PSQ formation using the latest QP segment. HCT tracer is a recursive algorithm, which traces among the HCT levels in order to form a QP (segment) for the next PSQ update. When the time allocated for this operation is completed, this process is paused and the next PSQ retrieval result is formed and presented to the user. Then HCT tracer is re-activated for the next PSQ update and both processes stay alive unless the user stops the PQ or entire PQ process is completed (when all the indexed database items are covered). 6.3.1. QP Formation from HCT As mentioned briefly, QP is formed segment by segment for each and every PSQ update. Once a QP segment is formed, the periodic sub-query results are obtained within this segment (group of items) and then the retrieval result (the sorted list of items) is fused with the last PSQ update to form the next PSQ retrieval result. Starting from the top level, HCT tracer algorithm recursively traces among the levels and their cells according to the similarity of the cell nucleuses. This is similar to the MS-Nucleus cell search process, only this time it will not stop its execution when it finds the “most similar” cell on the ground (target) level but continues its sweep by visiting the 2nd most similar, then 3rd most and so on, while inserting all the items of the cells that it visits in the ground level to the current QP segment. Starting from the top level, each cell it visits in an intermediate level (any level except the ground level), HCT tracer forms a priority (item) queue, which ranks the cell items according to their similarity with respect to the query item. Note that these items are nothing but the nucleus items of the cells on the lower level and hence on the lower level, the cell “tracing” order is determined according to the priority queue that is formed on the upper level using their representative (nucleus) items. When the tracing operation is completed among the cells on the lower level (i.e. when the priority queue is depleted for the cell in a particular level) HCT tracer retreats to the upper level (cell) where it is originated. The process is terminated when the priority queue of the top level (cell) is depleted, which means, the whole HCT body is traced. Within the implementation of HCT tracer, we further develop an internal structure that prevents redundant similarity distance calculation, that is, similarity distances between items of the cells on intermediate levels are calculated only once and used on the lower levels whenever needed. In fact this is a general property of PQ operation, all the (computationally) costly operations A Novel Indexing Scheme: Hierarchical Cellular Tree 121 such as similarity distance calculations, loading features of the items from disc to the system memory, etc. are performed only once and shared between the processes whenever needed. The following HCTtracer algorithm implements HCT tracer operation, which basically extracts the next QP segment into a generic array, ArrayQP[]. It is initially called with the top-level number (topLevelNo) and an item (item-MS) from the single cell on the top level: HCTtracer (ArrayQP[], levelNo, item-MS). Let item-MS be the (next) most similar item to the query item, item-Q, on the (target) level indicated with a number, levelNo. HCTtracer algorithm can then be expressed as follows: HCTtracer (ArrayQP[], levelNo, item-MS) Ø Let cell-MS be the owner cell of item-MS. Ø If (levelNo = 0) then do: // if this is ground level o Append all items in cell-MS into ArrayQP[]. o Return. Ø Else do: // if this is an intermediate level o Create the priority queue of cell-MS: queue-MS. o For ∀ONi ∈ queue − MS , do: // for all sorted (nucleus) items do: § HCTtracer (ArrayQP[] , levelNo-1, ONi ) Ø Return. Note that this algorithm is executed as a separate process (thread) and can be paused externally from the main PQ process when the time comes for the next PSQ update. An example HCT tracer process is illustrated in Figure 55 for the sample HCT body shown in Figure 53. 122 Query Item: PQ operation Q A a b c Q A 1 2 3 b c a Level 2 = Top Level 2 1 1 3 d 1 i k e f E Q C 1 2 b f a F b r l n o m 3 5 C b D j h a 2 C c g B 1 2 B d Q a 1 3 1 B 1 3 c e d 2 A A 1 2 3 2 Level 1 e Q 1 c 1 2 3 A p 4 f s 2 1 6 Level 0 = Ground Level QP(Q) b r p s f c i e j k l m n d g h a o Figure 55: QP formation on a sample HCT body. 6.3.2. PQ Operation over HCT Once QP segments are formed, PQ operation that is executed over HCT body becomes similar to the sequential PQ illustrated in Figure 37. There are two main differences: each database sub-set shown in Figure 37 should now be replaced by the particular QP segment created by the HCT tracer process and a particular (e.g. say q + 1 st ) periodic sub-query period value should be reformulated according to expression (31). In order to present the overall PQ algorithm over an HCT body, let HCTfile be the HCT filename where HCT body is stored along with the database for which it is extracted and t = t p be the user defined period value. So the PQoverHCT algorithm can be expressed as follows: A Novel Indexing Scheme: Hierarchical Cellular Tree 123 0 PQoverHCT(HCTfile, t p ) Ø Load the HCTfile to activate HCT body of the database. Ø Create a timer, which signals to this process every t = t p millisecond. Ø Create a process (thread) for HCT tracer. Ø Set q = 0. 0 q Ø While ( timer< t p > ticks ) do: o Pause HCT tracer process. o Retrieve QP segment as a periodic sub-query result. o Fuse the periodic sub-query result with the last PSQ result to form next PSQ update. o Render the next PSQ update to the user. q o Update t p value for the next (q+1st) PSQ period as given in Eq. 3. Reset q+1 the timer < t p >. o Set qà q+1. o Re-activate HCT tracer process. Ø End loop. 6.4. HCT BROWSING Generally speaking, there are two ways of retrieval from a (multimedia) database: through a query process, such as query by example (QBE), and browsing. A query is a search and retrieve type process and is bound to some strictly defined rules and algorithmic steps. It is a retrieval race against time, so to provide the “most relevant” results in the “earliest” possible time given an “example” query item. However, such a scheme, by its nature, might have limitations and drawbacks. The user may not have a definitive idea what he/she is looking for and even though he/she has a clear idea about the query item, finding a relevant example may not be so trivial. Therefore, the problem of locating one or more initial query examples can be addressed by a particular browsing scheme. Database browsing, on the other hand, is a loose process, which usually requires a continuous interaction and feedback from the user and therefore, it is a kind of free-navigation and exploration among database items. Yet, it has a purpose of its own: it is to reach (classes of) items in a systematic and an efficient way even though the items may rather vague defined. Therefore, it is the browsing algorithm’s responsibility to organize the database in such a way that the “unknown” parameters of any browsing action can be resolved as efficiently as possible. Since browsing requires the capability of handling the entire database, a particular visualization system (for visual databases) and tool(s) for navigation should be provided; otherwise, browsing can turn out to be a disorienting process for the user. For this reason, it is essential to provide an organized (perhaps in a hierarchical way) map of the entire database along with the current status of the user (e.g. such as a “You are here!” sign) should be provided during the browsing process. 124 In order to assist browsing, database items should be organized in some way. Particularly for large databases, a hierarchical representation of the database may provide a natural support for free-navigation between the “levels” of the hierarchy such as traversing in and out among the levels. Several browsing methodologies are proposed in the literature. Koikkalainen and Oja introduced TS-SOM [36] that is used in PicSOM [37] as a CBIR indexing structure. TS-SOM provides a tree-structured vector quantization algorithm. Other similar SOM-based approaches are introduced by Zhang and Zhong [80], and Sethi and Coman [62]. All SOM-based indexing methods rely on training of the levels using the feature vectors and each level has a pre-fixed node size that has to be arranged according to the size of the database. This brings a significant limitation, that is, they are all static indexing structures, which do not allow dynamic construction or updates for a particular database. Retraining and costly reorganizations are required each time the content of the image database changes (i.e. new insertions or deletions), that is rebuilding the whole indexing structure from scratch. In the previous section, we presented an efficient query method, PQ implementation over the proposed indexing scheme, HCT. Moreover, HCT can provide a basis for accomplishing an efficient browsing scheme, namely HCT Browsing. It is designed to address efficient solutions for such limitations and problems. As explained earlier, HCT is a dynamic indexing structure, which allows incremental item insertions and removals. It has virtually no parameter dependency, not even on the database size. The cell sizes are also dynamic and they can vary according to the coverage or the amount of a particular content all of which is intended to be stored in one (or the least number of) cell(s) in the ground level. The hierarchic structure of HCT can be used to present the user an overview of what lies under the current level. Therefore, provided that a user friendly GUI is designed, HCT Browsing can turn out to be a guided tour among the database items. When the user initiates it, it is designed to show the items in the cell at the top level, so it is the first clue about what the database contains or more specifically a brief summary of it. The next logical step for the user is to choose an item of interest among the items in this cell and starts the tour downwards. So this is the first functionality that HCT Browsing provides: to choose an item on the upper level and trace down to see the cell it represents (as a nucleus item of that cell). As long as the item chosen belongs to the current cell, the “level down” option will lead to the lower-level cell that is represented by that item. Otherwise, the first cell on the lower level will be shown by default. The opposite functionality, the “level up” will lead to the upper level cell, which is the owner (cell) of the nucleus item of the host cell at the current level and it works at all the levels except the top level. This is also a useful functionality to see and visit similar cells of a particular cell. Therefore, the (slight) variations of a particular content can be visited using both of the level functionalities. In addition to such inter-level navigation options, HCT Browsing provides intercellular navigation within a certain level. The user can visit the cells sequentially, in forward or backward direction, one cell at a time. This is especially useful when the user does not have any particular target content in mind and he/she may be just “looking” for interesting items. So in such an indefinite or “open-ended” exploration task, navigating through the consecutive A Novel Indexing Scheme: Hierarchical Cellular Tree 125 cells in a certain level will summarize the overall database content and the amount of summarization obviously depends on the “height” of the navigation performed, that is, simply the current level number. Nucleus Item Level Controls Cell Controls Cell Items HCT Info Item Navigator Buttons Cell MST Info Prev Next Figure 56: HCT Browsing GUI on MUVIS-MBrowser Application. Figure 56 shows a snapshot of the MUVIS application, MBrowser, where HCT Browsing is implemented. Depending on the HCT indexing genre (i.e. visual or aural), the aforementioned functionalities of HCT Browsing are supported by means of a Control Window along with GUI of MBrowser. In this example, the database used is Corel_1K with 1000 images. Color (HSV and YUV histograms) and texture (Gray Level Co-occurrence Matrix) features are extracted and HCT indexing is then performed. As shown in the bottom-left part of the figure, the Control Window allows the user to perform inter-level and inter-cellular navigations. Furthermore, it gives some logs and useful information to the expert users about the HCT body in general and particularly about the current cell and its MST. So any (expert) user can examine the cell compactness, which items are connected to each other within MST, the nucleus item, and the most important of all, whether or not the current cell is compact and mature. In this example, by comparing the cell compactness feature with the current level compactness threshold value, ( CThr L = 246 . 3 and CF C = 24 .86 for this cell) it can be easily deducted that this is a highly focused cell. As a compact cell, its MST branch weights are quite low and 126 within a close neighborhood as expected. Additional important information that can be obtained from the Control Window is the reliability and discrimination factor of the visual (or aural) features by visually (or aurally) inspecting the cell items’ relevancy along with their (minimal) connections to the cell. Two examples of HCT Browsing with inter-level navigations are shown in Figure 57 Figure 58. In both illustrations, the user starts browsing from the 3rd level of a 5-level high HCT body. Due to the space limitation, only a portion of HCT body (where the browsing operation is performed) is shown. Note that in both examples, HCT indexing scheme provides more and more “narrowed” content in the descending order of the levels. For example, the user chooses an “outdoor, city, architecture” content on the third level where it yields “outdoor, city, beach and buses” content carrying cell in the second level. He/she then chooses a multi-color “bus” and then navigating down into the first level, it yields a cell, which contains mostly “buses” with different color as the content; and finally choosing a “red bus” image (nucleus item) yields the cell of “red buses” on the ground level. Another example can be seen through: “outdoor, city, architecture” à “outdoor, city, beach and buses” à “beach and mountains” à “beaches”. Similar series of examples can also be seen in the sample HCT Browsing operation within a texture database. The cells are getting more and more compact (focused) in the descending order of level and the ground level cells achieve a “good” clustering of texture images showing high similarity. A Novel Indexing Scheme: Hierarchical Cellular Tree 127 Level 3 Level 2 Level 1 Level 0 Figure 57: An HCT Browsing example, which starts from the third level within Corel_1K MUVIS database. The user navigates among the levels shown with the lines through ground level. 128 Level 3 Level 2 Level 1 Level 0 Figure 58: Another HCT Browsing example, which starts from the third level within Texture MUVIS database. The user navigates among the levels shown with the lines through ground level. A Novel Indexing Scheme: Hierarchical Cellular Tree 129 6.5. EXPERIMENTAL RESULTS The experiments performed in this chapter use 7 different MUVIS multimedia databases as presented in section 2.1.4: 1) Open Video Database: 2) Real World Audio/Video Database: 3) Sports Hybrid Database:. 4) Corel_1K Image Database: 5) Corel_10K Image Database: 6) Shape Image Database: 7) Texture Image Database All experiments are carried out on a Pentium-4 3.06 GHz computer with 2048 MB memory. In order to have unbiased evaluations, each query experiment is performed using the same queried multimedia item with the same instance of MBrowser application. The evaluation of the retrieval results by PQ is performed subjectively using ground-truth method, i.e. a group of people evaluates the query results of a certain set of retrieval experiments, upon which all the group members unanimously agreed about the query retrieval performance. Among these a certain set of examples were chosen and presented in this section for visual inspection and verification. 6.5.1. Performance Evaluation of HCT Indexing In this section, performance evaluation is made based on the cell search algorithm and the fitness check. The fitness check operation is performed only once as a post-processing step after the completion of the incremental construction of the HCT body. To examine the “quality” and “compactness” of the clustering, especially at the ground level where the entire database is stored, some statistics are used as shown in Table XI, in the left column. First, we will explain the details of the statistical data and analysis performed over the key algorithms of the HCT indexing operation. Afterwards, the performance evaluation is presented based on the statistical data given in Table XI. 130 Table XI: Statistics obtained from the ground level of HCT bodies extracted from the sample MUVIS databases. Statistics (Level 0) Mature Cell % Item % in Mature Cells Average Compactness Average Covering Radius Average Broken Branch Weight Average Cell Size Average Mature Cell Size HCT Construction Time (seconds) Cell Search Fitness Real World Algorithm Check Video Before 7.246 MSFC Nucleus After FC 17.526 Before 5.848 PreFC emptive After FC 9.938 Before 24.062 MSFC Nucleus After FC 37.748 Before 23.179 PreFC emptive After FC 36.865 Before 173.328 MSFC Nucleus After FC 192.968 Before 112.148 PreFC emptive After FC 128.164 Before 1.238 MSFC Nucleus After FC 1.229 Before 1.068 PreFC emptive After FC 1.093 Before 1.015 MSFC Nucleus After FC 1.034 Before 0.890 PreFC emptive After FC 0.883 Before 3.283 MSFC Nucleus After FC 4.670 Before 2.649 PreFC emptive After FC 2.814 Before 10.900 MSFC Nucleus After FC 10.059 Before 10.500 PreFC emptive After FC 10.438 Before 168.079 MSFC Nucleus After FC 286.906 Before 541.073 PreFC emptive After FC 1548.585 Real World Audio Sports Image Sports Video Corel_ 1K Corel_ 10K Shape Texture 9.052 15.929 0.000 15.962 18.228 13.694 31.783 14.286 28.261 15.686 19.324 34.654 50.877 44.444 6.792 12.308 1.389 16.393 13.937 12.760 28.114 12.389 25.510 10.909 24.631 25.618 19.048 41.048 21.303 44.040 0.000 39.700 47.865 41.143 64.091 30.451 61.818 32.000 44.600 65.116 75.214 72.898 20.802 40.404 3.500 44.200 40.729 42.357 61.705 30.952 63.030 27.000 54.400 56.465 52.571 70.000 2299.667 255.417 NaN 152.672 289.694 23.636 0.038 2687.994 304.267 10.989 145.581 384.193 157.74 0.048 2172.990 134.299 11.586 82.440 65.020 12.097 0.011 2096.814 195.721 9.378 99.905 77.587 14.104 0.015 2.448 1.089 NaN 1.049 1.206 0.580 0.127 2.449 1.156 0.588 1.042 1.293 0.817 0.133 2.357 0.961 0.501 0.935 0.863 0.504 0.098 2.514 1.077 0.505 0.986 0.925 0.539 0.107 2.351 1.025 0.558 1.014 1.200 0.668 0.147 2.387 1.119 0.585 1.031 1.255 0.707 0.149 2.278 0.915 0.517 0.861 0.794 0.540 0.109 2.281 0.980 0.531 0.872 0.805 0.549 0.109 3.440 4.381 3.226 4.695 4.619 4.459 6.822 3.931 5.380 3.922 4.831 6.231 8.187 8.148 3.011 3.808 2.778 4.098 4.097 4.154 6.263 3.531 5.051 3.636 4.926 5.321 5.128 7.686 8.095 12.111 NaN 11.676 12.128 13.395 13.756 8.379 11.769 8.000 11.150 11.708 12.103 13.365 9.222 12.500 7.000 11.050 11.973 13.791 13.747 8.821 12.480 9.000 10.880 11.727 14.154 13.106 4359.849 1.847 79.600 2.310 46.162 23.365 3.000 7674.978 3.034 158.287 4.147 72.456 31.020 3.989 15110.948 1.925 193.918 3.054 450.105 67.631 3.196 31708.918 3.868 394.462 5.673 926.525 117.51 4.403 A Novel Indexing Scheme: Hierarchical Cellular Tree 131 6.5.1.1 Statistical Analysis Table XI presents several statistics per fitness check status (before and after) and per cell search algorithm: the proposed Pre-emptive vs. traditional MS-Nucleus. The first two statistics, the percentage of mature cells and their overall item coverage are mainly chosen to show the effect of both algorithms on the maturity of the cells. Furthermore the effect of fitness check can be clearly seen by examining these measures. The other three averaging statistics, compactness, nucleus distance and broken branch weight are about the “quality” of the indexing scheme, that is, how focused (compact) the obtained cells are. For (HCT) indexing of these databases the following regularization function is used: CF C = f ( µ C , σ C , rC , max( w C ), N C ) = K µ C σ C rC max( w C ) N C (35) where K, is a scaling coefficient, µ C and σ C are the mean and standard deviation of the MST branch weights, wC , of the cell C. rC is the covering radius, that is the distance from the nu- cleus where all the cell items lie and N C > N M is the number of items in cell C.. Once the indexing operation is completed, the average compactness is then calculated using CFC value of all mature cells on the ground level. We used N M = 6 for maturity and K=1000 in the experiments. Average covering radius is the conventional way for the analytical expression of the cell compactness. Therefore, its average over the entire level is expected to be low for indexing operations targeting high quality. Finally, the average broken branch weight is obtained from a log of all mitosis operations performed so far on ground level. It is an alternative criterion for measuring the overall level compactness, and therefore, similar argument can be done for this statistical parameter. Since Pre-emptive cell search is primarily designed to improve the overall level compactness, its corresponding statistics are expected to be lower (indicating more focused cells) and significantly lower when the database size is getting higher. This basically reveals the subject of scalability and thus, from several experimental databases given in Table XI, comparing these statistics for Corel_1K and Corel_10K databases, the scalability performance of each algorithm can be evaluated. The average mature cell size is used to perform a compactness evaluation that is not affected by the actual mature cell size. As given in equation (35), the CFC value for a cell C is proportional to the (square root of the) size of C, that is, if a cell is reluctant to carry more and more items, then the items should be more and more similar (focused) in order not to cause the cell to split. In other words increasing the cell size should be compensated (is only allowed) with more focused (i.e. comparatively low values of µ C , σ C , rC , max( wC ) ) cell items in order to keep the cell in one piece. Finally, HCT construction time represents the basic cost of each operation as the (CPU) computational time. 132 6.5.1.2 Performance Evaluation The numerical results given in Table XI show that regardless which cell search algorithm is used, the fitness check operation usually improves the amount (percentage) of mature cells and also the number of items stored in these cells significantly without degrading the overall compactness drastically. Such effects naturally cause a significant increase in the average cell size; however, the average mature cell size is slightly affected. This means that the fitness check operation is not changing the natural maturity level within the HCT body so that the average size of similar item groups can be kept intact. As expected, Pre-emptive cell search achieves a major compactness improvement with respect to MS-Nucleus algorithm. Its performance is further increased when the database is larger and the features are getting better (higher discrimination factor). Consider for example the statistics obtained from Corel_1K and Corel_10K, in “before fitness check” status (to discard the effect of fitness check): Pre-emptive cell search algorithm yields an average compactness factor of 82.44 compared with 152.672 for MS-Nucleus, see Table XI. Not surprisingly, the average compactness is increased even more in the case of a larger database, e.g. Corel_10K (65.020 for pre-emptive search vs. 289.694 for MS-nucleus). Note that this is an unbiased comparison since both algorithms yields close values for average mature cell size for both databases. Therefore, due to the reasons explained earlier, MS-Nucleus cell search algorithm induces corruption proportional with the database size. On the other hand, Pre-emptive cell search algorithm is not degraded from the increasing database size and therefore achieves a significant scalability in this aspect. This can be verified by comparing the average compactness and covering radius values obtained (82.440 vs. 65.020 and 0.935 vs. 0.863) and it can be clearly seen that the compactness level is, on the contrary, improved with the increasing database size; whereas, the opposite is true for MS-Nucleus cell search algorithm. Apart from the database size, the reliability (discriminating factor) of the feature(s) is also an important factor. Improved feature discrimination factors yield more robust similarity distance measures which in turn leads to more focused cells obtained by Pre-emptive cell search algorithm. Among the features used in the experiments that are reported in this thesis, the most reliable ones are texture features (GLCM [49] and Gabor [40]). Hence, the (second) highest difference in terms of compactness (more than three times) between Pre-emptive and MS-Nucleus cell search algorithms can be seen in this database (0.011 vs. 0.038, in Table XI). The cost for using both fitness check and Pre-emptive cell search is the increased computational time for the construction of the HCT indexing structure. However, since indexing is an off-line process that is performed only once during the creation of the database, this cost can be compensated for by the accuracy and time gains in the query and browsing, both of which are online processes that are subject to be performed several times during the lifetime of the database. 6.5.2. PQ over HCT Two different performance evaluations can be performed for PQ operations over HCT indexing structure. First the relevancy of the QP where PQ will proceed can be examined from A Novel Indexing Scheme: Hierarchical Cellular Tree 133 some typical QP (similarity distance) plots. These plots indicate whether the order of the items within QP is formed in accordance with the similarity of the query item so that the most similar items can be retrieved earliest. In Figure 59, a query image has a group of 97 relevant images among 1000 images in the database and in Figure 60 a query video has a group of 21 relevant video clips among 200 video clips. It can be seen from the figures that HCT tracer successfully captures all of the relevant items at the beginning of QP. Therefore, PQ operation will be presenting them (first) to the user immediately after the query operation is initiated. Another important remark should be made about the “trend” of the QP plots, that is, it traces along the increasing order of dissimilarity, as intended. Figure 59: QP plot of a sample image query in Corel_1K database. Figure 60: QP plot of a sample video query in Sports database. The second performance evaluation is about the speed (or timing) of PQ over HCT operation compared with both the Sequential PQ and NQ. To this end, we performed 10 visual and aural retrieval experiments using all three query methods and we measured the time to retrieve at least 90% of all relevant items that are subjectively determined using the groundtruth method within each database. We used t p ≤ 3 sec and the results are presented in Table XII. 134 It is not surprising that over 20 query operations performed, PQ over HCT achieves the fastest retrievals and it yields the retrieval result in the first periodic update of PQ except 4 aural retrievals. In those examples the audio feature could not provide sufficient discrimination and hence the cell search within HCT tracer (MS-Nucleus) fails to track on the optimum sub-tree at the beginning. Note that this is the main and expected problem causing corruption for the indexing phase as explained earlier. Table XII: Retrieval times (in msec) for 10 visual and aural query operations performed per query type. Query Genre Visual Aural Query Type NQ Seq. PQ PQ over HCT NQ Seq. PQ PQ over HCT Q1=428 Q2=466 Q3=381 Q4=705 Q5=617 Q6=291 Q7=417 Q8=784 Q9=277 Q10=603 19624 12007 3003 37675 21004 3001 9407 5974 2003 29820 21005 3000 5938 3997 2002 31977 18006 12003 5922 6000 3139 36652 36023 1501 7776 6003 2501 54869 45003 1500 6774 4004 1501 39905 18050 12033 7290 6002 3002 48553 30023 3002 7799 2003 2003 58897 24003 9000 4954 2001 1504 84921 82998 12000 6.5.3. Remarks and Evaluation As a brief summary, the following innovative properties achieved by HCT can be listed: • HCT is a dynamic, parameter independent and flexible cell (node) sized indexing structure, which is optimized to achieve as focused cells as possible using visual and aural descriptors with limited discrimination factors. • By means of the flexible cell size property, one or minimum number of cell(s) are used to store a group of similar items, which in effect reduces the degradations caused by “crowd effect” within the HCT body. • During their life-time, the cells are under the close surveillance of their levels in order to enhance the compactness using mitosis operations whenever necessary to rid the cell of dissimilar item(s). Furthermore, for item insertions, an optimum cell search technique (Pre-emptive) is used to determine the most suitable (similar) cell in each level. • HCT is also intrinsically dynamic, meaning that the cell and level parameters and primitives are subject to continuous upgrade operations to provide most reliable environment. For example a cell nucleus item is changed whenever a better candidate is available and once a new nucleus item is assigned, its owner cell on the upper level is found after a cell search instead of using the old one’s owner cell. Such a dynamic internal behavior keeps the HCT body intact by preventing the potential sources of corruption. • By means of MST within each cell, the optimum nucleus item can be assigned whenever necessary and with no cost. Furthermore, the optimum split management can be done when the mitosis operation is performed (again with no cost). Most important of all, MST provides a reliable compactness measure via “cell similarity” for any item instead of relying on only a single (nucleus) item. By this way a better judgment can be done whether or not a particular item is suitable for a mature cell. • HCT is particularly designed to work with PQ in order to provide the earliest possible retrievals of relevant items. 7039 3002 2001 61080 57057 9002 A Novel Indexing Scheme: Hierarchical Cellular Tree 135 • Finally, HCT indexing body can be used for efficient browsing and navigation among database items. The user is guided at each level by the nucleus items and several hierarchic levels of summarization help the user to have a “mental picture” about the entire database. Experimental results presented earlier demonstrate that HCT achieves all the abovementioned properties and capabilities in an automatic and parameter invariant way. It further achieves significant improvements on the cell compactness and shows no sign of corruption when the database size is getting larger. Assuming a stable content distribution, the cells keep approximately the same level of compactness when the database size is increased significantly (i.e. 10 times). The analysis obtained from different databases suggests that HCT usually yields better clustering performance when the discrimination factor of the features is sharper and they provide better relevancy from the user’s point of view. 136 137 Chapter 7 Conclusions M ultimedia management has been and always will be a grotesque challenge of humancentric computing. It requires efficient consumption of multimedia primitives along with several distinct operations performed during their life cycle. These operations such as content analysis, indexing, retrieval, summarization, etc., should all be involved in an efficient framework to achieve a global and generic basis for an efficient management. Especially the variations on the digital multimedia parameters and formats increase the complexity of the problem for the sake of a global approach. On the other hand, the lack of current Artificial Intelligence level makes the training-based methods, such as recognition and identification, infeasible to yield a generic solution. Moreover, traditional early attempts such as text based query methods are also far from bringing a content-based solution to the problem, mainly because of two reasons: First the text annotations are user dependent and might vary among different people. Also its feasibility is only limited to small size databases since annotation requires a significant laborious effort. Therefore, designing feasible, yet global and generic techniques for content-based multimedia management becomes the primary objective of this thesis. Having defined the primary motivation behind developing the MUVIS framework as such, MUVIS is further designed as a self-sufficient test-bed platform for developing modular aural and visual feature extraction methods, novel indexing and retrieval techniques, efficient browsing capabilities with its user-friendly GUI design in addition to several other functionalities, such as scalable video management, summarization, etc. The contributions of the thesis can then be mainly summarized into four parts: Audio content analysis and content-based audio indexing and retrieval; the query technique called Progressive Query; the Hierarchical Cellular Tree as an efficient retrieval, indexing and browsing techniques for multimedia databases; all of which have been successfully integrated into a modular framework, MUVIS. At present content-based multimedia indexing and retrieval has been an active research topic for roughly two decades. Within this context, there are several efficient methodologies developed for visual items such as images and video; however, the audio counterpart is still in its infancy despite the fact that it can yield a better retrieval performance due its unique and stable nature with respect to the content. Therefore, we focus the attention on 138 automatic audio content analysis and efficient aural indexing and retrieval framework design. The former technique is especially designed for and within the context of the latter, yet it achieves a significant classification and segmentation performance within a bi-modal and unsupervised (fully automatic) structure. By using it as the initial step, the proposed audio-based multimedia indexing and retrieval framework then becomes a major alternative, showing equal or better performance with respect to the visual counterpart. The positive experimental results may lead one to predict that audio may be the key to closing the “semantic gap”, which exists today between the low-level feature and the real semantic content of the audio-visual world. Consequently, future research studies in this field shall focus on extensions and improvements of the techniques developed in this thesis. Particular emphasis will be placed on additional features and enhanced summarization models extracted from the aural content. The major part of the management of the multimedia collections obviously requires powerful indexing, retrieval via query and browsing techniques. Although much work has been done for the development of such techniques, as it is shown with the extensive literature search in this thesis, most of the current techniques and systems have significant limitations and drawbacks especially for large multimedia databases. In this context, the thesis first focuses on the problem of an efficient query methodology. The proposed Progressive Query has been developed to achieve several innovative retrieval capabilities for databases with or without an indexing structure. Since the ultimate measure of any retrieval system performance is the satisfaction of the system user, the most important property achieved is therefore its continuous user interaction, which provides a solid control and enhanced relevance feedback mechanism along with the ongoing query operation. The experiments performed on databases without an indexing structure clearly demonstrate the superior performance achieved by PQ in terms of speed, minimum system requirements, user interaction and possibility of better retrievals as compared with the traditional and the only available query scheme, the NQ. Yet the ultimate objective, the retrieval of the most relevant items in the earliest possible time, cannot be fulfilled without the imminent role of an efficient indexing scheme. It is a fact that existing indexing methods have not coped with the requirements set by multimedia databases, such as the presence of multiple features possibly in high dimensions, the dynamic construction capability, the need for prevention of the corruption for the large-scale databases, robustness against low discrimination power of low-level features, etc. A novel indexing structure, the Hierarchical Cellular Tree, is presented to accomplish these requirements. Moreover, it is experimentally verified in this thesis that it can work in harmony with PQ to retrieve at an early stage the most relevant items to the user’s query. Visualization of multimedia primitives and the query by example (searching a specific item of interest) operations are the flip sides of a typical retrieval scheme. The user may want to switch back and forth between the two modes, as provided by the enhanced GUI of MUVIS. Furthermore, the user may need an efficient browsing scheme to reach the item(s) of interest that can then be used for a query operation. However, the user may have neither a definitive idea about what exactly the item of interest must be, nor where it can be found, hence, a navigation process should be entirely guided. Such a “guided tour” among the database Conclusions 139 items along with a hierarchical summarization of the database is provided as a side feature of the HCT indexing body. Depending of the level of discrimination that the features can provide, it is shown that HCT indexing structure can well scale with the database size and this can yield a better browsing capability for the user. Due to the subjectivity and the human factor in any browsing operation, the performance analysis and evaluations are currently limited to the statistical measures taken from the HCT indexing body. A crucial future task will be to develop a common setting for benchmarking of indexing and browsing so that the performance of HCT can be assessed. The current status of the techniques proposed in this thesis promotes several interesting options and possibilities for further research. All the techniques within the context of multimedia management are designed and implemented to provide generic, automatic and global solutions. This is due to the direct consequences of the massive size and variations that can be seen in today’s multimedia collections. Even though they are developed independent from a certain multimedia, database, application and environment type, in a specific application domain, such a general approach might not be the optimal and some modifications are likely to be justified. Finally, the management of the ever-increasing, massive multimedia databases will still be a great challenge in the future. It is obvious that it can never be done manually, yet the occurrence of the so-called “semantic gap” with the fully automatic methods is unavoidable. Therefore, the efforts will be focused on the development of feasible methods for narrowing the “semantic gap” by having better and possibly higher level descriptors as well as designing a semi-automatic framework for providing semantic annotation for the multimedia content with a certain degree of human interaction. The ultimate goal in the latter is to improve the level of interactivity and interoperability with smarter GUI designs, and equally to minimize the amount of manual work. In order to accomplish this, some studies have already been started to improve the accuracy and performance of the fully automatic techniques whilst appending new capabilities such as interactive editing and semantic identification. 140 141 Bibliography [1] R. M. Aarts and R. T. Dekkers, “A Real-Time Speech-Music Dicriminator”, J. Audio Eng. Soc., vol. 47, No 9, pp. 720-725, September 1999. [2] D. Bainbridge, Extensible optical music recognition. PhD thesis, Department of Computer Science, University of Canterbury, New Zealand, 1997. [3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An efficient and robust access method for points and rectangles”, In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Atlantic City, US, pp. 322-331. 1990. [4] J. L. Bentley, “Multidimensional binary search trees used for associative searching”, In Proc. of Communications of the ACM, v.18 n.9, pp.509-517, September 1975. [5] S. Berchtold, C. Bohm, H. V. Jagadish, H.-P. Kriegel, J. Sander, ‘Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces’, In Proc. of the 16th Int. Conf. on Data Engineering, San Diego, USA, pp.577-588, Feb. 2000. [6] S. Berchtold , C. Böhm , H.-P. Kriegal, “The pyramid-technique: towards breaking the curse of dimensionality”, In Proc. of the 1998 ACM SIGMOD International conference on Management of data, pp.142-153, Seattle, Washington, US, June 01-04, 1998. [7] S. Berchtold, D. A. Keim, and H.-P.Kriegel, “The X-tree: An index structure for highdimensional data”, In Proc. of the 22th International Conference on Very Large Databases (VLDB) Conference, 1996. [8] S. Blackburn and D. DeRoure. “A Tool for Content Based Navigation of Music”, In Proc. of ACM Multimedia, 1998. [9] T. Bozkaya, Z. M. Ozsoyoglu, “Distance-Based Indexing for High-Dimensional Metric Spaces”, In Proc. of ACM-SIGMOD, pp.357-368, 1997. [10] K.-H. Brandenburg, “MP3 and AAC Explained”, In Proc. of AES 17th International Conference, Florence, Italy, September 1999. [11] S. Brin, “Near Neighbor Search in Metric Spaces”, In Proc. of International Conference on Very Large Databases (VLDB), pp. 574-584, 1995. [12] A. Bugatti, A. Flammini, P. Migliorati, “Audio Classification in Speech and Music: A Comparison Between a Statistical and a Neural Approach”, Eurasip Journal on Applied Signal Processing, Vol. 2002, No. 4, Part 1, pp. 372-378, April 2002. 142 [13] J. Canny, “A Computational Approach to Edge Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, pp. 679-698, November 1986. [14] K. Chakrabarti and S. Mehrotra. “The hybrid tree: An index structure for high dimensional feature spaces”, In Proc. of Int. Conf. on Data Engineering, pp. 440-447, February 1999. [15] S.F. Chang, W. Chen, J. Meng, H. Sundaram and D. Zhong, “VideoQ: An Automated Content Based Video Search System Using Visual Cues”, In Proc. of ACM Multimedia, Seattle, 1997. [16] F. A. Cheikh, “MUVIS: A System for Content-Based Image Retrieval”, PhD. Thesis at Tampere University of Technology, Tampere, Finland, April 2004. [17] F.A. Cheikh, B. Cramariuc, M. Gabbouj, “Relevance feedback for shape query refinement”, In Proc. of IEEE International Conference on Image Processing, ICIP 2003, Barcelona, Spain, 14-17 September 2003. [18] T.-C. Chou , A. L. P. Chen , C.-C. Liu, “Music Databases: Indexing Techniques and Implementation”, In Proc. of the 1996 International Workshop on Multi-Media Database Management Systems (IW-MMDBMS '96), pp. 46, August 14-16, 1996. [19] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: an efficient access method for similarity search in metric spaces”, In Proc. of International Conference on Very Large Databases (VLDB), pp. 426-435, Athens, Greece, August 1997. [20] M. J. Fonseca, J. A. Jorge, “Indexing High-Dimensional Data for Content-Based Retrieval in Large Databases”, In Proc. of Eighth International Conference on Database Systems for Advanced Applications (DASFAA '03), pp. 267-274, Kyoto-Japan, March 26 – 28, 2003. [21] J. T. Foote, “Content-Based Retrieval of Music and Audio”, in Proc. of SPIE, vol. 3229, pp. 138-147, 1997. [22] A. Guttman, “R-trees: a dynamic index structure for spatial searching”, In Proc. of ACM SIGMOD, pp. 47-57, 1984. [23] ISO/IEC 11172-3, Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s, Part 3: Audio, 1992. [24] ISO/IEC CD 14496-3 Subpart4: 1998, Coding of Audiovisual Object Part 3: Audio, 1998. [25] ISO/IEC 13818-3:1997, Information technology -- Generic coding of moving pictures and associated audio information -- Part 3: Audio, 1997. 143 [26] ISO/IEC JTC1/SC29/WG11, “Overview of the MPEG-7 Standard Version 5.0”, March 2001. [27] K. El-Maleh, M. Klein, G. Petrucci and P. Kabal, “Speech/Music Discrimination for Multimedia Applications”, In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2445--2448, Istanbul, Turkey, 2000. [28] A. Ghias, J. Logan, and D. Chamberlin. B. C. Smith, “Query By Humming.”, In Proc. of ACM Multimedia 95, pp. 231-236, 1995. [29] R.L. Graham and O. Hell, “On the history of the minimum spanning tree problem,” Annual Hist. Comput. 7, pp. 43-57. 1985. [30] R. Jarina, N. Murphy, N. O'Connor, and S. Marlow, “Speech-music discrimination from MPEG-1 bitstream”, in Kluev, V.V., and Mastorakis, N.E. (eds.). Advances in signal processing, robotics and communications, WSES Press, pp. 174-178, 2001. [31] N. Katayama , S. Satoh, “The SR-tree: an index structure for high-dimensional nearest neighbor queries”, In Proc. of the 1997 ACM SIGMOD international conference on Management of data, pp.369-380, Tucson, Arizona, US, May 11-15, 1997. [32] A. Khokhar and G. Li, “Content-based Indexing and Retrieval of Audio Data using Wavelet”, In Proc. of ICME, 2000. [33] S. Kiranyaz, A. F. Qureshi, and M. Gabbouj, “A Generic Audio Classification And Segmentation Approach For Multimedia Indexing and Retrieval”, In Proc. of the EWIMT 2004, IEE European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, London, U.K., November 2004. [34] S. Kiranyaz, M. Gabbouj, “A Novel Multimedia Retrieval Technique: Progressive Query (WHY WAIT?)”, In Proc. of WIAMIS Workshop, Lisboa, Portugal, April 2004. [35] L. Kjell and L. Pauli, “Musical Information Retrieval using musical Parameters”, In Proc. of International Computer Music Conference, Ann Arbour, 1998. [36] P. Koikkalainen and E. Oja. ”Self-organizing hierarchical feature maps”, In Proc. of the International Joint Conference on Neural Networks, San Diego, CA, 1990. [37] J. T. Laaksonen, J. M. Koskela, S. P. Laakso, and E. Oja, ”PicSOM - content-based image retrieval with self-organizing maps”, Pattern Recognition Letters, 21(13-14), pp. 1199-1207, December 2000. [38] L. Lu, H. You, H. J. Zhang, “A New Approach to Query by Humming in Music Retrieval”, In Proc. of ICME 2001, Tokyo, August 2001. [39] L. Lu, H. Jiang and H. J. Zhang, “A Robust Audio Classification and Segmentation Method”, In Proc. of ACM 2001, pp. 203-211, Ottawa, Canada, 2001. 144 [40] W. Y. Ma, B. S. Manjunath, ”A Comparison of Wavelet Transform Features for Texture Image Annotation”, In Proc. of IEEE International Conf. On Image Processing, 1995. [41] R.J. McNab, L.A.Smith, I.H. Witten, C.L. Henderson, and S.J. Cunningham, “Towards the digital music library: tune retrieval from acoustic input”, In Proc. of ACM Digital Libraries '96, 1118, 1996. [42] R. J. McNab, L. A. Smith, D. Bainbridge, and I. H. Witten., “The New Zealand Digital Library MELody inDEX.”, http://www.dlib.org/dlib/may97/meldex/05written.html , May 1997. [43] MUVIS. http://muvis.cs.tut.fi/ [44] Muscle Fish LLC. http://www.musclefish.com/ [45] K. Lin, H.V. Jagadish, and C. Faloutsos. “The TV-tree: an index for high dimensional data”, Very Large Databases (VLDB) Journal, 3(4), pp. 517-543, 1994. [46] Y. Nakayima, Y. Lu, M. Sugano, A. Yoneyama, H. Yanagihara, A. Kurematsu, “A Fast Audio Classification from MPEG Coded Data”, In Proc. of Int. Conf. on Acoustics, Speech and Signal Proc., vol. 6, pp. 3005-3008, Phoenix, AZ, March 1999. [47] M. Noll, “Pitch Determination of Human Speech by the Harmonic Product Spectrum, the Harmonic Sum Spectrum, and a Maximum Likelihood Estimate”, In Proc. of the Symposium on Computer Processing Communications, pp. 770-797, Polytechnic Inst. of Brooklyn, 1969. [48] Open Video Project. http://www.open-video.org/ [49] M. Partio, B. Cramariuc, M. Gabbouj, A. Visa, “Rock Texture Retrieval Using Gray Level Co-occurrence Matrix”, In Proc. of 5th Nordic Signal Processing Symposium, October 2002. [50] A. Pentland, R.W. Picard, S. Sclaroff, “Photobook: tools for content based manipulation of image databases”, In Proc. of SPIE (Storage and Retrieval for Image and Video Databases II), 2185, pp. 34-37, 1994. [51] D. Pan, “A tutorial on MPEG/Audio Compression”, IEEE Multimedia, pp. 60-74, 1995. [52] S. Pfeiffer, J. Robert-Ribes and D. Kim, “Audio Content Extraction from MPEG Encoded Sequences”, In Proc. of Fifth Joint Conference on Information Sciences, Vol. II, pp. 513-516, New Jersey, US, 1999. [53] S. Pfeiffer, S. Fischer and W. Effelsberg, “Automatic Audio Content Analysis”, In Proc. of ACM International Conference on Multimedia, pp. 21-30, 1996. 145 [54] R. C. Prim, “Shortest Connection Matrix Network and Some Generalizations”, Bell System Technical Journal, vol. 36, pp. 1389-1401, November, 1957. [55] L. R. Rabiner and B. H. Juang, Fundamental of Speech Recognition, Prentice hall, 1993. [56] Y. Rui, T.S.Huang S. Metrotra, “Relevance feedback techniques in interactive contentbased image retrieval”, In Proc. of IS& T and SPIE Storage and Retrieval of Image and Video Databases VI, San Juan, PR, pp. 762-768, June 1997. [57] Y. Sakurai , M.Yoshikawa , S. Uemura , H. Kojima, “The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation”, In Proc. of the 26th International Conference on Very Large Data Bases, pp. 516-526, September 10-14, 2000. [58] J. Saunders, “Real Time Discrimination of Broadcast Speech/Music”, In Proc. of ICASSP-96, vol.II, Atlanta, May, pp. 993-996, 1996. [59] E. Scheirer and M. Slaney, “Construction and Evaluation of a Robust Multifeature, Speech/Music Discriminator”, In Proc. of IEEE Int. Conf. on Acoustics, Speech, Signal Proc., pp. 1331-1334, Munich, Germany, Apr. 1997. [60] R. Sedgewick, Algorithms, Addison-Wesley Publishing Company, Inc, New York reprinted with corrections, August 1989. [61] T. K. Sellis , N. Roussopoulos , C. Faloutsos, “The R+-Tree: A Dynamic Index for Multi-Dimensional Objects”, In Proc. of the 13th International Conference on Very Large Data Bases, pp.507-518, September 01-04, 1987. [62] I. K. Sethi, I. Coman, “Image retrieval using hierarchical self-organizing feature map”, Pattern Recognition Letters, 20:1337–1345, 1999. [63] J.R. Smith and Chang, “VisualSEEk: a fully automated content-based image query system”, In Proc. of ACM Multimedia, Boston, November 1996. [64] C. Spevak and E. Favreau, “Soundspotter - a prototype system for content-based audio retrieval” In Proc. of the COST G-5 Conf. on Digital Audio Effects (DAFX-02), Hamburg, Germany, September 2002. [65] S. Srinivasan, D. Petkovic and D. Ponceleon, “Towards robust features for classifying audio in the CueVideo System”, In Proc. of the seventh ACM International Conf. On Multimedia, pp. 393-400, Ottawa, Canada 1999. [66] E. Terhardt, “Pitch Shifts of Harmonics, An Explanation of the Octave Enlargement Phenomenon”, In Proc. of the 7th International Congress on Acoustics, pp. 621-624, 1971. 146 [67] C. Traina Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos, “Slim-trees: High performance metric trees minimizing overlap between nodes”, In Proc. of EDBT 2000, pp. 5165, Konstanz, Germany, March 2000. [68] G. Tzanetakis and P. Cook, “Sound Analysis Using MPEG Compressed Audio”, In Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Proc. ICASSP 2000, Vol II, pp. 761-764, Istanbul, Turkey, 2000. [69] The Open Video Project. http://www.open-video.org/ [70] J. K. Uhlmann, “Satisfying General Proximity/Similarity Queries with Metric Trees”, Information Processing Letters, vol. 40, pp. 175-179, 1991. [71] H. Wang, C.-S. Perng, “The S²-Tree: An Index Structure for Subsequence Matching of Spatial Objects”, In Proc. of 5th Pacific-Asic Conf. On Knowledge Discovery and Data Mining (PAKDD), Hong Kong, 2001. [72] R. Weber , H.-J. Schek , S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces”, In Proc. of the 24rd International Conference on Very Large Databases, pp.194-205, August 24-27, 1998. [73] D. White and R. Jain, “Similarity Indexing with the SS-tree”, In Proc. of the 12th IEEE Int. Conf. On Data Engineering, pp. 516-523, 1996. [74] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based Classification, Search, and Retrieval of Audio”, IEEE Multimedia Magazine, pp. 27-36, Fall 1996. [75] Virage. www.virage.com [76] P. N. Yianilos, “Data structures and algorithms for nearest neighbor search in general metric spaces”, In Proc. of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pp.311-321, Austin, Texas, US, January 25-27, 1993. [77] T. Zhang and J. Kuo, “Video Content Parsing Based on Combined Audio and Visual Information”, In Proc. of SPIE 1999, Vol. IV, pp. 78-89. 1999. [78] T. Zhang and C.–C. J. Kuo, “Hierarchical Classification of Audio Data for Archiving and Retrieving”, In Proc. of IEEE Int. Conf. on Acoustics, Speech, Signal Proc., pp. 3001-3004, Phoenix. March 1999. [79] X. Zhou , G. Wang , J. X. Yu , G. Yu, “M+-tree: a new dynamical multidimensional index for metric spaces”, In Proc. of the Fourteenth Australasian database conference on Database technologies 2003, pp.161-168, Adelaide, Australia, February 2003. [80] H. Zhang and D. Zhong, “A scheme for visual feature based image indexing”, In Proc. of SPIE/IS&T Conf. On Storage and Retrieval for Image and Video Databases III, vol. 2420, pp. 36-46, (San Jose, CA), February 9-10, 1995.