Extending Social Videos for Personal Consumption
Transcription
Extending Social Videos for Personal Consumption
Extending Social Videos for Personal Consumption A pre-synopsis report submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY by Nithya Sudhakar (Roll No. 05405701) Under the guidance of Prof. Sharat Chandran DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY BOMBAY 2015 To my dear daughter.. ii Abstract Videos are rich in content; nevertheless their massive size makes it challenging to find user specific desired content. With the fast growth of internet and video processing techniques, videos are efficiently. However, the user is presented with huge volume of search results making it really difficult for the user to browse through all the videos until they find videos of their interest. Presenting good visualizations, appropriate filters, ability to reach specific part of the video and customized genre specific applications will help the user to quickly skim the videos, thereby reducing user’s time. In this thesis, we propose applications which extract this information. The details of the applications designed in this thesis are: Applications • Lead Star Detection Lead stars are significant people in a video. Examples of lead stars are man of the match in sports, lead actors of movies and guest & host of TV shows. We propose an unsupervised approach to quickly detect lead stars. We have used audio highlights to detect prominent region where lead star are likely to be present. Our method detects lead stars with reasonable accuracy in real time which is faster than the prior work [19]. • Automontage Photographs capture memories of special events. However, it is often challenging to capture good expressions of different people all at the same time. In this application, we propose a technique to create a photo album from event videos where expressive faces of people are automatically merged to create photo shots. As opposed to prior work [6], our method automatically select base photos and good face expressions from various frames and stitch them together. This makes the complete iii process automated, thereby making it easier for end-users. • Cricket Delivery Detection Cricket is a fascinating and a favorite game in India. Detecting deliveries in cricket video gives arise to many related applications such as analyzing deliveries of a particular player and automatically generating cricket highlights. In this work, we have improved the detection accuracy of prior work [45] and also added browse by player so that the deliveries of the particular player can be enjoyed by the viewer or analyzed by a cricket coach. In the process of building new applications for better video browsing, we realize the necessity of a technique which can address multiple applications need. The details of the technique designed in this thesis is: Technique Hierarchical Model Structured representation of visual information results in easier and speedy video applications. However most of the models that exist today are either designed for specific applications or captures similarity in multi dimensional space which makes it difficult for applications to use as it is. Summararizing such similarity information at various levels makes it easy for applications to directly use these information without additional computations. We propose a hierarchical model for representing video, so that designing applications is much easier and consumes less time. We have demonstrated this through new applications. In summary, in the process of building video applications to help user quickly browse or filter video results, we have proposed a model to support various applications which requires similarity measure of video. iv Contents Abstract iii List of Tables ix List of Figures xi 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Fast Lead Star Detection in Entertainment Videos . . . . . . . . . . . . 5 1.3.2 AutoMontage: Photo Sessions Made Easy! . . . . . . . . . . . . . . . 6 1.3.3 Cricket Delivery Detection . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Hierarchical Summarization for Easy Video Applications . . . . . . . . 8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 2 Fast Lead Star Detection in Entertainment Videos 11 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Our Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Audio Highlight Detection . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Finding & Tracking Potential People . . . . . . . . . . . . . . . . . . . 16 2.3.3 Face Dictionary Formation . . . . . . . . . . . . . . . . . . . . . . . . 17 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Lead Actor Detection in Motion Pictures . . . . . . . . . . . . . . . . 20 2.4.2 Player of the Match Identification . . . . . . . . . . . . . . . . . . . . 20 2.4 v 2.4.3 2.5 3 5 21 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Lead Actor Detection in Motion Pictures . . . . . . . . . . . . . . . . 22 2.5.2 Player of the Match Identification . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Host Guest Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 AutoMontage: Photo Sessions Made Easy! 27 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Face Detection and Expression Measurement . . . . . . . . . . . . . . 29 3.3.2 Base Photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Montage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 4 Host Guest Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . Cricket Delivery Detection 37 4.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Pop-up Advertisement Detection and Removal . . . . . . . . . . . . . 41 4.3.2 Green Dominance Detection for Field View Classification . . . . . . . 41 4.3.3 Smoothing of View Sequence Using Moving Window . . . . . . . . . 42 4.3.4 Player Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Hierarchical Summarization for Easy Video Applications 5.1 47 5.0.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.0.2 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 49 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.2 H1 Dictionary Formation and Representation . . . . . . . . . . . . . . 51 5.1.3 H2 Dictionary Formation and Representation . . . . . . . . . . . . . . 52 vi 5.2 5.3 6 5.1.4 H3 Dictionary Formation and Representation . . . . . . . . . . . . . . 54 5.1.5 Building global dictionary . . . . . . . . . . . . . . . . . . . . . . . . 56 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.1 Suggest a Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.2 Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.3 Logical Turning Point Detection . . . . . . . . . . . . . . . . . . . . . 58 5.2.4 Potential Candidate Identifier for Audio Video Mix . . . . . . . . . . . 59 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.1 Individual Evaluation of H1 and H2 . . . . . . . . . . . . . . . . . . . 61 5.3.2 Evaluation of Alternate Features . . . . . . . . . . . . . . . . . . . . . 62 5.3.3 Evaluation of Video Classification . . . . . . . . . . . . . . . . . . . . 63 5.3.4 Evaluation of Logical Turning Point Detection . . . . . . . . . . . . . 63 5.3.5 Evaluation of Potential Remix Candidates . . . . . . . . . . . . . . . . 63 Conclusion & Future Work 67 6.1 68 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii viii List of Tables 2.1 Time taken for computing lead actors for popular movies. . . . . . . . . . . . . 2.2 Time taken for computing key players from BBC MOTD highlights for premier league 2007-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 22 25 Time taken for identifying host in a TV show Koffee With Karan. Two episodes are combined into a single video and given as input. . . . . . . . . . . . . . . . 25 4.1 Comparison of methods across different matches . . . . . . . . . . . . . . . . . . . 44 4.2 Result of detected deliveries for different types of matches . . . . . . . . . . . . . . 44 5.1 Classification using only H1. With limited information, we are able to glean 5.2 classification hints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Video suggestion using popular features . . . . . . . . . . . . . . . . . . . . . 62 ix x List of Figures 1.1 A scenario where user is browsing through various video results to reach video of his interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 2 A sample scenario where user is provided with more filtering options and visual representation along with video results. Please note that the suggestions and visual representations vary based on the input video and its genre. . . . . . . . 3 1.3 Applications developed or addressed in our work. . . . . . . . . . . . . . . . . 5 1.4 Lead star detection. This is exemplified in sports by the player of the match; in movies, stars are identified; and in TV shows, the guest and host are located. . . 1.5 6 An automatic photo montage created by assembling “good” expressions from a video (randomly picked from YouTube). This photo did not exist in any of the input frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Illustration of various problems where work [45] fails . . . . . . . . . . . . . . 8 1.7 Retrieval applications designed using our Hierarchical Dictionary Summarization method. “Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a “remix” song from a given set of music videos. . . . . . . . . 2.1 Lead star detection. This is exemplified in sports by the player of the match; in movies, stars are identified; and in TV shows, the guest and host are located. . . 2.2 9 Our strategy for lead star detection. 12 We detect lead stars by considering segments that involve significant change in audio level. However, this by itself is not enough! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 13 2.3 Strategy further unfolded. We first detect scenes accompanied by change in audio level. Next we look for faces in these important scenes, and to further confirm the suitability track faces in subsequent frames. Finally, a face dictionary representing the lead stars is formed by clustering the confirmed faces. 14 2.4 Illustration of highlight detection from audio signal. We detect highlights by considering segments that involve significant low RMS ratio. . . . . . . . . . . 2.5 Illustration of data matrix formation. The tracked faces are stacked together to form the data matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 15 16 Face dictionary formed for the movie Titanic. Lead stars are highlighted in blue (third image), and red (last image). Note that this figure does not indicate the frequency of detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Illustration of face dictionary formation. . . . . . . . . . . . . . . . . . . . . . 19 2.8 Lead actors detected for the movie Titanic . . . . . . . . . . . . . . . . . . . . 20 2.9 Lead star detected for the highlights of a soccer match Liverpool vs Havant & Waterlooville. The first image is erroneously detected as face. The other results represents players and coach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 Lead actors identified for popular movies appear on the right. . . . . . . . . . . 23 2.11 Key players detected from BBC Match of the Day match highlights for premier league 2007-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.12 Host detection of TV show “Koffee with Karan". Two episodes of the show are combined and given as input. The first person in the detected list (sorted by the weight) gives the host. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 25 An automatic photo montage created by assembling “good” expressions from a video (randomly picked from YouTube). This photo did not exist in any of the input frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 28 A schematic of our method. In the first step, detected faces are tracked and grouped together. In the next step frames are analyzed to detect a base photo frame which can either be a mosaic of frames, or a single frame with maximum number of good expression faces from the video. In the last step, faces with good expression are patched to the base photo to form the required photo montage. 30 3.3 Facial expression measured as deviation from neutral expression . . . . . . . . xii 31 3.4 Illustration of the montage creation process. Best expressions from various frames previously selected are patched to create a new frame. . . . . . . . . . . 3.5 AutoMontage created by generating a mosaic of the Youtube video available in [4]. 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Photo Montage example having acceptable expressions. A wedding reception video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 34 Photo Montage example having acceptable expressions. Youtube video is available in [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 33 AutoMontage created from family stack of image [6]. We are able to automatically generate the photo montage as opposed to the method in [6]. . . . . . . . 3.7 33 35 Typical views in the game of cricket (a) Pitch View (b) Ground View (c) NonField View. Note that the ground view may include non-trivial portions of the pitch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Block diagram of our system . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Illustration of various problems where Binod pal’s work fails . . . . . . . . . . 41 4.4 Detecting Advertisement Boundary . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Scoreboard Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.6 Extraction of features at block level for improving view detection . . . . . . . . 43 4.7 Retrieval of deliveries involving only certain players. . . . . . . . . . . . . . . . . . 44 5.1 Retrieval applications designed using our Hierarchical Dictionary Summarization method. “Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a “remix” song from a given set of music videos. . . . . . . . . 48 5.2 Illustration of the H-Video Abstraction . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Pixel-Level Feature Extraction Filters . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 H1 Dictionary Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 H1 Representation of a frame (bottom) . . . . . . . . . . . . . . . . . . . . . . . 53 5.6 Features extracted from H1 frame . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.7 H2 dictionary formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.8 Representation of H2 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.9 Features extracted from H2 shot . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 xiii 5.10 Feature extraction from H2 shots using shot filter bank to form H3 Dictionary . . . . . 56 5.11 H3 representation of the video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.12 Given a set of movies of interest (in red), a movie which is similar to them is suggested (in green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.13 Movies are classified into drama and action genres . . . . . . . . . . . . . . . . . . 59 5.14 Story line of the movie A walk to remember split by story units. Characters are introduced in the first story unit; in the second story unit, the girl and the boy are participating in a drama; in the last part, they fall in love and get married. In spite of the length of stories being very different, our method successfully identifies these different story units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.15 Classification using only H2. With only limited information, and with smaller allowed error, fine details were captured. With larger allowed error, broad categories were captured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.16 Sample Result of classifying movie trailers for categories Drama, Action and Romance. In most of the cases, our model has classified movie trailers correctly. . . . . . 64 5.17 Sample frames from identified remix candidates is presented. In each sets, top row correspond to a original video song and the second row corresponds to the remix candidate. The content and sequencing of the first and second rows match suggesting the effectiveness of our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 65 Chapter 1 Introduction A picture speaks a thousand words and a video is thousands of pictures. Videos let people articulate their ideas creatively. The abundant source of videos available via video sharing websites, makes video freely accessible to everyone. Videos are great sources of information and entertainment, and they cover various areas like news, education, sports, movies and events. 1.1 Background and Motivation In the early era of internet, when concept of web search and video search was introduced, only way people ever searched is to annotate each and every web page and video with all possible tags, so that people can search them. Over a period of time as the volume increased, this method no more worked. So researchers came up with interesting ways of training images to learn real world objects like indoor, outdoor, swimming, car, air plane and so on, so that these annotations are automatically extracted from video frames. Though this approach increased number of tags and reduced human effort, the searchability was restricted to objects trained and training exhaustive list was again a tedious process. To overcome this, bag of words feature was introduced which is similar to text, which has visual description about various objects in the scene and image was used as input instead of traditional text. However this left with another problem of user finding a suitable image to represent their thought. 1 Over the years with tremendous growth of video sharing website, people have uploaded a huge collection of video with their own annotations. By combining bag-of-words with various tags available, tag annotations have reached a great level that videos are easily searchable. However as user gets closer to video results, there are various expectations user have for which there is no sufficient annotation. So user browses through results and recommendations until they find videos of their interest. The size of videos pose additional challenges as user has to download various videos and browse through them to check if that satisfies their requirement. This scenario is illustrated in the Figure 1.1. Figure 1.1: A scenario where user is browsing through various video results to reach video of his interest. This demands a whole realm of applications which help user to quickly get close to the video they want. In this thesis we propose various applications like lead star detections, story-line generation which helps users to get a quick idea about video without having to go through them. We also propose applications which able user to go through selected parts instead of complete video. Example applications we have explored in this genre includes cricket delivery detection and logical turning point detection. As the more user gets to a specific genre, genre specific applications make more sense than generic ones. One example that we have tried is Automontage which generates photos by merging different expressions to form a nice output. Further we noticed a great possibility of such genre specific applications which commonly need similarity information. We have proposed a technique called Hierarchical model, which enables a specific genre of similarity applications. We have demonstrated our applications through supporting applications like logical turning point detection, story-line generation, movie classification and audio video remix candidate search. 2 3 Police and Politics Social Gathering Court Scene Protest Sports Music Romance Interview Events Select All Lead Casts Music Cricket Interview Sports HH:MM Remake Remix Interviews by Host Similar Videos Interviews of Guest Sixes and boundaries by India Wickets by India Last two overs of Indian Bowling Sixes and boundaries by Sri Lanka Wickets by Sri Lanka Last two overs of Sri Lankan Bowling Customized Videos Similar Videos Sequel Similar Movies Figure 1.2: A sample scenario where user is provided with more filtering options and visual representation along with video results. Please note that the suggestions and visual representations vary based on the input video and its genre. Show videos related to browsing history HH:MM Less than 2 min 2 min to 5 min 5 to 30 min 30 min to 3 hrs More than 3 hrs Duration Avg. Customer Reviews Genres Movies 1.2 Problem Statement The objective of the thesis is to design applications and technique which will help users to reach specific video of their interest easily. A sample scenario in which user is provided with various options is illustrated in Figure 1.2. As the possibilities of designing such applications are wider, we establish the following problems specifically 1. Ability to summarize the video, so that user quickly visualize the content of videos 2. Ability to browse through specific parts of video instead of browsing through whole video. 3. Designing genre specific video applications which is of great use in that genre 4. Designing a technique which can support similarity applications which suits various needs like auto suggestions, story line generation. 1.3 Our Contributions In this thesis, we propose applications and technique which achieve the following, so that user easily can quickly filter video results or skim through video or explore related ones. 1. Summarize videos in terms of lead stars, allowing user to narrow down based on stars of their interest 2. Summarize the video in terms of story-line, so that user can get a quickly overview videos before viewing them. 3. Segment the video allowing users to browse through specific parts of video instead of the whole video. 4. Designing genre specific video applications like automontage, cricket delivery detection, remix candidate generation, which is of great use in those specific genres. 5. Designing a technique which can support a genre of similarity applications like some applications listed above. 4 We achieve these functionalities using various techniques ranging from building appropriate dictionaries to specific view classification or merging expressions to create good photos. We have classified these video applications into three major categories - segmentation, summarization and classification. We present various applications we have developed in these categories and present a model which supports applications from all these categories. A visual summary of the work done is presented in the Figure 1.3. In this section, we will be presenting an overview of these applications and models. Guest Host Detection Summarization Segmentation Cricket Delivery Detection Lead Actors Identification Key Players Detection Lead Star Detection Storyline Generation Shot Type Segmentation Logical Turning Point Detection AutoMontage Remix Candidate Identification Drama Suggest a Video Hierarchical Model Action Movie Classification Classification Figure 1.3: Applications developed or addressed in our work. 1.3.1 Fast Lead Star Detection in Entertainment Videos Video can be summarized in many forms. One natural possibility that has been well explored is extracting key frames in shots or scenes, and then creating thumbnails. Another natural alternative that has been surprisingly ill-explored is to locate “lead stars” around whom the action revolves. This is illustrated in the Figure 1.4. Though scarce and far between, available techniques for detecting lead stars is usually video specific. 5 Figure 1.4: Lead star detection. This is exemplified in sports by the player of the match; in movies, stars are identified; and in TV shows, the guest and host are located. We present a generalized method for detecting snippets around lead actors in entertainment videos. This application is useful in locating action around the ‘player of the match’ in sports videos, lead actors in movies and TV shows, and guest-host snippets in TV talk shows. Additionally, as our method uses audio clue, we are able to compute lead stars in real time, which is faster than the state-of-art techniques with comparable accuracy. 1.3.2 AutoMontage: Photo Sessions Made Easy! Group photograph sessions are tedious; it is difficult to get acceptable expressions from all people at the same time. The larger the group size, the harder it gets. Ironically, we miss many expressions in the scene while the group assembles, or reassembles in the taking of the photographs. As a result of this, people have started using videos. However, going through video is time consuming. A solution to the problem is using automatically extracting an acceptable, possibly stretched, photo montage and present it to the user. In this application, we automate this process. We extract their faces, assess the quality, and paste them back appropriately at the correct position to create a pleasing memory. This scenario is illustrated in the Figure 1.5. In this work, we have contributed a frame analyzer that detects camera panning motion to 6 Figure 1.5: An automatic photo montage created by assembling “good” expressions from a video (randomly picked from YouTube). This photo did not exist in any of the input frames. generate candidate frames, an expression analyzer from detected faces and a photo patcher that enables seamless placement of faces in group photos. 1.3.3 Cricket Delivery Detection Cricket is the most popular sport in India and is played globally in seventeen countries. Meaningful analysis and mining of cricket videos yields to many interesting applications like analyzing the balls faced by a player, comparison between fast bowlers and slow bowlers. Presenting such specific videos gives user a choice to select the one they are interested in. Prior work [45] has proposed an approach for automatic segmentation of broadcast cricket video into meaningful structural units. First, the input video frames are classified into two semantic concepts: play and break segment frames. Play segment frames are then further classified into a set of well defined views. Sequences of views are then analyzed to extract, in cricketing terminology, “balls”. However this method fails when there is displacement of score board due to advertisement. The pitch detection algorithm used by them fails on cases where there is some other rectangle like players dress. In our method, we improvise his approach by adding ad detector and remover logic. We also enhance pitch detection algorithm by splitting the frames into 3x3 grids and training the classifier using grid level information in addition to frame level information. Further each delivery is examined for faces and key players are identified so as to support features like browse by player and find interesting deliveries of that player. 7 Figure 1.6: Illustration of various problems where work [45] fails 1.3.4 Hierarchical Summarization for Easy Video Applications With growing use of videos, demand on video applications has become intense. Most existing methods that analyze the semantics of a video build specific models; for example, ones that aim at event detection, or targeted video albumization. These might be called as application specific works and useful in their own right. In this technique, however, we propose a video abstraction framework that unifies the creation of various applications, rather than the application itself. Specifically, we present a dictionary summarization of a video that provides abstractions at various hierarchical levels such as pixels, frames, shots, and the complete video. We illustrate the usability of our model with four different “apps” as shown in the Figure 1.7. Our model (termed H-Video) makes the task of creating application easier. Our method learns semantic dictionaries at three different levels — pixel patches, frames, and shots. Video is represented in the form of learned dictionary units that reveal semantic similarity and video structure. The main intention of this model is to provide these semantic dictionaries, so that comparison of video units at different levels in the same video and different videos becomes easier. The benefits of H-Video include the following: (i) The model advocates run-time leveraging of prior offline processing time. As a consequence, applications run fast. (ii) The model is built in an unsupervised fashion. As no application specific assumption is made, many retrieval applications can use this model and its features. This can potentially save 8 Figure 1.7: Retrieval applications designed using our Hierarchical Dictionary Summarization method. “Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a “remix” song from a given set of music videos. enormous amount of computation time spent in learning. (iii) The model represents learned information using a hierarchical dictionary. This allows video to be represented as indexes to elements in the video. This makes it easier for the developer of a new retrieval applications as similarity information is available as a one dimensional array. In other words, our model doesn’t demand deep video understanding background from application developers. (iv) We have illustrated our model through several applications. 9 1.4 Future Work The future scope of this work is to enhance hierarchical model to learn concepts automatically from a set of positive and negative examples using machine learning techniques like random forest or random tree. This would serve as a great source in learning a particular tag and applying to all other videos which have the same concept. This model can also be extended to matching parts of videos instead of whole video. The future scope of this work also includes more applications to find videos of user’s interest. 10 Chapter 2 Fast Lead Star Detection in Entertainment Videos Suppose an avid cricket fan or coach wants to learn exactly how Flintoff repeatedly got Hughes “out.” Or a movie buff wants to watch an emotional scene involving his favorite heroine in a Hollywood movie. Clearly, in such scenarios, you want to skip frames that are “not interesting.” One possibility that has been well explored is extracting key frames in shots or scenes and then creating thumbnails. Another natural alternative – the emphasis in this work – is to determine frames around, what we call, lead stars. A lead star in an entertainment video is the actor who, most likely, appears in many significant frames. We define lead stars in other videos also. For example, the lead star in a soccer match is the hero, or the player of the match, who has scored “important” goals. Intuitively, he is the one the audience has paid to come and see. Similarly the lead star in a talk show is the guest who has been invited, or, for that matter, the hostess. This work presents how to detect lead stars in entertainment videos. Moreover like various video summarization [13, 55, 75], lead stars is a natural way of summarizing video. (Multiple lead stars are of course allowed.) 11 Figure 2.1: Lead star detection. This is exemplified in sports by the player of the match; in movies, stars are identified; and in TV shows, the guest and host are located. 2.1 Related Work Researchers have explored various video specific applications for lead stars detection – anchor detection in news video [66], lead casts in comedy sitcoms [16], summarizing meetings [62], guest host detection [24, 55] and locating the lecturer in smart rooms by tracking the face and head [74]. Fitzgibbon [18] uses affine invariant clustering to detect cast listing from movie. As the original algorithm had runtime that is quadratic, the authors used a hierarchical strategy to improve the clustering speed that is central to their method. Foucher, S. and Gagnon, L. [19] used spatial clustering techniques for clustering actor faces. Their methods detect the actor’s cluster in unsupervised way with computation time of about 23 hours for a motion picture. 2.2 Our Strategy Although the lead actor has been defined using a pictorial or semantic concept, an important observation is that the significant frames in an entertainment video is often accompanied by a change in the audio intensity level. It is true no doubt that not all frames containing the lead actors involve significant audio differences. Our interest is not at the frame level, however. Note that certainly the advent of important scenes and important people bear a strong correlation to the audio level. We surveyed around one hundred movies, and found that it is rarely, if at all, 12 the case that the lead star does not appear in audio highlighted sections, although the nature of the audio may change from scene to scene. And as alluded above, once the lead star has entered the shot, the frames may well contain normal audio levels. Figure 2.2: Our strategy for lead star detection. We detect lead stars by considering segments that involve significant change in audio level. However, this by itself is not enough! Our method is built upon this concept. We detect lead stars considering such important scenes of the video. To reduce false positives and negatives, our method clusters the faces for each important scenes separately and then combines the results. Unlike the method in [18], our method provides a natural segmentation for clustering. Our method is shown to considerably reduce the computation time of the previously mentioned state-of-the-art [19] for computing lead star in motion picture (a factor of 50). We apply this method to sports video to identify the player of the match, motion pictures to find heroes and heroines and TV show to detect guest and host. 2.3 Methodology As mentioned, the first step in the problem is to find important scenes which have audio highlights. Once such important scenes are identified, they are further examined for potential faces. Once a potential face is found in a frame, subsequent frames are further analyzed for false alarms using concepts from tracking. At this point, several areas are identified as faces. Such confirmed faces are grouped into clusters to identify the lead stars. 13 Figure 2.3: Strategy further unfolded. We first detect scenes accompanied by change in audio level. Next we look for faces in these important scenes, and to further confirm the suitability track faces in subsequent frames. Finally, a face dictionary representing the lead stars is formed by clustering the confirmed faces. 14 2.3.1 Audio Highlight Detection Figure 2.4: Illustration of highlight detection from audio signal. We detect highlights by considering segments that involve significant low RMS ratio. The intensity of a segment of an audio signal is summarized by the root-mean-square value. The audio track of a video is divided into windows of equal size and the rms value is computed for each audio window. From the resulting rms sequence, the rms ratio is computed for successive items in the sequence. Let xn be defined as audio segments of fixed window size, such that 0 < n < L where L is the number of audio segments. The audio highlight indicator function H(xn ) is defined as H(xn ) = (rms(xn )/rms(x(n+1) )) < th (2.1) The rms ratio is marked as low when the value is below an user defined threshold. Using this rms ratio, we define function A to detect audio highlight region, which is defined as 1 A(n) = 0 ∃H(xn ) = 1 | (n − tw) ≤ n ≤ (n + tw) (2.2) otherwise In our implementation, based on our experiments we have used th=5 and tw=2. The video frames corresponding to such windows are considered as ‘important.’ 15 2.3.2 Finding & Tracking Potential People Figure 2.5: Illustration of data matrix formation. The tracked faces are stacked together to form the data matrix. Once important scenes are marked, we seek to identify people in the corresponding video segment. Fortunately there are well understood algorithms that detect faces in an image frame. We select a random frame within the window and detect faces using the Viola & Jones face detector [59]. Every face detected in the current frame is then voted for a confirmation by attempting to track them in subsequent frames in the window. Confirmed faces are stored for the next step in processing in a data matrix. Confirmed faces from each highlight Fi , is stored in the corresponding data matrix F as illustrated in Figure 2.5. 16 2.3.3 Face Dictionary Formation In this step, the confirmed faces are grouped based on their features. There are a variety of algorithms for dimensionality reduction, and subsequent grouping. We observe that the Principal Component Analysis (PCA) method has been successfully used for face recognition. We use PCA to extract feature Pi vectors from Fi . and we use the k-means algorithm for clustering. We determinine the number of clusters K using the following steps. Let Pi be the pca vector of face matrix Fi where 1 ≤ i ≤ n pdist(i, j) ← cosine distance between pca vectores Pi and Pj F = {1..N} K=0 while F 6= φ do K = K +1 G = any element from F repeat size =| G | newG = {} for all element f in G do newG = newG S k : pdist( f , k) < t where 1 ≤ k ≤ n end for G(m) = newG(m) until size =| G(m) | F = F : ∀x ∈ F and x 3 G(m) end while The representative face for the clusters formed for the movie Titanic is shown in Figure 2.6. Our method has successfully detected the lead actors in movie. As can be noticed, along with lead stars, there are patches that have been misclassified as faces. There are also non lead stars present along with the highlighted lead stars. We further refine these clusters and select 17 prominant clusters which represent lead stars. Figure 2.6: Face dictionary formed for the movie Titanic. Lead stars are highlighted in blue (third image), and red (last image). Note that this figure does not indicate the frequency of detection. Let C j represent each clusters in the final set of clusters, where 1 ≤ j ≤ K. The number of elements in cluster Ci is represented as N j =| C j | (2.3) At this point, we have a dictionary of faces, but not all faces belong to lead actors. We use the following parameters to shortlist the faces to form lead stars. 1. The number of faces in the cluster. If a cluster (presumably of the same face) has a large cardinality, we give this cluster a high weightage. The weight function S1 which selects the top clusters is defined as 1 S1 ( j) = 1 0 j=1 S1 ( j − 1) = 1 and N j ≥ ts N j−1 (2.4) otherwise In our experiments, we have used ts = 0.5 which helps in identifying prominant clusters. 2. Position of the face with respect to center of the image. Lead stars are expected to be in the center of the image. Let H and W be the height and width of the video frame respectively and Xi , Yi are the center coordinates of face Fi . The function measuring the position of the faces in Cluster C j is computed as Nj S2 ( j) = ∑ ((W /2− | W /2 − Xi |) + (H/2− | H/2 −Yi |)) (2.5) i=1 3. Area of the detected face S3 ( j) is calculated as sum of areas of all faces in the cluster C j . Again, lead stars typically occupy a significant portion of the image. 18 The weighted avearge of these parameters are used to select lead star clusters. Let SWw be the weight of the funtion Sw , then shortlisting function S( j) is defined as 3 S( j) = ∑ SWwSw( j) (2.6) w=1 The clusters with S( j) > µS is selected as lead star clusters L j and lead star LR j for cluster C j is detected as the face with minimum distance to the cluster center. LR j = Fi : q Nj q ((µL j − Pi )2 ) = min ((µL j − Px )2 ) x=1 (2.7) Figure 2.7: Illustration of face dictionary formation. 2.4 Applications In this section, we demonstrate our technique using three applications — Lead Actor Detection in Motion Pictures, Player of the Match Identification and Host Guest Detection. As 19 applications go, Player of the Match Identification has not been well explored considering the enormous interest. In the other two applications, our technique detects lead stars faster than the state-of-art techniques, which makes our method practical and easy to use. 2.4.1 Lead Actor Detection in Motion Pictures In motion pictures, detecting the hero, heroine and villain has many interesting benefits. A person while reviewing a movie can skip the scenes where lead actors are not present. A profile of the lead actors can also be generated. Significant scenes containing many lead actors can be used for summarizing video. In our method, the face dictionary formed contains the lead actors. These face dictionaries are good enough in most of the cases. However, for more accurate results, the algorithm scans through a few frames of every shot to determine the faces occurring in the shot. The actors who appear in a large number of shots are identified as lead actors. The result of lead actors for the movie Titanic after scanning through entire movie is shown in the Figure 2.8. Figure 2.8: Lead actors detected for the movie Titanic . 2.4.2 Player of the Match Identification In sports, sports highlight and key frames [55] are the two main methods used for summarizing. We summarize sports using the player of the match capturing the star players. Detecting and tracking players in complete sports video does not yield player of the match. The star players can play for shorter time and score more as opposed to players who attempt many times and don’t. So analyzing the players when there is score leads to the identification of star 20 players. This is easily achieved by our technique, as detecting highlights results in exciting scenes like scores. Figure 2.9: Lead star detected for the highlights of a soccer match Liverpool vs Havant & Waterlooville. The first image is erroneously detected as face. The other results represents players and coach. The result of lead sports stars detected from a soccer match Liverpool vs Havant & Waterlooville is presented in the Figure 2.9. The key players of the match are detected. 2.4.3 Host Guest Detection In TV interviews and other TV programs, detecting host and guest of the program is the key information used in video retrieval. Javed et. al. [24] have proposed a method for the same which removes the commercials and then exploits the structure of the program to detect guest and host. The algorithm uses the inherent structure of the interview that the host appears for shorter duration than guest. However, it is not always the case, especially when the hosts are equally popular like in the case of TV shows like Koffee With Karan. In the case of competition shows, the host is shown for longer duration than guests or judges. Our algorithm detects hosts and guests as lead stars. To distinguish hosts and guests, we detect lead stars on multiple episodes and combine the result. As it is intuitive, the lead stars over multiple episodes are hosts and the other lead stars detected for specific episodes are guests. 2.5 Experiments We have implemented our system in Matlab. We tested our method on an Intel Core Duo processor, 1.6 Ghz, 2GB RAM. We have conducted experiments on 7 popular motion pictures, 9 soccer match highlights and two episodes of TV shows summing up to total of 19 hours 23 21 minutes of video. Our method detected lead stars in all the videos in an average of 14 minutes for an one hour video. The method [19] in the literature computes lead star of a motion picture in 23 hours, whereas we compute lead star for motion picture in an average of 30 minutes. We now provide more details. 2.5.1 Lead Actor Detection in Motion Pictures Table 2.1: Time taken for computing lead actors for popular movies. No. Movie Name Duration Computation Time for Computation Time for Detecting Lead Stars Refinement (hh:mm) (hh:mm) (hh:mm) 1 The Matrix 02:16 00:22 00:49 2 The Matrix Reloaded 02:13 00:38 01:25 3 Matrix Revolutions 02:04 00:45 01:07 4 Eyes Wide Shut 02:39 00:12 00:29 5 Austin Powers 01:34 00:04 01:01 01:59 00:16 00:47 Titanic 03:18 01:01 01:42 Total 16:03 03:18 07:20 in Goldmember 6 The Sisterhood of the Traveling Pants 7 We ran our experiments on 7 box-office hit movies listed in the table 2.1. This totally sums up to 16 hours of video. The lead stars in all these movies are computed in 3 hour 18 minutes. So the average computation time for a movie is around 30 minutes. From Table 2.1, we see that the best computation time is 4 minutes for the movie Austin Powers in Goldmember which is 1 hour 42 minutes in duration. The worst computation time is 45 minutes for the movie Matrix Revolutions of duration 2 hour 4 minutes. For movies like Eyes Wide Shut and Austin Powers in Goldmember, the computation is faster as there are fewer audio highlights. Whereas action movies like Titanic sequels take more time as there are many audio highlights. This causes the variation in computation time among movies. The lead actors detected are shown in the Figure 2.10. The topmost star is highlighted in red 22 Figure 2.10: Lead actors identified for popular movies appear on the right. color and the next top star is highlighted in blue color. As you can notice in the figure, in most of the movies topmost stars are detected. Since the definition of “top” is subjective, it could be said that in some cases, top stars are not detected in some cases. Further, in some cases the program identifies the same actor multiple times. This could be due to disguise, or due to pose variation. The result is further refined for better accuracy as mentioned in Section 2.4.1. 2.5.2 Player of the Match Identification We have conducted experiments on 11 soccer match highlights taken from BBC and listed in Table 2.2. Our method on an average takes half the time of the duration of the video. Note however, that these timings are for only sections that have already been manually edited by the BBC staff. If the video were run on a routine full soccer match, we expect our running time to be a lower percentage of the entire video. The results of key player detection is presented in the Figure 2.11. The key players of the match are identified for all the matches. 23 Figure 2.11: Key players detected from BBC Match of the Day match highlights for premier league 2007-08. 2.5.3 Host Guest Detection We conducted our experiment on the TV show Koffee with Karan. Two different episodes of the show were combined and fed as input. Our method identified the host in 4 minutes for a video of duration 1 hour 29 minutes. Our method is faster than the method proposed by Javed et. al. [24]. The result of our method for the TV show Koffee with Karan is presented in Figure 2.12. Our method has successfully identified the host. 24 Table 2.2: Time taken for computing key players from BBC MOTD highlights for premier league 2007-08. No. Soccer Match Duration Computation (hh:mm) (hh:mm) 1 Barnsley vs Chelsea 00:02 00:01 2 BirminghamCity vs 00:12 00:04 Arsena 3 Tottenham vs Arsenal 00:21 00:07 4 Chelsea vs Arsenal 00:14 00:05 5 Chelsea vs 00:09 00:05 Middlesborough 6 Liverpool vs Arsena 00:12 00:05 7 Liverpool vs 00:15 00:06 00:09 00:05 00:18 00:04 01:52 00:43 Havant & Waterlooville 8 Liverpool vs Middlesbrough 9 Liverpool vs NewcaslteUnited Total Table 2.3: Time taken for identifying host in a TV show Koffee With Karan. Two episodes are combined into a single video and given as input. TV show Koffee With Karan Duration Computation (hh:mm) (hh:mm) 01:29 00:04 Figure 2.12: Host detection of TV show “Koffee with Karan". Two episodes of the show are combined and given as input. The first person in the detected list (sorted by the weight) gives the host. 25 26 Chapter 3 AutoMontage: Photo Sessions Made Easy! Its no more that only professional photographers who take pictures. Almost anyone has a good camera, and often takes a lot of photographs. Group photographs as a photo session in reunions, conferences, weddings, and so on are de rigueur. It is difficult, however, for novice photographers to capture good expressions at the right time, and realize a consolidated acceptable picture. A video shoot of the same scene ensures that expressions are not missed. Sharing the video, however, may not be the best solution. Besides the obvious bulk in the video, poor expressions (“false positives") are also willy nilly captured and might prove embarrassing. A good compromise is to produce a mosaiced photograph assembling good expressions, and discarding poor ones. This can be achieved by a cumbersome manual editing; in this paper, we provide an automated solution, illustrated in Figure 3.1. The photo shown has been created from a random YouTube video excerpt and the photo shown does not exist in any frame of the original video. 3.1 Related Work Research in measuring the quality of face expressions has appeared elsewhere, in applications such as medical patient expression detection to sense pain [40], measurement of children’s 27 Figure 3.1: An automatic photo montage created by assembling “good” expressions from a video (randomly picked from YouTube). This photo did not exist in any of the input frames. facial expression during problem solving [37], and analyzing empathetic interactions in group meeting [28]. Our work focuses on generating an acceptable photo montage from photo-session video and is oriented towards a targeted goal of discarding painful expressions, and recognizing memorable expressions. In regard to photo editing researchers have come up with many interesting applications like organizing photos based on the person present in the photos [31][34][32][12], correcting an image with closed eye to open eye [38], and morphing photos [36]. Many of these methods either require manual intervention, or involves a different and enlarged problem scope resulting in more complicated algorithms rendering them inapplicable to our problem. The closest work to ours is presented in [6], where a user selects the part they want from each photo. These parts are then merged using the technique of graph cuts to create a final photo. Our work differs from [6] in a few ways. Faces are selected automatically by determine pleasing face expressions (based on an offline machine learning strategy). Video input are allowed enabling a larger corpus of acceptable faces, and the complete process is automated, thereby making it easier for end-users. 3.2 Technical Contributions The technical contribution in this work includes: • A frame analyzer that detects camera panning motion to generate candidate frames 28 • An expression analyzer from detected faces • A photo patcher that enables seamless placement of faces in group photos Details of these steps appear in the following Section. 3.3 Methodology In this section, we present a high level overview of our method followed by the details. The steps involved in auto montage creation are 1. Face detection and expression measurement 2. Base photo selection 3. Montage creation First, we detect faces in all the frames and track them based on its position. Then we measure facial expression of all detected faces, and select a manageable subset. The next step in this problem is to identify plausible frames which can be merged to create mosaic. Frames in a video “far away” in time, or unrelated frames cannot be merged. In the last phase, faces with best expressions are selected and substituted in the mosaiced image using the technique of graph cut and blending. Figure 3.2 illustrates these steps. 3.3.1 Face Detection and Expression Measurement In our work, first we detect faces from video using the method in [60]. Measuring facial expression is, as expected, critical. Facial expression is measured as deviation from neutral expressions as illustrated in Figure 3.3. We have manually collected around one hundred neutral expression faces from various wedding videos for training our system. 29 Face Detection Face Position Tracking Facial Expression Measurement Base Photo Selection Montage Creation Figure 3.2: A schematic of our method. In the first step, detected faces are tracked and grouped together. In the next step frames are analyzed to detect a base photo frame which can either be a mosaic of frames, or a single frame with maximum number of good expression faces from the video. In the last step, faces with good expression are patched to the base photo to form the required photo montage. Offline Alignment - Intuitively, alignment is achieved using the position of the eyes as the reference. Neutral faces are aligned using the following steps: 1. Color space conversion: The face is converted from RGB to TSL (Tint, Saturation, and Luminance) color space. 2. Skin regions with values Is are detected. 3. In the non skin region where Is = 0, regions in the top half of the face are examined for two symmetrical and almost spherical regions which represents the eyes. 4. Non-skin regions in the bottom half of the face are examined for the occurrence of the mouth. When a horizontal region in between the eyes is found, it is detected as the mouth. Rectangular regions extracted are measured relative to the positions of eyes and mouth to achieve alignment. 30 Figure 3.3: Facial expression measured as deviation from neutral expression Neutral expression - A sparse quantification of neutral expression is achieved using dimensionality reduction techniques. In brief, 1. Faces are contrast enhanced 2. Mean images are computed and subtracted from neutral faces. 3. The actual dimensionality reduction using SVD factorization For any test face xt , the facial expression measure is computed as deviation from the stored principal component vectors. Similar to training phase, the face is first aligned, enhanced, and then mean centered vector is projected onto the neutral face vector eigen space to obtain a value p. The Euclidean distance between the projection p and mean of projection of neutral eigen face P is used to measure the quality of an expression. 31 δ1 = p − 1 N ∑ Pki N i=1 We also compute the minimum Euclidean distance between the projection p and each neutral eigenface from P. N δ2 = min p − Pi i=1 The facial expression measure is computed as the arithmetic mean of δ1 and δ2 . δ= 3.3.2 δ1 + δ2 2 Base Photo We detect camera movement direction by tracking the position and sizes of the faces. For each shot, we accumulate frames until the camera direction changes. The resulting clusters are used to serve as the base photo. In case there is little or no camera movement, the scene is considered static and the frame having maximum number of good expression faces is selected as the base photo. 3.3.3 Montage The montage is illustrated in Figure 3.4 and created as follows: 1. As a rough indicator of the desired position, the detected face’s coordinates are mapped to the corresponding coordinates in the mosaiced image using the computed parameters from the mosaicing algorithm. 2. Simultaneously multiple faces which have similar coordinates are grouped, and the face with the maximum facial expression measure (δ ) is selected. 32 Figure 3.4: Illustration of the montage creation process. Best expressions from various frames previously selected are patched to create a new frame. 3. A broad alignment with the body position is also made 4. Given these tentative positions, Graph cut [8] is used to find the accurate boundary of the inserted face. In brief, the base boundaries are tied to source node and assigned a high weight. In-between nodes are assigned the absolute difference of gradient level. 5. Around the graph-cut segmentation, image blending is done between the base photo and the selected face. Figure 3.5: AutoMontage created by generating a mosaic of the Youtube video available in [4]. 33 3.4 Experiments To compare our method, we ran our experiments against the stack of images provided in [6]. As can be seen, the output photo generated had good expressions of most of the people. The result is presented in Figure 3.6. Note that we have automatically generated the photo montage without user input compared to the original method. Figure 3.6: AutoMontage created from family stack of image [6]. We are able to automatically generate the photo montage as opposed to the method in [6]. Our system has also been tested on other group photo sessions collected from youtube. Our algorithm successfully created photo montages from all these videos. Examples are presented in Figure 3.6, Figure 3.7 and Figure 3.8. 34 In Figure 3.6, though most of the faces looks good, there is an artifact in the third person from top left corner. This is introduced by the patching scheme when multiple face’s best expression was substituted. This scenario demands more accurate detection of faces to avoid such artifacts. Figure 3.7: Photo Montage example having acceptable expressions. Youtube video is available in [3]. Figure 3.8: Photo Montage example having acceptable expressions. A wedding reception video. 35 36 Chapter 4 Cricket Delivery Detection Cricket is one of the most popular sports in the world after soccer. Played globally in more than a dozen countries, it is followed by over a billion people. One would therefore expect, following other trends in global sports, that there would be a meaningful analysis and mining of cricket videos. There has been some interesting work in this area (for example, [23, 27, 45]). However, by and large, the amount of computer vision research does not seem to be commensurate with the interest and the revenue in the game. Possible reasons could be the complex nature of the game, and the variety of views one can see, as compared to games such as tennis, and soccer. Further, the long duration of the game might inhibit the use of inefficient algorithms. Segmenting a video into its meaningful units is very useful for the structure analysis of the video, and in applications such as content based retrieval. One meaningful unit corresponds to the semantic notion of “deliveries” or “balls” (virtually all cricket games are made up of 6-ball overs). Consider: (a) (b) (c) Figure 4.1: Typical views in the game of cricket (a) Pitch View (b) Ground View (c) Non-Field View. Note that the ground view may include non-trivial portions of the pitch. • A bowling coach might be interested in analyzing the balls faced only by, say, Steve Smith 37 from the entire match. If we could temporally segment a cricket match into balls, then it is possible to watch only these portions. • The games administrator might be able to figure out how many minutes are consumed by slow bowlers as compared to fast bowlers. Currently such information is available only by the manual process of segmenting a long video (more than 7 hours). 4.1 Prior Work Indeed, the problem of segmenting a cricket video into meaningful scenes is addressed in [27]. Specifically, the method uses the manual commentaries available for cricket video, to segment a cricket video into its constituent balls. Once segmented, the video is annotated by the text for higher-level content access. A hierarchical framework and algorithms for cricket event detection and classification is proposed in [23]. The authors uses a hierarchy of classifier to detect various views present in the game of cricket. The views with which they are concerned are real-time, replay, field view, non-field view, pitch-view, long-view, boundary view, close-up and crowd etc. Despite the interesting methods in these works – useful in their own right – there are some challenges. The authors in [23] have only worked on view classification without addressing the problem of video segmentation. Our work closely resembles that of the method in [27], and is inspired from it from a functional point of view. It differs dramatically in the complexity of the solution, and the basis for the temporal segmentation. Specifically, as the title in [27] indicates, the work is text-driven. In Binod’s work [45], he address the problem of segmenting cricket videos into the meaningful structural unit termed as balls. Intuitively a ball starts with the frame in which “a bowler is running towards the pitch to release the ball.” The end of the ball is not necessarily the start of the next ball. The ball is said to end, when the batsman has dealt with the ball. This ball corresponds to a variety of views. The end frame might be a close up of a player or players (e.g. celebration), the audience, a replay, or even an advertisements. Because of these varied nature of views, he has used both the domain knowledge of the game 38 (views and their sequence), as well as TV production rules (location of scoreboard, its presence and absence). In his approach first the input video frames are classified into two semantic concepts: play and break. The break segment includes replays, graphics and advertisements. The views that we get in the game of cricket include close-up of the players, and a variety of views that are defined as follows: Examples of various views are presented in the Figure 4.6. • Pitch Views (P) bowler run-up, ball bowled, ball played. • Ground View (G) ball tracking by camera, ball fielded and returned. • Non-Field Views (N) Any view that is not P or G. This includes views such as the closeup of batsman, bowler, umpire, and the audience. Further, Binod defines Field Views (F) as the union of P and G. Using this vocabulary, he creates an alphabetical sequence (see the block diagram of our system in Fig. 4.2.) Representative views are shown in Fig. 4.1. Note that our particular classification is not intended to be definitive, or universal. It may appear to be even subjective. The process of creating this sequence is • Segment the video into play and break segments frames. • Classify play segment frames into F or N • Classify F View frames into P or G • Smooth the resulting sequence • Extract balls using a rule based system. In summary, his contributions lies in modeling the ball as a sequence of relevant views and coming up with appropriate techniques to obtain the vocabulary in the sequence. However his method fails when the views are misclassified due to pop-up ad, or too few frames in a particular view. In our work, we have focused on improving accuracy in such cases and further, we identify key players of each ball to help in indexing the match. 39 Cricket Video Scoreboard Position Detection Play & Break Detection No Scoreboard Break Segment Play Field/Non Field View Classification Train Classifiers Dominant Grass Color Ratio < th Non Field View Pitch / Ground View Classifier Color, Edge and Shape Features Ground View Pitch View Figure 4.2: Block diagram of our system 4.2 Problem Statement The cricket delivery detection method proposed by Binod Pal et. al. [45] fails when the views are misclassified. We propose improvements to handle these cases. 1. Eliminate pop-up advertisement 2. Enhance the pitch view detection logic to eliminate false positives 3. Enhance smoothing logic, so that it doesn’t smooth out the required views 4. Associate each delivery with the prominent player 40 Figure 4.3: Illustration of various problems where Binod pal’s work fails 4.3 Methodology In our approach we improvise on Binod Pal’s work. We address the issues presented in the problem statement. 4.3.1 Pop-up Advertisement Detection and Removal Whenever the view classifier detects an advertisement, we further examine the frame to check if it is a pop-up ad and eliminate it. Pop-up advertisements are relatively static when compared to the game play where there is more activity. We use this to detect advertisement and eliminate them. From the training set we extract the static component image from motion vector which has negligent motion. From this, we detect the prominent regions at the left and bottom of the image and then smooth the region boundaries to get the ad boundary. When this method fails to detect the advertisement boundary, we use scoreboard to determine advertisement boundary. We look for scoreboard boundary in the first quadrant of the image. Once the scoreboard position is detected, based on the displacement distance, advertisement boundaries are detected. 4.3.2 Green Dominance Detection for Field View Classification In the prior work [45], there were false positives for pitch detection when the player’s dress color is green. To reduce such false positives, we have have divided the frame into 9 blocks and 41 Figure 4.4: Detecting Advertisement Boundary Figure 4.5: Scoreboard Detection fed 128-hue histogram of each block as addition features to the view classifier. As the green ground part is around the pitch and pitch is at center, this helped in reducing the false positives. 4.3.3 Smoothing of View Sequence Using Moving Window In the prior work, smoothing of window was done by diving the views sequence into window of 5, and assigning most frequently occurring views to all the frames in that window. As this method can remove some views which fall across the window, we use running window instead of dividing view sequence. For each frame, the frame is considered as center of windows, and most occurring view in the window is assigned to the current frame. This approach does not eliminate views which were originally falling across windows. 42 Figure 4.6: Extraction of features at block level for improving view detection 4.3.4 Player Detection Once deliveries are detected, further filtering of the video is possible. For example, we consider classifying deliveries based on an a particular player, such as the Australian cricket captain Steve Smith. We can use a standard frontal face detector to detect faces of bowlers. On the other hand, faces of batsmen are confused by the usual use of helmets. So we look for additional information in the form of connected regions (in skin color range), and square regions which have more edges. For each delivery, face detection is performed for 100 frames before and after the actual delivery. This buffer of 200 frames is assessed for recognizing faces of key players associated with the actual delivery. The usual method of dimensionality reduction, and clustering is used to prune the set. With this background, if an input image of a particular player is given, the image is mapped to the internal low dimension representation and the closest cluster is found using a kd-tree. Corresponding deliveries in which that player appears are displayed. 43 4.4 Experiments and Results We have conducted our experiments on one day international matches, test matches and T20 matches summing up to 50 hours of video. We have compared the classification of pitch/ground view using sift features. Features proposed by us have produced promising results which is presented in Table 1. Table 4.1: Comparison of methods across different matches Pitch/Ground View Classification Precision Recall Our Method 0.982 0.973 Sift features 0.675 0.647 Our method had successfully detected deliveries with overall precision of 0.935 and recall of 0.957. The result is presented in Table 2. Table 4.2: Result of detected deliveries for different types of matches Match Type Precision Recall ODI 0.932 0.973 Test Match 0.893 0.931 T20 0.982 0.967 The filtered retrieval of deliveries is presented in Fig 4.7. Input Player Retrieved deliveries Figure 4.7: Retrieval of deliveries involving only certain players. 44 4.5 Conclusion and Future Work We have improvised the accuracy of temporal segmentation of cricket videos proposed by Binod Pal and compared with different forms of cricket (50 overs ODI, 20-20 (T20) and test matches. We are also able to provide a browse by player capability to traverse through player. 45 46 Chapter 5 Hierarchical Summarization for Easy Video Applications Say you want to find an action scene of your favorite hero. Or want to watch a romantic movie with a happy ending. Or say you saw a interesting trailer, and want to watch related movies. Finding these videos in the ocean of videos available has become noticeably difficult, and requires a trustworthy friend or editor. Can we quickly design computer “apps” that act like this friend? This work presents an abstraction for making this happen. We present a model which summarizes the video in terms of dictionaries at different hierarchical levels — pixel, frame, and shot. This makes it easier for creating applications that summarizes videos, and address complex queries like the ones listed above. The abstraction leads to a toolkit that we use to create several different applications demonstrated in Figure 5.1. In the “Suggest a Video” application, from a set of action movies, three Matrix sequels were given as input; the movie Terminator was found to be the closest suggested match. In the “Story Detection” application, the movie A Walk to Remember is segmented into three parts; user experience suggests that these parts correspond to prominent changes in the twist and plot of the movie. In the “Audio Video Mix" application, given a song with a video as a backdrop, the application finds another video with a similar song and video; the application thus can be used to generate a “remix” for the input song. This application illustrates the ability 47 Figure 5.1: Retrieval applications designed using our Hierarchical Dictionary Summarization method. “Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a “remix” song from a given set of music videos. of the data representation to find a video which closely matches both content and tempo. 5.0.1 Related Work Methods like hierarchical hidden Markov models [65, 67] and latent semantic analysis [43, 64] have been used to build models over basic features to learn semantics. In the method proposed by Lexing Xie et al. [65], the structure of the video is learned in an unsupervised way using twolevel hidden Markov model. The higher level corresponds to semantic events while the lower level corresponds to variations within the same event. Gu Xu et al. [67], proposed multi-level hidden Markov models using detectors and connectors to learn videos. 48 Spatio temporal words [29, 43, 63, 64] are interest points identified in space and time. Ivan Laptev [29] uses spatio temporal Laplacian operator over spatial and temporal scales to detect events. Niebles et al. [43] uses probabilistic latent semantic analysis (pLSA) on spatio temporal words to capture semantics. Shu-Fai Wong detects spatio temporal interest points using global information [63] and then uses pLSA [64] for learning relationship between the semantics revealing structural relationship. Soft quantization [22] accounting distance from a number of codewords is considered for classifying scenes with the image and Fisher Vector and used in classification [25, 53] showing significant improvement over BOW methods. A detailed survey on various video summarization models can be found in [9]. Similar to [25, 53], we use local descriptors and form visual dictionaries. However unlike [25, 53], we preserve more information instead of extracting only specific information. In addition to building dictionary at pixel level, we extend this dictionary to frame and shot level, forming a hierarchical dictionary. Having similarity information available at various granularities is the key to creating applications that need features at the level desired. 5.0.2 Technical Contributions In this paper, we propose a Hierarchical Dictionary Model (termed H-Video) to make the task of creating application easier. Our method learns semantic dictionaries at three different levels — pixel patches, frames, and shots. Video is represented in the form of learned dictionary units that reveal semantic similarity and video structure. The main intention of this model is to provide these semantic dictionaries, so that comparison of video units at different levels in the same video and different videos becomes easier. The benefits of H-Video include the following (i) The model advocates run-time leveraging of prior offline processing time. As a consequence, applications run fast. (ii) The model is built in an unsupervised fashion. As no application specific assumption is made, many retrieval applications can use this model and its features. This can potentially save enormous amount of computation time spent in learning. 49 Feature Extraction Frames H1 Representations of frames H2 Representation of shots Feature Extraction H1 Dictionary Clustering Frame H2 Dictionary Clustering Feature Extraction H1 Representation of frames in a shot H3 Dictionary H2 Representation of shots in the video Clustering Feature Extraction H1 Dictionary Feature Extraction H2 Dictionary Feature Extraction H3 Dictionary H1 Representation of pixels in a frame H2 Representation of frames in a shot H3 Representation of shots in a video Dictionary Representation Dictionary Formation Figure 5.2: Illustration of the H-Video Abstraction (iii) The model represents learned information using a hierarchical dictionary. This allows video to be represented as indexes to elements in the video. This makes it easier for the developer of a new retrieval application as similarity information is available as a one dimensional array. In other words, our model doesn’t demand deep video understanding background from application developers. (iv) We have illustrated our model through several applications. Figure 5.1 illustrates these applications. 5.1 Methodology We first give an overview of the process which is illustrated in Figure 5.2. 50 5.1.1 Overview Our model first extracts local descriptors like color and edge from pixel patches. (Color and edge descriptors are simply examples.) We then build a dictionary, termed H1 dictionary, out of these features. At this point, the video could be, in principle, be represented in terms of this H1 dictionary. We refer each frame of this representation as a H1 frame. We then extract slightly higher level features such as the histogram of the H1 dictionary units, the number of regions from these H1 frames and so on, and form a new H2 dictionary. The H2 dictionary is yet another representation of the video and captures the type of objects and their distribution in the scene; in this way, it captures the nature of the scene. The video could also be represented using this H2 dictionary. We refer each shot in this representation as a H2 shot. Further, we extract features based on the histogram and position of H2 dictionary units and build yet another structure, the H3 dictionary. This dictionary represents the type of shots occurring in the video. The video is now represented in terms of this H3 dictionary to form H3 video. 5.1.2 H1 Dictionary Formation and Representation For forming the H1 dictionary, at each pixel, the nearby 8 × 8 window of pixels are considered. We extract local descriptors from the moving window using pyramidal Gaussian features, neighborhood layout features, and edge filters. As an example, we use the list of filters listed in Fig. 5.3, which we have found to be effective in capturing the color & shape information. A different set of features like SIFT or HOG can also be used based on the requirement of the application. These features are extracted from the complete video. We use principal component analysis to determine the number of clusters. The top principal component is chosen based on the allowed error. We then use k-means to obtain the H1 dictionary. This is illustrated in Figure 5.4. However as clustering very long videos could be time consuming, one alternative is to do this process in various stages. For example, we first form H1 dictionary for each frame, combine dictionaries from a sequence of frames, and do clustering to form a more representative dictionary. Several dictionaries from temporal stages in the video are then combined to form a global H1 dictionary. This step-by-step approach of building dictionary makes this easy to 51 Figure 5.3: Pixel-Level Feature Extraction Filters compute and scalable. Figure 5.4: H1 Dictionary Formation Once the global H1 dictionary is available we process the video again to remove duplicates, or near duplicates to form a less redundant dictionary-based representation. The complete video is then represented using these dictionary units (left hand side of Fig. 5.2). Each frame in this representation is referred as a H1 frame. H1 frames may also be thought of as a segmentation, but in addition to the segmentation, the dictionary units have the information about nature of the objects. This process is demonstrated in Figure 5.5. 5.1.3 H2 Dictionary Formation and Representation H2 Formation: From each H1 frame, conglomerate features like a histogram, the number of regions and the distance between them are captured. The list of features used are shown in 52 Figure 5.5: H1 Representation of a frame (bottom) Figure 5.6. A subset of these features can also be chosen based on the application of these features. Figure 5.6: Features extracted from H1 frame Similar to H1 dictionary formation, these features are clustered to form a H2 dictionary. We use the step-by-step dictionary building approach, wherein first dictionaries for a sequence of frames are built and they are clustered again to form the global dictionary. This is illustrated in Figure 5.7. Using the H2 dictionary, we represent the video in terms of H2 units. We refer each shot in this representation as a H2 shot. This is illustrated in Figure 5.8. The change in the H2 units 53 Figure 5.7: H2 dictionary formation correspond to dynamically changing shots, whereas more or less similar H2 units correspond to relatively static content. Representing video as a one dimensional array of dictionary units makes the comparison of video elements easier. H2 units capture higher level details and are simpler for comparison purposes. Generally information needed by similarity application will be captured at this level. 5.1.4 H3 Dictionary Formation and Representation Features capturing distribution of H2 units like histogram, and distance between blocks are considered. The details of the various features captured are provided in Figure 5.9. These extracted features are clustered to form H3 dictionary, wherein first the dictionary is computed at shot level, and the dictionary units are again clustered to form global H3 dictionary. This is illustrated in Figure 5.11. Using this dictionary, the video is represented as a one dimensional sequence of H3 units, 54 Figure 5.8: Representation of H2 Units Figure 5.9: Features extracted from H2 shot where each unit corresponds to a shot as illustrated in Figure 5.11. This representation will be typically useful in applications involving a huge collection of videos; applications can quickly narrow down to the correct interest group. This level will also be useful in segmentation and classification applications. For example while classifying shots of lecture video into professor, student and board slides, this level will be of helpful. 55 Figure 5.10: Feature extraction from H2 shots using shot filter bank to form H3 Dictionary Figure 5.11: H3 representation of the video 5.1.5 Building global dictionary Some applications require videos to be compared at multiple dictionary levels. For this purpose, we cluster H1 clusters of more than one input videos in the database together, and construct a H1 dictionary at the database level, rather than an individual video. Using this notion of global H1 dictionary, the H2 units and H3 units are generated again. (Input videos from the database are randomly sampled.) 56 5.2 Applications In this section, we demonstrate the efficiency and applicability of our model through four retrieval applications — video classification, suggest a video, detecting logical turning points and searching candidates for audio video mix. The applications designed using our model are fast, as they use pre-computed information. Typically these applications take a minute to compute required information. This illustrates the efficiency and applicability of this model for content based video retrieval. 5.2.1 Suggest a Video Websites like imdb.com present alternate related movies when the user is exploring a specific movie. This is a good source for users to find new or forgotten movies they would be interested in. Currently such suggestions are mostly based on text annotations and user browsing history, which the user may want to turn off due to privacy considerations. If visual features of the video can also be used to solve this problem better suggestions can evolve. This scenario is illustrated in Figure 5.12, where the movies which the user was interested in is highlighted and a movie highlighted in green is suggested; the green one is presumably closest to the user’s interest. In this application, we aim to automatically suggest videos based on the video similarity of the various dictionary units. There are many ways to compare videos, like comparing the feature vectors of the dictionaries, computing the amount of overlapping dictionary units, computing correlation, and so on. In this case, we use the sequence of H2 dictionary units and substitute it with the corresponding dictionary’s feature vector to compute similarity. We take cross correlation of H2 representation features to compute the similarity between videos. The video matching with value above a threshold is suggested to the user. 57 Figure 5.12: Given a set of movies of interest (in red), a movie which is similar to them is suggested (in green). 5.2.2 Video Classification In this application, we identify various genres a movie could belong to and suggest this classification. This will reduce the effort involved in tedious annotation and help users choose appropriate genres easily. In our approach, we train a random forest model for each of the genre using the H2 and H3 histograms of the sample set. Given a new movie, and a genre, the model will output whether the movie belongs to that genre or not. 5.2.3 Logical Turning Point Detection We define Logical Turning Point in the video as a point where the characteristics of objects in the video change drastically. Detecting such places helps in summarizing the video effectively. 58 Figure 5.13: Movies are classified into drama and action genres Figure 5.14. As illustrated in the figure, the characteristics of people occurring in each of the differing stories is different. We consider shots within a moving window of 5 shots. We compute the semantic overlap of H1 and H2 units between the shots. When the amount of overlap between shots is low in a specific window of shots, we detect that as a logical turning point. As the logical turning points capture the important events in video, this can be used in applications like advertisement placement and preparing the story line. 5.2.4 Potential Candidate Identifier for Audio Video Mix Remix is the process of generating a new video from existing videos by either changing audio or video. Remixes are typically performed on video songs for a variation on the entertainment needs. The remix candidates need to have similar phase in the change of the scene to generate a pleasing output. 59 Figure 5.14: Story line of the movie A walk to remember split by story units. Characters are introduced in the first story unit; in the second story unit, the girl and the boy are participating in a drama; in the last part, they fall in love and get married. In spite of the length of stories being very different, our method successfully identifies these different story units. We use cross correlation of H2 units to identify if they have same phase of change. Once the closest candidate is identified, we replace the remix candidate’s audio with the original video’s audio. 5.3 Experiments In this section, we first evaluate the individual performance of the constituents of H-Video model itself. Next, in creating dictionaries, we compare the usage of popular features such as SIFT. Finally the performance of applications that may use more than one of the three levels, H1, H2, or H3, are evaluated. 60 Data: We have collected around 100 movie trailer from youtube, twenty five full length movie films, and a dozen music video clips. Computational Time: Our implementation is in Matlab. Typically a full-length movie takes around two hours for feature extraction. Building local dictionaries for a movie takes around 10 hours. Building the global dictionary, which is extracted from local dictionary of multiple videos takes around 6 hours. Note that building local dictionaries and a global dictionary (left hand side of Fig. 5.2) are one time jobs. Once these are built, the dictionaries are directly used to create the right hand side of Fig. 5.2. In other words, relevant model building typically takes two hours which is no different from the average runtime of the video itself. Once the model is constructed, each access operation typically takes only around 10 seconds per video. 5.3.1 Individual Evaluation of H1 and H2 For illustrating the effectiveness of H1 dictionary, we considered the classification problem and collected videos from the category “car”, “cricket” and “news anchor”. We have collected 6 videos from each category summing up to total of 18 videos. We computed H1 dictionary for each of these videos, and formed a global H1 dictionary for the given dataset and represented all videos in terms of this global dictionary. For testing purposes we randomly selected two videos from each category for training data and remaining as testing set and test set against one of the three categories. The recall and precision of classification using only H1 is provided in Table 5.1. Table 5.1: Classification using only H1. With limited information, we are able to glean classification hints. Category Precision Recall Car 1.00 0.75 Cricket 0.67 1.00 News Anchor 1.00 0.75 To evaluate the effectiveness of only H2 dictionary units, we have built our model on a TV interview show; these had scenes of only individuals, as well as people groups. When we built H2 dictionaries with allowed error as “top principal component / 1000”, we got six categories 61 (a) H2 dictionary units with smaller allowed error (b) H2 dictionary units with larger allowed error Figure 5.15: Classification using only H2. With only limited information, and with smaller allowed error, fine details were captured. With larger allowed error, broad categories were captured. Table 5.2: Video suggestion using popular features Methods Direct H-Model Comparison Percentage Improvement SURF 54% 54% 0% SIFT 29% 53% 83% Color, Edge 48% 59% 23% capturing different scenes, people and their positions. As we relaxed the allowed error, it resulted in two categories between individual scene and group of people. This result is presented in Fig. 5.15. Hence applications can tune allowed error parameters to suit their requirement. 5.3.2 Evaluation of Alternate Features In this section we evaluate popular features like SURF, SIFT and contrast with the color & edge features used in this paper. Given any feature set, one may do a “direct comparison” (which will take longer time), or do our proposed H-Video model-based comparison (which will take 62 far lesser time). This experiment is performed on the “Suggest-a-video” problem using only trailers of movies as the database. The result is presented in Table 5.2. When the H-Video model is used, we use the H2 as the basis of comparison. We observe that the use of the hierarchical model helped improving the accuracy for SIFT and color & edge features; the accuracy was almost the same when using SURF features. In producing these statistics, for the ground truth we have used information available on imdb.com. One problem in using imdb.com is that the truth is limited to the top twelve only. We therefore have added transpose and transitive relationships as well. (In transpose relationships, if a movie A is deemed to be related to B, we mark B as related to A. In transitivity, if a movie A is related to B, and B is related to C, then we mark A as related to both B and C. The transitive closure is maintained via recursion.) 5.3.3 Evaluation of Video Classification We have considered three genres for video classification. We took the category annotation of 100 movie trailers and for each category considered 30% of the data for training and remaining as testing set. We have build the H-video for these videos, extracted H2 and H3 representations, and classified using the random forest model. Example output is shown in Fig. 5.16. 5.3.4 Evaluation of Logical Turning Point Detection Typically drama movies have three logical parts. First the characters are introduced, then they get together and then a punchline is presented towards the end. Considering this as ground truth, we validated the detected logical turning points. The logical turning points were detected with precision of 1.0 and recall of 0.75. 5.3.5 Evaluation of Potential Remix Candidates We have conducted experiments on 20 song clips, where the aim is to find best remix candidates. Our algorithm found two pairs of songs which are best candidates for remix in the given set. 63 Figure 5.16: Sample Result of classifying movie trailers for categories Drama, Action and Romance. In most of the cases, our model has classified movie trailers correctly. Sample frames from matched video are presented in Fig. 5.17. 64 (a) Remix Candidates – Set 1 (b) Remix Candidates – Set 2 Figure 5.17: Sample frames from identified remix candidates is presented. In each sets, top row correspond to a original video song and the second row corresponds to the remix candidate. The content and sequencing of the first and second rows match suggesting the effectiveness of our method. 65 66 Chapter 6 Conclusion & Future Work Videos revolutionize many of the ways we receive and use information every day. The availability of online resources has changed many things. Usage of videos has transformed dramatically demanding better ways of retrieving them. Many diverse and exciting initiatives demonstrate how visual contents can be learned from video. Yet at the same time, reaching videos using visual contents online arguably has been less radical, especially in comparison to text search. There are many complex reasons for this slow pace of change, including lack of proper tags and huge computation time taken for learning for specific topics. In this thesis, we have presented ways to optimize computation time, which is the most important obstacle to retrieve videos. Next, we have focused on producing visual summaries, so that user can quickly find if the video is of interest to him/her. We have also presented ways to slice and dice videos so that user can for quickly reach the segment they are interested in. 1. Lead star Detection: Detecting lead stars has numerous applications like identifying player of the match, detecting lead actor and actress in motion pictures, guest host identification. Computational time has always been a bottleneck for using this technique. In our work, we have presented a faster method to solve this problem with comparable accuracy. This makes our algorithm usable in practice. 2. AutoMontage: Photo Sessions Made Easy!: With the increased usage of camera by 67 novices, tools to make photography sessions are becoming increasingly valuable. Our work successfully creates photo montage from photo session videos and combines best expressions into a single photo. 3. Cricket Delivery Detection: Inspired by prior work [45], we approached the problem of temporal segmentation of cricket videos. As the view classification accuracy is highly important, we have proposed few corrective measures to improve the accuracy by introducing pop-up ad eliminator and finer features for view classification. Further, we associated each delivery with the key players of that delivery, proving browse by player feature. 4. Hierarchical Model: In traditional video retrieval systems, relevant features are extracted from the video and applications are built using the extracted features. For multimedia database retrieval systems, there are typically plethora of applications that would be required for satisfying user needs. A unified model which uses fundamental notions of similarity would therefore be valuable to reduce computation time for building applications. In this work, we have proposed a novel model called H-Video, which provides the semantic information needed by retrieval applications. In summary, both creation (programmer time) and runtime (computer time) of the resulting applications are reduced. First, our model provides semantic information of video in a simple way, so that it is easy for programmers. Second, due to the suggested pre-processing of long video data, runtime is reduced. We have built four applications as examples to demonstrate our model. 6.1 Future Work The future scope of this work is to enhance hierarchical model to learn concepts automatically from a set of positive and negative examples using machine learning techniques like random forest or random tree. This would serve as a great source in learning a particular tag and applying to all other videos which have the same concept. This model can also be extended to matching parts of videos instead of whole video. 68 Hierarchical model can also be used to analyze video types and integrate with other applications. For example, video types pertaining to an actor can be learnt to create a profile of that actor. The video type learnt when associated with appropriate tag, can serve complex queries related to actor. Hierarchical model can be used to identify photo session videos, so that automontage can be applied automatically to create a photo album. The future scope of this work also includes developing more applications to find videos of user’s interest. Video sharing websites create automatic play list based on user’s choice. In such a list, the videos seem to be based on browsing patterns of users. However it has a drawback of mixing different types of video, whereas user may be interested in specific type of video. Hence an application to learn video types from the recommendations and provide filtering options based on various attributes will help the user narrow down their interest. Few attributes that could be used for such filtering are lead actors, number of people in the video, video types from hierarchical model, popular tags associated with videos and recency. In conclusion, we feel that generic models with quick retrieval time combined with user centric applications have much unexplored potential. They will become the favored methods for reaching videos in the near future. 69 70 Thesis related Publications 1. Nithya Manickam, Sharat Chandran. Automontage: Photo Sessions Made Easy. In IEEE ICIP 2013, pp. 1321-1325. 2. Nithya Manickam, Sharat Chandran: “Automontage,” Filled Indian patent 350/MUM/2013, Feburary 2013. 3. Nithya Manickam, Sharat Chandran. Fast Lead Star Detection in Entertainment Videos. In IEEE WACV 2009, pp. 1-6. 4. Nithya Manickam, Sharat Chandran. Hierarchical Summarization for Easy Video Applications. IAPR MVA 2015. 5. Binod Pal, Nithya Manickam, Sharat Chandran: “Cricket Delivery Detection,” Filled Indian Patent 319/MUM/2015, January 2015. Thesis related Publications (In preparation or review) 1. Nithya Manickam, Binod Pal, Sharat Chandran. Cricket Delivery Detection. Submitted to IEEE ICIP 2015. 71 72 Bibliography [1] Internet movie database. http://www.imdb.com. [2] Trec video retrieval evaluation. www-nlpir.nist.gov/projects/ trecvid. [3] Youtube video 94rotary convention_youth-hub committee 17th. rotaractors. http://www. youtube.com/watch?v=KQ73-P9HGiE. [Last visited on 07/06/2013]. xiii, 35 [4] Youtube video jc class of 1976 reunion-group photo session. http://www.youtube. com/watch?v=9FZTh_BkRD4. [Last visited on 07/06/2013]. xiii, 33 [5] Youtube video search. http://www.youtube.com. [6] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen. Interactive digital photomontage. ACM Transactions on Graphics, 23(3):294–302, Aug 2004. iii, xiii, 28, 34 [7] C. Bang, S.-C. Chenl, and M.-L. Shyu. Pixso: a system for video shot detection. pages 1320 – 1324, December 2003. [8] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222– 1239, 2001. 33 [9] S.-F. Chang, W.-Y. Ma, and A. Smeulders. Recent advances and challenges of semantic image/video search. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–1205–IV–1208, April 2007. 49 [10] J. Chen and Q. Ji. A hierarchical framework for simultaneous facial activity tracking. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 679–686, March 2011. [11] M. Covell and S. Ahmad. Analysis by synthesis dissolve detection. pages 425 – 428, 2002. [12] M. Das and A. Loui. Automatic face-based image grouping for albuming. In IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages 3726–3731, 73 Oct. 2003. 28 [13] A. Doulamis and N. Doulamis. Optimal content-based video decomposition for interactive video navigation. Circuits and Systems for Video Technology, IEEE Transactions on, 14(6):757–775, June 2004. 11 [14] D. P. W. Ellis and G. E. Poliner. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Identifying ‘Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking, volume 4, April 2007. [15] A. A. et. al. Ibm research trecvid-2005 video retrieval system. November 2005. [16] M. Everingham and A. Zisserman. situation comedies. Automated visual identification of characters in In ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 4, pages 983–986, Washington, DC, USA, 2004. IEEE Computer Society. 12 [17] T. Fang, X. Zhao, O. Ocegueda, S. Shah, and I. Kakadiaris. 3d facial expression recognition: A perspective on promises and challenges. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 603–610, March 2011. [18] A. W. Fitzgibbon and A. Zisserman. On affine invariant clustering and automatic cast listing in movies. In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part III, pages 304–320, London, UK, 2002. Springer-Verlag. 12, 13 [19] S. Foucher and L. Gagnon. Automatic detection and clustering of actor faces based on spectral clustering techniques. In Proceedings of the Fourth Canadian Conference on Computer and Robot Vision, pages 113–122, 2007. iii, 12, 13, 22 [20] B. Funt and G. Finlayson. Color constant color indexing. Pattern Analysis and Machine Intelligence, 17:522 – 529, 1995. [21] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini. On using clustering algorithms to produce video abstracts for the web scenario. In Consumer Communications and Networking Conference, 2008. CCNC 2008. 5th IEEE, pages 1112–1116, Jan. 2008. [22] J. C. Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders. Kernel codebooks for scene categorization. In Proceedings of the 10th European Conference on Computer Vision: Part III, ECCV ’08, pages 696–709, Berlin, Heidelberg, 2008. Springer-Verlag. 49 [23] K. M. H., P. K., and S. S. Semantic event detection and classification in cricket video sequence. In Indian Conference on Computer Vision Graphics and Image Processing, pages 382–389, 2008. 37, 38 74 [24] O. Javed, Z. Rasheed, and M. Shah. A framework for segmentation of talk and game shows. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, 2:532–537 vol.2, 2001. 12, 21, 24 [25] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3304–3311, June 2010. 49 [26] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 119– 126, New York, NY, USA, 2003. ACM. [27] P. S. K., P. Saurabh, and J. C. V. Text driven temporal segmentation of cricket videos. In Indian Conference on Computer Vision Graphics and Image Processing, pages 433–444, 2006. 37, 38 [28] S. Kumano, K. Otsuka, D. Mikami, and J. Yamato. Analyzing empathetic interactions based on the probabilistic modeling of the co-occurrence patterns of facial expressions in group meetings. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 43–50, March 2011. 28 [29] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(23):107–123, 2005. 49 [30] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178, 2006. [31] S.-H. Lee, J.-W. Han, O.-J. Kwon, T.-H. Kim, and S.-J. Ko. Novel face recognition method using trend vector for a multimedia album. In IEEE International Conference on Consumer Electronics, pages 490–491, Jan. 2012. 28 [32] C.-H. Li, C.-Y. Chiu, C.-R. Huang, C.-S. Chen, and L.-F. Chien. Image content clustering and summarization for photo collections. In IEEE International Conference on Multimedia and Expo, pages 1033–1036, July 2006. 28 [33] D. Li and H. Lu. Avoiding false alarms due to illumination variation in shot detection. pages 828 – 836, October 2000. [34] J. Li, J. H. Lim, and Q. Tian. Automatic summarization for personal digital photos. In Fourth Pacific Rim Conference on Information, Communications and Signal Processing 75 and Proceedings of the Joint Conference of the Fourth International Conference on Multimedia, volume 3, pages 1536–1540, Dec. 2003. 28 [35] R. Lienhart and A. Zaccarin. A system for reliable dissolve detection in videos. volume III, pages 406 – 409, 2001. [36] S. H. Lim, Q. Lin, and A. Petruszka. Automatic creation of face composite images for consumer applications. In IEEE International Conference on Acoustics Speech and Signal Processing, pages 1642–1645, March 2010. 28 [37] G. Littlewort, M. Bartlett, L. Salamanca, and J. Reilly. Automated measurement of children’s facial expressions during problem solving tasks. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 30–35, March 2011. 28 [38] Z. Liu and H. Ai. Automatic eye state recognition and closed-eye photo correction. In 19th International Conference on Pattern Recognition, pages 1–4, Dec. 2008. 28 [39] H. Lu and Y. Tan. An effective postrefinement method for shot boundary detection. CirSysVideo, 15:1407 – 1421, November 2005. [40] P. Lucey, J. F. Cohn, I. Matthews, S. Lucey, S. Sridharan, J. Howlett, and K. M. Prkachin. Automatically detecting pain in video through facial action units. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 41(3):664–674, 2011. 27 [41] S. Malassiotis and F. Tsalakanidou. Recognizing facial expressions from 3d video: Current results and future prospects. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 597–602, March 2011. [42] C. Ngo, T. Pong, and R. Chin. Detection of gradual transitions through temporal slice analysis. pages 36 – 41, 1999. [43] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3):299–318, 2008. 48, 49 [44] S. O’Hara, Y. M. Lui, and B. Draper. Unsupervised learning of human expressions, gestures, and actions. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 1–8, March 2011. [45] B. Pal and S. Chandran. Sequence based temporal segmentation of cricket videos. In Sequence based Temporal Segmentation of Cricket Videos, 2010. iv, xi, 7, 8, 37, 38, 40, 41, 68 [46] N. Patel and I. Sethi. Video shot detection and characterization for video databases. Pattern 76 Recognition, 30:583 – 592, April 1997. [47] C. Petersohn. Fraunhofer hhi at trecvid 2004. shot boundary detection system. November 2004. [48] Z. Rasheed and M. Shah. Scene detection in hollywood movies and tv shows. volume II, pages 343 – 348, June 2003. [49] S. Shahraray. Scene change detection and content-based sampling of video sequence. pages 2 – 13, February 1995. [50] R. Shaw and P. Schmitz. Community annotation and remix: a research platform and pilot deployment. In HCM ’06: Proceedings of the 1st ACM international workshop on Human-centered multimedia, pages 89–98, New York, NY, USA, 2006. ACM. [51] H. Soyel and H. Demirel. Improved sift matching for pose robust facial expression recognition. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 585–590, March 2011. [52] G. Stratou, A. Ghosh, P. Debevec, and L.-P. Morency. Effect of illumination on automatic expression recognition: A novel 3d relightable facial database. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 611–618, March 2011. [53] C. Sun and R. Nevatia. Large-scale web video event classification by use of fisher vectors. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 15–22, Jan 2013. 49 [54] D. Swanberg, C. Shu, and R. Jain. Knowledge guided parsing in video database. pages 13 – 24, May 1993. [55] Y. Takahashi, N. Nitta, and N. Babaguchi. Video summarization for large sports video archives. Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 1170–1173, July 2005. 11, 12, 20 [56] D. Tjondronegoro, Y.-P. P. Chen, and B. Pham. Highlights for more complete sports video summarization. Multimedia, IEEE, 11(4):22–37, Oct.-Dec. 2004. [57] H. Tong, J. He, M. Li, C. Zhang, and W.-Y. Ma. Graph based multi-modality learning. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, pages 862–871, New York, NY, USA, 2005. ACM. [58] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. Principal component analysis of image gradient orientations for face recognition. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 553–558, March 2011. 77 [59] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, May 2004. 16 [60] P. A. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004. 29 [61] T. Vlachos. Cut detection in video sequences using phase correlation. Signal Processing Letters, pages 173 – 175, July 2000. [62] A. Waibel, M. Bett, and M. Finke. Meeting browser: Tracking and summarizing meetings. In Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pages 281–286, February 1998. 12 [63] S.-F. Wong and R. Cipolla. Extracting spatiotemporal interest points using global information. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8, Oct. 2007. 49 [64] S.-F. Wong, T.-K. Kim, and R. Cipolla. Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–6, June 2007. 48, 49 [65] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun. Unsupervised discovery of multilevel statistical video structures using hierarchical hidden markov models. In Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on, volume 3, pages III–29–32 vol.3, July 2003. 48 [66] D. Xu, X. Li, Z. Liu, and Y. Yuan. Anchorperson extraction for picture in picture news video. Pattern Recogn. Lett., 25(14):1587–1594, 2004. 12 [67] G. Xu, Y.-F. Ma, H.-J. Zhang, and S.-Q. Yang. An hmm-based framework for video semantic analysis. Circuits and Systems for Video Technology, IEEE Transactions on, 15(11):1422–1433, Nov. 2005. 48 [68] C. Yeo, Y.-W. Zhu, Q. Sun, and S.-F. Chang. A framework for sub-window shot detection. pages 84 – 91, 2005. [69] H.-W. Yoo, H.-J. Ryoo, and D.-S. Jang. Gradual shot boundary detection using localized edge blocks. Multimedia Tools and Applications, 28:283 – 300, 2006. [70] G. Yuliang and X. De. A solution to illumination variation problem in shot detection. pages 81 – 84, November 2004. [71] R. Zabih, J. Miller, and K. Mai. Feature-based algorithms for detecting and classifying scene breaks. 1995. [72] H. Zhang, A. Kankanhalli, and S. Smoliar. Automatic partitioning of full-motion video. 78 ACM Multimedia Systems, 1:10 – 28, 1993. [73] Y. Zhang, L. Gao, and S. Zhang. Feature-based automatic portrait generation system. In WRI World Congress on Computer Science and Information Engineering, volume 3, pages 6–10, 31 2009-april 2 2009. [74] Z. Zhang, G. Potamianos, A. W. Senior, and T. S. Huang. Joint face and head tracking inside multi-camera smart rooms. Signal, Image and Video Processing, 1:163–178, 2007. 12 [75] Y. Zhuang, Y. Rui, T. Huang, and S. Mehrotra. Adaptive key frame extraction using unsupervised clustering. In Proceedings of the International Conference on Image Processing, volume 1, pages 866–870, 1998. 11 79 80 Acknowledgments I would like to express my special appreciation and thanks to my advisor Prof. Sharat Chandran, you have been a tremendous mentor for me. I would like to thank you for encouraging my research and for allowing me to grow as a research scholar. I would also like to thank my committee members Prof. Subhasis Chaudhuri, Prof. Shabbir Merchant for serving as my committee members even at hardship. I would like to thank CSE Department Staff members for their constant support. More than academic support, I have many, many people to thank for listening to and, at times, having to tolerate me over the past years. I express my gratitude and appreciation for their friendship. ViGIL lab members have been unwavering in their personal and professional support during the time I spent at the University. I would like to thank my colleagues from Amazon, who have been of great help. I would especially like to thank my mentor Dr. Arati Deo, for all the support. Your advice on both research as well as on my career have been priceless. A special thanks to my family. Words cannot express how grateful I am to my family members for all of the sacrifices that you have made on my behalf. I would also like to thank all of my friends who supported me, and incented me to strive towards my goal. I would like to express appreciation to my beloved husband Sudhakar who was always my support in the moments when there was no one to answer my queries. At the end I would like to thank my beloved daughter T.S. Harini for her love, understanding and prayers. Nithya Sudhakar Date: Nithya Sudhakar 82
Similar documents
Synopsis - Department of Computer Science and Engineering
Supervisor: Prof. Sharat Chandran
More information