Synopsis - Department of Computer Science and Engineering
Transcription
Synopsis - Department of Computer Science and Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Department of Computer Science and Engineering SYNOPSIS of the Ph. D. thesis entitled E XTENDING S OCIAL V IDEOS FOR P ERSONAL C ONSUMPTION Proposed to be submitted in partial fulfillment of the degree of DOCTOR OF PHILOSOPHY of the INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY BY NITHYA SUDHAKAR Supervisor: Prof. Sharat Chandran Department of Computer Science and Engineering INDIAN INSTITUTE OF TECHNOLOGY BOMBAY Motivation A picture speaks a thousand words and a video is thousands of pictures. Videos let people articulate their ideas creatively. The abundant source of videos available via video sharing websites, makes video freely accessible to everyone. Videos are great sources of information and entertainment, and they cover various areas like news, education, sports, movies and events. In the early era of internet, when concept of web search and video search was introduced, only way people ever searched is to annotate each and every web page and video with all possible tags, so that people can search them. Over a period of time as the volume increased, this method no more worked. So researchers came up with interesting ways of training images to learn real world objects like indoor, outdoor, swimming, car, air plane and so on, so that these annotations are automatically extracted from video frames. Though this approach increased number of tags and reduced human effort, the searchability was restricted to objects trained and training exhaustive list was again a tedious process. To overcome this, bag of words feature was introduced which is similar to text, which has visual description about various objects in the scene and image was used as input instead of traditional text. However this left with another problem of user finding a suitable image to represent their thought. Over the years with tremendous growth of video sharing website, people have uploaded a huge collection of video with their own annotations. By combining bag-of-words with various tags available, tag annotations have reached a great level that videos are easily searchable. However as user gets closer to video results, there are various expectations user have for which there is no sufficient annotation. So user browses through results and recommendations until they find videos of their interest. The size of videos pose additional challenges as user has to download various videos and browse through them to check if that satisfies their requirement. This scenario is illustrated in the Figure 1. Figure 1: A scenario where user is browsing through various video results to reach video of his interest. 1 2 Police and Politics Social Gathering Court Scene Protest Sports Music Romance Interview Events Select All Lead Casts Music Cricket Interview Sports HH:MM Remake Remix Interviews by Host Similar Videos Interviews of Guest Sixes and boundaries by India Wickets by India Last two overs of Indian Bowling Sixes and boundaries by Sri Lanka Wickets by Sri Lanka Last two overs of Sri Lankan Bowling Customized Videos Similar Videos Sequel Similar Movies that the suggestions and visual representations vary based on the input video and its genre. Figure 2: A sample scenario where user is provided with more filtering options and visual representation along with video results. Please note Show videos related to browsing history HH:MM Less than 2 min 2 min to 5 min 5 to 30 min 30 min to 3 hrs More than 3 hrs Duration Avg. Customer Reviews Genres Movies This demands a whole realm of applications which help user to quickly get close to the video they want. In this thesis we propose various applications like lead star detections, story-line generation which helps users to get a quick idea about video without having to go through them. We also propose applications which able user to go through selected parts instead of complete video. Example applications we have explored in this genre includes cricket delivery detection and logical turning point detection. As the more user gets to a specific genre, genre specific applications make more sense than generic ones. One example that we have tried is Automontage which generates photos by merging different expressions to form a nice output. Further we noticed a great possibility of such genre specific applications which commonly need similarity information. We have proposed a technique called Hierarchical model, which enables a specific genre of similarity applications. We have demonstrated our applications through supporting applications like logical turning point detection, story-line generation, movie classification and audio video remix candidate search. Problem Statement The objective of the thesis is to design applications and technique which will help users to reach specific video of their interest easily. A sample scenario in which user is provided with various options is illustrated in Figure 2. As the possibilities of designing such applications are wider, we establish the following problems specifically 1. Ability to summarize the video, so that user quickly visualize the content of videos 2. Ability to browse through specific parts of video instead of browsing through whole video. 3. Designing genre specific video applications which is of great use in that genre 4. Designing a technique which can support similarity applications which suits various needs like auto suggestions, story line generation. Our Contributions In this thesis, we propose applications and technique which achieve the following, so that user easily can quickly filter video results or skim through video or explore related ones. 3 4 AutoMontage Lead Star Detection Figure 3: Applications developed or addressed in our work. Hierarchical Model Remix Candidate Identification Logical Turning Point Detection Cricket Delivery Detection Movie Classification Shot Type Segmentation Segmentation Suggestion Storyline Generation Summarization Suggest a Video Key Players Detection Lead Actors Identification Guest Host Detection 1. Summarize videos in terms of lead stars, allowing user to narrow down based on stars of their interest 2. Summarize the video in terms of story-line, so that user can get a quickly overview videos before viewing them. 3. Segment the video allowing users to browse through specific parts of video instead of the whole video. 4. Designing genre specific video applications like automontage, cricket delivery detection, remix candidate generation, which is of great use in those specific genres. 5. Designing a technique which can support a genre of similarity applications like some applications listed above. We achieve these functionalities using various techniques ranging from building appropriate dictionaries to specific view classification or merging expressions to create good photos. We have classified these video applications into three major categories - segmentation, summarization and classification. We present various applications we have developed in these categories and present a model which supports applications from all these categories. A visual summary of the work done is presented in the Figure 3. In this section, we will be presenting an overview of these applications and models. 5 Fast Lead Star Detection in Entertainment Videos Suppose an avid cricket fan or coach wants to learn exactly how Flintoff repeatedly got Hughes “out.” Or a movie buff wants to watch an emotional scene involving his favorite heroine in a Hollywood movie. Clearly, in such scenarios, you want to skip frames that are “not interesting.” One possibility that has been well explored is extracting key frames in shots or scenes and then creating thumbnails. Another natural alternative – the emphasis in this work – is to determine frames around, what we call, lead stars. A lead star in an entertainment video is the actor who, most likely, appears in many significant frames. We define lead stars in other videos also. For example, the lead star in a soccer match is the hero, or the player of the match, who has scored “important” goals. Intuitively, he is the one the audience has paid to come and see. Similarly the lead star in a talk show is the guest who has been invited, or, for that matter, the hostess. This work presents how to detect lead stars in entertainment videos. Moreover like various video summarization Zhuang et al. [1998]; Takahashi et al. [2005]; Doulamis and Doulamis [2004], lead stars is a natural way of summarizing video. (Multiple lead stars are of course allowed.) Figure 4: Lead star detection. This is exemplified in sports by the player of the match; in movies, stars are identified; and in TV shows, the guest and host are located. Related Work Researchers have explored various video specific applications for lead stars detection – anchor detection in news video Xu et al. [2004], lead casts in comedy sitcoms Everingham and Zisserman [2004], summarizing meetings Waibel et al. [1998], guest host detection Takahashi et al. [2005]; Javed et al. [2001] and locating the lecturer in smart rooms by tracking the face and head Zhang et al. [2007]. Fitzgibbon Fitzgibbon and Zisserman [2002] uses affine invariant clustering to detect cast listing from 6 movie. As the original algorithm had runtime that is quadratic, the authors used a hierarchical strategy to improve the clustering speed that is central to their method. Foucher, S. and Gagnon, L. Foucher and Gagnon [2007] used spatial clustering techniques for clustering actor faces. Their methods detect the actor’s cluster in unsupervised way with computation time of about 23 hours for a motion picture. Our Strategy Although the lead actor has been defined using a pictorial or semantic concept, an important observation is that the significant frames in an entertainment video is often accompanied by a change in the audio intensity level. It is true no doubt that not all frames containing the lead actors involve significant audio differences. Our interest is not at the frame level, however. Note that certainly the advent of important scenes and important people bear a strong correlation to the audio level. We surveyed around one hundred movies, and found that it is rarely, if at all, the case that the lead star does not appear in audio highlighted sections, although the nature of the audio may change from scene to scene. And as alluded above, once the lead star has entered the shot, the frames may well contain normal audio levels. Figure 5: Our strategy for lead star detection. We detect lead stars by considering segments that involve significant change in audio level. However, this by itself is not enough! Our method is built upon this concept. We detect lead stars considering such important scenes of the video. To reduce false positives and negatives, our method clusters the faces for each important scenes separately and then combines the results. Unlike the method in Fitzgibbon and Zisserman [2002], our method provides a natural segmentation for clustering. Our method is shown to considerably reduce the computation time of the previously mentioned state-of-the-art Foucher and Gagnon [2007] for 7 Figure 6: Strategy further unfolded. We first detect scenes accompanied by change in audio level. Next we look for faces in these important scenes, and to further confirm the suitability track faces in subsequent frames. Finally, a face dictionary representing the lead stars is formed by clustering the confirmed faces. computing lead star in motion picture (a factor of 50). We apply this method to sports video to identify the player of the match, motion pictures to find heroes and heroines and TV show to detect guest and host. Methodology As mentioned, the first step in the problem is to find important scenes which have audio highlights. Once such important scenes are identified, they are further examined for potential faces. Once a potential face is found in a frame, subsequent frames are further analyzed for false alarms using concepts from tracking. At this point, several areas are identified as faces. Such confirmed faces are grouped into clusters to identify the lead stars. 8 Applications In this section, we demonstrate our technique using three applications — Lead Actor Detection in Motion Pictures, Player of the Match Identification and Host Guest Detection. As applications go, Player of the Match Identification has not been well explored considering the enormous interest. In the other two applications, our technique detects lead stars faster than the state-of-art techniques, which makes our method practical and easy to use. Lead Actor Detection in Motion Pictures In motion pictures, detecting the hero, heroine and villain has many interesting benefits. A person while reviewing a movie can skip the scenes where lead actors are not present. A profile of the lead actors can also be generated. Significant scenes containing many lead actors can be used for summarizing video. In our method, the face dictionary formed contains the lead actors. These face dictionaries are good enough in most of the cases. However, for more accurate results, the algorithm scans through a few frames of every shot to determine the faces occurring in the shot. The actors who appear in a large number of shots are identified as lead actors. The result of lead actors for the movie Titanic after scanning through entire movie is shown in the Figure 7. Figure 7: Lead actors detected for the movie Titanic . Player of the Match Identification In sports, sports highlight and key frames Takahashi et al. [2005] are the two main methods used for summarizing. We summarize sports using the player of the match capturing the star players. Detecting and tracking players in complete sports video does not yield player of the match. The star players can play for shorter time and score more as opposed to players who attempt many times and don’t. So analyzing the players when there is score leads to the identification of star players. This is easily achieved by our technique, as detecting highlights results in exciting scenes like scores. The result of lead sports stars detected from a soccer match Liverpool vs Havant & Waterlooville is presented in the Figure 8. The key players of the match are detected. Host Guest Detection In TV interviews and other TV programs, detecting host and guest of the program is the key information used in video retrieval. Javed et. al. Javed et al. [2001] have proposed 9 Figure 8: Lead star detected for the highlights of a soccer match Liverpool vs Havant & Waterlooville. The first image is erroneously detected as face. The other results represents players and coach. a method for the same which removes the commercials and then exploits the structure of the program to detect guest and host. The algorithm uses the inherent structure of the interview that the host appears for shorter duration than guest. However, it is not always the case, especially when the hosts are equally popular like in the case of TV shows like Koffee With Karan. In the case of competition shows, the host is shown for longer duration than guests or judges. Our algorithm detects hosts and guests as lead stars. To distinguish hosts and guests, we detect lead stars on multiple episodes and combine the result. As it is intuitive, the lead stars over multiple episodes are hosts and the other lead stars detected for specific episodes are guests. Experiments We have implemented our system in Matlab. We tested our method on an Intel Core Duo processor, 1.6 Ghz, 2GB RAM. We have conducted experiments on 7 popular motion pictures, 9 soccer match highlights and two episodes of TV shows summing up to total of 19 hours 23 minutes of video. Our method detected lead stars in all the videos in an average of 14 minutes for an one hour video. The method Foucher and Gagnon [2007] in the literature computes lead star of a motion picture in 23 hours, whereas we compute lead star for motion picture in an average of 30 minutes. We now provide more details. Lead Actor Detection in Motion Pictures We ran our experiments on 7 box-office hit movies listed in the table 1. This totally sums up to 16 hours of video. The lead stars in all these movies are computed in 3 hour 18 minutes. So the average computation time for a movie is around 30 minutes. From Table 1, we see that the best computation time is 4 minutes for the movie Austin Powers in Goldmember which is 1 hour 42 minutes in duration. The worst computation time is 45 minutes for the movie Matrix Revolutions of duration 2 hour 4 minutes. For movies like Eyes Wide Shut and Austin Powers in Goldmember, the computation is faster as there are fewer audio highlights. Whereas action movies like Titanic sequels take more time as there are many audio highlights. This causes the variation in computation time among movies. 10 Table 1: Time taken for computing lead actors for popular movies. No. Movie Name Duration Computation Time for Computation Time for Detecting Lead Stars Refinement (hh:mm) (hh:mm) (hh:mm) 02:16 00:22 00:49 1 The Matrix 2 The Matrix Reloaded 02:13 00:38 01:25 3 Matrix Revolutions 02:04 00:45 01:07 4 Eyes Wide Shut 02:39 00:12 00:29 5 Austin Powers 01:34 00:04 01:01 6 The Sisterhood 01:59 00:16 00:47 Titanic 03:18 01:01 01:42 Total 16:03 03:18 07:20 in Goldmember of the Traveling Pants 7 Figure 9: Lead actors identified for popular movies appear on the right. The lead actors detected are shown in the Figure 9. The topmost star is highlighted in red color and the next top star is highlighted in blue color. As you can notice in the figure, in most of the movies topmost stars are detected. Since the definition of “top” is subjective, it could be said that in some cases, top stars are not detected in some cases. Further, in some cases the program identifies the same actor multiple times. This could be due to disguise, or due to pose variation. The result is further refined for better accuracy as mentioned. 11 Player of the Match Identification We have conducted experiments on 11 soccer match highlights taken from BBC and listed in Table 2. Our method on an average takes half the time of the duration of the video. Note however, that these timings are for only sections that have already been manually edited by the BBC staff. If the video were run on a routine full soccer match, we expect our running time to be a lower percentage of the entire video. Figure 10: Key players detected from BBC Match of the Day match highlights for premier league 2007-08. The results of key player detection is presented in the Figure 10. The key players of the match are identified for all the matches. Host Guest Detection We conducted our experiment on the TV show Koffee with Karan. Two different episodes of the show were combined and fed as input. Our method identified the host in 4 minutes for a video of duration 1 hour 29 minutes. Our method is faster than the method proposed by Javed et. al. Javed et al. [2001]. 12 Table 2: Time taken for computing key players from BBC MOTD highlights for premier league 200708. No. Soccer Match Duration Computation (hh:mm) (hh:mm) 1 Barnsley vs Chelsea 00:02 00:01 2 BirminghamCity vs 00:12 00:04 Arsena 3 Tottenham vs Arsenal 00:21 00:07 4 Chelsea vs Arsenal 00:14 00:05 5 Chelsea vs 00:09 00:05 Middlesborough 6 Liverpool vs Arsena 00:12 00:05 7 Liverpool vs 00:15 00:06 00:09 00:05 00:18 00:04 01:52 00:43 Havant & Waterlooville 8 Liverpool vs Middlesbrough 9 Liverpool vs NewcaslteUnited Total Table 3: Time taken for identifying host in a TV show Koffee With Karan. Two episodes are combined into a single video and given as input. TV show Koffee With Karan Duration Computation (hh:mm) (hh:mm) 01:29 00:04 The result of our method for the TV show Koffee with Karan is presented in Figure 11. Our method has successfully identified the host. Figure 11: Host detection of TV show “Koffee with Karan". Two episodes of the show are combined and given as input. The first person in the detected list (sorted by the weight) gives the host. 13 AutoMontage: Photo Sessions Made Easy! Its no more that only professional photographers who take pictures. Almost anyone has a good camera, and often takes a lot of photographs. Group photographs as a photo session in reunions, conferences, weddings, and so on are de rigueur. It is difficult, however, for novice photographers to capture good expressions at the right time, and realize a consolidated acceptable picture. A video shoot of the same scene ensures that expressions are not missed. Sharing the video, however, may not be the best solution. Besides the obvious bulk in the video, poor expressions (“false positives") are also willy nilly captured and might prove embarrassing. A good compromise is to produce a mosaiced photograph assembling good expressions, and discarding poor ones. This can be achieved by a cumbersome manual editing; in this paper, we provide an automated solution, illustrated in Figure 12. The photo shown has been created from a random YouTube video excerpt and the photo shown does not exist in any frame of the original video. Figure 12: An automatic photo montage created by assembling “good” expressions from a video (randomly picked from YouTube). This photo did not exist in any of the input frames. Related Work Research in measuring the quality of face expressions has appeared elsewhere, in applications such as medical patient expression detection to sense pain Lucey et al. [2011], measurement of children’s facial expression during problem solving Littlewort et al. [2011], and analyzing empathetic interactions in group meeting Kumano et al. [2011]. Our work focuses on generating an acceptable photo montage from photo-session video and is oriented towards a targeted goal of discarding painful expressions, and recognizing memorable expressions. In regard to photo editing researchers have come up with many interesting applications like organizing photos based on the person present in the photos Lee et al. [2012]Li et al. [2003]Li et al. [2006]Das and Loui [2003], correcting an image with closed eye to open eye Liu and Ai [2008], and morphing photos Lim et al. [2010]. Many of these methods either require manual intervention, or involves a different 14 and enlarged problem scope resulting in more complicated algorithms rendering them inapplicable to our problem. The closest work to ours is presented in Agarwala et al. [2004], where a user selects the part they want from each photo. These parts are then merged using the technique of graph cuts to create a final photo. Our work differs from Agarwala et al. [2004] in a few ways. Faces are selected automatically by determine pleasing face expressions (based on an offline machine learning strategy). Video input are allowed enabling a larger corpus of acceptable faces, and the complete process is automated, thereby making it easier for end-users. Technical Contributions The technical contribution in this work includes: • A frame analyzer that detects camera panning motion to generate candidate frames • An expression analyzer from detected faces • A photo patcher that enables seamless placement of faces in group photos Details of these steps appear in the following Section. Methodology In this section, we present a high level overview of our method followed by the details. The steps involved in auto montage creation are 1. Face detection and expression measurement 2. Base photo selection 3. Montage creation First, we detect faces in all the frames and track them based on its position. Then we measure facial expression of all detected faces, and select a manageable subset. The next step in this problem is to identify plausible frames which can be merged to create mosaic. Frames in a video “far away” in time, or unrelated frames cannot be merged. In the last phase, faces with best expressions are selected and substituted in the mosaiced image using the technique of graph cut and blending. Figure 13 illustrates these steps. 15 Face Detection Face Position Tracking Facial Expression Measurement Base Photo Selection Montage Creation Figure 13: A schematic of our method. In the first step, detected faces are tracked and grouped together. In the next step frames are analyzed to detect a base photo frame which can either be a mosaic of frames, or a single frame with maximum number of good expression faces from the video. In the last step, faces with good expression are patched to the base photo to form the required photo montage. Experiments To compare our method, we ran our experiments against the stack of images provided in Agarwala et al. [2004]. As can be seen, the output photo generated had good expressions of most of the people. The result is presented in Figure 14. Note that we have automatically generated the photo montage without user input compared to the original method. Our system has also been tested on other group photo sessions collected from youtube. Our algorithm successfully created photo montages from all these videos. Examples are presented in Figure 14, Figure 15 and Figure 16. In Figure 14, though most of the faces looks good, there is an artifact in the third person from top left corner. This is introduced by the patching scheme when multiple face’s best expression was substituted. This scenario demands more accurate detection of faces to avoid such artifacts. 16 Figure 14: AutoMontage created from family stack of image Agarwala et al. [2004]. We are able to automatically generate the photo montage as opposed to the method in Agarwala et al. [2004]. Figure 15: Photo Montage example having acceptable expressions. Youtube video is available in http: // www. youtube. com/ watch? v= KQ73-P9HGiE . Figure 16: Photo Montage example having acceptable expressions. A wedding reception video. 17 Cricket Delivery Detection Cricket is one of the most popular sports in the world after soccer. Played globally in more than a dozen countries, it is followed by over a billion people. One would therefore expect, following other trends in global sports, that there would be a meaningful analysis and mining of cricket videos. There has been some interesting work in this area (for example, K. et al. [2006]; H. et al. [2008]; Pal and Chandran [2010]). However, by and large, the amount of computer vision research does not seem to be commensurate with the interest and the revenue in the game. Possible reasons could be the complex nature of the game, and the variety of views one can see, as compared to games such as tennis, and soccer. Further, the long duration of the game might inhibit the use of inefficient algorithms. Segmenting a video into its meaningful units is very useful for the structure analysis of the video, and in applications such as content based retrieval. One meaningful unit corresponds to the semantic notion of “deliveries” or “balls” (virtually all cricket games are made up of 6-ball overs). Consider: (a) (b) (c) Figure 17: Typical views in the game of cricket (a) Pitch View (b) Ground View (c) Non-Field View. Note that the ground view may include non-trivial portions of the pitch. • A bowling coach might be interested in analyzing the balls faced only by, say, Steve Smith from the entire match. If we could temporally segment a cricket match into balls, then it is possible to watch only these portions. • The games administrator might be able to figure out how many minutes are consumed by slow bowlers as compared to fast bowlers. Currently such information is available only by the manual process of segmenting a long video (more than 7 hours). Prior Work Indeed, the problem of segmenting a cricket video into meaningful scenes is addressed in K. et al. [2006]. Specifically, the method uses the manual commentaries available for cricket video, to segment a cricket video into its constituent balls. Once segmented, the video is annotated by the text for higher-level content access. A hierarchical framework and algorithms for cricket event detection and classification is proposed in H. et al. [2008]. The authors uses a hierarchy of classifier to detect various 18 views present in the game of cricket. The views with which they are concerned are real-time, replay, field view, non-field view, pitch-view, long-view, boundary view, close-up and crowd etc. Despite the interesting methods in these works – useful in their own right – there are some challenges. The authors in H. et al. [2008] have only worked on view classification without addressing the problem of video segmentation. Our work closely resembles that of the method in K. et al. [2006], and is inspired from it from a functional point of view. It differs dramatically in the complexity of the solution, and the basis for the temporal segmentation. Specifically, as the title in K. et al. [2006] indicates, the work is text-driven. In Binod’s work Pal and Chandran [2010], he address the problem of segmenting cricket videos into the meaningful structural unit termed as balls. Intuitively a ball starts with the frame in which “a bowler is running towards the pitch to release the ball.” The end of the ball is not necessarily the start of the next ball. The ball is said to end, when the batsman has dealt with the ball. This ball corresponds to a variety of views. The end frame might be a close up of a player or players (e.g. celebration), the audience, a replay, or even an advertisements. Because of these varied nature of views, he has used both the domain knowledge of the game (views and their sequence), as well as TV production rules (location of scoreboard, its presence and absence). In his approach first the input video frames are classified into two semantic concepts: play and break. The break segment includes replays, graphics and advertisements. The views that we get in the game of cricket include close-up of the players, and a variety of views that are defined as follows: Examples of various views are presented in the Figure 19. • Pitch Views (P) bowler run-up, ball bowled, ball played. • Ground View (G) ball tracking by camera, ball fielded and returned. • Non-Field Views (N) Any view that is not P or G. This includes views such as the close-up of batsman, bowler, umpire, and the audience. Further, Binod defines Field Views (F) as the union of P and G. Using this vocabulary, he creates an alphabetical sequence (see the block diagram of our system in Fig. 18.) Representative views are shown in Fig. 17. Note that our particular classification is not intended to be definitive, or universal. It may appear to be even subjective. The process of creating this sequence is • Segment the video into play and break segments frames. • Classify play segment frames into F or N • Classify F View frames into P or G • Smooth the resulting sequence • Extract balls using a rule based system. 19 Cricket Video Scoreboard Position Detection Play & Break Detection No Scoreboard Break Segment Play Field/Non Field View Classification Train Classifiers Dominant Grass Color Ratio < th Non Field View Pitch / Ground View Classifier Color, Edge and Shape Features Ground View Pitch View Figure 18: Block diagram of our system In summary, his contributions lies in modeling the ball as a sequence of relevant views and coming up with appropriate techniques to obtain the vocabulary in the sequence. However his method fails when the views are misclassified due to pop-up ad, or too few frames in a particular view. In our work, we have focused on improving accuracy in such cases and further, we identify key players of each ball to help in indexing the match. Problem Statement The cricket delivery detection method proposed by Binod Pal et. al. Pal and Chandran [2010] fails when the views are misclassified. We propose improvements to handle these cases. 1. Eliminate pop-up advertisement 2. Enhance the pitch view detection logic to eliminate false positives 20 3. Enhance smoothing logic, so that it doesn’t smooth out the required views 4. Associate each delivery with the prominent player Figure 19: Illustration of various problems where Binod pal’s work fails Methodology In our approach we improvise on Binod Pal’s work. We address the issues presented in the problem statement. Pop-up Advertisement Detection and Removal Whenever the view classifier detects an advertisement, we further examine the frame to check if it is a pop-up ad and eliminate it. Pop-up advertisements are relatively static when compared to the game play where there is more activity. We use this to detect advertisement and eliminate them. From the training set we extract the static component image from motion vector which has negligent motion. From this, we detect the prominent regions at the left and bottom of the image and then smooth the region boundaries to get the ad boundary. When this method fails to detect the advertisement boundary, we use scoreboard to determine advertisement boundary. We look for scoreboard boundary in the first quadrant of the image. Once the scoreboard position is detected, based on the displacement distance, advertisement boundaries are detected. Green Dominance Detection for Field View Classification In the prior work Pal and Chandran [2010], there were false positives for pitch detection when the player’s dress color is green. To reduce such false positives, we have have divided the frame into 9 blocks and fed 128-hue histogram of each block as addition features to the view classifier. As the green ground part is around the pitch and pitch is at center, this helped in reducing the false positives. Smoothing of View Sequence Using Moving Window In the prior work, smoothing of window was done by diving the views sequence into window of 5, 21 and assigning most frequently occurring views to all the frames in that window. As this method can remove some views which fall across the window, we use running window instead of dividing view sequence. For each frame, the frame is considered as center of windows, and most occurring view in the window is assigned to the current frame. This approach does not eliminate views which were originally falling across windows. Player Detection Once deliveries are detected, further filtering of the video is possible. For example, we consider classifying deliveries based on an a particular player, such as the Australian cricket captain Steve Smith. We can use a standard frontal face detector to detect faces of bowlers. On the other hand, faces of batsmen are confused by the usual use of helmets. So we look for additional information in the form of connected regions (in skin color range), and square regions which have more edges. For each delivery, face detection is performed for 100 frames before and after the actual delivery. This buffer of 200 frames is assessed for recognizing faces of key players associated with the actual delivery. The usual method of dimensionality reduction, and clustering is used to prune the set. With this background, if an input image of a particular player is given, the image is mapped to the internal low dimension representation and the closest cluster is found using a kd-tree. Corresponding deliveries in which that player appears are displayed. Experiments and Results We have conducted our experiments on one day international matches, test matches and T20 matches summing up to 50 hours of video. We have compared the classification of pitch/ground view using sift features. Features proposed by us have produced promising results which is presented in Table 4. Input Player Retrieved deliveries Figure 20: Retrieval of deliveries involving only certain players. 22 Table 4: Comparison of methods across different matches Pitch/Ground View Classification Precision Recall Our Method 0.982 0.973 Sift features 0.675 0.647 Our method had successfully detected deliveries with overall precision of 0.935 and recall of 0.957. The result is presented in Table 5. Table 5: Result of detected deliveries for different types of matches Match Type Precision Recall ODI 0.932 0.973 Test Match 0.893 0.931 T20 0.982 0.967 The filtered retrieval of deliveries is presented in Fig 20. 23 Hierarchical Summarization for Easy Video Applications Say you want to find an action scene of your favorite hero. Or want to watch a romantic movie with a happy ending. Or say you saw a interesting trailer, and want to watch related movies. Finding these videos in the ocean of videos available has become noticeably difficult, and requires a trustworthy friend or editor. Can we quickly design computer “apps” that act like this friend? This work presents an abstraction for making this happen. We present a model which summarizes the video in terms of dictionaries at different hierarchical levels — pixel, frame, and shot. This makes it easier for creating applications that summarizes videos, and address complex queries like the ones listed above. The abstraction leads to a toolkit that we use to create several different applications demonstrated in Figure 21. In the “Suggest a Video” application, from a set of action movies, three Matrix sequels were given as input; the movie Terminator was found to be the closest suggested match. In the “Story Detection” application, the movie A Walk to Remember is segmented into three parts; user experience suggests that these parts correspond to prominent changes in the twist and plot of the movie. In the “Audio Video Mix" application, given a song with a video as a backdrop, the application finds another video with a similar song and video; the application thus can be used to generate a “remix” for the input song. This application illustrates the ability of the data representation to find a video which closely matches both content and tempo. Related Work Methods like hierarchical hidden Markov models Xie et al. [2003]; Xu et al. [2005] and latent semantic analysis Niebles et al. [2008]; Wong et al. [2007] have been used to build models over basic features to learn semantics. In the method proposed by Lexing Xie et al. Xie et al. [2003], the structure of the video is learned in an unsupervised way using two-level hidden Markov model. The higher level corresponds to semantic events while the lower level corresponds to variations within the same event. Gu Xu et al. Xu et al. [2005], proposed multi-level hidden Markov models using detectors and connectors to learn videos. Spatio temporal words Laptev [2005]; Niebles et al. [2008]; Wong and Cipolla [2007]; Wong et al. [2007] are interest points identified in space and time. Ivan Laptev Laptev [2005] uses spatio temporal Laplacian operator over spatial and temporal scales to detect events. Niebles et al. Niebles et al. [2008] uses probabilistic latent semantic analysis (pLSA) on spatio temporal words to capture semantics. Shu-Fai Wong detects spatio temporal interest points using global information Wong and Cipolla [2007] and then uses pLSA Wong et al. [2007] for learning relationship between the semantics revealing structural relationship. Soft quantization Gemert et al. [2008] accounting distance from 24 Shot Type Segmentation Storyline Generation Logical Turning Point Detection Remix Candidate Identification Movie Classification Suggest a Video Figure 21: Retrieval applications designed using our Hierarchical Dictionary Summarization method. “Suggest a Video” suggests Terminator for Matrix sequels. “Logical Turning Point Detection” segments a movie into logical turning points in the plot of the movie A Walk to Remember. “Storyline Generation" generates storyline of the movie based on logical turning points. “Remix Candidate Identification" generates a “remix” song from a given set of music videos. “Shot Type Segmentation” segments a TV show into close-up views and chatting view. 25 a number of codewords is considered for classifying scenes with the image and Fisher Vector and used in classification Sun and Nevatia [2013]; Jegou et al. [2010] showing significant improvement over BOW methods. A detailed survey on various video summarization models can be found in Chang et al. [2007]. Similar to Sun and Nevatia [2013]; Jegou et al. [2010], we use local descriptors and form visual dictionaries. However unlike Sun and Nevatia [2013]; Jegou et al. [2010], we preserve more information instead of extracting only specific information. In addition to building dictionary at pixel level, we extend this dictionary to frame and shot level, forming a hierarchical dictionary. Having similarity information available at various granularities is the key to creating applications that need features at the level desired. Technical Contributions In this paper, we propose a Hierarchical Dictionary Model (termed H-Video) to make the task of creating application easier. Our method learns semantic dictionaries at three different levels — pixel patches, frames, and shots. Video is represented in the form of learned dictionary units that reveal semantic similarity and video structure. The main intention of this model is to provide these semantic dictionaries, so that comparison of video units at different levels in the same video and different videos becomes easier. The benefits of H-Video include the following (i) The model advocates run-time leveraging of prior offline processing time. As a consequence, applications run fast. (ii) The model is built in an unsupervised fashion. As no application specific assumption is made, many retrieval applications can use this model and its features. This can potentially save enormous amount of computation time spent in learning. (iii) The model represents learned information using a hierarchical dictionary. This allows video to be represented as indexes to elements in the video. This makes it easier for the developer of a new retrieval application as similarity information is available as a one dimensional array. In other words, our model doesn’t demand deep video understanding background from application developers. (iv) We have illustrated our model through several applications. Figure 21 illustrates these applications. Methodology We first give an overview of the process which is illustrated in Figure 22. 26 Feature Extraction Frames H1 Representations of frames H2 Representation of shots Feature Extraction H1 Dictionary Clustering Frame H2 Dictionary Clustering Feature Extraction H1 Representation of frames in a shot H3 Dictionary H2 Representation of shots in the video Clustering Feature Extraction H1 Dictionary Feature Extraction H2 Dictionary Feature Extraction H3 Dictionary H1 Representation of pixels in a frame H2 Representation of frames in a shot H3 Representation of shots in a video Dictionary Representation Dictionary Formation Figure 22: Illustration of the H-Video Abstraction Our model first extracts local descriptors like color and edge from pixel patches. (Color and edge descriptors are simply examples.) We then build a dictionary, termed H1 dictionary, out of these features. At this point, the video could be, in principle, be represented in terms of this H1 dictionary. We refer each frame of this representation as a H1 frame. We then extract slightly higher level features such as the histogram of the H1 dictionary units, the number of regions from these H1 frames and so on, and form a new H2 dictionary. The H2 dictionary is yet another representation of the video and captures the type of objects and their distribution in the scene; in this way, it captures the nature of the scene. The video could also be represented using this H2 dictionary. We refer each shot in this representation as a H2 shot. Further, we extract features based on the histogram and position of H2 dictionary units and build yet another structure, the H3 dictionary. This dictionary represents the type of shots occurring in the video. The video is now represented in terms of this H3 dictionary to form H3 video. Applications In this section, we demonstrate the efficiency and applicability of our model through four retrieval applications — video classification, suggest a video, detecting logical turning points and searching candidates for audio video mix. The applications designed using our model are fast, as they use pre-computed information. Typically 27 these applications take a minute to compute required information. This illustrates the efficiency and applicability of this model for content based video retrieval. Suggest a Video Websites like imdb.com present alternate related movies when the user is exploring a specific movie. This is a good source for users to find new or forgotten movies they would be interested in. Currently such suggestions are mostly based on text annotations and user browsing history, which the user may want to turn off due to privacy considerations. If visual features of the video can also be used to solve this problem better suggestions can evolve. This scenario is illustrated in Figure 23, where the movies which the user was interested in is highlighted and a movie highlighted in green is suggested; the green one is presumably closest to the user’s interest. Figure 23: Given a set of movies of interest (in red), a movie which is similar to them is suggested (in green). In this application, we aim to automatically suggest videos based on the video similarity of the various dictionary units. There are many ways to compare videos, like comparing the feature vectors of the dictionaries, computing the amount of overlapping dictionary units, computing correlation, and so 28 on. In this case, we use the sequence of H2 dictionary units and substitute it with the corresponding dictionary’s feature vector to compute similarity. We take cross correlation of H2 representation features to compute the similarity between videos. The video matching with value above a threshold is suggested to the user. Video Classification In this application, we identify various genres a movie could belong to and suggest this classification. This will reduce the effort involved in tedious annotation and help users choose appropriate genres easily. In our approach, we train a random forest model for each of the genre using the H2 and H3 histograms of the sample set. Given a new movie, and a genre, the model will output whether the movie belongs to that genre or not. Figure 24: Movies are classified into drama and action genres Logical Turning Point Detection We define Logical Turning Point in the video as a point where the characteristics of objects in the video change drastically. Detecting such places helps in summarizing the video effectively. Figure 25. As illustrated in the figure, the characteristics of people occurring in each of the differing stories is different. We consider shots within a moving window of 5 shots. We compute the semantic overlap of H1 and H2 units between the shots. When the amount of overlap between shots is low in a specific window of shots, we detect that as a logical turning point. As the logical turning points capture the important events in video, this can be used in applications 29 Figure 25: Story line of the movie A walk to remember split by story units. Characters are introduced in the first story unit; in the second story unit, the girl and the boy are participating in a drama; in the last part, they fall in love and get married. In spite of the length of stories being very different, our method successfully identifies these different story units. like advertisement placement and preparing the story line. Potential Candidate Identifier for Audio Video Mix Remix is the process of generating a new video from existing videos by either changing audio or video. Remixes are typically performed on video songs for a variation on the entertainment needs. The remix candidates need to have similar phase in the change of the scene to generate a pleasing output. We use cross correlation of H2 units to identify if they have same phase of change. Once the closest candidate is identified, we replace the remix candidate’s audio with the original video’s audio. 30 Experiments In this section, we first evaluate the individual performance of the constituents of H-Video model itself. Next, in creating dictionaries, we compare the usage of popular features such as SIFT. Finally the performance of applications that may use more than one of the three levels, H1, H2, or H3, are evaluated. Data: We have collected around 100 movie trailer from youtube, twenty five full length movie films, and a dozen music video clips. Computational Time: Our implementation is in Matlab. Typically a full-length movie takes around two hours for feature extraction. Building local dictionaries for a movie takes around 10 hours. Building the global dictionary, which is extracted from local dictionary of multiple videos takes around 6 hours. Note that building local dictionaries and a global dictionary (left hand side of Fig. 22) are one time jobs. Once these are built, the dictionaries are directly used to create the right hand side of Fig. 22. In other words, relevant model building typically takes two hours which is no different from the average runtime of the video itself. Once the model is constructed, each access operation typically takes only around 10 seconds per video. Individual Evaluation of H1 and H2 For illustrating the effectiveness of H1 dictionary, we considered the classification problem and collected videos from the category “car”, “cricket” and “news anchor”. We have collected 6 videos from each category summing up to total of 18 videos. We computed H1 dictionary for each of these videos, and formed a global H1 dictionary for the given dataset and represented all videos in terms of this global dictionary. For testing purposes we randomly selected two videos from each category for training data and remaining as testing set and test set against one of the three categories. The recall and precision of classification using only H1 is provided in Table 6. Table 6: Classification using only H1. With limited information, we are able to glean classification hints. Category Precision Recall Car 1.00 0.75 Cricket 0.67 1.00 News Anchor 1.00 0.75 To evaluate the effectiveness of only H2 dictionary units, we have built our model on a TV interview show; these had scenes of only individuals, as well as people groups. When we built H2 dictionaries with allowed error as “top principal component / 1000”, we got six categories capturing different 31 (a) H2 dictionary units with smaller allowed error (b) H2 dictionary units with larger allowed error Figure 26: Classification using only H2. With only limited information, and with smaller allowed error, fine details were captured. With larger allowed error, broad categories were captured. Table 7: Video suggestion using popular features Methods Direct H-Model Comparison Percentage Improvement SURF 54% 54% 0% SIFT 29% 53% 83% Color, Edge 48% 59% 23% scenes, people and their positions. As we relaxed the allowed error, it resulted in two categories between individual scene and group of people. This result is presented in Fig. 26. Hence applications can tune allowed error parameters to suit their requirement. Evaluation of Alternate Features In this section we evaluate popular features like SURF, SIFT and contrast with the color & edge features used in this paper. Given any feature set, one may do a “direct comparison” (which will take longer time), or do our proposed H-Video model-based comparison (which will take far lesser time). This experiment is performed on the “Suggest-a-video” problem using only trailers of movies as the database. The result is presented in Table 7. When the H-Video model is used, we use the H2 as the basis of comparison. We observe that the 32 use of the hierarchical model helped improving the accuracy for SIFT and color & edge features; the accuracy was almost the same when using SURF features. In producing these statistics, for the ground truth we have used information available on imdb.com. One problem in using imdb.com is that the truth is limited to the top twelve only. We therefore have added transpose and transitive relationships as well. (In transpose relationships, if a movie A is deemed to be related to B, we mark B as related to A. In transitivity, if a movie A is related to B, and B is related to C, then we mark A as related to both B and C. The transitive closure is maintained via recursion.) Evaluation of Video Classification We have considered three genres for video classification. We took the category annotation of 100 movie trailers and for each category considered 30% of the data for training and remaining as testing set. We have build the H-video for these videos, extracted H2 and H3 representations, and classified using the random forest model. Example output is shown in Fig. 27. Figure 27: Sample Result of classifying movie trailers for categories Drama, Action and Romance. In most of the cases, our model has classified movie trailers correctly. Evaluation of Logical Turning Point Detection Typically drama movies have three logical parts. First the characters are introduced, then they get 33 (a) Remix Candidates – Set 1 (b) Remix Candidates – Set 2 Figure 28: Sample frames from identified remix candidates is presented. In each sets, top row correspond to a original video song and the second row corresponds to the remix candidate. The content and sequencing of the first and second rows match suggesting the effectiveness of our method. together and then a punchline is presented towards the end. Considering this as ground truth, we validated the detected logical turning points. The logical turning points were detected with precision of 1.0 and recall of 0.75. Evaluation of Potential Remix Candidates We have conducted experiments on 20 song clips, where the aim is to find best remix candidates. Our algorithm found two pairs of songs which are best candidates for remix in the given set. Sample frames from matched video are presented in Fig. 28. Conclusion and Future Work Videos revolutionize many of the ways we receive and use information every day. The availability of online resources has changed many things. Usage of videos has transformed dramatically demanding better ways of retrieving them. Many diverse and exciting initiatives demonstrate how visual contents can be learned from video. Yet at the same time, reaching videos using visual contents online arguably has been less radical, especially in comparison to text search. There are many complex reasons for this slow pace of change, 34 including lack of proper tags and huge computation time taken for learning for specific topics. In this thesis, we have presented ways to optimize computation time, which is the most important obstacle to retrieve videos. Next, we have focused on producing visual summaries, so that user can quickly find if the video is of interest to him/her. We have also presented ways to slice and dice videos so that user can for quickly reach the segment they are interested in. 1. Lead star Detection: Detecting lead stars has numerous applications like identifying player of the match, detecting lead actor and actress in motion pictures, guest host identification. Computational time has always been a bottleneck for using this technique. In our work, we have presented a faster method to solve this problem with comparable accuracy. This makes our algorithm usable in practice. 2. AutoMontage: Photo Sessions Made Easy!: With the increased usage of camera by novices, tools to make photography sessions are becoming increasingly valuable. Our work successfully creates photo montage from photo session videos and combines best expressions into a single photo. 3. Cricket Delivery Detection: Inspired by prior work Pal and Chandran [2010], we approached the problem of temporal segmentation of cricket videos. As the view classification accuracy is highly important, we have proposed few corrective measures to improve the accuracy by introducing pop-up ad eliminator and finer features for view classification. Further, we associated each delivery with the key players of that delivery, proving browse by player feature. 4. Hierarchical Model: In traditional video retrieval systems, relevant features are extracted from the video and applications are built using the extracted features. For multimedia database retrieval systems, there are typically plethora of applications that would be required for satisfying user needs. A unified model which uses fundamental notions of similarity would therefore be valuable to reduce computation time for building applications. In this work, we have proposed a novel model called H-Video, which provides the semantic information needed by retrieval applications. In summary, both creation (programmer time) and runtime (computer time) of the resulting applications are reduced. First, our model provides semantic information of video in a simple way, so that it is easy for programmers. Second, due to the suggested pre-processing of long video data, runtime is reduced. We have built four applications as examples to demonstrate our model. Future Work The future scope of this work is to enhance hierarchical model to learn concepts automatically from a set of positive and negative examples using machine learning techniques like random forest or random 35 tree. This would serve as a great source in learning a particular tag and applying to all other videos which have the same concept. This model can also be extended to matching parts of videos instead of whole video. Hierarchical model can also be used to analyze video types and integrate with other applications. For example, video types pertaining to an actor can be learnt to create a profile of that actor. The video type learnt when associated with appropriate tag, can serve complex queries related to actor. Hierarchical model can be used to identify photo session videos, so that automontage can be applied automatically to create a photo album. The future scope of this work also includes developing more applications to find videos of user’s interest. Video sharing websites create automatic play list based on user’s choice. In such a list, the videos seem to be based on browsing patterns of users. However it has a drawback of mixing different types of video, whereas user may be interested in specific type of video. Hence an application to learn video types from the recommendations and provide filtering options based on various attributes will help the user narrow down their interest. Few attributes that could be used for such filtering are lead actors, number of people in the video, video types from hierarchical model, popular tags associated with videos and recency. In conclusion, we feel that generic models with quick retrieval time combined with user centric applications have much unexplored potential. They will become the favored methods for reaching videos in the near future. References Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin, D., and Cohen, M. (2004). Interactive digital photomontage. ACM Transactions on Graphics, 23(3):294– 302. 16, 17, 18 Chang, S.-F., Ma, W.-Y., and Smeulders, A. (2007). Recent advances and challenges of semantic image/video search. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–1205–IV–1208. 27 Das, M. and Loui, A. (2003). Automatic face-based image grouping for albuming. In IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages 3726–3731. 15 Doulamis, A. and Doulamis, N. (2004). Optimal content-based video decomposition for interactive video navigation. Circuits and Systems for Video Technology, IEEE Transactions on, 14(6):757– 775. 7 Everingham, M. and Zisserman, A. (2004). Automated visual identification of characters in situation 36 comedies. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 4, pages 983–986, Washington, DC, USA. IEEE Computer Society. 7 Fitzgibbon, A. W. and Zisserman, A. (2002). On affine invariant clustering and automatic cast listing in movies. In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part III, pages 304–320, London, UK. Springer-Verlag. 7, 8 Foucher, S. and Gagnon, L. (2007). Automatic detection and clustering of actor faces based on spectral clustering techniques. In Proceedings of the Fourth Canadian Conference on Computer and Robot Vision, pages 113–122. 8, 11 Gemert, J. C., Geusebroek, J.-M., Veenman, C. J., and Smeulders, A. W. (2008). Kernel codebooks for scene categorization. In Proceedings of the 10th European Conference on Computer Vision: Part III, ECCV ’08, pages 696–709, Berlin, Heidelberg. Springer-Verlag. 25 H., K. M., K., P., and S., S. (2008). Semantic event detection and classification in cricket video sequence. In Indian Conference on Computer Vision Graphics and Image Processing, pages 382– 389. 19, 20 Javed, O., Rasheed, Z., and Shah, M. (2001). A framework for segmentation of talk and game shows. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, 2:532–537 vol.2. 7, 10, 13 Jegou, H., Douze, M., Schmid, C., and Perez, P. (2010). Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3304–3311. 27 K., P. S., Saurabh, P., and V., J. C. (2006). Text driven temporal segmentation of cricket videos. In Indian Conference on Computer Vision Graphics and Image Processing, pages 433–444. 19, 20 Kumano, S., Otsuka, K., Mikami, D., and Yamato, J. (2011). Analyzing empathetic interactions based on the probabilistic modeling of the co-occurrence patterns of facial expressions in group meetings. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 43–50. 15 Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(23):107–123. 25 Lee, S.-H., Han, J.-W., Kwon, O.-J., Kim, T.-H., and Ko, S.-J. (2012). Novel face recognition method using trend vector for a multimedia album. In IEEE International Conference on Consumer Electronics, pages 490–491. 15 Li, C.-H., Chiu, C.-Y., Huang, C.-R., Chen, C.-S., and Chien, L.-F. (2006). Image content clustering 37 and summarization for photo collections. In IEEE International Conference on Multimedia and Expo, pages 1033–1036. 15 Li, J., Lim, J. H., and Tian, Q. (2003). Automatic summarization for personal digital photos. In Fourth Pacific Rim Conference on Information, Communications and Signal Processing and Proceedings of the Joint Conference of the Fourth International Conference on Multimedia, volume 3, pages 1536–1540. 15 Lim, S. H., Lin, Q., and Petruszka, A. (2010). Automatic creation of face composite images for consumer applications. In IEEE International Conference on Acoustics Speech and Signal Processing, pages 1642–1645. 15 Littlewort, G., Bartlett, M., Salamanca, L., and Reilly, J. (2011). Automated measurement of children’s facial expressions during problem solving tasks. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 30–35. 15 Liu, Z. and Ai, H. (2008). Automatic eye state recognition and closed-eye photo correction. In 19th International Conference on Pattern Recognition, pages 1–4. 15 Lucey, P., Cohn, J. F., Matthews, I., Lucey, S., Sridharan, S., Howlett, J., and Prkachin, K. M. (2011). Automatically detecting pain in video through facial action units. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 41(3):664–674. 15 Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3):299–318. 25 Pal, B. and Chandran, S. (2010). Sequence based temporal segmentation of cricket videos. In Sequence based Temporal Segmentation of Cricket Videos. 19, 20, 21, 22, 36 Sun, C. and Nevatia, R. (2013). Large-scale web video event classification by use of fisher vectors. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 15–22. 27 Takahashi, Y., Nitta, N., and Babaguchi, N. (2005). Video summarization for large sports video archives. Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 1170– 1173. 7, 10 Waibel, A., Bett, M., and Finke, M. (1998). Meeting browser: Tracking and summarizing meetings. In Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pages 281–286. 7 Wong, S.-F. and Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. 25 38 Wong, S.-F., Kim, T.-K., and Cipolla, R. (2007). Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–6. 25 Xie, L., Chang, S.-F., Divakaran, A., and Sun, H. (2003). Unsupervised discovery of multilevel statistical video structures using hierarchical hidden markov models. In Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on, volume 3, pages III–29–32 vol.3. 25 Xu, D., Li, X., Liu, Z., and Yuan, Y. (2004). Anchorperson extraction for picture in picture news video. Pattern Recogn. Lett., 25(14):1587–1594. 7 Xu, G., Ma, Y.-F., Zhang, H.-J., and Yang, S.-Q. (2005). An hmm-based framework for video semantic analysis. Circuits and Systems for Video Technology, IEEE Transactions on, 15(11):1422–1433. 25 Zhang, Z., Potamianos, G., Senior, A. W., and Huang, T. S. (2007). Joint face and head tracking inside multi-camera smart rooms. Signal, Image and Video Processing, 1:163–178. 7 Zhuang, Y., Rui, Y., Huang, T., and Mehrotra, S. (1998). Adaptive key frame extraction using unsupervised clustering. In Proceedings of the International Conference on Image Processing, volume 1, pages 866–870. 7 39
Similar documents
Extending Social Videos for Personal Consumption
In the early era of internet, when concept of web search and video search was introduced, only way people ever searched is to annotate each and every web page and video with all possible tags, so t...
More information