Synopsis - Department of Computer Science and Engineering

Transcription

Synopsis - Department of Computer Science and Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
Department of Computer Science and Engineering
SYNOPSIS
of the Ph. D. thesis entitled
E XTENDING S OCIAL V IDEOS FOR P ERSONAL
C ONSUMPTION
Proposed to be submitted in
partial fulfillment of the degree of
DOCTOR OF PHILOSOPHY
of the
INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY
BY
NITHYA SUDHAKAR
Supervisor: Prof. Sharat Chandran
Department of Computer Science and Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
Motivation
A picture speaks a thousand words and a video is thousands of pictures. Videos let people articulate
their ideas creatively. The abundant source of videos available via video sharing websites, makes
video freely accessible to everyone. Videos are great sources of information and entertainment, and
they cover various areas like news, education, sports, movies and events.
In the early era of internet, when concept of web search and video search was introduced, only way
people ever searched is to annotate each and every web page and video with all possible tags, so that
people can search them. Over a period of time as the volume increased, this method no more worked.
So researchers came up with interesting ways of training images to learn real world objects like indoor,
outdoor, swimming, car, air plane and so on, so that these annotations are automatically extracted
from video frames. Though this approach increased number of tags and reduced human effort, the
searchability was restricted to objects trained and training exhaustive list was again a tedious process.
To overcome this, bag of words feature was introduced which is similar to text, which has visual
description about various objects in the scene and image was used as input instead of traditional text.
However this left with another problem of user finding a suitable image to represent their thought.
Over the years with tremendous growth of video sharing website, people have uploaded a huge
collection of video with their own annotations. By combining bag-of-words with various tags
available, tag annotations have reached a great level that videos are easily searchable. However
as user gets closer to video results, there are various expectations user have for which there is no
sufficient annotation. So user browses through results and recommendations until they find videos of
their interest. The size of videos pose additional challenges as user has to download various videos
and browse through them to check if that satisfies their requirement. This scenario is illustrated in the
Figure 1.
Figure 1: A scenario where user is browsing through various video results to reach video of his
interest.
1
2
Police and Politics
Social Gathering
Court Scene
Protest
Sports
Music
Romance
Interview
Events

 Select All

Lead Casts
Music
Cricket
Interview
Sports
HH:MM
Remake
Remix
Interviews
by Host
Similar Videos
Interviews
of Guest
Sixes and boundaries by India
Wickets by India
Last two overs of Indian Bowling
Sixes and boundaries by Sri Lanka
Wickets by Sri Lanka
Last two overs of Sri Lankan Bowling
Customized Videos
Similar Videos
Sequel
Similar Movies
that the suggestions and visual representations vary based on the input video and its genre.
Figure 2: A sample scenario where user is provided with more filtering options and visual representation along with video results. Please note
Show videos related
to browsing history
HH:MM
Less than 2 min
2 min to 5 min
5 to 30 min
30 min to 3 hrs
More than 3 hrs
Duration
Avg. Customer Reviews


Genres
Movies
This demands a whole realm of applications which help user to quickly get close to the video they
want. In this thesis we propose various applications like lead star detections, story-line generation
which helps users to get a quick idea about video without having to go through them. We also
propose applications which able user to go through selected parts instead of complete video. Example
applications we have explored in this genre includes cricket delivery detection and logical turning
point detection. As the more user gets to a specific genre, genre specific applications make more
sense than generic ones. One example that we have tried is Automontage which generates photos
by merging different expressions to form a nice output. Further we noticed a great possibility of
such genre specific applications which commonly need similarity information. We have proposed
a technique called Hierarchical model, which enables a specific genre of similarity applications.
We have demonstrated our applications through supporting applications like logical turning point
detection, story-line generation, movie classification and audio video remix candidate search.
Problem Statement
The objective of the thesis is to design applications and technique which will help users to reach
specific video of their interest easily. A sample scenario in which user is provided with various
options is illustrated in Figure 2. As the possibilities of designing such applications are wider, we
establish the following problems specifically
1. Ability to summarize the video, so that user quickly visualize the content of videos
2. Ability to browse through specific parts of video instead of browsing through whole video.
3. Designing genre specific video applications which is of great use in that genre
4. Designing a technique which can support similarity applications which suits various needs like
auto suggestions, story line generation.
Our Contributions
In this thesis, we propose applications and technique which achieve the following, so that user easily
can quickly filter video results or skim through video or explore related ones.
3
4
AutoMontage
Lead Star Detection
Figure 3: Applications developed or addressed in our work.
Hierarchical Model
Remix Candidate Identification
Logical Turning Point
Detection
Cricket Delivery Detection
Movie Classification
Shot Type Segmentation
Segmentation
Suggestion
Storyline Generation
Summarization
Suggest a Video
Key Players Detection
Lead Actors Identification
Guest Host Detection
1. Summarize videos in terms of lead stars, allowing user to narrow down based on stars of their
interest
2. Summarize the video in terms of story-line, so that user can get a quickly overview videos
before viewing them.
3. Segment the video allowing users to browse through specific parts of video instead of the whole
video.
4. Designing genre specific video applications like automontage, cricket delivery detection, remix
candidate generation, which is of great use in those specific genres.
5. Designing a technique which can support a genre of similarity applications like some
applications listed above.
We achieve these functionalities using various techniques ranging from building appropriate dictionaries to specific view classification or merging expressions to create good photos. We have classified
these video applications into three major categories - segmentation, summarization and classification.
We present various applications we have developed in these categories and present a model which
supports applications from all these categories. A visual summary of the work done is presented in
the Figure 3. In this section, we will be presenting an overview of these applications and models.
5
Fast Lead Star Detection in Entertainment Videos
Suppose an avid cricket fan or coach wants to learn exactly how Flintoff repeatedly got Hughes “out.”
Or a movie buff wants to watch an emotional scene involving his favorite heroine in a Hollywood
movie. Clearly, in such scenarios, you want to skip frames that are “not interesting.” One possibility
that has been well explored is extracting key frames in shots or scenes and then creating thumbnails.
Another natural alternative – the emphasis in this work – is to determine frames around, what we
call, lead stars. A lead star in an entertainment video is the actor who, most likely, appears in many
significant frames. We define lead stars in other videos also. For example, the lead star in a soccer
match is the hero, or the player of the match, who has scored “important” goals. Intuitively, he is
the one the audience has paid to come and see. Similarly the lead star in a talk show is the guest
who has been invited, or, for that matter, the hostess. This work presents how to detect lead stars in
entertainment videos. Moreover like various video summarization Zhuang et al. [1998]; Takahashi
et al. [2005]; Doulamis and Doulamis [2004], lead stars is a natural way of summarizing video.
(Multiple lead stars are of course allowed.)
Figure 4: Lead star detection. This is exemplified in sports by the player of the match; in movies,
stars are identified; and in TV shows, the guest and host are located.
Related Work
Researchers have explored various video specific applications for lead stars detection – anchor
detection in news video Xu et al. [2004], lead casts in comedy sitcoms Everingham and Zisserman
[2004], summarizing meetings Waibel et al. [1998], guest host detection Takahashi et al. [2005]; Javed
et al. [2001] and locating the lecturer in smart rooms by tracking the face and head Zhang et al. [2007].
Fitzgibbon Fitzgibbon and Zisserman [2002] uses affine invariant clustering to detect cast listing from
6
movie. As the original algorithm had runtime that is quadratic, the authors used a hierarchical strategy
to improve the clustering speed that is central to their method. Foucher, S. and Gagnon, L. Foucher
and Gagnon [2007] used spatial clustering techniques for clustering actor faces. Their methods detect
the actor’s cluster in unsupervised way with computation time of about 23 hours for a motion picture.
Our Strategy
Although the lead actor has been defined using a pictorial or semantic concept, an important
observation is that the significant frames in an entertainment video is often accompanied by a change
in the audio intensity level. It is true no doubt that not all frames containing the lead actors involve
significant audio differences. Our interest is not at the frame level, however. Note that certainly the
advent of important scenes and important people bear a strong correlation to the audio level. We
surveyed around one hundred movies, and found that it is rarely, if at all, the case that the lead star
does not appear in audio highlighted sections, although the nature of the audio may change from scene
to scene. And as alluded above, once the lead star has entered the shot, the frames may well contain
normal audio levels.
Figure 5: Our strategy for lead star detection. We detect lead stars by considering segments that
involve significant change in audio level. However, this by itself is not enough!
Our method is built upon this concept. We detect lead stars considering such important scenes of the
video. To reduce false positives and negatives, our method clusters the faces for each important scenes
separately and then combines the results. Unlike the method in Fitzgibbon and Zisserman [2002], our
method provides a natural segmentation for clustering. Our method is shown to considerably reduce
the computation time of the previously mentioned state-of-the-art Foucher and Gagnon [2007] for
7
Figure 6: Strategy further unfolded. We first detect scenes accompanied by change in audio level.
Next we look for faces in these important scenes, and to further confirm the suitability track faces in
subsequent frames. Finally, a face dictionary representing the lead stars is formed by clustering the
confirmed faces.
computing lead star in motion picture (a factor of 50). We apply this method to sports video to
identify the player of the match, motion pictures to find heroes and heroines and TV show to detect
guest and host.
Methodology
As mentioned, the first step in the problem is to find important scenes which have audio highlights.
Once such important scenes are identified, they are further examined for potential faces. Once a
potential face is found in a frame, subsequent frames are further analyzed for false alarms using
concepts from tracking. At this point, several areas are identified as faces. Such confirmed faces are
grouped into clusters to identify the lead stars.
8
Applications
In this section, we demonstrate our technique using three applications — Lead Actor Detection in
Motion Pictures, Player of the Match Identification and Host Guest Detection. As applications go,
Player of the Match Identification has not been well explored considering the enormous interest. In the
other two applications, our technique detects lead stars faster than the state-of-art techniques, which
makes our method practical and easy to use.
Lead Actor Detection in Motion Pictures In motion pictures, detecting the hero, heroine and villain
has many interesting benefits. A person while reviewing a movie can skip the scenes where lead actors
are not present. A profile of the lead actors can also be generated. Significant scenes containing many
lead actors can be used for summarizing video.
In our method, the face dictionary formed contains the lead actors. These face dictionaries are good
enough in most of the cases. However, for more accurate results, the algorithm scans through a few
frames of every shot to determine the faces occurring in the shot. The actors who appear in a large
number of shots are identified as lead actors. The result of lead actors for the movie Titanic after
scanning through entire movie is shown in the Figure 7.
Figure 7: Lead actors detected for the movie Titanic
.
Player of the Match Identification In sports, sports highlight and key frames Takahashi et al. [2005]
are the two main methods used for summarizing. We summarize sports using the player of the match
capturing the star players.
Detecting and tracking players in complete sports video does not yield player of the match. The star
players can play for shorter time and score more as opposed to players who attempt many times and
don’t. So analyzing the players when there is score leads to the identification of star players. This is
easily achieved by our technique, as detecting highlights results in exciting scenes like scores.
The result of lead sports stars detected from a soccer match Liverpool vs Havant & Waterlooville is
presented in the Figure 8. The key players of the match are detected.
Host Guest Detection In TV interviews and other TV programs, detecting host and guest of the
program is the key information used in video retrieval. Javed et. al. Javed et al. [2001] have proposed
9
Figure 8: Lead star detected for the highlights of a soccer match Liverpool vs Havant & Waterlooville.
The first image is erroneously detected as face. The other results represents players and coach.
a method for the same which removes the commercials and then exploits the structure of the program
to detect guest and host. The algorithm uses the inherent structure of the interview that the host
appears for shorter duration than guest. However, it is not always the case, especially when the hosts
are equally popular like in the case of TV shows like Koffee With Karan. In the case of competition
shows, the host is shown for longer duration than guests or judges.
Our algorithm detects hosts and guests as lead stars. To distinguish hosts and guests, we detect lead
stars on multiple episodes and combine the result. As it is intuitive, the lead stars over multiple
episodes are hosts and the other lead stars detected for specific episodes are guests.
Experiments
We have implemented our system in Matlab. We tested our method on an Intel Core Duo processor,
1.6 Ghz, 2GB RAM. We have conducted experiments on 7 popular motion pictures, 9 soccer match
highlights and two episodes of TV shows summing up to total of 19 hours 23 minutes of video. Our
method detected lead stars in all the videos in an average of 14 minutes for an one hour video. The
method Foucher and Gagnon [2007] in the literature computes lead star of a motion picture in 23
hours, whereas we compute lead star for motion picture in an average of 30 minutes. We now provide
more details.
Lead Actor Detection in Motion Pictures
We ran our experiments on 7 box-office hit movies listed in the table 1. This totally sums up to 16
hours of video. The lead stars in all these movies are computed in 3 hour 18 minutes. So the average
computation time for a movie is around 30 minutes. From Table 1, we see that the best computation
time is 4 minutes for the movie Austin Powers in Goldmember which is 1 hour 42 minutes in duration.
The worst computation time is 45 minutes for the movie Matrix Revolutions of duration 2 hour 4
minutes. For movies like Eyes Wide Shut and Austin Powers in Goldmember, the computation is
faster as there are fewer audio highlights. Whereas action movies like Titanic sequels take more time
as there are many audio highlights. This causes the variation in computation time among movies.
10
Table 1: Time taken for computing lead actors for popular movies.
No.
Movie Name
Duration
Computation Time for
Computation Time for
Detecting Lead Stars
Refinement
(hh:mm)
(hh:mm)
(hh:mm)
02:16
00:22
00:49
1
The Matrix
2
The Matrix Reloaded
02:13
00:38
01:25
3
Matrix Revolutions
02:04
00:45
01:07
4
Eyes Wide Shut
02:39
00:12
00:29
5
Austin Powers
01:34
00:04
01:01
6
The Sisterhood
01:59
00:16
00:47
Titanic
03:18
01:01
01:42
Total
16:03
03:18
07:20
in Goldmember
of the Traveling Pants
7
Figure 9: Lead actors identified for popular movies appear on the right.
The lead actors detected are shown in the Figure 9. The topmost star is highlighted in red color and
the next top star is highlighted in blue color. As you can notice in the figure, in most of the movies
topmost stars are detected. Since the definition of “top” is subjective, it could be said that in some
cases, top stars are not detected in some cases. Further, in some cases the program identifies the same
actor multiple times. This could be due to disguise, or due to pose variation. The result is further
refined for better accuracy as mentioned.
11
Player of the Match Identification We have conducted experiments on 11 soccer match highlights
taken from BBC and listed in Table 2. Our method on an average takes half the time of the duration
of the video. Note however, that these timings are for only sections that have already been manually
edited by the BBC staff. If the video were run on a routine full soccer match, we expect our running
time to be a lower percentage of the entire video.
Figure 10: Key players detected from BBC Match of the Day match highlights for premier league
2007-08.
The results of key player detection is presented in the Figure 10. The key players of the match are
identified for all the matches.
Host Guest Detection We conducted our experiment on the TV show Koffee with Karan. Two
different episodes of the show were combined and fed as input. Our method identified the host in
4 minutes for a video of duration 1 hour 29 minutes. Our method is faster than the method proposed
by Javed et. al. Javed et al. [2001].
12
Table 2: Time taken for computing key players from BBC MOTD highlights for premier league 200708.
No.
Soccer Match
Duration
Computation
(hh:mm)
(hh:mm)
1
Barnsley vs Chelsea
00:02
00:01
2
BirminghamCity vs
00:12
00:04
Arsena
3
Tottenham vs Arsenal
00:21
00:07
4
Chelsea vs Arsenal
00:14
00:05
5
Chelsea vs
00:09
00:05
Middlesborough
6
Liverpool vs Arsena
00:12
00:05
7
Liverpool vs
00:15
00:06
00:09
00:05
00:18
00:04
01:52
00:43
Havant & Waterlooville
8
Liverpool vs
Middlesbrough
9
Liverpool vs
NewcaslteUnited
Total
Table 3: Time taken for identifying host in a TV show Koffee With Karan. Two episodes are combined
into a single video and given as input.
TV show
Koffee With Karan
Duration
Computation
(hh:mm)
(hh:mm)
01:29
00:04
The result of our method for the TV show Koffee with Karan is presented in Figure 11. Our method
has successfully identified the host.
Figure 11: Host detection of TV show “Koffee with Karan". Two episodes of the show are combined
and given as input. The first person in the detected list (sorted by the weight) gives the host.
13
AutoMontage: Photo Sessions Made Easy!
Its no more that only professional photographers who take pictures. Almost anyone has a good camera,
and often takes a lot of photographs. Group photographs as a photo session in reunions, conferences,
weddings, and so on are de rigueur. It is difficult, however, for novice photographers to capture good
expressions at the right time, and realize a consolidated acceptable picture.
A video shoot of the same scene ensures that expressions are not missed. Sharing the video, however,
may not be the best solution. Besides the obvious bulk in the video, poor expressions (“false
positives") are also willy nilly captured and might prove embarrassing. A good compromise is to
produce a mosaiced photograph assembling good expressions, and discarding poor ones. This can be
achieved by a cumbersome manual editing; in this paper, we provide an automated solution, illustrated
in Figure 12. The photo shown has been created from a random YouTube video excerpt and the photo
shown does not exist in any frame of the original video.
Figure 12: An automatic photo montage created by assembling “good” expressions from a video
(randomly picked from YouTube). This photo did not exist in any of the input frames.
Related Work
Research in measuring the quality of face expressions has appeared elsewhere, in applications
such as medical patient expression detection to sense pain Lucey et al. [2011], measurement of
children’s facial expression during problem solving Littlewort et al. [2011], and analyzing empathetic
interactions in group meeting Kumano et al. [2011]. Our work focuses on generating an acceptable
photo montage from photo-session video and is oriented towards a targeted goal of discarding painful
expressions, and recognizing memorable expressions.
In regard to photo editing researchers have come up with many interesting applications like organizing
photos based on the person present in the photos Lee et al. [2012]Li et al. [2003]Li et al. [2006]Das and
Loui [2003], correcting an image with closed eye to open eye Liu and Ai [2008], and morphing photos
Lim et al. [2010]. Many of these methods either require manual intervention, or involves a different
14
and enlarged problem scope resulting in more complicated algorithms rendering them inapplicable to
our problem.
The closest work to ours is presented in Agarwala et al. [2004], where a user selects the part they
want from each photo. These parts are then merged using the technique of graph cuts to create a final
photo. Our work differs from Agarwala et al. [2004] in a few ways. Faces are selected automatically
by determine pleasing face expressions (based on an offline machine learning strategy). Video input
are allowed enabling a larger corpus of acceptable faces, and the complete process is automated,
thereby making it easier for end-users.
Technical Contributions
The technical contribution in this work includes:
• A frame analyzer that detects camera panning motion to generate candidate frames
• An expression analyzer from detected faces
• A photo patcher that enables seamless placement of faces in group photos
Details of these steps appear in the following Section.
Methodology
In this section, we present a high level overview of our method followed by the details. The steps
involved in auto montage creation are
1. Face detection and expression measurement
2. Base photo selection
3. Montage creation
First, we detect faces in all the frames and track them based on its position. Then we measure facial
expression of all detected faces, and select a manageable subset. The next step in this problem is to
identify plausible frames which can be merged to create mosaic. Frames in a video “far away” in time,
or unrelated frames cannot be merged. In the last phase, faces with best expressions are selected and
substituted in the mosaiced image using the technique of graph cut and blending. Figure 13 illustrates
these steps.
15
Face Detection
Face Position
Tracking
Facial Expression
Measurement
Base Photo Selection
Montage Creation
Figure 13: A schematic of our method. In the first step, detected faces are tracked and grouped
together. In the next step frames are analyzed to detect a base photo frame which can either be a
mosaic of frames, or a single frame with maximum number of good expression faces from the video.
In the last step, faces with good expression are patched to the base photo to form the required photo
montage.
Experiments
To compare our method, we ran our experiments against the stack of images provided in Agarwala
et al. [2004]. As can be seen, the output photo generated had good expressions of most of the people.
The result is presented in Figure 14. Note that we have automatically generated the photo montage
without user input compared to the original method.
Our system has also been tested on other group photo sessions collected from youtube. Our algorithm
successfully created photo montages from all these videos. Examples are presented in Figure 14,
Figure 15 and Figure 16.
In Figure 14, though most of the faces looks good, there is an artifact in the third person from top
left corner. This is introduced by the patching scheme when multiple face’s best expression was
substituted. This scenario demands more accurate detection of faces to avoid such artifacts.
16
Figure 14: AutoMontage created from family stack of image Agarwala et al. [2004]. We are able to
automatically generate the photo montage as opposed to the method in Agarwala et al. [2004].
Figure 15: Photo Montage example having acceptable expressions. Youtube video is available in
http: // www. youtube. com/ watch? v= KQ73-P9HGiE .
Figure 16: Photo Montage example having acceptable expressions. A wedding reception video.
17
Cricket Delivery Detection
Cricket is one of the most popular sports in the world after soccer. Played globally in more than a
dozen countries, it is followed by over a billion people. One would therefore expect, following other
trends in global sports, that there would be a meaningful analysis and mining of cricket videos.
There has been some interesting work in this area (for example, K. et al. [2006]; H. et al. [2008];
Pal and Chandran [2010]). However, by and large, the amount of computer vision research does not
seem to be commensurate with the interest and the revenue in the game. Possible reasons could be
the complex nature of the game, and the variety of views one can see, as compared to games such
as tennis, and soccer. Further, the long duration of the game might inhibit the use of inefficient
algorithms. Segmenting a video into its meaningful units is very useful for the structure analysis of
the video, and in applications such as content based retrieval. One meaningful unit corresponds to the
semantic notion of “deliveries” or “balls” (virtually all cricket games are made up of 6-ball overs).
Consider:
(a)
(b)
(c)
Figure 17: Typical views in the game of cricket (a) Pitch View (b) Ground View (c) Non-Field View.
Note that the ground view may include non-trivial portions of the pitch.
• A bowling coach might be interested in analyzing the balls faced only by, say, Steve Smith from
the entire match. If we could temporally segment a cricket match into balls, then it is possible
to watch only these portions.
• The games administrator might be able to figure out how many minutes are consumed by slow
bowlers as compared to fast bowlers. Currently such information is available only by the manual
process of segmenting a long video (more than 7 hours).
Prior Work
Indeed, the problem of segmenting a cricket video into meaningful scenes is addressed in K. et al.
[2006]. Specifically, the method uses the manual commentaries available for cricket video, to segment
a cricket video into its constituent balls. Once segmented, the video is annotated by the text for
higher-level content access. A hierarchical framework and algorithms for cricket event detection and
classification is proposed in H. et al. [2008]. The authors uses a hierarchy of classifier to detect various
18
views present in the game of cricket. The views with which they are concerned are real-time, replay,
field view, non-field view, pitch-view, long-view, boundary view, close-up and crowd etc.
Despite the interesting methods in these works – useful in their own right – there are some challenges.
The authors in H. et al. [2008] have only worked on view classification without addressing the problem
of video segmentation. Our work closely resembles that of the method in K. et al. [2006], and is
inspired from it from a functional point of view. It differs dramatically in the complexity of the
solution, and the basis for the temporal segmentation. Specifically, as the title in K. et al. [2006]
indicates, the work is text-driven.
In Binod’s work Pal and Chandran [2010], he address the problem of segmenting cricket videos into
the meaningful structural unit termed as balls. Intuitively a ball starts with the frame in which “a
bowler is running towards the pitch to release the ball.” The end of the ball is not necessarily the start
of the next ball. The ball is said to end, when the batsman has dealt with the ball. This ball corresponds
to a variety of views. The end frame might be a close up of a player or players (e.g. celebration), the
audience, a replay, or even an advertisements.
Because of these varied nature of views, he has used both the domain knowledge of the game (views
and their sequence), as well as TV production rules (location of scoreboard, its presence and absence).
In his approach first the input video frames are classified into two semantic concepts: play and break.
The break segment includes replays, graphics and advertisements. The views that we get in the game
of cricket include close-up of the players, and a variety of views that are defined as follows: Examples
of various views are presented in the Figure 19.
• Pitch Views (P) bowler run-up, ball bowled, ball played.
• Ground View (G) ball tracking by camera, ball fielded and returned.
• Non-Field Views (N) Any view that is not P or G. This includes views such as the close-up of
batsman, bowler, umpire, and the audience.
Further, Binod defines Field Views (F) as the union of P and G. Using this vocabulary, he creates
an alphabetical sequence (see the block diagram of our system in Fig. 18.) Representative views are
shown in Fig. 17. Note that our particular classification is not intended to be definitive, or universal.
It may appear to be even subjective. The process of creating this sequence is
• Segment the video into play and break segments frames.
• Classify play segment frames into F or N
• Classify F View frames into P or G
• Smooth the resulting sequence
• Extract balls using a rule based system.
19
Cricket Video
Scoreboard
Position
Detection
Play & Break
Detection
No
Scoreboard
Break Segment
Play
Field/Non Field
View
Classification
Train Classifiers
Dominant Grass
Color Ratio < th
Non Field View
Pitch / Ground
View Classifier
Color, Edge and Shape Features
Ground View
Pitch View
Figure 18: Block diagram of our system
In summary, his contributions lies in modeling the ball as a sequence of relevant views and coming up
with appropriate techniques to obtain the vocabulary in the sequence.
However his method fails when the views are misclassified due to pop-up ad, or too few frames in a
particular view. In our work, we have focused on improving accuracy in such cases and further, we
identify key players of each ball to help in indexing the match.
Problem Statement
The cricket delivery detection method proposed by Binod Pal et. al. Pal and Chandran [2010] fails
when the views are misclassified. We propose improvements to handle these cases.
1. Eliminate pop-up advertisement
2. Enhance the pitch view detection logic to eliminate false positives
20
3. Enhance smoothing logic, so that it doesn’t smooth out the required views
4. Associate each delivery with the prominent player
Figure 19: Illustration of various problems where Binod pal’s work fails
Methodology
In our approach we improvise on Binod Pal’s work. We address the issues presented in the problem
statement.
Pop-up Advertisement Detection and Removal
Whenever the view classifier detects an advertisement, we further examine the frame to check if it is
a pop-up ad and eliminate it. Pop-up advertisements are relatively static when compared to the game
play where there is more activity. We use this to detect advertisement and eliminate them. From the
training set we extract the static component image from motion vector which has negligent motion.
From this, we detect the prominent regions at the left and bottom of the image and then smooth the
region boundaries to get the ad boundary.
When this method fails to detect the advertisement boundary, we use scoreboard to determine
advertisement boundary. We look for scoreboard boundary in the first quadrant of the image. Once
the scoreboard position is detected, based on the displacement distance, advertisement boundaries are
detected.
Green Dominance Detection for Field View Classification In the prior work Pal and Chandran
[2010], there were false positives for pitch detection when the player’s dress color is green. To reduce
such false positives, we have have divided the frame into 9 blocks and fed 128-hue histogram of each
block as addition features to the view classifier. As the green ground part is around the pitch and pitch
is at center, this helped in reducing the false positives.
Smoothing of View Sequence Using Moving Window
In the prior work, smoothing of window was done by diving the views sequence into window of 5,
21
and assigning most frequently occurring views to all the frames in that window. As this method can
remove some views which fall across the window, we use running window instead of dividing view
sequence. For each frame, the frame is considered as center of windows, and most occurring view
in the window is assigned to the current frame. This approach does not eliminate views which were
originally falling across windows.
Player Detection
Once deliveries are detected, further filtering of the video is possible. For example, we consider
classifying deliveries based on an a particular player, such as the Australian cricket captain Steve
Smith.
We can use a standard frontal face detector to detect faces of bowlers. On the other hand, faces of
batsmen are confused by the usual use of helmets. So we look for additional information in the form
of connected regions (in skin color range), and square regions which have more edges. For each
delivery, face detection is performed for 100 frames before and after the actual delivery. This buffer
of 200 frames is assessed for recognizing faces of key players associated with the actual delivery.
The usual method of dimensionality reduction, and clustering is used to prune the set. With this
background, if an input image of a particular player is given, the image is mapped to the internal low
dimension representation and the closest cluster is found using a kd-tree. Corresponding deliveries in
which that player appears are displayed.
Experiments and Results
We have conducted our experiments on one day international matches, test matches and T20 matches
summing up to 50 hours of video. We have compared the classification of pitch/ground view using
sift features. Features proposed by us have produced promising results which is presented in Table 4.
Input Player
Retrieved deliveries
Figure 20: Retrieval of deliveries involving only certain players.
22
Table 4: Comparison of methods across different matches
Pitch/Ground View Classification
Precision
Recall
Our Method
0.982
0.973
Sift features
0.675
0.647
Our method had successfully detected deliveries with overall precision of 0.935 and recall of 0.957.
The result is presented in Table 5.
Table 5: Result of detected deliveries for different types of matches
Match Type
Precision
Recall
ODI
0.932
0.973
Test Match
0.893
0.931
T20
0.982
0.967
The filtered retrieval of deliveries is presented in Fig 20.
23
Hierarchical Summarization for Easy Video Applications
Say you want to find an action scene of your favorite hero. Or want to watch a romantic movie with a
happy ending. Or say you saw a interesting trailer, and want to watch related movies. Finding these
videos in the ocean of videos available has become noticeably difficult, and requires a trustworthy
friend or editor. Can we quickly design computer “apps” that act like this friend?
This work presents an abstraction for making this happen. We present a model which summarizes the
video in terms of dictionaries at different hierarchical levels — pixel, frame, and shot. This makes
it easier for creating applications that summarizes videos, and address complex queries like the ones
listed above.
The abstraction leads to a toolkit that we use to create several different applications demonstrated in
Figure 21. In the “Suggest a Video” application, from a set of action movies, three Matrix sequels
were given as input; the movie Terminator was found to be the closest suggested match. In the “Story
Detection” application, the movie A Walk to Remember is segmented into three parts; user experience
suggests that these parts correspond to prominent changes in the twist and plot of the movie. In the
“Audio Video Mix" application, given a song with a video as a backdrop, the application finds another
video with a similar song and video; the application thus can be used to generate a “remix” for the
input song. This application illustrates the ability of the data representation to find a video which
closely matches both content and tempo.
Related Work
Methods like hierarchical hidden Markov models Xie et al. [2003]; Xu et al. [2005] and latent semantic
analysis Niebles et al. [2008]; Wong et al. [2007] have been used to build models over basic features
to learn semantics. In the method proposed by Lexing Xie et al. Xie et al. [2003], the structure
of the video is learned in an unsupervised way using two-level hidden Markov model. The higher
level corresponds to semantic events while the lower level corresponds to variations within the same
event. Gu Xu et al. Xu et al. [2005], proposed multi-level hidden Markov models using detectors and
connectors to learn videos.
Spatio temporal words Laptev [2005]; Niebles et al. [2008]; Wong and Cipolla [2007]; Wong et al.
[2007] are interest points identified in space and time. Ivan Laptev Laptev [2005] uses spatio temporal
Laplacian operator over spatial and temporal scales to detect events. Niebles et al. Niebles et al. [2008]
uses probabilistic latent semantic analysis (pLSA) on spatio temporal words to capture semantics.
Shu-Fai Wong detects spatio temporal interest points using global information Wong and Cipolla
[2007] and then uses pLSA Wong et al. [2007] for learning relationship between the semantics
revealing structural relationship. Soft quantization Gemert et al. [2008] accounting distance from
24
Shot Type Segmentation
Storyline Generation
Logical Turning Point
Detection
Remix Candidate Identification
Movie Classification
Suggest a Video
Figure 21: Retrieval applications designed using our Hierarchical Dictionary Summarization method.
“Suggest a Video” suggests Terminator for Matrix sequels. “Logical Turning Point Detection” segments a
movie into logical turning points in the plot of the movie A Walk to Remember. “Storyline Generation" generates
storyline of the movie based on logical turning points. “Remix Candidate Identification" generates a “remix”
song from a given set of music videos. “Shot Type Segmentation” segments a TV show into close-up views and
chatting view.
25
a number of codewords is considered for classifying scenes with the image and Fisher Vector and
used in classification Sun and Nevatia [2013]; Jegou et al. [2010] showing significant improvement
over BOW methods. A detailed survey on various video summarization models can be found in Chang
et al. [2007].
Similar to Sun and Nevatia [2013]; Jegou et al. [2010], we use local descriptors and form visual
dictionaries.
However unlike Sun and Nevatia [2013]; Jegou et al. [2010], we preserve more
information instead of extracting only specific information. In addition to building dictionary at pixel
level, we extend this dictionary to frame and shot level, forming a hierarchical dictionary. Having
similarity information available at various granularities is the key to creating applications that need
features at the level desired.
Technical Contributions
In this paper, we propose a Hierarchical Dictionary Model (termed H-Video) to make the task of
creating application easier. Our method learns semantic dictionaries at three different levels — pixel
patches, frames, and shots. Video is represented in the form of learned dictionary units that reveal
semantic similarity and video structure. The main intention of this model is to provide these semantic
dictionaries, so that comparison of video units at different levels in the same video and different videos
becomes easier.
The benefits of H-Video include the following
(i) The model advocates run-time leveraging of prior offline processing time. As a consequence,
applications run fast.
(ii) The model is built in an unsupervised fashion. As no application specific assumption is made,
many retrieval applications can use this model and its features. This can potentially save enormous
amount of computation time spent in learning.
(iii) The model represents learned information using a hierarchical dictionary. This allows video to
be represented as indexes to elements in the video. This makes it easier for the developer of a new
retrieval application as similarity information is available as a one dimensional array. In other words,
our model doesn’t demand deep video understanding background from application developers.
(iv) We have illustrated our model through several applications.
Figure 21 illustrates these
applications.
Methodology
We first give an overview of the process which is illustrated in Figure 22.
26
Feature
Extraction
Frames
H1 Representations
of frames
H2 Representation
of shots
Feature
Extraction
H1 Dictionary
Clustering
Frame
H2 Dictionary
Clustering
Feature
Extraction
H1 Representation
of frames in a shot
H3 Dictionary
H2 Representation
of shots in the video
Clustering
Feature
Extraction
H1
Dictionary
Feature
Extraction
H2
Dictionary
Feature
Extraction
H3
Dictionary
H1 Representation of
pixels in a frame
H2 Representation of
frames in a shot
H3 Representation of
shots in a video
Dictionary Representation
Dictionary Formation
Figure 22: Illustration of the H-Video Abstraction
Our model first extracts local descriptors like color and edge from pixel patches. (Color and edge
descriptors are simply examples.) We then build a dictionary, termed H1 dictionary, out of these
features. At this point, the video could be, in principle, be represented in terms of this H1 dictionary.
We refer each frame of this representation as a H1 frame. We then extract slightly higher level features
such as the histogram of the H1 dictionary units, the number of regions from these H1 frames and so
on, and form a new H2 dictionary. The H2 dictionary is yet another representation of the video and
captures the type of objects and their distribution in the scene; in this way, it captures the nature of
the scene. The video could also be represented using this H2 dictionary. We refer each shot in this
representation as a H2 shot. Further, we extract features based on the histogram and position of H2
dictionary units and build yet another structure, the H3 dictionary. This dictionary represents the type
of shots occurring in the video. The video is now represented in terms of this H3 dictionary to form
H3 video.
Applications
In this section, we demonstrate the efficiency and applicability of our model through four retrieval
applications — video classification, suggest a video, detecting logical turning points and searching
candidates for audio video mix.
The applications designed using our model are fast, as they use pre-computed information. Typically
27
these applications take a minute to compute required information. This illustrates the efficiency and
applicability of this model for content based video retrieval.
Suggest a Video
Websites like imdb.com present alternate related movies when the user is exploring a specific movie.
This is a good source for users to find new or forgotten movies they would be interested in. Currently
such suggestions are mostly based on text annotations and user browsing history, which the user may
want to turn off due to privacy considerations. If visual features of the video can also be used to solve
this problem better suggestions can evolve.
This scenario is illustrated in Figure 23, where the movies which the user was interested in is
highlighted and a movie highlighted in green is suggested; the green one is presumably closest to
the user’s interest.
Figure 23: Given a set of movies of interest (in red), a movie which is similar to them is suggested (in green).
In this application, we aim to automatically suggest videos based on the video similarity of the various
dictionary units. There are many ways to compare videos, like comparing the feature vectors of the
dictionaries, computing the amount of overlapping dictionary units, computing correlation, and so
28
on. In this case, we use the sequence of H2 dictionary units and substitute it with the corresponding
dictionary’s feature vector to compute similarity. We take cross correlation of H2 representation
features to compute the similarity between videos. The video matching with value above a threshold
is suggested to the user.
Video Classification In this application, we identify various genres a movie could belong to and
suggest this classification. This will reduce the effort involved in tedious annotation and help users
choose appropriate genres easily. In our approach, we train a random forest model for each of the
genre using the H2 and H3 histograms of the sample set. Given a new movie, and a genre, the model
will output whether the movie belongs to that genre or not.
Figure 24: Movies are classified into drama and action genres
Logical Turning Point Detection We define Logical Turning Point in the video as a point where the
characteristics of objects in the video change drastically. Detecting such places helps in summarizing
the video effectively. Figure 25. As illustrated in the figure, the characteristics of people occurring in
each of the differing stories is different.
We consider shots within a moving window of 5 shots. We compute the semantic overlap of H1 and
H2 units between the shots. When the amount of overlap between shots is low in a specific window
of shots, we detect that as a logical turning point.
As the logical turning points capture the important events in video, this can be used in applications
29
Figure 25: Story line of the movie A walk to remember split by story units. Characters are introduced in the
first story unit; in the second story unit, the girl and the boy are participating in a drama; in the last part,
they fall in love and get married. In spite of the length of stories being very different, our method successfully
identifies these different story units.
like advertisement placement and preparing the story line.
Potential Candidate Identifier for Audio Video Mix
Remix is the process of generating a new video from existing videos by either changing audio or
video. Remixes are typically performed on video songs for a variation on the entertainment needs.
The remix candidates need to have similar phase in the change of the scene to generate a pleasing
output.
We use cross correlation of H2 units to identify if they have same phase of change. Once the closest
candidate is identified, we replace the remix candidate’s audio with the original video’s audio.
30
Experiments
In this section, we first evaluate the individual performance of the constituents of H-Video model
itself. Next, in creating dictionaries, we compare the usage of popular features such as SIFT. Finally
the performance of applications that may use more than one of the three levels, H1, H2, or H3, are
evaluated.
Data: We have collected around 100 movie trailer from youtube, twenty five full length movie films,
and a dozen music video clips.
Computational Time: Our implementation is in Matlab. Typically a full-length movie takes around
two hours for feature extraction. Building local dictionaries for a movie takes around 10 hours.
Building the global dictionary, which is extracted from local dictionary of multiple videos takes
around 6 hours. Note that building local dictionaries and a global dictionary (left hand side of Fig. 22)
are one time jobs. Once these are built, the dictionaries are directly used to create the right hand side
of Fig. 22. In other words, relevant model building typically takes two hours which is no different
from the average runtime of the video itself. Once the model is constructed, each access operation
typically takes only around 10 seconds per video.
Individual Evaluation of H1 and H2
For illustrating the effectiveness of H1 dictionary, we considered the classification problem and
collected videos from the category “car”, “cricket” and “news anchor”. We have collected 6 videos
from each category summing up to total of 18 videos. We computed H1 dictionary for each of these
videos, and formed a global H1 dictionary for the given dataset and represented all videos in terms of
this global dictionary. For testing purposes we randomly selected two videos from each category for
training data and remaining as testing set and test set against one of the three categories.
The recall and precision of classification using only H1 is provided in Table 6.
Table 6: Classification using only H1. With limited information, we are able to glean classification
hints.
Category
Precision
Recall
Car
1.00
0.75
Cricket
0.67
1.00
News Anchor
1.00
0.75
To evaluate the effectiveness of only H2 dictionary units, we have built our model on a TV interview
show; these had scenes of only individuals, as well as people groups. When we built H2 dictionaries
with allowed error as “top principal component / 1000”, we got six categories capturing different
31
(a) H2 dictionary units with smaller allowed error
(b) H2 dictionary units with larger allowed error
Figure 26: Classification using only H2. With only limited information, and with smaller allowed error, fine
details were captured. With larger allowed error, broad categories were captured.
Table 7: Video suggestion using popular features
Methods
Direct
H-Model
Comparison
Percentage
Improvement
SURF
54%
54%
0%
SIFT
29%
53%
83%
Color, Edge
48%
59%
23%
scenes, people and their positions. As we relaxed the allowed error, it resulted in two categories
between individual scene and group of people. This result is presented in Fig. 26. Hence applications
can tune allowed error parameters to suit their requirement.
Evaluation of Alternate Features
In this section we evaluate popular features like SURF, SIFT and contrast with the color & edge
features used in this paper. Given any feature set, one may do a “direct comparison” (which will take
longer time), or do our proposed H-Video model-based comparison (which will take far lesser time).
This experiment is performed on the “Suggest-a-video” problem using only trailers of movies as the
database. The result is presented in Table 7.
When the H-Video model is used, we use the H2 as the basis of comparison. We observe that the
32
use of the hierarchical model helped improving the accuracy for SIFT and color & edge features; the
accuracy was almost the same when using SURF features. In producing these statistics, for the ground
truth we have used information available on imdb.com. One problem in using imdb.com is that the
truth is limited to the top twelve only. We therefore have added transpose and transitive relationships
as well. (In transpose relationships, if a movie A is deemed to be related to B, we mark B as related
to A. In transitivity, if a movie A is related to B, and B is related to C, then we mark A as related to
both B and C. The transitive closure is maintained via recursion.)
Evaluation of Video Classification
We have considered three genres for video classification. We took the category annotation of 100
movie trailers and for each category considered 30% of the data for training and remaining as testing
set. We have build the H-video for these videos, extracted H2 and H3 representations, and classified
using the random forest model. Example output is shown in Fig. 27.
Figure 27: Sample Result of classifying movie trailers for categories Drama, Action and Romance. In most of
the cases, our model has classified movie trailers correctly.
Evaluation of Logical Turning Point Detection
Typically drama movies have three logical parts. First the characters are introduced, then they get
33
(a) Remix Candidates – Set 1
(b) Remix Candidates – Set 2
Figure 28: Sample frames from identified remix candidates is presented. In each sets, top row correspond to
a original video song and the second row corresponds to the remix candidate. The content and sequencing of
the first and second rows match suggesting the effectiveness of our method.
together and then a punchline is presented towards the end. Considering this as ground truth, we
validated the detected logical turning points. The logical turning points were detected with precision
of 1.0 and recall of 0.75.
Evaluation of Potential Remix Candidates
We have conducted experiments on 20 song clips, where the aim is to find best remix candidates.
Our algorithm found two pairs of songs which are best candidates for remix in the given set. Sample
frames from matched video are presented in Fig. 28.
Conclusion and Future Work
Videos revolutionize many of the ways we receive and use information every day. The availability of
online resources has changed many things. Usage of videos has transformed dramatically demanding
better ways of retrieving them.
Many diverse and exciting initiatives demonstrate how visual contents can be learned from video.
Yet at the same time, reaching videos using visual contents online arguably has been less radical,
especially in comparison to text search. There are many complex reasons for this slow pace of change,
34
including lack of proper tags and huge computation time taken for learning for specific topics.
In this thesis, we have presented ways to optimize computation time, which is the most important
obstacle to retrieve videos. Next, we have focused on producing visual summaries, so that user can
quickly find if the video is of interest to him/her. We have also presented ways to slice and dice videos
so that user can for quickly reach the segment they are interested in.
1. Lead star Detection: Detecting lead stars has numerous applications like identifying player
of the match, detecting lead actor and actress in motion pictures, guest host identification.
Computational time has always been a bottleneck for using this technique. In our work, we
have presented a faster method to solve this problem with comparable accuracy. This makes
our algorithm usable in practice.
2. AutoMontage: Photo Sessions Made Easy!: With the increased usage of camera by novices,
tools to make photography sessions are becoming increasingly valuable. Our work successfully
creates photo montage from photo session videos and combines best expressions into a single
photo.
3. Cricket Delivery Detection: Inspired by prior work Pal and Chandran [2010], we approached
the problem of temporal segmentation of cricket videos. As the view classification accuracy
is highly important, we have proposed few corrective measures to improve the accuracy
by introducing pop-up ad eliminator and finer features for view classification. Further, we
associated each delivery with the key players of that delivery, proving browse by player feature.
4. Hierarchical Model: In traditional video retrieval systems, relevant features are extracted from
the video and applications are built using the extracted features. For multimedia database
retrieval systems, there are typically plethora of applications that would be required for
satisfying user needs. A unified model which uses fundamental notions of similarity would
therefore be valuable to reduce computation time for building applications.
In this work, we have proposed a novel model called H-Video, which provides the semantic
information needed by retrieval applications. In summary, both creation (programmer time)
and runtime (computer time) of the resulting applications are reduced. First, our model provides
semantic information of video in a simple way, so that it is easy for programmers. Second, due
to the suggested pre-processing of long video data, runtime is reduced. We have built four
applications as examples to demonstrate our model.
Future Work
The future scope of this work is to enhance hierarchical model to learn concepts automatically from a
set of positive and negative examples using machine learning techniques like random forest or random
35
tree. This would serve as a great source in learning a particular tag and applying to all other videos
which have the same concept. This model can also be extended to matching parts of videos instead of
whole video.
Hierarchical model can also be used to analyze video types and integrate with other applications.
For example, video types pertaining to an actor can be learnt to create a profile of that actor. The
video type learnt when associated with appropriate tag, can serve complex queries related to actor.
Hierarchical model can be used to identify photo session videos, so that automontage can be applied
automatically to create a photo album.
The future scope of this work also includes developing more applications to find videos of user’s
interest. Video sharing websites create automatic play list based on user’s choice. In such a list, the
videos seem to be based on browsing patterns of users. However it has a drawback of mixing different
types of video, whereas user may be interested in specific type of video. Hence an application to learn
video types from the recommendations and provide filtering options based on various attributes will
help the user narrow down their interest. Few attributes that could be used for such filtering are lead
actors, number of people in the video, video types from hierarchical model, popular tags associated
with videos and recency.
In conclusion, we feel that generic models with quick retrieval time combined with user centric
applications have much unexplored potential. They will become the favored methods for reaching
videos in the near future.
References
Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin, D., and
Cohen, M. (2004). Interactive digital photomontage. ACM Transactions on Graphics, 23(3):294–
302. 16, 17, 18
Chang, S.-F., Ma, W.-Y., and Smeulders, A. (2007). Recent advances and challenges of semantic
image/video search. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
International Conference on, volume 4, pages IV–1205–IV–1208. 27
Das, M. and Loui, A. (2003).
Automatic face-based image grouping for albuming.
In IEEE
International Conference on Systems, Man and Cybernetics, volume 4, pages 3726–3731. 15
Doulamis, A. and Doulamis, N. (2004). Optimal content-based video decomposition for interactive
video navigation. Circuits and Systems for Video Technology, IEEE Transactions on, 14(6):757–
775. 7
Everingham, M. and Zisserman, A. (2004). Automated visual identification of characters in situation
36
comedies. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on
(ICPR’04) Volume 4, pages 983–986, Washington, DC, USA. IEEE Computer Society. 7
Fitzgibbon, A. W. and Zisserman, A. (2002). On affine invariant clustering and automatic cast listing
in movies. In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part
III, pages 304–320, London, UK. Springer-Verlag. 7, 8
Foucher, S. and Gagnon, L. (2007). Automatic detection and clustering of actor faces based on spectral
clustering techniques. In Proceedings of the Fourth Canadian Conference on Computer and Robot
Vision, pages 113–122. 8, 11
Gemert, J. C., Geusebroek, J.-M., Veenman, C. J., and Smeulders, A. W. (2008). Kernel codebooks
for scene categorization. In Proceedings of the 10th European Conference on Computer Vision:
Part III, ECCV ’08, pages 696–709, Berlin, Heidelberg. Springer-Verlag. 25
H., K. M., K., P., and S., S. (2008). Semantic event detection and classification in cricket video
sequence. In Indian Conference on Computer Vision Graphics and Image Processing, pages 382–
389. 19, 20
Javed, O., Rasheed, Z., and Shah, M. (2001). A framework for segmentation of talk and game
shows. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference
on, 2:532–537 vol.2. 7, 10, 13
Jegou, H., Douze, M., Schmid, C., and Perez, P. (2010). Aggregating local descriptors into a compact
image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference
on, pages 3304–3311. 27
K., P. S., Saurabh, P., and V., J. C. (2006). Text driven temporal segmentation of cricket videos. In
Indian Conference on Computer Vision Graphics and Image Processing, pages 433–444. 19, 20
Kumano, S., Otsuka, K., Mikami, D., and Yamato, J. (2011). Analyzing empathetic interactions based
on the probabilistic modeling of the co-occurrence patterns of facial expressions in group meetings.
In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages
43–50. 15
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(23):107–123. 25
Lee, S.-H., Han, J.-W., Kwon, O.-J., Kim, T.-H., and Ko, S.-J. (2012). Novel face recognition
method using trend vector for a multimedia album. In IEEE International Conference on Consumer
Electronics, pages 490–491. 15
Li, C.-H., Chiu, C.-Y., Huang, C.-R., Chen, C.-S., and Chien, L.-F. (2006). Image content clustering
37
and summarization for photo collections. In IEEE International Conference on Multimedia and
Expo, pages 1033–1036. 15
Li, J., Lim, J. H., and Tian, Q. (2003). Automatic summarization for personal digital photos. In Fourth
Pacific Rim Conference on Information, Communications and Signal Processing and Proceedings
of the Joint Conference of the Fourth International Conference on Multimedia, volume 3, pages
1536–1540. 15
Lim, S. H., Lin, Q., and Petruszka, A. (2010).
Automatic creation of face composite images
for consumer applications. In IEEE International Conference on Acoustics Speech and Signal
Processing, pages 1642–1645. 15
Littlewort, G., Bartlett, M., Salamanca, L., and Reilly, J. (2011).
Automated measurement of
children’s facial expressions during problem solving tasks. In IEEE International Conference on
Automatic Face Gesture Recognition and Workshops, pages 30–35. 15
Liu, Z. and Ai, H. (2008). Automatic eye state recognition and closed-eye photo correction. In 19th
International Conference on Pattern Recognition, pages 1–4. 15
Lucey, P., Cohn, J. F., Matthews, I., Lucey, S., Sridharan, S., Howlett, J., and Prkachin, K. M. (2011).
Automatically detecting pain in video through facial action units. IEEE Transactions on Systems,
Man, and Cybernetics, Part B, 41(3):664–674. 15
Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsupervised learning of human action categories
using spatial-temporal words. International Journal of Computer Vision, 79(3):299–318. 25
Pal, B. and Chandran, S. (2010). Sequence based temporal segmentation of cricket videos. In
Sequence based Temporal Segmentation of Cricket Videos. 19, 20, 21, 22, 36
Sun, C. and Nevatia, R. (2013). Large-scale web video event classification by use of fisher vectors. In
Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 15–22. 27
Takahashi, Y., Nitta, N., and Babaguchi, N. (2005). Video summarization for large sports video
archives. Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 1170–
1173. 7, 10
Waibel, A., Bett, M., and Finke, M. (1998). Meeting browser: Tracking and summarizing meetings. In
Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pages 281–286.
7
Wong, S.-F. and Cipolla, R. (2007).
Extracting spatiotemporal interest points using global
information. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages
1–8. 25
38
Wong, S.-F., Kim, T.-K., and Cipolla, R. (2007). Learning motion categories using both semantic
and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE
Conference on, pages 1–6. 25
Xie, L., Chang, S.-F., Divakaran, A., and Sun, H. (2003). Unsupervised discovery of multilevel
statistical video structures using hierarchical hidden markov models. In Multimedia and Expo,
2003. ICME ’03. Proceedings. 2003 International Conference on, volume 3, pages III–29–32 vol.3.
25
Xu, D., Li, X., Liu, Z., and Yuan, Y. (2004). Anchorperson extraction for picture in picture news
video. Pattern Recogn. Lett., 25(14):1587–1594. 7
Xu, G., Ma, Y.-F., Zhang, H.-J., and Yang, S.-Q. (2005). An hmm-based framework for video semantic
analysis. Circuits and Systems for Video Technology, IEEE Transactions on, 15(11):1422–1433. 25
Zhang, Z., Potamianos, G., Senior, A. W., and Huang, T. S. (2007). Joint face and head tracking inside
multi-camera smart rooms. Signal, Image and Video Processing, 1:163–178. 7
Zhuang, Y., Rui, Y., Huang, T., and Mehrotra, S. (1998). Adaptive key frame extraction using
unsupervised clustering. In Proceedings of the International Conference on Image Processing,
volume 1, pages 866–870. 7
39