Extending Social Videos for Personal Consumption

Transcription

Extending Social Videos for Personal Consumption
Extending Social Videos for Personal Consumption
A pre-synopsis report submitted in partial fulfillment of
the requirements for the degree of
DOCTOR OF PHILOSOPHY
by
Nithya Sudhakar
(Roll No. 05405701)
Under the guidance of
Prof. Sharat Chandran
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
2015
To my dear daughter..
ii
Abstract
Videos are rich in content; nevertheless their massive size makes it challenging to find user
specific desired content. With the fast growth of internet and video processing techniques,
videos are efficiently. However, the user is presented with huge volume of search results making
it really difficult for the user to browse through all the videos until they find videos of their
interest.
Presenting good visualizations, appropriate filters, ability to reach specific part of the video and
customized genre specific applications will help the user to quickly skim the videos, thereby
reducing user’s time. In this thesis, we propose applications which extract this information.
The details of the applications designed in this thesis are:
Applications
• Lead Star Detection Lead stars are significant people in a video. Examples of lead stars
are man of the match in sports, lead actors of movies and guest & host of TV shows.
We propose an unsupervised approach to quickly detect lead stars. We have used audio
highlights to detect prominent region where lead star are likely to be present. Our method
detects lead stars with reasonable accuracy in real time which is faster than the prior
work [19].
• Automontage Photographs capture memories of special events. However, it is often
challenging to capture good expressions of different people all at the same time. In
this application, we propose a technique to create a photo album from event videos
where expressive faces of people are automatically merged to create photo shots. As
opposed to prior work [6], our method automatically select base photos and good face
expressions from various frames and stitch them together. This makes the complete
iii
process automated, thereby making it easier for end-users.
• Cricket Delivery Detection Cricket is a fascinating and a favorite game in India. Detecting
deliveries in cricket video gives arise to many related applications such as analyzing
deliveries of a particular player and automatically generating cricket highlights. In this
work, we have improved the detection accuracy of prior work [45] and also added browse
by player so that the deliveries of the particular player can be enjoyed by the viewer or
analyzed by a cricket coach.
In the process of building new applications for better video browsing, we realize the necessity of
a technique which can address multiple applications need. The details of the technique designed
in this thesis is:
Technique
Hierarchical Model Structured representation of visual information results in easier
and speedy video applications.
However most of the models that exist today are
either designed for specific applications or captures similarity in multi dimensional
space which makes it difficult for applications to use as it is. Summararizing such
similarity information at various levels makes it easy for applications to directly use
these information without additional computations. We propose a hierarchical model
for representing video, so that designing applications is much easier and consumes less
time. We have demonstrated this through new applications.
In summary, in the process of building video applications to help user quickly browse or
filter video results, we have proposed a model to support various applications which requires
similarity measure of video.
iv
Contents
Abstract
iii
List of Tables
ix
List of Figures
xi
1
Introduction
1
1.1
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3.1
Fast Lead Star Detection in Entertainment Videos . . . . . . . . . . . .
5
1.3.2
AutoMontage: Photo Sessions Made Easy! . . . . . . . . . . . . . . .
6
1.3.3
Cricket Delivery Detection . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3.4
Hierarchical Summarization for Easy Video Applications . . . . . . . .
8
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.4
2
Fast Lead Star Detection in Entertainment Videos
11
2.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Our Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.3
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.1
Audio Highlight Detection . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3.2
Finding & Tracking Potential People . . . . . . . . . . . . . . . . . . .
16
2.3.3
Face Dictionary Formation . . . . . . . . . . . . . . . . . . . . . . . .
17
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4.1
Lead Actor Detection in Motion Pictures . . . . . . . . . . . . . . . .
20
2.4.2
Player of the Match Identification . . . . . . . . . . . . . . . . . . . .
20
2.4
v
2.4.3
2.5
3
5
21
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5.1
Lead Actor Detection in Motion Pictures . . . . . . . . . . . . . . . .
22
2.5.2
Player of the Match Identification . . . . . . . . . . . . . . . . . . . .
23
2.5.3
Host Guest Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
AutoMontage: Photo Sessions Made Easy!
27
3.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.1
Face Detection and Expression Measurement . . . . . . . . . . . . . .
29
3.3.2
Base Photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.3
Montage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4
4
Host Guest Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cricket Delivery Detection
37
4.1
Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3.1
Pop-up Advertisement Detection and Removal . . . . . . . . . . . . .
41
4.3.2
Green Dominance Detection for Field View Classification . . . . . . .
41
4.3.3
Smoothing of View Sequence Using Moving Window . . . . . . . . .
42
4.3.4
Player Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.5
Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Hierarchical Summarization for Easy Video Applications
5.1
47
5.0.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.0.2
Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . .
49
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.1.2
H1 Dictionary Formation and Representation . . . . . . . . . . . . . .
51
5.1.3
H2 Dictionary Formation and Representation . . . . . . . . . . . . . .
52
vi
5.2
5.3
6
5.1.4
H3 Dictionary Formation and Representation . . . . . . . . . . . . . .
54
5.1.5
Building global dictionary . . . . . . . . . . . . . . . . . . . . . . . .
56
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2.1
Suggest a Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2.2
Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2.3
Logical Turning Point Detection . . . . . . . . . . . . . . . . . . . . .
58
5.2.4
Potential Candidate Identifier for Audio Video Mix . . . . . . . . . . .
59
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.3.1
Individual Evaluation of H1 and H2 . . . . . . . . . . . . . . . . . . .
61
5.3.2
Evaluation of Alternate Features . . . . . . . . . . . . . . . . . . . . .
62
5.3.3
Evaluation of Video Classification . . . . . . . . . . . . . . . . . . . .
63
5.3.4
Evaluation of Logical Turning Point Detection . . . . . . . . . . . . .
63
5.3.5
Evaluation of Potential Remix Candidates . . . . . . . . . . . . . . . .
63
Conclusion & Future Work
67
6.1
68
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
viii
List of Tables
2.1
Time taken for computing lead actors for popular movies. . . . . . . . . . . . .
2.2
Time taken for computing key players from BBC MOTD highlights for premier
league 2007-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
22
25
Time taken for identifying host in a TV show Koffee With Karan. Two episodes
are combined into a single video and given as input. . . . . . . . . . . . . . . .
25
4.1
Comparison of methods across different matches . . . . . . . . . . . . . . . . . . .
44
4.2
Result of detected deliveries for different types of matches . . . . . . . . . . . . . .
44
5.1
Classification using only H1. With limited information, we are able to glean
5.2
classification hints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Video suggestion using popular features . . . . . . . . . . . . . . . . . . . . .
62
ix
x
List of Figures
1.1
A scenario where user is browsing through various video results to reach video
of his interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
2
A sample scenario where user is provided with more filtering options and visual
representation along with video results. Please note that the suggestions and
visual representations vary based on the input video and its genre. . . . . . . .
3
1.3
Applications developed or addressed in our work. . . . . . . . . . . . . . . . .
5
1.4
Lead star detection. This is exemplified in sports by the player of the match; in
movies, stars are identified; and in TV shows, the guest and host are located. . .
1.5
6
An automatic photo montage created by assembling “good” expressions from a
video (randomly picked from YouTube). This photo did not exist in any of the
input frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.6
Illustration of various problems where work [45] fails . . . . . . . . . . . . . .
8
1.7
Retrieval applications designed using our Hierarchical Dictionary Summarization method.
“Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments
a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio
Video Mix" generates a “remix” song from a given set of music videos. . . . . . . . .
2.1
Lead star detection. This is exemplified in sports by the player of the match; in
movies, stars are identified; and in TV shows, the guest and host are located. . .
2.2
9
Our strategy for lead star detection.
12
We detect lead stars by considering
segments that involve significant change in audio level. However, this by itself
is not enough! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
13
2.3
Strategy further unfolded. We first detect scenes accompanied by change
in audio level. Next we look for faces in these important scenes, and to
further confirm the suitability track faces in subsequent frames. Finally, a face
dictionary representing the lead stars is formed by clustering the confirmed faces. 14
2.4
Illustration of highlight detection from audio signal. We detect highlights by
considering segments that involve significant low RMS ratio. . . . . . . . . . .
2.5
Illustration of data matrix formation. The tracked faces are stacked together to
form the data matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
15
16
Face dictionary formed for the movie Titanic. Lead stars are highlighted in blue
(third image), and red (last image). Note that this figure does not indicate the
frequency of detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.7
Illustration of face dictionary formation. . . . . . . . . . . . . . . . . . . . . .
19
2.8
Lead actors detected for the movie Titanic . . . . . . . . . . . . . . . . . . . .
20
2.9
Lead star detected for the highlights of a soccer match Liverpool vs Havant &
Waterlooville. The first image is erroneously detected as face. The other results
represents players and coach. . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.10 Lead actors identified for popular movies appear on the right. . . . . . . . . . .
23
2.11 Key players detected from BBC Match of the Day match highlights for premier
league 2007-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.12 Host detection of TV show “Koffee with Karan". Two episodes of the show are
combined and given as input. The first person in the detected list (sorted by the
weight) gives the host. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
25
An automatic photo montage created by assembling “good” expressions from a
video (randomly picked from YouTube). This photo did not exist in any of the
input frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
28
A schematic of our method. In the first step, detected faces are tracked and
grouped together. In the next step frames are analyzed to detect a base photo
frame which can either be a mosaic of frames, or a single frame with maximum
number of good expression faces from the video. In the last step, faces with
good expression are patched to the base photo to form the required photo montage. 30
3.3
Facial expression measured as deviation from neutral expression . . . . . . . .
xii
31
3.4
Illustration of the montage creation process. Best expressions from various
frames previously selected are patched to create a new frame. . . . . . . . . . .
3.5
AutoMontage created by generating a mosaic of the Youtube video available in
[4].
3.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Photo Montage example having acceptable expressions. A wedding reception
video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
34
Photo Montage example having acceptable expressions. Youtube video is
available in [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
33
AutoMontage created from family stack of image [6]. We are able to automatically generate the photo montage as opposed to the method in [6]. . . . . . . .
3.7
33
35
Typical views in the game of cricket (a) Pitch View (b) Ground View (c) NonField View. Note that the ground view may include non-trivial portions of the
pitch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2
Block diagram of our system . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3
Illustration of various problems where Binod pal’s work fails . . . . . . . . . .
41
4.4
Detecting Advertisement Boundary . . . . . . . . . . . . . . . . . . . . . . .
42
4.5
Scoreboard Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.6
Extraction of features at block level for improving view detection . . . . . . . .
43
4.7
Retrieval of deliveries involving only certain players. . . . . . . . . . . . . . . . . .
44
5.1
Retrieval applications designed using our Hierarchical Dictionary Summarization method.
“Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments
a movie into logical turning points in the plot of the movie A Walk to Remember. “Audio
Video Mix" generates a “remix” song from a given set of music videos. . . . . . . . .
48
5.2
Illustration of the H-Video Abstraction . . . . . . . . . . . . . . . . . . . . . . . .
50
5.3
Pixel-Level Feature Extraction Filters . . . . . . . . . . . . . . . . . . . . . . . .
52
5.4
H1 Dictionary Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.5
H1 Representation of a frame (bottom)
. . . . . . . . . . . . . . . . . . . . . . .
53
5.6
Features extracted from H1 frame . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.7
H2 dictionary formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.8
Representation of H2 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.9
Features extracted from H2 shot . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
xiii
5.10 Feature extraction from H2 shots using shot filter bank to form H3 Dictionary . . . . .
56
5.11 H3 representation of the video . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.12 Given a set of movies of interest (in red), a movie which is similar to them is suggested
(in green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.13 Movies are classified into drama and action genres . . . . . . . . . . . . . . . . . .
59
5.14 Story line of the movie A walk to remember split by story units. Characters are
introduced in the first story unit; in the second story unit, the girl and the boy are
participating in a drama; in the last part, they fall in love and get married. In spite of the
length of stories being very different, our method successfully identifies these different
story units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.15 Classification using only H2. With only limited information, and with smaller allowed
error, fine details were captured. With larger allowed error, broad categories were
captured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.16 Sample Result of classifying movie trailers for categories Drama, Action and Romance. In most of the cases, our model has classified movie trailers correctly. . . . . .
64
5.17 Sample frames from identified remix candidates is presented. In each sets, top row
correspond to a original video song and the second row corresponds to the remix
candidate. The content and sequencing of the first and second rows match suggesting
the effectiveness of our method. . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
65
Chapter 1
Introduction
A picture speaks a thousand words and a video is thousands of pictures. Videos let people
articulate their ideas creatively. The abundant source of videos available via video sharing
websites, makes video freely accessible to everyone. Videos are great sources of information
and entertainment, and they cover various areas like news, education, sports, movies and events.
1.1
Background and Motivation
In the early era of internet, when concept of web search and video search was introduced, only
way people ever searched is to annotate each and every web page and video with all possible
tags, so that people can search them. Over a period of time as the volume increased, this method
no more worked. So researchers came up with interesting ways of training images to learn real
world objects like indoor, outdoor, swimming, car, air plane and so on, so that these annotations
are automatically extracted from video frames. Though this approach increased number of
tags and reduced human effort, the searchability was restricted to objects trained and training
exhaustive list was again a tedious process.
To overcome this, bag of words feature was introduced which is similar to text, which has visual
description about various objects in the scene and image was used as input instead of traditional
text. However this left with another problem of user finding a suitable image to represent their
thought.
1
Over the years with tremendous growth of video sharing website, people have uploaded a huge
collection of video with their own annotations. By combining bag-of-words with various tags
available, tag annotations have reached a great level that videos are easily searchable. However
as user gets closer to video results, there are various expectations user have for which there
is no sufficient annotation. So user browses through results and recommendations until they
find videos of their interest. The size of videos pose additional challenges as user has to
download various videos and browse through them to check if that satisfies their requirement.
This scenario is illustrated in the Figure 1.1.
Figure 1.1: A scenario where user is browsing through various video results to reach video of
his interest.
This demands a whole realm of applications which help user to quickly get close to the video
they want. In this thesis we propose various applications like lead star detections, story-line
generation which helps users to get a quick idea about video without having to go through
them. We also propose applications which able user to go through selected parts instead
of complete video. Example applications we have explored in this genre includes cricket
delivery detection and logical turning point detection. As the more user gets to a specific
genre, genre specific applications make more sense than generic ones. One example that we
have tried is Automontage which generates photos by merging different expressions to form a
nice output. Further we noticed a great possibility of such genre specific applications which
commonly need similarity information. We have proposed a technique called Hierarchical
model, which enables a specific genre of similarity applications. We have demonstrated our
applications through supporting applications like logical turning point detection, story-line
generation, movie classification and audio video remix candidate search.
2
3
Police and Politics
Social Gathering
Court Scene
Protest
Sports
Music
Romance
Interview
Events

 Select All

Lead Casts
Music
Cricket
Interview
Sports
HH:MM
Remake
Remix
Interviews
by Host
Similar Videos
Interviews
of Guest
Sixes and boundaries by India
Wickets by India
Last two overs of Indian Bowling
Sixes and boundaries by Sri Lanka
Wickets by Sri Lanka
Last two overs of Sri Lankan Bowling
Customized Videos
Similar Videos
Sequel
Similar Movies
Figure 1.2: A sample scenario where user is provided with more filtering options and visual representation along with video results. Please note
that the suggestions and visual representations vary based on the input video and its genre.
Show videos related
to browsing history
HH:MM
Less than 2 min
2 min to 5 min
5 to 30 min
30 min to 3 hrs
More than 3 hrs
Duration
Avg. Customer Reviews


Genres
Movies
1.2
Problem Statement
The objective of the thesis is to design applications and technique which will help users to reach
specific video of their interest easily. A sample scenario in which user is provided with various
options is illustrated in Figure 1.2. As the possibilities of designing such applications are wider,
we establish the following problems specifically
1. Ability to summarize the video, so that user quickly visualize the content of videos
2. Ability to browse through specific parts of video instead of browsing through whole
video.
3. Designing genre specific video applications which is of great use in that genre
4. Designing a technique which can support similarity applications which suits various
needs like auto suggestions, story line generation.
1.3
Our Contributions
In this thesis, we propose applications and technique which achieve the following, so that user
easily can quickly filter video results or skim through video or explore related ones.
1. Summarize videos in terms of lead stars, allowing user to narrow down based on stars of
their interest
2. Summarize the video in terms of story-line, so that user can get a quickly overview videos
before viewing them.
3. Segment the video allowing users to browse through specific parts of video instead of the
whole video.
4. Designing genre specific video applications like automontage, cricket delivery detection,
remix candidate generation, which is of great use in those specific genres.
5. Designing a technique which can support a genre of similarity applications like some
applications listed above.
4
We achieve these functionalities using various techniques ranging from building appropriate
dictionaries to specific view classification or merging expressions to create good photos. We
have classified these video applications into three major categories - segmentation, summarization and classification. We present various applications we have developed in these categories
and present a model which supports applications from all these categories. A visual summary of
the work done is presented in the Figure 1.3. In this section, we will be presenting an overview
of these applications and models.
Guest Host Detection
Summarization
Segmentation
Cricket Delivery Detection
Lead Actors Identification
Key Players Detection
Lead Star Detection
Storyline Generation
Shot Type Segmentation
Logical Turning Point
Detection
AutoMontage
Remix Candidate Identification
Drama
Suggest a Video
Hierarchical Model
Action
Movie Classification
Classification
Figure 1.3: Applications developed or addressed in our work.
1.3.1
Fast Lead Star Detection in Entertainment Videos
Video can be summarized in many forms. One natural possibility that has been well explored
is extracting key frames in shots or scenes, and then creating thumbnails. Another natural
alternative that has been surprisingly ill-explored is to locate “lead stars” around whom the
action revolves. This is illustrated in the Figure 1.4. Though scarce and far between, available
techniques for detecting lead stars is usually video specific.
5
Figure 1.4: Lead star detection. This is exemplified in sports by the player of the match; in
movies, stars are identified; and in TV shows, the guest and host are located.
We present a generalized method for detecting snippets around lead actors in entertainment
videos. This application is useful in locating action around the ‘player of the match’ in
sports videos, lead actors in movies and TV shows, and guest-host snippets in TV talk shows.
Additionally, as our method uses audio clue, we are able to compute lead stars in real time,
which is faster than the state-of-art techniques with comparable accuracy.
1.3.2
AutoMontage: Photo Sessions Made Easy!
Group photograph sessions are tedious; it is difficult to get acceptable expressions from all
people at the same time. The larger the group size, the harder it gets. Ironically, we miss
many expressions in the scene while the group assembles, or reassembles in the taking of the
photographs. As a result of this, people have started using videos. However, going through
video is time consuming.
A solution to the problem is using automatically extracting an acceptable, possibly stretched,
photo montage and present it to the user. In this application, we automate this process. We
extract their faces, assess the quality, and paste them back appropriately at the correct position
to create a pleasing memory. This scenario is illustrated in the Figure 1.5.
In this work, we have contributed a frame analyzer that detects camera panning motion to
6
Figure 1.5: An automatic photo montage created by assembling “good” expressions from a
video (randomly picked from YouTube). This photo did not exist in any of the input frames.
generate candidate frames, an expression analyzer from detected faces and a photo patcher
that enables seamless placement of faces in group photos.
1.3.3
Cricket Delivery Detection
Cricket is the most popular sport in India and is played globally in seventeen countries.
Meaningful analysis and mining of cricket videos yields to many interesting applications like
analyzing the balls faced by a player, comparison between fast bowlers and slow bowlers.
Presenting such specific videos gives user a choice to select the one they are interested in.
Prior work [45] has proposed an approach for automatic segmentation of broadcast cricket video
into meaningful structural units. First, the input video frames are classified into two semantic
concepts: play and break segment frames. Play segment frames are then further classified into
a set of well defined views. Sequences of views are then analyzed to extract, in cricketing
terminology, “balls”. However this method fails when there is displacement of score board due
to advertisement. The pitch detection algorithm used by them fails on cases where there is some
other rectangle like players dress.
In our method, we improvise his approach by adding ad detector and remover logic. We
also enhance pitch detection algorithm by splitting the frames into 3x3 grids and training the
classifier using grid level information in addition to frame level information. Further each
delivery is examined for faces and key players are identified so as to support features like browse
by player and find interesting deliveries of that player.
7
Figure 1.6: Illustration of various problems where work [45] fails
1.3.4
Hierarchical Summarization for Easy Video Applications
With growing use of videos, demand on video applications has become intense. Most existing
methods that analyze the semantics of a video build specific models; for example, ones that aim
at event detection, or targeted video albumization. These might be called as application specific
works and useful in their own right. In this technique, however, we propose a video abstraction
framework that unifies the creation of various applications, rather than the application itself.
Specifically, we present a dictionary summarization of a video that provides abstractions at
various hierarchical levels such as pixels, frames, shots, and the complete video. We illustrate
the usability of our model with four different “apps” as shown in the Figure 1.7.
Our model (termed H-Video) makes the task of creating application easier. Our method learns
semantic dictionaries at three different levels — pixel patches, frames, and shots. Video is
represented in the form of learned dictionary units that reveal semantic similarity and video
structure. The main intention of this model is to provide these semantic dictionaries, so that
comparison of video units at different levels in the same video and different videos becomes
easier.
The benefits of H-Video include the following:
(i) The model advocates run-time leveraging of prior offline processing time. As a consequence,
applications run fast.
(ii) The model is built in an unsupervised fashion. As no application specific assumption is
made, many retrieval applications can use this model and its features. This can potentially save
8
Figure 1.7: Retrieval applications designed using our Hierarchical Dictionary Summarization method.
“Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into
logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a
“remix” song from a given set of music videos.
enormous amount of computation time spent in learning.
(iii) The model represents learned information using a hierarchical dictionary. This allows video
to be represented as indexes to elements in the video. This makes it easier for the developer of
a new retrieval applications as similarity information is available as a one dimensional array. In
other words, our model doesn’t demand deep video understanding background from application
developers.
(iv) We have illustrated our model through several applications.
9
1.4
Future Work
The future scope of this work is to enhance hierarchical model to learn concepts automatically
from a set of positive and negative examples using machine learning techniques like random
forest or random tree. This would serve as a great source in learning a particular tag and
applying to all other videos which have the same concept. This model can also be extended
to matching parts of videos instead of whole video. The future scope of this work also includes
more applications to find videos of user’s interest.
10
Chapter 2
Fast Lead Star Detection in Entertainment
Videos
Suppose an avid cricket fan or coach wants to learn exactly how Flintoff repeatedly got Hughes
“out.” Or a movie buff wants to watch an emotional scene involving his favorite heroine in a
Hollywood movie. Clearly, in such scenarios, you want to skip frames that are “not interesting.”
One possibility that has been well explored is extracting key frames in shots or scenes and then
creating thumbnails. Another natural alternative – the emphasis in this work – is to determine
frames around, what we call, lead stars. A lead star in an entertainment video is the actor who,
most likely, appears in many significant frames. We define lead stars in other videos also. For
example, the lead star in a soccer match is the hero, or the player of the match, who has scored
“important” goals. Intuitively, he is the one the audience has paid to come and see. Similarly the
lead star in a talk show is the guest who has been invited, or, for that matter, the hostess. This
work presents how to detect lead stars in entertainment videos. Moreover like various video
summarization [13, 55, 75], lead stars is a natural way of summarizing video. (Multiple lead
stars are of course allowed.)
11
Figure 2.1: Lead star detection. This is exemplified in sports by the player of the match; in
movies, stars are identified; and in TV shows, the guest and host are located.
2.1
Related Work
Researchers have explored various video specific applications for lead stars detection – anchor
detection in news video [66], lead casts in comedy sitcoms [16], summarizing meetings [62],
guest host detection [24, 55] and locating the lecturer in smart rooms by tracking the face and
head [74].
Fitzgibbon [18] uses affine invariant clustering to detect cast listing from movie. As the original
algorithm had runtime that is quadratic, the authors used a hierarchical strategy to improve the
clustering speed that is central to their method. Foucher, S. and Gagnon, L. [19] used spatial
clustering techniques for clustering actor faces. Their methods detect the actor’s cluster in
unsupervised way with computation time of about 23 hours for a motion picture.
2.2
Our Strategy
Although the lead actor has been defined using a pictorial or semantic concept, an important
observation is that the significant frames in an entertainment video is often accompanied by a
change in the audio intensity level. It is true no doubt that not all frames containing the lead
actors involve significant audio differences. Our interest is not at the frame level, however. Note
that certainly the advent of important scenes and important people bear a strong correlation to
the audio level. We surveyed around one hundred movies, and found that it is rarely, if at all,
12
the case that the lead star does not appear in audio highlighted sections, although the nature of
the audio may change from scene to scene. And as alluded above, once the lead star has entered
the shot, the frames may well contain normal audio levels.
Figure 2.2: Our strategy for lead star detection. We detect lead stars by considering segments
that involve significant change in audio level. However, this by itself is not enough!
Our method is built upon this concept. We detect lead stars considering such important scenes
of the video. To reduce false positives and negatives, our method clusters the faces for each
important scenes separately and then combines the results. Unlike the method in [18], our
method provides a natural segmentation for clustering. Our method is shown to considerably
reduce the computation time of the previously mentioned state-of-the-art [19] for computing
lead star in motion picture (a factor of 50). We apply this method to sports video to identify the
player of the match, motion pictures to find heroes and heroines and TV show to detect guest
and host.
2.3
Methodology
As mentioned, the first step in the problem is to find important scenes which have audio
highlights. Once such important scenes are identified, they are further examined for potential
faces. Once a potential face is found in a frame, subsequent frames are further analyzed for
false alarms using concepts from tracking. At this point, several areas are identified as faces.
Such confirmed faces are grouped into clusters to identify the lead stars.
13
Figure 2.3: Strategy further unfolded. We first detect scenes accompanied by change in audio
level. Next we look for faces in these important scenes, and to further confirm the suitability
track faces in subsequent frames. Finally, a face dictionary representing the lead stars is formed
by clustering the confirmed faces.
14
2.3.1
Audio Highlight Detection
Figure 2.4: Illustration of highlight detection from audio signal. We detect highlights by
considering segments that involve significant low RMS ratio.
The intensity of a segment of an audio signal is summarized by the root-mean-square value. The
audio track of a video is divided into windows of equal size and the rms value is computed for
each audio window. From the resulting rms sequence, the rms ratio is computed for successive
items in the sequence.
Let xn be defined as audio segments of fixed window size, such that 0 < n < L where L is the
number of audio segments. The audio highlight indicator function H(xn ) is defined as
H(xn ) = (rms(xn )/rms(x(n+1) )) < th
(2.1)
The rms ratio is marked as low when the value is below an user defined threshold. Using this
rms ratio, we define function A to detect audio highlight region, which is defined as

 1
A(n) =
 0
∃H(xn ) = 1 | (n − tw) ≤ n ≤ (n + tw)
(2.2)
otherwise
In our implementation, based on our experiments we have used th=5 and tw=2. The video
frames corresponding to such windows are considered as ‘important.’
15
2.3.2
Finding & Tracking Potential People
Figure 2.5: Illustration of data matrix formation. The tracked faces are stacked together to
form the data matrix.
Once important scenes are marked, we seek to identify people in the corresponding video
segment. Fortunately there are well understood algorithms that detect faces in an image frame.
We select a random frame within the window and detect faces using the Viola & Jones face
detector [59].
Every face detected in the current frame is then voted for a confirmation by attempting to
track them in subsequent frames in the window. Confirmed faces are stored for the next
step in processing in a data matrix. Confirmed faces from each highlight Fi , is stored in the
corresponding data matrix F as illustrated in Figure 2.5.
16
2.3.3
Face Dictionary Formation
In this step, the confirmed faces are grouped based on their features. There are a variety
of algorithms for dimensionality reduction, and subsequent grouping. We observe that the
Principal Component Analysis (PCA) method has been successfully used for face recognition.
We use PCA to extract feature Pi vectors from Fi . and we use the k-means algorithm for
clustering. We determinine the number of clusters K using the following steps.
Let Pi be the pca vector of face matrix Fi where 1 ≤ i ≤ n
pdist(i, j) ← cosine distance between pca vectores Pi and Pj
F = {1..N}
K=0
while F 6= φ do
K = K +1
G = any element from F
repeat
size =| G |
newG = {}
for all element f in G do
newG = newG
S
k : pdist( f , k) < t where 1 ≤ k ≤ n
end for
G(m) = newG(m)
until size =| G(m) |
F = F : ∀x ∈ F and x 3 G(m)
end while
The representative face for the clusters formed for the movie Titanic is shown in Figure 2.6.
Our method has successfully detected the lead actors in movie. As can be noticed, along with
lead stars, there are patches that have been misclassified as faces. There are also non lead
stars present along with the highlighted lead stars. We further refine these clusters and select
17
prominant clusters which represent lead stars.
Figure 2.6: Face dictionary formed for the movie Titanic. Lead stars are highlighted in blue
(third image), and red (last image). Note that this figure does not indicate the frequency of
detection.
Let C j represent each clusters in the final set of clusters, where 1 ≤ j ≤ K. The number of
elements in cluster Ci is represented as
N j =| C j |
(2.3)
At this point, we have a dictionary of faces, but not all faces belong to lead actors. We use the
following parameters to shortlist the faces to form lead stars.
1. The number of faces in the cluster. If a cluster (presumably of the same face) has a large
cardinality, we give this cluster a high weightage. The weight function S1 which selects
the top clusters is defined as



1


S1 ( j) =
1



 0
j=1
S1 ( j − 1) = 1 and N j ≥ ts N j−1
(2.4)
otherwise
In our experiments, we have used ts = 0.5 which helps in identifying prominant clusters.
2. Position of the face with respect to center of the image. Lead stars are expected to be
in the center of the image. Let H and W be the height and width of the video frame
respectively and Xi , Yi are the center coordinates of face Fi . The function measuring the
position of the faces in Cluster C j is computed as
Nj
S2 ( j) = ∑ ((W /2− | W /2 − Xi |) + (H/2− | H/2 −Yi |))
(2.5)
i=1
3. Area of the detected face S3 ( j) is calculated as sum of areas of all faces in the cluster C j .
Again, lead stars typically occupy a significant portion of the image.
18
The weighted avearge of these parameters are used to select lead star clusters. Let SWw be the
weight of the funtion Sw , then shortlisting function S( j) is defined as
3
S( j) =
∑ SWwSw( j)
(2.6)
w=1
The clusters with S( j) > µS is selected as lead star clusters L j and lead star LR j for cluster C j
is detected as the face with minimum distance to the cluster center.
LR j = Fi :
q
Nj q
((µL j − Pi )2 ) = min ((µL j − Px )2 )
x=1
(2.7)
Figure 2.7: Illustration of face dictionary formation.
2.4
Applications
In this section, we demonstrate our technique using three applications — Lead Actor Detection in Motion Pictures, Player of the Match Identification and Host Guest Detection. As
19
applications go, Player of the Match Identification has not been well explored considering the
enormous interest. In the other two applications, our technique detects lead stars faster than the
state-of-art techniques, which makes our method practical and easy to use.
2.4.1
Lead Actor Detection in Motion Pictures
In motion pictures, detecting the hero, heroine and villain has many interesting benefits. A
person while reviewing a movie can skip the scenes where lead actors are not present. A profile
of the lead actors can also be generated. Significant scenes containing many lead actors can be
used for summarizing video.
In our method, the face dictionary formed contains the lead actors. These face dictionaries are
good enough in most of the cases. However, for more accurate results, the algorithm scans
through a few frames of every shot to determine the faces occurring in the shot. The actors who
appear in a large number of shots are identified as lead actors. The result of lead actors for the
movie Titanic after scanning through entire movie is shown in the Figure 2.8.
Figure 2.8: Lead actors detected for the movie Titanic
.
2.4.2
Player of the Match Identification
In sports, sports highlight and key frames [55] are the two main methods used for summarizing.
We summarize sports using the player of the match capturing the star players.
Detecting and tracking players in complete sports video does not yield player of the match. The
star players can play for shorter time and score more as opposed to players who attempt many
times and don’t. So analyzing the players when there is score leads to the identification of star
20
players. This is easily achieved by our technique, as detecting highlights results in exciting
scenes like scores.
Figure 2.9: Lead star detected for the highlights of a soccer match Liverpool vs Havant &
Waterlooville. The first image is erroneously detected as face. The other results represents
players and coach.
The result of lead sports stars detected from a soccer match Liverpool vs Havant & Waterlooville
is presented in the Figure 2.9. The key players of the match are detected.
2.4.3
Host Guest Detection
In TV interviews and other TV programs, detecting host and guest of the program is the key
information used in video retrieval. Javed et. al. [24] have proposed a method for the same
which removes the commercials and then exploits the structure of the program to detect guest
and host. The algorithm uses the inherent structure of the interview that the host appears for
shorter duration than guest. However, it is not always the case, especially when the hosts are
equally popular like in the case of TV shows like Koffee With Karan. In the case of competition
shows, the host is shown for longer duration than guests or judges.
Our algorithm detects hosts and guests as lead stars. To distinguish hosts and guests, we detect
lead stars on multiple episodes and combine the result. As it is intuitive, the lead stars over
multiple episodes are hosts and the other lead stars detected for specific episodes are guests.
2.5
Experiments
We have implemented our system in Matlab. We tested our method on an Intel Core Duo
processor, 1.6 Ghz, 2GB RAM. We have conducted experiments on 7 popular motion pictures,
9 soccer match highlights and two episodes of TV shows summing up to total of 19 hours 23
21
minutes of video. Our method detected lead stars in all the videos in an average of 14 minutes
for an one hour video. The method [19] in the literature computes lead star of a motion picture
in 23 hours, whereas we compute lead star for motion picture in an average of 30 minutes. We
now provide more details.
2.5.1
Lead Actor Detection in Motion Pictures
Table 2.1: Time taken for computing lead actors for popular movies.
No.
Movie Name
Duration
Computation Time for
Computation Time for
Detecting Lead Stars
Refinement
(hh:mm)
(hh:mm)
(hh:mm)
1
The Matrix
02:16
00:22
00:49
2
The Matrix Reloaded
02:13
00:38
01:25
3
Matrix Revolutions
02:04
00:45
01:07
4
Eyes Wide Shut
02:39
00:12
00:29
5
Austin Powers
01:34
00:04
01:01
01:59
00:16
00:47
Titanic
03:18
01:01
01:42
Total
16:03
03:18
07:20
in Goldmember
6
The Sisterhood
of the Traveling Pants
7
We ran our experiments on 7 box-office hit movies listed in the table 2.1. This totally sums up
to 16 hours of video. The lead stars in all these movies are computed in 3 hour 18 minutes. So
the average computation time for a movie is around 30 minutes. From Table 2.1, we see that
the best computation time is 4 minutes for the movie Austin Powers in Goldmember which is 1
hour 42 minutes in duration. The worst computation time is 45 minutes for the movie Matrix
Revolutions of duration 2 hour 4 minutes. For movies like Eyes Wide Shut and Austin Powers
in Goldmember, the computation is faster as there are fewer audio highlights. Whereas action
movies like Titanic sequels take more time as there are many audio highlights. This causes the
variation in computation time among movies.
The lead actors detected are shown in the Figure 2.10. The topmost star is highlighted in red
22
Figure 2.10: Lead actors identified for popular movies appear on the right.
color and the next top star is highlighted in blue color. As you can notice in the figure, in most
of the movies topmost stars are detected. Since the definition of “top” is subjective, it could
be said that in some cases, top stars are not detected in some cases. Further, in some cases the
program identifies the same actor multiple times. This could be due to disguise, or due to pose
variation. The result is further refined for better accuracy as mentioned in Section 2.4.1.
2.5.2
Player of the Match Identification
We have conducted experiments on 11 soccer match highlights taken from BBC and listed in
Table 2.2. Our method on an average takes half the time of the duration of the video. Note
however, that these timings are for only sections that have already been manually edited by the
BBC staff. If the video were run on a routine full soccer match, we expect our running time to
be a lower percentage of the entire video.
The results of key player detection is presented in the Figure 2.11. The key players of the match
are identified for all the matches.
23
Figure 2.11: Key players detected from BBC Match of the Day match highlights for premier
league 2007-08.
2.5.3
Host Guest Detection
We conducted our experiment on the TV show Koffee with Karan. Two different episodes of
the show were combined and fed as input. Our method identified the host in 4 minutes for a
video of duration 1 hour 29 minutes. Our method is faster than the method proposed by Javed
et. al. [24].
The result of our method for the TV show Koffee with Karan is presented in Figure 2.12. Our
method has successfully identified the host.
24
Table 2.2: Time taken for computing key players from BBC MOTD highlights for premier league
2007-08.
No.
Soccer Match
Duration
Computation
(hh:mm)
(hh:mm)
1
Barnsley vs Chelsea
00:02
00:01
2
BirminghamCity vs
00:12
00:04
Arsena
3
Tottenham vs Arsenal
00:21
00:07
4
Chelsea vs Arsenal
00:14
00:05
5
Chelsea vs
00:09
00:05
Middlesborough
6
Liverpool vs Arsena
00:12
00:05
7
Liverpool vs
00:15
00:06
00:09
00:05
00:18
00:04
01:52
00:43
Havant & Waterlooville
8
Liverpool vs
Middlesbrough
9
Liverpool vs
NewcaslteUnited
Total
Table 2.3: Time taken for identifying host in a TV show Koffee With Karan. Two episodes are
combined into a single video and given as input.
TV show
Koffee With Karan
Duration
Computation
(hh:mm)
(hh:mm)
01:29
00:04
Figure 2.12: Host detection of TV show “Koffee with Karan". Two episodes of the show are
combined and given as input. The first person in the detected list (sorted by the weight) gives
the host.
25
26
Chapter 3
AutoMontage: Photo Sessions Made Easy!
Its no more that only professional photographers who take pictures. Almost anyone has a
good camera, and often takes a lot of photographs. Group photographs as a photo session
in reunions, conferences, weddings, and so on are de rigueur. It is difficult, however, for
novice photographers to capture good expressions at the right time, and realize a consolidated
acceptable picture.
A video shoot of the same scene ensures that expressions are not missed. Sharing the video,
however, may not be the best solution. Besides the obvious bulk in the video, poor expressions
(“false positives") are also willy nilly captured and might prove embarrassing.
A good
compromise is to produce a mosaiced photograph assembling good expressions, and discarding
poor ones. This can be achieved by a cumbersome manual editing; in this paper, we provide an
automated solution, illustrated in Figure 3.1. The photo shown has been created from a random
YouTube video excerpt and the photo shown does not exist in any frame of the original video.
3.1
Related Work
Research in measuring the quality of face expressions has appeared elsewhere, in applications
such as medical patient expression detection to sense pain [40], measurement of children’s
27
Figure 3.1: An automatic photo montage created by assembling “good” expressions from a
video (randomly picked from YouTube). This photo did not exist in any of the input frames.
facial expression during problem solving [37], and analyzing empathetic interactions in group
meeting [28]. Our work focuses on generating an acceptable photo montage from photo-session
video and is oriented towards a targeted goal of discarding painful expressions, and recognizing
memorable expressions.
In regard to photo editing researchers have come up with many interesting applications like
organizing photos based on the person present in the photos [31][34][32][12], correcting an
image with closed eye to open eye [38], and morphing photos [36]. Many of these methods
either require manual intervention, or involves a different and enlarged problem scope resulting
in more complicated algorithms rendering them inapplicable to our problem.
The closest work to ours is presented in [6], where a user selects the part they want from each
photo. These parts are then merged using the technique of graph cuts to create a final photo. Our
work differs from [6] in a few ways. Faces are selected automatically by determine pleasing face
expressions (based on an offline machine learning strategy). Video input are allowed enabling
a larger corpus of acceptable faces, and the complete process is automated, thereby making it
easier for end-users.
3.2
Technical Contributions
The technical contribution in this work includes:
• A frame analyzer that detects camera panning motion to generate candidate frames
28
• An expression analyzer from detected faces
• A photo patcher that enables seamless placement of faces in group photos
Details of these steps appear in the following Section.
3.3
Methodology
In this section, we present a high level overview of our method followed by the details. The
steps involved in auto montage creation are
1. Face detection and expression measurement
2. Base photo selection
3. Montage creation
First, we detect faces in all the frames and track them based on its position. Then we measure
facial expression of all detected faces, and select a manageable subset. The next step in this
problem is to identify plausible frames which can be merged to create mosaic. Frames in a
video “far away” in time, or unrelated frames cannot be merged. In the last phase, faces with
best expressions are selected and substituted in the mosaiced image using the technique of graph
cut and blending. Figure 3.2 illustrates these steps.
3.3.1
Face Detection and Expression Measurement
In our work, first we detect faces from video using the method in [60]. Measuring facial
expression is, as expected, critical. Facial expression is measured as deviation from neutral
expressions as illustrated in Figure 3.3. We have manually collected around one hundred neutral
expression faces from various wedding videos for training our system.
29
Face Detection
Face Position
Tracking
Facial Expression
Measurement
Base Photo Selection
Montage Creation
Figure 3.2: A schematic of our method. In the first step, detected faces are tracked and grouped
together. In the next step frames are analyzed to detect a base photo frame which can either be
a mosaic of frames, or a single frame with maximum number of good expression faces from the
video. In the last step, faces with good expression are patched to the base photo to form the
required photo montage.
Offline Alignment - Intuitively, alignment is achieved using the position of the eyes as the
reference. Neutral faces are aligned using the following steps:
1. Color space conversion: The face is converted from RGB to TSL (Tint, Saturation, and
Luminance) color space.
2. Skin regions with values Is are detected.
3. In the non skin region where Is = 0, regions in the top half of the face are examined for
two symmetrical and almost spherical regions which represents the eyes.
4. Non-skin regions in the bottom half of the face are examined for the occurrence of the
mouth. When a horizontal region in between the eyes is found, it is detected as the mouth.
Rectangular regions extracted are measured relative to the positions of eyes and mouth to
achieve alignment.
30
Figure 3.3: Facial expression measured as deviation from neutral expression
Neutral expression - A sparse quantification of neutral expression is achieved using dimensionality reduction techniques. In brief,
1. Faces are contrast enhanced
2. Mean images are computed and subtracted from neutral faces.
3. The actual dimensionality reduction using SVD factorization
For any test face xt , the facial expression measure is computed as deviation from the stored
principal component vectors. Similar to training phase, the face is first aligned, enhanced, and
then mean centered vector is projected onto the neutral face vector eigen space to obtain a value
p.
The Euclidean distance between the projection p and mean of projection of neutral eigen face
P is used to measure the quality of an expression.
31
δ1 = p −
1 N
∑ Pki
N i=1
We also compute the minimum Euclidean distance between the projection p and each neutral
eigenface from P.
N
δ2 = min p − Pi
i=1
The facial expression measure is computed as the arithmetic mean of δ1 and δ2 .
δ=
3.3.2
δ1 + δ2
2
Base Photo
We detect camera movement direction by tracking the position and sizes of the faces. For each
shot, we accumulate frames until the camera direction changes.
The resulting clusters are used to serve as the base photo. In case there is little or no camera
movement, the scene is considered static and the frame having maximum number of good
expression faces is selected as the base photo.
3.3.3
Montage
The montage is illustrated in Figure 3.4 and created as follows:
1. As a rough indicator of the desired position, the detected face’s coordinates are mapped
to the corresponding coordinates in the mosaiced image using the computed parameters
from the mosaicing algorithm.
2. Simultaneously multiple faces which have similar coordinates are grouped, and the face
with the maximum facial expression measure (δ ) is selected.
32
Figure 3.4: Illustration of the montage creation process. Best expressions from various frames
previously selected are patched to create a new frame.
3. A broad alignment with the body position is also made
4. Given these tentative positions, Graph cut [8] is used to find the accurate boundary of the
inserted face. In brief, the base boundaries are tied to source node and assigned a high
weight. In-between nodes are assigned the absolute difference of gradient level.
5. Around the graph-cut segmentation, image blending is done between the base photo and
the selected face.
Figure 3.5: AutoMontage created by generating a mosaic of the Youtube video available in [4].
33
3.4
Experiments
To compare our method, we ran our experiments against the stack of images provided in [6].
As can be seen, the output photo generated had good expressions of most of the people. The
result is presented in Figure 3.6. Note that we have automatically generated the photo montage
without user input compared to the original method.
Figure 3.6: AutoMontage created from family stack of image [6]. We are able to automatically
generate the photo montage as opposed to the method in [6].
Our system has also been tested on other group photo sessions collected from youtube. Our
algorithm successfully created photo montages from all these videos. Examples are presented
in Figure 3.6, Figure 3.7 and Figure 3.8.
34
In Figure 3.6, though most of the faces looks good, there is an artifact in the third person from
top left corner. This is introduced by the patching scheme when multiple face’s best expression
was substituted. This scenario demands more accurate detection of faces to avoid such artifacts.
Figure 3.7: Photo Montage example having acceptable expressions. Youtube video is available
in [3].
Figure 3.8: Photo Montage example having acceptable expressions. A wedding reception video.
35
36
Chapter 4
Cricket Delivery Detection
Cricket is one of the most popular sports in the world after soccer. Played globally in more than
a dozen countries, it is followed by over a billion people. One would therefore expect, following
other trends in global sports, that there would be a meaningful analysis and mining of cricket
videos.
There has been some interesting work in this area (for example, [23, 27, 45]). However, by
and large, the amount of computer vision research does not seem to be commensurate with
the interest and the revenue in the game. Possible reasons could be the complex nature of
the game, and the variety of views one can see, as compared to games such as tennis, and
soccer. Further, the long duration of the game might inhibit the use of inefficient algorithms.
Segmenting a video into its meaningful units is very useful for the structure analysis of the
video, and in applications such as content based retrieval. One meaningful unit corresponds to
the semantic notion of “deliveries” or “balls” (virtually all cricket games are made up of 6-ball
overs). Consider:
(a)
(b)
(c)
Figure 4.1: Typical views in the game of cricket (a) Pitch View (b) Ground View (c) Non-Field
View. Note that the ground view may include non-trivial portions of the pitch.
• A bowling coach might be interested in analyzing the balls faced only by, say, Steve Smith
37
from the entire match. If we could temporally segment a cricket match into balls, then it
is possible to watch only these portions.
• The games administrator might be able to figure out how many minutes are consumed by
slow bowlers as compared to fast bowlers. Currently such information is available only
by the manual process of segmenting a long video (more than 7 hours).
4.1
Prior Work
Indeed, the problem of segmenting a cricket video into meaningful scenes is addressed in [27].
Specifically, the method uses the manual commentaries available for cricket video, to segment
a cricket video into its constituent balls. Once segmented, the video is annotated by the text for
higher-level content access. A hierarchical framework and algorithms for cricket event detection
and classification is proposed in [23]. The authors uses a hierarchy of classifier to detect various
views present in the game of cricket. The views with which they are concerned are real-time,
replay, field view, non-field view, pitch-view, long-view, boundary view, close-up and crowd
etc.
Despite the interesting methods in these works – useful in their own right – there are some
challenges. The authors in [23] have only worked on view classification without addressing the
problem of video segmentation. Our work closely resembles that of the method in [27], and is
inspired from it from a functional point of view. It differs dramatically in the complexity of the
solution, and the basis for the temporal segmentation. Specifically, as the title in [27] indicates,
the work is text-driven.
In Binod’s work [45], he address the problem of segmenting cricket videos into the meaningful
structural unit termed as balls. Intuitively a ball starts with the frame in which “a bowler is
running towards the pitch to release the ball.” The end of the ball is not necessarily the start
of the next ball. The ball is said to end, when the batsman has dealt with the ball. This ball
corresponds to a variety of views. The end frame might be a close up of a player or players (e.g.
celebration), the audience, a replay, or even an advertisements.
Because of these varied nature of views, he has used both the domain knowledge of the game
38
(views and their sequence), as well as TV production rules (location of scoreboard, its presence
and absence). In his approach first the input video frames are classified into two semantic
concepts: play and break. The break segment includes replays, graphics and advertisements.
The views that we get in the game of cricket include close-up of the players, and a variety of
views that are defined as follows: Examples of various views are presented in the Figure 4.6.
• Pitch Views (P) bowler run-up, ball bowled, ball played.
• Ground View (G) ball tracking by camera, ball fielded and returned.
• Non-Field Views (N) Any view that is not P or G. This includes views such as the closeup of batsman, bowler, umpire, and the audience.
Further, Binod defines Field Views (F) as the union of P and G. Using this vocabulary, he
creates an alphabetical sequence (see the block diagram of our system in Fig. 4.2.) Representative views are shown in Fig. 4.1. Note that our particular classification is not intended to
be definitive, or universal. It may appear to be even subjective. The process of creating this
sequence is
• Segment the video into play and break segments frames.
• Classify play segment frames into F or N
• Classify F View frames into P or G
• Smooth the resulting sequence
• Extract balls using a rule based system.
In summary, his contributions lies in modeling the ball as a sequence of relevant views and
coming up with appropriate techniques to obtain the vocabulary in the sequence.
However his method fails when the views are misclassified due to pop-up ad, or too few frames
in a particular view. In our work, we have focused on improving accuracy in such cases and
further, we identify key players of each ball to help in indexing the match.
39
Cricket Video
Scoreboard
Position
Detection
Play & Break
Detection
No
Scoreboard
Break Segment
Play
Field/Non Field
View
Classification
Train Classifiers
Dominant Grass
Color Ratio < th
Non Field View
Pitch / Ground
View Classifier
Color, Edge and Shape Features
Ground View
Pitch View
Figure 4.2: Block diagram of our system
4.2
Problem Statement
The cricket delivery detection method proposed by Binod Pal et. al. [45] fails when the views
are misclassified. We propose improvements to handle these cases.
1. Eliminate pop-up advertisement
2. Enhance the pitch view detection logic to eliminate false positives
3. Enhance smoothing logic, so that it doesn’t smooth out the required views
4. Associate each delivery with the prominent player
40
Figure 4.3: Illustration of various problems where Binod pal’s work fails
4.3
Methodology
In our approach we improvise on Binod Pal’s work. We address the issues presented in the
problem statement.
4.3.1
Pop-up Advertisement Detection and Removal
Whenever the view classifier detects an advertisement, we further examine the frame to check
if it is a pop-up ad and eliminate it. Pop-up advertisements are relatively static when compared
to the game play where there is more activity. We use this to detect advertisement and eliminate
them. From the training set we extract the static component image from motion vector which
has negligent motion. From this, we detect the prominent regions at the left and bottom of the
image and then smooth the region boundaries to get the ad boundary.
When this method fails to detect the advertisement boundary, we use scoreboard to determine
advertisement boundary. We look for scoreboard boundary in the first quadrant of the image.
Once the scoreboard position is detected, based on the displacement distance, advertisement
boundaries are detected.
4.3.2
Green Dominance Detection for Field View Classification
In the prior work [45], there were false positives for pitch detection when the player’s dress
color is green. To reduce such false positives, we have have divided the frame into 9 blocks and
41
Figure 4.4: Detecting Advertisement Boundary
Figure 4.5: Scoreboard Detection
fed 128-hue histogram of each block as addition features to the view classifier. As the green
ground part is around the pitch and pitch is at center, this helped in reducing the false positives.
4.3.3
Smoothing of View Sequence Using Moving Window
In the prior work, smoothing of window was done by diving the views sequence into window
of 5, and assigning most frequently occurring views to all the frames in that window. As this
method can remove some views which fall across the window, we use running window instead
of dividing view sequence. For each frame, the frame is considered as center of windows, and
most occurring view in the window is assigned to the current frame. This approach does not
eliminate views which were originally falling across windows.
42
Figure 4.6: Extraction of features at block level for improving view detection
4.3.4
Player Detection
Once deliveries are detected, further filtering of the video is possible. For example, we consider
classifying deliveries based on an a particular player, such as the Australian cricket captain
Steve Smith.
We can use a standard frontal face detector to detect faces of bowlers. On the other hand, faces
of batsmen are confused by the usual use of helmets. So we look for additional information
in the form of connected regions (in skin color range), and square regions which have more
edges. For each delivery, face detection is performed for 100 frames before and after the actual
delivery. This buffer of 200 frames is assessed for recognizing faces of key players associated
with the actual delivery. The usual method of dimensionality reduction, and clustering is used
to prune the set. With this background, if an input image of a particular player is given, the
image is mapped to the internal low dimension representation and the closest cluster is found
using a kd-tree. Corresponding deliveries in which that player appears are displayed.
43
4.4
Experiments and Results
We have conducted our experiments on one day international matches, test matches and T20
matches summing up to 50 hours of video. We have compared the classification of pitch/ground
view using sift features. Features proposed by us have produced promising results which is
presented in Table 1.
Table 4.1: Comparison of methods across different matches
Pitch/Ground View Classification Precision Recall
Our Method
0.982
0.973
Sift features
0.675
0.647
Our method had successfully detected deliveries with overall precision of 0.935 and recall of
0.957. The result is presented in Table 2.
Table 4.2: Result of detected deliveries for different types of matches
Match Type Precision Recall
ODI
0.932
0.973
Test Match
0.893
0.931
T20
0.982
0.967
The filtered retrieval of deliveries is presented in Fig 4.7.
Input Player
Retrieved deliveries
Figure 4.7: Retrieval of deliveries involving only certain players.
44
4.5
Conclusion and Future Work
We have improvised the accuracy of temporal segmentation of cricket videos proposed by Binod
Pal and compared with different forms of cricket (50 overs ODI, 20-20 (T20) and test matches.
We are also able to provide a browse by player capability to traverse through player.
45
46
Chapter 5
Hierarchical Summarization for Easy
Video Applications
Say you want to find an action scene of your favorite hero. Or want to watch a romantic
movie with a happy ending. Or say you saw a interesting trailer, and want to watch related
movies. Finding these videos in the ocean of videos available has become noticeably difficult,
and requires a trustworthy friend or editor. Can we quickly design computer “apps” that act like
this friend?
This work presents an abstraction for making this happen. We present a model which summarizes the video in terms of dictionaries at different hierarchical levels — pixel, frame, and shot.
This makes it easier for creating applications that summarizes videos, and address complex
queries like the ones listed above.
The abstraction leads to a toolkit that we use to create several different applications demonstrated in Figure 5.1. In the “Suggest a Video” application, from a set of action movies, three
Matrix sequels were given as input; the movie Terminator was found to be the closest suggested
match. In the “Story Detection” application, the movie A Walk to Remember is segmented into
three parts; user experience suggests that these parts correspond to prominent changes in the
twist and plot of the movie. In the “Audio Video Mix" application, given a song with a video as
a backdrop, the application finds another video with a similar song and video; the application
thus can be used to generate a “remix” for the input song. This application illustrates the ability
47
Figure 5.1: Retrieval applications designed using our Hierarchical Dictionary Summarization method.
“Suggest a Video” suggests Terminator for Matrix sequels. “Story Detection” segments a movie into
logical turning points in the plot of the movie A Walk to Remember. “Audio Video Mix" generates a
“remix” song from a given set of music videos.
of the data representation to find a video which closely matches both content and tempo.
5.0.1
Related Work
Methods like hierarchical hidden Markov models [65, 67] and latent semantic analysis [43, 64]
have been used to build models over basic features to learn semantics. In the method proposed
by Lexing Xie et al. [65], the structure of the video is learned in an unsupervised way using twolevel hidden Markov model. The higher level corresponds to semantic events while the lower
level corresponds to variations within the same event. Gu Xu et al. [67], proposed multi-level
hidden Markov models using detectors and connectors to learn videos.
48
Spatio temporal words [29, 43, 63, 64] are interest points identified in space and time. Ivan
Laptev [29] uses spatio temporal Laplacian operator over spatial and temporal scales to detect
events. Niebles et al. [43] uses probabilistic latent semantic analysis (pLSA) on spatio temporal
words to capture semantics. Shu-Fai Wong detects spatio temporal interest points using global
information [63] and then uses pLSA [64] for learning relationship between the semantics
revealing structural relationship. Soft quantization [22] accounting distance from a number
of codewords is considered for classifying scenes with the image and Fisher Vector and used in
classification [25, 53] showing significant improvement over BOW methods. A detailed survey
on various video summarization models can be found in [9].
Similar to [25, 53], we use local descriptors and form visual dictionaries. However unlike [25,
53], we preserve more information instead of extracting only specific information. In addition
to building dictionary at pixel level, we extend this dictionary to frame and shot level, forming
a hierarchical dictionary. Having similarity information available at various granularities is the
key to creating applications that need features at the level desired.
5.0.2
Technical Contributions
In this paper, we propose a Hierarchical Dictionary Model (termed H-Video) to make the task
of creating application easier. Our method learns semantic dictionaries at three different levels
— pixel patches, frames, and shots. Video is represented in the form of learned dictionary
units that reveal semantic similarity and video structure. The main intention of this model is to
provide these semantic dictionaries, so that comparison of video units at different levels in the
same video and different videos becomes easier.
The benefits of H-Video include the following
(i) The model advocates run-time leveraging of prior offline processing time. As a consequence,
applications run fast.
(ii) The model is built in an unsupervised fashion. As no application specific assumption is
made, many retrieval applications can use this model and its features. This can potentially save
enormous amount of computation time spent in learning.
49
Feature
Extraction
Frames
H1 Representations
of frames
H2 Representation
of shots
Feature
Extraction
H1 Dictionary
Clustering
Frame
H2 Dictionary
Clustering
Feature
Extraction
H1 Representation
of frames in a shot
H3 Dictionary
H2 Representation
of shots in the video
Clustering
Feature
Extraction
H1
Dictionary
Feature
Extraction
H2
Dictionary
Feature
Extraction
H3
Dictionary
H1 Representation of
pixels in a frame
H2 Representation of
frames in a shot
H3 Representation of
shots in a video
Dictionary Representation
Dictionary Formation
Figure 5.2: Illustration of the H-Video Abstraction
(iii) The model represents learned information using a hierarchical dictionary. This allows video
to be represented as indexes to elements in the video. This makes it easier for the developer of
a new retrieval application as similarity information is available as a one dimensional array. In
other words, our model doesn’t demand deep video understanding background from application
developers.
(iv) We have illustrated our model through several applications. Figure 5.1 illustrates these
applications.
5.1
Methodology
We first give an overview of the process which is illustrated in Figure 5.2.
50
5.1.1
Overview
Our model first extracts local descriptors like color and edge from pixel patches. (Color and
edge descriptors are simply examples.) We then build a dictionary, termed H1 dictionary, out
of these features. At this point, the video could be, in principle, be represented in terms of
this H1 dictionary. We refer each frame of this representation as a H1 frame. We then extract
slightly higher level features such as the histogram of the H1 dictionary units, the number of
regions from these H1 frames and so on, and form a new H2 dictionary. The H2 dictionary is
yet another representation of the video and captures the type of objects and their distribution in
the scene; in this way, it captures the nature of the scene. The video could also be represented
using this H2 dictionary. We refer each shot in this representation as a H2 shot. Further, we
extract features based on the histogram and position of H2 dictionary units and build yet another
structure, the H3 dictionary. This dictionary represents the type of shots occurring in the video.
The video is now represented in terms of this H3 dictionary to form H3 video.
5.1.2
H1 Dictionary Formation and Representation
For forming the H1 dictionary, at each pixel, the nearby 8 × 8 window of pixels are considered.
We extract local descriptors from the moving window using pyramidal Gaussian features,
neighborhood layout features, and edge filters.
As an example, we use the list of filters listed in Fig. 5.3, which we have found to be effective
in capturing the color & shape information. A different set of features like SIFT or HOG can
also be used based on the requirement of the application.
These features are extracted from the complete video. We use principal component analysis to
determine the number of clusters. The top principal component is chosen based on the allowed
error. We then use k-means to obtain the H1 dictionary. This is illustrated in Figure 5.4.
However as clustering very long videos could be time consuming, one alternative is to do
this process in various stages. For example, we first form H1 dictionary for each frame,
combine dictionaries from a sequence of frames, and do clustering to form a more representative
dictionary. Several dictionaries from temporal stages in the video are then combined to form
a global H1 dictionary. This step-by-step approach of building dictionary makes this easy to
51
Figure 5.3: Pixel-Level Feature Extraction Filters
compute and scalable.
Figure 5.4: H1 Dictionary Formation
Once the global H1 dictionary is available we process the video again to remove duplicates, or
near duplicates to form a less redundant dictionary-based representation. The complete video
is then represented using these dictionary units (left hand side of Fig. 5.2). Each frame in this
representation is referred as a H1 frame. H1 frames may also be thought of as a segmentation,
but in addition to the segmentation, the dictionary units have the information about nature of
the objects. This process is demonstrated in Figure 5.5.
5.1.3
H2 Dictionary Formation and Representation
H2 Formation: From each H1 frame, conglomerate features like a histogram, the number of
regions and the distance between them are captured. The list of features used are shown in
52
Figure 5.5: H1 Representation of a frame (bottom)
Figure 5.6. A subset of these features can also be chosen based on the application of these
features.
Figure 5.6: Features extracted from H1 frame
Similar to H1 dictionary formation, these features are clustered to form a H2 dictionary. We
use the step-by-step dictionary building approach, wherein first dictionaries for a sequence of
frames are built and they are clustered again to form the global dictionary. This is illustrated in
Figure 5.7.
Using the H2 dictionary, we represent the video in terms of H2 units. We refer each shot in
this representation as a H2 shot. This is illustrated in Figure 5.8. The change in the H2 units
53
Figure 5.7: H2 dictionary formation
correspond to dynamically changing shots, whereas more or less similar H2 units correspond
to relatively static content. Representing video as a one dimensional array of dictionary units
makes the comparison of video elements easier. H2 units capture higher level details and are
simpler for comparison purposes. Generally information needed by similarity application will
be captured at this level.
5.1.4
H3 Dictionary Formation and Representation
Features capturing distribution of H2 units like histogram, and distance between blocks are
considered. The details of the various features captured are provided in Figure 5.9.
These extracted features are clustered to form H3 dictionary, wherein first the dictionary is
computed at shot level, and the dictionary units are again clustered to form global H3 dictionary.
This is illustrated in Figure 5.11.
Using this dictionary, the video is represented as a one dimensional sequence of H3 units,
54
Figure 5.8: Representation of H2 Units
Figure 5.9: Features extracted from H2 shot
where each unit corresponds to a shot as illustrated in Figure 5.11. This representation will be
typically useful in applications involving a huge collection of videos; applications can quickly
narrow down to the correct interest group. This level will also be useful in segmentation and
classification applications. For example while classifying shots of lecture video into professor,
student and board slides, this level will be of helpful.
55
Figure 5.10: Feature extraction from H2 shots using shot filter bank to form H3 Dictionary
Figure 5.11: H3 representation of the video
5.1.5
Building global dictionary
Some applications require videos to be compared at multiple dictionary levels. For this purpose,
we cluster H1 clusters of more than one input videos in the database together, and construct a
H1 dictionary at the database level, rather than an individual video. Using this notion of global
H1 dictionary, the H2 units and H3 units are generated again. (Input videos from the database
are randomly sampled.)
56
5.2
Applications
In this section, we demonstrate the efficiency and applicability of our model through four
retrieval applications — video classification, suggest a video, detecting logical turning points
and searching candidates for audio video mix.
The applications designed using our model are fast, as they use pre-computed information.
Typically these applications take a minute to compute required information. This illustrates the
efficiency and applicability of this model for content based video retrieval.
5.2.1
Suggest a Video
Websites like imdb.com present alternate related movies when the user is exploring a specific
movie. This is a good source for users to find new or forgotten movies they would be interested
in. Currently such suggestions are mostly based on text annotations and user browsing history,
which the user may want to turn off due to privacy considerations. If visual features of the video
can also be used to solve this problem better suggestions can evolve.
This scenario is illustrated in Figure 5.12, where the movies which the user was interested in is
highlighted and a movie highlighted in green is suggested; the green one is presumably closest
to the user’s interest.
In this application, we aim to automatically suggest videos based on the video similarity of the
various dictionary units. There are many ways to compare videos, like comparing the feature
vectors of the dictionaries, computing the amount of overlapping dictionary units, computing
correlation, and so on. In this case, we use the sequence of H2 dictionary units and substitute
it with the corresponding dictionary’s feature vector to compute similarity. We take cross
correlation of H2 representation features to compute the similarity between videos. The video
matching with value above a threshold is suggested to the user.
57
Figure 5.12: Given a set of movies of interest (in red), a movie which is similar to them is suggested (in
green).
5.2.2
Video Classification
In this application, we identify various genres a movie could belong to and suggest this
classification. This will reduce the effort involved in tedious annotation and help users choose
appropriate genres easily. In our approach, we train a random forest model for each of the genre
using the H2 and H3 histograms of the sample set. Given a new movie, and a genre, the model
will output whether the movie belongs to that genre or not.
5.2.3
Logical Turning Point Detection
We define Logical Turning Point in the video as a point where the characteristics of objects in
the video change drastically. Detecting such places helps in summarizing the video effectively.
58
Figure 5.13: Movies are classified into drama and action genres
Figure 5.14. As illustrated in the figure, the characteristics of people occurring in each of the
differing stories is different.
We consider shots within a moving window of 5 shots. We compute the semantic overlap of H1
and H2 units between the shots. When the amount of overlap between shots is low in a specific
window of shots, we detect that as a logical turning point.
As the logical turning points capture the important events in video, this can be used in
applications like advertisement placement and preparing the story line.
5.2.4
Potential Candidate Identifier for Audio Video Mix
Remix is the process of generating a new video from existing videos by either changing audio
or video. Remixes are typically performed on video songs for a variation on the entertainment
needs. The remix candidates need to have similar phase in the change of the scene to generate
a pleasing output.
59
Figure 5.14: Story line of the movie A walk to remember split by story units. Characters are introduced
in the first story unit; in the second story unit, the girl and the boy are participating in a drama; in
the last part, they fall in love and get married. In spite of the length of stories being very different, our
method successfully identifies these different story units.
We use cross correlation of H2 units to identify if they have same phase of change. Once the
closest candidate is identified, we replace the remix candidate’s audio with the original video’s
audio.
5.3
Experiments
In this section, we first evaluate the individual performance of the constituents of H-Video
model itself. Next, in creating dictionaries, we compare the usage of popular features such as
SIFT. Finally the performance of applications that may use more than one of the three levels,
H1, H2, or H3, are evaluated.
60
Data: We have collected around 100 movie trailer from youtube, twenty five full length movie
films, and a dozen music video clips.
Computational Time: Our implementation is in Matlab. Typically a full-length movie takes
around two hours for feature extraction. Building local dictionaries for a movie takes around
10 hours. Building the global dictionary, which is extracted from local dictionary of multiple
videos takes around 6 hours. Note that building local dictionaries and a global dictionary (left
hand side of Fig. 5.2) are one time jobs. Once these are built, the dictionaries are directly used
to create the right hand side of Fig. 5.2. In other words, relevant model building typically takes
two hours which is no different from the average runtime of the video itself. Once the model is
constructed, each access operation typically takes only around 10 seconds per video.
5.3.1
Individual Evaluation of H1 and H2
For illustrating the effectiveness of H1 dictionary, we considered the classification problem and
collected videos from the category “car”, “cricket” and “news anchor”. We have collected 6
videos from each category summing up to total of 18 videos. We computed H1 dictionary for
each of these videos, and formed a global H1 dictionary for the given dataset and represented
all videos in terms of this global dictionary. For testing purposes we randomly selected two
videos from each category for training data and remaining as testing set and test set against one
of the three categories.
The recall and precision of classification using only H1 is provided in Table 5.1.
Table 5.1: Classification using only H1. With limited information, we are able to glean
classification hints.
Category
Precision
Recall
Car
1.00
0.75
Cricket
0.67
1.00
News Anchor
1.00
0.75
To evaluate the effectiveness of only H2 dictionary units, we have built our model on a TV
interview show; these had scenes of only individuals, as well as people groups. When we built
H2 dictionaries with allowed error as “top principal component / 1000”, we got six categories
61
(a) H2 dictionary units with smaller allowed error
(b) H2 dictionary units with larger allowed error
Figure 5.15: Classification using only H2. With only limited information, and with smaller allowed
error, fine details were captured. With larger allowed error, broad categories were captured.
Table 5.2: Video suggestion using popular features
Methods
Direct
H-Model
Comparison
Percentage
Improvement
SURF
54%
54%
0%
SIFT
29%
53%
83%
Color, Edge
48%
59%
23%
capturing different scenes, people and their positions. As we relaxed the allowed error, it
resulted in two categories between individual scene and group of people. This result is presented
in Fig. 5.15. Hence applications can tune allowed error parameters to suit their requirement.
5.3.2
Evaluation of Alternate Features
In this section we evaluate popular features like SURF, SIFT and contrast with the color & edge
features used in this paper. Given any feature set, one may do a “direct comparison” (which
will take longer time), or do our proposed H-Video model-based comparison (which will take
62
far lesser time). This experiment is performed on the “Suggest-a-video” problem using only
trailers of movies as the database. The result is presented in Table 5.2.
When the H-Video model is used, we use the H2 as the basis of comparison. We observe that
the use of the hierarchical model helped improving the accuracy for SIFT and color & edge
features; the accuracy was almost the same when using SURF features. In producing these
statistics, for the ground truth we have used information available on imdb.com. One problem
in using imdb.com is that the truth is limited to the top twelve only. We therefore have added
transpose and transitive relationships as well. (In transpose relationships, if a movie A is deemed
to be related to B, we mark B as related to A. In transitivity, if a movie A is related to B, and B
is related to C, then we mark A as related to both B and C. The transitive closure is maintained
via recursion.)
5.3.3
Evaluation of Video Classification
We have considered three genres for video classification. We took the category annotation of
100 movie trailers and for each category considered 30% of the data for training and remaining
as testing set. We have build the H-video for these videos, extracted H2 and H3 representations,
and classified using the random forest model. Example output is shown in Fig. 5.16.
5.3.4
Evaluation of Logical Turning Point Detection
Typically drama movies have three logical parts. First the characters are introduced, then they
get together and then a punchline is presented towards the end. Considering this as ground truth,
we validated the detected logical turning points. The logical turning points were detected with
precision of 1.0 and recall of 0.75.
5.3.5
Evaluation of Potential Remix Candidates
We have conducted experiments on 20 song clips, where the aim is to find best remix candidates.
Our algorithm found two pairs of songs which are best candidates for remix in the given set.
63
Figure 5.16: Sample Result of classifying movie trailers for categories Drama, Action and Romance.
In most of the cases, our model has classified movie trailers correctly.
Sample frames from matched video are presented in Fig. 5.17.
64
(a) Remix Candidates – Set 1
(b) Remix Candidates – Set 2
Figure 5.17: Sample frames from identified remix candidates is presented. In each sets, top row
correspond to a original video song and the second row corresponds to the remix candidate. The content
and sequencing of the first and second rows match suggesting the effectiveness of our method.
65
66
Chapter 6
Conclusion & Future Work
Videos revolutionize many of the ways we receive and use information every day. The
availability of online resources has changed many things. Usage of videos has transformed
dramatically demanding better ways of retrieving them.
Many diverse and exciting initiatives demonstrate how visual contents can be learned from
video. Yet at the same time, reaching videos using visual contents online arguably has been less
radical, especially in comparison to text search. There are many complex reasons for this slow
pace of change, including lack of proper tags and huge computation time taken for learning for
specific topics.
In this thesis, we have presented ways to optimize computation time, which is the most
important obstacle to retrieve videos. Next, we have focused on producing visual summaries,
so that user can quickly find if the video is of interest to him/her. We have also presented ways
to slice and dice videos so that user can for quickly reach the segment they are interested in.
1. Lead star Detection: Detecting lead stars has numerous applications like identifying
player of the match, detecting lead actor and actress in motion pictures, guest host
identification. Computational time has always been a bottleneck for using this technique.
In our work, we have presented a faster method to solve this problem with comparable
accuracy. This makes our algorithm usable in practice.
2. AutoMontage: Photo Sessions Made Easy!: With the increased usage of camera by
67
novices, tools to make photography sessions are becoming increasingly valuable. Our
work successfully creates photo montage from photo session videos and combines best
expressions into a single photo.
3. Cricket Delivery Detection: Inspired by prior work [45], we approached the problem
of temporal segmentation of cricket videos. As the view classification accuracy is
highly important, we have proposed few corrective measures to improve the accuracy by
introducing pop-up ad eliminator and finer features for view classification. Further, we
associated each delivery with the key players of that delivery, proving browse by player
feature.
4. Hierarchical Model: In traditional video retrieval systems, relevant features are extracted
from the video and applications are built using the extracted features. For multimedia
database retrieval systems, there are typically plethora of applications that would be
required for satisfying user needs. A unified model which uses fundamental notions of
similarity would therefore be valuable to reduce computation time for building applications.
In this work, we have proposed a novel model called H-Video, which provides the semantic information needed by retrieval applications. In summary, both creation (programmer
time) and runtime (computer time) of the resulting applications are reduced. First, our
model provides semantic information of video in a simple way, so that it is easy for
programmers. Second, due to the suggested pre-processing of long video data, runtime is
reduced. We have built four applications as examples to demonstrate our model.
6.1
Future Work
The future scope of this work is to enhance hierarchical model to learn concepts automatically
from a set of positive and negative examples using machine learning techniques like random
forest or random tree. This would serve as a great source in learning a particular tag and
applying to all other videos which have the same concept. This model can also be extended
to matching parts of videos instead of whole video.
68
Hierarchical model can also be used to analyze video types and integrate with other applications.
For example, video types pertaining to an actor can be learnt to create a profile of that actor.
The video type learnt when associated with appropriate tag, can serve complex queries related
to actor. Hierarchical model can be used to identify photo session videos, so that automontage
can be applied automatically to create a photo album.
The future scope of this work also includes developing more applications to find videos of user’s
interest. Video sharing websites create automatic play list based on user’s choice. In such a list,
the videos seem to be based on browsing patterns of users. However it has a drawback of mixing
different types of video, whereas user may be interested in specific type of video. Hence an
application to learn video types from the recommendations and provide filtering options based
on various attributes will help the user narrow down their interest. Few attributes that could
be used for such filtering are lead actors, number of people in the video, video types from
hierarchical model, popular tags associated with videos and recency.
In conclusion, we feel that generic models with quick retrieval time combined with user centric
applications have much unexplored potential. They will become the favored methods for
reaching videos in the near future.
69
70
Thesis related Publications
1. Nithya Manickam, Sharat Chandran. Automontage: Photo Sessions Made Easy. In
IEEE ICIP 2013, pp. 1321-1325.
2. Nithya Manickam, Sharat Chandran: “Automontage,” Filled Indian patent 350/MUM/2013,
Feburary 2013.
3. Nithya Manickam, Sharat Chandran. Fast Lead Star Detection in Entertainment
Videos. In IEEE WACV 2009, pp. 1-6.
4. Nithya Manickam, Sharat Chandran. Hierarchical Summarization for Easy Video
Applications. IAPR MVA 2015.
5. Binod Pal, Nithya Manickam, Sharat Chandran: “Cricket Delivery Detection,” Filled
Indian Patent 319/MUM/2015, January 2015.
Thesis related Publications (In preparation or review)
1. Nithya Manickam, Binod Pal, Sharat Chandran. Cricket Delivery Detection. Submitted
to IEEE ICIP 2015.
71
72
Bibliography
[1] Internet movie database. http://www.imdb.com.
[2] Trec video retrieval evaluation. www-nlpir.nist.gov/projects/ trecvid.
[3] Youtube video 94rotary convention_youth-hub committee 17th. rotaractors. http://www.
youtube.com/watch?v=KQ73-P9HGiE. [Last visited on 07/06/2013]. xiii, 35
[4] Youtube video jc class of 1976 reunion-group photo session. http://www.youtube.
com/watch?v=9FZTh_BkRD4. [Last visited on 07/06/2013]. xiii, 33
[5] Youtube video search. http://www.youtube.com.
[6] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin,
and M. Cohen.
Interactive digital photomontage.
ACM Transactions on Graphics,
23(3):294–302, Aug 2004. iii, xiii, 28, 34
[7] C. Bang, S.-C. Chenl, and M.-L. Shyu. Pixso: a system for video shot detection. pages
1320 – 1324, December 2003.
[8] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph
cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–
1239, 2001. 33
[9] S.-F. Chang, W.-Y. Ma, and A. Smeulders. Recent advances and challenges of semantic
image/video search. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007.
IEEE International Conference on, volume 4, pages IV–1205–IV–1208, April 2007. 49
[10] J. Chen and Q. Ji. A hierarchical framework for simultaneous facial activity tracking. In
IEEE International Conference on Automatic Face Gesture Recognition and Workshops,
pages 679–686, March 2011.
[11] M. Covell and S. Ahmad. Analysis by synthesis dissolve detection. pages 425 – 428,
2002.
[12] M. Das and A. Loui. Automatic face-based image grouping for albuming. In IEEE
International Conference on Systems, Man and Cybernetics, volume 4, pages 3726–3731,
73
Oct. 2003. 28
[13] A. Doulamis and N. Doulamis. Optimal content-based video decomposition for interactive
video navigation. Circuits and Systems for Video Technology, IEEE Transactions on,
14(6):757–775, June 2004. 11
[14] D. P. W. Ellis and G. E. Poliner. Identifying ‘cover songs’ with chroma features and
dynamic programming beat tracking. In Identifying ‘Cover Songs’ with Chroma Features
and Dynamic Programming Beat Tracking, volume 4, April 2007.
[15] A. A. et. al. Ibm research trecvid-2005 video retrieval system. November 2005.
[16] M. Everingham and A. Zisserman.
situation comedies.
Automated visual identification of characters in
In ICPR ’04: Proceedings of the Pattern Recognition, 17th
International Conference on (ICPR’04) Volume 4, pages 983–986, Washington, DC, USA,
2004. IEEE Computer Society. 12
[17] T. Fang, X. Zhao, O. Ocegueda, S. Shah, and I. Kakadiaris.
3d facial expression
recognition: A perspective on promises and challenges. In IEEE International Conference
on Automatic Face Gesture Recognition and Workshops, pages 603–610, March 2011.
[18] A. W. Fitzgibbon and A. Zisserman. On affine invariant clustering and automatic cast
listing in movies. In ECCV ’02: Proceedings of the 7th European Conference on Computer
Vision-Part III, pages 304–320, London, UK, 2002. Springer-Verlag. 12, 13
[19] S. Foucher and L. Gagnon. Automatic detection and clustering of actor faces based on
spectral clustering techniques. In Proceedings of the Fourth Canadian Conference on
Computer and Robot Vision, pages 113–122, 2007. iii, 12, 13, 22
[20] B. Funt and G. Finlayson. Color constant color indexing. Pattern Analysis and Machine
Intelligence, 17:522 – 529, 1995.
[21] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini. On using clustering algorithms
to produce video abstracts for the web scenario. In Consumer Communications and
Networking Conference, 2008. CCNC 2008. 5th IEEE, pages 1112–1116, Jan. 2008.
[22] J. C. Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders. Kernel codebooks
for scene categorization. In Proceedings of the 10th European Conference on Computer
Vision: Part III, ECCV ’08, pages 696–709, Berlin, Heidelberg, 2008. Springer-Verlag.
49
[23] K. M. H., P. K., and S. S. Semantic event detection and classification in cricket video
sequence. In Indian Conference on Computer Vision Graphics and Image Processing,
pages 382–389, 2008. 37, 38
74
[24] O. Javed, Z. Rasheed, and M. Shah. A framework for segmentation of talk and game
shows. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International
Conference on, 2:532–537 vol.2, 2001. 12, 21, 24
[25] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a
compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, pages 3304–3311, June 2010. 49
[26] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using
cross-media relevance models. In SIGIR ’03: Proceedings of the 26th annual international
ACM SIGIR conference on Research and development in informaion retrieval, pages 119–
126, New York, NY, USA, 2003. ACM.
[27] P. S. K., P. Saurabh, and J. C. V. Text driven temporal segmentation of cricket videos. In
Indian Conference on Computer Vision Graphics and Image Processing, pages 433–444,
2006. 37, 38
[28] S. Kumano, K. Otsuka, D. Mikami, and J. Yamato. Analyzing empathetic interactions
based on the probabilistic modeling of the co-occurrence patterns of facial expressions
in group meetings.
In IEEE International Conference on Automatic Face Gesture
Recognition and Workshops, pages 43–50, March 2011. 28
[29] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(23):107–123, 2005. 49
[30] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching
for recognizing natural scene categories. In Computer Vision and Pattern Recognition,
2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178, 2006.
[31] S.-H. Lee, J.-W. Han, O.-J. Kwon, T.-H. Kim, and S.-J. Ko. Novel face recognition
method using trend vector for a multimedia album. In IEEE International Conference
on Consumer Electronics, pages 490–491, Jan. 2012. 28
[32] C.-H. Li, C.-Y. Chiu, C.-R. Huang, C.-S. Chen, and L.-F. Chien.
Image content
clustering and summarization for photo collections. In IEEE International Conference
on Multimedia and Expo, pages 1033–1036, July 2006. 28
[33] D. Li and H. Lu. Avoiding false alarms due to illumination variation in shot detection.
pages 828 – 836, October 2000.
[34] J. Li, J. H. Lim, and Q. Tian. Automatic summarization for personal digital photos. In
Fourth Pacific Rim Conference on Information, Communications and Signal Processing
75
and Proceedings of the Joint Conference of the Fourth International Conference on
Multimedia, volume 3, pages 1536–1540, Dec. 2003. 28
[35] R. Lienhart and A. Zaccarin. A system for reliable dissolve detection in videos. volume
III, pages 406 – 409, 2001.
[36] S. H. Lim, Q. Lin, and A. Petruszka. Automatic creation of face composite images for
consumer applications. In IEEE International Conference on Acoustics Speech and Signal
Processing, pages 1642–1645, March 2010. 28
[37] G. Littlewort, M. Bartlett, L. Salamanca, and J. Reilly.
Automated measurement
of children’s facial expressions during problem solving tasks. In IEEE International
Conference on Automatic Face Gesture Recognition and Workshops, pages 30–35, March
2011. 28
[38] Z. Liu and H. Ai. Automatic eye state recognition and closed-eye photo correction. In
19th International Conference on Pattern Recognition, pages 1–4, Dec. 2008. 28
[39] H. Lu and Y. Tan. An effective postrefinement method for shot boundary detection.
CirSysVideo, 15:1407 – 1421, November 2005.
[40] P. Lucey, J. F. Cohn, I. Matthews, S. Lucey, S. Sridharan, J. Howlett, and K. M. Prkachin.
Automatically detecting pain in video through facial action units. IEEE Transactions on
Systems, Man, and Cybernetics, Part B, 41(3):664–674, 2011. 27
[41] S. Malassiotis and F. Tsalakanidou.
Recognizing facial expressions from 3d video:
Current results and future prospects. In IEEE International Conference on Automatic
Face Gesture Recognition and Workshops, pages 597–602, March 2011.
[42] C. Ngo, T. Pong, and R. Chin. Detection of gradual transitions through temporal slice
analysis. pages 36 – 41, 1999.
[43] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories
using spatial-temporal words. International Journal of Computer Vision, 79(3):299–318,
2008. 48, 49
[44] S. O’Hara, Y. M. Lui, and B. Draper. Unsupervised learning of human expressions,
gestures, and actions. In IEEE International Conference on Automatic Face Gesture
Recognition and Workshops, pages 1–8, March 2011.
[45] B. Pal and S. Chandran. Sequence based temporal segmentation of cricket videos. In
Sequence based Temporal Segmentation of Cricket Videos, 2010. iv, xi, 7, 8, 37, 38, 40,
41, 68
[46] N. Patel and I. Sethi. Video shot detection and characterization for video databases. Pattern
76
Recognition, 30:583 – 592, April 1997.
[47] C. Petersohn. Fraunhofer hhi at trecvid 2004. shot boundary detection system. November
2004.
[48] Z. Rasheed and M. Shah. Scene detection in hollywood movies and tv shows. volume II,
pages 343 – 348, June 2003.
[49] S. Shahraray. Scene change detection and content-based sampling of video sequence.
pages 2 – 13, February 1995.
[50] R. Shaw and P. Schmitz. Community annotation and remix: a research platform and
pilot deployment. In HCM ’06: Proceedings of the 1st ACM international workshop on
Human-centered multimedia, pages 89–98, New York, NY, USA, 2006. ACM.
[51] H. Soyel and H. Demirel. Improved sift matching for pose robust facial expression
recognition. In IEEE International Conference on Automatic Face Gesture Recognition
and Workshops, pages 585–590, March 2011.
[52] G. Stratou, A. Ghosh, P. Debevec, and L.-P. Morency. Effect of illumination on automatic
expression recognition: A novel 3d relightable facial database. In IEEE International
Conference on Automatic Face Gesture Recognition and Workshops, pages 611–618,
March 2011.
[53] C. Sun and R. Nevatia. Large-scale web video event classification by use of fisher vectors.
In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 15–22, Jan
2013. 49
[54] D. Swanberg, C. Shu, and R. Jain. Knowledge guided parsing in video database. pages 13
– 24, May 1993.
[55] Y. Takahashi, N. Nitta, and N. Babaguchi. Video summarization for large sports video
archives. Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on,
pages 1170–1173, July 2005. 11, 12, 20
[56] D. Tjondronegoro, Y.-P. P. Chen, and B. Pham. Highlights for more complete sports video
summarization. Multimedia, IEEE, 11(4):22–37, Oct.-Dec. 2004.
[57] H. Tong, J. He, M. Li, C. Zhang, and W.-Y. Ma. Graph based multi-modality learning.
In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on
Multimedia, pages 862–871, New York, NY, USA, 2005. ACM.
[58] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. Principal component analysis of image
gradient orientations for face recognition. In IEEE International Conference on Automatic
Face Gesture Recognition and Workshops, pages 553–558, March 2011.
77
[59] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer
Vision, 57(2):137–154, May 2004. 16
[60] P. A. Viola and M. J. Jones. Robust real-time face detection. International Journal of
Computer Vision, 57(2):137–154, 2004. 29
[61] T. Vlachos. Cut detection in video sequences using phase correlation. Signal Processing
Letters, pages 173 – 175, July 2000.
[62] A. Waibel, M. Bett, and M. Finke. Meeting browser: Tracking and summarizing meetings.
In Proceedings DARPA Broadcast News Transcription and Understanding Workshop,
pages 281–286, February 1998. 12
[63] S.-F. Wong and R. Cipolla.
Extracting spatiotemporal interest points using global
information. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference
on, pages 1–8, Oct. 2007. 49
[64] S.-F. Wong, T.-K. Kim, and R. Cipolla. Learning motion categories using both semantic
and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR ’07.
IEEE Conference on, pages 1–6, June 2007. 48, 49
[65] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun. Unsupervised discovery of multilevel
statistical video structures using hierarchical hidden markov models. In Multimedia and
Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on, volume 3, pages
III–29–32 vol.3, July 2003. 48
[66] D. Xu, X. Li, Z. Liu, and Y. Yuan. Anchorperson extraction for picture in picture news
video. Pattern Recogn. Lett., 25(14):1587–1594, 2004. 12
[67] G. Xu, Y.-F. Ma, H.-J. Zhang, and S.-Q. Yang. An hmm-based framework for video
semantic analysis. Circuits and Systems for Video Technology, IEEE Transactions on,
15(11):1422–1433, Nov. 2005. 48
[68] C. Yeo, Y.-W. Zhu, Q. Sun, and S.-F. Chang. A framework for sub-window shot detection.
pages 84 – 91, 2005.
[69] H.-W. Yoo, H.-J. Ryoo, and D.-S. Jang. Gradual shot boundary detection using localized
edge blocks. Multimedia Tools and Applications, 28:283 – 300, 2006.
[70] G. Yuliang and X. De. A solution to illumination variation problem in shot detection.
pages 81 – 84, November 2004.
[71] R. Zabih, J. Miller, and K. Mai. Feature-based algorithms for detecting and classifying
scene breaks. 1995.
[72] H. Zhang, A. Kankanhalli, and S. Smoliar. Automatic partitioning of full-motion video.
78
ACM Multimedia Systems, 1:10 – 28, 1993.
[73] Y. Zhang, L. Gao, and S. Zhang. Feature-based automatic portrait generation system. In
WRI World Congress on Computer Science and Information Engineering, volume 3, pages
6–10, 31 2009-april 2 2009.
[74] Z. Zhang, G. Potamianos, A. W. Senior, and T. S. Huang. Joint face and head tracking
inside multi-camera smart rooms. Signal, Image and Video Processing, 1:163–178, 2007.
12
[75] Y. Zhuang, Y. Rui, T. Huang, and S. Mehrotra. Adaptive key frame extraction using
unsupervised clustering.
In Proceedings of the International Conference on Image
Processing, volume 1, pages 866–870, 1998. 11
79
80
Acknowledgments
I would like to express my special appreciation and thanks to my advisor Prof. Sharat
Chandran, you have been a tremendous mentor for me. I would like to thank you for
encouraging my research and for allowing me to grow as a research scholar. I would also
like to thank my committee members Prof. Subhasis Chaudhuri, Prof. Shabbir Merchant
for serving as my committee members even at hardship. I would like to thank CSE Department
Staff members for their constant support.
More than academic support, I have many, many people to thank for listening to and, at times,
having to tolerate me over the past years. I express my gratitude and appreciation for their
friendship. ViGIL lab members have been unwavering in their personal and professional support
during the time I spent at the University.
I would like to thank my colleagues from Amazon, who have been of great help. I would
especially like to thank my mentor Dr. Arati Deo, for all the support. Your advice on both
research as well as on my career have been priceless.
A special thanks to my family. Words cannot express how grateful I am to my family members
for all of the sacrifices that you have made on my behalf. I would also like to thank all of my
friends who supported me, and incented me to strive towards my goal. I would like to express
appreciation to my beloved husband Sudhakar who was always my support in the moments
when there was no one to answer my queries. At the end I would like to thank my beloved
daughter T.S. Harini for her love, understanding and prayers.
Nithya Sudhakar
Date:
Nithya Sudhakar
82